Best-Insilco Epitope Design

Methods in
Molecular Biology 2131
Namrata Tomar Editor
Immuno-
informatics
Third Edition
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, UK
For further volumes:

http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and
methodologies in the critically acclaimed Methods in Molecular Biology series. The series was
the first to introduce the step-by-step protocols approach that has become the standard in all
biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by-
step fashion, opening with an introductory overview, a list of the materials and reagents
needed to complete the experiment, and followed by a detailed procedure that is supported
with a helpful notes section offering tips and tricks of the trade as well as troubleshooting
advice. These hallmark features were introduced by series editor Dr. John Walker and
constitute the key ingredient in each and every volume of the Methods in Molecular Biology
series. Tested and trusted, comprehensive and reliable, all protocols from the series are
indexed in PubMed.
Immunoinformatics
Third Edition
Edited by
Namrata Tomar
Department of BioMedical Engineering, Medical College of Wisconsin, Milwaukee, WI, USA
Editor
Namrata Tomar
Department of BioMedical Engineering
Medical College of Wisconsin
Milwaukee, WI, USA
ISSN 1064-3745 ISSN 1940-6029 (electronic)

Methods in Molecular Biology
ISBN 978-1-0716-0388-8 ISBN 978-1-0716-0389-5 (eBook)
https://doi.org/10.1007/978-1-0716-0389-5
© Springer Science+Business Media, LLC, part of Springer Nature 2020

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been
made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer
Nature.
The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface
The immune system is very complex and consists of numerous cell types, molecular path-
ways, and signals, which help a host system to distinguish between normal, healthy cells and
unhealthy cells. All immune cell types have a specific role and ways of recognizing potentially
harmful foreign bodies. To get the diverse details of an immune network, a researcher may
optimize immune responses for a specific issue that may range from minor infections to
cancer. This requires implementing data mining, statistics, and machine learning approaches
to convert high-throughput immune data into meaningful insights. In simpler terms,
Immunoinformatics incorporates the application of bioinformatics methods, mathematical
models, and statistical techniques for the study of immune systems biology. The develop-
ment of immunoinformatics tools, databases, and models involves computer scientists and
modeling experts working closely with immunologists in a multidisciplinary team. Modeling
and computational approaches have been widely applied to solve the problems in immunol-
ogy as in quantifying the data generated in laboratory experiments and extracting meaning-
ful biological information on its kinetics. To state the value of computational tools and
models in immunology research, we need a variety of immune system-related databases,
prediction software and modeling tools, informatics, and computational infrastructure for
connecting computer modeling and wet-lab experimentation, as well as data analytics and
visualization.
This book consists of 23 chapters that cover diverse immunoinformatics research topics.
It involves tools and databases of potential epitope prediction, HLA gene analysis, MHC
characterizing, in silico vaccine design, mathematical modeling of host-pathogen interac-
tions, and network analysis of immune system data.
Content and General Outline of the Book
Chapter 1 introduces a reverse vaccinology approach and its advantages and applications. It
basically searches through genomic sequences to predict antigens that have a capacity to be
used as potential vaccine candidates. It describes required web tools, databases, and software
to predict potential epitopes for vaccine development.
Chapter 2 introduces a peptide-based vaccine approach to design an in silico vaccine
against Zika virus.
Chapter 3 focuses on high-definition genomic analysis of human leukocyte antigen
(HLA) genes that encode for major histocompatibility complex (MHC) proteins. The
genotyping of HLA alleles was done through whole genome sequencing data, whole
exome sequencing data, or targeting sequence of HLA genes by using next-generation
sequencing technology.
Chapter 4 describes detailed steps for a computational vaccine design for MERS-CoV
infections. It mostly makes use of IEDB software to predict the suitable MERS-CoV epitope
vaccine.
Chapter 5 introduces an alignment-independent platform for allergenicity prediction. It
utilizes three modular servers to assess the allergenicity of a randomly selected allergenic
protein and demonstrates a protocol for fast and reliable in silico prediction.
v
vi Preface
Chapter 6 reviews the methodology used for computational identification of B and T

cell epitopes against enterotoxigenic Escherichia coli, along with other databases of epitopes
and analysis tools for T cell and B cell epitope prediction and vaccine design.
Chapter 7 introduces immunoinformatics and molecular docking methods to screen
potential vaccine candidates for Leptospira, which is responsible for Leptospirosis, a zoo-
notic disease.
Chapter 8 takes into account a residue-centric presentation score for both mutated
residues and MHC-I genotype and hypothesized that high scores (corresponds to poor
presentation) would correlate to high mutation frequencies within tumors. To explain,
MHC class I proteins present on the cell’s surface recognize tumor-specific neoantigens of
early neoplastic cells and eliminate them before the tumor develops.
Chapter 9 suggests the importance of the network analysis of large-scale data and its
application in the field of immunological research. Network analysis is a way to extract
complex information from high-throughput data and develop advanced algorithms to
unveil the underlying mechanisms. This chapter discusses the ways to integrate and analyze
networks using genome-wide transcriptional profiles.
Chapter 10 discusses the implementation of in silico tools using a multiparametric
approach to screen both B and T cell epitopes, along with a ranking system to shortlist
potential mimotope candidates to be used as peptide cancer vaccine candidates.
Chapter 11 introduces immunoinformatics approaches, e.g., epitope prediction tools,
molecular docking, and population coverage analysis to design desired immunogenic pep-
tides. This chapter uses these approaches to select potential peptide containing multiple T
(CD8+ and CD4+) and B cell epitopes from Avian H3N2 M1 protein.
Chapter 12 focuses on monoclonal antibody (mAb) formulations, where protein-
protein interfaces formed by mAb aggregation could be selectively recognized by short
peptides with random amino acid sequences. These aggregated mAb are used to screen a
phage display peptide library to pick peptides that can recognize mAb aggregates.
Chapter 13 details the protocol and provides software to build variability-free pro-
teomes for epitope vaccine design implemented for human herpesvirus 1 (HHV1) and
involves the identification of protein clusters, followed by multiple sequence alignments
and Shannon variability calculations.
Chapter 14 utilizes the immunoinformatics tools to identify immunodominant epitopes
for Shigella flexneri and validates them through an in vivo model.
Chapter 15 overviews immunoinformatics tools and their application in in silico vaccine
design against viral diseases.
Chapter 16 presents two online servers, EPCES and EPSVR, for discontinuous epitope
prediction for which all methods were benchmarked by a curated independent test set. Here
all antigens had no complex structures with the antibody, and their epitopes were identified
by various biochemical experiments.
Chapter 17 introduces SVMTrip that utilizes the Support Vector Machine by combin-
ing the tri-peptide similarity and propensity scores to predict B cell epitopes. It was
implemented on non-redundant B cell linear epitopes extracted from IEDB, and it achieved
a sensitivity of 80.1% and a precision of 55.2% with a fivefold cross-validation.
Chapter 18 introduces the usage of mathematical modeling in the immunoinformatics
domain to modeling phage-bacteria dynamics to study the dynamics of this interaction.
Chapter 19 focuses on the utilization of simulation techniques in order to understand
mycobacteriophage and host interactions. Mycobacterium sp. exhibits complex evolution of
Preface vii
antimicrobial resistance. Phage treatment using phage-encoded products can be used

instead of directly using the bacteriophages (scavengers of bacteria).
Chapter 20 describes electrochemiluminescence immunoassays that are based on the
principle of light emission in a chemical environment to detect and analyze different proteins
and biomolecules. It uses the Mesoscale Discovery System with optimization protocols to
discover more biologically relevant markers.
Chapter 21 presents the database AAgAtlas 1.0 that mines PubMed to support basic and
translational studies associated with autoimmunity. It focuses on autoantibodies that work
against host self-proteins and play significant roles in homeostasis maintenance and also lead
to autoimmune disorders.
Chapter 22 presents different ensemble meta-learning approaches for epitope predic-
tion based on stacked generalization, cascade generalization, and meta-decision trees. The
meta-learning approach enables the integration of multiple prediction models and thereby
outperforms the single best-performing model. Also, it provides a flexibility to researchers
to construct various meta-classification hierarchies for epitope prediction in different protein
domains.
Chapter 23 presents a server PCPS for predicting cleavage sites generated by both the
constitutive proteasome and the immunoproteasome. PCPS is implemented for free public
use online at http://imed.med.ucm.es/pcps/.
Milwaukee, WI, USA Namrata Tomar

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Reverse Vaccinology and Its Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Amol M. Kanampalliwar
2 Computational Methodology for Peptide Vaccine Design
for Zika Virus: A Bioinformatics Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Ashesh Nandy, Smarajit Manna, and Subhash C. Basak
3 High-Definition Genomic Analysis of HLA Genes Via
Comprehensive HLA Allele Genotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Shuji Kawaguchi and Fumihiko Matsuda
4 A Computational Vaccine Designing Approach
for MERS-CoV Infections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Hiba Siddig Ibrahim and Shamsoun Khamis Kafi
5 An Alignment-Independent Platform for Allergenicity Prediction . . . . . . . . . . . . . 147
Ivan Dimitrov and Irini Doytchinova
6 Immunoinformatics and Epitope Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Jayashree Ramana and Kusum Mehla
7 Vaccine Design Against Leptospirosis Using an Immunoinformatic
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Kumari Snehkant Lata, Vibhisha Vaghasia, Shivarudrappa Bhairappanvar,
Saumya Patel, and Jayashankar Das
8 Characterizing MHC-I Genotype Predictive Power for Oncogenic
Mutation Probability in Cancer Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Lainie Beauchemin, Michael Slifker, David Rossell,
and Joan Font-Burgada
9 Network Analysis of Large-Scale Data and Its Application
to Immunology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Lauren Benoodt and Juilee Thakar
10 In Silico-Guided Sequence Modification of Epitopes in Cancer
Vaccine Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Winfrey Pui Yee Hoo, Pui Yan Siak, and Lionel L. A. In
11 An Immunoinformatics Approach in Design of Synthetic
Peptide Vaccine Against Influenza Virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Neha Lohia and Manoj Baranwal
12 A New Approach to Assess mAb Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Illarion V. Turko
13 Generation of Variability-Free Reference Proteomes
from Pathogenic Organisms for Epitope-Vaccine Design . . . . . . . . . . . . . . . . . . . . 255
Jose L. Sanchez-Trincado and Pedro A. Reche
ix
x Contents
14 Immunoinformatic Identification of Potential Epitopes. . . . . . . . . . . . . . . . . . . . . . 265

Priti Desai, Divya Tarwadi, Bhargav Pandya, and Bhrugu Yagnik
15 Immunoinformatic Approaches for Vaccine Designing
Against Viral Infections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Richa Anand and Richa Raghuwanshi
16 EPCES and EPSVR: Prediction of B-Cell Antigenic Epitopes on
Protein Surfaces with Conformational Information . . . . . . . . . . . . . . . . . . . . . . . . . 289
Shide Liang, Dandan Zheng, Bo Yao, and Chi Zhang
17 SVMTriP: A Method to Predict B-Cell Linear Antigenic Epitopes . . . . . . . . . . . . 299
Bo Yao, Dandan Zheng, Shide Liang, and Chi Zhang
18 Modeling Phage–Bacteria Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Saptarshi Sinha, Rajdeep Kaur Grewal, and Soumen Roy
19 Dynamics of Mycobacteriophage—Mycobacterial Host Interaction . . . . . . . . . . . 329
Arabinda Ghosh, Tridip Phukan, Surabhi Johari, Ashwani Sharma,
Abha Vashista, and Subrata Sinha
20 Multiplexing of Immune Markers via Electrochemiluminescence
Immunoassays for Systems Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Vrushali Abhyankar and Ammaar H. Abidi
21 AAgAtlas 1.0: A Database of Human Autoantigens Extracted
from Biomedical Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Dan Wang, Yupeng Zhang, Qing Meng, and Xiaobo Yu
22 Application of Meta Learning to B-Cell Conformational
Epitope Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Yuh-Jyh Hu
23 PCPS: A Web Server to Predict Proteasomal Cleavage Sites . . . . . . . . . . . . . . . . . . 399
Marta Gomez-Perosanz, Alvaro Ras-Carmona, and Pedro A. Reche
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
Contributors
VRUSHALI ABHYANKAR • American Academy of Periodontology, UTHSC, College of Dentistry,

Memphis, TN, USA
AMMAAR H. ABIDI • Department of Bioscience Research and General Dentistry, College of
Dentistry, University of Tennessee Health Science Center, Memphis, TN, USA
RICHA ANAND • Department of Applied Sciences, Indian Institute of Information Technology,
Allahabad, UP, India
MANOJ BARANWAL • Department of Biotechnology, Thapar Institute of Engineering and
Technology, Patiala, India
SUBHASH C. BASAK • Department of Chemistry and Biochemistry, University of Minnesota,
Duluth, MN, USA
LAINIE BEAUCHEMIN • Cancer Biology Program, Fox Chase Cancer Center, Philadelphia, PA,
USA
LAUREN BENOODT • Department of Biochemistry and Biophysics, University of Rochester
Medical Center, Rochester, NY, USA
SHIVARUDRAPPA BHAIRAPPANVAR • Gujarat Biotechnology Research Centre, Department of
Science and Technology, Government of Gujarat, Gandhinagar, India
JAYASHANKAR DAS • Gujarat Biotechnology Research Centre, Department of Science and
Technology, Government of Gujarat, Gandhinagar, India
PRITI DESAI • Department of Biotechnology and Biological Sciences, Institute of Advanced
Research (IAR), Gandhinagar, Gujarat, India
IVAN DIMITROV • Faculty of Pharmacy, Medical University of Sofia, Sofia, Bulgaria
IRINI DOYTCHINOVA • Faculty of Pharmacy, Medical University of Sofia, Sofia, Bulgaria
JOAN FONT-BURGADA • Cancer Biology Program, Fox Chase Cancer Center, Philadelphia,
PA, USA
ARABINDA GHOSH • Microbiology Division, Department of Botany, Gauhati University,
Guwahati, Assam, India
MARTA GOMEZ-PEROSANZ • Department of Immunology, School of Medicine, Complutense
University of Madrid, Madrid, Spain
RAJDEEP KAUR GREWAL • Department of Physics, Bose Institute, Kolkata, India
WINFREY PUI YEE HOO • Department of Biotechnology, Faculty of Applied Sciences, UCSI
University, Kuala Lumpur, Malaysia
YUH-JYH HU • College of Computer Science, National Chiao Tung University, Hsinchu,
Taiwan; Institute of Biomedical Engineering, National Chiao Tung University, Hsinchu,
Taiwan
HIBA SIDDIG IBRAHIM • Sudan Diabetic Childhood Center, Khartoum, Sudan
LIONEL L. A. IN • Department of Biotechnology, Faculty of Applied Sciences, UCSI
University, Kuala Lumpur, Malaysia
SURABHI JOHARI • Institute of Management Studies (IMSUC), Ghaziabad, Uttar Pradesh,
India
SHAMSOUN KHAMIS KAFI • Faculty of Medical Laboratory Science (MLS), The National Ribat
University, Khartoum, Sudan
AMOL M. KANAMPALLIWAR • Master of Technology, School of Biotechnology, UTD, Rajiv
Gandhi Proudyogiki Vishwavidyalaya, Bhopal, India
xi
xii Contributors
SHUJI KAWAGUCHI • Center for Genomic Medicine, Kyoto University Graduate School of
Medicine, Kyoto, Japan
KUMARI SNEHKANT LATA • Gujarat Biotechnology Research Centre, Department of Science
and Technology, Government of Gujarat, Gandhinagar, India; Department of Botany,
Bioinformatics and Climate Change, Gujarat University, Ahmedabad, India
SHIDE LIANG • Department of R&D, Bio-Thera Solutions, Guangzhou, China
NEHA LOHIA • Department of Biotechnology, Thapar Institute of Engineering and
Technology, Patiala, India; School of Life Sciences, Jaipur National University, Jaipur,
India
SMARAJIT MANNA • Centre for Interdisciplinary Research and Education, Kolkata, India;
Jagadis Bose National Science Talent Search, Kolkata, India
FUMIHIKO MATSUDA • Center for Genomic Medicine, Kyoto University Graduate School of
Medicine, Kyoto, Japan
KUSUM MEHLA • National Bureau of Animal Genetic Resources, Karnal, Haryana, India
QING MENG • State Key Laboratory of Proteomics, Beijing Proteome Research Center,
National Center for Protein Sciences-Beijing (PHOENIX Center), Beijing Institute of
Lifeomics, Beijing, China
ASHESH NANDY • Centre for Interdisciplinary Research and Education, Kolkata, India
BHARGAV PANDYA • Department of Biotechnology and Biological Sciences, Institute of
Advanced Research (IAR), Gandhinagar, Gujarat, India
SAUMYA PATEL • Department of Botany, Bioinformatics and Climate Change, Gujarat
University, Ahmedabad, India
TRIDIP PHUKAN • Microbiology Division, Department of Botany, Gauhati University,
Guwahati, Assam, India
RICHA RAGHUWANSHI • Department of Botany, Mahila Mahavidyalaya, Banaras Hindu
University, Varanasi, UP, India
JAYASHREE RAMANA • Department of Biotechnology and Bioinformatics, Jaypee University of
Information Technology, Waknaghat, HP, India
ALVARO RAS-CARMONA • Department of Immunology, School of Medicine, Complutense
PEDRO A. RECHE • Department of Immunology, School of Medicine, Complutense University
of Madrid, Madrid, Spain
DAVID ROSSELL • Department of Economics and Business, Universitat Pompeu Fabra,
Barcelona, Spain
SOUMEN ROY • Department of Physics, Bose Institute, Kolkata, India
JOSE L. SANCHEZ-TRINCADO • Department of Immunology, School of Medicine, Complutense
ASHWANI SHARMA • Biopredic International, Parc d’activité de la Bretèche Bâtiment A4,
Saint Grégoire, France
PUI YAN SIAK • Department of Biotechnology, Faculty of Applied Sciences, UCSI University,
Kuala Lumpur, Malaysia
SAPTARSHI SINHA • Department of Physics, Bose Institute, Kolkata, India
SUBRATA SINHA • Centre for Biotechnology and Bioinformatics, Dibrugarh University,
Dibrugarh, Assam, India
MICHAEL SLIFKER • Cancer Biology Program, Fox Chase Cancer Center, Philadelphia, PA,
USA
DIVYA TARWADI • Department of Biotechnology and Biological Sciences, Institute of Advanced
Research (IAR), Gandhinagar, Gujarat, India
Contributors xiii
JUILEE THAKAR • Department of Microbiology and Immunology, University of Rochester

Medical Center, Rochester, NY, USA; Department of Biostatistics and Computational
Biology, University of Rochester Medical Center, Rochester, NY, USA
ILLARION V. TURKO • Biomolecular Mesurement Division, National Institute of Standards
and Technology, Gaithersburg, MD, USA; Institute for Bioscience and Biotechnology
Research, Rockville, MD, USA
VIBHISHA VAGHASIA • Gujarat Biotechnology Research Centre, Department of Science and
Technology, Government of Gujarat, Gandhinagar, India; Department of Botany,
Bioinformatics and Climate Change, Gujarat University, Ahmedabad, India
ABHA VASHISTA • Institute of Management Studies (IMSUC), Ghaziabad, Uttar Pradesh,
India
DAN WANG • State Key Laboratory of Proteomics, Beijing Proteome Research Center,
BHRUGU YAGNIK • Emory Vaccine Center, Yerkes National Primate Research Center, Emory
University, Atlanta, GA, USA; Department of Microbiology and Immunology, Emory
School of Medicine, Emory University, Atlanta, GA, USA
BO YAO • Quantitative Biomedical Research Center, University of Texas Southwestern
Medical Center, Dallas, TX, USA
XIAOBO YU • State Key Laboratory of Proteomics, Beijing Proteome Research Center,
CHI ZHANG • School of Biological Sciences, University of Nebraska – Lincoln, Lincoln, NE,
USA
YUPENG ZHANG • State Key Laboratory of Proteomics, Beijing Proteome Research Center,
DANDAN ZHENG • Department of Radiation Oncology, University of Nebraska Medical
Center, Omaha, NE, USA
Chapter 1
Reverse Vaccinology and Its Applications

Amol M. Kanampalliwar
Abstract
The application of the fields of pharmacogenomics and pharmacogenetics to vaccine design, profoundly
combined with bioinformatics, has been recently termed “vaccinomics.” The enormous amount of infor-
mation generated by whole genome sequencing projects and the rise of bioinformatics has triggered the
birth of a new era of vaccine research and development, leading to a “third generation” of vaccines, which
are based on the application of vaccinomics science to vaccinology. The first example of such an approach is
reverse vaccinology. Reverse vaccinology reduces the period of vaccine target detection and evaluation to
1–2 years. This approach targets the genomic sequence and predicts those antigens that are most likely to be
vaccine candidates. This approach allows not only the identification of all the antigens obtained by the
previous methods but also the discovery of new antigens that work on a totally different paradigm. Hence
this method helps in the discovery of novel mechanisms of immune intervention. Epitope-based immune-
derived vaccines (IDV) are generally considered to be safe when compared to other vectored or attenuated
live vaccines. Epitope-based IDV may also provide essential T-cell help for antibody-directed vaccines. Such
vaccines may have a significant advantage over earlier vaccine design approaches, as the cautious assortment
of the components may diminish.
Key words Vaccinomics, Reverse vaccinology, IDV, Epitope
1 Introduction
The use of genomic information with aid of a computer for the

preparation of vaccines without culturing microorganism is known
as reverse vaccinology. Since the introduction of vaccination into
western medicine in 1796 with the smallpox vaccine developed by
Edward Jenner, vaccines have risen to be the most cost-effective
way of controlling infectious diseases. The combinational approach
of pharmacogenomics and pharmacogenetics when used for vac-
cine design, profoundly with bioinformatics, has been termed “vac-
cinomics” [1]. The enormous amount of information generated by
whole genome sequencing projects and the rise of bioinformatics
has triggered the birth of a new era of vaccine research and devel-
opment, leading to a “third generation” of vaccines, which are
based on the application of vaccinomics science to vaccinology.
Namrata Tomar (ed.), Immunoinformatics, Methods in Molecular Biology, vol. 2131,

https://doi.org/10.1007/978-1-0716-0389-5_1, © Springer Science+Business Media, LLC, part of Springer Nature 2020
1
2 Amol M. Kanampalliwar
Reverse vaccinology is the first example of such an approach

[2]. This approach targets the genomic sequence and predicts
those antigens that are most likely to be vaccine candidates. This
approach allows not only the identification of all the antigens
obtained by the previous methods but also the discovery of new
antigens that work on a totally different paradigm. Hence this
method helps in the discovery of novel mechanisms of immune
intervention [3]. Epitope-based immune-derived vaccines (IDV)
are generally considered to be safe when compared to other vec-
tored or attenuated live vaccines [4–7]. Epitope-based IDV may
also provide essential T-cell help for antibody-directed vaccines.
Such vaccines may have a significant advantage over earlier vaccine
design approaches, as the cautious assortment of the components
may diminish undesired side effects that have been observed with
the whole pathogen and protein subunit vaccines [8]. Reverse
vaccinology approach has been used for the preparation of a vaccine
against many diseases like malaria, anthrax, endocarditis, and
meningitidis.
Medical workers have long helped the body’s immune system
prepare for future attacks through vaccination. Vaccines are
biological preparations that are helpful in improving the immunity
of a person against a particular disease [9]. Vaccines can be prepared
by various means depending on the pathogenicity of microbes.
Vaccine may be prophylactics [10] that are used for decreasing the
effect of the disease occurring in the future while some may be
curative. Vaccines consist of genetically modified or killed microbes
and some components of microbes or DNA of microbes. Hence
vaccines are the best way for prevention of infectious diseases safe as
that of other.
With the start of genomic era, new revolutions have been
taking place in the field of vaccines [11]. The application of shotgun
sequencing has been introduced in giving the whole genomic
sequences of several pathogens. With the completion of the
sequence of the first living organism, the genomic data was used
for the preparation of the vaccines against the organism. The com-
plete genomic sequence of an organism is the reservoir of genes
encoding the proteins that can act as potential antigens that can be
used as vaccine candidates. The novel technique of identifying the
proteins exposed on the surface of microorganisms by using their
genome is known as reverse vaccinology [3]. The use of genomic
information with aid of computer for the preparation of vaccines
without culturing microorganism is known as reverse vaccinology.
The first revolution in field of vaccination is the use of genetic
engineering to produce vaccines. In this approach the pathogenic
components of organisms were identified by culturing in labora-
tory. But it was not a very successful approach for vaccine
preparation.
Reverse Vaccinology and Its Applications 3
The second revolution took place in the twentieth century with

the aid of genomic technology [12]. Nowadays, various technolo-
gies are available which can be helpful in determining the whole
genome sequence of the organism which can be used to explore the
protein coding sequences that can be used as a potential target for
vaccine preparation.
1.1 Modification in In this approach the genome of the different isolates of same
Reverse Vaccinology organism is compared with each other by using computer analysis.
The first pan genome approach was done against Streptococcus
1.1.1 Pan Genomic
agalactiae [13].
Reverse Vaccinology
1.1.2 Comparative In this approach the pathogenic and nonpathogenic strains of one
Reverse Vaccinology species are compared at their genetic level. It deals with the differ-
ences in structure of proteins of different organisms.
1.2 Advantages of 1. It allowed identification of a much broader spectrum of candi-

Reverse Vaccinology dates, including proteins that had not been identified before
[14] because they were masked by other immunodominant targets.
2. It allowed the identification of potential vaccine targets in
organisms that were difficult to cultivate in the laboratory.
Thus we obtained some epitopes against HIV-I which can be
used for the preparation of candidate vaccine (peptide) by using
reverse vaccinology with the help of bioinformatics study.
1.3 The Role of When the conventional ways fail to develop a vaccine, then one has
Epitope Prediction in to follow the nonconventional ways for the preparation of vaccine.
Reverse Vaccinology Until now the genomic sequences of more than 500 pathogens
including bacteria and viruses are available on NIH list. As the
techniques are available for studying host-pathogen interactions,
whole genome study, and every unique gene, the work is now
focused on the development of epitope-driven vaccines that are
target specific.
An epitope is an antigenic determinant that plays an important
role in immunity of an organism. These are present on the surface
of organisms that can be detected by the antibody [15]. Reverse
vaccinology deals with computational analysis of genome that can
be used for the prediction of the epitopes that are surface proteins.
So the epitopes play an important role in the development of a
candidate vaccine. The major role played in immune system is by B
and T lymphocyte. B cells are important in recognizing the epitopes
of the antigens that can be identified by the paratopes of antibody.
In some cases, T cells play a role in cell-mediated immunity as the
processed antigenic peptides interact with the T cell when they are
presented in context of T cell. So the prediction of the epitopes of T
and B cell plays an important role in the determination of the

candidate vaccine. The epitope prediction plays an important role
in designing epitope-based vaccine.
2 Materials
2.1 Web-Based 1. http://www.ncbi.nlm.nih.gov/.

Databases and 2. http://www.violinet.org/vaxign/, http://tools.immuneepi
Programs tope.org/mhc/.
3. http://www.rcsb.org/pdb/home/home.do.
4. http://www.expasy.org/spdbv/.
5. http://autodock.scripps.edu/downloads.
6. http://bio.dfci.harvard.edu/RANKPEP/.
7. http://tools.immuneepitope.org/esm/userMappingFrontP.jsp.
8. http://dicsoft1.physics.iisc.ernet.in/rp/.
2.2 Software 1. Modeller9.10 (any homology modeling software).

Requirements 2. AutoDock (any molecular docking software).
3. Cygwin (required for Autodock 4.0).
4. Spdb-viewer (for energy minimization).
5. RasMol (for molecular analysis or any other).
3 Methods
In fulfilling the objectives of the project work, the following steps

were pursued: Selection of proteins from HIV-1 whole genome
database, epitope prediction and selection, homology modeling,
and docking.
3.1 Selection of 1. Whole genome sequence of HIV-1 was downloaded from

Proteins from HIV-1 NCBI database.
Whole Genome 2. All proteins encoded by whole genome of HIV-1 were selected
Database for further studies and their amino acid sequences were down-
loaded and saved in FASTA format. The following table depicts
the attributes of selected protein (Table 1).
Table 1
List of selected HIV-1 proteins
Name Accession Start Stop GeneID Locus Locus tag Protein product Length
Pr55(gag) NC_001802.1 336 1838 155030 Gag HIV1gp2 NP_057850.1 500
3.2 Epitope 1. Home page of RANKPEP was opened (http://tools.

Prediction by immuneepitope.org/mhc/). Selection of home page depends
RANKPEP (Figs. 1, 2, 3 on the prediction of the epitopes against MHCI and MHCII.
and Tables 2, 3) 2. The amino acid sequence of the proteins which were selected
from the above step then pasted in the query window of
RANKPEP. Selection of parameters like MHC source species,
prediction method, and allele length was done.
3. Then the query was submitted to the RANKPEP database.
4. After this the result page was generated that contains different
epitopes which are categorized on the basis of the
percentile rank.
5. Epitopes were for further studies selected having highest
percentile rank.
Fig. 1 Query window for the prediction of MHC I bonding epitope present at RANKPEP
Fig. 2 Outcomes of RANKPEP
Fig. 3 MHC class II predicted epitopes of Pr55 (Gag)
Table 2
List of predicted epitopes against MHC class I for Pr55 (Gag)
Sr. MHC Percentile

no. Epitope sequence class MHC allele rank
1 QILGQLQPSL I HLA-A∗11:01,HLA-B∗40:02,HLA-∗32:01, 44.59
2 WMTNNPPIPV I HLA-∗29:02,HLA-∗08:01,HLA-∗68:02HLA- 35.32
3 RTLNAWVKVV I ∗07:02,HLA-∗01:02,HLA-∗01:04,HLA-∗30:01, 34.59
4 RAPRKKGCWK I HLA-∗68:02,HLA-∗74:01 38.04
5 ATLYCVHQRI I 36.86
6 KCFNCGK I 35.96
15 SEGATPQDL I 29.72
Table 3
List of predicted epitopes against MHC class II for Pr5 (Gag)
Sr. MHC Percentile

no. Epitope sequence class MHC allele rank
1 SILDIRQGP II HLA-DQA1∗05:01/DQB1∗03:01, 32.02
2 ANPDCKTIL II HLA-DRB1∗01:06, HLA-DRB1∗01:16, 31.36
3 IRQGPKEPF II HLA-DRB1∗01:04, HLA-DRB1∗01:17, 30.09
4 AMSQVTNSA II HLA-DRB1∗01:19, HLA-DRB1∗03:11, 36.56
5 ALSEGATPQ II HLA-DRB1∗03:35 29.80
6 IQGQMVHQA II 29.27
7 YPIVQNIQG II 43.54
Table 4
List of predicted models MHCI of Pr55 (Gag)
Sr. Epitope Epitope PDB ID of epitope Position of Percentage

no. no. sequence model amino acid identity E-value
1 Pr1 PIPVGEIYKRW 1E6J 123–133 100% 8.60 1012
2 Pr2 EPRGSDIAGTT 1E6J 98–108 100% 8.65 1012
Table 5
List of predicted models IEDB MHCII of Pr55 (Gag)
Sr. Epitope PDB ID of Position of Percentage

no. no. Epitope sequence epitope model amino acid Identity E-value
1 Pr1 RPEPTAPPEESFRSG 2C55 4–18 90% 7.79 1021
2 Pr2 KIVKCFNCGKEGHTA 1AAF 11–26 93% 1.33 1028
3.3 Homology Steps involved in homology modeling were as follows:

Modeling (Tables 4, 5
1. Command prompt of MODELLER9.10 was opened.
and Figs. 4, 5, 6, 7)
2. Root directory path was changed to the folder containing the
epitope sequences:
(a) To change the root directory type “cd..” and press enter.
(b) To guide the directory to the desired folder, type “cd
folder name,” and press enter.
3. Files like Tvldh.ali, Build_Profile.py, PDB_95 PIR, Compare.
py, Align 2d.py, and Model.single.py were copied from the
basic examples folder of MODELLER and copied to desired
folder.
Fig. 4 Predicted model and Ramachandran plot of Pr1
Fig. 5 Predicted model and Ramachandran plot of Pr2
4. Copy the sequence of epitope to be modeled in TvLDH.ali file

and save it as “protein_name.ali”. For example, the first pre-
dicted epitope sequence for Gag-Pol was copied and saved as
“gp1.ali”:
>P1; gp1sequence:gp1::::::: 0.00: 0.00 KQWPLTEEKI∗

Fig. 6 Predicted model and Ramachandran plot of Pr 1
Fig. 7 Predicted model and Ramachandran plot of Pr 2
5. Open build_profile.py file and change the name of protein. For

example, for generating model for gp1.ali file, change the
command name accordingly:
aln = alignment(env)=’gp1.ali’
6. Run on Command prompt:
“C:\modeling>gp1>mod9.10 build_profile.py” and press enter.
This command takes “pdb_95.pir” and”gp1.ali” file

prepared for the target sequence as input and generates three
output files, namely, pdb_95.bin, build_profile.prf, build_pro-
file.ali.
7. Among the output files generated in the folder, build_profile.
prf file was opened using word pad. This file contains the PDB
ID of the various templates built in the previous step. Template
showing maximum sequence similarity and minimum energy
were selected and their protein structures were downloaded
from the website http://www.rcsb.org/pdb/home/home.do.
8. Compare.py file was opened in word pad. The PDB IDs of
selected templates from previous step were copied in the spe-
cified command lines in the word pad and saved. For example,
for (pdb, chain) in ((‘1rth’, ‘A’), (‘1rth’, ‘B’), (‘3hvt’, ‘B’)):
9. Run on Command prompt:
“C:\modeling>gp1>mod9.10 compare.py” and press enter.
10. Running this will create two output files. Compare.txt file
contains a dendrogram for all possible templates selected.
With the help of this dendrogram, one best and closely related
template was chosen for the next step. The chosen template
had minimum resolution and maximum similarity with the
query sequence (epitope).
11. Align2d.py fie was opened in word pad and PDB ID of selected
template was specified in program lines:
mdl = model(env, file=’1rth’, model_segment=(’FIRST:

A’,’LAST:A’))
aln.append_model(mdl, align_codes=’1rthA’, atom_fi-
les=’1rth.pdb’)
aln.append(file=’gp1.ali’, align_codes=’gp1’)
aln.align2d()
aln.write(file=’gp1-1rthA.ali’, alignment_format=’PIR’)
aln.write(file=’gp1-1rthA.pap’, alignment_format=’PAP’)
This step aligns the sequence of TARGET (epitope) with

the structure of template selected in MODELLER; although it
is based on a dynamic programming algorithm, it is different
from standard sequence-sequence alignment methods because
it takes into account structural information from the template
when constructing an alignment.
12. Run on command prompt:
“C:\modeling>gp1>mod9.10 align2d.py” and press enter.
13. File named model-single.py was opened in word pad and cho-
sen template name was entered in desired program lines:
env = environ()a = automodel(env, alnfile=’gp1-1rthA.ali’,

knowns=’1rthA’, sequence=’gp1’,
assess_methods=(assess.DOPE, assess.GA341))
14. Run on command prompt:
“C:\modeling>gp1>mod9.10 model-single.py” and press enter.
15. This step generated five probable models for the target epitope,
details of which are present in output file—model-single.txt
file. The model having lowest Dope score was selected as the
final model for the target epitope.
16. For analysis of the generated structure, Ramachandran plot was
generated by SwissPdb viewer software [16].
17. The modeled structure was opened in SwissPdb viewer:
File ! open pdb file ! gp1.B99990002.pdb.
Select ! all.
Wind ! Ramachandran Plot.
The images of modeled epitopes were taken by RasMol.
Display ! wireframe.
Colour ! group.
Option ! label.
3.4 Molecular In this project, AUTODOCK software was used for molecular
Docking (Tables 6, 7, docking and the protocol was designed based on the results of
8, 9 and Figs. 8, 9) previous experiments [17, 18].
1. Broadly neutralizing antibodies (BrNAbs) that were produced
by the immune systems of some HIV-infected patients were
selected for molecular docking [19–21]. The PDB structures
of antibodies were downloaded from the website http://www.
rcsb.org/pdb/home/home.do.
2. A folder was created where files needed for docking by AUTO-
DOCK 1.5.4 were pasted. These files include PDB file of target
protein (antibody) and ligand protein (modeled epitope struc-
ture) and extension file of autodock and autogrid.
Table 6
Docking results of RANKPEP MHC I Pr55 (gag) epitopes
Sr. no. Epitope Antibody Run Lowest binding/docked energy

1. Pr1 2G12 3 2771.11
4E10 10 28.77
B12 9 264619.75
PG9 10 85159.05
2. Pr2 2G12 9 2022.44
4E10 3 111.83
B12 4 106922.83
PG9 10 18086.23
3. Pr3 2G12 2 20648.23
4E10 1 31644.80
B12 5 582708.69
PG9 9 388484.97
Table 7
Software and their links for prediction of B-cell epitope prediction (MHC binding peptides) (Adopted
from Kanampalliwar et al., 2013) [22]
Epitope prediction
tools URL
Bepitope http://www-dsv.cea.fr/en/institutes/institute-of-environmental-biology-and-
biotechnology-ibeb/services2/department-of-biochemistry-and-nuclear-
toxicology-sbtn/molecular-recognition-and-interactions-laboratory-lirm/
research/software/bepitope
BcePred http://www.imtech.res.in/raghava/bcepred/
Pepitope http://pepitope.tau.ac.il/
Ellipro http://tools.immuneepitope.org/tools/ElliPro
Epitopia http://epitopia.tau.ac.il
ABCpred http://www.imtech.res.in/raghava/abcpred
FBCpred http://ailab.cs.iastate.edu/bcpreds/
Discotope http://www.cbs.dtu.dk/services/DiscoTope-2.0
3. The software was opened and the following steps were followed
for docking.
4. The target file (antibody PDB file) was prepared first:
The target file was opened in Autodock
File—read molecule—open target file.
Color—by atom type—all geometries—ok.
Edit—hydrogens—add polar only.
Table 8
Immunological databases and their links (Adopted from Kanampalliwar et al., 2013) [22]
Immunological Databases URL

Immune epitope database http://www.iedb.org
SYFPEITHI http://syfpeithi.bmi-heidelberg.com/scripts/MHCServer.dll/
home.htm
SVM Server http://sysbio.unl.edu/SVMTriP
MHCPEP http://wehih.wehi.edu.au/mhcpep
JenPep http://www.jenner.ac.uk/JenPep
FIMM http://sdmc.krdl.org.sg:8080/fimm
MHCBN http://www.imtech.res.in/raghava/mhcbn
EPIMHC http://mif.dfci.harvard.edu/Tools/db_query_epimhe.html
HIV molecular immunology http://hiv-web.lanl.gov/content/immunology/
database
Table 9
Software and their links for prediction of T-cell epitope prediction (MHC binding peptides) (Adopted
from Kanampalliwar et al., 2013) [22]
Epitope prediction Tools URLs

IEDB analysis resources http://tools.immuneepitope.org/mhc/
PREDEPP http://bioinfo.md.huji.ac.il/marg/Teppred/mhc-bind
Epipredict http://www.epipredict.de/index.html
Predict http://sdmc.krdl.org.sg:8080/predict-demo
MHCpred http://www.jenner.ac.uk/MHCPred
NetMHC http://www.cbs.dtu.dk/services/NetMHC
PREDIVAC http://predivac.biosci.uq.edu.au
RANKPEP http://bio.dfci.harvard.edu/RANKPEP/
EpiMatrix http://www.epivax.com
Edit—charges—add kollman charges—ok.

Edit—charges—check totals on residue—spread the charge—
dismiss.
Edit—atom – assign AD4 type.
File—save—write PDBQT—then copy all the residues from
left side and add them on right side—then browse your
folder and save the file as target.pdbqt—then press Ok.
Fig. 8 Epitope Pr1-Antibody 4E10 interactions
Fig. 9 Epitope Pr2-Antibody 4E10 interactions
5. The ligand file (modeled epitope PDB file) was then prepared:
Ligand—input—open pdb file of ligand—ok.
Ligand—torsion tree—detect root.
Ligand—torsion tree—show/hide root maker.
Ligand—torsion tree—choose torsions—done.
Ligand—torsion tree—set number of torsions—dismiss.
Ligand—output—save PDBQT—save as ligand.pdbqt.
6. The grid file was then created:

Grid—macromolecule—chooses—target—select molecule—
saves as same name with pdbqt extension.
Grid—grid box—give the co-ordinates if you know otherwise
keep as default—file—close saving current.
Grid—output—saves as grid.gpf.
7. Command was then inputted in cygwin command prompt
window. Cygwin gives Linux-like environment for windows.
The following command was then entered.
To enter the folder containing all the needed files:
cd g: (enter) – cd example (enter)

./autogrid4.exe (space) -p (space) grid.gpf (space) –l
(space) grid.glg (space) “&” enter
8. Preparation of docking file:

Docking—macromolecule—set rigid file name—open PDBQT
file of target.
Docking—ligand—choose—select ligand—accept.
Docking—search parameter—genetic algorithm—accept.
Docking—docking parameter—accept.
Docking—other option—autodock4 parameters—accept.
Docking—output—lamatkian GA—save as dock.dpf.
9. The following command was given in cygwin window:
./autodock4.exe (space) –p (space) dock.dpf (space) –l

(space) dock.dlg (space) “&” enter.
10. After the completion of the above step (which takes quite long
time ranging from 1 h to 3 h), a file “dock.dlg” was created in
the folder. This file contains the binding energies, docking
energies, and RMSD values on the basis of which the docked
confirmation analysis can be done.
11. The dock.dlg file was further opened in autodock:
Analyze—Docking—open the dock.dlg file.
Analyze—macromolecule—open.
Analyze—confirmation play—click on “&”—show info.
References
1. Poland GA, Ovsyannikova IG, Jacobson RM 12. LM L (2010) New strategies for vaccine devel-
(2009) Application of pharmacogenomics to opment. SPCV 2:e4
vaccines. Pharmacogenomics 10(5):837–852 13. Lefébure T, Stanhope MJ (2007) Evolution of
2. Bagnoli F et al (2011) Designing the next gen- the core and pan-genome of streptococcus:
eration of vaccines for global public health. positive selection, recombination, and genome
OMICS 15(9):545–566 composition. Genome Biol 8:R71
3. Rappuoli R (2000) Reverse vaccinology. Curr 14. Donati C, Rappuoli R (2013) Reverse vaccinol-
Opin Microbiol 3(5):445–450 ogy in the 21st century: improvements over the
4. Elliott SL et al (2008) Phase I trial of a CD8+ original design. Ann N Y Acad Sci
T-cell peptide epitope-based vaccine for infec- 1285:115–132
tious mononucleosis. J Virol 82 15. Ansari HR, Raghava GP (2010) Identification
(3):1448–1457 of conformational B-cell epitopes in an antigen
5. Gahery H et al (2006) New CD4+ and CD8+ from its primary sequence. Immunome Res 6:6
T cell responses induced in chronically HIV 16. Guex N, Peitsch MC (1997) SWISS-MODEL
type-1-infected patients after immunizations and the Swiss-PdbViewer: an environment for
with an HIV type 1 lipopeptide vaccine. AIDS comparative protein modeling. Electrophoresis
Res Hum Retrovir 22(7):684–694 18(15):2714–2723
6. Asjo B et al (2002) Phase I trial of a therapeutic 17. Goodsell DS, Olson AJ (1990) Automated
HIV type 1 vaccine, Vacc-4x, in HIV type docking of substrates to proteins by simulated
1-infected individuals with or without antire- annealing. Proteins 8(3):195–202
troviral therapy. AIDS Res Hum Retrovir 18 18. Sotriffer CA et al (2000) Automated docking
(18):1357–1365 of ligands to antibodies: methods and applica-
7. Kran AM et al (2004) HLA- and dose- tions. Methods 20(3):280–291
dependent immunogenicity of a peptide- 19. Walker LM et al (2011) Broad neutralization
based HIV-1 immunotherapy candidate coverage of HIV by multiple highly potent
(Vacc-4x). AIDS 18(14):1875–1883 antibodies. Nature 477(7365):466–470
8. De Groot AS et al (2011) Tools for vaccine 20. Walker LM et al (2009) Broad and potent neu-
design: prediction and validation of highly tralizing antibodies from an African donor
immunogenic and conserved class II epitopes reveal a new HIV-1 vaccine target. Science
and development of epitope-driven vaccines, in 326(5950):285–289
development of vaccines. John Wiley & Sons, 21. Trkola A et al (1996) Human monoclonal anti-
Inc., Hoboken, New Jersey, pp 65–94 body 2G12 defines a distinctive neutralization
9. Lara HH, Garza-Treviño EN, Ixtepan- epitope on the gp120 glycoprotein of human
Turrent L, Singh DK (2011) Silver nanoparti- immunodeficiency virus type 1. J Virol 70
cles are broad-spectrum bactericidal and viru- (2):1100–1108
cidal compounds. J Nanobiotechnology 9:30 22. Kanampalliwar AM, Soni R, Girdhar A, Tiwari
10. Geels MJ et al (2011) European vaccine initia- A (2013) Web based tools and databases for
tive: lessons from developing malaria vaccines. epitope prediction and analysis: a contextual
Expert Rev Vaccines 10(12):1697–1708 review. Int J Comput Bioinform In Silico
11. Rinaudo CD, Telford JL, Rappuoli R, Seib KL Model 2(4):180–185
(2009) Vaccinology in the genome era. J Clin
Invest 119(9):2515–2525
Chapter 2
Computational Methodology for Peptide Vaccine Design

for Zika Virus: A Bioinformatics Approach
Ashesh Nandy, Smarajit Manna, and Subhash C. Basak
Abstract
With the increasing frequency of viral epidemics, vaccines to augment the human immune response system
have been the medium of choice to combat viral infections. The tragic consequences of the Zika virus
pandemic in South and Central America a few years ago brought the issues into sharper focus. While
traditional vaccine development is time-consuming and expensive, recent advances in information technol-
ogy, immunoinformatics, genetics, bioinformatics, and related sciences have opened the doors to new
paradigms in vaccine design and applications.
Peptide vaccines are one group of the new approaches to vaccine formulation. In this chapter, we discuss
the various issues involved in the design of peptide vaccines and their advantages and shortcomings, with
special reference to the Zika virus for which no drugs or vaccines are as yet available. In the process, we
outline our work in this field giving a detailed step-by-step description of the protocol we follow for such
vaccine design so that interested researchers can easily follow them and do their own designing. Several
flowcharts and figures are included to provide a background of the software to be used and results to be
anticipated.
Key words Peptide vaccine, Sequence descriptors, Vaccine design protocol, Alignment-free techni-
ques, Average solvent accessibility (ASA) and protein variability, Epitopes, Graphical methods
1 Introduction
The Zika virus pandemic of 2015–2016 has left an indelible mark

on the collective consciousness of the world, so devastating and
widespread has been its clinical manifestations [1]. To date no
drugs or vaccines have been successfully formulated or implemen-
ted to protect us from its menace should a pandemic raise its head
again. Understanding the process of drug discovery and vaccine
design thus remains as a top priority in labs around the world.
Human biology has an immune system that protects the body
from invasive pathogens, referred to as antigens, and to date that
remains as the best form of prevention. Figure 1 gives a general
overview of how the immune system works. While it can confront
the pathogen onslaught often by itself, in the event of new and

17
18 Ashesh Nandy et al.
Virus attack
Part of the invading pathogens are removed by action of innate

immunity system
Balance antigens enter into the blood cells and serum
T cells and B cells of the organism’s adaptive immunity system

circulate variety of receptors (antibodies) to bind to antigens
After identification of specific receptors which can bind with the

antigen, thousands of copies of the antibodies are generated
Antibodies bind with antigens and eliminate them by phagocytosis
Fig. 1 Flowchart: How the immune system works against a viral infection
potent pathogens with no previous history, the system can be

assisted by augmentation in the nature of vaccination which was
known to ancients but was approached methodically when Edward
Jenner used cowpox to vaccinate patients against smallpox in the
eighteenth century [2]. Subsequent works by Pasteur against rabies
and other researchers [3] led to a functional approach to the
formulation of vaccines against many forms of viruses and other
pathogens and the worldwide success of eradication of smallpox
and, almost close, poliomyelitis [4].
Traditionally, vaccines have been conceived and implemented
in three main types: live attenuated vaccines where the virus is bred
successively until its pathogenicity is significantly reduced; inacti-
vated vaccines where the replication systems of the virus are elimi-
nated; and subunit vaccines where one or more surface proteins are
used as vaccines. The idea is to impregnate the recipient with a
suitable dose of the vaccine to activate the body’s immune defense
system, which then gets into action when the real pathogen infects
the host. Design and development of such vaccines take 10 years or
more and around a half-billion dollars in costs to move one vaccine
from lab to marketplace [5], taking into account issues such as
maintenance of viability during the manufacturing process, stability
during storage and transportation, administration in different
environments, and so on. Even then there are the prospects of
failure due to the rapidity with which viral genomes, especially the
Methodology for Peptide Vaccine Design 19
hypervariable ribonucleic acid (RNA) viruses like influenza, change

due to mutations in their sequences. When functionally stable, it
will be the same vaccine that will be administered in every country
regardless of community and physiology, and we know how the
immune response system varies from population to population and
between individuals, and therefore not everyone reacts to the same
vaccine in a similar fashion.
2 The Dawn of Peptide Vaccines
Clearly, a new strategy for vaccine design is called for. Recent

developments in immunoinformatics, information technology,
genomics, and bioinformatics have led to a better understanding
of immune responses [6] and provide an opportunity to focus more
specifically on the way vaccines operate [7]. This is leading to a new
paradigm in vaccine development—the era of vaccinomics, where
one moves away from the “one size fits all” concept to develop-
ments that orientate vaccine design to fit population, community,
and individual profiles [8, 9]. These new strategies of rational
design of vaccines and the science of “reverse vaccinology” are the
beacons of the future [10].
Peptide vaccine is one of the new strategies to design novel
vaccines [11]. The basic concept is to identify from the knowledge
of the pathogen’s genome those segments of the pathogen that can
elicit a strong response from the host’s immune system to eliminate
the invading pathogen and then determine peptides that could be
instrumental as vaccines to augment this process. Then one can
perform various tests to determine how strong the response is for a
particular population group, whether these will have any adverse
reactions, etc. This procedure is under increasing use by researchers
across the globe; at last count (August 2, 2019), the US National
Institute of Health’s website ClinicalTrials.gov had 268 studies on
peptide vaccines under different phase trials, mostly for various
types of cancers but also for several viral diseases. We have used
our own approach to research on peptide vaccines by analyzing
514 influenza H5N1 avian flu and 425 influenza H1N1 swine flu
neuraminidase sequences to determine six vaccine candidates for
the H5N1 virus [12], 220 influenza H7N9 hemagglutinin
sequences available to determine 3 vaccine targets [13], 433 VP7
surface glycoprotein sequences of rotavirus for a repertoire of 4 vac-
cine targets [14], 222 L1 capsid genes of human papilloma virus for
5 peptides as vaccines [15], and 60 envelope proteins of Zika virus
that suggested 4 possible vaccine targets [16]. In view of these
developments, and since this is a Methods and Protocols book,
we will concentrate on the protocol we have used for designing
peptide vaccines, flowchart in Fig. 2, taking the Zika virus case as an
example, and refer the interested reader to other approaches to the
task [16–23].
Collect all information on the virus and its structure
Determine surface situated proteins of the virion for

vaccine design
Get average solvent accessibility (ASA) Get protein variability index by

information on protein sequence Graphical Sliding Window Method
Determine protein sequence segments that have high

ASA and low protein variability
Use 3D structural model of the protein to ensure the

selected segments are indeed surface situated
Determine through web servers like IEDB if the selected

segments have acceptable epitope potential
Run peptide segments from above step through a BLAST

or similar software to ensure no autoimmune threats
Final list of peptides that pass all

the above tests are to be submitted
for wet lab experiments
Fig. 2 Flowchart for in silico peptide vaccine design
3 Peptide Design Protocol
Our protocol begins by identifying the most likely protein to be

surface exposed on the antigen; this can be ascertained from the
viral and protein structural information in some database like Uni-
Prot. The Zika virus is a positive-sense, single-stranded RNA virus
containing over 10,000 bases that code for 11 proteins: three

structural proteins of capsid (C), premembrane/membrane
(prM/M), and envelope (E) and eight nonstructural
(NS) proteins of NS1, NS2A, NS2B, NS3, NS4A, NS4B, and
NS5 with a “2K” 23 amino acid peptide generated by cleavage of
the N terminus of the NS4B signal sequence [24]. The structural
proteins form an icosahedral structure that contains the RNA
genome within it, with 180 envelope proteins making up the
icosahedral shell [25]. While several researchers have predicted
good antibody binding potential in the capsid, NS2A, NS3,
NS4B, and NS5 proteins of Zika virus [19–21], the envelope
proteins are the major surface proteins [26] and a good candidate
for vaccine design [16–18, 22], when adequate sequence informa-
tion is available.
3.1 Graphical Sometimes the available information is partial, as, e.g., in the case of
Methods the Zika virus during the early days of the outbreak. In this case, we
do a simple 2D graphical representation of the available sequence
and examine how the various data fragments fit and whether they
can be used for a deeper analysis; this was the case for the Zika virus
mentioned, and we needed that to ensure we had the right fit
[16]. The method is to assign the four bases of a nucleotide
sequence to the four cardinal directions of a Cartesian coordinate
system [27]; e.g., assign adenine (a) to the negative x-axis, cytosine
(c) to the positive y-axis, guanine (g) to the positive x-axis, and
thymine (t) to the negative y-axis (in the databases, uracil (u) in
RNA sequences is generally represented by t). To plot a graph, one
takes a step for a base in the sequence in the assigned direction, then
the next step for the next base, and so on until the whole sequence
is plotted. This charts the base distribution of a sequence as a curve
on the 2D grid (see Note 1), including some degeneracy arising out
of overlapping steps, where comparison of two or more sequences
can be made to visually see where they are similar or different. This
feature was used in one of our papers to group together several
human papillomavirus types to design one variety of vaccine
[15]. Moreover, the graphical representation
P canPbe enumerated
xi yi
by defining a center of mass μx ¼ iN and μy ¼ iN and a graph
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
radius, g R ¼ μx 2 þ μy 2 that is the distance of the center of mass
from the origin [28]. The gR turns out to be a very sensitive
measure of the base distribution of a sequence and equal gR implies
practically identical sequences [29]. The gR, therefore, has been
designated as an index, a descriptor that is characteristic of a
sequence. The advent of graphical representations and their numer-
ical characterization gave rise to many other approaches [30], but
for ease of computation, we operate with the 2D method outlined
above (see Notes 2–4).
To consider protein sequences, a parallel system had been

proposed by us some time ago [31]. In this case, there are
20 amino acids that combine to form a protein sequence, and we
assign each amino acid to a designated axis in a 20D Cartesian
coordinate system where, in principle, a sequence can be plotted
using the same principles as for the nucleotide sequences. The
graph, of course, is in abstract space but can be computationally
compared with other such graphs. Here also the first-order
moment and resultant vector from the origin to the center of
mass are defined as:
P P P
x 1i x 2i x 20i
i i i
μ1 ¼ , μ2 ¼ , . . . , μ20 ¼
N N N
and
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u 20
uX
pR ¼ t μj 2
j ¼1
where the pR, similarly as in the case of gR, acts as a descriptor of the
sequence. With a little bit of adjustment in the computation as
explained in our paper [14] (see Note 5), the pR turns out to be
an index of a sequence or segment where equal pR values for two
segments imply identical sequences. This property of the pR is very
useful for our purposes of vaccine design.
3.2 Viral Mutations After identifying the surface protein we wish to design vaccines for,
it is necessary to question whether it itself or segments of it are
stable against mutations, lest our efforts get overcome in short
time. Because viruses mutate very rapidly, and more so the RNA
viruses like the Zika, dengue, and others, that do not have an error-
correcting mechanism in their replication machinery, one has to
ensure that the segments identified will be reasonably stable. There
are two approaches to this task: alignment of sequences to deter-
mine conserved regions using some software such as BLAST [32]
and an alignment-free approach. Alignment procedures are well
established and familiar to most molecular biologists. The difficul-
ties with this approach are that (1) it is model dependent on how to
take care of mismatches and (2) it is highly computation intensive
and cannot accommodate too many sequences at a time, limited by
the computing resources available. On the other hand, alignment-
free approaches, like the graphical methods outlined above, are
relatively recent, take sequences as they are without the necessity
of a fit to other sequences, and can be computed fairly easily for a
very large number of sequences.
3.3 Graphical Sliding This methodology is used to determine those segments of the
Window Method selected antigen that are fairly well conserved compared to the
for Protein Variability rest of the sequence by comparing the relevant index for each
segment of all the sequences available (see Ref. 14 for details). It

turns out from immuno-genetic studies that a segment size of
between 10 and 14 amino acids is a potent peptide to consider
for immune response generation and thus also for segment stability
against mutations [14]. For this, we take a window size of say
12 amino acids of a sequence and compute the pR value for that
window, starting from amino acid number 1–12. Then we move
the window by 1 step and compute another pR for the window from
amino acid number 2–13 and continue in this way till the end of the
sequence. Doing the computation for all the sequences available of
the viral protein, we can line up the pR values by window number
and scan each column to determine how many different value pR‘s
are there. The lesser the number of variations, the more conserved
is the amino acid string in that window. This is most easily seen in a
graph where we plot a moving average number to smooth over the
many kinks.
3.4 ASA Profile An additional information we need in parallel is average solvent

accessibility (ASA) which shows how much of the amino acid chain
is surface situated and exposed to the solvent. For this we submit
our sequences to selected web servers; the ones we have used are
SABLE and ITASSER. With a suitable choice of options, the servers
can return the ASA as a percentage from 0 (wholly hydrophobic) to
99 (wholly hydrophilic), though the results usually are in the 30–60
range. The results are again averaged and a moving average plotted
against amino acid numbers (see Note 6). Plotting the two graphs
together on the same chart, we can then visually inspect and pick
out the most likely candidates for being highly conserved (least
protein variability) and have high ASA. Figure 3 provides an exam-
ple of such a graph for the H5N1 neuraminidase protein where we
ultimately identified six candidates that fit our criteria.
3.5 3D Modeling However, the chance that parts of the identified candidates could
get covered by neighboring proteins needs to be checked. For this
we consider a 3D model of the structure of the protein, using
structural information from a protein database such as UniProt
and a modeling software like Cn3D or PyMOL. We have found a
space fill rendition the best visual where the identified peptides can
be highlighted in different colors (see Fig. 4). If any of the selected
peptides are overshadowed by neighboring proteins, we can leave
those out and have a shorter list to carry on with (see Note 7).
3.6 Epitope Potential The next step is to determine whether one or more of the peptides
in the short list has good potential to generate an immune
response. Not all amino acid combinations are able to do that;
those that do are classified as epitopes. One of the most widely
used web servers for this purpose is the IEDB (Immune Epitope
Database Analysis Resource). We have to submit the viral protein
sequence and select among the many options what we want as
40
35
30
25
20
15
10
5 A
B C D E F
0
0 50 100 150 200 250 300 350 400 450 500
Sliding window position
Fig. 3 Plots of average solvent accessibility (ASA) profile (upper curve) of H5N1
neuraminidase protein sequence with segment variability (lower curve) deter-
mined by the graphical sliding window method. Comparison of the two curves
yielded six likely candidates, labeled (A–F), that showed the least protein
segment variability and maximum solvent accessibility. (Reproduced from Ref.
12 with permission of the authors)
Fig. 4 A space-filling model of four conserved surface exposed segments of

envelope protein of Zika virus identified by comparison of protein variability and
ASA profile analysis. (Reproduced from Ref. 23 with permission of the authors)
output. The idea is that our antigen epitopes should elicit the T-cell
and B-cell antibodies to enable the immune reaction to the invad-
ing pathogen. This is also facilitated by a knowledge of the target
host community’s human leukocyte antigen (HLA) alleles, a profile
of which is available in a separate database within the IEDB suite.
On running the system thereafter, the IEDB output will be a series
of 15-mer peptides with percentile ranks about how strong (per-
centile rank 0 (strong) to 99 (weak)) the immune response could
be; generally, percentile ranks above 10 are supposed to be rather
weak and only peptides with ranks below 5 are considered worth
further tests. In our case of vaccine design candidates, we consider
only those peptides in the IEDB output that carry the short-listed
peptides identified in the previous step and determine if they fit in
the overall scheme of acceptable epitopes (see Note 8).
3.7 Autoimmune If some, or all, of the short-listed peptides are acceptable as epi-
Threats topes, we next need to ensure that they will not result in any
autoimmune threats. This can happen if any of the new peptides
we determine as a suitable candidate is akin to an existing peptide in
the host proteins, in which case the immune response may attack
the host protein also. To safeguard against such an eventuality, we
do a BLAST analysis with our short-listed peptides and determine if
any host protein or peptide has sequence overlap in these peptides,
in which case these peptides need to be removed from our list.
What we finally have is a small list of peptides that have the
potential to evoke a good immune response with no autoimmune
threats. These peptides can be prepared in a laboratory and tested in
animals and tissue culture before undertaking human phase trials.
4 Discussion and Summary
Currently, there are a large number of software available on the

World Wide Web for prediction of epitopes, analyses of biomolecu-
lar sequences, comparison of different strains, etc. We have men-
tioned a few in the above paragraphs. Table 1 gives for the
interested reader to start with a brief list of some of the software
and databases we have used. Each web server-based system has help
sections and often tutorials to initiate the beginner into the
resources it has to offer. They are worth exploring.
Finally, we may claim that, compared to traditional approaches
for vaccine development, our computational approach has short-
ened the lead time for the determination of suitable peptide vaccine
candidates for laboratory testing. Provided these are found accept-
able, peptide vaccines have several advantages over traditional vac-
cines, viz., purity in manufacture, scalability, minor adjustments to
suit local communities’ HLA profiles, stability, transport, adminis-
tration, etc. [11, 33]. However, such vaccines turn out to be weaker
than attenuated or inactivated vaccines [34], and several issues still
need to be sorted out before peptide vaccines can get into regular
or routine clinical use in the public health systems:
l Several peptides from the short list can be clubbed together to
form a multivalent peptide vaccine for a better immune
response.
l In case more than one surface protein of the antigen exists, a
parallel exercise can be done for a more effective multivalent
vaccine.
l The use of adjuvants that act as stimulators for an immune
response should be systematized.
l Care has to be exercised in dosage design lest excess generation
of cytotoxic antibodies turn against the host system [35].
Table 1
Selection of databases, protein structure, and epitope prediction software
DB/server URL Brief remarks

OpenFlu DB http://openflu.vital-it.ch/about.php Sequence data
Influenza research https://www.fludb.org/brc/search_ Sequence, epitopes, other data
database landing.spg?decorator¼influenza
EpiFlu https://www.gisaid.org/epiflu- Part of GSAID—Global Inititative
applications/nextflu-app/ on Sharing All Influenza Data
Virus Pathogen Resource https://www.viprbrc.org/brc/home. Sequences, analysis tools
(ViPR) spg?decorator¼vipr
Virus variation resource https://www.ncbi.nlm.nih.gov/ Influenza, rotavirus, dengue,
genome/viruses/variation/ yellow fever, Ebola, Zika, MERS
sequences
GenBank https://www.ncbi.nlm.nih.gov/ Annotated collection of all publicly
genbank/ available DNA/RNA/protein
sequences
RCSB protein data Bank https://www.rcsb.org 3D structure of proteins
UniProt https://www.uniprot.org Protein sequence and functional
information. Contains SwissProt
database
NCBI PDB https://www.ncbi.nlm.nih.gov/ Protein database for biological
protein/ structure and function
Cn3D https://www.ncbi.nlm.nih.gov/ View protein 3D structures
Structure/CN3D/cn3d.shtml
PyMOL https://pymol.org/2/ Molecular visualization system
BLAST—Basic Local https://blast.ncbi.nlm.nih.gov/Blast. To search similar sequences based
Alignment Search Tool cgi on local alignment
SABLE https://sable.cchmc.org Average solvent accessibility
prediction tool
I-TASSER https://zhanglab.ccmb.med.umich. Protein structure and function
edu/I-TASSER/ prediction software
JPRED http://www.compbio.dundee.ac.uk/ Protein secondary structure
www-jpred predictor
IEDB—Immune Epitope https://www.iedb.org Epitope prediction software suite
Database and Analysis
Resource
ABCPred http://crdd.osdd.net/raghava/ To predict B-cell epitopes
abcpred/
BepiPred http://www.cbs.dtu.dk/services/ Linear B-cell epitope predictor
BepiPred/
While the response time of development of a peptide vaccine to

an epidemic is short and appropriate, these several issues need to be
overcome satisfactorily before such vaccines can be marketed for
human use. In the meantime, the computational systems could be
made more robust and responsive to yield the desired results with a
minimum set of genomic information that may be available at a
time of a sudden epidemic.
5 Notes
1. While we have assigned the four bases to four specified cardinal

directions, there are two more independent assignments possi-
ble. Taking adenine on the negative x-axis and then cytosine,
guanine, and thymine in the other cardinal directions going
clockwise (axes assignment ACGT) described in the text, the
two others would be AGCT and AGTC [30]. The graphical
representations of the same sequence would be different in the
three axes systems. Figure 5 shows an example of a Zika virus
envelope gene (Accession No KY785476) plotted in the three
systems. Depending on which properties one may wish to dwell
on, one can choose one or more of such axes systems. Ref. 36 in
fact is based on the AGTC representation to show base distri-
bution differences between different virus types.
Zika virus envelope gene - Zika virus envelope gene -

KY785476 - axes ACGT KY785476 - axes AGCT
-20 -10 0 10 20 -130 -80 -30 20
20 120
100
10 80
60
0
40
-10 20
0
-20 -20
Zika virus envelope gene -

KY785476 - axes AGTC
-150 -100 -50 0 50
120
100
80
60
40
20
0
-20
Fig. 5 The Zika virus (KY785476) envelope gene drawn in three different axes systems in the 2D graphical
representation model
2. Sequences sometimes have ambiguous or unidentified bases,

marked usually as “n,” in the database entries like atgnnnnccg-
taa where sequencing was not confirmatory. In drawing a 2D
graph, such bases do not matter if the run is small compared to
the overall length of the sequence. But if the number of n’s is
large (e.g., 292 ambiguous bases in the Zika genome, Acces-
sion No MK817846), then the graph will not be representative
and computations from it may go awry. We find it best to select
sequences that do not have any or maybe just a few ambiguous
or unidentified bases so the results are all comparable.
3. In computing the graph descriptor, gR, we have to divide by
N the total number of bases in the sequence. In the case of
ambiguous bases, there is a question of what N should imply—
total number less the number of ambiguous bases or the total
number overall. We generally avoid sequences with ambiguous
bases, but even if one or two such bases are there, we keep N as
the total length of the sequence, the ambiguous bases inclusive.
4. The gR comes in very handy through its property that equal gR
implies identical sequences [29]. We have used this property to
weed out duplicates when working with a large number of
sequences.
5. Since in the 20D representation of protein sequences [31] all
directions are equivalent, there can be instances where two
peptides with alternate structures can lead to the same pR
value. To avoid that catastrophe, two methods have been
described in our paper (Ref. 14) where the emphasis is on
comparing pR values rather than ascribe any meaning to its
absolute value. Thus, one method assigns a different constant
number to each axis, and another just adds a string of 20 all
different amino acids to the sequence for which pR is being
calculated.
6. The protein variability and ASA profile computations have to
be compared to see at which amino acid positions there will be
high surface exposure and low protein variability. These are
necessarily bounded by the window size. Even then, the graphs
will be very jagged. We therefore use a moving average method,
taking the peptide window size as window size for the moving
average also, to smoothen out the representative plots and then
compare visually to determine the peptide stretched that fit our
criteria. Efforts are on to automate this process.
7. The 3D protein structure diagram from the pdb or equivalent
databases may be missing several residues in the beginning
and/or end of the sequence depending on the experimental
materials and procedures. One needs to be careful to conclude
which of the selected peptides therefore can be expected to be
acceptable if they happen to be close to the end-points.
8. The IEDB output considers 15-mer peptides at a time starting

from the first amino acid and then moves one amino acid to
consider the next 15-mer peptide and so on for computation of
binding affinity for each peptide. We look for the peptides we
identified in the previous steps in the IEDB output and deter-
mine how strong the binding is. These would be linear epi-
topes. Another option in IEDB will predict conformational
epitopes where we look for our peptides in the list; because of
protein folding, conformational epitopes are more relevant
than linear epitopes. Our peptides are short in this exercise
and will be only a part of the whole conformational epitope
identified by IEDB.
Acknowledgments
We would like to acknowledge with thanks the help provided by

Mr. Tathagata Dutta in software and graph rendering.
References
1. Nandy A, Basak SC (2017) The epidemic that 8. Poland GA, Kennedy RB, Ovsyannikova IG
shook the world—the Zika virus rampage. (2011) Vaccinomics and personalized vaccinol-
Explor Res Hypothesis Med 2(3):43–56. ogy: is science leading us toward a new path of
https://doi.org/10.14218/ERHM.2017. directed vaccine development and discovery?
00018 PLoS Pathog 7:e1002344
2. Riedel S (2005) Edward Jenner and the history 9. Poland GA, Whitaker JA, Poland CM, Ovsyan-
of smallpox and vaccination. Proc (Baylor Univ nikova IG, Kennedy RB (2016) Vaccinology in
Med Cent) 18(1):21–25 the third millennium: scientific and social chal-
3. Fu ZF (1997) Rabies and rabies research: past, lenges. Curr Opin Virol 17:116–125
present and future. Vaccine 15:S20–S24 10. Rappuoli R (2001) Reverse vaccinology, a
4. Roy P, Nandy A, Basak SC (2019) Zika virus— genome-based approach to vaccine develop-
the quest for vaccines. In: Basak SC, Bhatta- ment. Vaccine 19:2688–2691
charjee AK, Nandy A (eds) Zika virus surveil- 11. Purcell AW, McCluskey J, Rossjohn J (2007)
lance, vaccinology and anti-Zika drug More than one reason to rethink the use of
discovery: computer-assisted strategies to com- peptides in vaccine design. Nat Rev 6:404–414
bat the menace. Nova Science Publishers Inc., 12. Ghosh A, Nandy A, Nandy P (2010) Compu-
New York tational analysis and determination of a highly
5. Chit A, Parker J, Halperin SA, conserved surface exposed segment in H5N1
Papadimitropoulos M, Krahn M, Grooten- avian flu and H1N1 swine flu neuraminidase.
dorst P (2014) Toward more specific and trans- BMC Struct Biol 10:6. https://doi.org/10.
parent research and development costs: the 1186/1472-6807-10-6
case of seasonal influenza vaccines. Vaccine 32 13. Sarkar T, Das S, De A, Nandy P,
(26):3336–3340. https://doi.org/10.1016/j. Chattopadhyay S, Chawla-Sarkar M, Nandy A
vaccine.2013.06.055 (2015) H7N9 influenza outbreak in China
6. Backert L, Kohlbacher O (2015) Immunoin- 2013: in silico analyses of conserved segments
formatics and epitope prediction in the age of of the hemagglutinin as a basis for the selection
genomic medicine. Genome Med 7:119 of peptide vaccine targets. Comput Biol Chem
7. Tomar N, De RK (2014) Immunoinformatics: 59:8–15
a brief review. In: De RK TN (ed) Methods 14. Ghosh A, Chattopadhyay S, Chawla-Sarkar M
Mol Biol, vol 1184. Springer Science+Business et al (2012) In Silico study of rotavirus VP7
Media, New York. https://doi.org/10.1007/ surface accessible conserved regions for anti-
978-1-4939-1115-8_3 viral drug/vaccine design. PLoS One 7:
e40749. https://doi.org/10.1371/journal. 25. Sirohi D, Chen Z, Sun L, Klose T, Pierson TC,

pone.0040749 Rossmann MG et al (2016) The 3.8 Å resolu-
15. Dey S, De A, Nandy A (2016) Rational design tion cryo-EM structure of Zika virus. Science
of peptide vaccines against multiple types of 352(6284):467–470. https://doi.org/10.
human papillomavirus. Cancer Inform 1126/science.aaf5316
15:1–16. https://doi.org/10.4137/CIN. 26. Dai L, Song J, Lu X, Deng Y-Q, Musyoki AM,
S39071 Cheng H et al (2016) Structures of the Zika
16. Dey S, Nandy A, Basak SC, Nandy P, Das S virus envelope protein and its complex with a
(2017) A bioinformatics approach to designing Flavivirus broadly protective antibody. Cell
a Zika virus vaccine. Comput Biol Chem Host Microbe 19:696–704. https://doi.org/
68:143–152. https://doi.org/10.1016/j. 10.1016/j.chom.2016.04.013
compbiolchem.2017.03.002 27. Nandy A (1994) A new graphical representa-
17. Shawan MMAK, Mahmud HA, Hasan MM tion and analysis of DNA sequence structure:
et al (2014) In Silico Modeling and Immu- I. methodology and application to globin
noinformatics probing disclose the epitope genes. Curr Sci 66:309–314
based peptide vaccine against Zika virus enve- 28. Raychaudhury C, Nandy A (1999) Indexing
lope glycoprotein. Ind J Pharma Biol Res scheme and similarity measures for macromo-
2:44–57 lecular sequences. J Chem Inf Comput Sci
18. Badawi MM, Osman MM, Alla AAEF et al 39:243–247
(2016) Highly conserved epitopes of ZIKA 29. Nandy A, Nandy P (2003) On the uniqueness
envelope glycoprotein may act as a novel pep- of quantitative DNA difference descriptors in
tide vaccine with high coverage: Immunoinfor- 2D graphical representation models. Chem
matics approach. Am J Biomed Res 4:46–60. Phys Lett 368:102–107
https://doi.org/10.12691/ajbr-4-3-1 30. Nandy A, Harle M, Basak SC (2006) Mathemat-
19. Dar H, Zaheer T, Rehman MT et al (2016) ical descriptors of DNA sequences: development
Prediction of promiscuous T-cell epitopes in and applications. ARKIVOC 9:211–238
the Zika virus polyprotein: an in silico 31. Nandy A, Ghosh A, Nandy P (2009) Numerical
approach. Asian Pac J Tropical Med characterization of protein sequences and appli-
9:844–850. https://doi.org/10.1016/j. cation to voltage-gated sodium channel alpha
apjtm.2016.07.004 subunit phylogeny. In Silico Biol 9:77–87.
20. Dikhit MR, Ansari MY, Vijaymahantesh K et al https://doi.org/10.3233/ISB-2009-0389
(2016) Computational prediction and analysis 32. National Centre for Biotechnology Informa-
of potential antigenic CTL epitopes in Zika tion—BLAST (2019) https://blast.ncbi.nlm.
virus: a first step towards vaccine development. nih.gov/Blast.cgi. Accessed 08 August 2019
Infect Genet Evol 45:187–197. https://doi. 33. Moisa AA, Kolesanova EF (2012) Synthetic
org/10.1016/j.meegid.2016.08.037 peptide vaccines. In: Roy P (ed) An insight
21. Mirza MU, Rafique S, Ali A et al (2016) and control of infectious disease in global sce-
Towards peptide vaccines against Zika virus: nario. InTech, Rijeka Croatia
immunoinformatics combined with molecular 34. Disis ML, Bernhard H, Shiota FM, Hand SL,
dynamics simulations to predict antigenic epi- Gralow JR, Huseby ES, Gillis S, Cheever MA
topes of Zika viral proteins. Sci Rep 6:37313. (1996) Granulocyte-macrophage Colony-sti-
https://doi.org/10.1038/srep37313 mulating factor: an effective adjuvant for pro-
22. Nandy A, Basak SC (2016) A brief review of tein and peptide-based vaccines. Blood 88
computer-assisted approaches to rational (1):202–210
design of peptide vaccines. International J 35. Liu F, Feuer R, Hassett DE, Whitton JL
Mol Sci 17:666 (2006) Peptide vaccination of mice immune
23. Dey S, Das S, Nandy A (2017) Characteriza- to LCMV or vaccinia virus causes serious CD8
tion of Zika and other human infecting Flavivi- + T cell mediated, TNF-dependent immunopa-
rus envelope proteins and determination of thology. J Clin Investig 116(2):465–475
common conserved epitope regions. EC 36. Roy P, Dey S, Nandy A, Basak SC, Das S
Microbiol 8:29–46 (2019) Base distribution in dengue nuceotide
24. Kuno G, Chang GJ (2007) Full-length sequences differs significantly from other
sequencing and genomic characterization of mosquito-borne human-infecting Flavivirus
Bagaza, Kedougou, and Zika viruses. Arch members. Curr Comput Aided Drug Des
Virol 152:687–696. https://doi.org/10. 15:29–44
1007/s00705-006-0903-z
Chapter 3
High-Definition Genomic Analysis of HLA Genes Via

Comprehensive HLA Allele Genotyping
Shuji Kawaguchi and Fumihiko Matsuda
Abstract
HLA is essential for various medical applications, such as genomic studies of multifactorial diseases,
including immune system and inflammation-related disorders. Therefore, an accurate HLA typing method
that is applicable for any allele registered in HLA allele databases is required to deduce scientific evidence
related to disorders. Here, we describe a method for determining HLA alleles from next-generation
sequencing (NGS) results by using currently available HLA sequence data in public HLA databases and
show its application in association analysis.
Key words HLA allele, Genotyping, NGS, Software, Database, Bioinformatics, Logistic regression
1 Introduction
The human leukocyte antigen (HLA), which is a gene complex

encoding major histocompatibility complex (MHC) proteins in
humans, plays an important role in the regulation of the human
immune system. HLA gene region is known as one of the most
polymorphic regions in the human genome. HLA alleles were
reported to be associated with many immune disorders, such as
rheumatoid arthritis [1], systemic lupus erythematosus [2], and
IgG4-related disease [3]. The recent next-generation sequencing
(NGS) technology has enabled efficient sequencing of the complete
HLA gene in terms of high-throughput read generation using
human genomic DNA as a template [4]. Consequently, we became
able to genotype HLA alleles from whole genome sequencing
(WGS) data [5, 6], whole exome sequencing (WES) data, or target
sequence of HLA genes [7] by using NGS technology. However,
determination of the correct HLA alleles from NGS data is still
demanding because of its polymorphic nature; 22,362 HLA alleles
have been recorded to date in IPD-IMGT/HLA (Release 3.36.0,
April, 2019), which is the largest HLA database in the world
[8]. To solve the problem, we developed a novel HLA genotyping

31
32 Shuji Kawaguchi and Fumihiko Matsuda
algorithm, called HLA typing from a high-quality dictionary

(HLA-HD) [9], which creates an HLA allele dictionary that com-
pletes the current HLA allele information recorded in the
IPD-IMGT/HLA database and determines HLA alleles with a
six-digit precision from NGS data. There have been several reports
to date on the performance of HLA-HD in genotyping classical
class I and II HLA genes [10, 11].
HLA-HD can lead to identification of risk or protective HLA
alleles associated with a disease or a phenotype of interest through
case/control association analysis. However, to further examine
biological roles of associated HLA alleles, it is essential to clarify
which portion of HLA protein influences disease onset or pheno-
type. For such analyses, amino acid sequences of HLA alleles iden-
tified in case and control populations need to be aligned based on
the protein structure, and distribution of amino acid residues at
each position along the protein should be compared between cases
and controls.
In this chapter, we describe the usage of HLA-HD for NGS
data and show how to treat the genotyping result for an association
analysis of amino acid residues.
2 Materials
2.1 HLA-HD Setup HLA-HD was developed to enable HLA typing using WGS data,
WES data, RNA-seq data, and target sequence data of HLA genes.
HLA-HD is freely available for academic use and research purposes
upon registration and can be downloaded from the HLA-HD
website (https://www.genome.med.kyoto-u.ac.jp/HLA-HD/).
HLA-HD works on almost any operating system, including
Linux, Mac OS, and Windows. After the HLA-HD source program
is downloaded from the website, HLA-HD can be installed by
typing “sh install.sh.” For installation, GNU Compiler Collection
(https://gcc.gnu.org/) is required. PATH must be set in the com-
puter environment being used; i.e., the .bashrc is changed to
“export PATH¼$PATH:/path_to_HLA-HD_install_directory/
bin” after the installation. HLA-HD requires bowtie2 (http://
www.metagenomics.wiki/tools/bowtie2/) [12] to align the NGS
reads to the sequence of HLA alleles.
2.2 Input Data HLA-HD requires fastq (fastq.gz) data, which include sequences of
for HLA-HD the HLA genes generated by NGS. Mapping for WGS and WES
data takes time because HLA-HD maps all NGS reads to currently
available HLA allele sequences. Therefore, reads should be
dropped in advance by using samtools (http://samtools.
sourceforge.net/) and picard tools (https://broadinstitute.
github.io/picard/) to reduce time costs as follows (see Notes 1
and 2).
High-Definition Genomic Analysis of HLA Genes Via Comprehensive HLA. . . 33
1. Full-resolution (eight-digit) HLA sequence data must be

obtained from IPD-IMGT/HLA site (ftp://ftp.ebi.ac.uk/
pub/databases/ipd/imgt/hla/hla_gen.fasta).
2. Create bowtie2 index by typing:
“bowtie2-build hla_gen.fasta hla_gen”,
and map fastq to them as:
“bowtie2 -x hla_gen -1 sample_1.fastq -2 sample_2.fastq -S

sample.hlamap.sam”.
3. Extract mapped reads by “samtools view -h -F 4 sample.hla-

map.sam > sample.mapped.sam”.
4. Convert mapped sam to fastq by picard tools:
“java -jar picard.jar SamToFastq I=sample.mapped.sam F=sample.

hlatmp.1.fastq F2=sample.hlatmp.2.fastq”.
5. Change fastq ID as follows:
“cat sample.hlatmp.1.fastq |awk ’{if(NR%4==1){O=$0;gsub( “/1”,

” 1”,O);print O}else{print $0}}’ > sample.hla.1.fastq”,
“cat sample.hlatmp.2.fastq |awk ’{if(NR%4==1){O=$0;gsub(" /2",

" 2",O);print O}else{print $0}}’ > sample.hla.2.fastq”.
After the filtering, “sample.hla.1.fastq” and “sample.hla.2.

fastq” are used as new hlahd input.
2.3 Amino Acid To perform association analyses of amino acid residues in HLA
Residue Data for HLA protein, a data set of aligned positions for each HLA allele is
Protein required. The IPD-IMGT/HLA database provides the aligned
position data at FTP site (ftp://ftp.ebi.ac.uk/pub/databases/
ipd/imgt/hla/alignments/). Aligned data of HLA proteins tar-
geted in the research of interest can be downloaded, e.g., data for
HLA-A are represented as “A_prot.txt”. However, these positions
are merely numbered from the first amino acid residue of the leader
peptide and therefore are not convenient for a comparative study
between various HLA domains. Furthermore, sequences of almost
all the alleles are unfortunately recorded only at the MHC groove
domains (G-DOMAINs). In light of this problem, IMGT pro-
posed a unique numbering system to unify the positions of
G-DOMAINs [13]. This numbering has become highly valuable
for association studies of amino acid residues in HLA proteins.
HLA-HD can genotype any HLA genes and alleles recorded in

the IPD-IMGT/HLA database. Therefore, HLA-HD is best suited
for analysis of G-DOMAINs, which are known as significant
domains in many immune disorders.
2.4 Data Set Data set of typed HLA alleles must be converted to four-digit
and Scripts of Analysis resolution. If HLA alleles are newly identified and not recorded in
for Amino Acid the IPD-IMGT/HLA database, they should be registered in the
Residues database before analysis [8]. Sample scripts and data can be down-
loaded from https://www.genome.med.kyoto-u.ac.jp/HLA-HD/
hla_aa_analysis/ (see Note 3). In the demonstration, we used pub-
lic HLA data of Southern Han Chinese (CHS) and Japanese in
Tokyo (JPT) populations, whose samples were sequenced for the
1000 Genomes Project [14] and typed previously [15]. Scripts are
written by python and R codes and checked by python version
2.7.10 and 3.5.4 and R version 3.6.0.
3 Methods
3.1 HLA Typing by 1. HLA-HD: for paired-end short read data, the input command is
HLA-HD
“hlahd.sh [-m <int>] [-t <int>] [-c 0 to 1.0] [-f /path/to/
freq/data] <fastq data 1> <fastq data 2> <hla gene split
file> /path/to/dictionary <result name> /output/directory”,
where square brackets mean options. If the NGS data are

from a single end short read, the same file name with <fastq
data 1> is input to <fastq data 2>.
2. The HLA genes that are going to be used for typing must be
included in <hla gene split file>. All the classical HLA genes
and other major genes are listed in the HLA_gene.split.txt file
in the installed directory.
3. -m: minimum length in input reads. A read with a length that is
shorter than this parameter is ignored. The default size is 100.
4. -t: number of cores used to execute the program. This parame-
ter refers to both mapping and typing.
5. -c: if a complete match sequence is not found in the dictionary,
the read should be trimmed until some sequence is matched to
or reaches this ratio. The default is 1.0.
6. -f: this option enables allele determination by allele frequencies
from allele count data in cases where an allele pair is not
uniquely determined by the end of the run (see Note 4). The
default setting uses data of the allele frequency net database
[16] and exist in the installed directory (/hlahd.version/
freq_data).
3.2 Convert Typing The typed HLA alleles by HLA-HD are recorded in sampleID_f-
Results to IMGT Unique inal.result.txt at six-digit resolution as a tab-separated text. One
Numbering allele is represented by a hyphen if the gene was typed as having
homozygous alleles. If a candidate has not been determined for an
allele pair by the end of the run, multiple candidates are listed in
parallel. On the contrary, the allele pair is recorded as “Not typed”
if no candidates were obtained. To convert HLA alleles to amino
acid residues aligned by IMGT unique numbering, first, the
six-digit allele name must be replaced by a four-digit resolution,
cutting the allele name after the second colon and merging the
typed alleles to a tab-separated text file, sample, gene 1 allele
1, gene 1 allele 2, gene 2 allele 1, gene2 allele 2, and so on, for
each sample set (Fig. 1).
Second, the amino acid position files of HLA proteins to be
analyzed must be downloaded from the IPD-IMGT/HLA ftp site
(discussed in Subheading 2.3). By using python scripts and data
files, the typed allele list is realigned to IMGT unique numbering as
follows:
1. Concatenate amino acid residues of each allele in the position
file, e.g., position file for HLA-A “A_prot.txt” is concatenated
by using python script:
“python concatinate_aa.py A_prot.txt > A_con.txt”.
2. Align concatenate file to G-DOMAIN position:

“python align_to_gdomain.py G-DOMAIN.txt A_con.
txt A > A_aa.gd.txt” (Fig. 2).
3. Repeat processes 1 and 2 for all HLA genes to be analyzed.
4. Merge aa.gd.txt files by typing “cat ∗_aa.gd.txt > aa.gd.
all.txt”.
Sample ID A_1 A_2 B_1 B_2 C_1 C_2 DRB1_1 DRB1_2

NA18939 A*11:01 A*31:01 B*27:04 B*67:01 C*07:02 C*12:02 DRB1*15:01 DRB1*15:01
NA18940 A*24:02 A*24:02 B*46:01 B*52:01 C*01:03 C*12:02 DRB1*08:02 DRB1*15:02
NA18941 A*24:02 A*24:20 B*15:07 B*40:01 C*03:03 C*03:04 DRB1*14:54 DRB1*04:05
NA18942 A*24:02 A*33:03 B*35:01 B*44:03 C*03:03 C*14:03 DRB1*04:05 DRB1*12:01
NA18943 A*02:06 A*02:07 B*46:01 B*35:01 C*01:02 C*03:03 DRB1*08:02 DRB1*08:03
NA18944 A*02:06 A*24:02 B*40:02 B*51:01 C*03:04 C*14:02 DRB1*08:02 DRB1*12:01
NA18945 A*31:01 A*33:03 B*15:01 B*44:03 C*07:02 C*14:03 DRB1*13:02 DRB1*09:01
NA18946 A*02:01 A*24:02 B*07:02 B*52:01 C*07:02 C*12:02 DRB1*01:01 DRB1*04:05
NA18947 A*02:06 A*24:02 B*52:01 B*52:01 C*12:02 C*12:02 DRB1*15:02 DRB1*15:02
NA18948 A*26:01 A*30:01 B*07:02 B*13:02 C*06:02 C*07:02 DRB1*01:01 DRB1*07:01
NA18949 A*24:02 A*26:03 B*07:02 B*40:02 C*03:04 C*07:02 DRB1*01:01 DRB1*09:01
NA18950 A*24:02 A*24:02 B*07:02 B*52:01 C*07:02 C*12:02 DRB1*01:01 DRB1*15:02
Fig. 1 A sample txt file of the typing result. Each column must be separated by tab character, and the header of
two alleles should be described as genename_1 and genename_2
Allele Domain Amino acid residues of G-DOMAIN

#DQB1 D2 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,~
DQB1*05:01 D2 EDFVYQFKGLCYFTNGTERVRGVTRHIYNREEYVRFDSDVGVYRAVTPQGRPVAEYWNSQKEVLEGARASVDRVCRHNYEVAYRGILQRR
DQB1*05:02 D2 EDFVYQFKGLCYFTNGTERVRGVTRHIYNREEYVRFDSDVGVYRAVTPQGRPSAEYWNSQKEVLEGARASVDRVCRHNYEVAYRGILQRR
DQB1*05:03 D2 EDFVYQFKGLCYFTNGTERVRGVTRHIYNREEYVRFDSDVGVYRAVTPQGRPDAEYWNSQKEVLEGARASVDRVCRHNYEVAYRGILQRR
DQB1*05:04 D2 EDFVYQFKGLCYFTNGTERVRGVTRYIYNREEYVRFDSDVGVYRAVTPQGRPSAEYWNSQKDILEEDRASVDRVCRHNYEVAYRGILQRR
DQB1*05:05 D2 *DFVYQFKGLCYFTNGTERVRGVTRHIYNREEYARFDSDVGVYRAVTPQGRPSAEYWNSQKEVLEGARASVDRVCRHNYEVAYRGILQRR
DQB1*05:106 D2 EDFVYQFKGLCYFTNGTERVRGVTRHIYNREEYVRFDSDVGVYRAVTPQGRPSAEYWNSQKEVLEGARASVDRVCRHNYEVAYRGILQRR
DQB1*05:107 D2 *DFVYQFKGLCYFTNGTERVRGVTRHIYNREEYVRFDSDVGVYRAVTPQGRPVAEYWNSQKEVLEGARASVDRVCRHNYEVAYRGILQRR
DQB1*05:108 D2 *DFVYQFKGLCYFTNGTERVRGVTRHIYNREEYVRFDSDVGVYRAVTPQGRPDAEYWNSQKEVLEGARASVDRVCRHNYEVAYRGILQRR
DQB1*05:109 D2 *DFVYQFKGLCYFTNGTERVRGVTRHIYNREEYVRFDSDVGVYRAVTPQGRPDAEYWNSQKEVLEGARASVDRVCRHNYKVAYRGILQRR
DQB1*05:110N D2 *DFVYQFKGLCYFTNGTERVRGVTRHIYNX............................................................
Fig. 2 A converted data aligned to G-DOMAINs for HLA-DRB1 (DRB1_aa.gd.txt). An asterisk and a dot mean
amino acid residue is not recorded in the IPD-IMGT/HLA database and not translated at this position,
respectively
5. Convert typed alleles to amino acid information by using aa.gd.

all.txt:
“python create_r_input.py 1000Genomes_CHS_hla.txt 1000Gen-

omes_JPT_hla.txt aa.gd.all.txt CHS_JPT_hlagd”.
Finally, “CHS_JPT_hlagd.aa.txt” and “CHS_JPT_hlagd.

pos.txt” are created (see Note 5).
3.3 Association To perform association analysis based on a binomial logistic regres-

Analysis of HLA Amino sion model [17] (see Note 6) of amino acid residues of
Acid Residues G-DOMAINs, run the R program by typing “R” on the console
of G-DOMAINs or launching the R application and load script “log_reg_hla_aa.r”
by “source (“log_reg_hla_aa.r”)”. The logistic regression is then
performed as follows:
“glmhla <- glmHLA("CHS_JPT_hlagd.aa.txt","CHS_JPT_hlagd.pos.

txt")”.
The regression result of the most significant position (glmhla

$minpos) and its p value (glmhla$minp), p values of each domain
(glmhla$p), counts of amino acid residues (glmhla$n), and posi-
tions at G-DOMAIN (glmhla$pos) can then be examined (Fig. 3a).
The amino acid positions significantly associated with, such as a
disease or a phenotype of interest can also be viewed graphically by
the command: “plot_glmhla(glmhla)” (Fig. 3b).
a b
−15
−10
log10 p
−5
0
1 92 1 92 1 92 1 92 1 92 1 92 1 92
2
_D
_D
_D
_D
_D
_D
D
1_
A
B
R
D
Fig. 3 (a) A result of binomial logistic regression for amino acid residues of G-DOMAINs between CHS and JPT
data sets. The most significant position and frequencies of amino acid residues of the position can be checked
from the returned values. (b) A plot of p values at G-DOMAINs. The red circle at HLA-A D2 domain (G-ALPHA2
domain) shows the most significant amino acid position. The gray line shows the significant threshold as
determined by Bonferroni correction, considering the number of positions contributing to variation in the
G-DOMAINs
4 Notes
1. The use of all sequence reads without filtering is best recom-

mended to gain a typing accuracy while scarifying the
calculation cost.
2. To use picard, Java 1.8 must be installed in the computer being
used for the analysis.
3. Data files and script files are compressed as imminfo_data.zip
and imminfo_scripts.zip, respectively.
4. If an allele pair is not uniquely determined even after the typing
algorithm is finished, the pair (A, B) with the highest P(A)P
(B) is selected, where P(Z) is the frequency of allele Z.
5. If untyped, multiple candidates or unknown allele in prot.txt
exists in the typing result file, whereas converted amino acids
are filed by dots and ignored in a logistic regression analysis.
6. This demonstration does not consider other covariates to avoid
confounding effects, and appropriate covariates must be
included to the model accordingly.
References
1. Okada Y, Kim K, Han B et al (2014) Risk for class II binding cancer mutations. Cell
ACPA-positive rheumatoid arthritis is driven 175:416–428. e13
by shared HLA amino acid polymorphisms in 11. Kishikawa T, Momozawa Y, Ozeki T et al
Asian and European populations. Hum Mol (2019) Empirical evaluation of variant calling
Genet 23:6916–6926 accuracy using ultra-deep whole-genome
2. Dostál C, Iványi D, Macurová H et al (1977) sequencing data. Sci Rep 9:1784
HLA antigens in systemic lupus erythemato- 12. Langmead B, Salzberg SL (2012) Fast gapped-
sus. Ann Rheum Dis 36:83–85 read alignment with bowtie 2. Nat Methods
3. Terao C, Ota M, Iwasaki T et al (2019) IgG4- 9:357–359
related disease in the Japanese population: a 13. Lefranc M-P, Duprat E, Kaas Q et al (2005)
genome-wide association study. Lancet Rheu- IMGT unique numbering for MHC groove
matol 1:e14–e22 G-DOMAIN and MHC superfamily (MhcSF)
4. Mardis ER (2008) The impact of next- G-LIKE-DOMAIN. Dev Comp Immunol
generation sequencing technology on genetics. 29:917–938
Trends Genet 24:133–141 14. 1000 Genomes Project Consortium, Abecasis
5. Erlich RL, Jia X, Anderson S et al (2011) Next- GR, Auton A et al (2012) An integrated map of
generation sequencing for HLA typing of class genetic variation from 1,092 human genomes.
I loci. BMC Genomics 12:42 Nature 491:56–65
6. Gabriel C, Fürst D, Faé I et al (2014) HLA 15. Abi-Rached L, Gouret P, Yeh J-H et al (2018)
typing by next-generation sequencing—get- Immune diversity sheds light on missing varia-
ting closer to reality. Tissue Antigens 83:65–75 tion in worldwide genetic diversity panels.
7. Hosomichi K, Jinam TA, Mitsunaga S et al PLoS One 13:e0206512
(2013) Phase-defined complete sequencing of 16. González-Galarza FF, Takeshita LYC, Santos
the HLA genes by next-generation sequencing. EJM et al (2015) Allele frequency net 2015
BMC Genomics 14:355 update: new features for HLA epitopes, KIR
8. Robinson J, Soormally AR, Hayhurst JD et al and disease and HLA adverse drug reaction
(2016) The IPD-IMGT/HLA database—new associations. Nucleic Acids Res 43:
developments in reporting HLA variation. D784–D788
Hum Immunol 77:233–237 17. Raychaudhuri S, Sandor C, Stahl EA et al
9. Kawaguchi S, Higasa K, Shimizu M et al (2012) Five amino acids in three HLA proteins
(2017) HLA-HD: an accurate HLA typing explain most of the association between MHC
algorithm for next-generation sequencing and seropositive rheumatoid arthritis. Nat
data. Hum Mutat 38:788–797 Genet 44:291–296
10. Marty Pyke R, Thompson WK, Salem RM et al
(2018) Evolutionary pressure against MHC
Chapter 4
A Computational Vaccine Designing Approach

for MERS-CoV Infections
Hiba Siddig Ibrahim and Shamsoun Khamis Kafi
Abstract
The aim of this study was to use IEDB software to predict the suitable MERS-CoV epitope vaccine against
the most known world population alleles through four selecting proteins such as S glycoprotein and
envelope protein and their modification sequences after the pandemic spread of MERS-CoV in 2012.
IEDB services is one of the computational methods; the output of this study showed that S glycoprotein,
envelope (E) protein, and S and E protein modified sequences of MERS-CoV might be considered as a
protective immunogenic with high conservancy because they can elect both neutralizing antibodies and
T-cell responses when reacting with B-cell, T-helper cell, and cytotoxic T lymphocyte. NetCTL, NetChop,
and MHC-NP were used to confirm our results. Population coverage analysis showed that the putative
helper T-cell epitopes and CTL epitopes could cover most of the world population in more than 60 geo-
graphical regions. According to AllerHunter results, all those selected different protein showed
non-allergen; this finding makes this computational vaccine study more desirable for vaccine synthesis.
Key words Middle East respiratory syndrome coronavirus, Severe acute respiratory syndrome coro-
navirus, Federal Drug Administration, Immuno epitope database, FAO, AllerHunter
1 Introduction
Vaccine development was considered as the most important sub-

jects to protect from a highly infectious disease especially when
treatment is not available; nowadays, a new way for vaccine design
was done by a new aspects called immune-informatics that depends
on software program to determine the most immunogenic parts of
the organisms (epitopes) like these software that were used in this
study to try to develop more powerful immunogenic MERS-CoV
vaccine because the previous MERS-CoV vaccine can be either
inactivated coronavirus, live attenuated coronavirus, S protein-
based, DNA vaccines, and combination vaccines against corona-
viruses; as we know coronaviruses were first described in the 1960s
from the nasal cavities of patients with common cold. These strains
of coronaviruses were called HC-229E and HC-OC43; in 2003,

39
40 Hiba Siddig Ibrahim and Shamsoun Khamis Kafi
following the outbreak of severe acute respiratory syndrome

(SARS) that resulted in over 8000 infections, about 10% of which
resulted in death, but in 24 September 2012, a first report of
isolated new novel coronavirus like SARS-CoV by Egyptian virolo-
gist Dr. Ali Mohamed Zaki in Jeddah, Saudi Arabia, from the lungs
of a 60-year-old male patient with acute pneumonia and acute renal
failure becomes a new discovery that was recently called MERS-
CoV; this finding was posted on ProMED-mail [1–3]. MERS-CoV
belong to group C β-coronaviruses that characterize 30 KB
genome, ssRNA virus, positive sense with 10 predicting open
reading frames (ORFs) like E, M, S, enveloped. MERS-CoV can
grow in a culture media; the genome size, organization, and
sequence analysis revealed that the NCoV is most closely related
to bat coronaviruses BtCoV-HKU4 and BtCoV-HKU5; a partial
spike gene sequencing of South African Neoromicia bats was con-
sidered as close relative to MERS-Cov as illustrated by nucleotide
percentage distance substitution model and the complete deletion
option in MEGA; this makes the possibility of a common coronavi-
rus vaccine more desirable [3–5].
This study depended on using S and E with modified S and E
protein sequences through in silico approach to develop MERS-
CoV vaccine in addition to study the side effects of mutation in
those selected sequences on vaccine development. Spike glycopro-
tein is characterized by a trimeric, envelope-anchored, type I fusion
glycoprotein that interfaces with human dipeptidyl peptidase
4 (DPP4) receptor; to mediate viral entry, it is composed of 2 sub-
units; they are S1, which contains the receptor-binding domain and
determines cell tropism, and S2, the location of the cell fusion
machinery, while E protein was considered as part of virus cell
membrane [4, 6].
This study showed that S, E and their modified sequences can
be considered safe and most promising MERS-CoV vaccine with-
out any kinds of allergic reactions.
2 Materials and Methods
2.1 Protein Sequence A total number of 130 spike (S) glycoproteins and 41 envelope
Retrieval (E) proteins of MERS-CoV were retrieved from NCBI (http://
www.ncbi.nlm.nih.gov/protein/) database in September 2016,
which was actually collected from different parts of the world,
such as Saudi Arabia, China, Thailand, United Kingdom, Qatar,
Tunisia, and South Africa. The accession numbers of retrieved
strains were listed in Supplementary Tables 1 and 2. All methods
below were applied for S, E, modified S & E proteins; modified S
and E proteins were made by randomly changing some amino acids
in their reference sequences; see Table 1 envelope protein (E) with
Table 2 spike glycoprotein (S) gene bank accession numbers.
A Computational Vaccine Designing Approach for MERS-CoV Infections 41
Table 1
Gene Bank Accession No of Envelope protein
Accession No of E protein Date and place of collection Type of specimen

YP_009047209.1 13-Jun-2012
AKJ80142.1 27-May-2015/China Nasopharyngeal swab
AIZ74456.1 07-May-2013/France Sputum on Vero E6
AIZ74443.1 07-May-2013/France Induced sputum
AIZ74434.1 07-May-2013/France Induced sputum
AIZ74422.1 26-Apr-2013/France Broncho-alveolar lavage
AIZ74406.1 26-Apr-2013/France Broncho-alveolar lavage
AID50423.1 10-Feb-2013/United Kingdom Throat swab
AID50423.1 10-Feb-2013/United Kingdom Throat swab
ALD51909.1 17-Jun-2015/Thailand Sputum
AMQ49075.1 24-Aug-2015/Saudi Arabia Respiratory secretions
AMQ49020.1 12-Jul-2015/Saudi Arabia Respiratory secretions
ALW82736.1 02-Feb-2015/Saudi Arabia
ALW82714.1 05-Feb-2015/Saudi Arabia Respiratory secretions
ALW82674.1 27-Mar-2015/Saudi Arabia Respiratory secretions
AFY13312.1 11-Sep-2012/United Kingdom
AIG13101.1 2011/South Africa
AHY21474.1 Mammalian cell line Vero CCL81
AHY22569.1 Nov-2013/Saudi Arabia nasal swab (camel)
AHB33331.1 07-May-2013/France Vero E6 isolate/sputum
AHC74092.1 13-Oct-2013/Qatar
AHC74103.1 17-Oct-2013/Qatar
AHI48522.1 02-May-2013/Saudi Arabia
(continued)
Table 1
(continued)
Accession No of E protein Date and place of collection Type of specimen

AHI48566.1 05-Aug-2013/Saudi Arabia
AHI48533.1 17-Jul-2013/Saudi Arabia
AHI48555.1 12-Jun-2013/Saudi Arabia
AHI48588.1 02-Jul-2013/Saudi Arabia
AHI48599.1 12-Jun-2013/Saudi Arabia
AHI48610.1 01-Mar-2013/Saudi Arabia
2.2 In Silico PCR (http://insilico.ehu.es/PCR_virus/) In silico PCR amplification is

a program that made amplification against sequenced viruses, by
mimicking PCR amplification and primers confirmatory tools too;
here it was used for the above viruses by using store gene bank
sequence; it contains 1783 sequences from 1421 completely
sequenced viruses (last update: 31 May 2010).
2.3 Determination The retrieved sequences, which were collected from NCBI, were
of Conserved Regions used as a platform to obtain the conserved regions by using multi-
ple sequence alignment (MSA). Sequences were aligned with the
aid of ClustalW as implemented in the BioEdit program, version
7.0.9.0.
2.4 B-Cell Epitope B-cell epitope is characterized by being hydrophilic, accessible,

Prediction flexible, antigenic propensity and in a beta turn region. Thus, the
classical propensity scale methods and hidden Markov model pro-
grammed software from IEDB analysis resource (http://www.iedb.
org/) were used for the following aspects:
2.4.1 Prediction of Linear BepiPred from immune epitope database and analysis resource
B-Cell Epitopes (http://toolsiedb.ofg/bcell/) was used for linear B-cell epitope
prediction from the conserved region with a default threshold
value of 0.350. BepiPred combines the predictions of a hidden
Markov model and the propensity scale of Parker et al. as it is
described in Larsen et al. (Immunome Research, 2006).
2.4.2 Prediction By Emini surface accessibility prediction tool of the immune epi-
of Surface Accessibility tope database (IEDB), the surface-accessible epitopes were pre-
dicted from the conserved regions holding the default threshold
value 1.000 or higher.
Table 2
Gene Bank Accession No of S glycoprotein
Accession No of S glycoprotein Date and place of collection Type of specimen

YP_009047204.1 13-Jun-2012
AHX00721.1 30-Dec-2013/Saudi Arabia Camel
AHX00711.1 30-Dec-2013/Saudi Arabia Dromedary
AHX00731.1 30-Nov-2013/Saudi Arabia Dromedary
AHZ90568.1 08-May-2013/Tunisia Serum
AHX71946.1 16-Feb-2014/Qatar Camelus dromedaries
ALJ54521.1 12-May-2015/Saudi Arabia Respiratory secretions
ALJ54520.1 13-Jun-2015/Saudi Arabia Respiratory secretions
ALJ54513.1 25-Apr-2015/Saudi Arabia Respiratory secretions
ALJ54504.1 20-May-2015/Saudi Arabia Rrespiratory secretions
ALJ54501.1 21-Mar-2015/Saudi Arabia Respiratory secretions
(continued)
Table 2
(continued)

ALJ54486.1 28-Feb-2015/Saudi Arabia Respiratory secretions
AID55078.1 2014/Saudi Arabia
(continued)
Table 2
(continued)

AID55073.1 22-Apr-2014/Saudi Arabia
ALJ54462.1 Saudi Arabia Respiratory secretions
(continued)
Table 2
(continued)

ALJ54478.1 29-Mar-2015Saudi Arabia Respiratory secretions
ALJ54462.1 30-Jan-2015/Saudi Arabia Respiratory secretions
(continued)
Table 2
(continued)

2.4.3 Prediction The Kolaskar and Tongaonkar antigenicity method was used to
of Epitope Antigenicity determine the antigenic sites with a default threshold value of
Sites 1.045.
2.4.4 Prediction Parker hydrophilicity prediction tool was used to determine the
of Epitope Hydrophilicity hydrophilicity of the conserved regions; the threshold default value
was 1.286.
2.4.5 Prediction of Beta Chou and Fasman beta turn prediction method was used with the
Turn Sites default threshold 1.009 to determine the sites that contain beta
turns.
2.4.6 Prediction Karplus and Schulz flexibility prediction tools were used for the
of Flexibility prediction of chain flexibility in proteins (selection of peptide anti-
gen) with default threshold value 0.992.
Thresholds of all tools were provided by IEDB and it is mainly
calculated by the software as the average score of the tested protein
for each corresponding tools.
2.5 T-Cell Epitope Scanning an antigen sequence for amino acid patterns indicative of:
Prediction
2.5.1 MHC Class Analysis of peptide binding to MHC class I molecules was assessed by
I Binding Predictions the IEDB MHC I prediction tool http://tools.iedb.org/mhci/n;
for MHC-I binding prediction, several alleles were used including
HLA-A, HLA-B, HLA-C, and HLA-E that have been reported as
frequent around the world. MHC-I peptide complex presentation to
T lymphocytes undergo several steps. The attachment of cleaved
peptides to MHC molecules step was predicted. Consensus method
which combines ANN, SMM, and scoring matrices derived from
combinatorial peptide libraries (Comblib_Sidney2008) was used.
9-mer epitope lengths were selected. All internationally conserved
epitopes that bind to alleles at score equal or less than 1.0 percentile
rank (low percentile rank ¼ good binders) were selected for further
analysis as in selecting thresholds (cutoffs) for MHC class I and II

binding predictions, http://help.iedb.org/entries/23854373-
Selecting-thresholds-cut-offs-for-MHC-class-I-and-II-binding-
predictions.
Note: For S glycoprotein, the sequence was divided into ten
parts due to software limitations, no more than 200 FASTA
sequences interring [7–11].
2.5.2 MHC Class II Analysis of peptide binding to MHC class II molecules was assessed
Binding Predictions by the IEDB MHC II prediction tool http://tools.immuneepitope.
org/mhcii/. For MHC-II binding prediction, the reference set of
alleles was used, which include HLA-DQ, HLA-DP, and HLA-DR
that are most frequent around the world. MHC class II groove has
the ability to bind to peptides with different lengths. There are
seven prediction methods in the IEDB MHC II prediction tool;
NetMHCIIpan was used in this study; the conserved epitopes that
bind to alleles at scores equal or less than 10 percentile rank were
selected for further analysis as in selecting thresholds (cutoffs) for
MHC class I and II binding predictions, http://help.iedb.org/
entries/23854373-Selecting-thresholds-cut-offs-for-MHC-class-
I-and-II-binding-predictions [7, 11–14].
2.5.3 Proteasomal This tool combines predictors of proteasomal processing, TAP

Cleavage/TAP Transport/ transport, and MHC binding to produce an overall score for each
MHC Class I Combined peptide’s intrinsic potential of being a T-cell epitope selected; in
Predictor this study NetMHCpan was used with immunoproteasomal cleav-
age prediction; there are two types of proteasomes, the constitu-
tively expressed “housekeeping” type and immunoproteasomes
that are induced by IFN-γ secretion. Results can be displayed in
proteasome score, TAP score, MHC score, processing score, total
score, and IC50 score. Explanations of prediction output:
Proteasome cleavage The scores can be interpreted as logarithms of the total amount of
cleavage site usage liberating the peptide C-terminus; it depends on
a lot of other factors, e.g., the amount of source protein degraded.
TAP transport The TAP score estimates an effective log (IC50) values for the
binding to TAP of a peptide or its N-terminal prolonged
precursors.
MHC binding The MHC binding prediction is identical to Class I with output
log (IC50) values.
Processing This score combines the proteasomal cleavage and TAP transport
predictions. It predicts a quantity proportional to the amount of
peptide present in the ER, where a peptide can bind to multiple
MHC molecules. This allows predicting T-cell epitope candidates
independent of MHC restriction.
Total This score combines the proteasomal cleavage, TAP transport, and
MHC binding predictions. It predicts a quantity proportional to
the amount of peptide presented by MHC molecules on the cell
surface. High scores mean high efficiency.
2.5.4 Neural NetChop that was used here is a predictor of proteasomal proces-
Network-Based Prediction sing based upon a neural network. NetCTL and NetCTLpan are
of Proteasomal Cleavage predictors of T-cell epitopes along a protein sequence. The positive
Sites (NetChop) and T-Cell predictions threshold, 0.5, 0.75, and 1, sequentially for all methods
Epitopes (NetCTL above are displayed in green, while the red color for prediction
and NetCTLpan) below the threshold.
2.5.5 MHC-NP: MHC-NP employs data obtained from MHC elution experiments
Prediction of Peptides in order to assess the probability that a given peptide is naturally
Naturally Processed by processed and binds to a given MHC molecule. This tool used in
the MHC this study was the winner of the second Machine Learning Compe-
tition in Immunology; it is composed of three groups of peptides,
binders, nonbinders, and eluted peptides that considered as natu-
rally processed peptides, so greater probe score considered naturally
processing peptide.
2.6 Epitope Analysis All potential MHC I and MHC II binders from spike glycoprotein,
Tools E protein, and S and E modified sequences were assessed for a
population coverage against the whole world population especially
2.6.1 Population
Saudi Arabia with other reported MERS-CoV countries. Calcula-
Coverage Calculation
tions are achieved using the selected MHC-I and MHC-II inter-
acted alleles by the IEDB population coverage calculation tool
http://tools.iedb.org/tools/population/iedb_input; it computes
projected population coverage, average number of epitope hits/
HLA combinations recognized by the population, and minimum
number of epitope hits/HLA combinations recognized by 90% of
the population (PC90).
2.7 Homology The complete 3D structure of spike glycoprotein and envelope

Modeling protein was obtained by phyre2 (http://www.sbg.bio.ic.ac.uk/
phyre2) which uses advanced remote homology detection methods
to build 3D models. UCSF Chimera (version 1.8) was used to
visualize the 3D structure, which is currently available within the
chimera package and available from the chimera website (http://
www.cgl.ucsf.edu/cimera). Homology modeling was achieved for
further verification of the service accessibility and hydrophilicity of
B-lymphocyte epitopes predicted, as well as visualization of all
predicted T-cell epitopes in the structural level.
In addition to the above methods, three other software were
used to determine the effect that was induced in S and E reference
sequences among the amino acid (SNP, single nucleotide
polymorphism).
2.8 Confirmation (Polymorphism Phenotyping v2) (http://genetics.bwh.harvard.

of Amino Acid Change edu/pph2/index.shtml) is an online bioinformatics program to
in Spike Glycoprotein automatically predict the consequence of an amino acid change
(S) and Envelope on the structure and function of a protein was assessed here.
Protein (E) Sequence Basically, this program searches for 3D protein structures, multiple
alignments of homologous sequences, and amino acid contact
2.8.1 PolyPhen-2 information in several protein structure databases and then calcu-
lates position-specific independent count scores (PSIC) for each of
two variants and then computes the PSIC score difference between
two variants; PolyPhen scores were assigned as probably damaging
(2.00 or more), possibly damaging (1.40–1.90), potentially dam-
aging (1.0–1.50), and benign (0.00–0.90). Basically PolyPhen
accepts input in form of SNPs or protein sequences [18].
2.8.2 I-Mutant Suite I used I-Mutant version 3.0 (http://gpcr2.biocomp.unibo.it/cgi/

predictors/I-Mutant3.0/I-Mutant3.0.cgi) to predict the protein
stability changes upon single-site mutations. I-Mutant3.0 basically
can evaluate the stability change of a single-site mutation starting
from the protein structure or from the protein sequences. This
program was trained on some data set derived from ProTherm
which is considered to be the most comprehensive database of
experimental data on protein mutations [18].
2.8.3 Project Hope (http://www.cmbi.ru.nl/hope/) Hope Version 1.1.0, HOPE is an

Mutation easy-to-use web service that analyzes the structural effects of a point
mutation in a protein sequence.
2.8.4 SNPs and GO (http://snps.biofold.org/snps-and-go//snps-and-go.html) were

used to predict disease-associated variations through using GO
terms by collected information in a unique framework that derived
from protein sequence, 3D structure, protein sequence profile, and
protein function, beside gene ontology annotation to predict if a
given variation can be classified disease-related or neutral. It calcu-
lates the result according to the three methods used depending on
SVM type and data such as:
PANTHER output of the PANTHER algorithm.
PhD-SNP SVM input is the sequence and profile at the mutated position.
SNPs and GO SVM input is all the input in PhD-SNP, PANTHER, and GO term
features, by giving disease probability (if >0.5 mutation is predicted
disease).
2.9 Peptide The peptide search tool was used to find all UniProtKB sequences
Search Tool that exactly match a query peptide sequence (http://www.uniprot.
org/peptidesearch/). This means we can easily synthesis the
desired peptides in the laboratory by cloning methods and so on to

study peptide impact on immune system via injected laboratory
animals with peptide sequence of any organisms.
2.10 AllerHunter (http://tiger.dbs.nus.edu.sg/AllerHunter/index.html) is a cross-

reactive allergen prediction program built on a combination of
support vector machine (SVM) and pairwise sequence similarity.
Results of prediction of query sequence(s) can be achieved by using
AllerHunter and FAO/WHO evaluation scheme; in AllerHunter
sequence can be considered as a cross-reactive allergen if it has a
probability of ≧0.06, while in the guideline of the FAO/WHO,
they stated that a sequence is potentially allergenic if it either has an
identity of at least 6 contiguous amino acids OR >35 percent
sequence identity over a window of 80 amino acids when compared
to known allergens.
2.11 AlgPred: (http://www.imtech.res.in/raghava/algpred/index.html)

Prediction AlgPred used to predict allergenic protein and mapping of IgE
of Allergenic Proteins epitopes by:
and Mapping of IgE 1. It allows prediction of allergens based on similarity of known
Epitopes epitope with any region of protein.
2. The mapping of IgE epitope(s) feature of server allows user to
locate the position of epitope in their protein.
3. Server search MEME/MAST allergen motifs using MAST and
assign a protein allergen if it has any motif.
4. It allows predicting allergens based on SVM modules using
amino acid or dipeptide composition.
5. It facilitates BLAST search against 2890 allergen-representative
peptides (ARPs) obtained from Bjorklund et al. (2005) and
assigns a protein allergen if it has a BLAST hit.
6. Hybrid option of server allows predicting allergen using com-
bined approach (SVMc + IgE epitope + ARPs BLAST + MAST).
2.12 VaxiJen v2.0 (http://www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen_help.

html) VaxiJen is the first server for alignment-independent predic-
tion of protective antigens. It was developed to allow antigen
classification solely based on the physicochemical properties of
proteins without recourse to sequence alignment.
3 Results
3.1 Prediction Spike glycoprotein, E protein, and modified S and E protein were
of B-Cell Epitopes subjected to BepiPred linear epitope prediction, Emini surface
accessibility, Kolaskar and Tongaonkar antigenicity, Parker hydro-
phobicity, Chou and Fasman beta turn prediction methods, and
Fig. 1 BepiPred linear epitope prediction of S glycoprotein, the desired epitope residue showed in yellow color.
The red horizontal line indicates surface accessibility threshold (0.35)
Karplus and Schulz flexibility in IEDB, as the results in Figs. 1, 2, 3,

4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, and 24.
3.1.1 BepiPred Linear The average binder score of spike glycoprotein to B cell was 0.35;
Epitope Prediction Method all values equal or greater than the default threshold 0.35 were
predicted to be potential B-cell binders.
3.1.2 Emini Surface The average surface accessibility areas of the protein were scored as
Accessibility Prediction 1.000; all values equal or greater than the default threshold 1.0
were regarded potentially in the surface. A total number of positive
S glycoprotein peptide represent 481 peptide out of 1349, while in
E protein represents 23 out of 77 and in S and E modified sequence
represents 485 out 485 and 17out of 77 peptides sequentially.
3.1.3 Kolaskar The default threshold of antigenicity of the protein was 1.045; all
and Tongaonkar values greater than 1.045 were considered as potential antigenic
Antigenicity determinants. The positive result number of selected S glycoprotein
peptide represents 655 out of 1348, while in E protein represents
55 out of 76 and in S and E modified sequence represents 668 out
of 668 and 47 out of 76 peptides sequentially.
3.1.4 Parker The average hydrophilicity score of the protein was 1.286; all values
Hydrophilicity Prediction equal or greater than the default threshold 1.286 were potentially
hydrophilic. The positive result number of S glycoprotein peptide
Fig. 2 Emini surface accessibility prediction of S glycoprotein. The desired epitope residue for surface
accessibility showed in yellow color, while green color was below threshold (1.000)
Fig. 3 Kolaskar and Tongaonkar antigenicity prediction of S glycoprotein. The desired epitope residue for
antigenicity showed in yellow color, while the green color below the red horizontal line indicates less
antigenicity below (1.045)
Fig. 4 Parker hydrophilicity prediction of S glycoprotein. The desired epitope residue showed in yellow color.
The red horizontal line indicates parker hydrophilicity threshold (1.286)
Fig. 5 Chou and Fasman beta turn prediction of S glycoprotein. The desired epitope residue showed in yellow
color. The red horizontal line indicates beta turn prediction threshold (1.009)
Fig. 6 Karplus and Schulz flexibility prediction of S glycoprotein. The desired epitope residue showed in yellow
color. The red horizontal line indicates surface accessibility threshold (0.35)
Fig. 7 BepiPred linear epitope prediction of S glycoprotein modified sequence. The desired epitope residue
showed in yellow color. The red horizontal line indicates BepiPred Linear Epitope threshold (0.35)
Fig. 8 Emini surface accessibility prediction of S glycoprotein modified sequence. The desired epitope residue
showed in yellow color, while green color below the red horizontal line indicates surface accessibility
threshold (1.000)
Fig. 9 Kolaskar and Tongaonkar antigenicity prediction of S glycoprotein modified sequence. The desired
epitope residue showed in yellow color. The red horizontal line indicates antigenicity threshold (1.045)
Fig. 10 Parker hydrophilicity prediction of S glycoprotein modified sequence. The desired epitope residue
showed in yellow color, while green color below the red horizontal line indicates hydrophilicity threshold
(1.286)
Fig. 11 Chou and Fasman beta turn prediction of S glycoprotein modified sequence. The desired epitope
residue showed in yellow color. The red horizontal line indicates beta turn threshold (1.009)
Fig. 12 Karplus and Schulz flexibility prediction of S glycoprotein modified sequence. The desired epitope
residue showed in yellow color, while green color below the red horizontal line indicates flexibility threshold
(0.992)
3
Threshold
1
Score
–1
–2
–3
0 20 40 60 80 100
Position
Fig. 13 BePipred linear epitope prediction of E protein. The desired epitope residue showed in yellow color.
The red horizontal line indicates Bepipred Linear Epitope threshold (0.35)
represents 693 out of 1348, while in E protein represents 18 out of

76 and in S and E modified sequence represents 690 out of 695 and
20 out of 76 peptides sequentially.
7
Threshold
4
Score
0
0 20 40 60 80
Position
Fig. 14 Emini surface accessibility prediction of E protein. The desired epitope residue showed in yellow color,
while green color below the red horizontal line indicates surface accessibility threshold (1.000)
1.25
Threshold
1.20
1.15
1.10
Score
1.05
1.00
0.95
0.90
0 20 40 60 80
Position
Fig. 15 Kolaskar and Tongaonkar antigenicity prediction of E protein. The desired epitope residue showed in
yellow color, while green color below the red horizontal line indicates antigenicity threshold (1.045)
3.1.5 Chou and Fasman To determine the site that contains beta turns, the default threshold
Beta Turn Prediction was 1.009; all values equal or greater than the default threshold
were considered beta turn sites. The positive result number of
selected peptide represents 668 out of 1348 in S glycoprotein,
while it represents 19 out of 76 in E protein and 673 out of
673 with 21 out of 76 in both S and E modified sequence
sequentially.
4
Threshold
0
Score
–2
–4
–6
0 20 40 60 80
Position
Fig. 16 Parker hydrophilicity prediction of E protein the desired epitope residue showed in yellow color. The
red horizontal line indicates hydrophilicity threshold (1.286)
1.4
Threshold
1.3
1.2
1.1
Score
1.0
0.9
0.8
0.7
0.6
0 20 40 60 80
Position
Fig. 17 Chou and Fasman beta turn prediction of E protein. The desired epitope residue showed in yellow
color. The red horizontal line indicates beta turn threshold (1.009)
3.1.6 Karplus and Schulz The default threshold value 0.992 determined chain flexibility in
Flexibility Prediction proteins, so all values equal or greater than the default threshold
were considered as chain flexibility of protein. The positive results
of selected peptide represent 679 out of 1347 in S glycoprotein,
and it represents 24 out of 24 in E protein beside represented
680 out of 681 and 24 out of 75 in S and E modified sequences
sequentially.
The most common B-cell epitope for E protein is YVKFQDS in
a position 69, while for E protein modified sequence, they are
110
Threshold
105
100
Score
0.95
0.90
0.85
0 20 40 60 80
Position
Fig. 18 Karplus and Schulz flexibility prediction of E protein. The desired epitope residue showed in yellow
color, while green color below the red horizontal line indicated flexibility below threshold (0.992)
7 Threshold
4
Score
0
0 20 40 60 80
Position
Fig. 19 BepiPred linear epitope prediction of E protein modified sequence. The desired epitope residue showed
in yellow color. The red horizontal line indicates BepiPred Linear Epitope threshold (0.35)
VYVPQQD, YVPQQDS, and PPLPED/PPLPEDV in positions

68, 69, and 77 respectively.
The most common B-cell epitopes for both S and modified S
are DVGPDSV, PDSVKSA, DSVKSAC, PRPIDVS, HTPATDC,
AKPSGSV, KPSGSVV, SGTPPQV, GTPPQVY, TPPQVYN,
QLSPLEG, YGPLQTP, PRSVRSV, RSVRSVP, SVKSSQS,
VKSSQSS, SQSSPII, and SLNTKYV in the following positions
23, 26, 27, 48, 211, 371, 372, 393, 394, 395, 547, 707,
750, 751, 855, 856, 859 (or 857 in modified S), and 1202 sequen-
tially; but QVDQLNS and VDQLNSS in positions 772 and
7
Threshold
4
Score
0
0 20 40 60 80
Position
Fig. 20 Emini surface accessibility prediction of E protein modified sequence. The desired epitope residue
showed in yellow color, above the red horizontal line threshold (1.000)
6
Threshold
2
Score
–2
–4
–6
0 20 40 60 80
Position
Fig. 21 Kolaskar and Tongaonkar Antigenicity prediction of E protein modified sequence. The desired epitope
residue showed in yellow color, while green color indicates antigenicity below threshold (1.045)
773 are ordinary only found in S glycoprotein, while LTPTSSY,

TPTSSYV, PTSSYVD, TSSYVDV, DHGDYYV, YSQDVKQ,
ANQYSPC, NQYSPCV, and YYRKQLS in a positions 15, 16,
17, 18, 83, 108, 523, 524, and 543 sequentially are only found in
S glycoprotein modified sequence.
3.2 T-Cell Epitope Spike glycoprotein, E protein, and S and E modified sequence were
Prediction subjected to consensus method for MHC-I binding, NetMHCII-
pan for MHC-II binding, NetMHCpan for proteasomal cleavage/
TAP transport/MHC class I combined predictor, NetChop and
6
Threshold
2
Score
-2
-4
-6
0 20 40 60 80
Position
Fig. 22 Parker hydrophilicity prediction of E protein modified sequence. The desired epitope residue showed in
yellow color. The red horizontal line indicates hydrophilicity threshold (1.286)
1.3
Threshold
1.2
1.1
Score
1.0
0.9
0.8
0.7
0 20 40 60 80
Position
Fig. 23 Chou and Fasman beta turn prediction of E protein modified sequence. The desired epitope residue
showed in yellow color, while green color below the red horizontal line indicates low beta turn threshold
(1.009)
NetCTL for neural network-based prediction of proteasomal cleav-

age sites (NetChop), and T-cell epitopes (NetCTL and NetCTL-
pan) with MHC-NP for prediction of peptides that’s naturally
processed by the MHC in IEDB software program.
3.2.1 MHC Class Analysis of peptide sequence that’s binding to MHC class I mole-
I Binding Predictions cules by consensus method was assessed by the conserved epitopes
that bind to alleles at score equal or less than 1.0 percentile. The
1.15
Threshold
1.10
1.05
Score
1.00
0.95
0.90
0 20 40 60 80
Position
Fig. 24 Karplus and Schulz flexibility prediction of E protein modified sequence. The desired epitope residue
showed in yellow color that illustrates flexibility threshold (0.992)
positive result numbers of selected peptide represent 602 out of

53,800 in S glycoprotein and 63 out of 3626 in E protein while in S
and E modified sequence represents 612 out of 58,457 and 41 out
of 3234 sequentially.
Seven alleles were not found in E protein modified sequence,
including HLA-A∗03:01, HLA-A∗11:01, HLA-A∗31:01,
HLA-A∗68:01, HLA-B∗14:02, HLA-B∗40:01, and
HLA-B∗40:02, while in E protein four alleles were not found;
they are HLA-B∗48:01, HLA-B∗58:02, HLA-C∗04:01, and
HLA-E∗01:01; the ruminant of alleles are common between
both of them; among them three peptide sequences are common
such as CMTGFNTLLn, MTGFNTLLVn, and QCMTGFNTLn,
while HLCVQCMTG, KPPLPEDVW, LLVCTAFLT,
LLVQPALSL, LTATHLCVQ, LVCTAFLTA, PALSLYMTG,
PNFFDFTVVn, SLYMTGRSV, VCTAFLTAT, VQERIGWFI,
VQPALSLYM, VVCDITLLV, and WFIPNFFDFn are only found
in E modified sequence.
HLA-A∗02:01 allele showed higher frequency numbers six,
followed by HLA-A∗23:01, HLA-A∗29:02, HLA-A∗68:02, and
HLA-B∗46:01 that had four frequency numbers, and the same for
the peptide sequences FIFTVVCAI, ITLLVCMAF, IVNFFIFTVn,
and LVQPALYLY in E protein while in modified E, I found
HLA-C∗03:03 represents higher frequency numbers forty-three,
but HLA-A∗02:01, HLA-A∗02:06, HLA-A∗29:02, and
HLA-B∗38:01 had the same frequency numbers three.
For the peptide sequences, I found FIFTVVCAI had a higher
frequency numbers five, followed by ITLLVCMAF, IVNFFIFTVn,
and LVQPALYLY in E protein; reverse E protein modified
sequence, LVQPALSLY had a higher frequency numbers five then

followed by CMTGFNTLLn, FLTATHLCV, FVQERIGWF,
ITLLVCTAF, LYMTGRSVY, WFIPNFFDFn, and YMTGRSVYV
which had a frequency numbers four except QCMTGFNTLn that
had three frequency numbers.
N.B: nindicate presence of asparagine (N) in peptide sequences,
that’s hiding epitope from recognition by immune system so we
should deal with the common epitope with the caution; they are
11 peptide sequence numbers with asparagine in E and 13 in
modified E, while they are 8 in S and 46 in modified S sequence.
HLA-A∗30:02 allele was not found in S glycoprotein modified
sequence, while HLA-B∗38:01, HLA-B∗39:01, HLA-B∗40:01,
HLA-B∗40:02, HLA-B∗44:02, HLA-B∗44:03, HLA-B∗46:01,
HLA-B∗48:01, HLA-B∗51:01, and HLA-B∗53:01 were not
found in S sequence, but they were found in S modified sequence;
these means 15 peptide sequences were absent in S sequence
(AGYKVLPPL, APQVTYQNIn, CKLPLGQSL, CVFFILCCV,
DVKQFDNGFn, DYYVYSAGH, FKLSIPTNFn, FLLTPTSSY,
GEMRLASIA, GNYTYYHKWn, GPASARDLI, GTDTNSVCIn,
HKWPWYIWL, HSKFLLMFL, IAPVNGYFIn) but presented in
modified S sequence; besides this it also lakes a 34 peptide
sequences like AGPISQFNYn, CMGKLKCNRn, DLSQLHCSY,
DVKQFANGFn, FATYHTPAT, FLLTPTESY, FQFATLPVY,
FVYDAYQNLn, GTNCMGKLKn, GVRQQRFVY, HSVFLLMFL,
ICAQYVAGY, etc.; the other peptide sequences were not
shown here.
In S glycoprotein HLA-A∗29:02 allele showed higher fre-
quency numbers (41) then followed by HLA-A∗30:02 (37),
HLA-A∗01:01 (31), HLA-B∗15:01 (29), HLA-C∗14:02 (27),
HLA-A∗25:01 (25), HLA-A∗23:01 (24), HLA-B∗58:01 (23),
and HLA-C∗06:02 (22); modified S glycoprotein sequence par-
tially shared the same alleles with higher frequency numbers like in
S glycoprotein which they are HLA-A∗29:02 allele that repre-
sented the most higher frequency numbers (33), followed by
HLA-C∗14:02 (27), HLA-A∗01:01 (25), HLA-B∗46:01 (22)/
HLA-A∗23:01, HLA-B∗58:01, and HLA-C∗06:02 (21)/HLA-
B∗15:01 (20). In S glycoprotein the following peptide sequences
had higher frequency numbers such as 10 in FSFGVTQEY and
ITYQGLFPY peptides, 8 in WSYTGSSFY, 7 in KAWAAFYVY, and
6 in FVYDAYQNLn, and ITITYQGLF, QTAQGVHLF, while it
represented 5 in FQFATLPVY, NSYTSFATYn, SLILDYFSY,
STVWEDGDY, VSVPVSVIY, and YTYYNKWPWn, but in modi-
fied S glycoprotein, the frequencies were different, like 10 in
FSFGVTQEY peptide, 4 in FLLTPTSSY, FSSRYVDLY, FVA-
NYSQDVn, FYVYKLQPL, and IAFNHPIQVn, while it’s 3 in
ASIAFNHPIn, DEILEWFGI, DYFSYPLSM, EAAYTSSLL,

FCSKINQALn, FFNHTLVLLn, FQDELDEFF, FSDGKMGRF,
FSNPTCLILn, GEMRLASIA, GRFFNHTLVn, HISSTMSQY,
and HKWPWYIWL peptides.
N.B: n indicate presence of asparagine (N) in peptide
sequences, that’s hiding epitope from recognition by immune
system.
3.2.2 MHC Class II Analysis of peptide binding to MHC class II molecules was assessed
Binding Predictions by the conserved epitopes that bind to alleles at scores equal or less
than 10 percentile rank; the positive result numbers of selected
epitopes showed 212 out of 4819 epitopes in S glycoprotein,
685 out of 4148 in E protein, and 6896 out of 75,206 with
685 out of 4148 in both S and E modified proteins sequentially.
The following alleles are more common between S glycopro-
tein, E protein, and S and E modified sequences, and they are
HLA-DPA1∗01:03/DPB1∗02:01, HLA-DPA1∗02:01/
DPB1∗01:01, HLA-DRB1∗01:01, HLA-DRB1∗01:02,
HLA-DRB1∗04:04, HLA-DRB1∗04:05, HLA-DRB1∗04:08,
HLA-DRB1∗11:06, HLA-DRB1∗12:01, HLADRB1∗13:04,
HLA-DRB1∗13:11, HLA-DRB1∗13:21, and
HLA-DRB4∗01:01, but in S and modified S glycoprotein, both
of them contain other 42 different alleles not shown here. In E and
modified E protein, HLA-DRB1∗01:01 had higher frequency
numbers of alleles which represented 20, followed by 17 in
HLA-DRB1∗01:02, 11 in HLA-DRB1∗12:01, 10 in
HLA-DRB1∗11:04, HLA-DRB1∗11:06, and
HLA-DRB1∗13:11, and 9 in HLA-DRB1∗07:01,
HLA-DRB1∗07:03 and HLA-DRB1∗13:21, while in S and mod-
ified S glycoprotein, those alleles below had higher frequency num-
bers, which represented (200/199) in HLA-DRB1∗04:08/
(199/201) HLA-DRB1∗04:01, HLA-DRB1∗04:21, and
HLA-DRB1∗04:26/(194/190) in HLA-DRB1∗09:01/
(192/189) in HLA-DRB1∗04:05/(167/167) in
HLA-DRB1∗07:01, HLA-DRB1∗07:03/(164/167) in
HLA-DRB1∗15:02, (160/159) in HLA-DRB1∗13:02/
(159/159) in HLA-DRB1∗11:14, HLA-DRB1∗11:20, and
HLA-DRB1∗13:23, and (152/158) in HLA-DRB3∗01:01.
E and modified E protein had the same peptide sequences
with same frequency numbers, but the higher frequency
numbers only showed in peptides below; it represented 15 with
GFNTLLVQPALSLYMn, 14 with TGFNTLLVQPALSLYn,
13 with FNTLLVQPALSLYMT, 12 with MTGFNTLLVQPALSLn,
11 with NTLLVQPALSLYMTGn, and 10 with ALSLYMTGRS-

VYVPQ, LSLYMTGRSVYVPQQ, PALSLYMTGRSVYVP, and
QPALSLYMTGRSVYV peptides.
N.B:-
1. The alleles below are not available for S glycoprotein, E pro-
tein, and S and E modified sequence, and they are DPA1∗01-
DPB1∗ 04:01, DRB1∗03:09, DRB1∗08:17, and
DRB1∗13:28.
2. The same peptide sequence shared more than one allele gene or
the same allele has a different peptide sequence.
3. Variation in frequency numbers among both alleles and peptide
sequences has been shown when comparing reference sequence
of S & E protein with the modified sequence of both of them.
n
4. that is present in peptide sequences above indicates presence
of arginine in the sequence.
3.2.3 Proteasomal In NetMHCpan high scores mean high efficiency due to prediction
Cleavage/TAP Transport/ of a quantity proportional to the amount of peptide presented by
MHC Class I Combined MHC molecules on the cell surface; total score higher or equal to
Predictor 0 were selected for S and modified S glycoprotein, while in E
protein total score equal or higher than 0.3 was selected, but in
modified E protein total score equal or higher than 2.82 was
selected; see Tables 3 and 4.
3.2.4 Neural The positive prediction thresholds are 0.5 and 0.75 (green color)
Network-Based Prediction for NetChop and NetCTL sequentially considered as proteasomal
of Proteasomal Cleavage cleavage sites for T-cell epitopes; see Figs. 25, 26, 27, 28, 29, 30,
Sites (NetChop) and T-Cell 31, 32, 33, 34, 35, 36, 37, and 38 with Table 5.
Epitopes (NetCTL NetChop prediction score equal or greater than 0.5 in S glyco-
and NetCTLpan) protein represented a positive result; more than 300 peptides out of
1353 showed positive results, while in modified S glycoprotein,
5 out of 66 showed positive results, in E protein 28 out of
82 were positive, and 28 out of 82 in modified E protein were
positive.
Both E & modified E protein showed 28 amino acid that’s
crossed the threshold; 0.5 with same residue position like: F ! 33;
L ! 58, 50, 39, 51, 28, 56, 2; Q ! 70; R ! 63; Y ! 59 and 66;
V ! 67, 65, 41, 21, 22, 52, 29; except: V ! 82 in E protein while
it’s at position 10 in modified E protein, L ! 76 in E protein while
at position 34 and 6 in modified E protein, F ! 69 in E protein
while it’s at positions 17 and 19 in modified E protein, W ! 81 in E
while it’s at position 11 in modified E protein, R ! 38 in E, I ! 18
in E, K ! 68 and 73 in E while A ! 32 in modified E protein with
M ! 60,Y ! 57 in E protein.
Table 3
Illustrate the positive selected peptide sequences for both S and modified
S glycoprotein sequence by NetMHCpan prediction tool
S Modified S
a
AFYCILEPR AFYCILEPRa
ASLNSFKEYa,b ASLNSFKEYa,b
ATDCSDGNYa,b ATDCSDGNYa,b
AYQNLVGYYa,b AYQNLVGYYa,b
ALALCVFFIa AAIPFAQSI
a
CGTLLRAFY ALGAMQTGF
CTFMYTYNIa,b AVNNNAQALb
CYSSLILDYa ALALCVFFIa
CMGKLKCNRa,b CGTLLRAFYa
DAYQNLVGYa,b CTFMYTYNIa,b
ESFDVESGV CYSSLILDYa
EMRLASIAFa CMGKLKCNRa,b
ETKTHATLFa DLSQLHCSY
a
ESAALSAQL DAYQNLVGYa,b
FANGFVVRI b ETKTHATLFa
FLLTPTESYa EMRLASIAFa
FFNHTLVLLa,b EAAYTSSLL
FSDGKMGRFa ESAALSAQLa
FSSRYVDLYa FLLTPTSSYa
FQFATLPVY FFNHTLVLLa,b
FSVDGYIRR FSDGKMGRFa
FYVYKLQPLa FSSRYVDLYa
FSNPTCLILa,b FTNCNYNLTb
FQNCTAVGVa,b FYVYKLQPLa
FSFGVTQEYa FSNPTCLILa,b
FVVNAPNGL b FQNCTAVGVa,b
FQDELDEFFa FVYDAYQNLb
GVHLFSSRYa FSFGVTQEYa
GLVNSSLFVa,b FAQSIFYRL
GYYSDDGNYa,b FQDELDEFFa
GLYFMHVGYa GVHLFSSRYa
(continued)
Table 3
(continued)
S Modified S
GQGTHIVSF GVRQQRFVY
a,b
GRLTTLNAF GYYSDDGNYa,b
HSVFLLMFL GLVNSSLFVa,b
HISSTMSQYa GWTAGLSSF
a
IEVDIQQTF GRLTTLNAFa,b
IIYPQGRTYc GLYFMHVGYa
ITITYQGLF HISSTMSQYa
ITYQGLFPYa IEVDIQQTFa
ITEDEILEWa IIYPQTRTYc
IASNCYSSLa,b ITYQGLFPYa
ILATVPHNLa,b ITEDEILEWa
ILDYFSYPLa IASNCYSSLa,b
ITKPLKYSYa ILATVPHNLa
IAFNHPIQVa,b ILDYFSYPLa
IEVVSAYGLa ITKPLKYSYa
IAGLVALALa IAFNHPIQVa,b
KQFANGFVVa,b ICAQYVAGY
a
KAWAAFYVY IPFAQSIFY
KLQPLTFLLc IANKFNQAL b
KETKTHATLa IEVVSAYGL1
a
KVTIADPGY IPNFGSLTF b
KVTVDCKQYa IAGLVALALa
KELGNYTYYa,b KQFDNGFVVa,b
KYVAPQVTYa KAWAAFYVYa
LLRAFYCILa KLQPLTFLWc
LLDFSVDGY KETKTHATLa
LPVYDTIKYa KVTVDCKQYa
LYGGNMFQFb KVTIADPGYa
LSGTPPQVYa KYVAPQVTYa
LSLFSVNDF b KELGNYTYYa,b
LSIPTNFSFa,b LLRAFYCILa
(continued)
Table 3
(continued)
S Modified S
LQMGFGITVa LPVYDTIKYa
LINGRLTTLa,b LSGTPPQVYa
LVRSESAALa LTFLWDFSV
LYFMHVGYYa LQMGFGITVa
LVALALCVFa LSIPTNFSFa,b
MGRFFNHTLa,b LGSIAGVGW
MLGSSVGNFa,b LSSFAAIPF
a
MGFGITVQY LASELSNTF b
MTEQLQMGFa LINGRLTTLa,b
MLKRRDSTY LVRSESAALa
MSQYSRSTRa LTFINTTLLb
NLRNCTFMYa,b LYFMHVGYYa
NSYTSFATYa,b LVALALCVFa
NSVCPKLEFa,b MGRFFNHTLa,b
NHIEVVSAYa,b MLGSSVGNFa,b
NTTLLDLTY b MGFGITVQYa
PVYDTIKYY MSQYSRSTRa
QFANGFVVR b MTEQLQMGFa
QTAQGVHLFa MEAAYTSSL
c
QPLTFLLDF NLRNCTFMYa,b
QSFSNPTCL1b NSYTSFATYa,b
QALHGANLR b NSVCIKLEFa,b
QSSPIIPGFa NHIEVVSAYa,b
RFFNHTLVLa,b QTAQGVHLFa
RNCTFMYTYa QLHCSYESF
a,b
RLVFTNCNY QPLTFLWDFc
RSTRSMLKRa QSFSNPTCLa,b
RSAIEDLLFa QQRFVYDAY
SVFLLMFLL QVDQLNSSY b
SFKEYFNLRa,b QSSPIIPGFa
SLNSFKEYFa,b RFFNHTLVLa,b
(continued)
Table 3
(continued)
S Modified S
SFDVESGVYa RNCTFMYTYa,b
SGVYSVSSFa RLVFTNCNYa,b
SLILDYFSYa RSTRSMLKRa
SQFNYKQSFa,b RSAIEDLLFa
SSAGPISQFa SFKEYFNLRa,b
SPLEGGGWLa SLNSFKEYFa,b
SQLGNCVEYa,b SFDVESGVYa
STVAMTEQL SGVYSVSSFa
STVWEDGDYa SLILDYFSYa
SYINKCSRLa,b SPLEGGGWLa
SSTMSQYSRa SQFNYKQSFa,b
STLTPRSVRa SSAGPISQFa
STRSMLKRRa STVWEDGDYa
SVRNLFASVa,b SYINKCSRLa,b
TFFDKTWPRa SSTMSQYSRa
TYSNITITYa,b STRSMLKRRa
TAVGVRQQRa SQLGNCVEYa,b
TVWEDGDYYa STLTPRSVRa
TLLDLTYEM SLLGSIAGV
a,b
TSIPNFGSL SVRNLFASVa,b
TYQNISTNLa,b TFFDKTWPRa
TYYNKWPWYa,b TYSNITITYa,b
VSKADGIIYa TTITKPLKY
a
VYKLQPLTF TVWEDGDYYa
VECDFSPLLa TAVGVRQQRa
VYNFKRLVFa,b TTNEAFQKVb
VASGSTVAM TSIPNFGSLa,b
VSIVPSTVWa TYQNISTNLa,b
VSVPVSVIYa TYYHKWPWYa
VNAPNGLYFa,b VSKADGIIYa
VVNAPNGLYa,b VECDFSPLLa
(continued)
Table 3
(continued)
S Modified S
VALALCVFFa VYKLQPLTFa
VVKALNESYa,b VYNFKRLVFa,b
WPWYIWLGFa VSIVPSTVWa
WAAFYVYKLa VSVPVSVIYa
YQGDHGDMYc VNAPNGLYFa,b
YFNLRNCTFa,b VVNAPNGLYa,b
YYSIIPHSIa VALALCVFFa
YSIIPHSIRa VVKALNESYa,b
YNLTKLLSLa,b WPWYIWLGFa
YPLSMKSDLa WSYTGSSFY
YSSLILDYFa WTAGLSSFA
a
YGVSGRGVF WAAFYVYKLa
YINKCSRLLa YQGDHGDYYc
YSLYGVSGRa YFNLRNCTFa,b
YSYINKCSRa,b YNLTKLLSLa,b
YYRKQLSPLa YSIIPHSIRa
YSRSTRSMLa YYSIIPHSIa
YYSDDGNYYa,b YINKCSRLLa,b
YYPSNHIEVa,b YPLSMKSDLa
YAPEPITSLa YSSLILDYFa
YTYYNKWPWb,c YSYINKCSRa,b
YYNKWPWYIb,c YYRKQLSPLa
YGVSGRGVFa
YSLYGVSGRa
YSRSTRSMLa
YYSDDGNYYa,b
YAPEPITSLa
YYPSNHIEVa,b
YTYYHKWPWc
YYHKWPWYIc
a
Indicates a common peptide sequence
b
Indicates presence of arginine in sequence
c
Indicates a partial similarity between both reference sequence and modified sequence
Table 4
Illustrate the positive selected peptide sequences for both E and modified E protein by NetMHCpan
prediction tool
E Modified E
a
ALYLYNTGR KPPLPEDVW
CMAFLTATR
FTVVCAITL
FVQERIGLF
ITLLVCMAF
LFIVNFFIF a
LVQPALYLY
LYNTGRSVY a
MAFLTATRL
RIGLFIVNF a
TLLVQPALY
a
Indicates presence of arginine in sequence
NetChop Prediction
Threshold – 0.5 Positive prediction Negative prediction
1.0
0.8
0.6
Score
0.4
0.2
0.0
0 20 40 60 80 100
Position
Fig. 25 Illustrate the NetChop positive prediction of E protein with threshold equal or greater than 0.5
N.B:-.
1. Peptide sequences of both E and modified E protein were
different even if they had a similar residue position.
2. NetCTL was used for E and modified E protein just due to
large amounts of data beside, time-consuming when it is used
with S glycoprotein.
3. Modified E protein NetCTL charts were not shown here.
NetChop Prediction
1.0
0.8
0.6
Score
0.4
0.2
0.0
0 20 40 60 80
Position
Fig. 26 Illustrate the NetChop positive prediction of modified E protein threshold equal or greater than 0.5
NetCTL Prediction
1.6
1.4
1.2
1.0
0.8
Score
0.6
0.4
0.2
0.0
–0.2
0 20 40 60 80
Position
Fig. 27 Illustrate the NetCTL positive prediction of E protein supertype A1 that’s indicated in a green color with
threshold equal or greater than 0.75 above the red color
3.2.5 MHC-NP: The greater probe score was considered as naturally processing
Prediction of Peptides peptide; probe scores greater than 0 were considered as naturally
Naturally Processed by processing peptides.
the MHC The total positive epitope number of naturally processing pep-
tides represented 10,189 out of 10,760 in S glycoprotein and
NetCTL Prediction
2.0
1.5
1.0
Score
0.5
0.0
–0.5
0 20 40 60 80
Position
Fig. 28 Illustrate the NetCTL prediction of E protein supertype A2, the desired supertype A2 appeared in a
green color with threshold equal or greater than 0.75 above the threshold red color
NetCTL Prediction
2.0
1.5
1.0
Score
0.5
0.0
–0.5
0 20 40 60 80
Position
Fig. 29 Illustrate the NetCTL prediction of E protein supertype A3, the positive results appeared in a green color
with threshold equal or greater than 0.75 above the red color
10,187 out of 10,760 in modified S glycoprotein, while it repre-

sents 568 out of 592 in E and 566 out of 592 in modified E protein.
E protein showed alleles frequencies: H-2-Db (74), H-2-Kb
(74), HLA-A∗02:01 (68), HLA-B∗07:02 (66), HLA-B∗35:01
NetCTL Prediction
14
12
10
0.8
Score
0.6
0.4
0.2
0.0
-0.2
0 20 40 60 80
Position
Fig. 30 Illustrate the NetCTL prediction of E protein supertype A24, positive results appeared in a green color
with threshold equal or greater than 0.75 above the threshold red color
NetCTL Prediction
1.2
1.0
0.8
Score
0.6
0.4
0.2
0.0
0 20 40 60 80
Position
Fig. 31 Illustrate the NetCTL prediction of E protein supertype A26, positive results appeared in a green color
with threshold equal or greater than 0.75 above the threshold red color
(74), HLA-B∗44:03 (74), HLA-B∗53:01 (73), HLA-B∗57:01

(62) while in modified E they are H-2-Db (28), H-2-Kb (16),
HLA-A∗02:01 (5), HLA-B∗07:02 (2), HLA-B∗35:01 (6),
HLA-B∗44:03 (28), HLA-B∗53:01 (60), and HLA-B∗57:01 (4).
NetCTL Prediction
0.8
0.7
0.6
0.5
0.4
Score
0.3
0.2
0.1
0.0
–0.1
0 20 40 60 80
Position
Fig. 32 Illustrate the NetCTL negative prediction of E protein supertype B7 with threshold below 0.75
NetCTL Prediction
0.9
0.8
0.7
0.6
0.5
Score
0.4
0.3
0.2
0.1
0.0
0 20 40 60 80
Position
N.B: modified E protein showed less allele frequency when

compared with E protein in addition to some epitope differences
even if at the same positions.
NetCTL Prediction
0.8
0.7
0.6
0.5
0.4
Score
0.3
0.2
0.1
0.0
-0.1
0 20 40 60 80
Position
Fig. 34 Illustrate the NetCTL negative prediction of E protein supertype B27
NetCTL Prediction
1.5
1.0
0.5
0.0
Score
-0.5
-1.0
-1.5
-2.0
-2.5
0 20 40 60 80
Position
NetCTL Prediction
0.8
0.7
0.6
0.5
0.4
Score
0.3
0.2
0.1
0.0
-0.1
0 20 40 60 80
Position
NetCTL Prediction
1.4
1.2
1.0
0.8
Score
0.6
0.4
0.2
0.0
–0.2
0 20 40 60 80
Position
Fig. 37 Illustrate the NetCTL prediction of E protein supertype B58, positive results appeared in a green
colored with threshold equal or greater than 0.75 above the threshold red color
NetCTL Prediction
Threshold - 0.75 Positive prediction Negative prediction
1.6
1.4
1.2
1.0
0.8
Score
0.6
0.4
0.2
0.0
-0.2
0 20 40 60 80
Position
Fig. 38 Illustrate the NetCTL prediction of E protein supertype B62, positive results appeared in a green
colored with threshold equal or greater than 0.75 above the threshold red color
3.3 Epitope Analysis MHC-I and MHC-II interacted alleles by the IEDB population
Tools coverage calculation tool was computed by the average number of
epitope hits/HLA combinations recognized by the population and
3.3.1 Population
a minimum number of epitope hits/HLA combinations recognized
Coverage Calculation
by 90% of the population (PC90); see tables below.
Those below represented a selected E protein epitopes for
population coverage calculation:
PFVQER, VQERIG, QERIGL, FLTATR, LYLYNT,
YLYNTG, LYNTGR, YNTGRS, NTGRSV, TGRSVY, RSVYVK,
YVKFQD, VKFQDS, KFQDSK, FQDSKP, QDSKPP, DSKPPL,
SKPPLP, KPPLPP, PPLPPD, PLPPDE, LPPDEW, PPDEWV,
MLPFVQE, LPFVQER, PFVQERI, VQERIGL, RIGLFIV,
IGLFIVN, GLFIVNF, LFIVNFF, FIVNFFI, IVNFFIF, and
VNFFIFT.
There are differences between MHC-I and MHC-II popula-
tion coverage percentage.
There are similarities between MHC-I between both E and
modified E protein, but still there are differences between them at
MHC-II.
Those below represented a selected modified E protein epi-
topes for population coverage calculation:
RSVYVP, LYMTGR, VYVPQQ, PLPEDV, QERIGW,
TGRSVY, YMTGRS, QFVQER, VPQQDS, SKPPLP, PPLPED,
DSKPPL, YVPQQD, KPPLPE, QDSKPP, PQQDSK, QQDSKP,
PLPEDVW, QFVQERI, AFLTATH, MLQFVQE, ALSLYMT,
Table 5
Illustrate NetCTL +ve results in E and modified E protein with indications of similarities and
differences in the peptide sequences between them, beside the totals numbers of them
Peptide sequence for E Peptide sequence for modified Residue position for
Supertype protein E protein E/modified E protein
A1 LVQPALYLY LVQPALSLY 51/51
LYNTGRSVY 58/58
A2 FVQERIGWF FVQERIGWF 4/4
VVCDITLLV VVCDITLLV 21/21
FLTATHLCV FLTATHLCV 33/33
LLVQPALSL LLVQPALSL 50/50
SLYMTGRSV SLYMTGRSV 57/57
YMTGRSVYV YMTGRSVYV 59/59
A3 ALYLYNTGR ALSLYMTGR 55/55
NTGRSVYVK 60/
VYVKFQDSK 65/
A24 MLPFVQERI MLQFVQERI 1/1
PFVQERIGL FVQERIGWF 3/4
FVQERIGLF RIGWFIPNF 4/8
RIGLFIVNF WFIPNFFDF 8/11
IGLFIVNFF FTVVCDITL 9/19
LFIVNFFIF ITLLVCTAF 11/25
FTVVCAITL LVQPALSLY 19/51
ITLLVCMAF LYMTGRSVY 25/58
MAFLTATRL 31/
LVQPALYLY 51/
LYNTGRSVY 58/
TGRSVYVKF 61/
KFQDSKPPL 68/
A26 FVQERIGWF FVQERIGWF 4/4
RIGWFIPNF RIGWFIPNF 8/8
WFIPNFFDF WFIPNFFDF 11/11
TVVCDITLL TVVCDITLL 20/20
ITLLVCTAF ITLLVCTAF 25/25
ATHLCVQCM ATHLCVQCM 36/36
LCVQCMTGF LCVQCMTGF 39/39
QCMTGFNTL QCMTGFNTL 42/42
NTLLVQPAL NTLLVQPAL 48/48
LVQPALSLY LVQPALSLY 51/51
B7 – LLVQPALSL /50
QPALSLYMT /53
KPPLPEDVW /3
B8 FVQERIGLF FVQERIGWF 4/4
TGRSVYVKF WFIPNFFDF 61/11
B27 – – –
B39 YNTGRSVYV YMTGRSVYV 59/59
KFQDSKPPL 68
(continued)
Table 5
(continued)
Peptide sequence for E Peptide sequence for modified Residue position for
Supertype protein E protein E/modified E protein
B44 – – –
B58 ITLLVCMAF IGWFIPNFF 25/9
KPPLPPDEW ITLLVCTAF 73/25
KPPLPEDVW /3
B62 FVQERIGLF FVQERIGWF 4/4
ITLLVCMAF WFIPNFFDF 25/11
TLLVQPALY ITLLVCTAF 49/25
LVQPALYLY LVQPALSLY 51/51
YLYNTGRSV LYMTGRSVY 57/58
LQFVQER, VQCMTGF, YVPQQDS, GFNTLLV, PPLPEDV,

FLTATHL, TGRSVYV, PALSLYM, NTLLVQP, FNTLLVQ,
LPEDVWV, and CTAFLTA.
The percentage of a coverage population was similar among
both S glycoprotein reference sequence and modified S glycopro-
tein; it represented 95.60% of the world by MHC-I; 118 countries
showed a higher percentage especially Chile Amerindian (100%),
69 other countries showed 0% while in East Asia (94.80%), South
Korea and South Oriental Korea (92.84%), China (88.77%), Iran
and Iran Persian (91.53%) but Iran Kurd (0.00%), Jordan and
Jordan Arab (76.80%),Oman and Oman Arab (95.82%), Saudi
Arabia and Saudi Arabia Arab (96.38%), United Arab Emirates
and United Arab Emirates Arab (0.00%), Sudan (86.43%), Sudan
Arab (49.41%), Sudan Black (0.00%), and Sudan Mixed (87.06%);
please see Table 6.
According to the percentage of a coverage population that was
similar between S glycoprotein reference sequence and modified S
glycoprotein, the world MHC-II represent 81.81%; 64 countries
showed a higher percentage especially Norway and Norway Cauca-
soid (94.71%), 59 other countries (0%) while in East Asia represents
(94.80%), South Korea and South Oriental Korea (85.32%), China
(59.99%), Iran (64.22%), Iran Persian (55.78%), Iran Kurd
(65.72%), Jordan and Jordan Arab (52.88%), Oman and Oman
Arab (0.00%), Saudi Arabia and Saudi Arabia Arab (80.14%),
United Arab Emirates and United Arab Emirates Arab (32.92%),
Sudan (60.56%), Sudan Arab (0.00%), Sudan Black (0.00%), and
Sudan Mixed (60.56%), as in Table 7.
According to the percentage of MHC-I E protein coverage, the
world MHC-I represents 95.60%; 116 countries showed a higher
percentage especially Chile Amerindian (100%), 23 other countries
showed more than 4% but less than 50% while in East Asia it
Table 6
MHC-I coverage population for S and modified S glycoprotein
Class I
Population/Area Coveragea Average hitb PC90c

World 95.60% 10.57 4.38
East Asia 94.80% 10.93 2.58
Japan 96.19% 11.44 3.12
Japan Oriental 96.19% 11.44 3.12
Korea, South 92.84% 10.41 2.16
Korea, South Oriental 92.84% 10.41 2.16
Mongolia 94.37% 10.07 3.12
Mongolia Oriental 94.37% 10.07 3.12
Northeast Asia 88.80% 9.38 0.89
China 88.77% 9.33 0.89
China Oriental 88.77% 9.33 0.89
Hong Kong 90.85% 10.01 1.91
Hong Kong Oriental 90.85% 10.01 1.91
South Asia 86.54% 8.03 0.74
India 82.00% 7.21 0.56
India Asian 82.00% 7.21 0.56
Pakistan 88.63% 8.74 1.76
Pakistan Asian 87.30% 8.38 1.58
Pakistan Mixed 91.12% 9.42 3.23
Sri Lanka 52.39% 3.74 0.84
Sri Lanka Asian 52.39% 3.74 0.84
Southeast Asia 87.81% 9.99 0.82
Borneo 0.00% 0 ?
Borneo Austronesian 0.00% 0 ?
Indonesia 76.44% 7.8 0.42
Indonesia Austronesian 76.44% 7.8 0.42
Malaysia 76.30% 7.64 0.42
Malaysia Austronesian 40.59% 3.17 0.34
Malaysia Oriental 84.44% 9.02 0.64
Philippines 92.86% 11.56 8.01
(continued)
Table 6
(continued)
Class I

Philippines Austronesian 92.86% 11.56 8.01
Singapore 85.74% 9.04 0.7
Singapore Austronesian 82.82% 8.55 0.58
Singapore Oriental 88.96% 9.64 0.91
Taiwan 92.58% 11.31 6.08
Taiwan Oriental 92.58% 11.31 6.08
Thailand 82.85% 7.46 0.58
Thailand Oriental 82.85% 7.46 0.58
Vietnam 84.58% 8.55 0.65
Vietnam Oriental 84.58% 8.55 0.65
Southwest Asia 85.77% 7.59 0.7
Iran 91.53% 8.6 1.33
Iran Kurd 0.00% 0 ?
Iran Persian 91.53% 8.6 1.33
Israel 82.14% 7.29 0.56
Israel Arab 89.15% 9.13 0.92
Israel Jew 87.17% 7.84 0.78
Jordan 76.80% 6.52 0.43
Jordan Arab 76.80% 6.52 0.43
Lebanon 0.00% 0 0
Lebanon Arab 0.00% 0 ?
Lebanon Mixed 0.00% 0 0
Oman 95.82% 9.96 3.04
Oman Arab 95.82% 9.96 3.04
Saudi Arabia 96.38% 9.87 3.65
Saudi Arabia Arab 96.38% 9.87 3.65
United Arab Emirates 0.00% 0 0
United Arab Emirates Arab 0.00% 0 0
Europe 97.81% 11.07 5.29
Austria 98.78% 11.29 6
(continued)
Table 6
(continued)
Class I

Austria Caucasoid 98.78% 11.29 6
Belarus 0.00% 0 ?
Belarus Caucasoid 0.00% 0 ?
Belgium 98.75% 10.62 6.02
Belgium Caucasoid 98.75% 10.62 6.02
Bulgaria 96.59% 11.08 4.52
Bulgaria Caucasoid 96.56% 11.25 4.57
Bulgaria Other 97.43% 10.02 4.35
Croatia 97.76% 11.79 6.12
Croatia Caucasoid 97.76% 11.79 6.12
Czech Republic 96.20% 9.39 4.33
Czech Republic Caucasoid 96.20% 9.39 4.33
Czech Republic Other 0.00% 0 ?
Denmark 0.00% 0 0
Denmark Caucasoid 0.00% 0 0
England 99.29% 11.43 6.21
England Caucasoid 99.29% 11.43 6.21
England Jew 0.00% 0 0
England Mixed 0.00% 0 ?
Finland 99.80% 12.56 7.8
Finland Caucasoid 99.80% 12.56 7.8
France 98.05% 10.72 4.75
France Caucasoid 98.05% 10.72 4.75
Georgia 95.62% 10.98 4.48
Georgia Caucasoid 97.22% 11.66 6.21
Georgia Kurd 89.99% 9.26 1
Germany 99.07% 11.71 6.4
Germany Caucasoid 99.07% 11.71 6.4
Greece 0.00% 0 ?
Greece Caucasoid 0.00% 0 ?
(continued)
Table 6
(continued)
Class I

Ireland Northern 99.40% 11.43 6.27
Ireland Northern Caucasoid 99.40% 11.43 6.27
Ireland South 98.83% 10.82 4.85
Ireland South Caucasoid 98.83% 10.82 4.85
Italy 96.52% 9.83 4.16
Italy Caucasoid 96.52% 9.83 4.16
Macedonia 11.83% 0.86 0.45
Macedonia Caucasoid 11.83% 0.86 0.45
Netherlands 0.00% 0 ?
Netherlands Caucasoid 0.00% 0 ?
Norway 0.00% 0 ?
Norway Caucasoid 0.00% 0 ?
Poland 97.99% 11.25 6.02
Poland Caucasoid 97.99% 11.25 6.02
Portugal 97.11% 10.98 4.73
Portugal Caucasoid 97.11% 10.98 4.73
Romania 97.94% 11.56 5.94
Romania Caucasoid 97.94% 11.56 5.94
Russia 96.71% 11.38 4.59
Russia Caucasoid 0.00% 0 0
Russia Mixed 0.00% 0 0
Russia Other 98.34% 12.46 6.71
Russia Siberian 97.30% 11.52 4.53
Scotland 15.91% 0.81 0.24
Scotland Caucasoid 15.91% 0.81 0.24
Serbia 43.75% 0.78 0.18
Serbia Caucasoid 43.75% 0.78 0.18
Slovakia 0.00% 0 ?
Slovakia Caucasoid 0.00% 0 ?
Slovenia 0.00% 0 ?
(continued)
Table 6
(continued)
Class I

Slovenia Caucasoid 0.00% 0 ?
Spain 71.85% 5.51 0.36
Spain Caucasoid 71.85% 5.51 0.36
Spain Jew 0.00% 0 ?
Spain Other 0.00% 0 ?
Sweden 99.69% 12.61 6.84
Sweden Caucasoid 99.69% 12.61 6.84
Switzerland 0.00% 0 0
Switzerland Caucasoid 0.00% 0 0
Turkey 44.80% 3.58 1.45
Turkey Caucasoid 44.80% 3.58 1.45
Ukraine 0.00% 0 ?
Ukraine Caucasoid 0.00% 0 ?
United Kingdom 0.00% 0 0
United Kingdom Caucasoid 0.00% 0 0
Wales 0.00% 0 0
Wales Caucasoid 0.00% 0 0
East Africa 86.99% 6.96 0.77
Kenya 85.86% 6.62 0.71
Kenya Black 85.86% 6.62 0.71
Uganda 91.04% 8.19 1.48
Uganda Black 91.04% 8.19 1.48
Zambia 95.32% 7.98 4.01
Zambia Black 95.32% 7.98 4.01
Zimbabwe 91.57% 7.69 1.71
Zimbabwe Black 91.57% 7.69 1.71
West Africa 92.60% 8.71 1.67
Burkina Faso 58.50% 3.24 0.24
Burkina Faso Black 58.50% 3.24 0.24
Cape Verde 96.69% 10.09 4.14
(continued)
Table 6
(continued)
Class I

Cape Verde Black 96.69% 10.09 4.14
Gambia 0.00% 0 ?
Gambia Black 0.00% 0 ?
Ghana 0.00% 0 0
Ghana Black 0.00% 0 0
Guinea-Bissau 92.66% 8.7 1.49
Guinea-Bissau Black 92.66% 8.7 1.49
Ivory Coast 58.05% 0.78 0.24
Ivory Coast Black 58.05% 0.78 0.24
Liberia 0.00% 0 ?
Liberia Black 0.00% 0 ?
Nigeria 0.00% 0 ?
Nigeria Black 0.00% 0 ?
Senegal 95.03% 9.11 4
Senegal Black 95.03% 9.11 4
Central Africa 84.98% 6.7 0.67
Cameroon 88.67% 7.35 0.88
Cameroon Black 88.67% 7.35 0.88
Central African Republic 10.75% 0.27 0.11
Central African Republic Black 10.75% 0.27 0.11
Congo 0.00% 0 ?
Congo Black 0.00% 0 ?
Equatorial Guinea 0.00% 0 0
Equatorial Guinea Black 0.00% 0 0
Gabon 0.00% 0 ?
Gabon Black 0.00% 0 ?
Rwanda 23.09% 1.33 0.13
Rwanda Black 23.09% 1.33 0.13
Sao Tome and Principe 95.54% 8.72 2.29
Sao Tome and Principe Black 95.54% 8.72 2.29
(continued)
Table 6
(continued)
Class I

North Africa 91.87% 8.61 1.86
Algeria 0.00% 0 ?
Algeria Arab 0.00% 0 ?
Ethiopia 0.00% 0 ?
Ethiopia Black 0.00% 0 ?
Mali 94.28% 8.82 1.74
Mali Black 94.28% 8.82 1.74
Morocco 95.95% 9.47 4.19
Morocco Arab 97.89% 10.2 4.47
Morocco Caucasoid 94.32% 8.96 4.02
Sudan 86.43% 7.53 0.74
Sudan Arab 49.41% 4.62 0.59
Sudan Black 0.00% 0 0
Sudan Mixed 87.06% 7.56 0.77
Tunisia 96.04% 9.85 4.19
Tunisia Arab 96.04% 9.85 4.19
Tunisia Berber 0.00% 0 ?
South Africa 91.05% 8 2.1
South Africa Black 86.71% 6.67 0.75
South Africa Other 93.82% 9.59 2.73
West Indies 97.34% 10.78 4.6
Cuba 97.20% 10.65 4.53
Cuba Caucasoid 97.64% 11.2 4.77
Cuba Mixed 0.00% 0 ?
Cuba Mulatto 96.58% 9.66 4.09
Jamaica 0.00% 0 ?
Jamaica Black 0.00% 0 ?
Martinique 22.56% 2.03 1.16
Martinique Black 22.56% 2.03 1.16
(continued)
Table 6
(continued)
Class I

Trinidad and Tobago 0.00% 0 0
Trinidad and Tobago Asian 0.00% 0 0
North America 96.88% 10.98 4.65
Canada 0.00% 0 ?
Canada Amerindian 0.00% 0 ?
Mexico 97.10% 11 6.02
Mexico Amerindian 99.86% 13 7.84
Mexico Mestizo 96.78% 10.7 4.46
United States 96.93% 10.98 4.66
United States Amerindian 99.44% 13.15 8.19
United States Asian 92.39% 10.32 2.29
United States Austronesian 0.00% 0 ?
United States Black 94.18% 8.83 2.54
United States Caucasoid 98.65% 11.4 6.08
United States Hispanic 97.46% 11.01 4.77
United States Mestizo 98.09% 11.2 4.97
United States Polynesian 97.53% 11.57 3.62
Central America 5.10% 0.16 0.11
Costa Rica 0.00% 0 ?
Costa Rica Mestizo 0.00% 0 ?
Guatemala 5.10% 0.16 0.11
Guatemala Amerindian 5.10% 0.16 0.11
South America 86.24% 8.01 0.73
Argentina 98.02% 8.76 2.61
Argentina Amerindian 98.02% 8.76 2.61
Argentina Caucasoid 0.00% 0 ?
Bolivia 0.00% 0 ?
Bolivia Amerindian 0.00% 0 ?
Brazil 93.72% 9.43 2.69
Brazil Amerindian 92.35% 8.37 2.16
(continued)
Table 6
(continued)
Class I

Brazil Caucasoid 97.68% 11.33 5.35
Brazil Mixed 95.06% 9.85 3.75
Brazil Mulatto 0.00% 0 ?
Brazil Other 0.00% 0 0
Chile 94.93% 10.63 4.37
Chile Amerindian 100.00% 14.31 9.11
Chile Hispanic 0.00% 0 ?
Chile Mixed 87.43% 8.16 0.8
Colombia 9.86% 0.76 0.67
Colombia Amerindian 0.00% 0 0
Colombia Black 5.79% 0.42 0.64
Colombia Mestizo 14.81% 1.17 0.7
Ecuador 76.97% 8.77 1.74
Ecuador Amerindian 76.97% 8.77 1.74
Ecuador Black 0.00% 0 ?
Paraguay 0.00% 0 ?
Paraguay Amerindian 0.00% 0 ?
Peru 99.98% 13.69 8.37
Peru Amerindian 99.98% 13.69 8.37
Peru Mestizo 0.00% 0 0
Venezuela 88.37% 9.05 0.86
Venezuela Amerindian 88.88% 8.98 0.9
Venezuela Caucasoid 9.18% 0.83 0.99
Venezuela Mestizo 7.84% 0.71 0.98
Venezuela Mixed 0.00% 0 ?
Oceania 91.82% 10.92 4.06
American Samoa 95.26% 12.14 7.15
American Samoa Polynesian 95.26% 12.14 7.15
Australia 89.30% 9.93 0.93
Australia Australian Aborigines 82.36% 9.31 0.57
(continued)
Table 6
(continued)
Class I

Australia Caucasoid 99.06% 11.46 6.16
Chile 94.93% 10.63 4.37
Cook Islands 0.00% 0 ?
Cook Islands Polynesian 0.00% 0 ?
Fiji 0.00% 0 ?
Fiji Melanesian 0.00% 0 ?
Kiribati 0.00% 0 ?
Kiribati Micronesian 0.00% 0 ?
Nauru 0.00% 0 ?
Nauru Micronesian 0.00% 0 ?
New Caledonia 96.70% 12.14 8.63
New Caledonia Melanesian 96.70% 12.14 8.63
New Zealand 0.00% 0 ?
New Zealand Polynesian 0.00% 0 ?
Niue 0.00% 0 ?
Niue Polynesian 0.00% 0 ?
Papua New Guinea 97.26% 12.58 8.57
Papua New Guinea Melanesian 97.26% 12.58 8.57
Samoa 0.00% 0 ?
Samoa Polynesian 0.00% 0 ?
Tokelau 0.00% 0 ?
Tokelau Polynesian 0.00% 0 ?
Tonga 0.00% 0 ?
Tonga Polynesian 0.00% 0 ?
Average 55.31% 5.73 ?
(Standard deviation) 44.16% 4.92 (?)
a
Projected population coverage
b
Average number of epitope hits/HLA combinations recognized by the population
c
Minimum number of epitope hits/HLA combinations recognized by 90% of the population
Table 7
The MHC-II coverage population for S and modified S glycoprotein
Class II

World 81.81% 8.16 1.1
East Asia 81.82% 8.83 1.1
Japan 74.83% 7.85 0.79
Korea, South 85.32% 9.56 1.36
Mongolia 81.85% 7.79 1.1
China 59.99% 5.33 0.5
Hong Kong 0.00% 0 ?
Hong Kong Oriental 0.00% 0 ?
South Asia 75.38% 7.4 0.81
India 74.99% 7.35 0.8
India Asian 74.99% 7.35 0.8
Pakistan 1.18% 0.09 0.81
Pakistan Mixed 0.00% 0 0
Sri Lanka 0.00% 0 ?
Sri Lanka Asian 0.00% 0 ?
Borneo 49.02% 4.03 0.39
Borneo Austronesian 49.02% 4.03 0.39
Indonesia 47.84% 4.4 0.38
Malaysia 57.99% 5.34 0.48
Philippines 28.56% 2.52 0.28
(continued)
Table 7
(continued)
Class II

Singapore 65.78% 6.04 0.58
Singapore Oriental 0.00% 0 ?
Taiwan 67.88% 6.13 0.62
Thailand 63.90% 5.92 0.55
Vietnam 54.44% 4.43 0.44
Iran 64.22% 5.65 0.56
Iran Kurd 55.78% 4.74 0.45
Iran Persian 65.72% 5.83 0.58
Israel 68.79% 6.4 0.64
Israel Arab 67.51% 6.2 0.62
Israel Jew 69.65% 6.51 0.66
Jordan 52.88% 4.56 0.42
Jordan Arab 52.88% 4.56 0.42
Lebanon 70.46% 6.48 0.68
Lebanon Arab 70.46% 6.48 0.68
Lebanon Mixed 0.00% 0 ?
Oman 0.00% 0 ?
Oman Arab 0.00% 0 ?
Saudi Arabia 80.14% 8.31 1.01
United Arab Emirates 32.92% 0.66 0.3
United Arab Emirates Arab 32.92% 0.66 0.3
Europe 85.83% 8.88 1.41
Austria 93.34% 10.8 2.82
(continued)
Table 7
(continued)
Class II

Austria Caucasoid 93.34% 10.8 2.82
Belarus 43.81% 3.55 1.25
Belarus Caucasoid 43.81% 3.55 1.25
Belgium 79.39% 7.16 0.97
Bulgaria 57.23% 4.95 0.47
Bulgaria Other 0.00% 0 ?
Croatia 66.71% 5.89 0.6
Czech Republic Other 64.14% 6.4 0.56
Denmark 88.98% 9.04 1.81
Denmark Caucasoid 88.98% 9.04 1.81
England 93.48% 10.49 2.74
England Jew 0.00% 0 ?
England Mixed 0.00% 0 0
Finland 51.14% 4.24 0.41
France 88.54% 9.29 1.74
Georgia 75.05% 7.09 0.8
Georgia Kurd 0.00% 0 ?
Germany 91.14% 10.14 2.26
Greece 66.92% 6.29 0.6
Greece Caucasoid 66.92% 6.29 0.6
(continued)
Table 7
(continued)
Class II

Ireland South 93.15% 10 2.51
Ireland South Caucasoid 93.15% 10 2.51
Italy 85.90% 5.93 1.42
Macedonia 66.53% 6.2 0.6
Netherlands 83.44% 8.33 1.21
Netherlands Caucasoid 83.44% 8.33 1.21
Norway 94.71% 10.56 3.01
Norway Caucasoid 94.71% 10.56 3.01
Poland 84.46% 8.85 1.29
Portugal 78.00% 7.74 0.91
Romania 0.00% 0 ?
Romania Caucasoid 0.00% 0 ?
Russia 77.62% 7.24 0.89
Russia Caucasoid 88.52% 9.81 1.74
Russia Other 85.01% 9.2 1.33
Scotland 90.82% 10.1 2.2
Serbia 0.00% 0 ?
Serbia Caucasoid 0.00% 0 ?
Slovakia 18.28% 0.37 0.24
Slovakia Caucasoid 18.28% 0.37 0.24
Slovenia 84.85% 8.74 1.32
(continued)
Table 7
(continued)
Class II

Slovenia Caucasoid 84.85% 8.74 1.32
Spain 80.51% 8.28 1.03
Spain Jew 0.00% 0 ?
Spain Other 6.30% 0.57 0.96
Sweden 88.07% 9.13 1.68
Switzerland 0.00% 0 ?
Switzerland Caucasoid 0.00% 0 ?
Turkey 76.19% 7.3 0.84
Ukraine 50.64% 4.17 1.42
Ukraine Caucasoid 50.64% 4.17 1.42
Wales 0.00% 0 0
East Africa 68.30% 5.65 0.63
Kenya 0.00% 0 0
Kenya Black 0.00% 0 0
Uganda 0.00% 0 0
Uganda Black 0.00% 0 0
Zambia 0.00% 0 ?
Zambia Black 0.00% 0 ?
Zimbabwe 68.30% 5.65 0.63
West Africa 65.23% 6.13 0.58
Burkina Faso 0.00% 0 ?
Burkina Faso Black 0.00% 0 ?
Cape Verde 80.38% 8.1 1.02
(continued)
Table 7
(continued)
Class II

Gambia 0.00% 0 0
Gambia Black 0.00% 0 0
Ghana 0.00% 0 ?
Ghana Black 0.00% 0 ?
Ivory Coast 0.00% 0 ?
Ivory Coast Black 0.00% 0 ?
Liberia 0.00% 0 0
Liberia Black 0.00% 0 0
Nigeria 0.00% 0 0
Nigeria Black 0.00% 0 0
Senegal 30.28% 2.32 0.29
Senegal Black 30.28% 2.32 0.29
Cameroon 49.87% 3.31 0.4
Congo 68.66% 5.93 0.64
Congo Black 68.66% 5.93 0.64
Equatorial Guinea 47.58% 3.55 0.38
Equatorial Guinea Black 47.58% 3.55 0.38
Gabon 41.78% 3.84 1.2
Gabon Black 41.78% 3.84 1.2
Rwanda 62.79% 5.38 0.54
Rwanda Black 62.79% 5.38 0.54
(continued)
Table 7
(continued)
Class II

North Africa 75.06% 7 0.8
Algeria 77.15% 7.25 0.88
Algeria Arab 77.15% 7.25 0.88
Ethiopia 83.00% 8.71 1.18
Ethiopia Black 83.00% 8.71 1.18
Mali 0.00% 0 ?
Mali Black 0.00% 0 ?
Morocco 83.44% 8.14 1.21
Morocco Arab 85.07% 8.25 1.34
Sudan 60.56% 4.52 0.51
Sudan Arab 0.00% 0 ?
Sudan Mixed 60.56% 4.52 0.51
Tunisia 74.26% 6.82 0.78
Tunisia Arab 74.97% 6.78 0.8
Tunisia Berber 74.47% 7.43 0.78
South Africa 32.10% 1.11 0.29
South Africa 32.10% 1.11 0.29
South Africa Other 0.00% 0 ?
West Indies 69.22% 6.67 0.65
Cuba 85.48% 9.66 1.38
Cuba Caucasoid 0.00% 0 ?
Cuba Mixed 85.48% 9.66 1.38
Cuba Mulatto 0.00% 0 ?
Jamaica 27.41% 2.28 0.28
Jamaica Black 27.41% 2.28 0.28
Martinique 74.51% 7.17 0.78
(continued)
Table 7
(continued)
Class II

Trinidad and Tobago 0.00% 0 ?
Trinidad and Tobago Asian 0.00% 0 ?
North America 87.89% 9.12 1.65
Canada 38.41% 2.21 0.32
Canada Amerindian 38.41% 2.21 0.32
Mexico 55.04% 4.3 0.44
Mexico Amerindian 42.59% 3.09 0.35
United States 88.10% 9.17 1.68
United States Austronesian 58.09% 5.47 0.48
Costa Rica 24.31% 2.21 0.26
Costa Rica Mestizo 24.31% 2.21 0.26
Guatemala 49.16% 3.37 0.39
Argentina 62.67% 5.36 0.54
Argentina Caucasoid 80.65% 7.85 1.03
Bolivia 77.82% 5.97 0.9
Bolivia Amerindian 77.82% 5.97 0.9
Brazil 63.80% 5.16 0.55
(continued)
Table 7
(continued)
Class II

Brazil Mixed 77.50% 6.94 0.89
Brazil Mulatto 74.09% 6.89 0.77
Brazil Other 0.00% 0 ?
Chile 67.08% 5.82 0.61
Chile Hispanic 0.00% 0 0
Chile Mixed 52.65% 4.39 0.42
Colombia 54.02% 4.34 0.43
Colombia Amerindian 47.40% 3.65 0.38
Ecuador 52.17% 3.75 1.25
Ecuador Black 0.00% 0 0
Paraguay 4.90% 0.29 0.63
Paraguay Amerindian 4.90% 0.29 0.63
Peru 49.87% 3.47 0.4
Venezuela 3.01% 0.06 0.21
Venezuela Amerindian 0.00% 0 0
Venezuela Caucasoid 0.00% 0 ?
Venezuela Mestizo 0.00% 0 ?
Venezuela Mixed 3.17% 0.06 0.21
Oceania 59.87% 5.38 0.5
American Samoa 0.00% 0 ?
American Samoa Polynesian 0.00% 0 ?
Australia 33.15% 2.21 0.3
(continued)
Table 7
(continued)
Class II

Australia Caucasoid 0.00% 0 ?
Chile 67.08% 5.82 0.61
Cook Islands 78.59% 6.44 0.93
Cook Islands Polynesian 78.59% 6.44 0.93
Fiji 79.87% 7.5 0.99
Fiji Melanesian 79.87% 7.5 0.99
Kiribati 10.89% 0.85 0.22
Kiribati Micronesian 10.89% 0.85 0.22
Nauru 38.66% 3.4 0.33
Nauru Micronesian 38.66% 3.4 0.33
New Caledonia 81.41% 8.44 3.77
New Zealand 84.46% 6.76 1.29
New Zealand Polynesian 84.46% 6.76 1.29
Niue 77.82% 4.27 0.9
Niue Polynesian 77.82% 4.27 0.9
Samoa 80.86% 7.29 1.04
Samoa Polynesian 80.86% 7.29 1.04
Tokelau 55.11% 2.82 0.45
Tokelau Polynesian 55.11% 2.82 0.45
Tonga 71.91% 6.12 0.71
Tonga Polynesian 71.91% 6.12 0.71
Average 51.14% 4.7 ?
a
b
c
represents 94.80%, South Korea and South Oriental Korea

(92.84%), China (88.77%), Iran and Iran Persian (91.53%%), Jor-
dan and Jordan Arab (76.80%), Oman and Oman Arab (95.82%),
Saudi Arabia and Saudi Arabia Arab (96.38%), Sudan (86.43%),
Sudan Arab (49.41%), Sudan Black (0.00%), and Sudan Mixed
(87.06%); see Table 8. Iran Kurd, United Arab Emirates, and
United Arab Emirates Arab were not mentioned and showed
results in this tool.
According to the percentage of MHC-I modified E protein
coverage population that represented 95.60% of the world popula-
tion, 112 countries showed a higher percentile rate especially Chile
Amerindian which represents 100.00%, 96 other countries showed
0% while in East Asia represents 94.80%, South Korea and South
Oriental Korea (92.84%), China (88.77%), Iran (91.53%), Iran
Persian (91.53%), Iran Kurd (0.00%), Jordan and Jordan Arab
(76.80%), Oman and Oman Arab (95.82%), Saudi Arabia and
Saudi Arabia Arab (96.38%), United Arab Emirates and United
Arab Emirates Arab (0.0%), Sudan (60.56%), Sudan Arab
(0.00%), Sudan Black (0.00%), and Sudan Mixed (60.56%); see
Table 9.
According to the percentile rates of MHC-II E protein cover-
age population that represented 81.81% of the world population,
63 countries showed a higher percentage especially Norway and
Norway Caucasoid (94.71%), 45 other countries showed from 0%
to less than 50% while in East Asia represents 94.80%, South Korea
and South Oriental Korea (85.32%), China (59.99%), Iran
(64.22%), Iran Persian (65.72%), Iran Kurd (55.78%), Saudi Arabia
and Saudi Arabia Arab (80.14%), United Arab Emirates and United
Arab Emirates Arab (32.92%), and Sudan and Sudan Mixed
(60.56%); see Table 10. Oman, Jordan, Sudan Black, and Arab
were not mentioned and showed results in this tool.
According to the percentage of MHC-II modified E protein
coverage population that represented 81.81% of the world popula-
tion, 62 countries showed a higher percentage especially Norway
and Norway Caucasoid (94.71%), 59 other countries showed 0%
while in East Asia represents 94.80%, South Korea and South
Oriental Korea (85.32%), China (59.99%), Iran (64.22%), Iran
Persian (65.72%), Iran Kurd (55.78%), Jordan and Jordan Arab
(52.88%), Oman and Oman Arab (0.00%), Saudi Arabia and Saudi
Arabia Arab (80.14%), United Arab Emirates and United Arab
Emirates Arab (32.92%), Sudan and Sudan Mixed (60.56%), and
Sudan Arab and Sudan Black (0.00%); see Table 11.
3.4 Homology The results of homology modeling were not shown here because
Modeling they are not necessary.
Table 8
MHC-I coverage population for E protein
Class I

World 95.60% 10.57 4.38
East Asia 94.80% 10.93 2.58
Japan 96.19% 11.44 3.12
Korea, South 92.84% 10.41 2.16
Mongolia 94.37% 10.07 3.12
China 88.77% 9.33 0.89
Hong Kong 90.85% 10.01 1.91
South Asia 86.54% 8.03 0.74
India 82.00% 7.21 0.56
India Asian 82.00% 7.21 0.56
Pakistan 88.63% 8.74 1.76
Sri Lanka 52.39% 3.74 0.84
Indonesia 76.44% 7.8 0.42
Malaysia 76.30% 7.64 0.42
Philippines 92.86% 11.56 8.01
Singapore 85.74% 9.04 0.7
(continued)
Table 8
(continued)
Class I

Taiwan 92.58% 11.31 6.08
Thailand 82.85% 7.46 0.58
Vietnam 84.58% 8.55 0.65
Iran 91.53% 8.6 1.33
Iran Persian 91.53% 8.6 1.33
Israel 82.14% 7.29 0.56
Israel Arab 89.15% 9.13 0.92
Israel Jew 87.17% 7.84 0.78
Jordan 76.80% 6.52 0.43
Jordan Arab 76.80% 6.52 0.43
Oman 95.82% 9.96 3.04
Oman Arab 95.82% 9.96 3.04
Saudi Arabia 96.38% 9.87 3.65
Europe 97.81% 11.07 5.29
Austria 98.78% 11.29 6
Belgium 98.75% 10.62 6.02
Bulgaria 96.59% 11.08 4.52
Croatia 97.76% 11.79 6.12
(continued)
Table 8
(continued)
Class I

England 99.29% 11.43 6.21
Finland 99.80% 12.56 7.8
France 98.05% 10.72 4.75
Georgia 95.62% 10.98 4.48
Germany 99.07% 11.71 6.4
Ireland South 98.83% 10.82 4.85
Italy 96.52% 9.83 4.16
Macedonia 11.83% 0.86 0.45
Poland 97.99% 11.25 6.02
Portugal 97.11% 10.98 4.73
Romania 97.94% 11.56 5.94
Russia 96.71% 11.38 4.59
Russia Other 98.34% 12.46 6.71
(continued)
Table 8
(continued)
Class I

Scotland 15.91% 0.81 0.24
Serbia 43.75% 0.78 0.18
Spain 71.85% 5.51 0.36
Sweden 99.69% 12.61 6.84
Turkey 44.80% 3.58 1.45
East Africa 86.99% 6.96 0.77
Kenya 85.86% 6.62 0.71
Kenya Black 85.86% 6.62 0.71
Uganda 91.04% 8.19 1.48
Uganda Black 91.04% 8.19 1.48
Zambia 95.32% 7.98 4.01
Zambia Black 95.32% 7.98 4.01
Zimbabwe 91.57% 7.69 1.71
West Africa 92.60% 8.71 1.67
Burkina Faso 58.50% 3.24 0.24
Cape Verde 96.69% 10.09 4.14
Ivory Coast 58.05% 0.78 0.24
Senegal 95.03% 9.11 4
(continued)
Table 8
(continued)
Class I

Cameroon 88.67% 7.35 0.88
Rwanda 23.09% 1.33 0.13
Rwanda Black 23.09% 1.33 0.13
North Africa 91.87% 8.61 1.86
Mali 94.28% 8.82 1.74
Mali Black 94.28% 8.82 1.74
Morocco 95.95% 9.47 4.19
Morocco Arab 97.89% 10.2 4.47
Sudan 86.43% 7.53 0.74
Sudan Arab 49.41% 4.62 0.59
Sudan Mixed 87.06% 7.56 0.77
Tunisia 96.04% 9.85 4.19
Tunisia Arab 96.04% 9.85 4.19
West Indies 97.34% 10.78 4.6
Cuba 97.20% 10.65 4.53
Cuba Mulatto 96.58% 9.66 4.09
Martinique 22.56% 2.03 1.16
(continued)
Table 8
(continued)
Class I

North America 96.88% 10.98 4.65
Mexico 97.10% 11 6.02
United States 96.93% 10.98 4.66
Guatemala 5.10% 0.16 0.11
Argentina 98.02% 8.76 2.61
Brazil 93.72% 9.43 2.69
Brazil Mixed 95.06% 9.85 3.75
Chile 94.93% 10.63 4.37
Chile Mixed 87.43% 8.16 0.8
Colombia 9.86% 0.76 0.67
Ecuador 76.97% 8.77 1.74
(continued)
Table 8
(continued)
Class I

Peru 99.98% 13.69 8.37
Venezuela 88.37% 9.05 0.86
Oceania 91.82% 10.92 4.06
Australia 89.30% 9.93 0.93
Chile 94.93% 10.63 4.37
New Caledonia 96.70% 12.14 8.63
Average 55.31% 5.73 ?
a
b
c
3.5 Confirmation The results of confirmatory amino acid change were not shown
of Amino Acid Change here because they are not necessary.
in Spike Glycoprotein
(S) and Envelope
Protein (E) Sequence
3.6 Peptide The results of peptide search tool showed presence of selected
Search Tool peptide sequence in another organisms such as Leishmania dono-
vani, Drosophila sechellia (fruit fly), Leishmania infantum,
Table 9
MHC-I coverage population for modified E protein
Class I

World 95.60% 10.57 4.38
East Asia 94.80% 10.93 2.58
Japan 96.19% 11.44 3.12
Korea, South 92.84% 10.41 2.16
Mongolia 94.37% 10.07 3.12
China 88.77% 9.33 0.89
Hong Kong 90.85% 10.01 1.91
South Asia 86.54% 8.03 0.74
India 82.00% 7.21 0.56
India Asian 82.00% 7.21 0.56
Pakistan 88.63% 8.74 1.76
Sri Lanka 52.39% 3.74 0.84
Borneo 0.00% 0 ?
Borneo Austronesian 0.00% 0 ?
Indonesia 76.44% 7.8 0.42
Malaysia 76.30% 7.64 0.42
Philippines 92.86% 11.56 8.01
(continued)
Table 9
(continued)
Class I

Singapore 85.74% 9.04 0.7
Taiwan 92.58% 11.31 6.08
Thailand 82.85% 7.46 0.58
Vietnam 84.58% 8.55 0.65
Iran 91.53% 8.6 1.33
Iran Kurd 0.00% 0 ?
Iran Persian 91.53% 8.6 1.33
Israel 82.14% 7.29 0.56
Israel Arab 89.15% 9.13 0.92
Israel Jew 87.17% 7.84 0.78
Jordan 76.80% 6.52 0.43
Jordan Arab 76.80% 6.52 0.43
Lebanon 0.00% 0 0
Lebanon Arab 0.00% 0 ?
Lebanon Mixed 0.00% 0 0
Oman 95.82% 9.96 3.04
Oman Arab 95.82% 9.96 3.04
Saudi Arabia 96.38% 9.87 3.65
United Arab Emirates 0.00% 0 0
United Arab Emirates Arab 0.00% 0 0
Europe 97.81% 11.07 5.29
Austria 98.78% 11.29 6
(continued)
Table 9
(continued)
Class I

Belarus 0.00% 0 ?
Belarus Caucasoid 0.00% 0 ?
Belgium 98.75% 10.62 6.02
Bulgaria 96.59% 11.08 4.52
Croatia 97.76% 11.79 6.12
Czech Republic Other 0.00% 0 ?
Denmark 0.00% 0 0
Denmark Caucasoid 0.00% 0 0
England 99.29% 11.43 6.21
England Jew 0.00% 0 0
England Mixed 0.00% 0 ?
Finland 99.80% 12.56 7.8
France 98.05% 10.72 4.75
Georgia 95.62% 10.98 4.48
Germany 99.07% 11.71 6.4
Greece 0.00% 0 ?
Greece Caucasoid 0.00% 0 ?
(continued)
Table 9
(continued)
Class I

Ireland South 98.83% 10.82 4.85
Italy 96.52% 9.83 4.16
Macedonia 11.83% 0.86 0.45
Netherlands 0.00% 0 ?
Netherlands Caucasoid 0.00% 0 ?
Norway 0.00% 0 ?
Norway Caucasoid 0.00% 0 ?
Poland 97.99% 11.25 6.02
Portugal 97.11% 10.98 4.73
Romania 97.94% 11.56 5.94
Russia 96.71% 11.38 4.59
Russia Caucasoid 0.00% 0 0
Russia Other 98.34% 12.46 6.71
Scotland 15.91% 0.81 0.24
Serbia 43.75% 0.78 0.18
Slovakia 0.00% 0 ?
Slovakia Caucasoid 0.00% 0 ?
Slovenia 0.00% 0 ?
(continued)
Table 9
(continued)
Class I

Slovenia Caucasoid 0.00% 0 ?
Spain 71.85% 5.51 0.36
Spain Jew 0.00% 0 ?
Spain Other 0.00% 0 ?
Sweden 99.69% 12.61 6.84
Switzerland 0.00% 0 0
Switzerland Caucasoid 0.00% 0 0
Turkey 44.80% 3.58 1.45
Ukraine 0.00% 0 ?
Ukraine Caucasoid 0.00% 0 ?
Wales 0.00% 0 0
East Africa 86.99% 6.96 0.77
Kenya 85.86% 6.62 0.71
Kenya Black 85.86% 6.62 0.71
Uganda 91.04% 8.19 1.48
Uganda Black 91.04% 8.19 1.48
Zambia 95.32% 7.98 4.01
Zambia Black 95.32% 7.98 4.01
Zimbabwe 91.57% 7.69 1.71
West Africa 92.60% 8.71 1.67
Burkina Faso 58.50% 3.24 0.24
Cape Verde 96.69% 10.09 4.14
(continued)
Table 9
(continued)
Class I

Gambia 0.00% 0 ?
Gambia Black 0.00% 0 ?
Ghana 0.00% 0 0
Ghana Black 0.00% 0 0
Ivory Coast 58.05% 0.78 0.24
Liberia 0.00% 0 ?
Liberia Black 0.00% 0 ?
Nigeria 0.00% 0 ?
Nigeria Black 0.00% 0 ?
Senegal 95.03% 9.11 4
Cameroon 88.67% 7.35 0.88
Congo 0.00% 0 ?
Congo Black 0.00% 0 ?
Equatorial Guinea 0.00% 0 0
Equatorial Guinea Black 0.00% 0 0
Gabon 0.00% 0 ?
Gabon Black 0.00% 0 ?
Rwanda 23.09% 1.33 0.13
Rwanda Black 23.09% 1.33 0.13
(continued)
Table 9
(continued)
Class I

North Africa 91.87% 8.61 1.86
Algeria 0.00% 0 ?
Algeria Arab 0.00% 0 ?
Ethiopia 0.00% 0 ?
Ethiopia Black 0.00% 0 ?
Mali 94.28% 8.82 1.74
Mali Black 94.28% 8.82 1.74
Morocco 95.95% 9.47 4.19
Morocco Arab 97.89% 10.2 4.47
Sudan 86.43% 7.53 0.74
Sudan Arab 49.41% 4.62 0.59
Sudan Mixed 87.06% 7.56 0.77
Tunisia 96.04% 9.85 4.19
Tunisia Arab 96.04% 9.85 4.19
Tunisia Berber 0.00% 0 ?
West Indies 97.34% 10.78 4.6
Cuba 97.20% 10.65 4.53
Cuba Mixed 0.00% 0 ?
Cuba Mulatto 96.58% 9.66 4.09
Jamaica 0.00% 0 ?
Jamaica Black 0.00% 0 ?
Martinique 22.56% 2.03 1.16
(continued)
Table 9
(continued)
Class I

Trinidad and Tobago 0.00% 0 0
Trinidad and Tobago Asian 0.00% 0 0
North America 96.88% 10.98 4.65
Canada 0.00% 0 ?
Canada Amerindian 0.00% 0 ?
Mexico 97.10% 11 6.02
United States 96.93% 10.98 4.66
United States Austronesian 0.00% 0 ?
Costa Rica 0.00% 0 ?
Costa Rica Mestizo 0.00% 0 ?
Guatemala 5.10% 0.16 0.11
Argentina 98.02% 8.76 2.61
Argentina Caucasoid 0.00% 0 ?
Bolivia 0.00% 0 ?
Bolivia Amerindian 0.00% 0 ?
Brazil 93.72% 9.43 2.69
(continued)
Table 9
(continued)
Class I

Brazil Mixed 95.06% 9.85 3.75
Brazil Mulatto 0.00% 0 ?
Brazil Other 0.00% 0 0
Chile 94.93% 10.63 4.37
Chile Hispanic 0.00% 0 ?
Chile Mixed 87.43% 8.16 0.8
Colombia 9.86% 0.76 0.67
Colombia Amerindian 0.00% 0 0
Ecuador 76.97% 8.77 1.74
Ecuador Black 0.00% 0 ?
Paraguay 0.00% 0 ?
Paraguay Amerindian 0.00% 0 ?
Peru 99.98% 13.69 8.37
Venezuela 88.37% 9.05 0.86
Venezuela Mixed 0.00% 0 ?
Oceania 91.82% 10.92 4.06
Australia 89.30% 9.93 0.93
(continued)
Table 9
(continued)
Class I

Chile 94.93% 10.63 4.37
Cook Islands 0.00% 0 ?
Cook Islands Polynesian 0.00% 0 ?
Fiji 0.00% 0 ?
Fiji Melanesian 0.00% 0 ?
Kiribati 0.00% 0 ?
Kiribati Micronesian 0.00% 0 ?
Nauru 0.00% 0 ?
Nauru Micronesian 0.00% 0 ?
New Caledonia 96.70% 12.14 8.63
New Zealand 0.00% 0 ?
New Zealand Polynesian 0.00% 0 ?
Niue 0.00% 0 ?
Niue Polynesian 0.00% 0 ?
Samoa 0.00% 0 ?
Samoa Polynesian 0.00% 0 ?
Tokelau 0.00% 0 ?
Tokelau Polynesian 0.00% 0 ?
Tonga 0.00% 0 ?
Tonga Polynesian 0.00% 0 ?
Average 55.31% 5.73 ?
a
b
c
Table 10
The MHC-II coverage population for E protein
Class II

World 81.81% 8.16 1.1
East Asia 81.82% 8.83 1.1
Japan 74.83% 7.85 0.79
Korea, South 85.32% 9.56 1.36
Mongolia 81.85% 7.79 1.1
China 59.99% 5.33 0.5
South Asia 75.38% 7.4 0.81
India 74.99% 7.35 0.8
India Asian 74.99% 7.35 0.8
Pakistan 1.18% 0.09 0.81
Borneo 49.02% 4.03 0.39
Indonesia 47.84% 4.4 0.38
Malaysia 57.99% 5.34 0.48
Philippines 28.56% 2.52 0.28
Singapore 65.78% 6.04 0.58
Taiwan 67.88% 6.13 0.62
(continued)
Table 10
(continued)
Class II

Thailand 63.90% 5.92 0.55
Vietnam 54.44% 4.43 0.44
Iran 64.22% 5.65 0.56
Iran Kurd 55.78% 4.74 0.45
Iran Persian 65.72% 5.83 0.58
Israel 68.79% 6.4 0.64
Israel Arab 67.51% 6.2 0.62
Israel Jew 69.65% 6.51 0.66
Jordan 52.88% 4.56 0.42
Jordan Arab 52.88% 4.56 0.42
Lebanon 70.46% 6.48 0.68
Lebanon Arab 70.46% 6.48 0.68
Saudi Arabia 80.14% 8.31 1.01
Europe 85.83% 8.88 1.41
Austria 93.34% 10.8 2.82
Belarus 43.81% 3.55 1.25
Belgium 79.39% 7.16 0.97
Bulgaria 57.23% 4.95 0.47
Croatia 66.71% 5.89 0.6
(continued)
Table 10
(continued)
Class II

Denmark 88.98% 9.04 1.81
England 93.48% 10.49 2.74
Finland 51.14% 4.24 0.41
France 88.54% 9.29 1.74
Georgia 75.05% 7.09 0.8
Germany 91.14% 10.14 2.26
Greece 66.92% 6.29 0.6
Italy 85.90% 5.93 1.42
Macedonia 66.53% 6.2 0.6
Netherlands 83.44% 8.33 1.21
Norway 94.71% 10.56 3.01
(continued)
Table 10
(continued)
Class II

Poland 84.46% 8.85 1.29
Portugal 78.00% 7.74 0.91
Russia 77.62% 7.24 0.89
Russia Other 85.01% 9.2 1.33
Scotland 90.82% 10.1 2.2
Slovakia 18.28% 0.37 0.24
Slovenia 84.85% 8.74 1.32
Spain 80.51% 8.28 1.03
Spain Other 6.30% 0.57 0.96
Sweden 88.07% 9.13 1.68
Turkey 76.19% 7.3 0.84
Ukraine 50.64% 4.17 1.42
East Africa 68.30% 5.65 0.63
Zimbabwe 68.30% 5.65 0.63
West Africa 65.23% 6.13 0.58
Cape Verde 80.38% 8.1 1.02
(continued)
Table 10
(continued)
Class II

Senegal 30.28% 2.32 0.29
Cameroon 49.87% 3.31 0.4
Congo 68.66% 5.93 0.64
Congo Black 68.66% 5.93 0.64
Gabon 41.78% 3.84 1.2
Gabon Black 41.78% 3.84 1.2
Rwanda 62.79% 5.38 0.54
Rwanda Black 62.79% 5.38 0.54
Algeria 77.15% 7.25 0.88
Algeria Arab 77.15% 7.25 0.88
Ethiopia 83.00% 8.71 1.18
Morocco 83.44% 8.14 1.21
Morocco Arab 85.07% 8.25 1.34
Sudan 60.56% 4.52 0.51
Sudan Mixed 60.56% 4.52 0.51
Tunisia 74.26% 6.82 0.78
Tunisia Arab 74.97% 6.78 0.8
(continued)
Table 10
(continued)
Class II

South Africa 32.10% 1.11 0.29
South Africa 32.10% 1.11 0.29
West Indies 69.22% 6.67 0.65
Cuba 85.48% 9.66 1.38
Cuba Mixed 85.48% 9.66 1.38
Jamaica 27.41% 2.28 0.28
Martinique 74.51% 7.17 0.78
Canada 38.41% 2.21 0.32
Mexico 55.04% 4.3 0.44
Costa Rica 24.31% 2.21 0.26
Guatemala 49.16% 3.37 0.39
(continued)
Table 10
(continued)
Class II

Argentina 62.67% 5.36 0.54
Bolivia 77.82% 5.97 0.9
Brazil 63.80% 5.16 0.55
Brazil Mixed 77.50% 6.94 0.89
Chile 67.08% 5.82 0.61
Chile Mixed 52.65% 4.39 0.42
Colombia 54.02% 4.34 0.43
Ecuador 52.17% 3.75 1.25
Paraguay 4.90% 0.29 0.63
Peru 49.87% 3.47 0.4
Venezuela 3.01% 0.06 0.21
Oceania 59.87% 5.38 0.5
Australia 33.15% 2.21 0.3
(continued)
Table 10
(continued)
Class II

Chile 67.08% 5.82 0.61
Cook Islands 78.59% 6.44 0.93
Fiji 79.87% 7.5 0.99
Kiribati 10.89% 0.85 0.22
Nauru 38.66% 3.4 0.33
New Zealand 84.46% 6.76 1.29
Niue 77.82% 4.27 0.9
Samoa 80.86% 7.29 1.04
Tokelau 55.11% 2.82 0.45
Tonga 71.91% 6.12 0.71
Average 51.14% 4.7 ?
a
b
c
Table 11
The MHC-II coverage population for modified E protein
Class II

World 81.81% 8.16 1.1
East Asia 81.82% 8.83 1.1
Japan 74.83% 7.85 0.79
Korea, South 85.32% 9.56 1.36
Mongolia 81.85% 7.79 1.1
China 59.99% 5.33 0.5
Hong Kong 0.00% 0 ?
Hong Kong Oriental 0.00% 0 ?
South Asia 75.38% 7.4 0.81
India 74.99% 7.35 0.8
India Asian 74.99% 7.35 0.8
Pakistan 1.18% 0.09 0.81
Pakistan Mixed 0.00% 0 0
Sri Lanka 0.00% 0 ?
Sri Lanka Asian 0.00% 0 ?
Borneo 49.02% 4.03 0.39
Indonesia 47.84% 4.4 0.38
Malaysia 57.99% 5.34 0.48
Philippines 28.56% 2.52 0.28
(continued)
Table 11
(continued)
Class II

Singapore 65.78% 6.04 0.58
Taiwan 67.88% 6.13 0.62
Thailand 63.90% 5.92 0.55
Vietnam 54.44% 4.43 0.44
Iran 64.22% 5.65 0.56
Iran Kurd 55.78% 4.74 0.45
Iran Persian 65.72% 5.83 0.58
Israel 68.79% 6.4 0.64
Israel Arab 67.51% 6.2 0.62
Israel Jew 69.65% 6.51 0.66
Jordan 52.88% 4.56 0.42
Jordan Arab 52.88% 4.56 0.42
Lebanon 70.46% 6.48 0.68
Lebanon Arab 70.46% 6.48 0.68
Lebanon Mixed 0.00% 0 ?
Oman 0.00% 0 ?
Oman Arab 0.00% 0 ?
Saudi Arabia 80.14% 8.31 1.01
Europe 85.83% 8.88 1.41
Austria 93.34% 10.8 2.82
(continued)
Table 11
(continued)
Class II

Belarus 43.81% 3.55 1.25
Belgium 79.39% 7.16 0.97
Bulgaria 57.23% 4.95 0.47
Bulgaria Other 0.00% 0 ?
Croatia 66.71% 5.89 0.6
Denmark 88.98% 9.04 1.81
England 93.48% 10.49 2.74
England Jew 0.00% 0 ?
England Mixed 0.00% 0 0
Finland 51.14% 4.24 0.41
France 88.54% 9.29 1.74
Georgia 75.05% 7.09 0.8
Georgia Kurd 0.00% 0 ?
Germany 91.14% 10.14 2.26
Greece 66.92% 6.29 0.6
(continued)
Table 11
(continued)
Class II

Italy 85.90% 5.93 1.42
Macedonia 66.53% 6.2 0.6
Netherlands 83.44% 8.33 1.21
Norway 94.71% 10.56 3.01
Poland 84.46% 8.85 1.29
Portugal 78.00% 7.74 0.91
Romania 0.00% 0 ?
Romania Caucasoid 0.00% 0 ?
Russia 77.62% 7.24 0.89
Russia Other 85.01% 9.2 1.33
Scotland 90.82% 10.1 2.2
Serbia 0.00% 0 ?
Serbia Caucasoid 0.00% 0 ?
Slovakia 18.28% 0.37 0.24
Slovenia 84.85% 8.74 1.32
(continued)
Table 11
(continued)
Class II

Spain 80.51% 8.28 1.03
Spain Jew 0.00% 0 ?
Spain Other 6.30% 0.57 0.96
Sweden 88.07% 9.13 1.68
Switzerland 0.00% 0 ?
Switzerland Caucasoid 0.00% 0 ?
Turkey 76.19% 7.3 0.84
Ukraine 50.64% 4.17 1.42
Wales 0.00% 0 0
East Africa 68.30% 5.65 0.63
Kenya 0.00% 0 0
Kenya Black 0.00% 0 0
Uganda 0.00% 0 0
Uganda Black 0.00% 0 0
Zambia 0.00% 0 ?
Zambia Black 0.00% 0 ?
Zimbabwe 68.30% 5.65 0.63
West Africa 65.23% 6.13 0.58
Burkina Faso 0.00% 0 ?
Burkina Faso Black 0.00% 0 ?
Cape Verde 80.38% 8.1 1.02
(continued)
Table 11
(continued)
Class II

Gambia 0.00% 0 0
Gambia Black 0.00% 0 0
Ghana 0.00% 0 ?
Ghana Black 0.00% 0 ?
Ivory Coast 0.00% 0 ?
Ivory Coast Black 0.00% 0 ?
Liberia 0.00% 0 0
Liberia Black 0.00% 0 0
Nigeria 0.00% 0 0
Nigeria Black 0.00% 0 0
Senegal 30.28% 2.32 0.29
Cameroon 49.87% 3.31 0.4
Congo 68.66% 5.93 0.64
Congo Black 68.66% 5.93 0.64
Gabon 41.78% 3.84 1.2
Gabon Black 41.78% 3.84 1.2
Rwanda 62.79% 5.38 0.54
Rwanda Black 62.79% 5.38 0.54
(continued)
Table 11
(continued)
Class II

Algeria 77.15% 7.25 0.88
Algeria Arab 77.15% 7.25 0.88
Ethiopia 83.00% 8.71 1.18
Mali 0.00% 0 ?
Mali Black 0.00% 0 ?
Morocco 83.44% 8.14 1.21
Morocco Arab 85.07% 8.25 1.34
Sudan 60.56% 4.52 0.51
Sudan Arab 0.00% 0 ?
Sudan Mixed 60.56% 4.52 0.51
Tunisia 74.26% 6.82 0.78
Tunisia Arab 74.97% 6.78 0.8
South Africa 32.10% 1.11 0.29
South Africa 32.10% 1.11 0.29
South Africa Other 0.00% 0 ?
West Indies 69.22% 6.67 0.65
Cuba 85.48% 9.66 1.38
Cuba Caucasoid 0.00% 0 ?
Cuba Mixed 85.48% 9.66 1.38
Cuba Mulatto 0.00% 0 ?
Jamaica 27.41% 2.28 0.28
Martinique 74.51% 7.17 0.78
(continued)
Table 11
(continued)
Class II

Trinidad and Tobago 0.00% 0 ?
Trinidad and Tobago Asian 0.00% 0 ?
Canada 38.41% 2.21 0.32
Mexico 55.04% 4.3 0.44
Costa Rica 24.31% 2.21 0.26
Guatemala 49.16% 3.37 0.39
Argentina 62.67% 5.36 0.54
Bolivia 77.82% 5.97 0.9
Brazil 63.80% 5.16 0.55
(continued)
Table 11
(continued)
Class II

Brazil Mixed 77.50% 6.94 0.89
Brazil Other 0.00% 0 ?
Chile 67.08% 5.82 0.61
Chile Hispanic 0.00% 0 0
Chile Mixed 52.65% 4.39 0.42
Colombia 54.02% 4.34 0.43
Ecuador 52.17% 3.75 1.25
Ecuador Black 0.00% 0 0
Paraguay 4.90% 0.29 0.63
Peru 49.87% 3.47 0.4
Venezuela 3.01% 0.06 0.21
Venezuela Amerindian 0.00% 0 0
Venezuela Caucasoid 0.00% 0 ?
Venezuela Mestizo 0.00% 0 ?
Oceania 59.87% 5.38 0.5
American Samoa 0.00% 0 ?
American Samoa Polynesian 0.00% 0 ?
Australia 33.15% 2.21 0.3
(continued)
Table 11
(continued)
Class II

Australia Caucasoid 0.00% 0 ?
Chile 67.08% 5.82 0.61
Cook Islands 78.59% 6.44 0.93
Fiji 79.87% 7.5 0.99
Kiribati 10.89% 0.85 0.22
Nauru 38.66% 3.4 0.33
New Zealand 84.46% 6.76 1.29
Niue 77.82% 4.27 0.9
Samoa 80.86% 7.29 1.04
Tokelau 55.11% 2.82 0.45
Tonga 71.91% 6.12 0.71
Average 51.14% 4.7 ?
a
b
c
Trypanosoma cruzi Dm28c, Strigamia maritime, and Nocardioides

dokdonensis; besides some species of Mycobacteria, Salmonella,
Streptococcus, these may mean the presence of these peptides in
those organisms had a relationship with respiratory disease but
still needs to go deeper to confirm this suggestion, other things
we can easily synthesis the desired peptides in laboratory by using
one of these organisms (cloning techniques) because it is easy and
no risk from acquired a very dangers infections beside determina-
tion of the peptide sequences impact on immune system via injected
laboratory animals with those selected peptide sequences from any
organisms.
3.7 AllerHunter: Any sequence can be considered as a cross-reactive allergen if its

Cross-Reactive probability is ≧0.06. The results considered that envelope
Allergen Prediction (E) protein, spike (S) glycoprotein, and modified S glycoprotein
Program are potential non-allergens with scores of 0.01, 0.0, and 0.0,
respectively, while modified E protein sequence was too short for
prediction (AllerHunter predicted the query sequence as a potential
allergen with score of 0.07). According to the FAO/WHO, E and
modified E protein sequences are classified as a non-allergen
because they do not meet the criteria set by the FAO/WHO
evaluation scheme for cross-reactive allergen prediction, but in S
and modified S glycoprotein, they are classified as a potential aller-
gen based on the FAO/WHO evaluation scheme because query
sequence matches at least one sequence in the AllerHunter data set
with at least 35 percent identity over 80 amino acids.
3.8 AlgPred: AlgPred showed non-allergen for all four sequences (S, E, modified
Prediction S and E proteins) as follows:
of Allergenic Proteins
1. Prediction by mapping of IgE epitope: The protein sequence
and Mapping of IgE does not contain experimentally proven IgE epitope.
Epitopes
2. MAST RESULT: No Hits found; NON ALLERGEN.
3. BLAST results of ARPS: No hits found, NON-ALLERGEN.
4. Prediction by hybrid approach: NON-ALLERGEN/
ALLERGEN.
There were slightly differences between the four sequences in
SVM prediction methods according to amino acid composition/
dipeptide composition as in Tables 12 and 13.
3.9 VaxiJen v2.0 VaxJen servers showed three protein sequences out of two, consid-
ered as probable antigens, as illustrated below:
S glycoprotein: threshold for this model, 0.4; overall antigen
prediction, 0.4827 (probable ANTIGEN).
Modified S glycoprotein: threshold for this model, 0.4; overall
antigen prediction, 0.4907 (probable ANTIGEN).
Table 12
SVM prediction methods based on amino acid composition for the four protein sequences
SVM prediction based Positive Negative

Types of protein on amino acid predictive predictive
sequence composition Score Threshold value value
S glycoprotein Allergen 0.014762929 0.4 70.05% 80.74%
Modified S Allergen 0.0065929692 0.4 70.05% 80.74%
glycoprotein
E protein Allergen 0.3638541 0.4 47.13%/ 89.71%
Modified E protein Non-allergen 1.08932 0.4 15.19% 94.18%.
Table 13
Illustrates SVM prediction methods based on dipeptide composition for the four protein sequences
SVM prediction based Positive Negative

Types of protein on amino acid predictive predictive
sequence composition Score Threshold value value
S glycoprotein Allergen 0.04096577 0.2 63.1% 85.56%
Modified S Allergen 0.059498832 0.2 63.1% 85.56%
glycoprotein
E protein Non-allergen 0.7511982 0.2 13.26% 74.19%
Modified E protein Non-allergen 0.65278098 0.2 13.26% 74.19%
E protein: threshold for this model, 0.4; overall antigen pre-

diction, 0.3811 (probable NON-ANTIGEN).
Modified E protein: threshold for this model, 0.4; overall
antigen prediction, 0.4417 (probable ANTIGEN).
4 Discussions
Today, there are so many different ways to develop MERS-CoV

vaccine; some of them partially succeed but the others failed while
the remaining nor succeed neither failed because it depends on
software program for different reasons and still need to go under
vaccine protocols processing, in those studies that consist with S1
protein subunit especially RBD (the most mutable region that
containing mutation sites which define antibody escape variants)
was considered the basis for several MERS-CoV vaccine candidates
in many studies such as using RBD with aluminum salt or oil-in-
water adjuvants; can elicited neutralizing antibodies of high
potency across multiple viral strains by Modjarrad [4] and Wang
et al. [6] said that the full-length S DNA and a truncated S1 subunit
glycoprotein can elicit a higher titer of neutralizing antibodies; this
kind of immunization protected non-human primates (NHPs)
from severe lung disease after intratracheal challenge with MERS-
CoV injection; in another study that was done in Iran by Poorin-
mohammad et al. [15] [NetCTL 1.2 (Larsen et al., 2007), EpiJen
(Doytchinova et al, 2006), and NHLApred (Bhasin and Raghava,
2007), they were selected computational prediction tools with
PEPstr server for modeling (Kaur et al., 2007)] to identify cyto-
toxic T-lymphocyte epitopes presented by the human leukocyte
antigen (HLA)-A∗0201; as this is the most frequent HLA class I
allele among Middle Eastern populations with this selected RBD
for their study, they showed LLSGTPPQV, ILDYFSYPL
ILATVPHNL, NLTTITKPL, LQMGFGITV, and FSNPTCLIL
as selected epitopes but LLSGTPPQV and FSNPTCLIL were con-
sidered as real epitope due to the following: peptides with binding
orientations closer to the native structure and lower binding free
energy scores are ranked higher in having the potential to be real
epitopes reverse another study were done by Shi J et al. [19] by
using the Immune Epitope Database, that said: the nucleocapsid
(N) protein of MERS-CoV might be a better protective immuno-
gen with high conservancy and potential eliciting both neutralizing
antibodies and T-cell responses when compared with spike
(S) protein; in addition 71 peptides were identified as helper
T-cell epitopes, 34 peptides were identified as CTL epitopes; just
top 10 helper T-cell epitopes and CTL epitopes based on maximum
HLA binding alleles, can elicit protective cellular immune responses
against MERS-CoV were considered as MERS vaccine candidates
and they are covering 15 geographic regions [19].
In this study that consists of two parts reference and modified
sequence of both S glycoprotein and E protein, I found that the
most common B-cell epitope that passed all B-cell prediction meth-
ods [IEDB prediction tool] for E protein is YVKFQDS in position
69 and for modified E they are VYVPQQD, YVPQQDS, and
PPLPED/PPLPEDV epitopes at positions 68, 69, and 77 sequen-
tially; while for S and modified S, they are DVGPDSV, PDSVKSA,
DSVKSAC, PRPIDVS, HTPATDC, AKPSGSV, KPSGSVV,
SGTPPQV, GTPPQVY, TPPQVYN, QLSPLEG, YGPLQTP,
PRSVRSV, RSVRSVP, SVKSSQS, VKSSQSS, SQSSPII, and
SLNTKYV at positions 23, 26, 27, 48, 211, 371, 372, 393,
394, 395, 547, 707, 750, 751, 856, 859 (857 in modified S
glycoprotein), and 1202 sequentially, but QVDQLNS and
VDQLNSS epitopes at positions 772 and 773 are only found in S
glycoprotein, while LTPTSSY, TPTSSYV, PTSSYVD, TSSYVDV,
DHGDYYV, YSQDVKQ, ANQYSPC, NQYSPCV, and YYRKQLS
epitopes at positions 15, 16, 17, 18, 83, 108, 523, 524, and 543 are
only found in modified S glycoprotein; according to my study, I
found that the results of S and modified S glycoprotein they are
partially agree with the study that was done in Africa city of
Technology-Khartoum, Sudan by Badawi et al, [16] in those epi-
topes GTPPQVY in position 391–397 and LTPRSVRSVP in posi-
tion 745–754, may be do you to different numbers of selected
MERS-CoV protein sequence.
Prediction of cytotoxic T-lymphocyte epitopes and their inter-
action with MHC Class I, the results showed ILDYFSYPL was
similar according my study, Badwai et al [16] and Poorinmoham-
mad and Mohabatkar [15] studies; partially similarity with Iranian
study [15] in LLSGTPPQV, ILATVPHNL, LQMGFGITV, and
FSNPTCLIL epitopes were noticed except NLTTITKPL epitope
that was absent from my study in S and modified S sequence;
FSNPTCLIL represents the only epitope that is found in my
study in S and modified S sequence; FSFGVTQEY have a high
affinity to bind to many alleles and these findings agree with Badawi
et al. [16] in addition to ITYQGLFPY in my study through S
glycoprotein sequence, but still there are differences in the numbers
of selected epitopes that reacted with MHC-I which were higher
than that in Badawi et al. [16], while in E protein FIFTVVCAI
epitope has a higher allele affinity followed by ITLLVCMAF,
IVNFFIFTV, and LVQPALYLY reverse modified E protein;
LVQPALSLY epitope has shown high affinity and then followed
by LYMTGRSVY, WFIPNFFDF, YMTGRSVYV, ITLLVCTAF,
FVQERIGWF, FLTATHLCV, and CMTGFNTLL, the last epi-
tope which is common between E and modified E protein
sequences.
Prediction of T-helper cell epitopes and their interactions with
MHC Class II showed FNLTLLEPVSISTGS epitope that was
considered as the most suitable epitope with a high affinity to
26 alleles in Badawi et al. [16]; this epitope was actually found in
S and modified S sequence of my study, but the difference is that it
cannot considered that the most suitable epitope with a high bind-
ing affinity to different alleles like in in Badawi et al, [16] study.
There is no research results related to E protein and modified E
and S glycoprotein epitope vaccine instead of partial similarity that I
found between S and modified S glycoprotein.
No previous study illustrates S glycoprotein and E protein
allergic reactions except the study that were done by Shi J et al.
[19] for N protein, but in this study, S and E protein showed no
allergic reaction according to AllerHunter services. Furthermore
Shi J et al. [19] said that, for N protein, the analysis of the surface
accessibility of the predicted peptides showed that the maximum
surface probability value was 6.971 at amino acid position from
363 to 368 (363KKEKKQ368), but the minimum value of surface
probability was 0.074 for 205GIGAVG210 peptides, while in the
analysis of the flexibility of the predicted peptides, they showed that
the maximum flexibility value was 1.160 at amino acid position
from 170 to 176 (167GNSQSSS173) with the minimum value
0.903 for peptides 97RWYFYYT103; in MHC-II the epitope

329LRYSGAIKL337 interacting with 357 HLA-DR alleles was
considered the epitope that possesses the maximum number of
binding HLA-DR alleles, while 230VKQSQPKVI238 interacting
with 94 HLA-DR alleles is the epitope that possesses the minimum
number of binding HLA-DR alleles, and also the same occurred
with MHC-I; KQLAPRWYF100 had the highest number of bind-
ing HLA-A alleles in MHC-I and then followed by 343NYNKW-
LELL351,72AQNAGYWRR80, and 387RVQGSITQR395 (see
[19]) paper for coverage population); in addition to the above,
the studies that were done by Sharmin and Islam [20] showed
that WDYPKCDRA was considered as a highly conserved epitope
in the RNA directed RNA polymerase of human coronaviruses after
applying multiple sequence alignment (MSA) approach for spike
(S), membrane (M), enveloped (E), and nucleocapsid (N) protein
and replicase polyprotein 1ab to identify which one is highly con-
served in all coronavirus strains, followed by using various in silico
tools to predict consensus immunogenic and conserved peptide.
Furthermore information that were not shown here are that I
used the software below to confirm MHC-II results, and their
results partially agree with IEDB MHC-I results and I do not
know why. EpiDOCK: Molecular docking—based tool for MHC
class II binding prediction (http://epidock.ddg-pharmfac.net/),
EpiTOP1.0 (http://www.pharmfac.net/EpiTOP/index.php),
other things that I do not agree with Shi J et al. [19] when he did
alignments for S, E, M. . . .., with all human coronavirus & said he
just found the most common peptide was N protein alone, because
when I trying to made alignment for S, M, ORFA1,.., I found some
alignments between those proteins and different coronavirus
strains and this may be means presence of some common peptide
but it still needs more studies.
4.1 Conclusions As I mentioned before, software vaccine and drug design became
very important in the first and third world countries to avoid
wasting resources, time, and efforts; for MERS-CoV vaccine, it is
important to design effective vaccine that cannot be protected
against MERS-CoV but also the emergence of new strain besides
the other human coronavirus especially when MERS-CoV vaccines
they are not passed all vaccine design protocols.
In this study I found the following points: Emergence of a new
strains may had a minor change in peptide sequence vaccine espe-
cially when the selected viruses parts nor longer neither smaller in
their length.
In B-cell prediction; mutations can lead to increased numbers
of selected epitopes with very few sequence changes noticed, in
addition to a large number of shared epitopes between reference
and modified sequence; this means mutated sequence has the abil-
ity to elicit the same immune response (IR) (response to virus by
the same antibodies as in first infections).
Mutations of the virus sequence can change the frequency of

allele and peptide numbers eithers through increased or decreased
these numbers, beside presences or absences of some new/old
alleles or peptides; same alleles had a different peptide sequences
and vice versa.
For MHC-II there were not changed in E & modified E
protein alleles & their frequencies & also in peptide sequences &
their frequencies were noticed, these may be due to short E protein
sequence, while for S & modified S glycoprotein there are minor
difference in some peptide frequency numbers either by adding/
lowering one or two numbers just & same for alleles.
There is an allele similarity between E, S, and modified E and S
proteins in MHC-II, besides presence of a tiny difference in S and
modified S peptide sequences in MHC-II due to the modification
that I was introduced before in S reference sequence.
The absence of very few numbers of peptide sequences from S
reference sequence in modified S sequence leads to the presence of
a new peptide sequences.
In MHC-I a lot of selected peptide sequences that are repre-
sented in S glycoprotein reference sequence are missing from the
modified one reverse E protein reference sequence due to presence
of additional epitopes in E protein modified sequence.
The presence of arginine in some selected peptide sequence
vaccine makes it ineffective, so we need to solve this problem either
by replacing it with other amino acid from the same group or by
finding another ways that make those epitopes visible for immune
system (IS).
The presence of mutated sequence can effect on the coverage
population in MHC-II by presence/absence of some countries,
with the percentage changes, reverse MHC-I no changes were
noticed.
Acknowledgments
The author would like to thank Allah, her family, for always sup-
porting her, and the National Ribat University members.
References
1. Coronavirus-Vaccine-a-6110.html, 2013 5. Ithete NL, Stoffberg S, Corman VM, Cotton-
2. http://en.wikipedia.org/wiki/Coronavirus, tail VM, Richards LR, Schoeman MC,
2014 Drosten C, Drexler JF, Preiser W (2013)
3. Khan G (2013) A novel coronavirus capable of Close relative of human middle east respiratory
lethal human infections: an emerging picture. syndrome coronavirus in Bat, South Africa.
Virol J 10:66. http://virologyj.biomedcentral. Emerg Infect Dis 19(10):1697–1699
com/articles/10.1186/1743-422X-10-66 6. Wang L, Shi W, Joyce GM, Modjarrad K,
4. Modjarrad K (2016) MERS-CoV vaccine can- Zhang Y, Leung K, Lees RC, Zhou T, Yassine
didates in development: the current landscape. MH et al (2015) Evaluation of candidate vac-
Vaccine 34(26):2982–2987 cine approaches for MERS-CoV. Nat Commun
6:7712. http://www.nature.com/articles/ domain of MERS-CoV spike protein using a

ncomms8712 combinatorial in silico approach. Turk J Biol
7. Kim Y, Ponomarenko J, Zhu Z, Tamang D, 38:628–632. http://journals.tubitak.gov.tr/
Wang P, Greenbaum J, Lundegaard C, biology/issues/biy-14-38-5/biy-38-5-10-
Sette A, Lund O, Bourne PE, Nielsen M, 1401-21.pdf
Peters B (2012) Immune epitope database 16. Badawi MM, Salaheldin AM, Suliman MM,
analysis resource. Nucleic Acids Res 40: AbduRahim AS, Mohammed AEA, SidAhmed
W525–W530 SAA, Othman MM, Salih AM (2016) In silico
8. Sidney J, Assarsson E, Moore C, Ngo S, prediction of a novel universal multi-epitope
Pinilla C, Sette A, Peters B (2008) Quantitative peptide vaccine in the whole spike glycoprotein
peptide binding motifs for 19 human and of MERS CoV. Am. J. Microbiol. Res 4
mouse MHC class I molecules derived using (4):101–121
positional scanning combinatorial peptide 17. Du L, Zhao G, Kou Z (2013) Identification of
libraries. Immunome Res 4:2 a receptor-binding domain in the S protein of
9. Hoof I, Peters B, Sidney J, Pedersen LE, the novel human coronavirus Middle East
Sette A, Lund O, Buus S, Nielsen M (2009) respiratory syndrome coronavirus as an essen-
NetMHCpan, a method for MHC class I bind- tial target for vaccine development. J Virol 87
ing prediction beyond humans. Immunogenet- (17):9939–9942
ics 61:1–13 18. Mohamed HA, Mohamed YO, Salam AB, You-
10. Nielsen M, Lundegaard C, Worning P, Laue- sif AH, Hassan MM, Kaheel HH, Hassan AM
møller SL, Lamberth K, Buus S, Brunak S, (2014) In silico analysis of single nucleotide
Lund O (2003) Reliable prediction of T-cell polymorphisms (SNPs) in human FANCA
epitopes using neural networks with novel gene. Int J Comput Bioinform In Silico
sequence representations. Protein Sci Model 3(5):502–513
12:1007–1017 19. Shi J, Zhang J, Li S, Sun J, Teng Y, Wu M, Li J,
11. Peters B, Sette A (2005) Generating quantita- Li Y, Hu N, Wang H, Hu Y (2015) Epitope-
tive models describing the sequence specificity based vaccine target screening against highly
of biological processes with the stabilized pathogenic MERS-CoV: an in silico approach
matrix method. BMC Bioinformatics 6:132 applied to emerging infectious diseases. PLoS
12. Karosiene E, Rasmussen M, Blicher T, Lund O, One 10(12):e0144475
Buus S, Nielsen M (2013) NetMHCIIpan-3.0, 20. Sharmin R, Islam AB (2014) A highly con-
a common pan-specific MHC class II predic- served WDYPKCDRA epitope in the RNA
tion method including all three human MHC directed RNA polymerase of human corona-
class II isotypes, HLA-DR, HLA-DP and viruses can be used as epitope-based universal
HLA-DQ. Immunogenetics 65(10):711 vaccine design. BMC Bioinformatics 15:161
13. Nielsen M, Lundegaard C, Blicher T, 21. Saha S, Raghava GPS (2006) AlgPred: predic-
Lamberth K, Harndahl M, Justesen S, tion of allergenic proteins and mapping of IgE
Roder G, Peters B, Sette A, Lund O, Buus S epitopes. Nucleic Acids Res 34:W202–W209
(2007) NetMHCpan, a method for quantita- 22. Doytchinova AI, Flower RD (2007) VaxiJen: a
tive predictions of peptide binding to any server for prediction of protective antigens,
HLA-A and -B locus protein of known tumour antigens and subunit vaccines. BMC
sequence. PLoS One 2:e796 Bioinformatics 8:4
14. Nielsen M, Lundegaard C, Blicher T, Peters B, 23. Doytchinova AI, Flower RD (2007) Identify-
Sette A, Justesen S, Buus S, Lund O (2008) ing candidate subunit vaccines using an
Quantitative predictions of peptide binding to alignment-independent method based on prin-
any HLA-DR molecule of known sequence: cipal amino acid properties. Vaccine
NetMHCIIpan. PLoS Comput Biol 4(7): 25:856–866
e1000107 24. Doytchinova AI, Flower RD (2008) Bioinfor-
15. Poorinmohammad N, Mohabatkar H (2014) matic approach for identifying parasite and fun-
Identification of HLA-A∗0201-restricted gal candidate subunit vaccines. Open Vaccines J
CTL epitopes from the receptor-binding 1:22–26
Chapter 5
An Alignment-Independent Platform for Allergenicity

Prediction
Ivan Dimitrov and Irini Doytchinova
Abstract
A great number of novel proteins have been generated from new sources and genetically modified foods
during the last decade. As the allergenicity of these proteins is of particular importance for their safe usage,
fast and reliable screening strategies for allergenicity assessment are required. The WHO/FAO guidance
directs to structural similarities between the novel proteins and known allergens detected by sequence
alignment. However, the allergic response involves conformational IgE epitopes that are undetectable by
sequence alignment. Here, we present a protocol for allergenicity prediction based on a platform of three
alignment-independent servers developed in our lab: AllerTOP v.1, AllerTOP v.2, and AllergenFP. The
servers use similar datasets but different chemical descriptors and methods to derive models for allergenicity
prediction. The platform is freely accessible and user-friendly. The protocol is demonstrated stepwise on a
randomly chosen query protein.
Key words Allergenicity, Allergenicity prediction, Physicochemical properties of amino acids, Align-
ment-independent methods
1 Introduction
Allergy is a type of hypersensitivity of the human immune system to

normally innocuous substances, such as dust, pollen, foods, or
drugs. Allergens are small antigens that commonly provoke
human immunoglobulin E (IgE) antibody response. As the num-
ber of novel proteins generated from new sources and genetically
modified foods increased significantly during the last decade, a fast
and reliable screening strategy for allergenicity assessment is
required. According to the guidance of the World Health Organi-
zation (WHO) and Food and Agriculture Organization (FAO) of
the United Nations, the allergenicity assessment is based on struc-
tural similarities between the novel proteins and known allergens
detected by sequence alignment [1]. However, the allergic
response involves conformational IgE epitopes that are undetect-
able by sequence alignment.

147
148 Ivan Dimitrov and Irini Doytchinova
Recently, we developed three alignment-independent models

for allergenicity prediction. They were implemented in three mod-
ular servers: AllerTOP v.1 [2], AllerTOP v.2 [3], and AllergenFP
[4]. AllerTOP v.1 uses a dataset of 2395 known allergens as a
positive training set and a set of 2395 non-allergens from the
same species as a negative training set. The protein sequences are
encodes by three amino acids z-scores [5], transformed in uniform
vectors by auto- and cross-covariance (ACC) [6] and analyzed by
k nearest neighbor (kNN) algorithm. The model recognizes 94% of
the allergens and 94% of the non-allergens in the external test
set [2].
AllerTOP v.2 is an updated version of AllerTOP v.1. It uses an
updated set of 2427 known allergens and set of 2427
non-allergenic proteins from widely used foods species and
non-immunogenic human proteins. The data processing is similar
to that used in AllerTOP v.1: protein sequences are encodes by five
amino acid E-descriptors [7], transformed by ACC and analyzed by
kNN classifier. The model recognizes 87% of the allergens and 91%
of the non-allergens in the external test set.
The AllergenFP model is derived on the AllerTOP v.2 training
set of transformed protein sequences represented as binary finger-
prints. The predictions are made by comparing the fingerprint of
the tested protein to the fingerprints of the proteins from the
training set. The similarity is assessed by Tanimoto index [8]. This
model recognizes 87% of the allergens and 89% of the non-allergens
in the external test set.
Here, we utilize the three modular servers to assess the aller-
genicity of a randomly selected allergenic protein and demonstrate
a protocol for fast and reliable in silico prediction. The protocol
could be automated by a simple script and used for in silico screen-
ing of novel proteins.
2 Databases and Servers
2.1 Swiss-Prot Swiss-Prot is the manually annotated and reviewed database of the
Universal Protein Knowledgebase (UniProtKB) (http://www.
uniprot.org). It contains more than 50,000 protein sequences
(July 2019).
2.2 NCBI (National The NCBI Protein database (https://www.ncbi.nlm.nih.gov/) is a

Center for collection of sequences from several sources including translations
Biotechnology from annotated coding regions in other NCBI databases like Gen-
Information) Protein Bank, RefSeq, and TPA as well as records from other databases like
Database SwissProt, PIR, PRF, and PDB.
An Alignment-Independent Platform for Allergenicity Prediction 149
2.3 AllerTOP v.1 AllerTOP is an alignment-independent server for in silico predic-

tion of allergens based on the main physicochemical properties of
the amino acid residues in proteins. The dataset of allergens consists
of 84 food, 1156 inhalant, and 555 toxin (venom or salivary)
allergenic proteins collected from several databases. The set of
non-allergens consists of proteins from the same species selected
by BLAST search tailored to mirror the allergen set.
The protein sequences were encoded by three amino acid z-
scores accounting for hydrophobicity, size, and polarity. As the
proteins were of different length, ACC transformation was applied
to derive a set of uniform vectors. The set was analyzed by kNN
algorithm and the model was optimized in terms of lag length and
k neighbors and tested on external set. The highest accuracy of 94%
was achieved at lag of 5 and k of 3 [2]. Additionally, AllerTOP is
able to predict the route of allergen exposure: food, inhalant, or
toxin.
2.4 AllerTOP v.2 AllerTOP v.2 model was developed on an updated dataset of 2427
allergens and 2427 non-allergens. The set of allergens was collected
from several databases containing allergens of different origin and
route of exposure. The set of non-allergens was collected from
widely used food species like tomato, pepper, potato, bread
wheat, rice, and human non-immunogenic proteins.
The protein sequences were encoded by five E-descriptors
accounting for hydrophobicity, size, helix- or β-strand forming
propensities, and relative abundance. After ACC transformation,
the dataset was analyzed by kNN algorithm. The best performing
model showed accuracy of 85% at k ¼ 1 [3].
2.5 AllergenFP AllergenFP uses the same ACC transformed dataset as AlerTOP
v.2. Each vector from the dataset is encoded as a binary fingerprint.
The tested protein is presented as a string of E-descriptors, trans-
formed by ACC and coded binary to generate a fingerprint. The
fingerprint is compared to the protein fingerprints from the Aller-
genFP dataset by Tanimoto similarity index. The protein showing
the highest similarity with the tested protein predicts the
allergenicity/non-allergenicity of the tested protein. The external
validation of the model gave 88% accuracy [4].
3 Methods
In the present protocol, the three modular servers described above

are used to predict the allergenicity of a randomly chosen query
protein. As a query protein is selected the allergen AHZ10957.1
(Fra a 1.02) from Fragaria x ananassa subsp. ananassa (straw-
berry). It contains 160 residues, is proven experimentally to be an
allergen [9], and does not present in the datasets used to derive the
models in the three servers.
3.1 Allergenicity 1. Retrieve the protein sequence for Fra a 1.02 from UniProtKB
Prediction by AllertTOP (entry: A0A024B3G5) or GenBank (AHZ10957.1).
v.1 2. Download or copy the sequence in FASTA format.
3. Open AllerTOP v.1 (URL: http://www.ddg-pharmfac.net/
allertop).
4. Enter the protein sequence in plain format (removing the title
line: delimiter >).
5. Press “Get the Result” button.
6. The AllerTOP v.1. RESULTS page lists the statement:
“PROBABLE ALLERGEN” (Fig. 1) and a list of the three
nearest proteins presented by accession number, hyperlink to
UniProtKB or NCBI protein database, and class allergen or
non-allergen. In our case, the three nearest neighbors are:
(a) 60280861 defined as allergen—major allergen Mal d
1.03F from Malus domestica (apple) [10] with accession
number in the NCBI protein database
Fig. 1 The AllerTOP v.1. RESULTS page for the query protein AHZ10957.1 from
Fragaria x ananassa subsp. ananassa (strawberry)
(b) Q84LA7 defined as allergen—major allergen Mal d

1 from Malus domestica (apple) [11] with accession num-
ber in the UniprotKB database.
(c) 60280843 defined as allergen—major allergen Mal d 1B
from Malus domestica (apple) [10] with accession number
in the NCBI protein database.
Prediction by AllertTOP (entry: A0A024B3G5) or GenBank (AHZ10957.1).
v.2 2. Download or copy the sequence in FASTA format.
3. Go to AllerTOP v.2 homepage (URL: http://www.ddg-
pharmfac.net/AllerTOP).
line: delimiter >).
6. The AllerTOP v.2. RESULTS page lists the statement: “Your
sequence is PROBABLE ALLERGEN. The nearest protein is
NCBI gi number 60280861 (hyperlink) defined as allergen”
(Fig. 2). Here again, the nearest neighbor is the major allergen
Mal d 1.03F from Malus domestica (apple) [10].
Prediction by (entry: A0A024B.
AllergenFP G5) or GenBank (AHZ10957.1).
2. Download or copy the sequence in FASTA format.
Fig. 2 The AllerTOP v.2. RESULTS page for the query protein AHZ10957.1 from
Fig. 3 The AllergenFP RESULTS page for the query protein AHZ10957.1 from
3. Go to AllergenFP homepage (URL: http://www.ddg-

pharmfac.net/AllergenFP).
line: delimiter >).
6. The AllergenFP RESULTS page lists the statement: “Your
sequence is PROBABLE ALLERGEN. The protein with the
highest Tanimoto similarity index 0.88 is UniProtKB accession
number O50001 (hyperlink)” (Fig. 3). Here, the protein with
the highest Tanimoto index is the major allergen Pru ar 1 from
Prunus armeniaca (Apricot) (Armeniaca vulgaris) [12].
3.4 Conclusion The decision on the allergenicity/non-allergenicity of the query

protein is based on the predictions made by the three servers. In our
case, all the nearest proteins are defined as allergens. Hence, the
query protein is predicted to be an allergen as well. In the general
case, if only one of the nearest proteins is defined as allergen, the
tested protein should be considered as a probable allergen.
The platform for allergenicity assessment presented here is
freely accessible and user-friendly. The three servers work comple-
mentary and the platform could be considered as an expert system
for prediction of protein allergenicity.
Acknowledgments
This work has been accomplished with the financial support by the
Grant No BG05M2OP001-1.001-0003, financed by the Science
and Education for Smart Growth Operational Program (2014-
2020), and co-financed by the European Union through the
European structural and investment funds.
References
1. FAO/WHO (2009) Foods derived from mod- physical-chemical properties. J Mol Model
ern biotechnology (annex: assessment of possi- 7:445–453
ble allergenicity, Codex Alimentarius—joint 8. Tanimoto TT (1958) An elementary mathe-
FAO/WHO food standards), 2nd edn. matical theory of classification and prediction.
WHO, Geneva IBM Research Yorktown Heights, New York
2. Dimitrov I, Flower DR, Doytchinova I (2013) 9. Franz-Oberdorf K, Eberlein B, Edelmann K,
AllerTOP—a server for in silico prediction of Hücherig S, Besbes F, Darsow U, Ring J,
allergens. BMC Bioinformatics 14(Suppl.6):S4 Schwab W (2016) Fra a 1.02 is the most potent
3. Dimitrov I, Bangov I, Flower DR, Doytchi- isoform of the bet v 1-like allergen in straw-
nova I (2014) AllerTOP v.2—a server for in berry fruit. J Agric Food Chem 64:3688–3696
silico prediction of allergens. J Mol Model 10. Gao ZS, van de Weg WE, Schaart JG, Schouten
20:2278 HJ, Tran DH, Kodde LP, van der Meer IM, van
4. Dimitrov I, Naneva L, Bangov I, Doytchinova der Geest AHM, Kodde J, Breiteneder H,
I (2014) AllergenFP: allergenicity prediction Hoffmann-Sommergruber K, Bosch D, Gilis-
by descriptor fingerprints. Bioinformatics sen LJWJ (2005) Genomic cloning and linkage
30:846–851 mapping of the mal d 1 (PR-10) gene family in
5. Hellberg S, Sjöström M, Skagerberg B, Wold S apple (Malus domestica). Theor Appl Genet
(1987) Peptide quantitative structure-activity 111:171–183
relationships, a multivariate approach. J Med 11. Hsieh LS, Moos M, Lin Y (1995) Characteri-
Chem 30:1126–1135 zation of apple 18 and 31 kd allergens by
6. Wold S, Jonsson J, Sjöström M, Sandberg M, microsequencing and evaluation of their con-
R€annar S (1993) DNA and peptide sequences tent during storage and ripening. J Allergy Clin
and chemical processes multivariately modelled Immunol 96:960–970
by principal components analysis and partial 12. Scheurer S, Pastorello EA, Wangorsch A,
least squares projections to latent structures. K€astner M, Haustein D, Vieths S (2001)
Anal Chim Acta 277:239–253 Recombinant allergens Pru av 1 and Pru av
7. Venkatarajan MS, Braun W (2001) New quan- 4 and a newly identified lipid transfer protein
titative descriptors of amino acids based on in the in vitro diagnosis of cherry allergy. J
multidimensional scaling of a large number of Allergy Clin Immunol 107:724–731
Chapter 6
Immunoinformatics and Epitope Prediction

Jayashree Ramana and Kusum Mehla
Abstract
With advancements in sequencing technologies, vast amount of experimental data has accumulated. Due to
rapid progress in the development of bioinformatics tools and the accumulation of data, immunoinfor-
matics or computational immunology emerged as a special branch of bioinformatics which utilizes bioin-
formatics approaches for understanding and interpreting immunological data. One extensively studied
aspect of applied immunology involves using available databases and tools for prediction of B- and T-cell
epitopes. B and T cells comprise two arms of adaptive immunity.
This chapter first reviews the methodology we used for computational identification of B- and T-cell
epitopes against enterotoxigenic Escherichia coli (ETEC). Then we discuss other databases of epitopes and
analysis tools for T-cell and B-cell epitope prediction and vaccine design. The predicted peptides were
analyzed for conservation and population coverage. HLA distribution analysis for predicted epitopes
identified efficient MHC binders. Epitopes were further tested using computational docking studies to
bind in MHC-I molecule cleft. The predicted epitopes were conserved and covered more than 80% of the
world population.
Key words Immunoinformatics, Epitope, B cell, T cell, Vaccine, MHC
1 Introduction
Immunoinformatics is a specialized branch of bioinformatics which

utilizes a wide variety of computational, mathematical, and statisti-
cal methods to better understand and provide meaningful inter-
pretations to problems in immunology. With the advent of genome
sequencing, immunoinformatics techniques have been used widely
in MHC genotyping, epitope selection, and epitope assembly, for
predicting allergenicity and cross-reactivity, for disease surveillance,
and for simulating response to a disease [1]. Immune system is
comprised of innate and adaptive immunity. While innate immunity
involves rapid, nonspecific defense mechanisms for antigen clear-
ance, adaptive immunity is highly specific which works by identify-
ing foreign agent through B cells involving antibodies or T cells
involving cell-mediated toxicity and complement activation
[2]. Most of the research has been directed toward identification

155
156 Jayashree Ramana and Kusum Mehla
of T-cell epitopes as they are bound to MHC molecules in a linear

fashion; thus accurate modeling of the binding interface is feasible.
Advancement in sequencing techniques allowed researchers to
move beyond the traditional vaccinology approach and use geno-
mic information of an organism for predicting vaccine candidates
computationally. Vaccine identification has been mainly focused on
well-characterized virulence factors such as toxins, major coloniza-
tion factors, and adhesion proteins [3] which are vital for pathogen
inception and further host damage. However, pathogen genome
encodes for several uncharacterized proteins which are not explored
for their potential to code for antigenic segments. Immunoinfor-
matics techniques can serve as a powerful aid, particularly for the
pathogens for which knowledge about pathogenesis mechanism or
antigenic determinants is limited.
We present a reverse vaccinology approach which has been
successfully applied to a number of pathogens such as Neisseria
meningitidis serogroup B (MenB), Escherichia coli, Staphylococcus
aureus [4–6], etc. with some modifications to the original method-
ology. Antibodies offer immediate immunological protection
against biological weapons, while T cells confer long-lasting immu-
nity. In this study, we have proposed both B- and T-cell epitopes
with potential to act as vaccine candidates against enterotoxigenic
Escherichia coli (ETEC) which could be further validated for their
efficiency in eliciting immune responses through in vivo and in vitro
experiments. Our approach was unbiased in that we have consid-
ered the whole genome of the pathogen without first referencing
their functions. Consequently, we retrieved a list of five high-
priority T-cell epitopes, five sequential B-cell epitopes, and confor-
mational B-cell epitopes as potential vaccine candidates. The pre-
dicted epitopes showed higher population coverage in different
ethnicities and different geographical areas and thus are representa-
tive of larger human population and also were efficient binders of
various HLA alleles. Similar approaches have been subsequently
used to propose vaccines against other pathogens as well.
2 Our Approach
2.1 In Silico To include proteins of E. coli O139:H28 (strain E24377A/ETEC)

Orthologous Protein orthologues with other pathogenic strains of the same pathogen
Detection and exclude protein orthologous to other commensal E. coli strains,
a two-step comparative genomics procedure was employed. Entire
proteome of ETEC pathogenic strains E24377A and H10407 and
commensal strain (E. coli SE11) were downloaded from Uniprot.
Comparative genomics of pathogenic and commensal strains can
reveal proteins implicated in pathogenesis which can eventually
serve as vaccine candidates. Hence, we sought to identify proteins
that were shared between pathogenic ETEC strains and absent
Immunoinformatics and Epitope Prediction 157
from commensal E. coli. The most probable set of orthologous

proteins shared by the two pathogenic ETEC strains were identified
using a reciprocal BLAST analysis detailed below. Firstly, entire
proteome of one pathogenic strain was BLASTP searched against
the proteome of other pathogenic strain with an E-value cutoff set
at 10 12. To increase stringency further, the length of sequence
alignment between the subject and the query protein was set to be
at least 80% of the entire protein length and similarity at least 40%
for both query and subject protein lengths. In the next step, the
orthologous proteins selected in previous step were further sub-
jected to a BLASTP search against commensal strain SE11 at a
cutoff E-value of 10 12. At this step, proteins shared between
pathogenic and commensal strain were filtered out as they probably
would not have a role in pathogenicity.
Protein sequences selected from this pool were further filtered
based on length criterion. The average protein length in bacteria is
267 aa [7]. Hence, proteins with length less than 100 aa were
excluded from the analysis as they are less likely to code for any
functional protein. Remaining protein sequences were BLASTP
searched against human proteome database at an E-value cutoff of
0.05 to exclude the possibility of selecting homologous proteins.
Excluding pathogen proteins which share similarity with human
proteome is an essential step to avoid the situations where peptide
mimics autoimmune behavior and targets the host enzymes, thus
eliciting adverse effects on humans [8]. Here, only those protein
sequences were selected which showed no similarity with human
database below E-value inclusion threshold. This analysis rendered
61 protein sequences.
2.2 Antigenicity VaxiJen server [9] which transforms protein sequence information
and Transmembrane into vectors of amino acid properties was used to determine the
Prediction protein sequences with the potential to be antigenic above thresh-
old score of 0.6. The cellular location of a protein holds some clues
to its function. Hence, for the potential antigenic sequences, their
cellular locations were predicted through PSORTb at a probability
value ( p-value) of 7.5. PSORTb is based on a Bayesian network
model to calculate associated probability for five major localization
sites, viz., cytoplasmic, inner membrane, periplasmic, outer mem-
brane, and extracellular for Gram-negative bacteria [10]. In search
for vaccine candidates, it is imperative to look for genomic seg-
ments with high potential to be antigenic, and thus VaxiJen was
firstly used for identifying antigenic proteins and later to identify
peptides with high antigenicity values. VaxiJen predicted nine pro-
teins as antigenic above a threshold of 0.6, three of which were
localized in outer membrane making them ideal candidates for
epitope identification. A Venn diagram represents three prioritized
proteins retrieved after filtering on physiochemical properties
which would be further used for epitope mapping (Fig. 1).
Non-human Antigenic
0 0
Pathogenic Membrane
0
52 0
6 0
2 0
0 0
0 0
Fig. 1 Representation of three prioritized target proteins: The proteins characterized as (1) shared between
pathogenic strains and excluded from commensal strain, (2) non-homologous to humans, (3) antigenic,
(4) membrane localized are shown using different colors in the Venn diagram. Proteins satisfying a particular
parameter are shown in the corresponding category of the Venn diagram. Three proteins were prioritized for
vaccine candidate identification. The image has been generated by Jvenn. (Reproduced by permission of The
Royal Society of Chemistry [47])
2.3 T-Cell Epitope For T-cell epitope predictions, parameters such as T-cell processiv-
Prediction ity, the number of human leukocyte antigens (HLA) alleles covered,
and significant population coverage were considered. For predic-
tion of cytotoxic T-lymphocyte (CTL) epitopes from the outer
membrane localized sequences, NetCTL Server was used at a
threshold value of 0.75 and predictions were restricted to
12 major histocompatibility class I (MHC-I) supertypes. NetCTL
is neural network model which combines the prediction of peptide
MHC-I binding, proteasomal C terminal cleavage, and TAP trans-
port efficiency [11]. The CTL epitopes were also evaluated for their
antigenic potential. Based on their antigenicity scores, epitopes
were categorized into low-, medium-, and high-priority vaccine
candidates. As shown in Table 1, a total of 50 epitopes were pre-
dicted by NetCTL which after antigenicity assessment were
grouped in high priority (5 candidates), medium priority (11 can-
didates), and low priority (9 candidates). To predict MHC-I alleles
that bind efficiently to the high-priority epitopes, Immune Epitope
Database (IEDB) tool was used [12]. For computing inhibitory
concentration (IC50) values of MHC-1 and MHC-II alleles, a
partial least square-based multivariate model MHCPred was used
Table 1
Most probable predicted epitopes selected on the basis of their NetCTL (MHC binding, proteasomal
processing, and TAP transport) and VaxiJen (Antigenicity) score
Epitope Uniprot NetCTL VaxiJen

Priority set Id Protein name Epitope score score
High Set I A7ZGR5 Putative membrane protein LICFFTLSY 1.8004 1.7024
PLNPLILLY 1.7020 1.7194
PIVNLFLLY 1.0515 1.9812
Set II A7ZGK4 Uncharacterized protein SVSVFIFLF 0.9647 3.0057
SVFIFLFIY 1.0519 4.1539
Medium Set I A7ZGR5 Putative membrane protein IIAFYEFMY 1.1235 1.6211
FYEFMYINY 1.1529 1.1573
SLFGPEFLY 0.7607 1.5082
KTALLICFF 0.9668 1.2435
FTLSYNVLY 3.2100 1.1549
STTIHSLFF 1.1339 1.2332
KASNAHQRY 1.4873 1.2538
Set II A7ZTH5 O-antigen polymerase ASHATTAGY 1.5819 1.1027
KTTLYTINF 0.8133 1.1789
YTINFMLSL 0.9437 1.6805
SVGARLAMY 0.9550 1.0684
Low Set I A7ZGR5 Putative membrane protein TLLLGVLIY 1.0991 0.7237
TLSYNVLYF 0.8201 0.9366
VSTTIHSLF 1.1339 0.7892
YYRFNDLFY 0.8111 0.7745
YSSTKNIHQ 0.9350 0.8034
Set II A7ZGK4 Uncharacterized protein MVRACIQMY 0.8529 0.8009
Set III A7ZTH5 O-antigen polymerase SLATQLLFF 0.9629 0.7659
TTAGYIILF 1.6656 0.9246
FSAILIYAL 0.7571 0.7841
Epitopes are categorized into high-, medium-, and low-priority epitopes based on antigenicity scale (VaxiJen Score).
(Reproduced by permission of The Royal Society of Chemistry [47])
Antigenicity scale
0.7–1.0: high priority
1.0–1.7: medium priority
1.7 and above: high priority
[13]. The consensus alleles with binding affinity IC50 value less
than 500 nM from both MHCPred and IEDB were considered as
efficient peptide binders. A list of both MHC-I and MHC-II alleles
which may serve as efficient binders for each high-priority epitope
was prepared. The results are summarized in Table 2. Despite
unavailability of data for majority of the MHC-II alleles, our pre-
dicted peptides showed good population coverage. MHC is highly
polymorphic, thus identifying peptides which can bind to several
MHC alleles is an important factor in vaccine identification
[14]. We observed that all the predicted epitopes were interacting
with more than 20 MHC alleles.
Table 2
Predicted potential T-cell epitopes, along with their interacting MHC-I and MHC-II alleles with an
affinity of <500 nM and corresponding IC50 values (bracketed)
Total no. of
MHC-peptide
Epitope binders MHC-I alleles MHC-II alleles
LICFFTLSY 24 HLA-A∗01:01 (92.47), HLA-A∗02:06 HLA-DRB1∗01:01
(105.68), HLA-A∗02:17 (130.82), (9.40)
HLA-A∗03:01 (305.49), HLA-A∗11:01
(257.22), HLA-A∗29:02 (30.89),
HLA-A∗30:02 (400.37), HLA-A∗32:01
(468.75), HLA-A∗32:07 (23.72),
HLA-A∗32:15 (164.53), HLA-A∗68:01
(59.57), HLA-A∗68:02 (492.04),
HLA-A∗68:23 (99.25), HLA-A∗80:01
(26.42), HLA-B∗15:01 (54.83),
HLA-B∗15:02 (293.24), HLA-B∗15:03
(56.51), HLA-B∗15:17 (212.18),
HLA-B∗27:20 (14.40), HLA-B∗35:01
(133.53), HLA-B∗40:13 (179.03),
HLA-C∗12:03 (14.73), HLA-C∗14:02
(468.40)
PLNPLILLY 23 HLA-A∗01:01 (285.76), HLA-A∗02:03 HLA-DRB1∗01:01
(4.80), HLA-A∗02:17 (280.32), (1.01),
HLA-A∗03:01 (323.59), HLA-A∗11:01 HLA-DRB1∗04:01
(144.88), HLA-A∗29:02 (25.63), (359.75)
HLA-A∗31:01 (210.38), HLA-A∗32:07
(20.42), HLA-A∗32:15 (47.78),
HLA-A∗68:01 (127.64), HLA-A∗68:23
(91.57), HLA-A∗80:01 (2.30),
HLA-B∗15:02 (144.62), HLA-B∗27:20
(4.76), HLA-B∗35:01 (436.52),
HLA-B∗40:13 (399.88), HLA-C∗05:01
(102.02), HLA-C∗07:01 (326.72),
HLA-C∗07:02 (295.24), HLA-C∗12:03
(48.11), HLA-C∗14:02 (228.36)
PIVNLFLLY 22 HLA-A∗01:01 (14.13), HLA-A∗02:01 HLA-DRB1∗01:01
(456.04), HLA-A∗02:03 (280.54), (0.78),
HLA-A∗02:06 (38.55), HLA-A∗03:01 HLA-DRB1∗07:01
(20.94), HLA-A∗11:01 (49.32), (47.64)
HLA-A∗26:01 (450.41), HLA-A∗26:02
(108.22), HLA-A∗29:02 (28.96),
HLA-A∗32:07 (26.92), HLA-A∗32:15
(52.63), HLA-A∗68:01 (99.54),
HLA-A∗68:23 (76.34), HLA-A∗80:01
(4.53), HLA-B∗15:02 (159.31),
HLA-B∗27:20 (35.69), HLA-B∗35:01
(251.77), HLA-B∗40:13 (93.53),
HLA-C∗03:03 (255.63), HLA-C∗12:03
(23.62)
(continued)
Table 2
(continued)
Total no. of
MHC-peptide
Epitope binders MHC-I alleles MHC-II alleles
SVSVFIFLF 27 HLA-A∗02:01 (306.90), HLA-A∗02:02 HLA-DRB1∗01:01
(269.78), HLA-A∗02:06 (307.61), (12.45)
HLA-A∗02:50 (74.00), HLA-A∗11:01
(6.90), HLA-A∗24:02 (467.34),
HLA-A∗24:03 (388.63), HLA-A∗26:02
(63.58), HLA-A∗29:02 (325.66),
HLA-A∗32:01 (111.41), HLA-A∗32:07
(7.91), HLA-A∗32:15 (179.99),
HLA-A∗68:01 (50.23), HLA-A∗68:02
(202.77), HLA-A∗68:23 (11.72),
HLA-B∗15:02 (222.45), HLA-B∗15:03
(495.55), HLA-B∗15:17 (93.69),
HLA-B∗27:20 (70.39), HLA-B∗35:01
(242.66), HLA-B∗40:13 (106.40),
HLA-B∗58:01 (454.26), HLA-C∗03:03
(92.17), HLA-C∗07:01 (239.98),
HLA-C∗07:02 (297.97), HLA-C∗12:03
(84.96)
SVFIFLFIY 25 HLA-A∗02:01 (316.96), HLA-A∗02:02 HLA-DRB1∗01:01
(467.74), HLA-A∗02:06 (77.62), (41.59)
HLA-A∗02:17 (90.50), HLA-A∗03:01
(239.33), HLA-A∗11:01 (18.84),
HLA-A∗26:01 (454.58), HLA-A∗26:02
(73.17), HLA-A∗29:02 (21.27),
HLA-A∗30:02 (77.89), HLA-A∗32:01
(253.48), HLA-A∗32:07 (13.09),
HLA-A∗32:15 (53.12), HLA-A∗68:01
(15.56), HLA-A∗68:02 (283.79),
HLA-A∗68:23 (5.53), HLA-A∗80:01
(18.37), HLA-B∗15:02 (157.12),
HLA-B∗15:17 (46.96), HLA-B∗27:20
(41.83), HLA-B∗35:01 (160.91),
HLA-B∗40:13 (20.37), HLA-C∗07:02
(444.81), HLA-C∗12:03 (11.07)
2.4 HLA Distribution Five high-priority epitopes were grouped into two sets as they
Analysis belonged to two different proteins. Another IEDB tool [15] for
human population coverage analysis based on data from Allele
Frequency Net Database (AFND) was used to study the distribu-
tion of human HLA alleles for the high-priority epitopes in these
sets. The two sets were showing immune response elicitation in
more than 95% of the total world population. For one epitope set,
maximum coverage was in European region (99.23%), followed by
North America (97.53%), South Asia (94.42%), and East Asia

(93.44%). For second epitope set, maximum coverage was in Eur-
ope area (97.25%) followed by North America (96.01%), East Asia
(96.01%), and Southeast region (93.05%). As shown in Table 3, in
resource-poor countries such as Asia and Africa, population cover-
age for predicted epitopes was highest which is in line with the high
diarrheal incidence rate in these nations. For developed countries
such as Europe and North America, population coverage was high-
est which is in line with the fact that these countries usually have
maximum number of travellers to developing nations [16].
2.5 Molecular To study the molecular interactions between the high-priority

Docking Studies T-cell epitopes and HLA molecules, availability of epitope structure
of HLA Epitope is a prerequisite for which PEP-FOLD [17], a Hidden Markov-
based model was used. HLA-A∗11:01 was selected as a target for
docking simulations, since all T-cell epitopes were interacting with
HLA-A∗11:01 with varying affinities. The results of computa-
tional docking study show that all the peptides bind efficiently to
the HLA alleles. The crystal structure of HLA-A∗11:01 in complex
with sars nucleocapsid peptide (PDB Id: 1X7Q) was simplified to
HLA-A∗11:01. Hex which works by considering both shape and
electrostatics criteria was the first Fourier transform (FFT)-based
protein docking server [18]. Hex was used for docking of
HLA-A∗11:01 with all five high-priority epitopes and best confor-
mation selected based on the Etotal (binding affinity) value.
HLA-A∗11:01 binds with epitopes LICFFTLSY, PLNPLILLY,
PIVNLFLLY, SVSVFIFLF, and SVFIFLFIY with docking energies
-405.02, -404.60, -408.70, -400.57, and -446.00, respectively. As
shown in Fig. 2, the docked complex of epitope SVFIFLFIY and
the interactions involved were visualized in Pymol [19] and Lig-
Plot, [20] respectively.
Summarily, epitopic set II of high-priority T-cell epitopes has
high antigenicity value and maximum number of MHC-interacting
alleles. SVSVFIFLF was binding to the maximum number of MHC
(allele no: 27) alleles. Epitope SVFIFLFIY also showed strongest
binding affinity (316.96 nM) for MHC supertype
(HLA-A∗02:01). Epitopes from both sets (I and II) cover different
populations effectively. The epitopes from set II, SVSVFIFLF and
SVFIFLFIY, were observed to have highest antigenicity values of
3.0057 and 4.1539, respectively.
2.6 B-Cell Epitope For B-cell epitope predictions, we identified epitopes which could
Identification be efficiently processed by B lymphocytes and were chosen based on
criteria of surface accessibility, flexibility, and hydrophilicity. To
identify potential antigens which can interact with B lymphocytes,
BCPred [21] and AAP [22] algorithms at BCPREDS server
were used. IEDB tools, viz., Emini-surface accessibility prediction
[23], Karplus and Schulz flexibility prediction [24], and Parker
Table 3
Population coverage of predicted epitopes (set I and set II) based on MHC-I and MHC-II
restriction data
Population/area Class I and II coverage

a
World 97.58%
East Asia 93.44%
Northeast Asia 92.26%
South Asia 94.42%
Southeast Asia 87.54%
Southwest Asia 89.71%
Europe 99.23%
East Africa 86.70%
West Africa 88.20%
Central Africa 84.79%
North Africa 93.27%
South Africa 91.71%
West Indies 92.42%
North America 97.53%
Central America 14.41%
South America 89.09%
Oceania 80.85%
b
World 95.57%
East Asia 96.01%
Northeast Asia 91.23%
South Asia 90.26%
Southeast Asia 93.05%
Southwest Asia 85.79%
Europe 97.25%
East Africa 85.47%
West Africa 88.68%
Central Africa 84.55%
North Africa 88.91%
South Africa 91.08%
(continued)
Table 3
(continued)
Population/area Class I and II coverage

West Indies 90.89%
North America 96.01%
Central America 4.85%
South America 87.93%
Oceania 95.57%
(a) Maximum population coverage by South Africa. (b) Maximum population coverage by Europe. Set I represents
epitopes LICFFTLSY, PLNPLILLY, and PIVNLFLLY from protein with Uniprot Id A7ZGR5. Set II represents
epitopes SVSVFIFLF and SVFIFLFIY from protein with Uniprot Id A7ZGK4. (Reproduced by permission of The
Royal Society of Chemistry [47])
Fig. 2 (a) Docked complex of HLA-A∗11:01 with epitope SVFIFLFIY visualized in Pymol (b) and corresponding
interactions involved in binding visualized in LigPlot. (Reproduced by permission of The Royal Society of
Chemistry [47])
hydrophilicity prediction [25], for B-cell epitope prediction were

also used for the same purpose. Combined prediction from both
servers was used to generate a list of epitopes. These epitopes were
further checked for antigenicity which yielded five sequential B-cell
epitopes with antigenicity score > 0.09 as shown in Table 4. For
prediction of conformational B-cell epitopes, CBTOPE server was
used which uses an SVM-based model to predict the probability of a
residue being a part of a B-cell epitope [26]. Residues predicted to be a
part of conformational B-cell epitopes are listed in Table 5. We con-
sidered residues with DiscoTope score greater than 1. Conformational
Table 4
Five most potential B-cell epitopes by combined predictions of AAP, BCPred, IEDB tools (Emini
surface accessibility, Karplus and Schulz flexibility, Parker hydrophilicity), further filtered based on
their antigenicity values (VaxiJen Score)
Uniprot Id Start residue End residue Epitope Length VaxiJen score

A7ZGR5 345 352 PETHKSDN 8 1.1952
237 246 FKNQFNKKIT 10 0.9161
179 189 YSSTKNIHQQK 11 1.0160
A7ZTH5 272 279 YSHDNTRT 8 1.9700
303 319 EKRAEKIHELEEKEPRL 17 1.0362
Table 5
Potential conformational B-cell epitope residues predicted using CBTOPE and DiscoTope server, along
with the corresponding secondary structure conformation each residue adopts and DiscoTope scores
CBTOPE DiscoTope
Uniprot Start End Secondary Residue DiscoTope

Id residue residue Epitope Length structure type no. Residue score
A7ZGR5 322 322 I 1 C 238 K 2.081
342 345 VLFP 4 CCCC 325 Q 1.718
349 351 KSD 3 CCC 327 I 1.920
404 405 IR 2 EC 328 N 3.520
363 L 2.118
366 K 1.393
A7ZGK4 17 17 V 1 C
136 139 PDKR 4 CCCC
A7ZTH5 291 293 GLK 3 HHH
324 324 P 1 C
410 410 S 1 C
B-cell epitopes are found more in the coiled or turn region than in the
helix or beta sheet and majority of the antibody interacting residues
predicted in our study were also adopting coil conformation. This
further validates the accuracy of our results.
To sum up our approach, it would be safe to say that tools and
databases employed in our approach allowed us to identify a group
of ETEC-specific vaccine candidates with high potential to encode
for protective immunogens. This approach confirms the antigenic
and protective efficacy of a subset of the candidates. Further experi-
mental validation of the predicted epitopes in eliciting humoral and
cell-mediated immune responses in vitro and in vivo is required to
increase our therapeutic arsenal against ETEC.
3 Other Databases and Tools
Most of the immunoinformatics applications revolve around the

idea of design of epitope-based subunit vaccines as they hold the
potential for designing vaccine candidates and hence play an impor-
tant role in disease diagnosis and management. In order to elicit
strong immune response, it is imperative for the identified epitope
to associate with the humoral (B-cell) and cellular (T-cell) immune
systems. Accurately identifying the ideal vaccine candidate essen-
tially involves searching for immunogenic peptides which are con-
served across different stages of the pathogen and which elicit
desired immune response. Accurately predicting such epitopes
through computational approaches would significantly reduce the
time, cost, and labor involved in experimental verification. Several
databases and tools have been developed which house and analyze
the immunological data associated with biological entities. Integra-
tion of large-scale immunological data in form of databases along
with the tools developed for interpreting the data organized in such
databases would ultimately help in understanding immune system
functions and disease pathogenesis mechanisms. In this section, we
will briefly describe the databases and tools used at various stages of
epitope prediction that were not used in our approach.
3.1 B-Cell Epitope B cells mediate humoral adaptive immunity by recognizing solvent-
Prediction exposed antigens through means of B-cell receptors (BCR) which
are present on surface of B cells. BCRs contain membrane-bound
immunoglobulins which upon activation are released in their solu-
ble forms where they exert their appropriate action against antigens
such as neutralizing them or tagging them for apoptosis. B-cell
epitopes can be continuous (linear stretch of amino acids) or dis-
continuous/conformational (set of spatially separated amino acids
which are brought in close proximity as a result of protein folding.
Though 90% of the B-cell epitopes are conformational, there is not
much data available for such epitopes due to prerequisite of struc-
ture availability, which makes conformational epitope prediction
even more challenging. Bcipep is a collection of 3031 experimen-
tally determined linear B-cell epitopes curated from literature and
other public repositories [27]. Based on immunogenicity, these
epitopes are categorized into immunodominant, immunogenic,
and null-immunogenic epitopes. CED is another B-cell epitope
database which consists of well-defined conformational epitopes
and related information such as epitope residue composition and
location, its structure and immunological property, antibody bind-
ing to the particular epitope, etc. curated from published literature
[28]. Epitome implements a semi-automated tool for analyzing
known antigen–antibody complex structures to identify antigenic

interactions [29]. Epitome houses these known antigen–antibody
complex structures, residues involved in the identified intergenic
interactions, and their sequence/structure environments.
B-cell epitope prediction servers are mainly based on physio-
chemical properties of the amino acids such as hydrophilicity, sur-
face accessibility, flexibility, and their secondary structure content.
BEPITOPE [30] predicts continuous B-cell epitopes along with
prediction profiles which enables user to exclude undesired predic-
tions. BEPITOPE allows for treatment of whole genome and
searching for user-defined patterns. ABCpred is an artificial neural
network (ANN)-based model trained on epitope data from Bcipep
database [31]. Antigen–Antibody Interaction Database (AgAbDb)
is a resource of antigen–antibody interactions which consists of
information regarding interacting residues and buried residues
involved in binding of antibody and peptide antigen [32]. AgAbDb
also offers a tool for design of mimotopes representing antigen.
BepiPred-2.0 [33] is a random forest-based model which is trained
on crystal structure data of antibody–antigen complexes and out-
performs its earlier versions which were trained and tested on linear
peptides.
PEPITO [34] makes conformational epitope predictions by
combining amino acid propensity scores with side chain conforma-
tion and solvent accessibility information in a linear equation.
ElliPro is based on the concept of representing a protein structure
as an ellipsoid and calculating protrusion index scores for each
residue lying outside the ellipsoid boundary [35]. ElliPro allows
visualization of the identified epitopes in the protein structure.
Epitopia [36] is a Naı̈ve-Bayes classifier-based model which calcu-
lates for each solvent-accessible residue an immunogenicity score
relative to other residues and a probability score for each residue to
be an epitope or not. DiscoTope server is widely used for prediction
of conformational epitopes [37]. Discotope makes predictions
based on statistical scores of amino acids, surface accessibility, and
propensity scores based on spatial neighborhood.
3.2 T-Cell Epitope Antigens are translocated to the endoplasmic reticulum (ER) from
Prediction cytoplasm through action of TAP (transporter associated with anti-
gen presentation) where TAP transport protein gets dissociated and
antigens are bound to MHC molecules. This antigen-MHC com-
plex then leaves ER to be presented on the surface of antigen-
presenting cells (APCs). These MHC-bound antigens presented
on the surface of APC are recognized by T cell by means of T-cell
receptor (TCR). T-cell epitopes are bound to MHC class-I
(MHCI) and class-II (MHC-II) molecules which are recognized
by CD8 and CD4 T cells, respectively. To accurately model the
process of antigen binding to TAP and MHCs which serve as

potential T-cell epitopes, we need well-characterized data wherein
the concept of T-cell epitope databases comes into play.
SYFPEITHI is a collection of more than 7000 peptide motifs
from a wide spectrum of species such as humans, mouse, cattle,
apes, etc. [38]. All data on MHC ligands and peptide motifs is
retrieved from published literature. The international ImMunoGe-
neTics database (IMGT) is an integrated platform for all nucleo-
tide, protein, structural and genetic immunogenetics data on
immunoglobulins, and TCR and MHC molecules of humans and
other vertebrates [39]. HLA ligand database was created to include
information on motif and ligands for MHC-I and MHC-II mole-
cules [40]. The database can be used for building new algorithms
for prediction of novel peptides. AntiJen is a rich collection of over
24,000 entries of kinetic, thermodynamic, functional, and cellular
information on B- (both continuous and conformational) and
T-cell epitopes [41]. Recently, the database was updated to reflect
information on peptide library, copy numbers, and diffusion coeffi-
cient. EPIMHC houses data on MHC-binding peptides and T-cell
epitopes retrieved from real proteins [42]. The database includes
entries for tumor-associated antigens as well.
MMBPred was designed to identify peptides in an antigenic
sequence which are either high affinity or promiscuous (bind to
many HLA alleles) in nature [43]. RANKPEP employs position-
specific scoring matrices (PSSMs) for the prediction of MHC-I and
MHC-II class binders which serve as potential for CD8 and CD4
T-cell epitopes, respectively [44]. A variability masking function is
also implemented such that RANKPEP looks for conserved protein
segments than variable and hence predicts invariant T-cell epitopes
which can avoid mutation as an immune evasion mechanism.
SVRMHC based on support vector machine regression (SVR)
model is a sequence-based quantitative method for prediction of
binding affinities of peptide [45]. This peptide-MHC-binding
server computes a percent score value for each submitted peptide
based on a data of ~528,500 peptides derived from randomly
picked proteins from the Swiss-Prot database. NetMHCpan is a
server for quantitative prediction of binding affinities of MHC-
peptide-binding reactions having ANN as the underlying model
[46]. NetMHCpan is pan-specific in the sense that it predicts the
affinity of any peptide toward any human HLA-A or HLA-B mole-
cule. Translating sequence features of peptides and HLA molecules
into quantitative binding affinities, the method is able to perform
clustering of MHC specificities and selecting novel MHC mole-
cules using this information. It can predict binding affinity of pep-
tides to any MHC molecules with known sequence information.
4 Conclusion
Classical approach of vaccine identification was time- and labor-

intensive and had some inherent limitations. The field of immu-
noinformatics along with whole genome sequencing data and bio-
informatics tools has profoundly helped researchers in analyzing
and modeling complex diseases. Immunoinformatics techniques
enable screening of entire genome for identification of vaccine
candidates in a rapid and reliable fashion. We applied these techni-
ques for prediction of potential vaccine candidates against ETEC.
Epitopes predicted in the study are potential vaccine candidates on
grounds of conservation, higher population coverage, and ability to
interact with many HLA alleles with IC50 value less than 500 nM
[47]; however they need to be validated experimentally in vitro and
in vivo. Such approaches are generalized and hence can be applied
to other hosts or other enteric pathogens.
References
1. Kazi A, Chuah C, Majeed ABA et al (2018) Mycobacterium ulcerans for the identification
Current progress of immunoinformatics of putative essential genes and therapeutic can-
approach harnessed for cellular- and antibody- didates. PLoS One 7(8):e43080
dependent vaccine design. Pathog Glob Health 9. Doytchinova I, Flower D (2007) VaxiJen: a
112(3):123–131 server for prediction of protective antigens,
2. Evans MC (2008) Recent advances in immu- tumour antigens and subunit vaccines. BMC
noinformatics: application of in silico tools to Bioinformatics 8(1):4
drug development. Curr Opin Drug Discov 10. Yu NY, Wagner JR, Laird M et al (2010)
Deve l11(2):233–241 PSORTb 3.0: improved protein subcellular
3. Walker RI (2015) An assessment of enterotoxi- localization prediction with refined localization
genic Escherichia coli and Shigella vaccine can- subcategories and predictive capabilities for all
didates for infants and children. Vaccine 33 prokaryotes. Bioinformatics 26
(8):954–965 (13):1608–1615
4. Pizza M, Scarlato V, Masignani V et al (2000) 11. Larsen MV, Lundegaard C, Lamberth K et al
Identification of vaccine candidates against ser- (2007) Large-scale validation of methods for
ogroup B meningococcus by Whole-Genome cytotoxic T-lymphocyte epitope prediction.
sequencing. Science 287(5459):1816–1820 BMC Bioinformatics 8:424
5. Moriel DG, Bertoldi I, Spagnuolo A et al 12. Tenzer S, Peters B, Bulik S et al (2005) Model-
(2010) Identification of protective and broadly ing the MHC class I pathway by combining
conserved vaccine antigens from the genome of predictions of proteasomal cleavage, TAP
extraintestinal pathogenic Escherichia coli. transport and MHC class I binding. Cell Mol
Proc Natl Acad Sci U S A 107(20):9072–9077 Life Sci 62(9):1025–1037
6. McCarthy A, Lindsay J (2010) Genetic varia- 13. Guan P, Doytchinova IA, Zygouri C et al
tion in Staphylococcus aureus surface and (2003) MHCPred: a server for quantitative
immune evasion genes is lineage associated: prediction of peptide-MHC binding. Nucleic
implications for vaccine design and host- Acids Res 31(13):3621–3624
pathogen interactions. BMC Microbiol 10 14. Thorpe C, Edwards L, Snelgrove R et al (2007)
(1):173 Discovery of a vaccine antigen that protects
7. Brocchieri L, Karlin S (2005) Protein length in mice from Chlamydia pneumoniae infection.
eukaryotic and prokaryotic proteomes. Nucleic Vaccine 25(12):2252–2260
Acids Res 33(10):3390–3400 15. Bui HH, Sidney J, Dinh K et al (2006) Predict-
8. Butt AM, Nasrullah I, Tahir S et al (2012) ing population coverage of T-cell epitope-
Comparative genomics analysis of
based diagnostics and vaccines. BMC Bioinfor- 30. Odorico M, Pellequer JL (2003) BEPITOPE:
matics 7:153 predicting the location of continuous epitopes
16. Harris JA, Roy K, Woo-Rasberry V et al (2011) and patterns in proteins. J Mol Recognit 16
Directed evaluation of enterotoxigenic Escher- (1):20–22
ichia coli autotransporter proteins as putative 31. Saha S, Raghava GPS (2006) Prediction of
vaccine candidates. PLoS Negl Trop Dis 5(12): continuous B-cell epitopes in an antigen using
e1428 recurrent neural network. Proteins 65
17. Thevenet P, Shen Y, Maupetit J et al (2012) (1):40–48
PEP-FOLD: an updated de novo structure pre- 32. Kulkarni-Kale U, Raskar-Renuse S, Natekar-
diction server for both linear and disulfide Kalantre G et al (2014) Antigen–antibody
bonded cyclic peptides. Nucleic Acids Res 40 interaction database (AgAbDb): a compen-
(Web Server issue):W288–W293 dium of antigen–antibody interactions. In:
18. Macindoe G, Mavridis L, Venkatraman V et al De R, Tomar N (eds) Immunoinformatics,
(2010) HexServer: an FFT-based protein Methods in molecular biology (methods and
docking server powered by graphics processors. protocols), vol 1184. Humana Press,
Nucleic Acids Res 38(Web Server issue): New York, pp 149–164
W445–W449 33. Jespersen MC, Peters B, Nielsen M et al (2017)
19. The PyMOL Molecular Graphics System BepiPred-2.0: improving sequence-based
(2010) Version 1.3r1. LLC, Schrodinger B-cell epitope prediction using conformational
20. Laskowski RA, Swindells MB (2011) LigPlot+: epitopes. Nucleic Acids Res 45(W1):
multiple ligand-protein interaction diagrams W24–W29
for drug discovery. J Chem Inf Model 51 34. Sweredoski MJ, Baldi P (2008) PEPITO:
(10):2778–2786 improved discontinuous B-cell epitope predic-
21. El-Manzalawy Y, Dobbs D, Honavar V (2008) tion using multiple distance thresholds and half
Predicting linear B-cell epitopes using string sphere exposure. Bioinformatics 24
kernels. J Mol Recognit 21(4):243–255 (12):1459–1460
22. Chen J, Liu H, Yang J et al (2007) Prediction 35. Ponomarenko J, Bui HH, Li W et al (2008)
of linear B-cell epitopes using amino acid pair ElliPro: a new structure-based tool for the pre-
antigenicity scale. Amino Acids 33(3):423–428 diction of antibody epitopes. BMC Bioinfor-
matics 9:514
23. Emini EA, Hughes JV, Perlow DS et al (1985)
Induction of hepatitis A virus-neutralizing 36. Rubinstein ND, Mayrose I, Martz E et al
antibody by a virus-specific synthetic peptide. (2009) Epitopia: a web-server for predicting
J Virol 55(3):836–839 B-cell epitopes. BMC Bioinformatics 10:287
24. Karplus PA, Schulz GE (1985) Prediction of 37. Kringelum JV, Lundegaard C, Lund O et al
chain flexibility in proteins. Naturwissenschaf- (2012) Reliable B cell epitope predictions:
ten 72(4):212–213 impacts of method development and improved
benchmarking. PLoS Comput Biol 8(12):
25. Parker JMR, Guo D, Hodges RS (1986) New e1002829
hydrophilicity scale derived from high-
performance liquid chromatography peptide 38. Rammensee H, Bachmann J, Emmerich NP
retention data: correlation of predicted surface et al (1999) SYFPEITHI: database for MHC
residues with antigenicity and X-ray-derived ligands and peptide motifs. Immunogenetics
accessible sites. Biochemistry 25 50(3–4):213–219
(19):5425–5432 39. Lefranc MP (2001) IMGT, the international
26. Ansari HR, Raghava GPS (2010) Identification ImMunoGeneTics database. Nucleic Acids
of conformational B-cell Epitopes in an antigen Res 29(1):207–209
from its primary sequence. Immunome Res 6:6 40. Sathiamurthy M, Hickman HD, Cavett JW
27. Saha S, Bhasin M, Raghava GPS (2005) Bci- et al (2003) Population of the HLA ligand
pep: a database of B-cell epitopes. BMC Geno- database. Tissue Antigens 61(1):12–19
mics 6:79 41. Toseland CP, Clayton DJ, McSparron H et al
28. Huang J, Honda W (2006) CED: a conforma- (2005) AntiJen: a quantitative immunology
tional epitope. BMC Immunol 7:7 database integrating functional, thermody-
namic, kinetic, biophysical, and cellular data.
29. Schlessinger A, Ofran Y, Yachdav G et al Immunome Res 1(1):4
(2006) Epitome: database of structure-inferred
antigenic epitopes. Nucleic Acids Res 34: 42. Reche PA, Zhang H, Glutting JP et al (2005)
D777–D780 EPIMHC: a curated database of
MHC-binding peptides for customized
computational vaccinology. Bioinformatics 21 46. Jurtz V, Paul S, Andreatta M et al (2017)

(9):2140–2141 NetMHCpan-4.0: improved peptide-MHC
43. Bhasin M, Raghava GP (2003) Prediction of class i interaction predictions integrating eluted
promiscuous and high-affinity mutated MHC ligand and peptide binding affinity data. J
binders. Hybrid Hybridomics 22(4):229–234 Immunol 199(9):3360–3368
44. Reche PA, Glutting JP, Zhang H et al (2004) 47. Mehla K, Ramana J (2016) Identification of
Enhancement to the RANKPEP resource for epitope-based peptide vaccine candidates
the prediction of peptide binding to MHC against enterotoxigenic Escherichia coli: a
molecules using profiles. Immunogenetics 56 comparative genomics and immunoinformatics
(6):405–419 approach. Mol BioSyst 12(3):890–901
45. Wan J, Liu W, Xu Q et al (2006) SVRMHC
prediction server for MHC-binding peptides.
BMC Bioinformatics 7:463
Chapter 7
Vaccine Design Against Leptospirosis Using

an Immunoinformatic Approach
Kumari Snehkant Lata, Vibhisha Vaghasia,
Shivarudrappa Bhairappanvar, Saumya Patel,
and Jayashankar Das
Abstract
Vaccination is the best way to prevent the spread of emerging or reemerging infectious disease. Current
research for vaccine development is mainly focused on recombinant-, subunit-, and peptide-based vaccine.
At this point, immunoinformatics has been proven as a powerful method for identification of potential
vaccine candidates, by analyzing immunodominat B- and T-cell epitopes. This method can reduce the time
and cost of experiment to a great extent, by reducing the number of vaccine candidates for experimental
testing for their efficacy. This chapter describes the use of immunoinformatics and molecular docking
methods to screen potential vaccine candidates by taking Leptospira as a model.
Key words Immunoinformatics, Immunogenicity, Leptospirosis, Outer membrane protein, Epi-

topes, Vaccine candidate, Molecular docking, Simulation, Binding interaction
1 Introduction
Leptospirosis is a widespread zoonotic disease, caused by infection

of pathogenic species of Leptospira [1–3]. Pathogenic species com-
prises more than 250 antigenically distinct serovars of Leptospira
[3, 4]. This antigenic diversity makes up a challenge for the
researcher to eradicate this disease. The global burden of this
disease is increasing year by year [3]. Vaccination is usually consid-
ered to be the most feasible way to eradicate infectious diseases.
Generally, a vaccine is made from whole cell inactivated or killed
form of disease-causing microorganism and its toxins or immuno-
genic proteins. In some cases, vaccines derived from whole patho-
gen may cause adverse effects and provide short-term immunity
and are insufficient in providing broad spectrum of protection
across multiple strains of pathogen [3, 5, 6]. The advancement in
the molecular understanding of antigen presentation has led to the

173
174 Kumari Snehkant Lata et al.
development of peptide- and epitope-based vaccine. Usually,

peptide-based vaccine is based on identification of immunodomi-
nant B- and T-cell epitopes, from immunogenic proteins, which can
induce specific immune response. Epitope, also called as antigenic
determinant, is a portion of an antigen that is recognized by partic-
ular receptors present on the surface of immune cells, mainly B
and/or T cells. The epitopes recognized by B celles are known as
B-cell epitopes, whereas epitopes recognized by T cells are known
as T-cell epitopes. B cells recognize the epitopes on antigen are
known as B-cell epitope and epitopes recognized by T cells are
refers as T-cell epitopes.
The current research for developing peptide vaccines are mostly
focused on outer membrane proteins (OMPs) because these pro-
teins have been recognized as playing major role in the interaction
of pathogens with host cells and probably associated with patho-
genesis [3, 7]. The availability of genomics, proteomics, and immu-
nological data and advances in the computational algorithms have
improved the efficacy of identification of immunodominant outer
membrane proteins and thereby potential vaccine candidates; this
field of study is known as immunoinformatics [8–11]. Immunoin-
formatics is now becoming fascinating in the field of vaccine devel-
opment which uses genome- and proteome-based information and
offers high level of confidence for the prediction of potential vac-
cine candidates [3, 12, 13]. Recently, the approach has been
broadly accepted for screening the potential immunogens for vac-
cine design of infectious diseases. Identification of epitopes in an
antigen has become key for subunit- and peptide-based vaccine
development against pathogens, which confer long-lasting effects.
In silico approach may prove as a beneficial and directive approach
for mapping of potential epitopes on antigens, whereas conven-
tional methods focus more on pathogen cultivation and protein
extraction, where testing of these proteins on a large scale is expen-
sive and time-consuming [3]. Nowadays, several in silico tools are
available for the prediction of epitopes on target antigen, which
reduce the time and cost by reducing the list of potential epitopes
for experimental validation. Several in silico vaccine candidates have
been reported by researchers which were known to produce
promising preclinical and clinical trial results [14, 15]. This chapter
outlines the systematic methods to screen potential vaccine candi-
dates using immunoinformatic approach. The methods described
below are generalized and can be used for any of the targeted
disease.
A pictorial representation of workflow is shown in Fig. 1.

Immunoinformatics for Vaccine Design 175
Epitope prediction methods
Fig. 1 Workflow representing the key steps in screening of potential B- and T-cell epitopes. The conforma-
tional B-cell epitope depicted in this flowchart was predicted from 3D structure (PDB ID: 2ZZ8) of LipL32
protein of Leptospira. Structure shown here for visualization of binding interaction was downloaded from RCSB
PDB (PDB ID: 1B0G)
2.1 Retrieval The first step in immunoinformatic approach for vaccine design is
of the Target to retrieve protein sequences or whole proteome in FASTA format.
Sequences We can extract protein sequences from UniProtKB or NCBI data-
base and whole proteome sequence can be retrieved from UniProt
Proteomes database. Once protein sequences are obtained, we will
screen these sequences for their immunogenicity.
2.2 Identification The immunogenicity refers to the ability of an antigen to elicit an

of Immunogenic immune response. The selection of optimal immunogen is the first
Proteins step for vaccine design; hence, to identify the most probable immu-
nogenic protein, the whole proteome of Leptospira was submitted
to VaxiJen v2.0 server [16] (http://www.ddg-pharmfac.net/
vaxijen/VaxiJen/VaxiJen.html), which was developed for the pre-
diction of potent antigen and subunit vaccines with accuracy of
70 to 89% (see Note 1). As an input, we can paste a protein
sequence in a plain format or upload a list of protein sequences in
FASTA format. A target organism can also be selected based on
source of antigens. Here, in case of Leptospira, we have selected
bacteria as a target organism. An overall score depicts the potenti-
ality of each protein sequence to induce immune response. Proteins
with higher score are predicted to be more immunogenic.
2.3 Identification The localization of a protein plays a vital role in determining its
of Outer Membrane functionality. A potential immunogen has to be easily recognized
Protein (OMP) by the immune cells. Outer membrane proteins are surface-exposed
which is easily recognized by the host immune system and involved
in the interaction between bacterial cells and their host [17, 18]. In
pathogenic bacteria, OMPs are proven to be the most promising
vaccine candidates, due to its interaction with the host immune
cells; and hence, identification of OMPs is crucial for a reliable and
rapid vaccine development [3]. Therefore, protein sequences, pre-
dicted as antigenic in Vaxijen server, were subjected to CELLO
v.2.5 server [19, 20] (http://cello.life.nctu.edu.tw/) to retrieve
outer membrane protein (see Note 1). CELLO uses machine
learning, support vector machine algorithm to predict localization
of the proteins.
2.4 B-Cell Epitope B-cell epitope is the main antigenic region of an antigen which are
Prediction recognized by the B-cell receptors of the immune system and are
able to stimulate humoral immune response, which cause the
B-lymphocytes to differentiate into antibody-secreting plasma and
memory cells [21]. After activation plasma cells secrete antigen-
specific antibodies and circulated to the bone marrow where they
can encounter the antigen. Memory B cells are distributed
throughout the body and respond quickly to kill the antigen if it
is encountered again [22]. B-cell epitopes can be categorized as
linear (continuous) and conformational (discontinuous) based on
their spatial structure.
2.4.1 Linear B-Cell Linear B-cell epitope is a consecutive sequence of amino acids on an
Epitope antigen. B-cell epitope can be predicted using IEDB analysis
resource, where we have to paste protein sequence in plain format
or provide a Swiss-Prot ID and select an appropriate prediction
method. Linear B-cell epitope can be predicted using BepiPred
prediction method [23]. Antigenic B-cell epitope were predicted
by Kolaskar and Tongaonkar method at Immune Epitope Database
(IEDB) analysis resource (http://tools.iedb.org/main/bcell/)
[24]. This method predicts antigenic peptide by analyzing the
physicochemical properties of amino acid residues and their abun-
dance in experimentally determined antigenic epitopes [3, 24]. The
accuracy of this method to predict epitope is about 75%. Surface
accessibility, flexibility, and hydrophilic properties are also main
characteristics of B-cell epitopes [25]; hence, to predict these prop-
erties, Emini surface accessibility [26], Karplus and Schulz flexibil-
ity [27], and Parker hydrophilicity [28] prediction methods were
employed, respectively, with default parameters of IEDB analysis
resource (see Note 2).
2.4.2 Conformational A conformational B-cell epitope consists of discontinuous residue

B-Cell Epitope of amino acids in a protein sequence and present in close proximity
in 3D structure of protein (see Note 3). Conformational B-cell
epitopes in 3D structure were predicted using Ellipro [29]. This
tool predicts the epitopes based on the geometrical properties of
the protein structure, and it discriminates predicted epitopes from
non-epitopes on the basis of known protein antibody complex. The
conformational B-cell epitopes with a protrusion index (PI) value
above 0.7 were selected. The score (PI) reflects the percentage of
protein atoms that extend beyond the molecular bulk and are
responsible for antibody binding [29].
2.5 T-Cell Epitope Unlike B cells, T cells do not recognize antigen directly, here
antigen first processed by antigen-presenting cell (APCs), e.g.,
dendritic cells or B cells or macrophages, and then present to the
T-cell receptor (TCR) by major histocompatibility complex
[22, 30]. There are mainly two types of T-cell epitopes, CTL
(cytotoxic T-lymphocytes) and HTC (Helper T cell). T cell
expresses a cluster of differentiation (CD) receptor on its cell sur-
face that recognizes the antigen presented on MHC molecule
[8]. CTL expresses the CD8+ receptor on its surface and recog-
nizes peptides presented by MHC class I molecules, while HTC
expresses CD4+ receptor, which recognize MHC II antigen
complex [8].
2.5.1 Helper T-Cell (HTC) Helper T cell is crucial for activating an efficient humoral and cell-
Epitope mediated immune response, by stimulating the differentiation and
proliferation of B and cytotoxic T cell [31]. The binding of epi-
topes, complexed with MHC class II, to the T-cell receptor can
result in the activation of HTC response. Hence, in order to predict

MHC class II-restricted HTL epitopes, the protein sequences were
submitted to NetMHCIIpan 3.1 server (http://www.cbs.dtu.dk/
services/NetMHCIIpan/) with threshold value set as 0.5% and 2%
for strong binding peptides (SB) and weak binding peptides (WB),
respectively, to determine the binding affinities of epitopes and
MHC-II allele [32]. NetMHCIIpan is one of the most accurate
prediction tools that covers all human leukocyte antigen (HLA)
class II molecules based on artificial neural network algorithm. In
this tool we can upload or paste protein sequences in FASTA format
and can select HLA loci from the drop-down option. It has been
anticipated that the binding strength of HTL epitope to the HLA
molecules is a key factor in immunogenicity of the T-cell epitope
and a good T-cell epitope candidate should bind to the maximum
number of HLA alleles to get more population coverage
[33, 34]. Therefore, the strong binder epitopes, having IC50
value <50 nM, and epitopes binding with maximum number of
HLA allele were considered as putative HTC epitope for vaccine
candidates.
2.5.2 Cytotoxic Consistent predictions of CTL epitopes are very important for the
T-Lymphocyte (CTL) coherent vaccine design. Because sometimes humoral immunity is
Epitope not far enough to completely clean the infection, cell-mediated
immunity is required to induce cell death and completely destroy
the bacterial habitat. Although pathogenic Leptospira is not consid-
ered as a typical intracellular pathogen, indeed some bacterial pro-
teins may be able to escape from the phagolysosome and reach to
the cytosol of host cells and are exposed to the host CD8+ T-cell
response [3, 35]. Hence, the presence of CTL epitopes in OMP
protein was predicted using NetCTL.1.2 server (http://www.cbs.
dtu.dk/services/NetCTL), with default parameters [36]. This
server predicts epitopes by integrating predictions of MHC class I
binding, proteasomal C-terminal cleavage, and the TAP transport
efficiency. The MHC class I binding and proteasomal C-terminal
cleavage were predicted by the artificial neural network, while a
weight matrix was used to predict the TAP transport efficiency. This
tool has an option to select MHC-I supertype; we can select any
one of the MHC-I supertypes at a time from the drop-down
option. As an output, it generates 9-mer epitope sequence and
their respective score for C-terminal cleavage, TAP transport effi-
ciency, threshold of epitope identification, and overall combined
score.
2.6 Immunogenicity The peptides with strong immunogenicity are more probable T-cell
Prediction of T-Cell epitopes than those with weak immunogenicity. Therefore, the
Epitopes immunogenicity of putative T-cell epitopes was evaluated using
IEDB immunogenicity prediction tool. CD4 T-cell immunogenicity
prediction method at IEDB (http://tools.iedb.org/CD4episcore/)
was used for immunogenicity prediction of HTC epitope

[37]. Here, we have to submit HTC epitope sequences and can
select the prediction method and maximum percentile rank thresh-
old. By default, there is IEDB combined method for prediction of
immunogenicity. This method predicts the final score combined
from seven-allele and immunogenicity prediction methods. On the
other hand, immunogenicity of CTL epitopes was predicted using
class I immunogenicity method with default parameter at IEDB
T-cell analysis tool (http://tools.iedb.org/immunogenicity/) [38].
2.7 Allergenicity We also predicted the allergenicity of predicted epitopes, because a

Prediction potential vaccine candidate should be non-allergen. In order to
predict allergenicity of epitopes, AllerHunter server (http://tiger.
dbs.nus.edu.sg/AllerHunter) was used, which is based on support
vector machine (SVM) and pair-wise sequence similarity [39] (see
Note 4). AllerHunter predicts allergen in addition to non-allergen
with high sensitivity and specificity and efficiently distinguishes
allergens and non-allergens from allergen-like non-allergen
sequences, which make AllerHunter a very constructive tool for
allergen predictions.
2.8 Conservancy In epitope-based vaccine design, the selection of conserved epi-

Analysis topes would be crucial for providing a broad spectrum of protec-
tion across several serovars, strains, or species of Leptospira. In order
to evaluate conservancy of predicted epitopes among available
strains of Leptospira, we first downloaded orthologues protein
sequences of target protein, from NCBI, for all strains and saved
all sequences in a file (in FASTA format). Furthermore, we per-
formed conservancy of predicted epitopes, among orthologues
sequences, by using epitope conservancy analysis tool at the IEDB
analysis resource (http://tools.immuneepitope.org/tools/conser
vancy) [40]. This tool calculates the degree of conservancy of an
epitope within a provided protein sequence set at different degree
of identities. The degree of conservancy is defined as the portion
of protein sequences that contain the epitope at a specified
identity level.
2.9 Molecular A molecular docking study was performed to ensure the molecular
Binding Interaction binding interaction between HLA molecules and our predicted
Analysis of Predicted T-cell epitopes. For docking study, we need three-dimensional
Epitopes with HLA (3D) structure of predicted epitopes and HLA molecules. Since,
Molecules HLA-A2 allele is one of the most frequent MHC class I alleles in
most of the human populations; we downloaded the 3D structure
of HLA-A2 allele from RCSB PDB Protein Data Bank (https://
www.rcsb.org/) [41, 42].
2.9.1 3D Structure The 3D structures of all predicted T-cell epitopes excluding the
Prediction of T-Cell allergen one were modeled with the PEP-FOLD3 server (http://
Epitopes mobyle.rpbs.univ-paris-diderot.fr/cgi-bin/portal.py#forms::PEP-
FOLD3), using 200 simulation runs [43]. First the PEP-FOLD3
server clustered different conformational models of given epitope
and then sorted them with the sOPEP energy value. The best
ranked model was selected to analyze the interactions with selected
HLA molecules.
2.9.2 Molecular Docking Molecular docking between HLA molecule and predicted T-cell
epitopes was performed using PatchDock rigid-body server
(https://bioinfo3d.cs.tau.ac.il/PatchDock/php.php). In this
tool, we have to upload 3D structure of receptor and ligand mole-
cule; here, receptor will be HLA molecule and epitopes as a ligand.
This tool computes complexes with good molecular shape comple-
mentarity based on geometry of the molecules. The output of
PatchDock contains a list of predicted complex structures with
their rank, based on score. The best ranked docked complex was
further refined using FireDock (Fast Interaction Refinement in
Molecular Docking) server82,83 (http://bioinfo3d.cs.tau.ac.il/
FireDock/php.php) [44, 45] (see Notes 1 and 5). The output of
FireDock result includes ten best solutions for final refinement
complex, based on the binding score. This tool ranks the refined
complex based on global energy, attractive and repulsive Van der
Waals forces, atomic contact energy, and hydrogen binding interac-
tion scores. The complex having lowest global energy was ranked
first and considered the best suited confirmation for complex for-
mation. A screenshot of FireDock result outputs, taken from help
file, is shown in Fig. 2.
Fig. 2 Screenshot of refined docking complex, created from help file of FireDock
tool. The complexes are ranked based on their global energy
2.9.3 Molecular The binding stability of epitopes-HLA docked complex was

Dynamics Simulations checked by molecular dynamics simulations using GROMACS
v2016.3 software [46]. For each of the docked epitope-HLA com-
plexes, a production simulation of 5 ns at 300 K temperature and
1 bar pressure was obtained after carrying out stepwise energy
minimization and equilibration protocol of the solvated systems
with TIP3P water model. Further, trajectory analysis was per-
formed to explore H-bonding and root mean square deviation
(RMSD) [3].
2.9.4 Visualization Hydrogen bond interaction of the docked-complex can be analyzed

of Interaction with the molecular visualization tool UCSF Chimera 1.11.2 or
PyMOL [47, 48]. We used Chimera to visualize the hydrogen
bond interaction between HLA and epitope (see Note 6).
3 Notes
1. Users are suggested to use the most recent versions of any tools
or servers, as these tools are continuously being updated with
improved prediction algorithms and datasets.
2. As an alternative to the tools mentioned above for B-cell epi-
tope, you can use BcePred Prediction Server (http://crdd.
osdd.net/raghava/bcepred/). This tool gives scores for hydro-
philicity, turns, the surface exposed, flexibility, polar, accessibil-
ity, and antigenicity propensity score for each residue of the
protein.
3. For predicting conformational B-cell epitopes, you will require
a 3D structure of target protein. 3D structure of target protein
can be downloaded from RCSB PDB database (https://www.
rcsb.org/), if available. Otherwise, you can model the 3D
structure of target protein using Modeller (https://salilab.
org/modeller/) or I-TASSER tool (https://zhanglab.ccmb.
med.umich.edu/I-TASSER/). Modeller is a command line
tool, whereas I-TASSER is a web server.
4. As an alternative, allergenicity of epitopes can be predicted
using AllergenFP (http://ddg-pharmfac.net/AllergenFP/)
and AlgPred (http://crdd.osdd.net/raghava/algpred/submis
sion.html) tool.
5. Before using any web server or tools, users are requested to go
through the help/FAQ section of tools, so that you can easily
understand the stepwise procedure, the basic principles, and
the usage of parameters.
Fig. 3 Parameters for visualization of H-bond between epitope and HLA molecule. The 3D structure of HLA-A2-
peptide complex (visualized in this image) was downloaded from RCSB PDB (PDB ID: 1B0G)
6. In Chimera, first open a PDB file of docked complex by brows-

ing file location. Then, in select menu select the structure !
protein. Go to tool menu ! Structure analysis FindHBond.
For H-bond parameter, see Fig. 3.
References
1. Vijayachari P, Sugunan AP, Shriram AN (2008) 7. Wang Z, Jin L, We˛grzyn A (2007) Leptospiro-
Leptospirosis: an emerging global public sis vaccines. Microb Cell Factories 6(1):39
health problem. J Biosci 33(4):557–569 8. Tomar N, De RK (2010) Immunoinformatics:
2. Adler B, de la Peña Moctezuma A (2010) Lep- an integrated scenario. Immunology 131
tospira and leptospirosis. Vet Microbiol 140 (2):153–168
(3–4):287–296 9. De Gregorio E, Rappuoli R (2012) Vaccines
3. Lata KS, Kumar S, Vaghasia V et al (2018) for the future: learning from human immunol-
Exploring Leptospiral proteomes to identify ogy. Microb Biotechnol 5(2):149–155
potential candidates for vaccine design against 10. Patronov A, Doytchinova I (2013) T-cell epi-
leptospirosis using an immunoinformatics tope vaccine design by immunoinformatics.
approach. Sci Rep 8(1):6935 Open Biol 3(1):120139
4. Levett PN (2015) Systematics of leptospira- 11. Yang X, Yu X (2009) An introduction to epi-
ceae. In: Leptospira and leptospirosis. Springer, tope prediction methods and software. Rev
Berlin, Heidelberg, pp 11–20 Med Virol 19(2):77–96
5. Adler B (2015) Vaccines against leptospirosis. 12. Dellagostin O, Grassmann A, Rizzi C et al
In: Leptospira and leptospirosis. Springer, Ber- (2017) Reverse vaccinology: an approach for
lin, Heidelberg, pp 251–272 identifying leptospiral vaccine candidates. Int
6. Ellis WA (2015) Animal leptospirosis. In: Lep- J Mol Sci 18(1):158
tospira and leptospirosis. Springer, Berlin, Hei- 13. Chaudhuri R, Ramachandran S (2017) Immu-
delberg, pp 99–137 noinformatics as a tool for new antifungal
vaccines. In: Kalkum M, Semis M (eds) Meth- 27. Karplus PA, Schulz GE (1985) Prediction of
ods and protocols. Methods in molecular biol- chain flexibility in proteins. Naturwissenschaf-
ogy, vol 1625. Springer, Heidelberg, pp 31–43 ten 72(4):212–213
14. Davies MN, Flower DR (2007) Harnessing 28. Parker JMR, Guo D, Hodges RS (1986) New
bioinformatics to discover new vaccines. Drug hydrophilicity scale derived from high-
Discov Today 12(9–10):389–395 performance liquid chromatography peptide
15. Groot ASD, Rappuoli R (2004) Genome- retention data: correlation of predicted surface
derived vaccines. Expert Rev Vaccines 3 residues with antigenicity and X-ray-derived
(1):59–76 accessible sites. Biochemistry 25
16. Doytchinova IA, Flower DR (2007) VaxiJen: a (19):5425–5432
server for prediction of protective antigens, 29. Ponomarenko J, Bui HH, Li W et al (2008)
tumour antigens and subunit vaccines. BMC ElliPro: a new structure-based tool for the pre-
Bioinformatics 8(1):4 diction of antibody epitopes. BMC Bioinfor-
17. Lin J, Huang S, Zhang Q (2002) Outer mem- matics 9(1):514
brane proteins: key players for bacterial adapta- 30. Sanchez-Trincado JL, Gomez-Perosanz M,
tion in host niches. Microbes Infect 4 Reche PA (2017) Fundamentals and methods
(3):325–331 for T-and B-cell epitope prediction. J Immunol
18. Rodrı́guez-Ortega MJ, Norais N, Bensi G et al Res 2017:2680160
(2006) Characterization and identification of 31. Chen K, Kolls JK (2013) T cell–mediated host
vaccine candidate proteins through analysis of immune defenses in the lung. Annu Rev
the group A Streptococcus surface proteome. Immunol 31:605–633
Nat Biotechnol 24(2):191 32. Karosiene E, Rasmussen M, Blicher T et al
19. Yu CS, Lin CJ, Hwang JK (2004) Predicting (2013) NetMHCIIpan-3. 0, a common
subcellular localization of proteins for Gram- pan-specific MHC class II prediction method
negative bacteria by support vector machines including all three human MHC class II iso-
based on n-peptide compositions. Protein Sci types, HLA-DR, HLA-DP and HLA-DQ.
13(5):1402–1406 Immunogenetics 65(10):711–724
20. Yu CS, Chen YC, Lu CH, Hwang JK (2006) 33. Lazarski CA, Chaves FA, Jenks SA et al (2005)
Prediction of protein subcellular localization. The kinetic stability of MHC class II: peptide
Proteins 64(3):643–651 complexes is a key parameter that dictates
21. Nair DT, Singh K, Siddiqui Z et al (2002) immunodominance. Immunity 23(1):29–40
Epitope recognition by diverse antibodies sug- 34. Weber CA, Mehta PJ, Ardito M et al (2009) T
gests conformational convergence in an anti- cell epitope: friend or foe? Immunogenicity of
body response. J Immunol 168(5):2371–2382 biologics in context. Adv Drug Deliv Rev 61
22. Paul WE (2012) Fundamental immunology. (11):965–976
Lippincott Williams & Wilkins, Philadelphia 35. Fraga TR, Barbosa AS, Isaac L (2011) Lepto-
23. Jespersen MC, Peters B, Nielsen M, Marcatili P spirosis: aspects of innate immunity, immuno-
(2017) BepiPred-2.0: improving sequence- pathogenesis and immune evasion from the
based B-cell epitope prediction using confor- complement system. Scand J Immunol 73
mational epitopes. Nucleic Acids Res 45(W1): (5):408–419
W24–W29 36. Larsen MV, Lundegaard C, Lamberth K et al
24. Kolaskar AS, Tongaonkar PC (1990) A semi- (2007) Large-scale validation of methods for
empirical method for prediction of antigenic cytotoxic T-lymphocyte epitope prediction.
determinants on protein antigens. FEBS Lett BMC Bioinformatics 8(1):424
276(1–2):172–174 37. Dhanda S, Karosiene E, Edwards L et al (2018)
25. Fieser TM, Tainer JA, Geysen HM, Houghten Predicting HLA CD4 immunogenicity in
RA, Lerner RA (1987) Influence of protein human populations. Front Immunol 9:1369
flexibility and peptide conformation on reactiv- 38. Calis JJ, Maybeno M, Greenbaum JA et al
ity of monoclonal anti-peptide antibodies with (2013) Properties of MHC class I presented
a protein alpha-helix. Proc Natl Acad Sci 84 peptides that enhance immunogenicity. PLoS
(23):8568–8572 Comput Biol 9(10):e1003266
26. Emini EA, Hughes JV, Perlow D, Boger J 39. Muh HC, Tong JC, Tammi MT (2009) Aller-
(1985) Induction of hepatitis a virus- Hunter: a SVM-pairwise system for assessment
neutralizing antibody by a virus-specific syn- of allergenicity and allergic cross-reactivity in
thetic peptide. J Virol 55(3):836–839 proteins. PLoS One 4(6):e5861
40. Bui HH, Sidney J, Li W et al (2007) Develop- 44. Andrusier N, Nussinov R, Wolfson HJ (2007)
ment of an epitope conservancy analysis tool to FireDock: fast interaction refinement in molec-
facilitate the design of epitope-based diagnos- ular docking. Proteins 69(1):139–159
tics and vaccines. BMC Bioinformatics 8 45. Mashiach E, Schneidman-Duhovny D, Andru-
(1):361 sier N et al (2008) FireDock: a web server for
41. Hildesheim A, Apple RJ, Chen CJ et al (2002) fast interaction refinement in molecular dock-
Association of HLA class I and II alleles and ing. Nucleic Acids Res 36(suppl_2):
extended haplotypes with nasopharyngeal car- W229–W232
cinoma in Taiwan. J Natl Cancer Inst 94 46. Berendsen HJ, van der Spoel D, van Drunen R
(23):1780–1789 (1995) GROMACS: a message-passing parallel
42. Rivoltini L, Loftus DJ, Barracchini K et al molecular dynamics implementation. Comput
(1996) Binding and presentation of peptides Phys Commun 91(1–3):43–56
derived from melanoma antigens MART-1 47. Pettersen EF, Goddard TD, Huang CC et al
and glycoprotein-100 by HLA-A2 subtypes. (2004) UCSF chimera—a visualization system
Implications for peptide-based immunother- for exploratory research and analysis. J Comput
apy. J Immunol 156(10):3882–3891 Chem 25(13):1605–1612
43. Lamiable A, Thévenet P, Rey J et al (2016) 48. DeLano WL (2002) The PyMOL molecular
PEP-FOLD3: faster de novo structure predic- graphics system. http://www.pymol.org
tion for linear peptides in solution and in com-
plex. Nucleic Acids Res 44(W1):W449–W454
Chapter 8
Characterizing MHC-I Genotype Predictive Power

for Oncogenic Mutation Probability in Cancer Patients
Lainie Beauchemin, Michael Slifker, David Rossell,
and Joan Font-Burgada
Abstract
MHC class I proteins present intracellular peptides on the cell’s surface, enabling the immune system to
recognize tumor-specific neoantigens of early neoplastic cells and eliminate them before the tumor develops
further. However, variability in peptide-MHC-I affinity results in variable presentation of oncogenic
peptides, leading to variable likelihood of immune evasion across both individuals and mutations. Since
the major determinant of peptide-MHC-I affinity in patients is individual MHC-I genotype, we developed
a residue-centric presentation score taking both mutated residues and MHC-I genotype into account and
hypothesized that high scores (which correspond to poor presentation) would correlate to high mutation
frequencies within tumors. We applied our scoring system to 9176 tumor samples from TCGA across 1018
recurrent mutations and found that, indeed, presentation scores predicted mutation probability. These
findings open the door to more personalized treatment plans based on simple genotyping. Here, we outline
the computational tools and statistical methods used to arrive at this conclusion.
Key words Cancer predisposition, Cancer susceptibility prediction, Major histocompatibility complex
(MHC), Human leukocyte antigen (HLA), Antigen presentation, Cancer, Immunology, Immunoe-
diting, Immunotherapy, Neoantigens
1 Introduction
Early in tumor development, neoplastic cells are subject to recog-

nition and elimination by the host’s immune system in what is
known as immunosurveillance [1, 2]. This is due to T-cell recogni-
tion of neoantigens, antigenic peptides which are the result of
tumor-specific mutations or aberrations and are novel to the
immune system, thereby marking their hosts for destruction. Rec-
ognition of these neoantigens is enabled by the major histocom-
patibility complex class I (MHC-I) cell surface proteins, which
displays intracellular peptides extracellularly. The three main
MHC-I genes are encoded at the human leukocyte antigen
(HLA) locus on chromosome 6: HLA-A, HLA-B, and HLA-C

185
186 Lainie Beauchemin et al.
[3]. Targeted elimination halts further tumor development unless

the nascent tumor cells can evade this process, in which case the
tumor enters an equilibrium phase. During this time, tumor cells
are subject to constant immune selective pressure, continually
mutating and evolving in a way that promotes cell survival, until
they eventually develop effective mechanisms to suppress or evade
the host’s immunity. At this point, the tumor has entered the escape
phase. It is during this phase that tumors are usually detected [4].
This escape phase is clinically important, as thwarting these
suppression and evasion mechanisms is the aim of immune check-
point inhibitors. These inhibitors target tumor mechanisms which
deactivate T cells, thereby increasing tumor susceptibility to
immune response [5]. However, the elimination phase interaction,
which takes place early in tumor development, is also of great
importance, as it plays a large role in determining the driver muta-
tional makeup of tumors that eventually progress to clinically
detectable disease. The size of this role is amplified by the idea
that early-occurring driver mutations would be those which are
most susceptible to immune selection pressures, as early neoplastic
cells have not had the chance to develop the suppression mechan-
isms of those which arise later in the tumor lifespan. Therefore,
during the tumor’s nascent stages, the patient’s MHC-I genotype is
extremely important in determining a cancer cell’s ability to evade
immune recognition. MHC-I genotype determines the subset of
the intracellular proteome that can and cannot be presented and by
extension the tumor-promoting mutations which go detected and
undetected by the immune system. Since poor presentation of
intracellular oncogenic peptides is the sole way tumor cells avoid
immunosurveillance before specific evasion tactics are developed,
the susceptibility of various tumor-causing mutations to immune
response within a patient is determined by the patient’s MHC-I
genotype. We hypothesized, therefore, that MHC-I genotype and
resultant peptide presentation determine which cancer-causing
mutations go undetected in the early stages of tumor development
and continue to develop into full-blown, clinically diagnosable
disease. It would follow, then, that the likely mutational profile of
a patient’s cancer is predictable, at least partly on that patient’s
MHC-I genotype.
Unfortunately, early tumor-immune interactions are difficult to
systematically study, particularly in humans. In “MHC-I Genotype
Restricts the Oncogenic Mutational Landscape” [6], we sought to
test our hypothesis by confirming or refuting the existence of a
relationship between peptide presentation ability, as determined by
the patient’s MHC-I genotype, and the mutational makeup of that
patient’s tumor. In other words, are the oncogenic mutations
which generate poorly presented peptides more likely to be present
in a tumor than those which are well-presented? An affirmative
answer would suggest that MHC-I genotype shapes the
Characterizing MHC-I Genotype Predictive Power for Oncogenic Mutation. . . 187
oncogenome of a tumor by targeting mutated cells which are

recognized and destroyed by the immune system. In order to
answer this question, we used whole-exome data from 9176
patients in The Cancer Genome Atlas (TCGA) to ascertain
tumor-specific mutation profiles and MHC-I genotypes. We then
developed a presentation score using prediction software to deter-
mine the probability of presentation by the 6 MHC-I patient alleles
for a given oncogenic mutation. We fitted a model to project the
likelihood of a mutation to happen in a given patient based on the
patient presentation scores. We found a statistically significant neg-
ative correlation between presentation scores and mutation likeli-
hood, indicating that poor presentation correlates to mutation
frequency. Thus, oncogenic mutations are biased toward those
undetectable by the patient’s immunosurveillance machinery. Pre-
sentation scores can therefore be used as a predictor of mutation
presence. It was further determined in this study that while presen-
tation scores cannot be used to predict which patients are likely to
express a given mutation (within-mutation prediction), the scores
are highly predictive of which mutations a given patient is likely to
have (within-patient prediction). Here, we will detail the computa-
tional tools and statistical methods employed to arrive at these
conclusions.
2 Data Required for Analysis
To determine the relationship between neoantigen presentation

scores and mutation probability, a few pieces of information from
the patients of interest are required: their HLA genotype which
determines MHC-I binding affinity, the mutation profiles of their
tumors, and the possible peptide sequences which could contain
each mutated residue. The aim of this methodology description is
to outline the methods used to analyze the relationship between
neoantigen presentation and mutation probability, skipping the
methods used to acquire and validate the input data. However,
we will briefly detail the origin of the different datasets for clarity:
2.1 Acquisition The Cancer Genome Atlas (TCGA) contains data from thousands
of Patient Data of tumor samples, including clinical data, DNA sequence data, copy
number information, and mRNA expression data (https://
cancergenome.nih.gov/). Data was accessed from the NCI Geno-
mic Data Commons (https://portal.gdc.cancer.gov/).
2.2 HLA Typing We used exome sequencing data on the TCGA patients to type the
of Patient Data HLA-A, HLA-B, and HLA-C alleles of all patients. In order to
maximize the number of patients included in the analysis while not
compromising the accuracy of the calls, we used three different
typing software: PolySolver [7], Optitype [8], and snp2HLA
[9]. PolySolver and Optitype both take germline whole-exome data

as an input, while Snp2HLA utilizes germline genotype data con-
taining a patient’s SNPs. One limitation of Snp2HLA is that the
analysis works solely for the tumor samples of Caucasian patients,
so a principal component analysis (PCA) was done to determine
which sample Snp2HLA was capable of analyzing. Thus, the three
software were used in conjunction to increase the accuracy of the
results while also maximizing sample size. Each sample was run
through the Optitype and PolySolver software; if the two typings
disagreed by more than one of six total alleles, the sample was
discarded. If either PolySolver or Optitype failed to output a geno-
type, the successful software’s output was utilized, and in the event
that both software failed to type the sample, as was the case for
425 of the 9176 samples eventually typed, Snp2HLA was used,
provided that the sample was of Caucasian origin according to
the PCA.
2.3 Generation We created a matrix containing the mutation information for every
of the Mutation Matrix patient in our sample. Among all mutations present in all tumors,
we selected only those occurring in the top 100 ranked oncogenes
and top 100 ranked tumor suppressors according to Davoli et al.
[10] and were observed at least in three TCGA patients. In addi-
tion, we only retained mutations that generate amino acid changes
(missense mutations and inframe indels). With these considera-
tions, we generated a binary 9176 1018 matrix for presence/
absence of mutation, with each column representing a recurrent
mutation and each row representing a patient sample. Mutations
outside the 200 top ranked oncogenes and tumor suppressors were
deemed passenger-like mutations and 1000 were sampled for anal-
ysis. Another 1000 common germline variants were sampled from
the Exome Variant Server (http://evs.gs.washington.edu/EVS/).
2.4 Binding In order to determine the likelihood of a specific mutation to bind a

Likelihood of Specific given human MHC-I allele, we used NetMHCPan 3.0 [11]. From
Oncogenic Mutations two inputs, a peptide of length 8–11 and an MHC-I allele,
on a Given MHC-I NetMHCPan 3.0 outputs a percentile rank score, which measures
Allele the percentage of 400,000 randomly selected natural peptides
(length 8–11) in the IEDB that bind to a given allele better than
the peptide in question. For example, if a peptide receives a percen-
tile rank score of 2.0 for an HLA allele, it is a more specific binder
for that particular MHC-I molecule than 98% of natural peptides of
a similar length [11]. The tool suggests a rank threshold of 2.0 for a
peptide to be classified a weak binder and a threshold of 0.5 to be
classified as a strong one.
An important challenge lies in utilizing the software’s peptide-
centric affinity scoring system to synthesize residue-centric presen-
tation scores; while we were interested in determining residue-
MHC-I interactions, NetMHCPan-3.0 is designed to quantify
peptide-MHC interactions. In order to determine and accurately

represent the binding affinity of a residue using the software’s
algorithm, we had to consider all possible peptides of length 8–11
containing the mutated residue of interest, which amounted to
38 possible peptides. For indels, any new peptide of length 8–11
resulting from the insertion or deletion of the residue was included
in the collection of possible peptides. We then explored a few
options for synthesizing the rank scores from all 38 possible pep-
tides into a single presentation score for the residue of interest and
for a given MHC-I allele:
1. Best rank: the lowest rank score of all 38 possible peptides
containing the residue.
2. Number < 2: the total number of all possible peptides classified
as at least weak binders (i.e., percentile rank less than 2.0).
Ranges from 0 to 38 with 0 being very unlikely to be presented
and 38 being very likely to be presented.
3. Number < 0.5: the total number of all possible peptides classi-
fied as at least strong binders (i.e., percentile rank less than 0.5).
Ranges from 0 to 38 with 0 being very unlikely to be presented
and 38 being very likely to be presented.
4. Best rank with cleavage: the lowest rank score of all possible
peptides after the peptides were filtered to remove those
unlikely to be cleaved by the proteosome. The NetChop tool
[12] was used to assign cleavage scores to each peptide from
0 to 1, with 1 being the most likely to be generated by protea-
somal cleavage and 0 being the least likely. All peptides with
cleavage scores below a 0.5 threshold were removed.
A comprehensive database of cell surface-presented peptides
[13] consisting of peptides bound to MHC-Is encoded by 16 dif-
ferent HLA alleles identified by mass spectrometry was used as the
gold standard to determine which of the four strategies above had
the most predictive power when it comes to presentation likeli-
hood. The power of each standard to discriminate presented pep-
tides from random negatives was evaluated, and best rank
(BR) score was found to be the best predictor.
The next step was to numerically represent a patient’s ability to
present a mutation by combining the best rank scores of a patient’s
six HLA alleles. We evaluated the simple best rank among the six
alleles (patient best rank (PBR)) and a score representing the har-
monic mean of the six presentation scores (patient harmonic best
rank (PHBR)) which was strongly influenced by the lowest values in
a set. For this analysis we used mass spectrometry data from five cell
lines with confirmed four-digit HLA typing for the corresponding
six MHC-I alleles [14], which showed that both PBR and PHBR
had similar binding predictive power. The following analyses there-
fore were conducted using both measures although here we will
show only PHBR for simplicity.
3 Characterizing the Relationship Between Presentation Scores and Mutation

Presence Across All Cancer Types
3.1 Depicting We condensed our processed data into two matrices, each with
the Global Link 9176 rows and 1018 columns. Each of the 9176 rows represents
Between PHBR Scores an HLA-typed patient and each of the 1018 columns represents a
and Mutation recurrent oncogenic mutation. One matrix is comprised of PHBR
Probability scores corresponding to the patient and the mutation specified by
the entry’s row and column, respectively, and the other is a binary
matrix composed of entries of either one or zero, with a 1 indicating
that the patient represented by the cell’s row presents the mutation
specified by the column and a 0 indicating that the patient does not.
In order to explore the general relationship between PHBR
and the presence of mutation in patient samples, we compared the
distributions of PHBR scores associated with mutation and those
not associated with mutation. That is, we took PHBR scores
corresponding to a 1 (mutation) in one group and those
corresponding to a 0 (no mutation) in another. The distributions
of such PHBR scores in both groups can be visualized using a
boxplot (Fig. 1a). As is apparent from the boxplot alone, higher
PHBR scores are more commonly associated with mutations. To
gain a more detailed understanding of this trend, we generated a
histogram using the same data (Fig. 1b). Indeed, for PHBR scores
<2, PHBR scores not associated with mutation are enriched, while
the opposite appears true for >2. Overall, mutation-associated
scores tend to be greater than scores not associated with mutations,
indicating that mutations tend to occur in the context of lower
MHC-I presentation.
The functional form of the relationship underlying this ten-
dency, however, remained unclear. We wondered if the relationship
was in some sense linear, in that an increase in PHBR uniformly
corresponds to an increased likelihood of mutation. The histogram
made clear that differences in PHBR frequency between the two
groups are highly variable across the range of PHBR scores, with
the most drastic differences seen when PHBR scores are less than
1 and modest differentials when PHBR scores exceed 2. We won-
dered if there was some sort of threshold after which PHBRs no
longer had a significant effect or if there was another explanation.
To investigate this issue, a generalized additive logistic regression
model using cubic spline basis functions can be fitted, as implemen-
ted in the mgcv R package [15]. The R code to fit the model is
given below. The code assumes that mut is a vector 1’s and 0’s
indicating mutation presence/absence and that PHBR is a vector
storing the PHBR scores.
gam1¼ gam(mut ~ s(log(PHBR)), family¼‘binomial’)
PHBR Score Distribution

A
10
PHBR Scores
8
6
4
2
0
0 1
Mutation Presence
B
0.4
Frequency of PHBR Scores
No mutation
0.3
Mutation
Frequency
0.2
0.1
0.0
0 2 4 6 8 10
PHBR
Fig. 1 PHBR score distributions depending on mutation status. (a) Boxplot

showing the distribution of PHBR scores depending on mutation status. (b)
Histogram showing the distribution of PHBR scores depending on mutation
status
The fitted model reveals a linear association between

log-PHBR scores and logit mutation probability (Fig. 2b). We
also computed the probabilities as a function of the untransformed
PHBR scores. The curve steadily increases but begins to level off at
higher PHBR scores (Fig. 2a).
To assess the robustness of the results given by the GAMs, the
percentage of mutations in the actual data can be plotted after
grouping the PHBR into nine ranges: 0–0.5, 0.5–1, 1–1.5,
1.5–2.0, 2.0–2.5, 2.5–3, 3–4, 4–5, and 5–infinity (Fig. 3a). The
x-value of each point is then determined by the mean of all PHBR
scores within the range and y-value by the percentage of mutations
(entries equal to 1) in the mutation matrix. The y-values are there-
fore the observed proportion of mutations within each range. The
observed pattern in this nonparametric analysis is largely similar to
the model-based results (Fig. 2). The plotted logit of the true
mutation-probability data again shows a linear distribution match-
ing the GAM results. To increase the resolution here, PHBR scores
are grouped into 20 ranges (Fig. 3b).
A 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 Mutation Probability vs. PHBR B log(PHBR) vs. logit(probability)
−6.8
logit probability of mutation
Probability of mutation
−7.0
−7.2
−7.4
0 2 4 6 8 10 −2 −1 0 1 2
PHBR log−PHBR
Fig. 2 Nonparametric estimates of the mutation probability as a function of PHBR scores. (a) GAM-based
mutation probability vs. PHBR. (b) GAM-based logit-mutation probability vs. log PHBR
A Estimated logit(prob) vs log(PHBR) B Generalized additive model

0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 0.0040
0.5
Probability of mutation
logit mutation probability
0.0
−0.5
−1.0
−1.5
−4 −2 0 2 4 0 2 4 6 8 10
log PHBR PHBR
Fig. 3 Validation of the GAM models. (a) Frequency of mutation for PHBR groups. (b) Scatterplot with each
point representing the log-average PHBR of a group and the logit-mutation frequency within that group
This analysis establishes a global relationship between presen-

tation scores and mutation probability: a log increase in the PHBR
score approximately corresponds to a constant increase in the logit-
mutation probability. Our hypothesized link between mutation
presentation and expression was shown to be present, paving the
way for further examination of the model’s predictive power.
3.2 Determining We had at this point firm evidence of a bias toward higher presen-
Predictive Power tation scores in observed mutations, establishing a global link
of Presentation Scores between PHBR and mutation status. To determine the extent to
for Mutation which PHBR scores can predict mutation presence, it has to be
Probability taken into account that the influence of MHC-I presentation on
mutation probability can operate at different levels. For instance, it
could be that PHBR patient scores help identify patients that are
more likely to have a mutation or it could also be that PHBR scores
help identify which mutations are more likely to occur in a given
patient. To investigate these different relationships and understand
the relationship between PHBR and mutation probability more
precisely, we analyzed mutation prediction using two models: the
within-mutation model, which assesses the power of the PHBR
scores to predict which patients will have a particular mutation,
and the within-patient model, referring to the prediction of which
mutations a particular patient’s tumor is more likely to have.
3.2.1 Determining In order to model the probability of a given mutation across

Predictive Power patients based on their PHBR scores, we fit a generalized linear
of Presentation Scores model with random effects using the glmer() function from the
Within a Given Mutation lme4 R package [16] using mutations present in 5 tumors in
TCGA patients:

logit P y ij ¼ 1jx ij ¼ βj þ γ log x ij
where γ measures the effect of the log-PHBRs on the mutation

probability and β captures random mutation-specific effects due to
each residue having a potentially different baseline probability of
mutation. We then fit this model and tested the null hypothesis that
γ ¼ 0 or, in other words, that log-PHBR has no effect on the
mutation.
The model outputs a coefficient of 0.02660 with a p-value of
0.0753, which is suggestive but not conclusive evidence that
PHBRs have a marginally significant effect on the logit-probability.
In addition, the output reveals a random effects standard deviation
of 0.794, indicating that variance in mutation rates across residues
is substantial.
3.2.2 Determining The model the ability of PHBR scores to predict mutation proba-
Predictive Power bility within a patient, the same analysis as before can be implemen-
of Presentation Scores ted, except this time a model for patient-specific random effects is
Within a Given Patient fitted (i.e., certain patients are more likely to have more mutations
for reasons other than their MHC-I genotype). The formula is as

follows:

logit P y ij ¼ 1jx ij ¼ ηi þ γ log x ij
where γ still corresponds to the degree at which log-PHBR affect

mutation probability but η now represents patient-specific random
effects.
The model output suggests a regression coefficient of 0.24730,
refuting the null hypothesis with a p-value less than 2 1016 and
demonstrating with near certainty that the log(PHBR) influences
mutation probability within patients. The standard deviation of the
random effects is 0.506, indicating that there is a fair amount of
variance due to patient-specific mutation propensities, though not
as much variance as the within-mutation model.
3.3 Odds Ratios As discussed above, there is strong evidence that PHBR scores have
an effect on mutation probability within patients and marginal
evidence of a slight effect on the mutation probability across
patients. To help interpret the degree to which scores predict
mutation probability for both models, we compute the odds ratios
of the two models. The OR is 1.282 in the within-patient analysis,
that is, a +1 increase in log-PHBR score is estimated to increase the
odds of mutation by 28.2%. The estimated OR for the within-
mutation analysis was only 1.028 and the upper end of the CI was
1.059, indicating that any potential increase in odds is unlikely to
be larger than 5.9% per unit of PHBR (Table 1). The same models
can be fitted on common germline variants or passenger mutations
for which any or much weaker association is expected, respectively.
As expected, the results show that PHBR of passenger mutations or
germline variants are not predictive of mutation probability in both
the within-mutation and the within-patient models (Table 1).
To more fully understand the degree to which PHBR is predic-
tive of mutation probability both within mutations and within
patients, we compute odds ratios for both models, within-mutation
and within-patient, with multiple mutation frequency cutoffs. That
is, we repeated the previous analyses but restricting attention to
mutations that appeared in 3, 5, 10, 20, and 40 tumors
(Table 2).
The within-mutation model has an odds ratio close to 1 for all
cutoffs, with 1 included in the 95% confidence interval, suggesting
that this model has very limited predictive power, if any at all. By
contrast, the within-patient model proves predictive with the
degree of predictive power highly dependent on the mutation
frequency cutoff. Although always significant, as the cutoff
increases from 3, 5, 10, and 20, the odds ratio of the within-
patient model concomitantly increases from 1.183 to 1.545. How-
ever, at frequency cutoff of 40, the OR although still >1 and highly
Table 1
Quantitate estimate of the association between PHBR scores and mutation probability
Within-mutation OR p-Value Within-patient OR p-Value

5 mutations 1.028 (0.998, 1.059) 0.066 1.282 (1.256, 1.308) <2 1016
Passenger mutations 1 (0.996, 1.037) 0.953 0.999 (1.164, 1.204) 0.95
Germline variants 1 (0.994, 0.999) 0.022 0.995 (0.994,0.996) 5.8 1010
ORs, 95% CIs, and p-values for both the within-mutation and within-patient glmer models, as calculated with a mutation
frequency threshold of 5. Same for a 1000 sampled passenger mutations and 1000 common germline variants
Table 2
Quantitate estimate of the association between PHBR scores and mutation probability as a function of
oncogenic mutation frequency
Frequency cutoff Within-patient odds ratio 95% confidence interval p-Value

3 1.183 (1.164, 1.204) <2 1016
5 1.282 (1.256, 1.308) <2 1016
10 1.319 (1.287, 1.352) <2 1016
20 1.545 (1.493, 1.599) <2 1016
40 1.301 (1.256, 1.359) <2 1016
ORs, 95% CIs, and p-values for the within-patient glmer model at different thresholds of oncogenic mutation frequencies
significant is lower than at cutoff of 20 (Table 2). We interpret this

result as a consequence of lower statistical power at extreme higher
cutoffs due to low number of mutations considered by the model.
4 Analysis Across Different Cancer Types
To determine whether the predictive power of PHBR on mutation

probability persists equally across all cancer types, or whether the
effect is more pronounced in some tissues than in others, we
repeated our analysis separately for each cancer type in the TCGA
cohort.
We coded tissue information from each tumor in a 9176 2
matrix with the first column comprising TCGA patient IDs and the
second column detailing the tissue type from whence the tumor
was derived. For each tissue type, three models are created relating
log-PHBR to logit-mutation probability: a global model, a random
effects model accounting for mutation-specific effects, and a ran-
dom effects model accounting for patient-specific effects.
We then create an odds ratio table both for the within-patient
model and the within-mutation model. To visualize these and
facilitate comparison of the predictive power between tissue types
A PCPG B THCA
KIRP PCPG
LIHC KIRP
PRAD LGG
BRCA PAAD
READ READ
BLCA COAD
LAML BRCA
HNSC TGCT
LUAD SKCM
STAD PRAD
GBM LIHC
UCEC BLCA
OV LUAD
COAD HNSC
THCA STAD
LGG GBM
SKCM OV
LUSC UCEC
TGCT LAML
SARC LUSC
PAAD SARC
KIRC KIRC
0.5 1.0 2.0 0.5 1.0 2.0

Odds−ratio Odds−ratio
C PCPG
KIRP
D THCA
PCPG
LAML KIRP
TGCT LAML
GBM LGG
READ PAAD
LUAD COAD
HNSC READ
BLCA SKCM
LIHC GBM
OV LUAD
STAD TGCT
BRCA HNSC
PRAD BRCA
THCA OV
LUSC STAD
SKCM BLCA
COAD PRAD
UCEC UCEC
LGG LIHC
PAAD LUSC
SARC SARC
KIRC KIRC
0.5 1.0 2.0 0.5 1.0 2.0

Odds−ratio Odds−ratio
Fig. 4 Predictive power of PHBR scores across cancer types. (a–d) ORs as black boxes and 95% CIs as red
dotted lines for different cancer types using (a) oncogenic mutations with frequency 5 and the within-
mutation model, (b) same as in A but using within-patient model, (c) oncogenic mutations with frequency 20
and the within-mutation model, (d) same as in (c) but using the within-patient model
within each model, we generated two plots, one for the within-
patient model and one for the within-mutation model, and added
the odds ratio data for each tissue type (Fig. 4).
The analysis is restricted to only those cancer types which
contain more than 100 samples so as to increase the confidence of
the odds ratio estimates. In addition, as observed in the pan-cancer
analyses, when we restrict the cancer type analysis to highly recur-
rent mutations, specifically those with at least 20 instances, as
opposed to the default threshold of 5, the ORs obtained across
cancer types are significantly higher in the within-patient model,
with some cases up to 2.51, meaning that increases in PHBR of
1 unit imply 251% increase in mutation probability of occurrence.
As expected, none of the cancer types is associated with a significant
OR in the within-mutation models.
As a further check to assess the hypothesis that the predictive

power of the PHBR score differed significantly across tissues, we
used a fixed effects logistic regression model with an interaction
term between PHBR and tissue. Said interaction term was statisti-
cally significant (likelihood ratio test, p value ¼ 1.61 1014), sug-
gesting that the PHBR predictive power indeed varies across
tissues, although currently it is not possible to attribute these
differences to specific tissue characteristics. A simple explanation
could be that stronger effects are detected on tumor types with
larger patient sample sizes and only few and frequent mutations,
while cancer types with smaller groups of patients and more
low-frequency mutations appear to be more weakly predictive.
However, the differences observed could be reflecting some
biological aspects of the immune system response that operate
differently depending on the cancer type.
Acknowledgments
This work has been supported by NIH K99/R00CA191152 grant

to J. F-B and funded in part through the NIH/NCI Cancer Center
Support Grant P30 CA006927. David Rossell was partially funded
by the grant RyC-2015-18544, Plan Estatal PGC2018-101643-B-
I00, and Ayudas Fundacion BBVA a equipos de investigacion
cientifica 2017.
References
1. Kaplan DH et al (1998) Demonstration of an https://doi.org/10.1126/science.271.5256.
interferon gamma-dependent tumor surveil- 1734
lance system in immunocompetent mice. Proc 6. Marty R et al (2017) MHC-I genotype restricts
Natl Acad Sci U S A 95:7556–7561. https:// the oncogenic mutational landscape. Cell
doi.org/10.1073/pnas.95.13.7556 171:1272–1283. e1215. https://doi.org/10.
2. Shankaran V et al (2001) IFNgamma and lym- 1016/j.cell.2017.09.050
phocytes prevent primary tumour development 7. Shukla SA et al (2015) Comprehensive analysis
and shape tumour immunogenicity. Nature of cancer-associated somatic mutations in class
410:1107–1111. https://doi.org/10.1038/ I HLA genes. Nat Biotechnol 33:1152–1158.
35074122 https://doi.org/10.1038/nbt.3344
3. Sidney J, Peters B, Frahm N, Brander C, Sette 8. Szolek A et al (2014) OptiType: precision HLA
A (2008) HLA class I supertypes: a revised and typing from next-generation sequencing data.
updated classification. BMC Immunol 9:1. Bioinformatics 30:3310–3316. https://doi.
https://doi.org/10.1186/1471-2172-9-1 org/10.1093/bioinformatics/btu548
4. Zitvogel L, Tesniere A, Kroemer G (2006) 9. Jia X et al (2013) Imputing amino acid poly-
Cancer despite immunosurveillance: immuno- morphisms in human leukocyte antigens. PLoS
selection and immunosubversion. Nat Rev One 8:e64683. https://doi.org/10.1371/
Immunol 6:715–727. https://doi.org/10. journal.pone.0064683
1038/nri1936 10. Davoli T et al (2013) Cumulative haploinsuffi-
5. Leach DR, Krummel MF, Allison JP (1996) ciency and triplosensitivity drive aneuploidy
Enhancement of antitumor immunity by patterns and shape the cancer genome. Cell
CTLA-4 blockade. Science 271:1734–1736. 155:948–962. https://doi.org/10.1016/j.
cell.2013.10.011
11. Nielsen M, Andreatta M (2016) NetMHCpan- 14. Bassani-Sternberg M, Pletscher-Frankild S,

3.0; improved prediction of binding to MHC Jensen LJ, Mann M (2015) Mass spectrometry
class I molecules integrating information from of human leukocyte antigen class I peptidomes
multiple receptor and peptide length datasets. reveals strong effects of protein abundance and
Genome Med 8:33. https://doi.org/10. turnover on antigen presentation. Mol Cell
1186/s13073-016-0288-x Proteomics 14:658–673. https://doi.org/10.
12. Kesmir C, Nussbaum AK, Schild H, Detours V, 1074/mcp.M114.042812
Brunak S (2002) Prediction of proteasome 15. Wood SN (2011) Fast stable restricted maxi-
cleavage motifs by neural networks. Protein mum likelihood and marginal likelihood esti-
Eng 15:287–296 mation of semiparametric generalized linear
13. Abelin JG et al (2017) Mass spectrometry models. J R Stat Soc Series B Stat Methodol
profiling of HLA-associated Peptidomes in 73:3–36. https://doi.org/10.1111/j.1467-
mono-allelic cells enables more accurate epi- 9868.2010.00749.x
tope prediction. Immunity 46:315–326. 16. Bates D, M€achler M, Bolker B, Walker S
https://doi.org/10.1016/j.immuni.2017.02. (2015) Fitting linear mixed-effects models
007 using lme4. J Stat Softw 67(1):1–48 https://
doi.org/10.18637/jss.v067.i01
Chapter 9
Network Analysis of Large-Scale Data and Its Application

to Immunology
Lauren Benoodt and Juilee Thakar
Abstract
Diseases and infections elicit a multilayered immune response which consists of molecular and cellular
interaction cascades. Recent advances in high-throughput technologies have facilitated multiparameter
investigation of immune cells involved in human immune responses. These multiparameter investigations
generate large-scale datasets and advanced computational techniques are required to gain useful informa-
tion from them. Networks or graphs offer a practical way to represent complex information and develop
advanced algorithms to unveil the underlying mechanisms. Here we discuss ways to assemble and analyze
networks using genome-wide transcriptional profiles. Additionally, we discuss ways to integrate information
available in primary literature and databases with the networks assembled using large-scale datasets. Finally,
we describe ways in which network analysis offers insights into human immune responses.
Key words Gene co-expression, Network, Immunology, Clustering, Correlation, Mutual informa-
tion, Transcriptomics
1 Introduction
Current high-throughput methods such as RNA sequencing and

proteomics can generate massive amounts of data. Gaining useful
information from these large-scale data requires advanced compu-
tational techniques. Graphs or networks allow representation of
complex data and its analysis through the utilization of graph
theoretical tools and the development of novel techniques. A net-
work is a representation of data that highlights the interactions
between different features such as genes or proteins. Networks
consist of nodes representing features and edges representing the
connections between the nodes. The analysis of networks facilitates
mechanistic investigation since networks explicitly depict associa-
tions between genes or proteins. The edges of a network can be
directed, indicating that the state of one node is dependent on the
state of a connected upstream node, or undirected in which case
causality is unknown. The strength of the interaction can be

199
200 Lauren Benoodt and Juilee Thakar
depicted by the weights of the edges. Thus, networks can be

utilized to conceptualize data by representing different character-
istics that enable multifactorial investigation.
Generally, biological networks can be grouped into knowledge-
or data-driven networks. Knowledge-driven networks represent
known pathways or signaling cascades which are assembled from
primary literature, some of which are available in interaction data-
bases. Data-driven networks such as co-expression networks can be
constructed by finding association scores between genes from tran-
scriptomic data that describe their interaction. Principles from
graph theory are then used to infer different functional relation-
ships based on how the nodes of the network are connected. Both
knowledge- and data-driven networks can be integrated to learn
from previous experiments and at the same time infer novel inter-
actions. Other sources of biological knowledge can be added to the
data-driven networks, such as predicted transcription factor targets,
to improve understanding of the functionality of the genes within
the network. Here we will review ways to assemble and analyze
these networks.
The different types of networks discussed above can be used to
study the immune responses involved in diseases and infections.
Here we will discuss the applications of network-based methods to
study immune response in humans in the context of sample collec-
tion, multiparameter immunological datasets, and demographic
variables.
2 Construction of Data-Driven Networks
Edges in a data-driven network are based on the calculation of

association scores between the genes or proteins with the assump-
tion that genes with associated expression levels are functionally
related. These networks are referred to as co-expression networks.
The scores rely on associations between gene-expression patterns
across samples. Commonly used techniques to measure associations
include Pearson correlation and mutual information (MI). Both of
these methods have different strengths and weaknesses that make
them appropriate for use in different situations.
Pearson correlation is a measure of linear correlation which can
be positive or negative, suggesting activating or inhibitory relation-
ships between genes, respectively. Unfortunately, Pearson correla-
tion will only find linear correlations. This means that if a scatter
plot of expression levels of two genes formed a curve instead of a
line, it would have a low Pearson correlation score, despite there
being an association between the two genes. It is also very sensitive
to outliers (see Note 1). Pearson correlation is computationally fast
to run, which is important when analyzing very large datasets. An
alternative method, Spearman correlation is used as a metric of
Network Analysis of Large-Scale Data and Its Application to Immunology 201
association for large-scale data that is not normally distributed. It is

similar to Pearson, but relies on ranks rather than actual values and
provides monotonic as opposed to linear relationships. Spearman
correlation will have lower sensitivity to drastic outliers, since the
rank of a value away from the trend won’t change regardless of how
large the difference is.
MI is an information theory-based metric that can detect non-
linear as well as linear relationships. For this type of analysis, the
data needs to be discretized into a number of bins after which
entropy estimation calculations, such as Miller-Madow’s bias cor-
rection (see Note 2), measure how much the changes in the expres-
sion of one gene depends on another [1]. Since MI requires
discretized data, the number of bins used for the calculation has
an impact on the outcome of the MI calculation. Generally, the
number of bins is based on the number of samples in the dataset
and the square root of the number of samples is a reasonable
estimate for the number of bins (see Note 3) [1]. Once MI scores
are calculated, they are normalized to have a range of zero to one by
dividing each score by the maximum score, facilitating comparison
between studies. One drawback to this technique is that MI scores
are unsigned/undirected. On the other hand, MI will always have a
score if any statistical dependence exists between two features,
which may be missed by Pearson correlation, and is less sensitive
to outliers [2]. Unfortunately, the computational cost of running
MI is high, and the calculation is slow for very large datasets.
Packages such MINET in R (see Note 4) provide the framework
to run this type of analysis on biological data [1]. Another method,
the maximal information coefficient (MIC), had been used as an
alternative to MI, but is a normalized estimation of MI and has
been shown to be less accurate in noisy data [3].
Once association scores have been calculated, genes without
any association to other genes and edges with very low associations
can be filtered out by setting a cutoff, reducing the size of the
network (see Note 5). In general, the cutoff point for association
scores varies based on the data. Samples containing mixtures of cell
types, such as peripheral blood mononuclear cells (PBMCs), have
relatively low association scores, whereas purified cell types, such as
dendritic cells, have higher association scores. The selection of a
cutoff is based on the distribution of the association scores. Choos-
ing a data-specific cutoff will improve the size of the network,
leading to interpretable co-expression networks. Furthermore, the
association scores can be used as edge weights in the network.
These weights can be used with graph theoretical measures to
interpret the data.
3 Primary Analysis of Data-Driven Networks Using Graph Theoretical Measures
Data-driven networks are typically very dense and graph theoretical

analysis is the first choice for their characterization. The topology of
gene co-expression networks can provide useful insights into
potential gene interactions and functional relationships. Much like
cell signaling cascades, how the nodes of a network are connected is
representative of information flow. Graph theory includes various
metrics that help to interpret the network. Moreover, these metrics
can also be used to characterize the genes from canonical pathways
and their interactions with genes whose function is previously
unknown.
Node degree describes direct connections of nodes to other
nodes in the network (see Note 6). The nodes with the highest
degree are referred to as hub nodes and perturbation of these genes
will have an impact on many parts of the network. The genes
represented by high degree nodes could also be a part of many
canonical pathways, which are useful for studying how multiple
pathways interact during a perturbation like an infection. Even
the distribution of the node degree of a network informs about
the overall associations of genes within a network.
Some nodes in the network will be required to connect differ-
ent sections of the network. These genes have a high betweenness
centrality. These nodes might not have a direct connection to many
other node, but are like a bottleneck point to connect other groups
of nodes. For example, a network describing the response of epi-
thelial and dendritic cells to influenza infection shows that the MX2
gene has a low node degree and a high betweenness centrality
[4]. This gene is known to be induced by interferon and has been
linked to viral resistance [5]. Alternatively, these nodes can be the
same as the nodes with high node degree. The same study found
that IMMT had a high betweenness centrality and node degree
[4]. This gene has been linked to apoptosis in the immune response
to influenza [6]. Nodes with high betweenness centrality are poten-
tially involved in many shortest paths between genes. A shortest
path is the minimum number of nodes that are needed to connect
two nodes (Fig. 1a). This network metric can reveal novel genes
associated with genes known to be involved in antiviral
response [4].
Another way of assessing the nodes of a network is based on the
clustering coefficient of nodes. This is a measure of how well
grouped the nodes of interest are within the network. This is useful
in biological analysis when looking at genes from gene sets or
canonical pathways to probe their connectedness. Thus, different
types of centrality measures provide information about how a gene
is involved in information transfer through a network. More details
about graph theoretical analysis have been described [7].
Fig. 1 Epithelial (top) and dendritic cell (bottom) co-expression networks responsive to influenza infection. (a)
The shortest path based network highlighting the interactions between genes from the KEGG cytosolic DNA
sensing pathway. (b) Subnetworks enriched in genes from the KEGG cytosolic DNA sensing pathway. (c)
Nearest neighbors of the CXCL10 gene [4]
Numerical measures of network topology are very informative,

but visualizing a network can make all of these points easier to
understand. Tools like iGraph in R or NetworkX in Python are
useful for generating networks and extracting network character-
istics. The iGraph package has tools for network visualization simi-
lar to other more user-friendly tools like Cytoscape. Choices of how
the graph is laid out (see Note 7) and how the nodes are represented
can highlight different network characteristics to emphasize evi-
dence supporting a hypothesis.
4 Advanced Analysis of Data-Driven Networks Using Clustering Methods
Functionally related genes can be obtained from canonical path-

ways or large-scale studies available in the public domain. These
data-driven networks can also be directly used to infer the function-
ally related genes using clustering. Further, integrating prior
knowledge with data-specific clusters improves the confidence in
the inference. Different methods of clustering can be used to
identify strongly associated groups of genes.
One of the most common methods is hierarchical clustering. In
this case the genes are iteratively split into groups based on their
similarity with other genes measured by a distance measure. The
groups are split until each gene is in its own group. This makes a
dendrogram which can be cut using a threshold to get clusters. The
chosen threshold will determine the number of clusters (see Note 8).
Another type of commonly used clustering is k-means clustering,

which puts each data point into a group based on its distance from a
predetermined number of randomly assigned cluster centers. The
centers are recalculated based on the data points for a preset number
of iterations. In this method the number of clusters must be decided
before the analysis is run (see Note 9). Both of these methods will
output different groupings of data, and within those categories,
there are different options for how the clustering is achieved based
on different distance measures. In hierarchical clustering, Euclidean
distance is commonly used and represents the distance between two
data points if a straight line was drawn between them, but other
methods like MI and Pearson correlation can also be used as distance
measures in hierarchical clustering [8]. These are forms of hard
clustering, which means that each data point is a member of only
one cluster. Alternatively, fuzzy clustering can be used. In this type
of analysis, each data point is potentially a member of every cluster
and will have a score related to each cluster. Using a score cutoff,
Fig. 2 A chord diagram of fuzzy cluster membership. The width of the lines represents the number of genes
from a cluster overlapping with another cluster. The KEGG pathways indicated next to the cluster numbers
show enrichment of that pathway in a cluster [9]
data points will be a member of one or more clusters (Fig. 2). This
method requires that the number of clusters be initially set like in
k-means clustering and a cutoff like in hierarchical clustering. This
type of clustering is more representative of biological pathways in
which many genes aren’t only associated with one signaling cascade
or biological process [9].
It is also possible to cluster genes into subnetworks based on
association scores and the network topology (Fig. 1b). This is also
known as community detection, where a community is a very dense
section of a larger graph. A community will have more interactions
within the community than the rest of the network. One frequently
used approach is based on random walks where a node with many
edges connected to it will have more opportunities for having a
walk go through it. An example of this is the Walktrap method
implemented in the R package iGraph which relies on short random
walks that determine the likelihood of nodes being members of a
community and then using hierarchical clustering to merge adja-
cent communities [10]. It is possible to have weighted random
walks where the association score for each edge is used as an edge
weight and the probability of a random walk traversing an edge is
based on the edge weight [11]. Other methods utilize prior
knowledge-based information, such as canonical pathway net-
works, to generate subnetworks that incorporate connections
known to be biologically meaningful, but allowing for the inclusion
of previously uncharacterized interactions [12]. Random walk-
based methods are just one type of network-topology-based com-
munity detection. There are also methods for finding overlapping
communities. Much like fuzzy clustering, these methods will out-
put clusters where nodes can belong to multiple clusters [13].
5 Comparison Between the Methods Deriving Data-Driven Networks and Their

Utility in Studying Immune Response
Methods have been developed specifically for the calculation of

gene co-expression networks. These packages calculate association
scores and offer additional functionalities for the analysis. One such
method is an algorithm for the reconstruction of gene regulatory
networks (ARACNe) which is implemented in R [14]. ARACNe is
based on the relevance networks method which uses MI to con-
struct co-expression networks [15]. Improving on raw MI scores,
ARACNe shuffles gene-expression levels and determines a p-value
for each MI score to determine the statistically significant MI
scores. ARACNe then filters out the data using the data-processing
inequality. Specifically, if an edge between two nodes can be
removed and the path can be completed with connected nodes
with higher MI edges than the original edge, that edge will be
deemed unnecessary. These extra steps mean that the resulting

network is sparser.
ARACNe was utilized to infer disease and cell-type-specific
co-expression networks and their transcription factor regulators
[16]. These networks were then used for the deconvolution of
skin cells and immune cells to find disease-specific transcriptional
regulation.
Another popularly used method of generating co-expression
networks is the weighted correlation network analysis (WGCNA) R
package. WGCNA offers multiple options such as Pearson or Spear-
man correlation for association scores and hierarchical, k-means, or
fuzzy clustering to identify groups of highly correlated genes called
modules [17]. These options make this package very flexible and
can be applied in a very data-specific manner. WGCNA also
includes methods for studying the topological characteristics of
the network as well as functions for visualization of the network
properties. This package is very user friendly and even includes
tutorials for researchers with less experience in computational
methods. This method has also been applied to many immunologi-
cal problems. For example, an influenza vaccine study on PBMCs
from human subjects before and after vaccination used WGCNA to
characterize the immune response [18]. This study showed the
mixture of cell types in PBMCs was clustered according to specific
immune and cell-type signatures.
ARACNe and WGCNA generate gene co-expression networks
in different ways. ARACNe is more complex and uses extra statisti-
cal methods to infer sparse networks. WGCNA is very flexible and
contains clustering techniques to get data-driven gene sets/mod-
ules. Although these packages use different techniques to make
networks, both have been developed for the analysis of gene-
expression data.
These are only two of many algorithms that have been devel-
oped for the analysis of gene co-expression networks. The context
likelihood of relatedness (CLR) algorithm [19] uses MI and is
based on relevance networks like ARACNe, but is focused on the
identification of transcriptional relationships. CLR filters out less
likely interactions by comparing the MI score of one interaction
with a background of all MI scores involving one of the interacting
pair. This means that the method will be useful in finding the
interactions most likely to be biologically relevant when one tran-
scription factor is weakly associated with many genes or vice versa.
Bayesian networkshave also been applied to many transcriptomic
datasets [20]. A benefit of Bayesian networks is that the method can
generate directed graphs which rely on time series data. If time
series data is available, this method is very useful for inferring the
dynamics of a system but with the added complexity of the tempo-
ral factor (see Note 10) [21]. Tree-based methods, such as random
forests, have also been applied to gene-expression data. Methods
like gene network inference and ensemble of trees (GENIE3) use

regression trees to define modules [22]. These represent some of
the commonly used methods for the inference of gene
co-expression networks, but many other methods have been devel-
oped with similar base methods and unique applications to the
problem of generating accurate co-expression networks.
The integration of multiple data types such as transcriptomic,
DNA methylation, metabolomics, and miRNA data can improve
the robustness of gene co-expression networks. However, combin-
ing different data types has some difficulties. Extra steps like nor-
malizing the data types so that they can be compared, and
standardization in mapping the gene, or protein names are
required. Even combining data from different technologies such
as RNA sequencing with microarray and measuring the same bio-
molecule is not trivial due to differences in true positive measure-
ments and techniques. One of the examples of combining data
types uses transcriptomic and DNA methylation data [23] to gen-
erate a network. This method will only consider an edge to be valid
if the association is identified in both datasets increasing the likeli-
hood that the edge is a true positive. However, a number of true
positive measurements vary with the type of data; thus this
approach has the potential to exclude a reasonable number of true
edges. The approach described in [23] also integrates knowledge-
driven protein–protein interaction (PPI) networks. The interac-
tions from the PPI network are used in the calculation of edge
weights to increase the confidence of the data-driven network
edges.
WGCNA has an interesting functionality which allows integra-
tion of multiple datasets. For example, when studying the immune
response to West Nile virus, transcriptional modules identified by
WGCNA were compared with cytokine levels measured by Lumi-
nex [24]. In this study gene-expression modules compared with
IL-4, IL-6 and TNF were identified.
Thus data-driven networks can be enhanced using additional
datasets and can be utilized to investigate various immunological
questions.
6 Networks Analyzing Immune Responses
The immune response consists of multiple different cell types,

cytokines, chemokines, and molecular networks within different
cell types. While most of the earlier studies measure human immune
response using blood samples, more recent studies are utilizing
specific cell types and tissue samples. We have studied
co-expression networks developed from PBMCs obtained from
blood to investigate immune responses to influenza and respiratory
syncytial virus (RSV) infections. Co-expression networks were
developed using MI, and gene modules were identified using fuzzy
C-means clustering. It has been shown that the systemic immune
response to influenza and RSV can be characterized by investigative
gene expression in PBMCs. Despite PBMCs being located away
from the site of infection, they have been shown to have gene-
expression patterns for interferon-related genes that correlate with
airway epithelial cells in RSV and influenza infection [25]. This is
important since obtaining samples from the primary infected cells
from human subjects is not always possible and frequently mixtures
of cells like PBMCs are utilized to measure the immune response.
We have also studied the response in specific cells such as epithelial
and dendritic cells. Airway epithelial cells are constantly exposed to
pathogens and are responsible for secreting immune-activating
signals in response to harmful pathogens [26]. Dendritic cells are
a part of the airway mucosa and lung parenchyma which process
and present antigens to T-cells and are involved in the inflammatory
response to infection [26].
When studying the immune response, differences between
human subjects are important factors. Age, gender, race, time
since onset of symptoms, and other factors can impact the expres-
sion of immune-related genes. For example, when studying RSV
infection in infants, the developing immune system responds dif-
ferently to infection depending on the age of the patients
[27]. Regression-based methods have been developed to use
these covariates when calculating association scores between
genes which can reduce false discovery [28].
To investigate immune response, it is critical to understand the
intercellular and cell-cytokine interactions. In the past, knowledge-
driven networks of cytokine interactions have been combined with
data-driven networks to find novel interactions between cells and
cytokines in the immune response [29]. This study used a database
of cytokine-cytokine relationships to expand the networks derived
from bacterial respiratory infections and asthma. The authors were
able to infer the role of NK cells in combined asthma and bacterial
infection, which was previously uncharacterized.
Recent developments in technology such as flow cytometry,
cyTOF, and single cell RNAseq has allowed for deeper investiga-
tions of cell–cell interactions. Flow cytometry requires predeter-
mined markers for analysis and is limited in how many markers can
be assessed at one time. The cyTOF assay identifies immune popu-
lations using time of flight mass cytometry [30]. A combination of
these cellular phenotyping assays with gene expression has facili-
tated methods to deconvolute cell-type-specific expression
networks.
Single cell RNAseq provides new opportunities to investigate
the differences between cells. Methods like Seurat use a network
based clustering algorithm to identify populations of cells from sin-
gle cell RNAseq data [32]. The cells are clustered by expression and
each cluster is assigned a putative cell type based on the expression

of cell-type-specific genes. Cell-type-specific networks are then
made for each cluster, reducing the noise associated with mixed
cells. Unfortunately the cost of the analysis can be inhibitory and
the data itself is inherently noisy. There are many missing values for
gene expression in single cell RNAseq which makes calculating
association scores difficult. Variation between human subjects
becomes more apparent when comparing single cell expression
levels. However, with enough cells from each individual, these
data could aid inference of cell-type-specific gene networks asso-
ciated with clinical symptoms.
7 Conclusions
In this chapter we review methods to assemble and analyze net-

works from large-scale data. We describe ways to integrate these
networks with available information from primary literature and
databases in order to improve confidence in data-driven networks.
At the end we review immunological networks and describe newer
technologies that allow investigation of cell-specific molecular
networks.
8 Notes
1. In Pearson correlation a pair of genes could be given a low

association score because of a few samples that are not follow-
ing the same trend as the majority, resulting in false negatives.
2. The Miller-Madow correction reduces the downward bias of
the entropy function without increasing computational cost.
Other options for an estimator include the shrink entropy
estimator data with a small sample size and the Schirmann-
Grassberger estimator to incorporate prior knowledge into
the entropy calculation [1].
3. The bins can be split using either equal width, where all of the
bins have an equal number of data points, or equal frequency,
where each variable is split into bins and will have a different
number of data points in each bin [1].
4. R is a statistical analysis software that has a very active commu-
nity of users. It uses R studio, a console which includes a text
editor, which can be useful when starting out. There are tuto-
rial R packages like swirl for learning to use this language.
There are also many useful bioinformatics packages in
Bioconductor.
5. Selecting a cutoff can be made easier by using a visual represen-

tation of the data distribution, like a histogram. For example, a
lot of gene-expression data has a positively skewed distribution,
and finding the inflection point is a good start for picking a
cutoff. This also facilitates using statistical measures like the
mean and standard deviation will change the number of genes
that will be filtered out.
6. Node degree can be useful in determining the density of a
network. Very dense networks will have many edges connecting
the nodes and will be more difficult to interpret. Very sparse
networks won’t have enough edges to evaluate. Finding a
balance between these so that potentially new interactions are
present, but limiting false positive interactions is important.
Looking at the average and distribution of node degree in a
network can help to evaluate the cutoff chosen for the associa-
tion scores.
7. It will sometimes be difficult to choose a layout where all of the
nodes and edges are visible, especially with dense networks.
Cytoscape and iGraph have built in functions for determining
the layout of the graph, and Cytoscape allows the user to move
nodes which, while time consuming, can vastly improve the
layout of the network.
8. A visual representation of the data, like a heatmap, can be very
useful in this step. Usually it is possible to see clusters in the
heatmap and by adding a dendrogram. The evaluation of the
dendrogram can inform the expected number of clusters in
the data.
9. K-means clustering can be biased toward forming clusters of
equal size, which may skew the results.
10. Bayesian networks can be more sensitive to noise than other
methods.
References
1. Meyer PE, Lafitte F, Bontempi G (2008) 439:15–22. https://doi.org/10.1016/j.jim.

Minet: a R/bioconductor package for inferring 2016.09.005
large transcriptional networks using mutual 5. Stertz S, Dittmann J, Blanco JCG et al (2007)
information. BMC Bioinformatics 9:461 The antiviral potential of interferon-induced
2. Priness I, Maimon O, Ben-Gal I (2007) Evalu- cotton rat mx proteins against orthomyxovirus
ation of gene-expression clustering via mutual (influenza), rhabdovirus, and bunyavirus. J
information distance measure. BMC Bioinfor- Interf Cytokine Res 27:847–855
matics 8:111 6. Shim JM, Kim J, Tenson T et al (2017) Influ-
3. Kinney JB, Atwal GS (2014) Equitability, enza virus infection, interferon response, viral
mutual information, and the maximal informa- counter-response, and apoptosis. Viruses
tion coefficient. Proc Natl Acad Sci U S A 9:1–12
111:3354–3359 7. Christensen C, Thakar J, Reka A (2007)
4. Katanic D, Khan A, Thakar J (2016) PathCell- Systems-level insights into cellular regulation:
Net: cell-type specific pathogen-response net- inferring, analysing, and modeling intracellular
work explorer. J Immunol Methods networks. IET Syst Biol 1:61–77
8. D’Haeseleer P (2005) How does gene expres- 21. Sanchez-Castillo M, Blanco D, Tienda-Luna
sion clustering work? Nat Biotechnol IM et al (2018) A Bayesian framework for the
23:1499–1501 inference of gene regulatory networks from
9. Khan A, Katanic D, Thakar J (2017) Meta- time and pseudo-time series data. Bioinformat-
analysis of cell- specific transcriptomic data ics 34:964–970
using fuzzy c-means clustering discovers versa- 22. Huynh-Thu VA, Irrthum A, Wehenkel L,
tile viral responsive genes. BMC Bioinformatics Geurts P (2010) Inferring regulatory networks
18:295 from expression data using tree-based meth-
10. Pons P, Latapy M (2006) Computing commu- ods. PLoS One 5:1–10
nities in large networks using random walks. J 23. Li J, Zhang Q, Chen Z et al (2019) A network-
Graph Algorithms Appl 10:191–218 based pathway-extending approach using DNA
11. Rosvall M, Bergstrom CT (2007) Maps of ran- methylation and gene expression data to iden-
dom walks on complex networks reveal com- tify altered pathways. Sci Rep 9:11853
munity structure. Proc Natl Acad Sci U S A 24. Qian F, Thakar J, Yuan X et al (2014) Immune
105:1118–1123 markers associated with host susceptibility to
12. Bourdakou MM, Spyrou GM (2017) Informed infection with west nile virus. Viral Immunol
walks: whispering hints to gene hunters inside 27:39–47
networks’ jungle. BMC Syst Biol 11:1–11 25. Ioannidis I, McNally B, Willette M et al (2012)
13. Javed MA, Younis MS, Latif S et al (2018) Plasticity and virus specificity of the airway epi-
Community detection in networks: a multidis- thelial cell immune response during respiratory
ciplinary review. J Netw Comput Appl virus infection. J Virol 86:5422–5436
108:87–111 26. Holt PG, Strickland DH, Wikström ME, Jahn-
14. Margolin AA, Nemenman I, Basso K et al sen FL (2008) Regulation of immunological
(2006) ARACNE: an algorithm for the recon- homeostasis in the respiratory tract. Nat Rev
struction of gene regulatory networks in a Immunol 8:142–152
mammalian cellular context. BMC Bioinfor- 27. Walsh EE, Mariani TJ, Chu C et al (2019)
matics 7(Suppl 1):S7 Aims, study design, and Enrollment results
15. Butte AJ, Kohane IS (2000) Mutual informa- from the assessing predictors of infant respira-
tion relevance networks: functional genomic tory syncytial virus effects and severity study.
clustering using pairwise entropy measure- JMIR Res Protoc 8:e12907
ments. Pac Symp Biocomput 5:415–426 28. Xie J (2018) False discovery rate control for
16. Chen JC, Cerise JE, Jabbari A et al (2015) high dimensional networks of quantile associa-
Master regulators of infiltrate recruitment in tions conditioning on covariates. J R Stat Soc
autoimmune disease identified through Series B Stat Methodol 80:1015–1034
network-based molecular deconvolution. Cell 29. Campbell C, Thakar J, Albert R (2011) Net-
Syst 1:326–337 work analysis reveals cross-links of the immune
17. Langfelder P, Horvath S (2008) WGCNA: an pathways activated by bacteria and allergen.
R package for weighted correlation network Phys Rev E Stat Nonlin Soft Matter Phys
analysis. BMC Bioinformatics 9:559 84:1–12
18. Voigt EA, Grill DE, Zimmermann MT et al 30. Gadalla R, Noamani B, MacLeod BL et al
(2018) Transcriptomic signatures of cellular (2019) Validation of CyTOF against flow cyto-
and humoral immune responses in older adults metry for immunological studies and monitor-
after seasonal influenza vaccination identified ing of human cancer clinical trials. Front Oncol
by data-driven clustering. Sci Rep 8:1–16 9:1–13
19. Faith JJ, Hayete B, Thaden JT et al (2007) 31. Krutzik PO, Hale MB, Nolan GP (2005) Char-
Large-scale mapping and validation of Escher- acterization of the murine immunological Sig-
ichia coli transcriptional regulation from a naling network with Phosphospecific flow
compendium of expression profiles. PLoS Biol Cytometry. J Immunol 175:2366–2373
5:0054–0066 32. Butler A, Hoffman P, Smibert P et al (2018)
20. Friedman N, Linial M, Nachman I, Pe’er D Integrating single-cell transcriptomic data
(2000) Using Bayesian networks to analyze across different conditions, technologies, and
expression data. J Comput Biol 7:601–620 species. Nat Biotechnol 36:411–420
Chapter 10
In Silico-Guided Sequence Modification of Epitopes

in Cancer Vaccine Development
Winfrey Pui Yee Hoo, Pui Yan Siak, and Lionel L. A. In
Abstract
Discovery of tumor antigenic epitopes is important for cancer vaccine development. Such epitopes can be
designed and modified to become more antigenic and immunogenic in order to overcome immunosup-
pression towards the native tumor antigen. In silico-guided modification of epitope sequences allows
predictive discrimination of those that may be potentially immunogenic. Therefore, only candidates
predicted with high antigenicity will be selected, constructed, and tested in the lab. Here, we described
the employment of in silico tools using a multiparametric approach to assess both potential T-cell epitopes
(MHC class I/II binding) and B-cell epitopes (hydrophilicity, surface accessibility, antigenicity, and linear
epitope). A scoring and ranking system based on these parameters was developed to shortlist potential
mimotope candidates for further development as peptide cancer vaccines.
Key words In silico prediction, T-cell epitope, B-cell epitope, Tumor antigen, IEDB, Vaccine design,
Mimotope
1 Introduction
Approaches employing peptide mimotopes (mimics of epitopes) are

gaining much traction as a form of cancer vaccine immunotherapy
capable of being recognized by T-cell receptors and paratopes of
antibodies [1, 2]. Mimotopes are also proficient at eliciting the
secretion of cytokines, having superior MHC-restriction capabil-
ities and potentially developing immunological memory against
targeted tumor antigens [3–5]. To date, several studies have
reported the use of sequence modifications to overcome poorly
immunogenic self-antigens in vaccine design. These include
amino acid substitution(s) of HLA anchor residues to increase
HLA binding affinity to MHC class I/II molecules while enhanc-
ing antigenicity [6–9] and have been associated with improved
immunogenicity in vivo [10–13]. Owing to numerous amino acid
substitution possibilities, computational methods are often
adopted to predict and shortlist antigenicity and immunogenicity

213
214 Winfrey Pui Yee Hoo et al.
changes from a large array of modified sequence candidates before

subjecting them to proof-of-concept downstream studies. Here, we
selected the Immune Epitope Database and Analysis Resource
(IEDB) computational platform to predict and assess properties
of T- and B-cell mimotopes.
Designing potential mimotopes is highly dependent on the
intended immune response. Selection of potential T-cell mimo-
topes is conceptually based on MHC class I/II epitope prediction.
Human MHC class I epitope predictions include HLA-A/B/C
alleles, while HLA supertypes (34 allelic subtypes) typically cover
a large proportion of highly polymorphic HLA-DP/DQ/DR
MHC class II alleles. When mice models are used as a proof of
concept, mouse MHC class I epitope predictions should ideally
cover H-2D, H-2K, and H-2L alleles, while predictions for MHC
class II focuses on H-2I allele. On the other hand, predictions of
B-cell mimotope linearity and hydrophilicity stretches are done
using the BepiPred method which uses a combination of a hidden
Markov model and two amino acid propensity scales: the Parker’s
hydrophilicity scale and Levitt’s secondary structure scale [14–
16]. Hydrophilicity scale used in the Parker’s hydrophilic predic-
tion is based on the calculation of peptide retention time in high-
performance liquid chromatography (HPLC) with a reversed-phase
column. The more hydrophilic the peptide fragment, the higher
the retention time in HPLC [15]. Kolaskar and Tongaonkar anti-
genicity prediction uses a semiempirical approach developed based
on physicochemical properties of amino acid residues to predict and
score mimotopes on segments within a protein sequence that are
likely to be antigenic enough for antigen-presenting cell (APC)
recognition [17, 18]. Emini surface accessibility prediction uses a
scale that suggests hexapeptide mimotope regions for surface acces-
sibility of the epitope being bound on the surface of APCs.
After obtaining predicted raw scores from all six parameters, a
scoring and ranking system should be established in descending
priority order so that shortlisting of top mimotope candidates
based on a multiparametric approach can be carried out. All para-
meters are assigned with a reducing score range as follows: peptides
having high MHC class II restriction > MHC class I restriction >
surface accessibility > hydrophilicity > antigenicity > epitope lin-
earity. The total sum of normalized percentile scores yields a series
of ranked mimotopes in descending order of immunogenicity. Top
mimotope sequences that are highly predicted to be immunogenic
can then be chosen, synthesized, and subjected to further develop-
mental studies or preclinical animal assessments to evaluate their
immunogenicity.
Epitope Sequence Modifications and Predictions 215
2 Methods
2.1 In Silico Amino 1. Enter the website for Genbank, National Center for Biotech-
Acid Substitution(s) on nology Information at https://www.ncbi.nlm.nih.gov/
Peptide Sequence genbank/
2. Click on the drop-down button beside the “search” bar and
select “Protein.” Type the accession number, if known. Other-
wise, type the name of protein and species of choice. Click
“search.”
3. Obtain and download the full-length peptide sequence of
interest for sequence manipulation. As peptide sequence may
come from various species, select the peptide sequence base on
the target species.
4. Select a region from the full-length peptide sequence for mod-
ification (see Note 1).
5. Modify the peptide sequences by adding single amino acid
substitutions flanking the original mutations, thus generating
peptide sequences with up to 19 different amino acid possibi-
lities (see Note 2).
6. Use all query sequences generated for subsequent T- and B-cell
epitope predictions.
2.2 T-Cell Epitope 1. Go to IEDB website using the following link: http://www.
Prediction iedb.org/ (Fig. 1) and click on “MHC I Binding” or “MHC
II Binding” found under “T Cell Epitope Prediction” at the
2.2.1 Input of Query
section “Epitope Analysis Resource.” This will allow you to
Sequences
enter the MHC I and II binding prediction sites, which can be
also assessed via http://tools.iedb.org/mhci/ (Fig. 2) and
http://tools.iedb.org/mhcii/ (Fig. 3), respectively.
2. Under “Specify Sequence(s),” insert protein sequence in
FASTA format or click the “Choose File” button to select a
file from your desktop (see Note 3).
Fig. 1 The official website of Immune Epitope Database and Analysis Resource
Fig. 2 IEDB web server for MHC class I binding prediction
3. Use the default IEDB recommended prediction method (see

Note 4).
4. For MHC I binding prediction, under “Specify what to make
binding predictions for”:
(a) Select the MHC source species by clicking on the drop-
down button. The options available include chimpanzee,
cow, gorilla, human, macaque, mouse, pig, and rat (see
Note 5).
(b) The “Show only frequently occurring alleles” checkbox is
checked by default. Select MHC allele(s) from the drop-
down button to include one to multiple alleles of different
lengths. Lengths can be selected in the drop-down button
on the right (Fig. 2) (see Note 6).
Fig. 3 IEDB web server for MHC class II binding prediction
5. For MHC II binding prediction under “Specify what to make

binding predictions for”:
(a) Select the species or locus by clicking on the drop-down
button. The options available include human HLA-DP/
DQ/DR) and mouse H-2-I.
(b) Select MHC allele(s) from the drop-down button to
include one to multiple alleles (Fig. 3).
6. Under “Specify Output,” peptides can be sorted according to
“Percentile Rank” or “Position in Sequence.” All predictions
can be generated in an XHTML table or a text file. Choose the
preferred option for the generation of the output before click-
ing the “Submit” button to perform the prediction (Figs. 2
and 3).
Fig. 4 Example of the generated web page displaying the results of the MHC class I binding prediction
Fig. 5 Example of the generated web page displaying the results of the MHC class II binding prediction
2.2.2 Interpretation of 1. Both MHC class I and II binding prediction results display the
Output Results “Input Sequences” and a table showing the predicted
sequences along with their percentile ranks (Figs. 4 and 5)
(see Note 7).
2. Click on the “Check to expand the result” checkbox to expand
the tabulated predictions, which will show scores from individ-
ual prediction methods used (Figs. 6 and 7) (see Note 8).
3. Click to download the predicted results in Microsoft Excel
format.
Fig. 6 Example of the expanded results of the MHC class I binding prediction
Fig. 7 Example of the expanded results of the MHC class II binding prediction
4. Select the peptide sequence with the lowest percentile rank for
each query peptide sequence (see Note 9).
5. For MHC class I binding prediction results, categorize the
peptide sequences into high (0–50 nM), intermediate
(51–500 nM), and low (501–5000 nM) binders based on the
percentile rank.
6. For MHC class II binding prediction results, categorize the
peptide sequences into high (0.01–4.65 nM), intermediate
(4.66–9.28 nM), and low (9.29 nM) binders based on the
percentile rank.
Fig. 8 IEDB web server for B-cell epitope prediction
2.3 B-Cell Epitope 1. Go to IEDB website using the following link: http://www.
Prediction iedb.org/ and click on “Antigen Sequence Properties” found
under “B Cell Epitope Prediction” at the section “Epitope
2.3.1 Input of Query
Analysis Resource.” This will allow you to enter the Antibody
Sequences
Epitope Prediction site, which can be also assessed using this
link: http://tools.immuneepitope.org/bcell/ (Fig. 8).
2. Enter protein sequence containing less than 50,000 residues in
plain .txt format into the query box (see Note 10).
3. Choose a prediction method in the section “Choose a method”
according to the parameter of choice.
4. Click on the “submit” button to start the prediction.
2.3.2 Interpretation of 1. Upon the end of prediction, BepiPred displays a prediction

Output Results score for every residue in the input sequence and is then pre-
sented in tables and graphs. Peptides with scores above the
BepiPred Linear B-Cell default threshold value are predicted to be part of a linear
Epitope Prediction epitope (Fig. 9) (see Note 11).
2. Optional. To readjust the threshold, enter the threshold value
and click on the “Recalculate” button (see Note 12).
3. Press in the result table below to download the predicted
results in Microsoft Excel format.
4. Calculate the linear epitope scores for each query sequence by
summing up the scores of amino acid residues that are above
the default threshold (Fig. 9).
5. Categorize the peptide sequences into high (5.82), interme-
diate (0.36–5.81), or low (0–0.35) linearity based on the
respective scores.
Fig. 9 Example of a BepiPred linear epitope prediction result of a peptide sequence. Graph and table display
the scores obtained by each residue in the inserted sequence. X- and Y-axis in the graph represent the
sequence position and epitope score, respectively. The yellow region above the threshold was predicted to be
part of an epitope
Parker Hydrophilicity 1. Parker hydrophilicity prediction displays the scores for each
Prediction of Amino Acid window size of 7 amino acid residue of the query peptide
Residues sequence (Fig. 10) (see Note 13).
3. Select the highest residue score as the raw hydrophilic score for
each query peptide sequence.
4. Categorize the peptide sequences into high (3.444–4.614),
intermediate (2.615–3.443), or low (0–2.614) hydrophilicity
based on the respective scores.
Kolaskar and Tongaonkar 1. Kolaskar and Tongaonkar antigenicity prediction displays the
Antigenicity Prediction scores for each window size of 7 amino acid residues of the
query peptide sequence (similar to Parker hydrophilicity pre-
diction) (Fig. 11) (see Note 13).
3. Select the highest residue score as the raw antigenicity score for
each query peptide sequence (see Note 14).
4. Categorize the peptide sequences into high (1.0) and inter-
mediate (<1.0) antigenicity based on the respective scores.
Fig. 10 Example of a Parker hydrophilicity prediction result of a peptide sequence. Graph and table displays
the scores obtained by each residue containing 7 amino acids. The X- and Y-axis in the graph represent the
sequence position and hydrophilic propensity score, respectively. The yellow regions above the threshold are
hydrophilic
Fig. 11 Example of a Kolaskar and Tongaonkar antigenicity prediction result of a peptide sequence. Graph and
table display the scores obtained by each residue containing 7 amino acids. The X- and Y-axis in the graph
represent the sequence position and antigenicity score, respectively. The yellow regions above the threshold
are antigenic
Emini Surface Accessibility Emini surface accessibility prediction calculates the Sn using the
Prediction formulae Sn ¼ (n + 4 + i)(0.37)6, where Sn is the surface proba-
bility, n is the fractional surface probability value, while i varies from
1 to 6 [19].
1. Emini surface accessibility prediction displays the scores for
each window size of 7 amino acid residues of the query peptide
Fig. 12 Example of an Emini surface accessibility prediction result of a peptide sequence. Graph and table
display the scores obtained by each residue containing 7 amino acids. The X- and Y-axis in the graph
represent the sequence position and surface probability score, respectively. The yellow regions above the
threshold are parts of the peptide that can be accessed by B-cell receptors
sequences (similar to Parker Hydrophilicity, and Kolaskar and

Tongaonkar antigenicity prediction) (Fig. 12) (see Note 13).
2. Select the highest residue score as the raw surface accessibility
score for each query peptide sequence (see Note 15).
3. Categorize the peptide sequences into high (1.689–2.01),
intermediate (1.367–1.688), and low (1.048–1.368) surface
accessibility based on the respective scores.
2.4 Establishment of 1. Establish T- and B-cell epitope scoring system by prioritizing

a Scoring and Ranking the parameters in the following order: MHC class II restriction
System > MHC class I restriction > surface accessibility > hydrophi-
licity > antigenicity > epitope (see Note 16).
2. Assign the range of score percentages for parameters used in
T-cell epitope prediction, which will total up to 100%. For each
parameter, further assign each range of score percentages into
high, intermediate, and low score percentages. The range of
scores is reduced in decreasing priority based on the prediction
parameter in step 1.
3. Repeat step 2 to assign the range of score percentages for each
parameter used in B-cell epitope prediction (see Note 17).
4. Convert each predicted raw score obtained from IEDB to the
score percentages assigned for each immunologic parameter in
step 2 (see Note 18) (Table 1).
Table 1
Example of a raw score conversion table for each parameter assessed
Raw score range for conversion (maximum weightage (%))

Parameter (maximum
weightage) High Intermediate Low
MHC class II (58%) 0.01–4.65 nM, (58%) 4.66–9.28 nM, (39%) 9.29 nM, (20%)
MHC class I (42%) 0–50 nM, (42%) 51–500 nM, (28%) 501–5000 nM, (14%)
Surface accessibility (34%) 2.531–3.149, (34%) 1.913–2.530, (23%) 1.294–1.912, (12%)
Hydrophilicity (29%) 3.444–4.614, (29%) 2.615–3.443, (19%) 0–2.614, (9%)
Antigenicity (19%) 1.0, (19%) <1.0, (13%) N/A
Linear epitope (12%) 5.82, (12%) 0.36–5.81, (8%) 0–0.35, (4%)
(Adapted from Ng et al. 2018)
N/A non-applicable
5. Obtain the total score percentage for each peptide sequence by

summing up all the score values of the immunologic para-
meters (Table 1).
6. Rank the peptide sequences from high to low (see Note 19).
3 Notes
1. MHC class I binding involves peptides of a narrow range of

lengths, usually 8 to 10 amino acids, while the open-ended
nature of MHC class II peptide-binding groove allows for a
wider range of peptide lengths between 13 and 17 amino acids
in length [20]. The region selected must encompass the loca-
tion where common mutations or single amino acid poly-
morphisms (SAPs) are found in the full-length peptide
sequence, while at the same time favoring both MHC class
I/II binding groove domains.
2. Manipulation of the selected region from the full-length pep-
tide will first involve changing a wild-type protein into a natu-
rally occurring oncogenic mutant type. Modification of the
mutant peptide will then involve another substitution mutation
in one of the amino acid residue(s) flanking the oncogenic
mutant codon. With 20 amino acids commonly occurring in
nature, each amino acid residue within the selected peptide
region will be substituted once by each of the 19 remaining
amino acids, thus generating a large dataset of mimotope
sequences.
3. A maximum of 200 FASTA sequences can be entered into the

query box, while the maximum upload file size is 10 MB per
query.
4. IEDB “recommended” is the default prediction method selec-
tion which incorporates the best possible method for a given
MHC molecule. For MHC class I binding prediction, the
default selection allows the server to perform predictions
using the following methods in decreasing order of perfor-
mances: Consensus > Artificial neural network > Stabilized
matrix method > NetMHCpan > CombLib. Meanwhile,
MHC class I binding prediction uses the Consensus method
in combination with NN-align, SMM-align, CombLib, and
Sturniolo; otherwise NetMHCIIpan will be used.
5. T-cell epitope predictions against both human and mouse or
rat libraries are important in order to generate predictive values
of each modified peptide sequence during in vivo assessment.
6. “Show only frequently occurring alleles” checkbox is checked
by default, which allows the user to select alleles that occur in at
least 1% of the human population or allele frequency of 1% or
higher.
7. First, the “Input Sequences” displays the sequences and are
numbered in their input order. Second, the “Prediction output
table” displays the predicted results whereby each row corre-
sponds to one peptide-binding prediction, and the table can be
sorted by clicking on the column headers. Third, outputs from
MHC class I and II binding predictions that have low percen-
tile ranks indicate that protein sequences have high affinities to
MHC class I and II molecules. Meanwhile, in MHC class II
binding prediction, the result for Sturniolo method is given as a
raw score whereby a higher score indicates higher affinity.
Fourth, the MHC class II binding predictions incorporate the
use of NetMHCIIpan method when Consensus and other
methods such as SMM_align, NN_align, COMBLIB, and/or
Sturniolo are not available for a particular allele. NetMHCII-
pan method is however used as a second or third method if only
one or two of the methods mentioned are available.
8. Prediction result is condensed by default to show only the
percentile rank. The table can be expanded to display more
columns which shows the individual scores from different
methods used.
9. The predicted output is given in units of IC50 nM. Peptide
sequences with low percentile ranks are considered good MHC
class I/II binders.
10. A maximum protein sequence residue query for prediction is
50,000. Protein sequences that contains more than 50,000
residues have to be inserted separately.
11. This prediction is based on residue scores above the threshold

(default 0.35), and the peptide sequence is predicted to be part
of an epitope which is colored in yellow on the graph (where Y-
axes depict residue scores and X-axes depict residue positions
in the sequence) and marked with “E” in the output table.
12. The threshold setting is correlated with the sensitivity and
specificity of the prediction. Default threshold, 0.35 is the
optimum sensible value for running the prediction. Users can
set the threshold score to adjust the sensitivity/specificity of
prediction. Following website http://www.cbs.dtu.dk/
services/BepiPred-1.0/output.php displays different thresh-
old values with its respective sensitivity and specificity predic-
tion values for users’ reference. Threshold value will not affect
the E value.
13. To analyze the linearity, antigenicity, and surface accessibility of
an epitope, the software uses a window size of 7 amino acid
residues, whereby the corresponding value of the scale will be
introduced for each of the seven residues. The score is given
based on the arithmetical mean of these 7 residues, and this
value is assigned to the fourth (i + 3) residue in the segment.
The threshold uses the mean values of the total scores acquired
for the whole sequence. In this prediction, the adjustment of
the threshold is not crucial because the hydrophilic and antige-
nicity scores for each peptide sequence are based on the highest
predicted scores.
14. The antigenicity score with a value lesser than 1.0 is considered
a potentially antigenic sequence. In this prediction, output
results show several score values for each inserted sequence
based on different divided segments. Meanwhile, the sequence
with antigenicity score higher than 1.0 will be considered as
highly antigenic.
15. Any hexapeptide sequence with a Sn (Surface probability)
greater than 1.0 indicates a high probability that antigenic
sites are located on the surface of the peptide [21].
16. MHC class II epitope prediction was selected as first priority in
the peptide immunogenicity ranking system due to its impor-
tant role in eliciting Th-1 (cell mediated response) and
memory-based T-cell responses.
17. The example of normalized scoring range for all prediction
parameters can be found in a previous study [22].
18. The score percentages obtained for the peptide sequences are
based on the categorized raw scores predicted for each
parameter.
19. A high total percentage score indicates a peptide sequence a
high antigenicity and immunogenicity potential.
References
1. Pietersz GA, Pouniotis DS, Apostolopoulos V 11. Schreurs MWJ, Kueter EWM, Scholten KBJ,
(2006) Design of peptide-based vaccines for Lemonnier FA, Meijer CJLM, Hooiberg E
cancer. Curr Med Chem 13(14):1591–1607. (2005) A single amino acid substitution
https://doi.org/10.2174/ improves the in vivo immunogenicity of the
092986706777441922 HPV16 oncoprotein E7 (11-20) cytotoxic T
2. Li W, Joshi MD, Singhania S, Ramsey KH, lymphocyte epitope. Vaccine 23
Murthy AK (2014) Peptide vaccine: progress (31):4005–4010. https://doi.org/10.1016/j.
and challenges. Vaccines (Basel) 2 vaccine.2005.03.014
(3):515–536. https://doi.org/10.3390/ 12. Hofmann S, Mead A, Malinovskis A, Hardwick
vaccines2030515 NR, Guinn BA (2015) Analogue peptides for
3. Sharav T, Wiesmüller KH, Walden P (2007) the immunotherapy of human acute myeloid
Mimotope vaccines for cancer immunotherapy. leukemia. Cancer Immunol Immunother 64
Vaccine 25(16):3032–3037. https://doi.org/ (11):1357–1367. https://doi.org/10.1007/
10.1016/j.vaccine.2007.01.033 s00262-015-1762-9
4. Knittelfelder R, Riemer AB, Jensen-Jarolim E 13. Kumar SR, Prabakaran M, Ashok Raj KV, He F,
(2009) Mimotope vaccination—from allergy Kwang J (2015) Amino acid substitutions
to cancer. Expert Opin Biol Ther 9 improve the immunogenicity of H7N7HA
(4):493–506. https://doi.org/10.1517/ protein and protect mice against lethal H7N7
14712590902870386 viral challenge. PLoS One 10(6):e0128940.
5. Buhrman JD, Slansky JE (2013) Mimotope https://doi.org/10.1371/journal.pone.
vaccine efficacy gets a "boost" from native 0128940
tumour antigens. OncoImmunology 2(4): 14. Brown JH, Jardetzky TS, Gorga JC et al
e23492. https://doi.org/10.4161/onci. (1993) Three-dimensional structure of the
23492 human class II histocompatibility antigen
6. Lipford GB, Bauer S, Wagner H, Heeg K HLA-DR1. Nature 364(6432):33–39.
(1995) In vivo CTL induction with point- https://doi.org/10.1038/364033a0
substituted ovalbumin peptides: immunoge- 15. Parker JM, Guo D, Hodges RS (1986) New
nicity correlates with peptide-induced MHC hydrophilicity scale derived from high-
class I stability. Vaccine 13(3):313–320. performance liquid chromatography peptide
https://doi.org/10.1016/0264-410x(95) retention data: correlation of predicted surface
93320-9 residues with antigenicity and X--ray--derived
7. Pogue RR, Eron J, Frelinger JA, Matsui M accessible sites. Biochemistry 25
(1995) Amino-terminal alteration of the (19):5425–5432. https://doi.org/10.1021/
HLA-A∗0201-restricted human immunodefi- bi00367a013
ciency virus pol peptide increases complex sta- 16. Levitt M (1978) Conformational preferences
bility and in vitro immunogenicity. Proc Natl of amino acids in globular proteins. Biochem-
Acad Sci U S A 92(18):8166–8170. https:// istry 17(20):4277–4285. https://doi.org/10.
doi.org/10.1073/pnas.92.18.8166 1021/bi00613a026
8. Fikes J (2004) Chapter 2: the rational design of 17. Kavitha K, Saritha R, Vinod Chandra S (2013)
T cell epitopes with enhanced immunogenicity. Computational methods in linear B--cell epi-
In: Morse MA, Clay TM, Lyerly HK (eds) tope prediction. Int J Comput Appl 63
Handbook of cancer vaccines. Humana Press, (12):28–32
New Jersey 18. Kolaskar AS, Tongaonkar PC (1990) A semi-
9. Bei R, Scardino A (2010) TAA polyepitope empirical method for prediction of antigenic
DNA-based vaccines: a potential tool for can- determinants on protein antigens. FEBS Lett
cer therapy. J Biomed Biotechnol:1–12. 276(1–2):172–174. https://doi.org/10.
https://doi.org/10.1155/2010/102758 1016/0014-5793(90)80535-q
10. Huarte E, Sarobe P, Lu J, Casares N, Lasarte JJ, 19. Emini EA, Hughes JV, Perlow DS, Boger J
Dotor J, Ruiz M, Prieto J, Celis E, Borrás- (1985) Induction of hepatitis a virus-
Cuesta F (2002) Enhancing immunogenicity neutralising antibody by a virus-specific syn-
of a CTL epitope from carcinoembryonic anti- thetic peptide. J Virol 55(3):836–839
gen by selective amino acid replacements. Clin 20. Chang ST, Ghosh D, Kirschner DE, Linder-
Cancer Res 8(7):2336–2344 man JJ (2006) Peptide length-based prediction
of peptide–MHC class II binding. Bioinfor- 2:72. https://doi.org/10.3389/fgene.2011.

matics 22(22):2761–2767. https://doi.org/ 00072
10.1093/bioinformatics/btl479 22. Ng AWR, Tan PJ, Hoo WPY et al (2018) In
21. Srinivasan P, Kumar SP, Karthikeyan M et al silico-guided sequence modifications of K-ras
(2011) Epitope-based immunoinformatics and epitopes improve immunological outcome
molecular docking studies of nucleocapsid pro- against G12V and G13D mutant KRAS anti-
tein and ovarian tumor domain of crimean- gens. PeerJ 6:e5056. https://doi.org/10.
Congo hemorrhagic fever virus. Front Genet 7717/peerj.5056
Chapter 11
An Immunoinformatics Approach in Design of Synthetic

Peptide Vaccine Against Influenza Virus
Neha Lohia and Manoj Baranwal
Abstract
Peptide-based vaccines are an appealing strategy which involves usage of short synthetic peptides to
engineer a highly targeted immune response. These short synthetic peptides contain potential T- and
B-cell epitopes. Experimental approaches in identifying these epitopes are time-consuming and expensive;
hence immunoinformatics approach came into picture. Immuninformatics approach involves epitope
prediction tools, molecular docking, and population coverage analysis in design of desired immunogenic
peptides. In order to overcome the antigenic variation of viruses, conserved regions are targeted to find the
potential epitopes. The present chapter demonstrates the use of immunoinformatics approach to select
potential peptide containing multiple T- (CD8+ and CD4+) and B-cell epitopes from Avian H3N2 M1
Protein. Further, molecular docking (to analyse HLA-peptide interaction) and population coverage analysis
have been used to verify the potential of peptide to be presented by polymorphic HLA molecules. In silico
approach of epitope prediction has proven to be successful methodology in screening the putative epitopes
among numerous possible vaccine targets in a given protein.
Key words Conservation, Influenza epitope, BLASTp, Peptide-based vaccine, Epitope prediction,
Docking, Population coverage
1 Introduction
Influenza is a highly contagious viral infection of respiratory tract

which causes both endemic seasonal infections and periodic but
unpredictable pandemics. Influenza A virus (IAV) is classified
depending upon the serologic reactivity of two surface glycopro-
teins viz. hemagglutinin (HA) and neuraminidase (NA) proteins.
Eighteen serotypes of HA (H1–H18) and 11 serotypes of NA
(N1 to N11) have been identified which circulate in birds, humans,
swine, and bats [1]. Influenza virus mutates rapidly causing pan-
demics and epidemics in poultry and mammals. However, the
immunogen to which the immune system actually responds is
often smaller peptides. Therefore, the concept of peptide-based
vaccines is a novel approach which involves usage of short synthetic

229
230 Neha Lohia and Manoj Baranwal
peptides to elicit immunity. Peptide-based vaccines offer sundry

advantages over traditional vaccine [2, 3]:
l Lack risk of reversion, gene recombination or integration, onco-
genicity, or autoreactivity.
l Immunogenicity, solubility, and stability can be augmented by
introduction of carbohydrate, lipid, and phosphate groups.
l Ease of chemical synthesis, characterization, analysis, quality
control and scale up, storage, and distribution in freeze-dried.
The extensive mutability of influenza virus significantly
enhances its propensity to escape immune recognition, thus causing
inadequate immune response of the host against all the circulating
variants. Besides this, HLA gene polymorphism causes differential
induction of immune responses directed against specific epitopes
from one individual to other. Various influenza peptides have been
reported to elicit immunogenic response, but majority of those
peptides are either strain specific or binding to limited HLA mole-
cules [4]. Therefore, a peptide candidate for universal influenza
vaccine should be highly conserved, recognized in the context of
various HLA alleles, and elicit T-cell-mediated immune. Such pep-
tides capable of binding to multiple HLA alleles are generally
referred to as promiscuous peptide.
Immunoinformatics has emerged as an intersection between
the experimental and informatics approach, in order to study host-
pathogen interaction as well as molecular interaction in reference to
HLA-peptide-T-cell receptor complex. It has paved the way to
engineering of immune therapeutics and diagnostics besides iden-
tification of putative epitopes for vaccine design. Applications of B-
and T-cell epitope prediction include personalized immunothera-
pies as well as prophylactic and therapeutic vaccines [5]. Immunoin-
formatics enables to limit the experimental screening by identifying
the potential epitopes.
2 Methodology
Brief overview of methodology for the design of peptide-based

vaccine against influenza has been given in Fig. 1. Details of the
process are as follows.
2.1 Sequence With the collaborative efforts of various research institutes,

Retrieval sequencing data of influenza virus is being stored, organized,
updated, and maintained in specialized databases. Vast pool of
regularly updated information provided by these influenza data-
bases serves as a boon to researchers working worldwide in the field
of influenza vaccine development and other therapeutics. Full-
length sequences of target protein of H1N1, infecting humans,
Immuninformatics Approach in Epitope Mapping 231
Fig. 1 Flowchart of the methodology in design of synthetic peptide vaccine
can be downloaded from various databases such as Influenza Virus

Resource (NCBI), influenza research database (IRD) [6, 7], etc.,
using the following steps:
1. Assess the home page of Influenza database (Fig. 2).
2. Set different parameters for the download:
(a) Sequence type (protein/protein coding/nucleotide).
(b) Influenza virus type (A, B or C).
(c) Host (human/avian/bat, etc.)
(d) Country/region.
(e) Name of protein.
(f) Subtype of influenza virus (e.g., subtype H1 and N1 for
type A).
(g) Sequence length (full length or partial).
(h) Collection date and release date.
(i) Additional filters such as inclusion on exclusion lab or
vaccine strain, pandemic (H1N1) virus, mixed subtype
virus, lineage defining virus, etc.
3. Click on the option “collapse identical sequences” in order to
eliminate redundant sequences.
4. Click on “show results.” Download the protein sequences in
FASTA format using the option “download.”
Fig. 2 Influenza virus resource webpage displaying various parameters for downloading the protein sequences
2.2 Identification The extensive mutability of influenza virus significantly enhances its
of Conserved propensity to escape immune recognition, thus causing inadequate
Sequences immune response of the host against all the circulating variants.
Therefore, a peptide candidate for universal influenza vaccine
should be highly conserved. In order to identify highly conserved
peptide,
1. Open the protein sequence file in Microsoft word, and replace
all the “X” and “J” alphabets in the protein sequence with “N.”
Save the changes.
2. Align the protein sequences using multiple sequence compari-
son by log expectation (MUSCLE) or any other multiple
sequence alignment tool. Download the results of alignment
in FASTA format.
3. Load the alignment file on another tool called “Antigen Varia-
bility ANAlyzer” (AVANA) to find the conserved peptide
stretch.
Note: The Antigen Variability ANalyzer (AVANA) tool,
which is a standalone application, analyzes multiple sequence
alignments based on entropy and calculates variability at a given
Conserved regions of human H3N2 M1 protein

Entropy of amino acids at a particular position in the sequences
0.75
0.5
0.25
0
0 25 50 75 100 125 150 175 200 225 250
Position of amino acid
Fig. 3 Graphical presentation of the results obtain by AVANA analysis
amino acid position. AVANA creates a graphical presentation of

entropy profile for multiple sequence alignments, thus
enabling users to examine position-specific variations and
their frequencies [8].
4. Go to “alignment” tab on the control panel and chose “Find
conserved region” (Fig. 3). Set the minimum length (9
amino acids) and maximum length and peptide conservation
(¼100%). Click “Ok.” Save and analyze the result to obtain
conserved sequences.
2.3 T-Cell Epitope Various databases have been established for collection of experi-
Prediction by mental immunological data. Next-generation sequencing and high-
Consensus Approach throughput screening of HLA binding assays have taken the lead in
identification of novel MHC alleles and understanding of their
binding patterns. The computational tools are pattern recognition
methods trained on large data obtained from in vitro experiments.
Pattern recognition is an application of machine learning (ML) in
computer sciences. ML is employed for the study and construction
of algorithms that can recognize motifs/pattern from the training
data (binding peptides) and make predictions. Various ML techni-
ques are extensively used in immunoinformatics, which include
support vector machines (SVMs), position-specific scoring matrices
(PSSMs), artificial neural networks (ANNs), and hidden Markov
models (HMMs). T-cell epitope prediction tools based on different
machine learning approaches are listed in Table 1. The consensus
Table 1
List of various T-cell epitope prediction tools
Predictive
Predictive server for
server for HLA I HLA II (CD4+ Predictive
Server name (CD8+ T cells) T cells) method Link References
EpiJen 24 Multistep http://www.ddgpharmfac. [20]
algorithm net/epijen/EpiJen/
EpiJen.htm
IEDB 77 ANN and http://tools.iedb.org/ [4]
binding SMM main/tcell/
KISS 64 SVM http://cbio.ensmp.fr/kiss/ [21]
MHC2Pred 42 SVM http://www.imtech.res.in/ [22]
raghava/mhc2pred/
MHCPred 14 11 Additive http://www.ddg-pharmfac. [23]
method net/mhcpred/
MHCPred/
MMBPred 46 Quantitative http://www.imtech.res.in/ [24]
matrix raghava/mmbpred/
NetCTL 12 supertypes∗ ANN http://www.cbs.dtu.dk/ [25]
regression services/NetCTL
NetCTL 12 supertypes∗ ANN-weight http://www.cbs.dtu.dk/ [26]
PAN 1.1 matrix services/NetCTLpan/
netMHCpan 172 human ANN http://www.cbs.dtu.dk/ [27]
4.0 services/NetMHCpan/
netMHC 4.0 81 human ANN http://www.cbs.dtu.dk/ [28]
services/NetMHC/
nHLAPred 67 ANN http://www.imtech.res.in/ [29]
raghava/nhlapred/
ProPred 51 Quantitative http://www.imtech.res.in/ [30]
matrix raghava/propred/
ProPred I 47 Quantitative http://www.imtech.res.in/ [31]
matrix raghava/propred1/
RANKPEP 118 62 PSSM http://bio.dfci.harvard. [32]
edu/RANKPEP/
SVRMHC 36 6 SVM http://c1.accurascience. [33]
com/SVRMHCdb/
SYFPEITHI 33 (human) 7 Published http://www.syfpeithi.de/ [34]
motifs bin/MHCServer.dll/
EpitopePrediction.htm
*Supertype is defined as the cluster of functionally related HLA alleles that share binding specificities towards the same
panel of peptides owing to similar structural features of HLAs peptide binding groove
approach of epitope prediction is the combination and comparison

of the result obtained from different prediction algorithms to iden-
tify epitopes.
1. Prediction tools use different parameters to define epitopes;
therefore the idea of epitope prediction by consensus approach
is an appealing approach in order to improve the reliability of
the prediction [9]. Further, consideration of peptide contain-
ing multiple CD8+ and CD4+ T-cell epitopes will be advanta-
geous with the intention to induce both helper and cytotoxic
T-cell response. Various immunoinformatics-based prediction
tools based of different algorithms or parameters can be chosen
for prediction at a time for CD8+ (HLA I) T-cell epitope and
CD4+ (HLA II) T-cell epitope prediction.
2. Paste the conserved sequence (obtained in the previous step) in
the window provided on the webserver of the prediction tool.
Parameters can be selected based on the literature or self-
optimization. Click on “submit” and save the results.
3. Compare the results of these prediction tools and obtain a set
of epitopes which is predicted commonly by all the tools (for
each class of epitopes).
4. Carry out the BLASTp analysis of all the predicted epitopes to
exclude any potential self-peptide or auto-immunogenic epi-
topes. (Note: Nonamers sharing at least seven out of nine
identical amino acid sequences with the annotated human
peptides, lacking any gaps or mismatches, should be eliminated
from further consideration).
5. Merge the overlapping CD8+ and CD4+ T-cell epitopes after
BLASTp analysis into a single peptide fragment.
2.4 B-Cell Epitope Identification of B-cell epitope is also important to induce

Prediction antibody-mediated immunity. Peptide having the potential to pro-
duced antibody can be used not only for vaccine development but
also for diagnostic and therapeutic purpose. On the basis of spatial
arrangement, B-cell epitopes have been characterized into two
types: linear (continuous or sequential) and conformational (dis-
continuous). Various immunoinformatics tools based on propen-
sity or machine learning have been developed so far to predict linear
B-cell epitopes. B-cell epitopes prediction tools are listed in Table 2.
For instance, ABCpred can predict epitopes of varying length
(10–20 residues) and assigns a score to each of the predicted
epitope (between 0 and 1). A score closer to 1 signifies a higher
probability of being an epitope. Another tool, LBtope allocates
scores between 0% and 100% to each of the predicted epitope. A
higher score indicates a higher likelihood of the antigenic sequence
being an epitope.
Table 2
Sequence-based B-cell epitope prediction tools
Server
Name Predictive method URL Reference
BepiPred Random forest algorithm http://www.cbs.dtu.dk/services/ [35]
BepiPred/
CBtope Support vector machine [SVM http://crdd.osdd.net/raghava/cbtope/ [36]
submit.php
ABCpred Standard feed-forward [FNN) and https://webs.iiitd.edu.in/raghava/ [37]
recurrent neural network [RNN) abcpred/ABC_submission.html
IgPred SVM https://webs.iiitd.edu.in/raghava/ [38]
igpred/help.html
BCPRED SVM http://ailab.ist.psu.edu/bcpred/ [39]
predict.html
SVMtrip SVM http://sysbio.unl.edu/SVMTriP/ [40]
prediction.php
Bcepred Parker, Karplus, Emini, and Kolaskar http://crdd.osdd.net/raghava/ [41]
method bcepred/
BEST Support vector machine [SVM) Standalone software [42]
method
Epitopia Naive Bayes classifier http://epitopia.tau.ac.il/index.html [43]
Pepitope PepSurf and Mapitope algorithms http://pepitope.tau.ac.il/ [44]
iBCE-EL SVM, RF, ERT, GB, AB, and k-NN http://www.thegleelab.org/iBCE-EL/ [45]
COBEpro SVM and propensity score http://scratch.proteomics.ics.uci.edu/ [46]
LBtope SVM and multiple algorithms in Weka http://crdd.osdd.net/raghava/lbtope/ [47]
1. Paste the conserved sequence (identified by AVANA) in the

window provided on the webserver of the prediction tool.
2. Select the parameters based on the literature or self-
optimization.
3. Analyze the predicted epitopes by BLASTp to exclude any
potential self-peptide or auto-immunogenic epitopes (as per
the criterion used for T-cell epitope analysis).
2.5 Molecular The epitopes are presented to T cells as peptide-HLA complex

Docking expressed on the surface of any nucleated cell or specialized antigen
presenting cell. HLA genes are the most polymorphic loci among
the different individuals of the world. The peptide capable of
binding to a large array of HLA molecules is expected to be immu-
nogenic in population distributed worldwide. To further validate
promiscuous nature of the identified peptides towards a wide range
of HLAs, the predicted CD8+ T-cell epitopes/CD4+ T-cell pep-

tides can also be analyzed for their binding with different HLAs by
structure-based molecular docking approach.
Molecular docking has been recognized as a valuable technique
in computer-aided vaccine designing [10, 11]. It is an expeditious,
reliable, and accurate technique for analyzing the binding of pep-
tide with HLA class I and II molecules [12].
1. Fetch the high-resolution crystallographic structures of HLA
class I HLA and class II molecules (represented various super-
types of HLA) from the protein data bank.
2. Separate the bound peptides (native peptides) from the down-
loaded HLA molecules using the discovery studio visualizer
(v4.1). Save the ligand (peptide) and receptor (HLA) as sepa-
rate PDB files.
3. Define the grid for each HLA molecules based on exact posi-
tion of native peptides.
4. Generate the structure of the peptides (epitopes) using the
PEP-FOLD server [13].
5. Dock the native peptides (positive controls) as well as the
predicted epitope/peptides (test set) with different HLA struc-
tures using the AutoDock Vina [14, 15] or CABS-dock
tool [16].
Note: HLA class II accommodates longer peptides
(~18–20 amino acid) in contrast to HLA class I which binds
to smaller peptides (8–10 amino acid). Accordingly, from the
selected peptide of the human H3N2 M1 proteins,
PEP-FOLD generated structure of the CD8+ T-cell epitopes
(nonamer peptide) and peptides containing multiple CD4+
T-cell epitopes were docked with the HLA class I and II mole-
cules, respectively.
2.6 Population There is a drastic difference in the frequencies of expression of

Coverage various HLA types among individuals from different continents.
Because of this, population coverage analysis of the HLA alleles
corresponding to the predicted immunogenic peptides is impera-
tive. The population coverage tool by IEDB computes the fraction
of individuals responding to a given set of epitopes with known
HLA restriction taking into account HLA genotypic frequency
[17]. This tool obtains the HLA allele frequencies for a different
individual population from the allele frequency net database
[18]. Various populations are organized in a pecking order based
upon geographical area, continents, country, and ethnicity. The
data from the individual populations in each group was combined
to evaluate the allele frequencies for each merged population. HLA
peptide restriction data acquired from epitope prediction is used as
input data for population coverage analysis [19].
Demonstration of the Methodology Followed.

A study carried out on Matrix 1 protein of human H3N2
influenza virus using the above mentioned methodology is as
follows:
1. Sequence retrieval: Three unique (non-redundant) sequences of
human H3N2 M1 protein were obtained January 2011 to June
2011 from influenza sequence databases (IRD) in Asian region
(IRD accession no: JQ247217, KP412546, KP638185).
2. Sequence alignment and identification of conserved sequence:
(a) The three sequences were aligned using multiple sequence
alignment tool provided on the IRD server. The align-
ment was saved in FASTA format and used as an input for
AVANA tool.
(b) The alignment was loaded on AVANA tool to identify the
conserved sequences showing minimum 100% conserva-
tion (Fig. 3).
(c) Three conserved sequences were obtain for human H3N2
M1 protein ranging from length 19 to 206 amino acid
after analyzing the results (Table 3).
3. T- and B-cell epitope prediction:
(a) Conserved sequence CS3 was used as an input for T-cell
and B-cell epitope prediction tools. Two diverse immu-
noinformatics tools were employed to predict HLA class I
binding (CD8+ T-cell) epitopes (NetCTL and SYF-
PEITHI) as well as HLA class II binding (CD4+ T-cell)
epitopes (NetMHCII 2.2 and IEDB SMM align). B-cell
epitope prediction was carried out using ABCpred.
(b) The predicted epitopes were analyzed, and then a set of
epitopes was identified that was predicted by both the
tools (for each class of T-cell epitopes) (Table 4).
Table 3
Conserved sequences of Avian H3N2 M1 protein
Conserved peptide sequences Length

CS1 MSLLTEVETYVLSIVPSGPLKAEIAQRLEDVFAGKNTDLEALMEWLKTRPILSPL 206
TKGILGFVFTLTVPSERGLQRRRFVQNALNGNGDPNNMDKAVKLYRKLKREI
TFHGAKEIALSYSAGALASCMGLIYNRMGAVTTEVAFGLVCATCEQIADSQH
RSHRQMVATTNPLIKHENRMVLASTTAKAMEQMAGSSEQAAEAMEIA
CS2 QARQMVQAMRAIGTHPSSS 19
CS3 GLRDDLLENLQTYQKRMGVQMQRFK 25
Table 4
CD8+ and CD4+ T-cell epitopes and B-cell epitopes of human H3N2 virus matrix 1 protein
Conserved sequences CD8+ T-cell epitopes CD4+ T-cellepitopes B-cell epitope

CS3 KRMGVQMQR LENLQTYQK ENLQTYQKRM
DLLENLQTY LQTYQKRMG
LRDDLLENL YQKRMGVQM
Table 5
Avian H3N2 matrix 1 peptides containing overlapping CD8+ and CD4+ T-cell epitopes and B-cell
epitope
Type of peptide Sequence No. of epitopes

+
Peptides enriched with CD8 T-cell epitopes LRDDLLENLQTY, KRMGVQMQR 3
Peptides enriched with CD4+ T-cell epitopes LENLQTYQKRMGVQM 3
B-cell epitope containing peptide ENLQTYQKRM 1
Peptide containing all T- and B-cell epitope LRDDLLENLQTYQKRMGVQM
Table 6
HLA molecules used for docking and binding energy of each HLA-epitope complex
Free energy ΔG
(in Kcal/mol)
after docking
HLA with native Free energy ΔG (in Kcal/mol) after
Type of HLA PDB id molecules Resolution peptide docking with M1 epitopes/peptide
HLA class I 3MRK HLA-A2 1.4 Å 7.30 6.80 (KRMGVQMQR)
HLA class II 1KLU HLA-DR1 1.93 Å 5.70 6.20 (LENLQTYQKRMGVQM)
(c) Three CD8+ T-cell epitopes, three CD4+ T-cell epitopes,

and one B-cell epitope were identified from CS3 sequence
of the human H3N2 M1 protein (Table 4).
(d) All the predicted epitopes were analyzed by BLASTp, no
homology to annotated human proteins was detected as
per the set criterion.
(e) The CD8+ and CD4+ T-cell epitopes and B-cell epitopes
were overlapped to generate a single peptide fragment
(Table 5).
4. Molecular docking:
(a) The binding affinity (binding energy/free energy) of the
selected peptide containing multiple epitopes with HLA
molecules was determined by AutoDock Vina tool
(Table 6).
(b) High-resolution crystallographic structures of two

peptide-bound HLA molecules (each for HLA class I
and II) were downloaded from the PDB database for
docking (Table 6).
(c) Binding energy obtained by docking the natural peptides
separated from the parent HLA complex with the
corresponding HLA molecules served as positive control.
(d) From the selected peptide of the human H3N2 M1 pro-
teins, PEP-FOLD generated structure of the HLA class I
binding (CD8+ T cell) epitope (KRMGVQMQR) and
peptides containing multiple CD4+ T-cell peptide
(LENLQTYQKRMGVQM) were used for docking with
the class I (HLA A2) and class II (HLA DR1) molecule,
respectively (Fig. 4).
(e) The binding energy of both CD8+ T-cell epitope and
CD4+ T-cell epitope containing peptide had no significant
difference from native peptides of HLA A2 and HLA
DR1, respectively (Table 6).
5. Population coverage analysis: Population coverage analysis
becomes necessary for calculating the expected response of
predicted peptides in different populations of the world. The
Fig. 4 Pose of dockings obtained after docking of HLA class I (3MRK) and CD8+
T-cell epitope
Fig. 5 Population coverage of the peptide containing CD8+ epitope and CD4+
epitope
response of human H3N2 M1 peptide is expected to be aver-

age (40–69% approx.) in all the populations under consider-
ation belonging to different continents (Fig. 5).
References
1. Tong S, Zhu X, Li Y et al (2013) New World transmissibility factors in PB2 proteins of influ-
Bats Harbor diverse influenza a viruses. PLoS enza a by large-scale mutual information analy-
Pathog 9(10):e1003657 sis. BMC Bioinformatics 9(1):1–18
2. Purcell AW, McCluskey J, Rossjohn J (2007) 9. Lohia N, Baranwal M (2014) Conserved pep-
More than one reason to rethink the use of tides containing overlapping CD4+ and CD8+
peptides in vaccine design. Nat Rev Drug Dis- T-cell epitopes in the H1N1 influenza virus: an
cov 6(5):404–414 immunoinformatics approach. Viral Immunol
3. Slingluff CL (2011) The present and future of 27(5):225–234
peptide vaccines for Cancer. Cancer J 17 10. Agallou M, Athanasiou E, Koutsoni O et al
(5):343–350 (2014) Experimental validation of multi-
4. Vita R, Overton JA, Greenbaum JA et al epitope peptides including promising MHC
(2008) The immune epitope database (IEDB) class I- and II-restricted epitopes of four
3.0. Nucleic Acids Res 43(D1):D405–D412 known Leishmania infantum proteins. Front
5. Backert L, Kohlbacher O (2015) Immunoin- Immunol 5:1–16
formatics and epitope prediction in the age of 11. Vijayan R, Subbarao N, Manoharan N (2015)
genomic medicine. Genome Med 7(1):119 In silico analysis of conformational changes
6. Bao Y, Bolotov P, Dernovoy D et al (2008) induced by normal and mutation of macro-
The influenza virus resource at the national phage infectivity potentiator catalytic residues
center for biotechnology information. J Virol and its interactions with Rapamycin. Interdis-
82(2):596–601 cip Sci 7(3):326–333
7. Zhang Y, Aevermann BD, Anderson TK et al 12. Patronov A, Dimitrov I, Flower DR et al
(2017) Influenza research database: an (2011) Peptide binding prediction for the
integrated bioinformatics resource for influ- human class II MHC allele HLA-DP2: a
enza virus research. Nucleic Acids Res 45 molecular docking approach. BMC Struct
(D1):D466–D474 Biol 11:32
8. Miotto O, Heiny AT, Tan TW et al (2007) 13. Thévenet P, Shen Y, Maupetit J et al (2012)
Identification of human-to-human PEP-FOLD: an updated de novo structure pre-
diction server for both linear and disulfide
bonded cyclic peptides. Nucleic Acids Res 40 ligand and peptide binding affinity data. J
(W1):288–293 Immunol 199(9):3360–3368
14. Trott O, Olson AJ (2010) AutoDock Vina: 28. Andreatta M, Nielsen M (2015) Gapped
improving the speed and accuracy of docking sequence alignment using artificial neural net-
with a new scoring function efficient optimiza- works: application to the MHC class i system.
tion and multithreading. J Comput Chem 31 Bioinformatics 32(4):511–517
(2):455 29. Bhasin M, Raghava GPS (2007) A hybrid
15. Lohia N, Baranwal M (2018) Highly con- approach for predicting promiscuous MHC
served hemagglutinin peptides of H1N1 influ- class I restricted T cell epitopes. J Biosci 32
enza virus elicit immune response. 3 Biotech 8 (1):31–42
(12):492 30. Singh H, Raghava GPS (2002) ProPred: pre-
16. Jain S, Baranwal M (2019) Computational diction of HLA-DR binding sites. Bioinfor-
analysis in designing T cell epitopes enriched matics 17(12):1236–1237
peptides of Ebola glycoprotein exhibiting 31. Singh H, Raghava GPS (2003) ProPred1: pre-
strong binding interaction with HLA mole- diction of promiscuous MHC class-I binding
cules. J Theor Biol 465:34–44 sites. Bioinformatics 19(8):1009–1014
17. Bui HH, Sidney J, Dinh K et al (2006) Predict- 32. Reche PA, Reinherz EL (2007) Prediction of
ing population coverage of T-cell epitope- peptide-MHC binding using profiles. Methods
based diagnostics and vaccines. BMC Bioinfor- Mol Bio 409:185–200
matics 7:1–5 33. Liu W, Wan J, Meng X et al (2007) In silico
18. González-Galarza FF, Takeshita LYC, Santos prediction of peptide-MHC binding affinity
EJM et al (2015) Allele frequency net 2015 using SVRMHC. Methods Mol Biol (Clifton,
update: new features for HLA epitopes, KIR NJ) 409:283–291
and disease and HLA adverse drug reaction 34. Rammensee HG, Bachmann J, Emmerich
associations. Nucleic Acids Res 43(D1): NPN et al (1999) SYFPEITHI: database for
D784–D788 MHC ligands and peptide motifs. Immunoge-
19. Lohia N, Baranwal M (2015) Identification of netics 50(3–4):213–219
conserved peptides comprising multiple T cell 35. Jespersen MC, Peters B, Nielsen M et al (2017)
epitopes of matrix 1 protein in H1N1 influenza BepiPred-2.0: improving sequence-based
virus. Viral Immunol 28(10):570–579 B-cell epitope prediction using conformational
20. Doytchinova IA, Guan P, Flower DR (2006) epitopes. Nucleic Acids Res 45(W1):
EpiJen: a server for multistep T cell epitope W24–W29
prediction. BMC Bioinformatics 7:1–11 36. Ansari H, Raghava GP (2010) Identification of
21. Jacob L, Vert JP (2008) Efficient peptide- conformational B-cell epitopes in an antigen
MHC-I binding prediction for alleles with few from its primary sequence. Immunome Res 6
known binders. Bioinformatics24(3):358–366 (1):6
22. MHC2PRED: http://crdd.osdd.net/ 37. Saha S, Raghava GP (2006) Prediction of con-
raghava/mhc2pred/info.html tinuous B-cell epitopes in an antigen using
23. Guan P, Doytchinova IA, Zygouri C (2003) recurrent neural network. Proteins 65
MHCPred: a server for quantitative prediction (1):40–48
of peptide-MHC binding. Nucleic Acids Res 38. Gupta S, Ansari HR, Gautam A et al (2013)
31(13):3621–3624 Identification of B-cell epitopes in an antigen
24. Bhasin M, Raghava GPS (2003) Prediction of for inducing specific class of antibodies. Biol
promiscuous and high-affinity mutated MHC Direct 8(1):27
binders. Hybrid Hybridomics 22(4):229–234 39. Chen J, Liu H, Yang J et al (2007) Prediction
25. Larsen MV, Lundegaard C, Lamberth K et al of linear B-cell epitopes using amino acid pair
(2007) Large-scale validation of methods for antigenicity scale. Amino Acids 33(3):423–428
cytotoxic T-lymphocyte epitope prediction. 40. Yao B, Zhang L, Liang S et al (2012)
BMC Bioinformatics 8:1–12 SVMTriP: a method to predict antigenic epi-
26. Stranzl T, Larsen MV, Lundegaard C et al topes using support vector machine to inte-
(2010) NetCTLpan: pan-specific MHC class I grate tri-peptide similarity and propensity.
pathway epitope predictions. Immunogenetics PLoS One 7(9):e45152
62(6):357–368 41. Saha S, Raghava GPS (2004) BcePred: predic-
27. Jurtz V, Paul S, Andreatta M et al (2017) tion of continuous B-cell epitopes in antigenic
NetMHCpan-4.0: improved peptide–MHC sequences using Physico-chemical properties.
class i interaction predictions integrating eluted In: Nicosia G, Cutello V, Bentley PJ, Timmis
J (eds) Artificial immune systems. ICARIS
2004. Lecture notes in computer science, vol 45. Manavalan B, Govindaraj RG, Shin TH et al
3239. Springer, Berlin, Heidelberg (2018) iBCE-EL: a new ensemble learning
42. Gao J, Faraggi E, Zhou Y et al (2012) BEST: framework for improved linear B-cell epitope
improved prediction of B-cell epitopes from prediction. Front Immunol 9:1695
antigen sequences. PLoS One 7(6):e40104 46. Sweredoski MJ, Baldi P (2009) COBEpro: a
43. Rubinstein ND, Mayrose I, Martz E et al novel system for predicting continuous B-cell
(2009) Epitopia: a web-server for predicting epitopes. Protein Eng Des Sel 22(3):113–120
B-cell epitopes. BMC Bioinformatics 10 47. Singh H, Ansari HR, Raghava GP (2013)
(1):287 Improved method for linear b-cell epitope pre-
44. Mayrose I, Penn O, Erez E et al (2007) Pepi- diction using antigen’s primary sequence.
tope: epitope mapping from affinity-selected PLoS One 8(5):e62216
peptides. Bioinformatics 23(23):3244–3246
Chapter 12
A New Approach to Assess mAb Aggregation

Illarion V. Turko
Abstract
A proof of concept for new methodology to detect and potentially quantify mAb aggregation is presented.
Assay development included using an aggregated mAb as bait for screening of a phage display peptide
library and identifying those peptides with random sequence which can recognize mAb aggregates. The
selected peptides can be used for developing homogeneous quantitative methods to assess mAb aggrega-
tion. Results indicate that a peptide-binding method coupled with fluorescence polarization detection can
detect mAb aggregation and potentially monitor the propensity of therapeutic protein candidates to
aggregate.
Key words Monoclonal antibodies, Aggregation, Peptide phage display, Next-generation sequenc-
ing, FITC-peptide
1 Introduction
Monoclonal antibody (mAb) formulations are typically heavily con-

centrated protein solutions with various molecular interactions that
can increase the likelihood of antibody aggregation. Aggregation is
particularly concerning because of the suspected immunogenicity
of antibody aggregates [1, 2]. mAb aggregates appear in different
sizes (from a few nanometers to hundreds of microns) and in
different forms (reversible non-covalent, irreversible
non-covalent, and irreversible covalent) [3–7]. Unfortunately,
most currently available techniques to study mAb aggregates are
semiquantitative in nature and cannot provide absolute quantitative
data [7–13]. Therefore, an analytical assay that provides quantita-
tion of mAb aggregates in complex samples would be highly bene-
ficial in both research and clinical settings.
In the present work, we reasoned that protein-protein inter-
faces formed by mAb aggregation could be selectively recognized
by short peptides with random amino acid sequences. Assay devel-
opment included using purified aggregated mAb as bait for screen-
ing of a phage display peptide library and identifying those peptides

245
246 Illarion V. Turko
with random sequences which can recognize mAb aggregates.

Once identified, the selected peptides were conjugated with fluo-
rescein isothiocyanate (FITC) and further used for developing a
96-well plate-based fluorescence polarization assay for detection of
mAb aggregation. All phage clones selected for aggregated mAbs
were submitted to next-generation sequencing (NGS), and the
application of NGS to peptide phage display and NGS data analysis
will also be described.
2 Materials
2.1 NISTmAb 1. Formulation buffer: For 500 mL of formulation buffer

Aggregation (25 mmol/L Histidine/HCl, pH 6.0), weigh out 1.3129 g
histidine monohydrochloride monohydrate, and 0.9704 g his-
tidine. Quantitatively transfer to a clean beaker containing
450 mL ultrapure water (sensitivity of 18 MΩ-cm at
25 C). Use a stir bar to mix. Calibrate pH meter using pH 4
and pH 7 standards. Adjust pH to 6.00 by dropwise addition of
1 mol/L HCl with constant stirring (10 drops to 15 drops).
Transfer to 500 mL class A volumetric flask. Rinse beaker with
ultrapure water and adjust flask volume to 500.0 mL using
rinse water. Pass through 0.22 μm cellulose-acetate membrane
into a sterile plastic bottle. Store at 2 C to 8 C.
2. NISTmAb [14] stock solution: 10 mg/mL humanized IgG1κ
monoclonal antibody (NIST RM 8671) in formulation buffer
(see Note 1).
3. Size exclusion chromatography (SEC): can be performed on
any fully automated liquid chromatography system designed
for research scale separation of proteins (such as AKTA FPLC
(GE Healthcare Life Sciences) or NGC Chromatography Sys-
tem (Bio-Rad Laboratories)). Columns to use include Super-
dex 200 Increase 10/300 GL or Superose 6 Increase 10/300
GL (both are from GE Healthcare Life Sciences).
4. Concentrators: Centrifugal Filter Devices Amicon Ultra-4 (for
volumes up to 4 mL) and Amicon Ultra-0.5 (for volumes up to
0.5 mL) with 30,000 molecular mass cutoffs should be used in
accordance with manufacturer recommendations.
2.2 Phage Display 1. Ph.D.-12 phage display peptide library kit from New England
BioLabs (Ipswich, MA, USA), catalog # E8110S. The library
consists of M13 filamentous bacteriophage, on which five cop-
ies of a 12-amino-acid linear random peptide sequence are
expressed as N-terminal fusions to the minor coat protein
pIII of the phage. A short linker glycine-glycine-glycine-serine
(GGGS) is present between each displayed peptide and pIII
protein. The M13 phage is propagated in E. coli host strain
A New Approach to Assess mAb Aggregation 247
ER2738 (included in the kit, New England BioLabs) in Luria-

Bertani (LB) medium containing 20 μg/mL tetracycline
(Sigma-Aldrich, St. Louis, MO).
2. Blocking buffer: 5 mg/mL bovine serum albumin (BSA) in
phosphate-buffered saline (PBS). Filter sterilize and store at
4 C.
Washing buffer: PBS with 0.01% Tween 20 (PBST). Auto-
clave and store at room temperature.
Elution buffer: 1 mg/mL BSA in 0.2 mol/L glycine-HCl
(pH 2.2). Filter sterilize and store at 4 C.
Neutralizing buffer: 1 mol/L Tris–HCl (pH 9.1). Auto-
clave and store at room temperature.
3. PEG/NaCl solution: 20% (w/v) polyethylene glycol 8000
with 2.5 mol/L NaCl. Autoclave and store at room
temperature.
4. LB medium: 10 g Bacto-Tryptone, 5 g yeast extract, and 5 g
NaCl per liter. Autoclave and store at room temperature. Sup-
plement with sterile filtered 20 μg/mL tetracycline before use.
2.3 Phage DNA 1. TE buffer: 10 mmol/L Tris–HCl (pH 8.0) with 1 mmol/
Purification L EDTA.
2. Iodide buffer: TE buffer with 4 mol/L sodium iodide. Store in
a dark bottle covered with foil at room temperature.
3. DNA Clean and Concentrator-5 kit (Zymo Research, catalog #
D4003).
2.4 Next-Generation Next-generation sequencing (NGS) was performed by Abm (Rich-

Sequencing mond BC, Canada) using in-house materials. Abm-provided NGS
services can be found through the Genohub website (https://gen
ohub.com/).
2.5 Fluorescence 1. FITC-labeled peptides were synthesized by Biomatik (Wil-

Polarization mington, DE, USA) by adding Lys-FITC to the C-terminus
of peptide. Purity of FITC-peptides was higher than 95%.
FITC-peptides are not soluble in water and their 500 μmol/L
stock solutions were made in 50% DMSO (volume) in water.
2. Multimode reader: Synergy Neo2 plate reader (BioTek Instru-
ments, Winooski, VT, USA).
3. Assay plate: Costar assay plate, 96-well black with clear flat
bottom, non-treated, polystyrene from Corning Inc., Kenne-
bunk, ME, USA.
3 Methods
3.1 NISTmAb To partially aggregate NISTmAb, an aliquot of NISTmAb stock

Aggregation solution was incubated at 70 C for 10 min (see Note 2). This
treatment typically results in approximately 20% of NISTmAb
aggregation. Use 100 μL of temperature-treated NISTmAb per
SEC run to separate aggregated NISTmAb from monomer NIST-
mAb. Mobile phase is PBS and rate of elution is 0.4 mL/min.
Collect small fractions, such as 0.3 mL (see Note 3). Figure 1
shows that a single chromatographic run is not sufficient to obtain
the pure aggregated form of NISTmAb; the peaks are not fully
resolved and cross-contaminate each other (Fig. 1a). The peak
fractions of aggregated NISTmAb after first run must be pooled,
concentrated to 100 μL using Amicon centrifugal filter devices, and
run again by SEC. Figure 1b shows that after second run, a small
Fig. 1 SEC for temperature-treated NIST mAb. (a) First run. (b) Second run. (c)
Third run
amount of monomer is still present in the sample of aggregated

NISTmAb. The peak fractions of aggregated NISTmAb should be
pooled again, concentrated again to 100 μL, and run a third time by
SEC. Figure 1c shows that after three runs, the sample of aggre-
gated NISTmAb apparently is not contaminated with monomer
NISTmAb and can be used as a bait for phage display screening.
3.2 Phage Display Protocols that cover applications of Ph.D. Phage Display Libraries
can be found at https://www.neb.com/-/media/catalog/
Datacards%20or%20Manuals/manualE8102.pdf. Here, we
describe only the protocol used in the current study:
1. Phage panning: 150 μL/well of 500 μg/mL untreated NIST-
mAb or temperature-treated NISTmAb in PBS was used to
cover several wells of a 96-well plate (Nunclon, catalog #
167008) at 4 C overnight. All subsequent steps were carried
out with gentle orbital shaking at room temperature. First,
wells were washed with 200 μL of PBS for 10 min and PBS
was replaced with 200 μL of blocking buffer for 2 h. Then, each
well was loaded with 150 μL of 15-fold diluted Ph.D.-12 phage
display peptide library in blocking buffer. After 2 h of incuba-
tion, each well was washed five times with 200 μL of washing
buffer for 10 min each washing. Finally, attached phage was
eluted with 150 μL of elution buffer for exactly 10 min and
transferred to clean autoclaved conical tubes containing 22 μL
of neutralizing buffer.
3.3 Phage 1. Inoculate 3 mL of LB/tetracycline medium with 1 μL of

Amplification ER2738 E. coli cells (stock solution is included in the Ph.D-
12 library kit) and grow cells overnight at 250 rpm shaking at
37 C. The next morning, inoculate 20 mL of LB/tetracycline
medium with 200 μL of overnight culture and 100 μL of
eluted/neutralized phage. Incubate for exactly 4.5 h at 37 C
with 250 rpm shaking. Then, transfer the culture to a centri-
fuge tube and spin at 12,000 g for 10 min at 4 C. Discard
the pellet and transfer supernatant to a new tube and repeat the
centrifugation. Transfer supernatant to a new tube and add to it
1/6 volume of 20% PEG/2.5 mol/L NaCl. Mix thoroughly
and allow the phage to precipitate overnight at 4 C. The next
morning, centrifuge the precipitation at 12,000 g for 20 min
at 4 C. Discard supernatant and suspend the pellet in 0.5 mL
of PBS. This is the amplified phage ready for DNA purification
(see Note 4).
3.4 Phage DNA 1. Mix 150 μL of amplified phage with 25 μL of 20%

Purification PEG/2.5 mol/L NaCl and incubate at room temperature for
30 min. Then, centrifuge at 12,000 g for 20 min at 4 C.
Discard the supernatant and suspend the pellet in 100 μL of
iodide buffer. Add 250 μL of ice-cold ethanol and incubate at

room temperature for 30 min. Centrifuge at 12,000 g for
10 min at 4 C. Discard supernatant and suspend pellet in
200 μL of ice-cold 70% (volume) ethanol. Then, centrifuge at
12,000 g for 10 min at 4 C. Discard supernatant and air-dry
pellet for 10 min. Finally, suspend pellet in 20 μL of TE buffer
and keep at 4 C. At this stage, the UV spectra for phage DNA
show a strong absorbance at 230 nm, and this prevents accurate
assessment of phage DNA purity based on A260/A280 ratio.
2. To further purify phage DNA, DNA Clean & Concentrator-5
kit (Zymo Research, catalog # D4003) is used with minor
modifications to manufacturer recommended protocol. Essen-
tially, mix 20 μL of phage DNA with 140 μL of provided DNA
Binding Buffer and load on a Zymo-Spin column in a collec-
tion tube. After centrifugation at 12,000 g for 30 s, discard
the flow-through and wash column twice with 200 μL of
provided DNA Wash Buffer by 12,000 g for 30 s centrifuga-
tions. Finally, transfer column into new fresh 1.5 mL micro-
centrifuge tube and elute phage DNA with 30 μL of provided
DNA Elution Buffer by 12,000 g for 30 s centrifugation. At
this stage, phage DNA has A260/A280 ratio equal to 1.9 to 2.0
and is ready for sequencing.
3.5 Next-Generation 1. We used the Genohub website (https://genohub.com/) to

Sequencing and Data identify a next-generation sequencing (NGS) service (see
Analysis Note 5). The submitted samples of phage DNA were subjected
to two-step PCR amplifications using nested primers followed
by adaptor addition. Quality control of the final pooled library
was performed using Qubit, Agilent 2100 Bioanalyzer, and
qPCR. Sequencing was completed on an Illumina MiSeq
300 cycle flow cell. Read length was 2 150 bp PE. The
FastQ was analyzed by merging left and right reads using
FLASH and counted for unique DNA sequences. The insert
sequences were translated to amino acid sequence using
EMBOSS Transseq tool. For various samples, the delivered
number of unique peptide sequences with their frequency was
in a range of 70,000–130,000 (see Note 6). Finally, peptide
sequences found in control samples (NISTmAb without tem-
perature treatment) were subtracted from peptide sequences
found in purified aggregated NISTmAb treated with 70 C for
10 min (see Note 7). Table 1 shows a typical peptide distribu-
tion as a function of their appearance in the NGS data. The
cutoff for further analysis is subjective, but typically limited to a
small number of peptides with frequent appearance (see
Note 8).
Table 1
Appearance of peptides in NGS data. 127,396 hits are arranged based on their appearance. The top
abundance peptide (#1) was found 156 times
Peptide Appearance
#1 156
#2 65
#3 41
#4 26
# 55 10
# 1985 1
# 127396 1
Motif LDLKR
DM LDCRR VGCAP
GPF LDVKR NMTV
HEYQ LSIER RLP
E LWSKR ATYPPL
A LDLPR PQIGNR
VTRFHP LNESR Y
SYND LASFR LTT
NS VDFTR YTHSG
SQGKDH LRLLR P
Fig. 2 An example of a common motif search using The MEME Suits software.
100 top NGS sequences sorted by position p-value
2. Once the list of highest appearance peptides is obtained, two

major strategies for assay development can be applied.
The first employs various bioinformatics tools in search of a
common motif. Figure 2 shows an example of application of
motif-based sequence analysis tools (The MEME Suite, http://
meme-suite.org/) (see Note 9).
The second would be to find the best combination of high-
and low-affinity peptides experimentally to achieve the highest
avidity of the assay.
3.6 Fluorescence 1. Fluorescence polarization assays can detect the binding of a

Polarization small fluorescently labeled peptide to a protein of interest [15]
(see Note 10).
2. Fluorescence polarization equilibrium saturation binding
assays (see Note 11) were used in this study and were per-
formed in duplicate for 50 μL per well samples. Serially diluted
0.2 control/#1 control/#1+#7 70 C/#1 70 C/#1+#7
Fluorescence polarization 0.15
0.1
0.05
0
0 50 100 150 200 250
NISTmAb, µmol/L
Fig. 3 The representative fluorescence polarization saturation binding assay using two FITC-peptides (with
appearance #1 and #7 from NGS data) at 0.5 μmol/L and the indicated concentrations of control and
temperature-treated NISTmAb
control and temperature-treated NISTmAb (from 3 μmol/L to

230 μmol/L) were incubated with FITC-Peptide#1
(0.5 μmol/L) or a mixture of FITC-Peptide#1 and FITC-
Peptide#7 (both are 0.5 μmol/L) for 20 min at room temper-
ature with gentle shaking on orbital shaker (see Note 12). Use
Dual FP and FP485/530 filters for Synergy Neo2 plate reader
and measure fluorescence polarization.
3. Anticipated representative results are shown in Fig. 3 and plot-
ted as NISTmAb concentration versus fluorescence polariza-
tion (see Note 13).
4 Notes
1. RM 8671-NISTmAb, humanized IgG1κ monoclonal antibody

is available for purchase directly through the NIST website
(https://www-s.nist.gov/srmors/view_detail.cfm?srm¼8671).
2. Water or air bath equipped with a control thermometer must
be used to ensure that the actual temperature is 70 C.
3. Collecting small fractions will help to select those that will be
pooled for further processing and will provide for best combi-
nation between purity and yield.
4. In traditional applications of phage display screening, the pan-
ning/amplification cycles are repeated from three times to five
times. This results in selecting peptides that possess high-
affinity binding to the bait and elimination of peptides with
low-affinity binding and, therefore, decreasing the avidity.
Avidity refers to the accumulated strength of multiple affinities
of individual binding interactions. The ultimate goal of this

study is to develop a homogeneous assay to assess mAb aggre-
gation, which can rely on a single peptide with high affinity
(affinity) or on a mixture of several peptides with different
affinities (avidity). For this reason, peptides with low-affinity
binding could be as useful as peptides with high-affinity bind-
ing; therefore, a single panning/amplification cycle was used in
our case.
5. There are many options for obtaining NGS services. Our use of
Genohub does not endorse Genohub and does not imply that
Genohub services are necessarily the best available for the
purpose.
6. Due to random clustering during the sequencing process,
multiplexed samples may not get equal representation. As a
result, some samples may have more reads and some samples
will have fewer reads.
7. Phage display panning for control and temperature-treated
purified aggregated NISTmAb was performed side by side.
The control includes peptides not relevant to aggregation
that interact with plastic, BSA, and non-aggregated NISTmAb.
After subtraction, the final list of peptides includes those pep-
tides that presumably only recognize aggregated NISTmAb.
8. The appearance of peptides in the NGS apparently correlates
with their affinity to aggregated NISTmAb. Specifically, for
data presented in Table 1, despite obtaining 127,396 hits,
there is a fairly clear cutoff between the top 50 hits and the
remaining peptides, providing guidance for further analysis.
9. This is purely a bioinformatics approach to help design peptides
with the highest affinity to aggregated mAb.
10. The fluorescent peptide probe should meet several general
requirements: (a) the peptide should be as short as possible,
but still has sufficient binding affinity for aggregated mAb,
(b) sufficient binding affinity is typically in the low micromolar
range, (c) peptide concentration in the assay should be at least
twice lower that its KD, and (d) extra linker between peptide
and fluorophore should be avoided.
11. There are at least two options for how to perform the fluores-
cence polarization binding assay, including equilibrium satura-
tion and equilibrium competition assays [16].
12. Keep the plate wrapped in foil to minimize exposure of fluores-
cent reagents to light.
13. Concentrations of NISTmAb and FITC-peptides are specific
for the experiment shown in Fig. 3. For other applications
beyond these proteins and peptides, the concentrations will
need to be selected based on the anticipated KD value.
Acknowledgments
Certain commercial equipment, instruments, or materials are iden-

tified to adequately specify the experimental procedure. Such iden-
tification does not imply recommendation or endorsement by the
National Institute of Standards and Technology, nor does it imply
that the materials or equipment identified are necessarily the best
available for the purpose.
References
1. Lowe D, Dudgeon K, Rouet R, Schofield P, Hydrogen exchange mass spectrometry reveals
Jermutus L, Christ D (2011) Aggregation, sta- protein interfaces and distant dynamic cou-
bility, and formulation of human antibody pling effects during the reversible self-
therapeutics. Adv Protein Chem Struct Biol association of an IgG1 monoclonal antibody.
84:41–61 MAbs 7:525–539
2. Singh SK (2011) Impact of product-related 10. Tessier PM, Wu J, Dickinson CD (2014)
factors on immunogenicity of therapeutics. J Emerging methods for identifying monoclonal
Pharm Sci 100:354–387 antibodies with low propensity to self-associate
3. Cromwell MEM, Hilario E, Jacobson F (2006) during the early discovery process. Expert
Protein aggregation and bioprocessing. AAPS J Opin Drug Deliv 11:461–465
8:E572–E579 11. Yadav S, Shire SJ, Kalonia DS (2010) Factors
4. Vazquez-Rey M, Lang DA (2011) Aggregates affecting the viscosity in high concentration
in monoclonal antibody manufacturing pro- solutions of different monoclonal antibodies.
cesses. Biotechnol Bioeng 108:1494–1508 J Pharm Sci 99:4812–4829
5. Nishi H, Miyajima M, Nakagami H, Noda M, 12. Jezek J, Rides M, Derham B, Moore J,
Uchiyama S, Fukui K (2010) Phase separation Cerasoli E, Simler R, Perez-Ramirez B (2011)
of an IgG1 antibody solution under a low ionic Viscosity of concentrated therapeutic protein
strength condition. Pharm Res 27:1348–1360 compositions. Adv Drug Deliv Rev
6. Manning MC, Chou DK, Murphy BM, Payne 63:1107–1117
RW, Katayama DS (2010) Stability of protein 13. Binabaji E, Ma J, Zydney AL (2015) Intermo-
pharmaceuticals: an update. Pharm Res lecular interactions and the viscosity of highly
27:544–575 concentrated monoclonal antibody solutions.
7. Geng SB, Cheung JK, Narasimhan C, Pharm Res 32:3102–3109
Shameem M, Tessier PM (2014) Improving 14. Schiel JE, Turner A (2018) The NISTmAb
monoclonal antibody selection and engineer- reference material 8671 lifecycle management
ing using measurements of colloidal protein and quality plan. Anal Bioanal Chem
interactions. J Pharm Sci 103:3356–3363 410:2067–2078
8. Razinkov VI, Treuheit MJ, Becker GW (2015) 15. Moerke NJ (2009) Fluorescence polarization
Accelerated formulation development of (FP) assays for monitoring peptide-protein and
monoclonal antibodies (mAbs) and nucleic acid-protein binding. Curr Protoc
mAb-based modalities: review of methods and Chem Biol 1:1–15
tools. J Biomol Screen 20:468–483 16. Rossi AM, Taylor CW (2011) Analysis of
9. Arora J, Hickey JM, Majumdar R, protein-ligand interactions by fluorescence
Esfandiary R, Bishop SM, Samra HS, Mid- polarization. Nat Protoc 6:365–387
daugh CR, Weis DD, Volkin DB (2015)
Chapter 13
Generation of Variability-Free Reference Proteomes

from Pathogenic Organisms for Epitope-Vaccine Design
Jose L. Sanchez-Trincado and Pedro A. Reche
Abstract
Many pathogenic organisms have an inherent ability to rapidly evolve into new variants, which enables them
to escape previously existing immune responses. Vaccine design strategies should be aimed to counteract
such variability, targeting the conserved antigen regions of the pathogen. Sequence variability analysis
allows the identification of conserved regions upon multiple sequence alignments of the relevant antigens.
In this chapter, we describe a detailed protocol and provide software to build variability-free proteomes for
epitope-vaccine design. The procedure, which will be illustrated for human herpesvirus 1 (HHV1), involves
the identification of protein clusters, followed by multiple sequence alignments and Shannon variability
calculations. The software required to build variability-free proteomes is available at http://imed.med.ucm.
es/software/mmb2019.
Key words Epitope, Consensus sequence, Clustering, Multiple sequence alignments, Variability,
Shannon entropy
1 Introduction
Traditional vaccines like those based on attenuated organisms have

resulted successful against many pathogens. However, these vac-
cines have failed to provide broad and long-lasting immunity
against pathogens with high mutation rates that rapidly generate
new genetic variants [1, 2] such as influenza or HIV-1 viruses. In
these cases, new vaccines overcoming pathogen sequence variability
need to be designed. It is often found that the most immunogenic
antigens of the pathogen are those exhibiting the greatest degree of
variability [3]. Moreover, not a single antigen is conserved
throughout its entire sequence. Therefore, vaccines based on con-
served epitopes––precise antigen regions recognized by the adap-
tive immune system––are arguably the best choice to induce
protective immunity against highly variable pathogens. Not surpris-
ingly, researchers often target conserved antigen regions for epitope
prediction and identification [4–7].

255
256 Jose L. Sanchez-Trincado and Pedro A. Reche
Conserved or non-variable antigen regions can be identified

after multiple sequence alignments (MSAs) of the relevant antigens
using sequence similarity/identity thresholds. As an alternative, we
introduced an online tool under the name of PVS (Protein Varia-
bility Server: http://imed.med.ucm.es/PVS) [8], which identifies
non-variable regions upon sequence variability analyses, returning
consensus sequences with variable residues masked. Subsequently,
the tool permits the prediction and visualization of epitopes within
the conserved regions. In addition, PVS has become instrumental
in vaccine design upon experimental epitope legacy [9–15]. In this
approach, the consensus sequences returned by PVS are used to
select non-variable (conserved) epitopes within sets of experimen-
tally validated epitopes obtained from relevant databases such as
IEDB [16] and EPIMHC [17].
Following the success of PVS, in this chapter, we describe a
detailed method to obtain consensus sequences with variable resi-
dues masked in a Linux/Unix shell. The procedure starts with the
collection of full-length genomes from a given organism from
NCBI Nucleotide database (https://www.ncbi.nlm.nih.gov/
nuccore/), selecting one as reference. Genome sequences are
then processed to extract the amino acid coding sequences (CDS)
and clustered by sequence identity using CD-HIT [18]. Subse-
quently, MSAs are obtained from clustered sequences containing
reference CDS using MUSCLE [19]. MSAs are subjected to
sequence variability analysis using Shannon entropy [20], H, as
the variability metric. Lastly, the variability in each of the MSAs is
assigned to the chosen reference sequences, masking any residue
with a variability above a given threshold, often H > 0.5, to
generate variability-free reference consensus sequences. We illus-
trate the protocol step by step, on human herpesvirus 1 (HHV1)
genome sequences. In-house software to generate consensus refer-
ence proteomes is available at http://imed.med.ucm.es/software/
mmb2019.
2 Materials
2.1 Sequences We obtained HHV1 complete genome sequence files from the
NCBI Nucleotide database (https://www.ncbi.nlm.nih.gov/
nuccore). A total of 270 HHV1 genomes, each encompassing
150,000 nucleotides, were downloaded into a single GenBank
file (sequences.gb) (Fig. 1). We selected as the reference genome that
with GenBank accession X14112.1 and put it into a separate file
(ref.gb).
2.2 Software Software and scripts used in this report are summarized in Table 1.
Sequence clusters were generated using CD-HIT [18] and MSAs
using MUSCLE [19] or ClustalΩ [21]. Perl scripts are used for data
Generation of Reference Consensus Proteomes 257
Fig. 1 NCBI interface showing a search for HHV1 genomic sequences
Table 1
Computer applications and scripts for generating variability-free reference proteomes
Tool Description Download Reference

CD-HIT Package to cluster sequences https://github.com/ [18]
weizhongli/cdhit
MUSCLE Application to generate MSAs https://www.ebi.ac.uk/ [19]
Tools/msa/muscle/
cds_gb4cdhit.pl Perl script to obtain CDS in GenBank http://imed.med.ucm. n/a
files and generate FASTA files es/software/
cds_gbref4cdhit.pl Identical to cds_gb4cdhit.pl but mmb2019
introduces a “ref” tag in FASTA
headers
parse_cdhit_clstr.pl Perl script to analyze and process
protein clusters
maskvar_seqanalyse.pl Perl script to build consensus sequences
with variable sites masked from MSAs
flat_fasta.pl Perl script to remove line breaks in
sequences
formatting, processing output, executing programs and computing

sequence variability. Usage of these tools is described in Methods
section.
3 Methods
3.1 Collection of CDS To build HHV1 consensus proteome sequences, we first have to
from Genome Files extract CDS from both, the genome selected as reference (ref.gb)
and the file including all genomes (sequences.gb). Alternatively, one
could directly retrieve protein sequences from NCBI (see Note 1).
Perl scripts cds_gb4cdhit.pl and cds_gbref4cdhit.pl are needed to
complete this task. Both scripts select CDS in GenBank records
generating a single protein FASTA file after them. The scripts differ
in that cds_gbref4cdhit.pl introduces a “ref” tag in the header of
sequences and must be used with the reference genome. These two
scripts are executed as follows:
$ cds_gbref4cdhit.pl ref.gb > ref.fa

$ cds_gb4cdhit.pl sequences.gb > sequences.fa
The number of amino acid sequences in the HHV1 reference

proteome (ref.fa) is 77, while the total number of HHV1 amino
acid sequences is 21,422 (sequences.fa). Subsequently, the two
FASTA files must be merged into a single file (sequences_and_ref.
fa) using the following command:
$ cat ref.fa sequences.fa > ref_and_sequences.fa
3.2 Generation of Clustering of amino acid sequences, in this case from HHV1, is
Protein Clusters achieved with CD-HIT. This program defines clusters within a
collection of sequences after an identity threshold and generates
nonredundant sequence files. The default threshold is 0.9; all
sequences sharing 90% identity will be included in the same cluster.
Here, however, we show the use of CD-HIT using 0.8 as the
identity threshold (see Note 2).
$ Cd-hit —i sequences_and_ref.fa —c 0.8 —o sequences_and_ref_08
CD-HIT returns several files named after the output name

passed in the command line with the “–o” option. The file with
the clstr extension (sequences_and_ref_08.clstr) reports all HHV1
amino acid sequence clusters, listing the names and length of the
sequences included in each cluster and the percentage (%) of iden-
tity with the first sequence in the cluster. This first sequence appears
tagged with “∗”. CD-HIT also returns a FASTA file containing the
tagged representative sequences (sequences_and_ref_08), as well as a
file indicating the assigned cluster to each of the selected sequences
(sequences_and_ref_08.bak.clstr). In our example, we obtained
351 clusters from all 21,422 HHV1 amino acid sequences included
in the file sequences_and_ref.fa. Since the chosen reference genome
has only 77 CDS, it follows that some clusters do not have reference
CDS. In this chapter, we only consider clusters with reference CDS
for generating consensus sequences (see Note 3).
3.3 Generation Next step is to generate multiple sequence alignments (MSAs) from
Multiple Sequence selected clusters, those with reference CDS on them (see Note 3).
Alignments from This is achieved using the Perl script parse_cdhit_clstr.pl. The script
Sequence Clusters has several options and arguments and usage help can be obtained
by calling the program without arguments. In this chapter, it is used
as follows:
$ parse_cdhit_clstr.pl —i sequences_and_ref_08.clstr /
—p sequences_and_ref.fa —m 1 —s 1 —u 1
This script will basically go through each of the clusters defined

by CD-HIT (sequences_and_ref_08.clstr, entered with the “-i”
option), seek for the sequences in the original FASTA file entered
with the “-p” option (sequences_and_ref.fa), and place them in
independent FASTA files. Subsequently, cluster FASTA files are
processed following other arguments provided in the command
line. Thus, the “-m 1” option will generate MSAs from the
FASTA files using MUSCLE (see Note 4) and save them in
FASTA format. The option “–s 1” sorts the sequences in MSAs
so that the first sequence is a CDS from the reference genome and
the remaining sequences are sorted by length. This option also
converts MSAs into CLUSTAL format. Finally, the option “-u 1”
guarantees that MSAs contain only unique sequences without
repeated identifiers. The script processes clusters with and without
references sequences independently, placing the results in two dif-
ferent directories, cluster_with_ref and cluster_noref. Under these
two directories, the script creates three subdirectories named as
fastas, msa, and msa_sorted. The fastas directories are populated
with FASTA files, the msa directories with MSAs generated by
MUSCLE, and the msa_sorted directory with sorted MSAs contain-
ing only unique sequences. For the HHV1 case study, the script
generated 73 FASTA files out of 351 clusters, found in the fastas
directory under the directory clusters_with_ref. Since the reference
genome has 77 CDS, it follows that some of the reference CDS
share a sequence identity 80% and were clustered together. The
cluster FASTA files are named as cluster_n.fa, where n is the num-
ber of the cluster as defined by CD-HIT. Likewise, we find 73 files
in the msa directory, corresponding to the MSAs named as clus-
ter_n.msa. Finally, in the msa_sorted directory, there are also 73 files
named as cluster_n_sorted.aln, containing the same MSAs in
CLUSTAL format with sequences sorted as described previously.
3.4 Generation of To obtain consensus sequences with variable sites masked requires
Reference Consensus computing sequence variability per site/residue in MSAs and to
Proteomes with that end we use Shannon entropy (Eq. 1).
Variable Sites Masked XM
H ¼ i P i Log2 ðP i Þ ð1Þ
where Pi is the fraction of residues of amino acid type i and M is

the number of amino acid types, 20. H ranges from 0 (only one
amino acid type is present at that position) to 4.322 (every amino
acid is equally represented in that position). In general, any residue
with H 0.5 is highly conserved [22]. Following these calcula-
tions, H is assigned to the first sequence in the MSA, reference
CDS, and any residue with variability above a chosen threshold,
often H > 0.5, is masked to generate consensus reference sequences
with variable sites masked. All these tasks can be performed using
the Perl script maskvar_seqanalyse.pl as follows:
$ maskvar_seqanalyse.pl —i hhv1_msa_fnames.txt –t 0.5 -p 1 –s

qry
The input for the script, hhv1_msa_fnames.txt, is a text file

listing the names of all the MSAs generated from clusters with
reference sequences. MSAs need to be in the same directory than
the input file. The “–t 0.5” option sets the H value for variability
masking to 0.5. The option “–p 1” is to indicate that gaps in MSAs
are omitted from variability computations. Lastly, the “–s qry” sets
the first sequence in the MSAs (reference sequence) for generating
consensus sequences with variable sites masked (see Note 5). The
output of the script consists of FASTA files named after each of the
MSA filenames in hhv1_msa_fnames.txt, the argument passed to
the –s option and the chosen threshold, e.g., msa_filename–qry–
0.5.fa. Subsequently, all consensus sequences generated after the
MSAs can be concatenated into a single FASTA file containing the
entire proteome as follows:
$ cat ∗-05.fa > hhv1_cons_proteome.fa
In Fig. 2, we depict a slice of the generated HHV1 consensus

proteome with variability sites masked. Of note, HHV1 exhibits
very little sequence variability, as few sites are masked.
3.5 Identification of Consensus sequence with variable residues masked can be used as
HHV1 Conserved CD8 input of tools like RANKPEP (http://imed.med.ucm.es/Tools/
T-Cell Epitopes rankpep.html) [23] to predict non-variable T-cell epitopes. In
addition, these consensus sequences are also instrumental to iden-
tify conserved epitopes from experimentally validated ones [4–
7]. Here, we illustrate such application to identify conserved
HHV1-specific CD8 T-cell epitopes from experimentally validated
ones. A search in the IEDB resource (https://www.iedb.org/)
identifies 124 CD8 T-cell epitopes from HHV1 with 9 residues in
length, meeting the following criteria: linear peptides, positive
results on T-cell assays, recognition by human subjects, and restric-
tion by HLA I molecules. Conserved epitopes can be identified as
Fig. 2 Fragment of the HHV1 consensus proteome. Dots replace sites/residues with H > 0.5
those producing exact matches with the consensus reference prote-

ome. To determine all possible epitope matches, the sequences in
the consensus proteomes need to be in a single line, without line
breaks. This task can be achieved using the provided Perl Script
flat_fasta.pl as follows:
$ flat_fasta.pl hhv1_cons_proteome.fa > hhv1_cons_proteome.

ffa
Finally, provided that the selected experimental epitopes are in

a one-column file with one peptide per line (hvv1_cd8_epi.txt), the
command below will select those that are conserved:
$ grep –of hvv1_cd8_epi.txt hhv1_cons_proteome.ffa > con_-

hhv1_cd8_epi.txt
The file con_hhv1_cd8_epi.txt contains 94 conserved HHV1-

specific CD8 T-cell epitopes, namely, 75% of the experimentally
validated epitopes that were selected are located in conserved
regions and do not bear any single residue with a variability, H,
above 0.5. A vaccine based on epitopes with such a tiny variability

shall generate viral-specific T-cell adaptive immunity that will not
let the virus to scape.
4 Notes
1. One can easily apply the method described here to protein

sequences downloaded directly into FASTA format from rele-
vant databases. However, these sequences can contain chimeric
proteins, making results less reliable than those obtained from
complete genomes.
2. We recommend using an identity threshold with CD-HIT that
returns a number of clusters similar to the number of CDS
found in the reference genome.
3. Researchers working with pathogens often use a reference
genome, and thereby we have described the process of gener-
ating variability-free consensus sequences only for clusters con-
taining reference sequences. However, the process can equally
be applied to those clusters without reference sequences.
4. MUSCLE cannot take more than 1000 sequences. To make
MSAs with more than 1000 sequences, we recommend using
ClustalΩ.
5. Variability per site in MSAs can also be assigned to the most
common residue in the alignment, instead of that
corresponding to the reference sequence by passing the argu-
ment cns to the –s option (“–s cns”). In most occasions, the
resulting consensus sequences are actually the same.
Acknowledgments
We wish to thank Marta Gomez-Perosanz and Esther M. Lafuente

for critical reading and comments. This work was supported by
grant BIO2014-54164R from MINECO to P.A.R.
References
1. Cuevas JM, Geller R, Garijo R, Lopez- 3. Vogel M, Bachmann MF (2019) Immunoge-
Aldeguer J, Sanjuan R (2015) Extremely high nicity and Immunodominance in antibody
mutation rate of HIV-1 in vivo. PLoS Biol 13 responses. Curr Top Microbiol Immunol.
(9):e1002251. https://doi.org/10.1371/jour https://doi.org/10.1007/82_2019_160
nal.pbio.1002251 4. Gomez-Perosanz M, Russo G, Sanchez-
2. Qiu X, Duvvuri VR, Bahl J (2019) Computa- Trincado J, Pennisi M, Reche P, Shepherd A,
tional approaches and challenges to developing Pappalardo F (2019). Computational Immu-
universal influenza vaccines. Vaccines (Basel) 7 nogenetics. In: encyclopedia of bioinformatics
(2). https://doi.org/10.3390/ and computational biology, vol 2. Elsevier, pp
vaccines7020045 906–930
5. Sanchez-Trincado JL, Gomez-Perosanz M, 15. Sheikh QM, Gatherer D, Reche PA, Flower
Reche PA (2017) Fundamentals and methods DR (2016) Towards the knowledge-based
for T- and B-cell epitope prediction. J Immu- design of universal influenza epitope ensemble
nol Res 2017:2680160. https://doi.org/10. vaccines. Bioinformatics 32(21):3233–3239.
1155/2017/2680160 https://doi.org/10.1093/bioinformatics/
6. Sette A, Rappuoli R (2010) Reverse vaccinol- btw399
ogy: developing vaccines in the era of geno- 16. Zhang Q, Wang P, Kim Y, Haste-Andersen P,
mics. Immunity 33(4):530–541. https://doi. Beaver J, Bourne PE, Bui HH, Buus S,
org/10.1016/j.immuni.2010.1009.1017 Frankild S, Greenbaum J, Lund O,
7. Vivona S, Gardy JL, Ramachandran S, Brink- Lundegaard C, Nielsen M, Ponomarenko J,
man FS, Raghava GP, Flower DR, Filippini F Sette A, Zhu Z, Peters B (2008) Immune epi-
(2008) Computer-aided biotechnology: from tope database analysis resource (IEDB-AR).
immuno-informatics to reverse vaccinology. Nucleic Acids Res 36(Web Server issue):
Trends Biotechnol 26(4):190–200. https:// W513–W518. https://doi.org/10.1093/
doi.org/10.1016/j.tibtech.2007.1012.1006. nar/gkn254
Epub 2008 Feb 1021 17. Reche PA, Zhang H, Glutting JP, Reinherz EL
8. Garcia-Boronat M, Diez-Rivero CM, Reinherz (2005) EPIMHC: a curated database of
EL, Reche PA (2008) PVS: a web server for MHC-binding peptides for customized
protein sequence variability analysis tuned to computational vaccinology. Bioinformatics 21
facilitate conserved epitope discovery. Nucleic (9):2140–2141. https://doi.org/10.1093/
Acids Res 36(Web Server issue):W35–W41. bioinformatics/bti269
https://doi.org/10.1093/nar/gkn211 18. Li W, Godzik A (2006) Cd-hit: a fast program
9. Alonso-Padilla J, Lafuente EM, Reche PA for clustering and comparing large sets of pro-
(2017) Computer-aided Design of an tein or nucleotide sequences. Bioinformatics
Epitope-Based Vaccine against Epstein-Barr 22(13):1658–1659. https://doi.org/10.
virus. J Immunol Res 2017:9363750. 1093/bioinformatics/btl158
https://doi.org/10.1155/2017/9363750 19. Edgar RC (2004) MUSCLE: multiple
10. Damfo SA, Reche P, Gatherer D, Flower DR sequence alignment with high accuracy and
(2017) In silico design of knowledge-based high throughput. Nucleic Acids Res 32
Plasmodium falciparum epitope ensemble vac- (5):1792–1797. https://doi.org/10.1093/
cines. J Mol Graph Model 78:195–205. nar/gkh340
https://doi.org/10.1016/j.jmgm.2017.10. 20. Shannon CE (1948) A mathematical theory of
004 communication. Bell Syst Tech J 27
11. Molero-Abraham M, Lafuente EM, Flower (3):379–423. https://doi.org/10.1002/j.
DR, Reche PA (2013) Selection of conserved 1538-7305.1948.tb01338.x
epitopes from hepatitis C virus for 21. Sievers F, Wilm A, Dineen D, Gibson TJ,
pan-populational stimulation of T-cell Karplus K, Li W, Lopez R, McWilliam H,
responses. Clin Dev Immunol 2013:601943. Remmert M, Söding J, Thompson JD, Higgins
https://doi.org/10.1155/2013/601943 DG (2011) Fast, scalable generation of high-
12. Murphy D, Reche P, Flower DR (2019) quality protein multiple sequence alignments
Selection-based design of in silico dengue epi- using Clustal omega. Mol Syst Biol 7(1):539.
tope ensemble vaccines. Chem Biol Drug Des https://doi.org/10.1038/msb.2011.75
93(1):21–28. https://doi.org/10.1111/ 22. Reche PA, Reinherz EL (2003) Sequence varia-
cbdd.13357 bility analysis of human class I and class II
13. Reche PA, Keskin DB, Hussey RE, Ancuta P, MHC molecules: functional and structural cor-
Gabuzda D, Reinherz EL (2006) Elicitation relates of amino acid polymorphisms. J Mol
from virus-naive individuals of cytotoxic T lym- Biol 331(3):623–641. https://doi.org/10.
phocytes directed against conserved HIV-1 1016/s0022-2836(03)00750-2
epitopes. Med Immunol 5(1). https://doi. 23. Reche PA, Glutting JP, Zhang H, Reinherz EL
org/10.1186/1476-9433-5-1 (2004) Enhancement to the RANKPEP
14. Shah P, Mistry J, Reche PA, Gatherer D, resource for the prediction of peptide binding
Flower DR (2018) In silico design of Myco- to MHC molecules using profiles. Immunoge-
bacterium tuberculosis epitope ensemble vac- netics 56(6):405–419. https://doi.org/10.
cines. Mol Immunol 97:56–62. https://doi. 1007/s00251-004-0709-7
org/10.1016/j.molimm.2018.03.007
Chapter 14
Immunoinformatic Identification of Potential Epitopes

Priti Desai, Divya Tarwadi, Bhargav Pandya, and Bhrugu Yagnik
Abstract
Immunoinformatic plays a pivotal role in vaccine design and development. While traditional methods are
exclusively depended on immunological experiments, they are less effective, relatively expensive, and time-
consuming. However, recent advances in the field of immunoinformatics have provided innovative tools for
the rational design of vaccine candidates. This approach allows the selection of immunodominant regions
from the sequence of whole genome of a pathogen. The identified immunodominant region could be used
to develop potential vaccine candidates that can trigger protective immune responses in the host. At
present, epitope-based vaccine is an attractive concept which has been successfully trailed to develop
vaccines against a number of pathogens. In this chapter, we outline the methodology and workflow of
how to deploy immunoinformatics tools in order to identify immunodominant epitopes using Shigella as a
model organism. The immunodominant epitopes, derived from S. flexneri 2a using this workflow, were
validated using in vivo model, indicating the robustness of the outlined workflow.
Key words Immunoinformatics, Epitope, Vaccine, Shigella
1 Introduction
Epitope is a part of an antigen that is recognized by antigen-specific

membrane receptors present on the immune cells. Even though,
there are multiple sites present on an antigen, only a few are able to
generate robust immune responses, and these sites are known as
“immunodominant regions/epitopes” [1, 2]. The identification of
epitopes allows us to understand how immune system responds to a
particular pathogen, and this knowledge can be extended further to
devise strategies for rapid and accurate diagnosis [3] as well as to
design prophylactic and therapeutic vaccines that can induce
pathogen-specific immunity [4]. In contrast to the traditional vac-
cination strategies, epitope-based vaccines provide immunological
specificity, potency, and safety.
The selection of an immunodominant epitope is an important
and a first step in the design of an epitope-based vaccine. This is
achieved with the help of various computation tools [5] that take an

265
266 Priti Desai et al.
account of antigen and allele coverage, immunogenicity, as well as

probability of being produced during antigen processing [6].
To derive the epitopes, we have selected two outer membrane
proteins (OMPs): OmpA and OmpC. These proteins are found to
be conserved among most of the Shigella strains as well as in other
enteric pathogens [7, 8]. This chapter describes the procedure to
predict the epitopes using OmpC from S. flexneri 2a an example.
2 Materials
Suggested databases/servers.
Sr
no. Name Web address Use
1 UniProtKB https://www.uniprot.org A database of protein
sequence and function
information
2 VaxiJen http://www.ddg- Alignment independent
v2.0 pharmfac.net/vaxijen/ prediction of protective
VaxiJen/VaxiJen.html antigens
3 Swiss- http://swissmodel. Homology modeling and
model expasy.org/ 3D structure prediction
server
4 RAMPAGE http://www-cryst.bioc. Evaluation of the backbone
cam.ac.uk/rampage conformation and
checking non-GLY
residues in excluded
regions
5 PROSA https://prosa.services. Protein structure validation
came.sbg.ac.at/prosa.
php
6 BepiPred http://www.cbs.dtu.dk/ B-cell epitope prediction
services/BepiPred/
7 ProPred http://www.imtech.res. Prediction of epitopes
in/raghava/propred/ binding to MHC class II
molecules
8 IEDB-AR http://tools.iedb.org/ Prediction of epitopes
mhci/ binding to MHC class I
molecules
9 Glide https://www. Receptor-ligand docking
schrodinger.com/glide
10 Pepstr http://www.imtech.res. Prediction of tertiary
in/raghava/pepstr/ structure of peptides
11 BLASTp https://blast.ncbi.nlm. Searching similar proteins
nih.gov/Blast.cgi?
PAGE¼Proteins
Immunoinformatic Identification of Potential Epitopes 267
3 Method
The workflow is summarized in Fig. 1.

1. Sequence retrieval.
Retrieve amino acid sequence of the protein of interest
from UniProtKB protein database.
2. 3D structure prediction and validation
For three-dimensional (3D) structure prediction, submit
the amino acid sequence in the FASTA format for each protein
to Swiss-Model Server (http://swissmodel.expasy.org/) for
homology modeling [9–13]. It performs BLAST using the
input sequence against the PDB (Protein Data Bank) and
suggests the best hits to be used as templates for modeling
(see Note 1).
Perform Ramachandran Plot assessment of the predicted
structures using RAMPAGE web tool (http://www-cryst.bioc.
cam.ac.uk/rampage) to evaluate backbone conformation and
non-GLY residues in excluded regions [14] (see Notes 2 and
3). To do this, upload the PDB file of the model generated by
Swiss-Model Server as input on the RAMPAGE web tool.
Output is provided in two forms: PDF (Portable Document
Format) and PS (PostScript). The output will consist of two
images as shown in Fig. 2a and b.
Validate the overall and local quality of the models using Z-
score, which is determined by PROSA web tool [15, 16]. The
Amino acid sequence retrieval
3D structure predicon and validaon
B cell epitope predicon
T cell epitope predicon
Populaon coverage analysis
Molecular docking
Fig. 1 Workflow
graph shown in Fig. 3a and b depicts the Z-scores of all the

experimentally determined structures deposited in the PDB.
The two different colors correspond to different methods used
for structure determination, namely, X-ray crystallography and
NMR. As we can see from Fig. 3a and b, both the model and
the template has Z-score within the range, implying that the
model generated is well folded (see Note 4).
3. B-cell epitope prediction.
Submit the chosen sequences to BepiPred server (http://
www.cbs.dtu.dk/services/BepiPred/) [17]. Evaluate the pre-
dicted epitopes for surface exposed positions from the pre-
dicted structures and determine their antigenicity using
VaxiJen v2.0 [18–20]. Further, use the epitope sequences
which are surface exposed and have high VaxiJen score (>0.4)
and use them for prediction of T-cell epitopes (see Notes 5
and 6).
(a) 180
C231 ASP
A315 ASN
ψ 0
–180
–180 0 180
φ
General/Pre-Pro/Proline Favoured General/Pre-Pro/Proline Allowed

Glycine Favoured Glycine Allowed
Number of residues in favoured region (~98.0% expected) --- 986 (94.7%)

Number of residues in allowed region (~2.0% expected) -- 53 (5.1%)
Number of residues in outlier region - 2 (0.2%)
Fig. 2 (a) Ramachandran plot of OmpC model. The plot indicates two outliers, namely, asn315 of chain A and
asp231 of chain C. Ninety-eight percent of the residues are expected to be in favored regions, while 2% are
expected to be in allowed regions. Ideally there should be zero outliers. (b) Ramachandran plot highlighting
allowed and favored regions for special cases like glycine, proline, and pre-proline residues
(b) General Glycine

180
C231 ASP
0 A315 A
-180
Pre-Pro Proline
180
0
-180
-180 0 180 -180 0 180

General Favoured General Allowed

Glycine Favoured Glycine Allowed
Pre-Pro Favoured Pre-Pro Allowed
Proline Favoured Proline Allowed
Fig. 2 (continued)
4. T-cell epitope prediction.

MHC class II epitopes.
Use ProPred (http://www.imtech.res.in/raghava/
propred/) for prediction of epitopes binding to MHC class II
molecules [21]. It can predict the epitopes binding to
51 HLA-DR alleles and one can choose which alleles one
wants to use for prediction depending on the target popula-
tion. Threshold value should be no more than 3% for the first
screening. Three percent threshold value means the server
would predict the peptides belonging to 3% best-scoring natu-
ral peptides. Lower the threshold, lower the false positive rate,
and higher the false negative rate (see Note 7).
Fig. 3 (a) Z-score of OmpC model, (b) Z-score of OmpC template. The Z-score of the input protein is shown by
the black dot
MHC Class I Epitopes.

Use IEDB-AR (Immune Epitope DataBase Analysis
Resource) to predict epitopes binding to MHC class I mole-
cules. It provides various methods for prediction. Some of
them, like artificial neural network (ANN), stabilized matrix
method (SMM), and scoring matrices derived from combina-
torial peptide libraries (CombLib Sydney 2008), are explained
here [22–25]. If you are not sure on which method to use, then
you can go with the option of “IEDB recommended” when
choosing prediction method. This is the default prediction
selection method. Currently IEDB uses the Consensus method
consisting of ANN, SMM, and CombLib if available; otherwise
it will use NetMHCpan method. See Subheading 4 for the
explanation of methods (see Notes 8–11).
Screen-selected MHC class I restricted epitopes for over-
lapping sequences with the identified B-cell epitopes.
5. Population coverage analysis.
Select the overlapping epitopes and choose the high-
scoring epitopes binding to a maximum number of respective
classes of alleles for population coverage analysis. You may also
select the classes of alleles based on the target population and
abundance of the respective alleles in the population of interest.
Use IEDB-AR (http://tools.iedb.org/population/) for
population coverage analysis [26]. You have to provide the

peptide sequence and the allele to which it binds as input.
You can use multiple peptides binding to different alleles to
increase the population coverage. Choose the epitopes
showing high population coverage and check them for their
conservancy against other pathogens and for any similarities
with human proteins.
6. Molecular docking.
We have used HLA-A24 of MHC class I molecule and
HLA-DR4 of MHC class II molecule in this example.
(a) Preparing MHC structures for docking:
l Retrieve tertiary structures of HLA-A24 (PDB ID:
3VXN) and HLA-DR4 (PDB ID: 2SEB) from PDB
database.
l Prepare MHC structures for docking using Protein
Preparation Wizard of Glide, Schrodinger (https://
www.schrodinger.com/glide Glide, version 5.5,
Schrödinger, LLC, New York, NY, 2009) [27–29].
l Add hydrogen atoms; optimize hydroxyl group orien-
tation and His protonation state.
l Perform constrained refinement and energy minimiza-
tion by implementing imperf utility having OPLS 2005
force field.
l Subject the prepared protein structures of both the
MHC molecules to GRID generation by selecting the
α1 and α2 domains of HLA-A24 and α1 and β1
domains of HLA-DR4 with the binding box dimen-
sions as 20 A 20 A 20 A.
(b) Preparing selected epitopes for docking.
l Predict tertiary structure of selected peptides by using
Pepstr (http://www.imtech.res.in/raghava/pepstr/)
[30, 31] (see Note 12).
l Use OLPS 2005 force field for cleaning the structure,
for minimizing conformation, for designing high-
quality 3D structure, and for generation of multiple
protonation/tautomerization states of each peptide
field.
(c) Perform docking in a flexible docking mode using Glide,
Schrodinger (see Note 13). Results of the docking studies
should be experimentally validated.
To avoid the possibility of the selected epitopes acting as
self-antigens, screen the amino acid sequence of epitopes for
similarity with human proteins using BLASTp [32].
This chapter describes the procedure to predict a set of

epitopes using a defined workflow. We have used this workflow
to predict the epitopes against Shigella species. We have also
validated these epitopes using in vivo and in vitro methods, and
we found that they are capable of provoking immunogenic and
protective immune response when challenged with S. flexneri
in BalB/c mice. The proposed method for prediction of epi-
topes can be applied to other infectious pathogens.
4 Notes
1. One of the limiting steps in predicting B-cell epitopes is

unavailability of a complete structure. Without a model struc-
ture, it is difficult to predict whether the B-cell epitopes are
surfaced exposed or not. One should also consider the oligo-
meric form of the protein when checking for surface exposed
epitopes.
2. Ramachandran plot shows the range of ϕ (phi-rotation, around
N-Cα bond) and Ψ (psi-rotation, around Cα-C bond) angles
which are permissible in the proteins. The regions that are not
allowed are based on steric hindrance between different side
chains and also between side chains and peptide backbone.
3. Glycine has very short side chain (only H atom); therefore it
can form bond angles which would not be possible for other
amino acids. Proline, on the other hand, has a cyclic side chain
which restricts the possibilities of bond angles which can be
formed.
4. Z-score is used to check whether the input structure (predicted
model in this case) is within the range of scores typically found
in native proteins of similar size. It is a measure of the deviation
of the total energy of the structure compared to energy distri-
bution of random conformations. If the Z-score of the input
protein is within the range of the Z-scores of the proteins of
similar size, then it implies the input protein is well folded.
5. BepiPred uses a Random Forest Algorithm trained on epitope
and non-epitope amino acids determined from crystal struc-
tures to predict epitopes from the given protein sequence. The
default threshold is 0.5, and as the threshold increases, the
specificity increases and the sensitivity decreases.
6. For the antigenicity prediction by VaxiJen, a score of 0.4 is set
as a threshold by the server and a score below that deems the
epitope nonantigenic.
7. MHC class II proteins have pockets or binding sites which are
capable of binding to nine amino acid long peptides. ProPred
predicts such 9mers using quantitative matrices to calculate the
score for each nonameric peptide. These matrices provide
scores for each amino acid present at position 1–9, and scores
of individual amino acids are summed to give the score for the
peptide.
8. ANN (Artificial Neural Network) uses neural networks
trained with the help of a combination of input methods, like
sparse encoding and Blosum encoding, for quantitative predic-
tion of binding. The final prediction value is the average with
equal weight of the sparse and Blosum-encoded neural net-
work predictions. One of the advantages of ANN is that it relies
on the mutual information, i.e., how neighboring amino acids
affect each other’s binding affinity. The information about the
amino acid present at i position gives information about the
amino acid present at the i + 1 position.
9. SMM (Stabilized Matrix Method) uses both individual score of
amino acids at each position and pair coefficient to provide the
final prediction of binding affinity. Matrix entries have been
derived by minimizing the distance between the predicted
scores and measured affinities for a set of training peptides.
Only those coefficients are used for which sufficient training
data was available. A regularization parameter is included in the
minimization function used to derive matrix entries and pair
coefficients to avoid over-fitting.
10. CombLib (Combinatorial Peptide Libraries) method uses data
derived from binding assays of different HLA alleles with posi-
tion scanning combinatorial peptide libraries. These libraries
contain mixtures of peptides, and in each mixture, one amino
acid will be fixed at a certain position and the other positions
will have different amino acids present on different peptides. In
total 20 9 ¼ 180 mixtures will be required to get the data for
the contribution of all 20 amino acids at each position of the
9mer peptide. This method is available for 15 alleles.
11. NetMHCpan method uses neural networks trained on a very
large dataset of 79,137 unique peptide-MHC class I interac-
tions. The input data used to train neural networks is low in
redundancy and contains a large fraction of nonbinding data
for each allele. This method can be used to make predictions
for chimpanzee, gorilla, rhesus macaque, and mouse alleles
apart from human alleles.
12. This is to check whether the peptide is having same structure
on its own as compared to when it is a part of the fully folded
protein.
13. The dimensions of the binding box are kept as small as possible
because the binding sites are already known. Keeping the bind-
ing box small enough to cover only the intended binding site
saves computation power compared to trying to dock the
peptide against the whole MHC molecule.
References
1. Brusic V, Petrovsky N (2005) Immunoinfor- historical perspective. Electrophoresis 30:
matics and its relevance to understanding S162–S173
human immune disease. Expert Rev Clin 12. Benkert P, Biasini M, Schwede T (2011)
Immunol 1(1):145–157 Toward the estimation of the absolute quality
2. Gershoni JM, Roitburd-Berman A, Siman-Tov of individual protein structure models. Bioin-
DD, Tarnovitski Freund N, Weiss Y (2007) formatics 27:343–350
Epitope mapping: the first step in developing 13. Bertoni M, Kiefer F, Biasini M, Bordoli L,
epitope-based vaccines. BioDrugs 21 Schwede T (2017) Modeling protein quater-
(3):145–156 nary structure of homo- and hetero-oligomers
3. Goldsby RA, Kindt TJ, Kuby J, Osborne BA beyond binary interactions by homology. Sci
(2002) Immunology, 5th edn. W. H. Freeman, Rep 7(1):10480
New York 14. Lovell SC, Davis IW, Arendall IIIWB, de Bak-
4. Khan AM, Miotto O, Heiny AT, Salmon J, ker PIW, Word JM, Prisant MG, Richardson JS,
Srinivasan KN, Nascimento EJ, Marques ET Richardson DC (2002) Structure validation by
Jr, Brusic V, Tan TW, August JT (2006) A Calpha geometry: phi,psi and Cbeta deviation.
systematic bioinformatics approach for selec- Proteins: Structure, Function Genetics
tion of epitope-based vaccine targets. Cell 50:437–450
Immunol 244(2):141–147 15. Wiederstein M, Sippl MJ (2007) ProSA-web:
5. Bremel RD, Homan EJ (2010) An integrated interactive web service for the recognition of
approach to epitope analysis I: dimensional errors in three-dimensional structures of pro-
reduction, visualization and prediction of teins. Nucleic Acids Res 35:W407–W410
MHC binding using amino acid principal com- 16. Sippl MJ (1993) Recognition of errors in
ponents and regression approaches. Immu- three-dimensional structures of proteins. Pro-
nome Res 6:7 teins 17:355–362
6. Schubert B, Lund O, Nielsen M (2013) Evalu- 17. Jespersen MC, Peters B, Nielsen M, Marcatili P
ation of peptide selection approaches for (2017) BepiPred-2.0: improving sequence-
epitope-based vaccine design. Tissue Antigens based B-cell epitope prediction using confor-
82(4):243–251. https://doi.org/10.1111/ mational epitopes. Nucleic Acids Res 45(W1):
tan.12199 W24–W29. https://doi.org/10.1093/nar/
7. Mukhopadhaya A, Mahalanabis D, Chakrabarti gkx352
MK (2006) Role of Shigella flexneri 2a 34 kDa 18. Doytchinova IA, Flower DR (2007) VaxiJen: a
outer membrane protein in induction of pro- server for prediction of protective antigens.
tective immune response. Vaccine tumour antigens and subunit vaccines BMC
24:6028–6036 Bioinformatics 8:4
8. Jarza˛b A, Witkowska D, Ziomek E, 19. Doytchinova IA, Flower DR (2007) Identify-
Da˛browska A, Szewczuk Z, Gamian A (2013) ing candidate subunit vaccines using an
Shigella flexneri 3a outer membrane protein C alignment-independent method based on prin-
epitope is recognized by human umbilical cord cipal amino acid properties. Vaccine
sera and associated with protective activity. 25:856–866
PLoS One 8(8):e70539 20. Doytchinova IA, Flower DR (2008) Bioinfor-
9. Waterhouse A, Bertoni M, Bienert S, Studer G, matic approach for identifying parasite and fun-
Tauriello G, Gumienny R, Heer FT, de Beer gal candidate subunit vaccines. Open Vaccines J
TAP, Rempfer C, Bordoli L, Lepore R, 1:22–26
Schwede T (2018) SWISS-MODEL: homol- 21. Singh H, Raghava GPS (2001) ProPred: pre-
ogy modelling of protein structures and com- diction of HLA-DR binding sites. Bioinfor-
plexes. Nucleic Acids Res 46(W1): matics 17(12):1236–1237
W296–W303
22. Nielsen M, Lundegaard C, Worning P, Laue-
10. Bienert S, Waterhouse A, de Beer TAP, moller SL, Lamberth K, Buus S, Brunak S,
Tauriello G, Studer G, Bordoli L, Schwede T Lund O (2003) Reliable prediction of T-cell
(2017) The SWISS-MODEL repository - new epitopes using neural networks with novel
features and functionality. Nucleic Acids Res sequence representations. Protein Sci
45:D313–D319 12:1007–1017
11. Guex N, Peitsch MC, Schwede T (2009) Auto- 23. Peters B, Sette A (2005) Generating quantita-
mated comparative protein structure modeling tive models describing the sequence specificity
with SWISS-MODEL and Swiss-PdbViewer: a
of biological processes with the stabilized Glide: a new approach for rapid, accurate dock-
matrix method. BMC Bioinformatics 6:132 ing and scoring. 2. Enrichment factors in data-
24. Sidney J, Assarsson E, Moore C, Ngo S, base screening. J Med Chem 47:1750–1759
Pinilla C, Sette A, Peters B (2008) Quantitative 29. Friesner RA, Banks JL, Murphy RB, Halgren
peptide binding motifs for 19 human and TA, Klicic JJ, Mainz DT, Repasky MP, Knoll
mouse MHC class I molecules derived using EH, Shaw DE, Shelley M, Perry JK, Francis P,
positional scanning combinatorial peptide Shenkin PS (2004) Glide: a new approach for
libraries. Immunome Res 4(2) rapid, accurate docking and scoring. 1. Method
25. Hoof I, Peters B, Sidney J, Pedersen LE, and assessment of docking accuracy. J Med
Sette A, Lund O, Buus S, Nielsen M (2009) Chem 47:1739–1749
NetMHCpan, a method for MHC class I bind- 30. Singh S, Singh H, Tuknait A, Chaudhary K,
ing prediction beyond humans. Immunogenet- Singh B, Kumaran S, Raghava GPS (2015)
ics 61(1):1–13 PEPstrMOD: structure prediction of peptides
26. Bui HH, Sidney J, Dinh K, Southwood S, containing natural, non-natural and modified
Newman MJ, Sette A (2006) Predicting popu- residues. Biol Direct 10:73
lation coverage of T-cell epitope-based diag- 31. Kaur H, Garg A, Raghava GPS (2007) PEPstr:
nostics and vaccines. BMC Bioinformatics a de novo method for tertiary structure predic-
7:153 tion of small bioactive peptides. Protein Pept
27. Friesner RA, Murphy RB, Repasky MP, Frye Lett 14:626–630
LL, Greenwood JR, Halgren TA, Sanschagrin 32. Madden TL, Tatusov RL, Zhang J (1996)
PC, Mainz DT (2006) Extra precision glide: [9] Applications of network BLAST server.
docking and scoring incorporating a model of Computer Methods for Macromolecular
hydrophobic enclosure for protein-ligand com- Sequence Analysis:131–141. https://doi.org/
plexes. J Med Chem 49:6177–6196 10.1016/s0076-6879(96)66011-x
28. Halgren TA, Murphy RB, Friesner RA, Beard
HS, Frye LL, Pollard WT, Banks JL (2004)
Chapter 15
Immunoinformatic Approaches for Vaccine Designing

Against Viral Infections
Richa Anand and Richa Raghuwanshi
Abstract
Vaccines have become a cost-effective method for prevention or treatment of viral infections. Conventional
methods to design a vaccine candidate is a laborious process requiring time and economy. Many approaches
have been made to reduce the times and economy of vaccine development. In this regard, immunoinfor-
matic approach is supposed to bring a revolution in vaccine development. This chapter provides an overview
of immunoinformatics and its application in in silico vaccine design and development strategies in humans
against viral diseases with the help of available databases and tools.
Key words Epitope, Immunomics, Immunoinformatics, Viral infections, Vaccine design
Abbreviations
ANN Artificial Neural Network

CTL Cytotoxic T Lymphocytes
GRAVY Grand Average Hydropathicity
HLA Human Leukocyte Antigen
IC Inhibitory Concentration
IEDB Immune Epitope Database
MHC Major Histocompatibility Complex
NCBI National Center for Biotechnology Information
PIR Protein Information Resource
PSSM Position-Specific Scoring Matrices
QM Quantum Matrices
SMM Stabilized Matrix Method
SVM Support Vector Machine
TAP Transporter of Antigen Presentation
Vipr Virus Pathogen Database and Analysis Resource

277
278 Richa Anand and Richa Raghuwanshi
1 Introduction
During the last three decades, efforts to control viral diseases

through the development of large number of antiretroviral drugs,
public awareness, and other prevention programs across the globe
have led to significant reduction in viral cases. Yet constantly evol-
ving drug-resistant mutations are posing a continuous challenge to
the therapy. Development of vaccines or therapeutic measures often
requires prior understanding of the immunological aspects of the
natural course of an infection. Conventional vaccines prepared by
either attenuated or inactivated whole pathogen have a number of
limitations as genetic variations in these pathogens all over the
globe may results in reduced efficiency of these vaccines in different
parts of the world. Many vaccine trials are currently being con-
ducted worldwide, but they fail to reach in phase III. These facts
indicate clearly that there is a big gap between the early phase
clinical trials (phase I and II) and efficacy trial (phase III), and the
need for further research to gain more knowledge on minimal
components which determine the protective nature of the vaccine
candidates against virus is desired [1]. Genetic variation in envelope
proteins is one of the main hurdles in designing a vaccine
[2]. Experimental assays for identification of conserved regions
which maintain their structure and function of glycoprotein is a
tedious process. Besides this the pathogens utilized during vaccina-
tion may revert to its pathogenic form and cause infection [3]. This
beckons an urgent need for effective vaccines offering a stable
solution to control and eradicate the disease. Significant develop-
ments have been seen in the last two decades by the fusion of
computational technologies with the recombinant DNA technol-
ogy, leading to the fast growth of biological and genomic informa-
tion in database banks [4–6].
2 Immunoinformatics
An immune system includes innate and adaptive components.

According to the traditional dogma of immunology (shown in
Fig. 1), vertebrates have both innate and adaptive immune systems
whereas invertebrates possess only an innate immune system [7].
Immunological system has been classified as cellular and
humoral, and, depending on the disease, it can induce the expected
immune response. An ideal vaccine which initiates humoral or cell
mediated immune response is essential to completely eradicate the
chance of re-infection. In this regard, immunoinformatic approach
is supposed to bring a revolution in vaccine development. Immu-
noinformatics is an intersection between immunomics and compu-
tational approaches. All possible immunoinformatics approaches
for vaccine design against viral diseases are mentioned in Fig. 2.
Designing Viral Vaccine by Immunoinformatics 279
Immune system
Innate immune system Adaptive immune system

(Inborn immunity) (Aquired)
First line of defence
Anatomic Physiologic
Cellular response Humoral response
(skin, mucous membranes) (temperature, low pH)
Phagocytic
Inflammatory B-cell
(blood monocytes, neutrophils, T-cell
(serum proteins)
tissue, macrophages)
Recognize short linear peptide epitopes
displayed by
MHC class I MHC class II

molecules molecules
Cytotoxic T-cell Helper T-cell

(CD8) (CD4)
Fig. 1 Traditional dogma of immunology
T-cell epitope prediction
Reverse vaccinology Epitope-based
B-cell epitope prediction
Immunoinformatics approaches
Peptide-based vaccine design Alignment-free approach for vaccine design
Molecular dynamics simulation and docking Microarray technique for vaccine design
Fig. 2 Immunoinformatics approaches for vaccine design against viral infections

Epitope-based vaccine designing is more promising as it relies

not only on understanding the mechanisms of immunodominance
but simultaneously analyzes multiple genomes or proteomes to
select the most appropriate epitope. Prediction of epitope-based
peptide vaccine is discussed in detail in the present chapter.
3 Epitope-Based Peptide Vaccine Prediction
The ability to identify epitopes in immune response has important

implications in disease diagnosis. Thus, epitopes for T cell and B
cells need to be identified and mapped. Application of multiple
tools for epitopes prediction may enhance the accuracy of predic-
tion [8]. As per requirement, researcher can be interested either in
in silico prediction of T-cell epitope or B-cell epitope-based peptide
or both. The sequential steps for epitope-based peptide vaccine
prediction is described in detail under two sections for T cell and
B cell, respectively.
3.1 Methods The methodology for T-cell epitope-based vaccine design against
for T-Cell any viral disease is given below. Either use default threshold para-
Epitope-Based Vaccine meters or user-defined parameters for prediction of epitopes using
Prediction different computational tools. Always select computational tools
carefully and as per the need of query.
3.1.1 Retrieval l Retrieve the FASTA-formatted amino acid sequences of virus

of the Target Sequence from the ViPR, an integrated, powerful resource for several virus
and Removal families and their respective species. This database is integrated
of Nonstructural Proteins with the UniProtKB and NCBI database and thus helps to
download the sequence in FASTA format from both UniProtKB
and GenBank.
l Use this retrieved FASTA-formatted sequence as an input for
prediction of highest antigenic protein.
3.1.2 Antigenicity The ability of an antigen to bind to, or interact with, the B-cell or
Prediction T-cell receptors is termed as antigenicity. Therefore, it is important
to predict the highest antigenic protein among structural proteins
of virus.
l Submit the FASTA-formatted amino acid sequences of total
structural proteins to the VaxiJen v 2.0 server (alignment-free
approach for antigenicity prediction with 70% to 89% of accu-
racy) for prediction of antigenicity based on the physicochemical
properties of amino acids (see Note 1).
l Sort all antigenic proteins according to their antigenic score.
l Select antigenic protein with the highest antigenicity score for its
further evaluation.
l Other tools which can be used for prediction of antigenicity are

SVMTriP, Protegen, and EpiToolKit (see Note 2).
3.1.3 Homology Analysis l Perform BLAST to search homology between the predicted
protein with highest antigenicity and human proteins (see
Note 3).
3.1.4 T-Cell Epitope After evaluation of predicted highest antigenic protein, T-cell epi-
Prediction topes can be predicted in silico. T-cell epitopes presented by MHC
class-I molecules are typically 8–11 amino acids long peptides,
whereas MHC class-II molecules are 13–17 amino acids long pep-
tides. The rationality of T-cell epitope-based vaccine majorly
depends on the consistency of peptide prediction. Therefore,
apply multiple tools for prediction of epitopes, and select the top
ranked epitopes predicted by all tools/servers for further evalua-
tion. Some frequently used tools are detailed below.
Prediction of MHC Class-I Epitope
l Submit highest antigenic protein sequences either in PLAIN
text or in FASTA/PIR/EMBL format in CTL pred (QM,
ANN, and SVM-based algorithm) to predict CTL peptides (see
Note 4).
l Use either QM or ANN or SVM or Consensus or Combined
approach to predict peptides against CTL (see Note 5).
l Select top-ranked peptides for their processing prediction (see
Note 6).
l Other tools which predict peptides restricted to CTL at super-
type and allele-specific level are:
– BIMAS-HLA, an ANN-based algorithm, predicts the pep-
tides that can be recognized by MHC supertypes A1, A3,
A24, B7, and B40 and alleles A∗0201, B∗3501, B∗3701,
B∗5101, and B∗5801.
– Net CTL 1.2, an ANN and weight matrix-based algorithm,
predicts CTL epitopes restricted to 12 MHC class-I super-
type (A1, A2, A3, A24, A26, B7, B8, B27, B39, B44, B58,
and B62).
– Propred-I, PSSM-based, server predicts peptides against
47 MHC class-I alleles.
– IEDB, Consensus-based, predicts MHC class-I binding
peptides.
– nHLApred, a MHC Class-I Binding Peptide Prediction, con-
sists of two parts: a) ComPred, based on the hybrid approach
of QM and ANN, predicts the peptides for 67 MHC class-I
alleles, and b) ANNpred, ANN based, predicts peptides for
30 HLA class-I alleles.
– NetTepi 1.0, based on approximation algorithm, predicts

peptide-MHC binding affinity, peptide-MHC stability, and
T-cell propensity.
MHC Class II-Specific Peptide Prediction

l MHC class-II peptide prediction servers are ProPred server,
MHC Class-II predicting tool of IEDB, EpiDOCK, RANKPEP,
HLA-DR4 Pred, and MHC2 Pred against interest of a single or
multiple MHC class-II alleles.
l ProPred server (QM based) predicts MHC class-II binding
peptides against human HLA present on locus HLA-DR.
l RANKPEP (PSS- based algorithm) predicts MHC class-II bind-
ing peptides against human HLA present on locus HLA-DP,
HLA-DQ, and HLA-DR.
l HLA-DR4Pred (based on ANN and SVM) predicts
HLA-DRB1∗0401 (MHC class II alleles) binding peptides
only with an accuracy of ~86% and ~78% using SVM and ANN
algorithm, respectively.
l MHC2 Pred (based on motif, QM, and ANN) predicts peptides
against few of HLA-DQ and HLA-DR with an accuracy of
~78%.
l EpiDOCK (molecular docking-based approach) predicts pep-
tides against few alleles present on human HLA-DP,
HLA-DQ, and HLA-DR loci.
l IEDB predicts MHC class-II binding peptides against human
and mouse. It can predict peptides against vast variety of alleles
present on human HLA-DP, HLA-DQ, and HLA-DR loci.
l Therefore, submit the antigenic protein sequence to different
MHC class-II peptide prediction servers carefully and as per the
need of user.
l Select top-scored peptides obtained through different servers/
tools for their immunogenicity and antigenicity analysis.
3.1.5 MHC Class-I l Predict the probability for being T-cell epitope among selected
Processing Prediction top-scored peptides through MHC-I processing prediction tool
in IEDB.
l Assess the peptides based on proteasomal cleavage score, TAP
transport efficiency, and MHC class-I binding affinity.
l IC50 value for MHC binding peptides is calculated based on
SMM algorithm.
l The lower IC50 value indicates higher affinity.
l Select the best epitope for immunogenicity and antigenicity
prediction.
3.1.6 Immunogenicity l Submit predicted epitopes for immunogenicity and antigenicity

and Antigenicity Analysis prediction to immunogenicity and antigenicity prediction tools
of IEDB and VaxiJen v 2.0 server, respectively.
l Since more immunogenic and antigenic peptides are superior to
the less immunogenic and antigenic peptides, hence, select the
peptide with high immunogenicity and antigenecity for further
analysis.
3.1.7 Epitope The conservancy indicates the specific portion of a protein

Conservation Analysis sequence that restrains the epitope and shows availability with a
specific level of identity.
l Submit sequence of predicted epitopes either in PLAIN or in
FASTA format for prediction of epitope conservation to the
conservancy tools of IEDB.
l The server calculates Bayesian conservation score (ranges from
1 to 9) based on the multiple sequence alignment with three-
dimension structural coordinate of protein.
l Higher and lower score corresponds to maximum conservation
and extreme mutational variability, respectively.
l Thus, select epitopes with higher conservation score for popula-
tion coverage analysis.
3.1.8 Population MHC molecules are extremely polymorphic; thus it is significant to

Coverage Analysis select epitopes that can bind to maximum number of MHC mole-
cules of entire targeted population, which is referred as population
coverage.
l Submit sequence of shortlisted epitopes and their respective
MHC HLA-binding alleles as an input to IEDB analysis data-
base for their population coverage analysis.
l User-defined populations can be added against MHC
HLA-binding alleles.
l Select the epitope with higher coverage value for further aller-
genicity and toxicity prediction.
3.1.9 Allergenicity Since most vaccines transfer the immune response to the allergic
and Toxicity Prediction reaction by initiating immunoglobulin E and Type II T-helper cells,
therefore, after analyzing the population coverage, it is essential to
analyze the allergenecity and toxicity of the predicted epitopes.
l Submit the epitopes sequences to AllerTOP v. 2.0 and Allergen
FP 1.0 for allergenecity prediction.
l AllerTop v. 2.0 (based on kNN, k ¼ 1) predicts allergenicity with
~ 89% accuracy on the basis of amino acid properties such as size,
hydrophobicity, helix-forming propensity, β-strand-forming
propensity, and relative abundance of amino acids.
l Allergen FP 1.0 (based on a novel alignment-free descriptor-

based fingerprint approach) identifies both allergens and
non-allergens with ~88% accuracy.
l The toxicity of the epitopes can be predicted through the Toxin
Pred based on the combined approach of quantitative matrix
and machine learning technique using different physiochemical
properties of peptides.
l Select the non-allergic and nontoxic epitopes for further predic-
tion of autoimmunity and molecular docking simulation studies.
3.1.10 Peptide Match Before selecting an epitope as a vaccine candidate, the probability of
for Autoimmunity autoimmune reaction must be considered.
Prediction
l Submit predicted best epitopes sequences to Peptide Match
Service tool available in PIR.
l Select Homo sapiens (human) as the target organism (see
Note 7).
l Apply the limitation for UniRef100 representative sequences
within the UniProtKB.
l Select epitopes which do not show similarity with human.
3.1.11 Preparation After carrying out all the above said steps, preparation of the 3D
of the 3D Structure structure of selected best peptide epitope and MHC allele is neces-
of Selected Epitope sary to perform molecular docking studies.
and MHC Allele
l Search the 3D structure of best peptide epitope and MHC allele
in the PDB, if structures are available, then retrieve them in PDB
file format, and proceed for molecular docking studies.
l If structures are not available in the PDB, then predict the 3D
structures as follows:
– Submit the best selected epitopes to PEP-FOLD3 web server
(it works on the basis of the structural alphabet (SA) letters to
explain the structural conformation of four consecutive
amino acid residues coupled with a series of SA greedy algo-
rithm and coarse-grained force field) for prediction of 3D
structure-predicted best epitope.
– Save the best model predicted by PEP-FOLD3 in PDB file
format for docking with MHC alleles.
– Predict the 3D structure [9] of MHC alleles through two or
more protein 3D structure prediction servers/softwares such
as Phyre 2, LOMENTS, MUSTER, MODELLER, and
SWISS-MODEL (see Note 8).
– Analyze the resulted models based on Ramachandran plot,
and select the best 3D model for further refinement process.
– Refine the structure using Mod Refiner till a stable structure

is obtained, and then evaluate it through Ramachandran plot,
PROCHECK, Verify 3D, and QMEAN6.
3.1.12 Docking Molecular docking simulation study is performed to analyze the

Simulation for Selected interaction between predicted epitope and binding alleles.
Epitope
l Auto Dock Tools, Auto Dock Vina, Molegro Virtual Docker,
and Schrodingerect can be used to perform molecular docking
studies.
l Flexible docking approach should be used for better results.
l In general, higher binding energy (higher –ve value) indicates
higher binding affinity. Hence, peptide with highest binding
energy against MHC allele can be selected as a potential epitope
for in vitro vaccine development.
3.2 Methods The methodology for B-cell epitope-based vaccine design is given
for B-Cell below. Either use default threshold parameters or user-defined
Epitope-Based Vaccine parameters for prediction of epitopes using different computational
Prediction tools. The methodology for B-cell epitope-based vaccine design is
discussed below.
1. Antigenicity prediction.
2. Homology analysis.
3. Linear B-cell epitope prediction.
4. Continuous or conformational epitope prediction.
5. Epitope conservation, immunogenicity, and antigenicity
analysis.
6. Allergenicity and toxicity prediction.
7. Physiochemical characterization.
8. Peptide match for autoimmunity analysis.
All the above steps except steps 3, 4, and 7 in Subheading
3.2 remain same as discussed in T-cell epitope prediction in
Subheading 3.1.
9. Linear B-Cell Epitope Prediction
Length of B-cell epitopes vary from 5 to 30 residues, but
mostly web-based tools/servers predict linear B-cell epitopes
with length of 20 amino acid residues.
l Submit antigenic protein sequence to webservers LBtope,
ABCpred, Bcepred, and BepiPred-2.0.
l LBtope, based on SVM, predicts linear B-cell peptide epi-
tope with an overall accuracy of ~81%.
l ABCpred uses recurrent neural network for prediction with
an accuracy of ~65.93%.
l Bcepred predicts linear B-cell peptide epitopes based on the

physicochemical properties of amino acid such as hydrophi-
licity, mobility/flexibility, polarity, and exposed surface and
turns accessibility with an accuracy of ~58.70%.
l BepiPred-2.0 server predicts linear B-cell epitopes based on
Random Forest algorithm.
l Select top scored linear B-cell peptides predicted through
different web servers for prediction of epitope conservation,
immunogenicity, antigenicity, allergenicity, toxicity, physio-
chemical properties, and peptide match for autoimmunity.
l Best epitope selected after aforesaid analysis can be a candi-
date for in vitro vaccine development.
10. Conformational or Discontinuous B-Cell Epitope Prediction
The minimal amino acid sequence required for proper
folding of the discontinuous epitope in native proteins may
range from 20 to 400 amino acids.
l Conformational B-cell epitope prediction requires 3D
structure of antigenic protein.
l Search the 3D structure of selected antigenic protein in
PDB and retrieve its structure in PDB format for epitope
prediction.
l If the 3D structure of selected antigenic protein is not
available in PDB, then model it through molecular model-
ing approach discussed in Subheading 3.1.11 (3D structure
prediction of MHC allele).
l Submit the 3D structure toElliPro, Discotope 2.0, BEpro
(earlier known as PEPITO), and SEPPA3.0.
l ElliPro uses a combination of single amino acid propensity
and geometric features of an antigen for conformational
epitope prediction.
l Discotope 2.0 predicts conformational epitopes based on
calculation of a novel epitope propensity amino acid score
and surface accessibility.
l SEPPA3.0 predicts conformational B-cell epitopes from
membrane and secretory antigenic proteins.
l Select the best-scored epitopes for their physiochemical
characterization.
11. Physiochemical Characterization
l Submit best-scored linear or conformational B-cell epitopes
to Protparam tool and SOPMA for prediction of physio-
chemical properties such as molecular weight, aliphatic
index, instability index, isoelectric point (pI) value,
GRAVY, estimated half-life extinction coefficient,
transmembrane helices, bend region, globular region,

coiled-coil region, random coil, and solvent accessibility.
l Select best-scored epitopes for further evaluation of homol-
ogy with human for autoimmunity prediction (discussed in
3.1.10).
l Epitope obtained after autoimmunity prediction can be a
potential candidate for in vitro vaccine development.
3.3 T- and B-Cell Often, peptides capable of inducing both cellular and humoral
Epitope response are suggested as most probable candidate. Thus, it
Superimposition depends on user’s interest to proceed for this additional step.
l Select potential T- and B-cell epitopes obtained through above
approaches and study their alignment.
l Consider the overlapped regions of epitopes as most potential
epitope for vaccine development.
4 Notes
1. For prediction of epitopes using different computational tools,

either the default threshold parameters can be used or user can
define the parameters. If unable to define the parameters, go
through default parameters of a tool or software.
2. Each computational tool adopts a different approach or algo-
rithm to predict epitopes and, hence, varies in its accuracy of
prediction. In this respect, application of multiple tools for
epitopes prediction enhances the accuracy of prediction (see
ref. [8]). An epitope confirmed through many tools may be
selected as best epitope for further analysis.
3. Homology analysis between predicted protein and human pro-
tein sequence is essential to avoid the induction of a potential
autoimmune response.
4. Prepare sequence input file very carefully avoiding any extra
spaces, lines, or any special characters to avoid wrong output.
5. Before using any algorithm of prediction tool, it is better to
study the documentation of that tool/software for selection of
algorithm to obtain most accurate results.
6. User can define number of predicted peptides in the output
section of tools as per their need.
7. If vaccine has to be designed for human being, then select
organism Homo sapiens; otherwise select the particular species
for which vaccine has to be developed.
8. Perform energy minimization through SWISS-Prot (freely
available) to get stable 3D structure of MHC alleles.
References
1. Sundaramurthi JC, Ashokkumar M, 6. Welly BT, Miller MR, Stot JL et al (2017)
Swaminathan S, Hanna LE (2017) HLA based Genome report: identification and validation of
selection of epitopes offers a potential window of anti-genic proteins from Pajaroellobacter abor-
opportunity for vaccine design against HIV. tibovis using de novo genome sequence assem-
Vaccine 35:5568–5575 bly and reverse vaccinology. G3 7:321–331
2. Murrell S, Wu SC, Butler M (2011) Review of 7. Kimbrell DA, Beutler B (2001) The evolution
dengue virus and the development of a vaccine. and genetics of innate immunity. Nat Rev Genet
Biotechnol Adv 29:239–247 2:256–267
3. Khan MA, Hossain MU, Rakib-Uz-Zaman SM 8. Trost B, Bickis M, Kusalik A (2007) Strength in
et al (2015) Epitope-based peptide vaccine numbers: achieving greater accuracy in MHC-I
design and target site depiction against Ebola binding prediction by combining the results
viruses: an immunoinformatics study. Scand J from multiple prediction tools. Immunome Res
Immunol 82:25–34 3:5
4. Rappuoli R (2000) Reverse vaccinology. Curr 9. Anand R (2018) Identification of potential Anti-
Opin Microbiol 3:445–450 tuberculosis drugs through docking and virtual
5. He Y, Rappuoli R, De Groot AS et al (2010) screening. Interdiscip Sci Comput Life Sci
Emerging vaccine informatics. J Biomed Bio- 10:419–429
technol. https://doi.org/10.1155/2010/
218590
Chapter 16
EPCES and EPSVR: Prediction of B-Cell Antigenic Epitopes

on Protein Surfaces with Conformational Information
Shide Liang, Dandan Zheng, Bo Yao, and Chi Zhang
Abstract
Accurate prediction of discontinuous antigenic epitopes is important for immunologic research and medical
applications, but it is not an easy problem. Currently, there are only a few prediction servers available,
though discontinuous epitopes constitute the majority of all B-cell antigenic epitopes. In this chapter, we
describe two online servers, EPCES and EPSVR, for discontinuous epitope prediction. All methods were
benchmarked by a curated independent test set, in which all antigens had no complex structures with the
antibody, and their epitopes were identified by various biochemical experiments. The servers and all datasets
are available at http://sysbio.unl.edu/EPCES/ and http://sysbio.unl.edu/EPSVR/.
Key words B-cell epitope prediction, Support vector machine
1 Introduction
Antigenic epitopes are regions of protein surface that are preferen-

tially recognized by antibodies. Prediction of antigenic epitopes is
important but is challenging.
Usually, B-cell antigenic epitopes are classified as either contin-
uous or discontinuous epitopes. The majority of available epitope
prediction methods focus on continuous epitopes [1–
12]. Although discontinuous epitopes dominate most antigenic
epitope families [13], due to their computational complexity, only
a few prediction methods exist for discontinuous epitope predic-
tion: CEP [14], DiscoTope [15], PEPITO [16], ElliPro [17],
SEPPA [18], EPITOPIA [19, 20], EPCES [21], and EPSVR
[22]. All discontinuous epitope prediction methods require the
three-dimensional structure of the antigenic protein. The small
number of available antigen-antibody complex structures limits
the development of reliable discontinuous epitope prediction
methods and an unbiased benchmark set is very much in demand
[21, 23].

289
290 Shide Liang et al.
For discontinuous epitopes prediction, both EPCES and

EPSVR integrated six attributes: residue epitope propensity, con-
servation score, side-chain energy score, contact number, surface
planarity score, and secondary structure composition. The predic-
tion accuracy was validated by an independent test set, of which
antigens did not have available antibody-complex structures and
epitopes were derived from various biochemical experiments. Both
EPCES and EPSVR are available online now.
2 Materials
2.1 Webserver The webservers, EPCES and EPSVR, were developed for confor-
mational B-cell epitope prediction, which is available at http://
sysbio.unl.edu/EPCES/ and http://sysbio.unl.edu/EPSVR/.
Figure 1 displays the input page for two webservers that allow
users to input a PDB ID or upload a protein structure file in PDB
format. The chain name is also required (see Note 1). Once a user
submits a job by clicking the “submit” button, after typing the
correct four-letter word shown in a figure to prevent robot sub-
missions (see Note 2), a new page will appear, which acknowledges
Fig. 1 The input window of the EPCES webserver

EPCES and EPSVR: Prediction of Conformational B-cell Eptiopes 291
Fig. 2 The result window of the EPCES webserver. The result is saved in a PDB file, which can be downloaded
by clicking the button
the successful submission and displays a URL in red that will be

used to check the prediction results (see Note 3). The input struc-
ture is first screened against the database of all received input
sequences to see if it has been predicted before. If the same protein
structure has been predicted before, the existing results will be
returned directly. Otherwise, the protein sequence is subsequently
passed on to the predictor running in the background, which will
then screen the input structure to generate feature vectors and
finally use a vote system or an SVM classifier to score all candidates
(see Note 4). The output will be displayed by a web page, which is
shown in Fig. 2. The output includes (1) the predicted antigenic
residues and (2) the color-marked PDB file. If one display the PDB
file as the B factor color, the residues are colored from red to blue
according to the predicted possibility to be an epitope residue (see
Note 5). The results are permanently saved in the database, and
users can access the results with the URL obtained when they first
submit their input sequence.
2.2 Datasets The training set was gathered and screened from three protein data
sets: (1) 22 antigen-antibody complexes and their unbound struc-
2.2.1 Training Set
tures from protein docking Benchmark 2.0 [24], (2) 59 represen-
tative antigen-antibody complexes compiled by Ponomarenko and
Bourne [23], and (3) 17 antigen-antibody complex structures
released between February 2006 and October 2008 with available
unbound antigen structures, which was the test set in our previous
work [21]. Any antigen-antibody complex was discarded if its
antigen had no available unbound structure because the unbound
structures were required for prediction. A complex structure was
not used if its antigenic epitope consisted of amino acid residues
located on multiple chains. A complex was included if the sequence
identity between its antigen and all other antigens from the other
complex structures was less than 35% following local sequence
alignment. For an antigen with a sequence identity in the range of
35–50%, we accepted the antigen-antibody complex if the binding
topology was not the same as its homologous complex. For an
antigen with more than one antigenic epitope, only one was used
in order to avoid confusion in subsequent application of support
vector regression methods. As a result, a total of 48 complexes and
their unbound structures meeting the above criteria were used as a
training set, available for download at http://sysbio.unl.edu/
services/EPSVR/training.tar.gz.
2.2.2 Test Set The test set was curated from 293 entries of the Conformational
Epitope Database [25] (CED, Release 0.03) with the following
criteria. We only considered entries that had unbound antigen
structures, but no complex structures. Multiple entries with the
same antigen structure were combined and considered as one tar-
get, and antigenic residues from multiple entries were mapped onto
one protein structure. The sequence identity between any two
selected proteins was also required to be less than 35%. All selected
antigens were also screened against the rest of CED database and
our training set; the sequence identity between a selected antigen
and other antigens with complex structures in the CED or in the
training set was less than 35%. A total of 22 antigenic proteins in the
CED met all the above criteria; these were 1www, 1hgu, 1eku,
1mbn, 1av1, 1 pv6, 1al2, 2gmf, 1a7c, 1y8o, 1og5, 1jeq, 1dab,
1w7b, 1ly2, 1rec, 1nu6, 2b5i, 2gib, 1p4t, 1xwv, and 1qgt. Three
antigenic proteins, 1www, 1hgu, and 1xwv, were excluded since
they had multiple antibody-binding sites and the mapped antigenic
residues were evenly distributed on the protein surfaces. Therefore,
the final test set contained 19 antigen structures, available at
http://sysbio.unl.edu/services/EPSVR/testing.tar.gz.
3 Method
3.1 EPCES and Both EPCES and EPSVR employed the same set of features, but
EPSVR Predictive the predictive models are different. For the description of all fea-
Model tures, please see the Subheading 3.3.
EPCES conducts prediction with consensus scoring. To take
the advantage of the multiple features, we used a voting mechanism
with the above-described six scoring functions. A patch was con-
sidered as an interface patch if five of the all six terms scored it into
the top-ranked patch set. We did not use the vote mechanism of all
six votes from the six scoring functions because one surface patch
with a small contact number could not have a high planarity score at
the same time. The number of predicted residues with each single
term is the same, but the threshold of how many top ranked patches
shall be kept can be varied to yield predictions with different
sensitivities.
EPSVR employed the support vector regression (SVR), which
is implemented with the SVM package LIBSVM [26]. For the
training step, any integer value between 0 and the patch size
(20 for this work), and each surface patch had six SVR attributes,
which were calculated with the six scoring terms: residue epitope
propensity, conservation score, side-chain energy score, contact
number, surface planarity score, and secondary structure composi-
tion. The six scoring terms were the same as used in our previous
work [21]. All SVR parameters were optimized by a grid search
(c ¼ 2–10~ 1, g ¼ 2–12~ 3, and p ¼ 2–5~ 2), and for each grid
point of triplets, a leave-one-out procedure was applied to evaluate
the trained SVR model. For prediction, first, we enumerated all
surface patches of a given antigen structure and calculated their six
SVR attributes. For each surface patch, we predicted the number of
putative interface residues by the trained SVR model. Here, a patch
score was defined as the fraction of the number of putative interface
residues to the total number of amino acid residues in the patch,
i.e., 20. One surface residue was assigned a residue score by averag-
ing patch scores of all patches in which this amino acid residue is
included. Finally, we sorted surface residues according to their
residue scores and the top-ranked ones were considered as interface
residues. The assumption here is that a residue frequently appearing
in top-scoring patches is likely an interface residue.
3.2 Definition of Following the previous work [27], we consider an amino acid
Surface Residues, residue as a surface residue if the relative accessibility of its side
Surface Patches, and chain is greater than 6% with probe radius ¼ 1.2 Å. A surface patch
Interface Residues is defined as a central surface residue and its 19 nearest surface
neighbors in space. Solvent vector constraints [28] were applied
in order to avoid patches sampled on different sides of a protein
surface. An interface residue is the surface residue with solvent
accessibility decreased more than 1 Å2 upon association.
3.3 Feature Vectors Residue epitope propensity [29], conservation score [29], side-
chain energy score [29], contact number [15], surface planarity
score [30], and secondary structure composition [31] were
exploited for antibody-binding site prediction. We previously used
the first three terms for protein-protein interface prediction
(PINUP [29]), which has the highest prediction accuracy accord-
ing to an independent study [32]. The last three terms have been
used for antibody-binding site prediction by other researchers. We
describe the details of those six terms in the following paragraphs.
Residue epitope propensity. The score of antibody-binding site
propensity, Epropensity(i), is defined as
!
P rInterface Sr
E propensity ði Þ ¼ ln surface ave ð1Þ
Pr Sr
where PrInterface and Prsurface are the contribution of residue type r to

the antibody-binding site and to the protein surface area, respec-
tively, and Sr and S ave r are the relative accessible surface area of
residue r at the sequence position i and the average relative accessi-
ble surface area of surface residues of type r, respectively. The Cα
atom of Gly is considered as a side-chain atom for convenience.
Since antigen-antibody interfaces have different residue composi-
tion compared with other protein-protein interfaces, we used Pro-
tein Dataset 1 to derive residue antibody-binding site propensity
instead of using the former residue interface propensity score
[29]. Here, PrInterface and Prsurface were obtained from statistical
analysis of Protein Dataset 1. Some antigens in Dataset 1 have
multiple epitopes. Those residues belonging to any of the epitopes
were considered as antibody-binding interface residues. The values
of S ave
r for 20 amino acid residues were obtained from statistical
analyses on 41 antigens in Protein Dataset 1.
Residue conservation score. Residue conservation was measured
by the self-substitution score from the sequence profile. Sequence
profiles were obtained by three rounds of PSI-BLAST searches with
the BLOSUM62 [33] substitution matrix. The conservation score
at the position i is defined as

jM ir B rr j, if M ir B rr < 0,
E conserv ði Þ ¼ ð2Þ
0, if M ir B rr > 0:
where Mir is the self-substitution score in the position-specific
substitution matrix generated from PSIBLAST for the residue
type r at sequence position i and Brr is the diagonal element of
BLOSUM62 for residue type r. Usually, protein-protein interface
residues are more conserved than other surface residues due to
functional constraints, and hence, the conserved surface residues
in the unbound structure will be predicted as interface residues.
The residues in the antibody-binding site, however, are less
conserved than other surface residues due to the constraint of the

host immune system. The unconserved residues are considered as
the putative antibody-binding site residues.
Side-chain energy score. The exact expression for side-chain
energy score can be found in Eq. (3) in PINUP [29]. It was
calculated from the side-chain energies of all possible rotamers for
a given residue type at a sequence position, whereas other sequence
positions have native residue types and observed atomic coordi-
nates. The weights of the energy function were optimized, so that
the native residue was predicted energetically favorable at each
position of the training proteins [34]. We assumed that the residues
at the antibody-binding site had a higher energy score than other
surface residues, so that the free energy of the antigen-antibody
system could go down significantly upon association.
Contact number. The residue contact number is the number of
Cα atoms in the antigen within a distance of 10 Å of the Cα atom of
residue i [15]. A residue with a small contact number was consid-
ered as an antibody-binding site residue.
Planarity score. The planarity of each surface patch was calcu-
lated by evaluating the root mean squared (rms) deviation of all the
Cα atoms in the surface patch from the least squares plane through
the atoms. The rms deviations were inverted such that a high
planarity score for a patch was interpreted as a planar patch and
antibody-binding site [35].
Secondary structure composition. This score was defined as the
fraction of patch residues forming turns or loops in all 20 patch
residues. Following Chou & Fasman’s method [36], the α-helix
and β-sheet were defined as four or more consecutive residues
having φ, ψ angles within 40 of (60 , 50 ) and three or more
residues having φ,ψ angles within 40 of (120 , 110 ) or (140 ,
135 ), respectively. The remaining regions were considered turns
and loops.
4 Notes
1. Multiple chains are allowed, and the chain name(s) could be

“A” for a single chain or “HL” for two different chains. If the
interesting chain name in PDB file is empty, please use “_”. If
all the chains are interesting, please use “∗”.
2. In the image, above the “submit” button, all letters are
upper case.
3. The URL looks like http://sysbio.unl.edu/EPCES/waiting.
php?jobid¼5ccb426d7d5709.69830017. The string after
“jobid¼” is the ID specifically assigned for the submitted job
by the webserver, which is unique for each case. Please save this
URL for retrieving the outputs in the future.
4. Usually, it will take a while for the prediction step to complete.

The average waiting time is about 20 min. However, some-
times, the server is very busy, and the computing time for one
protein sequence could be more than several hours. If many
protein structures are submitted to the server at once from
different browser windows, then most of them will stay in the
queue for a very long time, maybe more than several days.
However, the server will automatically delete jobs that have
stayed in the queue for too long, such as more than 24 h.
5. For the score cutoff, we recommend using 90. That means if
the score assigned to an amino acid residue is larger than 90, it
will be a good candidate of an epitopic residue.
Acknowledgments
The work is supported by funding under C.Z.’s startup funds from

the University of Nebraska, Lincoln, NE. This work was completed
utilizing the Holland Computing Center of the University of
Nebraska.
References
1. Parker JM, Guo D, Hodges RS (1986) New 7. Chen J, Liu H, Yang J et al (2007) Prediction
hydrophilicity scale derived from high- of linear B-cell epitopes using amino acid pair
performance liquid chromatography peptide antigenicity scale. Amino Acids 33
retention data: correlation of predicted surface (3):423–428. https://doi.org/10.1007/
residues with antigenicity and X-ray-derived S00726-006-0485-9
accessible sites. Biochemistry 8. El-Manzalawy Y, Dobbs D, Honavar V (2008)
25 (19):5425–5432 Predicting linear B-cell epitopes using string
2. Emini EA, Hughes JV, Perlow DS et al (1985) kernels. J Mol Recognit 21(4):243–255.
Induction of hepatitis a virus-neutralizing anti- https://doi.org/10.1002/jmr.893
body by a virus-specific synthetic peptide. J 9. Blythe MJ, Flower DR (2005) Benchmarking
Virol 55(3):836–839 B cell epitope prediction: underperformance of
3. Karplus PA, Schulz GE (1985) Prediction of existing methods. Protein Sci 14(1):246–248.
chain flexibility in proteins - a tool for the https://doi.org/10.1110/ps.041059505
selection of peptide antigens. Naturwis- 10. Greenbaum JA, Andersen PH, Blythe M et al
senschaften 72(4):212–213 (2007) Towards a consensus on datasets and
4. Kolaskar AS, Tongaonkar PC (1990) A semi- evaluation metrics for developing B-cell epi-
empirical method for prediction of antigenic tope prediction tools. J Mol Recognit 20
determinants on protein antigens. FEBS Lett (2):75–82. https://doi.org/10.1002/jmr.815
276(1–2):172–174. https://doi.org/10. 11. Sweredoski MJ, Baldi P (2009) COBEpro: a
1016/0014-5793(90)80535-q novel system for predicting continuous B-cell
5. Larsen JE, Lund O, Nielsen M (2006) epitopes. Protein Eng Des Sel 22(3):113–120.
Improved method for predicting linear B-cell https://doi.org/10.1093/protein/gzn075
epitopes. Immunome Res 2:2. https://doi. 12. Yang X, Yu X (2009) An introduction to epi-
org/10.1186/1745-7580-2-2 tope prediction methods and software. Rev
6. Saha S, Raghava GP (2006) Prediction of con- Med Virol 19(2):77–96. https://doi.org/10.
tinuous B-cell epitopes in an antigen using 1002/rmv.602
recurrent neural network. Proteins 65 13. MHV VR (1996) Mapping epitope structure
(1):40–48. https://doi.org/10.1002/prot. and activity: from one-dimensional prediction
21078
to four-dimensional description of antigenic 24. Mintseris J, Wiehe K, Pierce B et al (2005)

specificity. Methods 9(3):465–472 Protein-Protein Docking Benchmark 2.0: an
14. Kulkarni-Kale U, Bhosle S, Kolaskar AS (2005) update. Proteins 60(2):214–216. https://doi.
CEP: a conformational epitope prediction org/10.1002/prot.20560
server. Nucleic Acids Res 33(Web Server 25. Huang J, Honda W (2006) CED: a conforma-
issue):W168–W171. https://doi.org/10. tional epitope database. BMC Immunol 7:7.
1093/nar/gki460 https://doi.org/10.1186/1471-2172-7-7
15. Andersen PH, Nielsen M, Lund O (2006) Pre- 26. Fan RE, Chen PH, Lin CJ (2005) Working set
diction of residues in discontinuous B-cell epi- selection using second order information for
topes using protein 3D structures. Protein Sci training support vector machines. J Mach
15(11):2558–2567. https://doi.org/10. Learn Res 6:1889–1918
1110/Ps.062405906 27. Liang S, Zhang J, Zhang S et al (2004) Predic-
16. Sweredoski MJ, Baldi P (2008) PEPITO: tion of the interaction site on the surface of an
improved discontinuous B-cell epitope predic- isolated protein structure by analysis of side
tion using multiple distance thresholds and half chain energy scores. Proteins 57(3):548–557
sphere exposure. Bioinformatics 24 28. Jones S, Thornton JM (1997) Analysis of
(12):1459–1460. https://doi.org/10.1093/ protein-protein interaction sites using surface
bioinformatics/btn199 patches. J Mol Biol 272(1):121–132
17. Ponomarenko J, Bui HH, Li W et al (2008) 29. Liang S, Zhang C, Liu S et al (2006) Protein
ElliPro: a new structure-based tool for the pre- binding site prediction using an empirical scor-
diction of antibody epitopes. BMC Bioinfor- ing function. Nucleic Acids Res 34
matics 9(514). https://doi.org/10.1186/ (13):3698–3707. https://doi.org/10.1093/
1471-2105-9-514 nar/gkl454
18. Sun J, Wu D, Xu T et al (2009) SEPPA: a 30. Jones S, Thornton JM (1997) Prediction of
computational server for spatial epitope predic- protein-protein interaction sites using patch
tion of protein antigens. Nucleic Acids Res 37 analysis. J Mol Biol 272(1):133–143. https://
(Web Server issue):W612–W616. https://doi. doi.org/10.1006/jmbi.1997.1233
org/10.1093/nar/gkp417 31. Pellequer JL, Westhof E, Van Regenmortel
19. Rubinstein ND, Mayrose I, Pupko T (2009) A MH (1993) Correlation between the location
machine-learning approach for predicting of antigenic sites and the prediction of turns in
B-cell epitopes. Mol Immunol 46 proteins. Immunol Lett 36(1):83–99
(5):840–847. https://doi.org/10.1016/j. 32. Zhou HX, Qin S (2007) Interaction-site pre-
molimm.2008.09.009 diction for protein complexes: a critical assess-
20. Rubinstein ND, Mayrose I, Martz E et al ment. Bioinformatics 23(17):2203–2209.
(2009) Epitopia: a web-server for predicting https://doi.org/10.1093/bioinformatics/
B-cell epitopes. BMC Bioinformatics btm323
10:–287. https://doi.org/10.1186/1471- 33. Henikoff S, Henikoff JG (1992) Amino acid
2105-10-287 substitution matrices from protein blocks. Proc
21. Liang S, Zheng D, Zhang C et al (2009) Pre- Natl Acad Sci U S A 89(22):10915–10919
diction of antigenic epitopes on protein sur- 34. Liang S, Grishin NV (2004) Effective scoring
faces by consensus scoring. BMC function for protein sequence design. Proteins
Bioinformatics 10:302. https://doi.org/10. 54(2):271–281. https://doi.org/10.1002/
1186/1471-2105-10-302 prot.10560
22. Liang S, Zheng D, Standley DM et al (2010) 35. Jones S, Thornton JM (1997) Analysis of
EPSVR and EPMeta: prediction of antigenic protein-protein interaction sites using surface
epitopes using support vector regression and patches. J Mol Biol 272(1):121–132. https://
multiple server results. BMC Bioinformatics doi.org/10.1006/jmbi.1997.1234
11:381. https://doi.org/10.1186/1471-
2105-11-381 36. Chou PY, Fasman GD (1978) Empirical pre-
dictions of protein conformation. Annu Rev
23. Ponomarenko JV, Bourne PE (2007) Biochem 47:251–276. https://doi.org/10.
Antibody-protein interactions: benchmark 1146/annurev.bi.47.070178.001343
datasets and prediction tools evaluation. BMC
Struct Biol 7:64. https://doi.org/10.1186/
1472-6807-7-64
Chapter 17
SVMTriP: A Method to Predict B-Cell Linear Antigenic

Epitopes
Bo Yao, Dandan Zheng, Shide Liang, and Chi Zhang
Abstract
Identifying protein antigenic epitopes recognizable by antibodies is the key step for new immuno-
diagnostic reagent discovery and vaccine design. To facilitate this process and improve its efficiency,
computational methods were developed to predict antigenic epitopes. For the linear B-cell epitope
prediction, many methods were developed, including BepiPred, ABCPred, AAP, BCPred, BayesB, BEOra-
cle/BROracle, BEST, and SVMTriP. Among these methods, SVMTriP, a frontrunner, utilized Support
Vector Machine by combining the tri-peptide similarity and Propensity scores. Applied on non-redundant
B-cell linear epitopes extracted from IEDB, SVMTriP achieved a sensitivity of 80.1% and a precision of
55.2% with a five-fold cross-validation. The AUC value was 0.702. The combination of similarity and
propensity of tri-peptide subsequences can improve the prediction performance for linear B-cell epitopes. A
webserver based on this method was constructed for public use. The server and all datasets used in the
corresponding study are available at http://sysbio.unl.edu/SVMTriP. This chapter describes the webserver
of SVMTriP.
Key words Linear B-cell epitope prediction, Support vector machine
1 Introduction
Antigenic epitopes are regions of the protein surface that are pref-
erentially recognized by B-cell antibodies [1]. Prediction of anti-
genic epitopes is useful for the investigation on the mechanism of
body’s self-protection systems and can help the design of vaccine
components and immuno-diagnostic reagents [2].
Usually, B-cell antigenic epitopes are classified as either contin-
uous or discontinuous. A continuous (also called linear) epitope is a
consecutive fragment from the protein sequence, while a discontin-
uous epitope is composed of several fragments scattered along the
protein sequence but still forms an antigen-binding interface in 3D
(see Note 1). Currently, the majority of the available epitope pre-
diction methods focused on continuous epitopes due to the relative
simplicity of the problem and the convenience of available

299
300 Bo Yao et al.
investigation methods, in which the amino acid sequence of a

protein is taken as the input. Such prediction methods are based
upon the amino acid properties including hydrophilicity [3, 4],
solvent accessibility [5], secondary structure [6], flexibility [7],
and antigenicity [8]. In addition, based on the epitope databases,
such as IEDB [9], Bcipep [10], and FIMM [11], there are also
some methods using machine learning approaches, such as Hidden
Markov Model (HMM) [12], Artificial Neural Network (ANN)
[13], and Support Vector Machine (SVM) [14–16], to locate linear
epitopes, such as PREDITOP [8], PEOPLE [17], BEPITOPE
[18], BepiPred [12], ABCPred [13], AAP [14], BCPred [15],
ELISA BayesB [19], BEOracle/BROracle [20], BEST [21], and
SVMTriP [16]. It was therefore hypothesized that a new method
based on 3D protein structures instead of on 2D sequences utiliz-
ing a machine learning approach would further improve the pre-
diction performance.
In this context, SVMTriP was developed by using SVM to
combine the tri-peptide similarity and Propensity scores for predic-
tion. SVMTriP was tested for varying epitope sequence lengths.
With a five-fold cross-validation, SVMTriP achieved a sensitivity
(Sn) of 80.1% and a precision (P) of 55.2% for sequences with
20 amino acids (AA).
2 Materials
2.1 Webserver The webserver, SVMTriP, was developed for linear B-cell epitope
prediction, which is available at http://sysbio.unl.edu/SVMTriP/.
Figure 1 displays the input page for the webserver that allows users
to cut and paste the protein sequence. The input required for
SVMTriP is a query protein sequence in the FASTA format (see
Note 2) and the linear epitope length (see Note 3). The standard
20 characters for amino acids are accepted, and any characters not
included in those 20 will be removed by the webserver (see Note 4).
Only one sequence per run is allowed for input. If multiple protein
sequences are entered as the input, only the first one will be pro-
cessed. Once a user submits a job by clicking the “submit” button,
after typing the correct four-letter word shown in a figure to
prevent robot submissions (see Note 5), a new page will appear,
which acknowledges the successful submission and displays a URL
in red that will be used to check the prediction results (see Note 6).
The input sequence is first screened against the database of all
received input-sequences to see whether it has been predicted
before. If the same sequence has been predicted before, the existing
results will be returned directly. Otherwise, the protein sequence is
subsequently passed on to the predictor running in the back-
ground, which will then screen the input sequence with a sliding
window, generate feature vectors for each window, and finally use
SVMTriP: A Method to Predict B-Cell Linear Antigenic Epitopes 301
Fig. 1 The input window of the SVMTriP webserver. The only required input is the protein sequence, which can
be copied and pasted into main text box on this page. Name, Organization, and Email are optional
an SVM classifier to score all candidates (see Note 7). The scores of
all candidate sites will be returned and displayed on the output
page, which is shown in Fig. 2 (see Note 8). The results are
permanently saved in the database, and users can access the results
with the URL obtained when they first submit their input
sequence.
2.2 Datasets The dataset was constructed by extracting non-redundant linear

B-cell epitopes from IEDB [9]. Initially, a total of 65,456 B-cell
linear epitopes were downloaded from IEDB (version June
11, 2012). Redundant epitopes and those possibly related to
T-cell were removed. The full-length sequences of corresponding
epitopes were also collected. The various lengths of epitope
sequences, including 10AA, 12AA, 14AA, 16AA, 18AA, and
20AA, are extracted by trimming the long experimental measured
epitopes or attaching more amino acid residues to both ends of
short epitopes according to the full-length sequences. For a given
302 Bo Yao et al.
Fig. 2 The result window of the SVMTriP webserver. All candidate sites for a given protein sequence are
displayed. All candidate sites in one group are ranked based on their predicted scores. The rank, location,
subsequence, and score for a given site are displayed
length, epitope sequences with 30% similarity, measured by

BLAST [22], were clustered together; and only one of them were
kept as an epitope sequence in the dataset. Finally, the dataset for
each length had a total of 4925 non-redundant epitope sequences.
For the negative dataset, the same numbers of equal-length sub--
sequences were extracted from the non-epitopic segments in the
corresponding antigen sequences (see Note 9).
2.3 Downloading The source code for the webserver, the compiled training/test
datasets, and the well-trained SVM models used by the webserver
are available for downloading at http://sysbio.unl.edu/SVMTriP/
download.php.
3 Methods
3.1 SVM Package This webserver used the SVM package, SVMlight, implemented by
and Model Parameters Joachims (http://svmlight.joachims.org/) [23]. The parameters
of C, the cost, and γ for RBF kernel in SVM were optimized (see
Note 10). During the procedure of a five-fold cross-validation, the
five test results were used to calculate the mean values and 95%
confidence intervals of the sensitivity, precision, and maximal
F-measure (see Note 11).
3.2 Feature Vectors The tri-peptide subsequence space was used to encode the SVM
attributes. This kernel had a space of 203 attributes for both
tri-peptide substring and propensity. The score of the i-th attribute,
K(i), is defined as the tri-peptide subsequence similarity kernel
modulated by its corresponding tri-peptide propensity. Please see
Eq. (1):
K ði Þ ¼ T ðiÞ P ðiÞ , ð1Þ
where K denotes the score of the i-th attribute, T denotes the
(i) (i)
i-th tri-peptide subsequence similarity kernel, and P(i) denotes

corresponding tri-peptide subsequence propensity of i-th
tri-peptide subsequence. The tri-peptide subsequence similarity
kernel is defined as:
X
T ði Þ ¼ ΦðiÞ Ωj , ð2Þ
where Φ(i) denotes the tri-peptide that represents the i-th attribute
and Ωj denotes the j-th tri-peptide in the tri-peptide subsequence
space for the input sequence (see Note 12). The symbol “”
denotes getting the similarity score of any two corresponding
tri-peptide, i.e., sum of three similarity scores for three amino acid
pairs from a BLOSUM/PAM matrix (see Note 13). For example,
assuming the length of a given epitope candidate is 20 AAs, the
tri-peptide subsequence similarity kernel for the i-th attribute is
generated by summing over similarity scores of the 18 pairs of
tri-peptide; each pair consists of one tri-peptide from the input
sequence, and the tri-peptide represents i-th attribute from the
tri-peptide subsequence space. Using BLOSUM62 as example,
the steps are shown below:
(a) A sliding window along the sequence was used.
(b) The score Dx for the sum of respective individual residue pair
from BLOSUM62 between two pairs of tri-peptide was
calculated.
(c) The score Dx was zero if the BLOSUM62 value of one of
individual residue pair was negative.
(d) The data for a sequence with 203 features using the average
score Dx with the specific tri-peptide were calculated.
To build tri-peptide score matrix, BLOSUM and PAM matri-
ces, the most popular score matrices used for protein sequence
alignment, were tested. BLOSUM matrix is derived from residue-
residue substitution probability while PAM is based on observed
mutations of closely related proteins. In our study, different levels
of BLOSUM matrices were tested, such as BLOSUM30, BLO-
SUM50, BLOSUM62, and BLOSUM75, where the number
represents the percentage of identity threshold to determine closely
related protein groups during the construction of BLOSUM
304 Bo Yao et al.
matrix. Different PAM matrices were used as well, such as

PAM120, PAM160, PAM200, and PAM250, where the number
stands for the times of multiplication of the primary PAM1 matrix
by itself when building a PAM matrix. The application of different
BLOSUM or PAM matrices will influence the prediction result of
final models.
The propensity of tri-peptide subsequence representing the i-
th attribute is calculated as in Eq. (3):
f ðiÞ
P ðiÞ ¼ , ð3Þ
F ði Þ
where f(i) is the frequency of the i-th type of tri-peptide in the
positive epitopes and F(i) is the frequency of i-th type of tri-peptide
in 5 104 protein sequences randomly selected from the Refseq
database [24].
4 Notes
1. The boundary between continuous and discontinuous epitopes

is vague; a continuous fragment in a discontinuous epitope can
be considered as a continuous epitope.
2. The query sequence must be a protein amino acid sequence in
the FASTA format. The gene in the DNA/RNA sequence has
to be converted to an amino acid sequence first by the user.
Unknown amino acids (e.g., X) must be removed.
3. For the length of linear epitopes, there are six options: 10, 12,
14, 16, 18, 20 Amino Acids (AA), where 20AA is the default
value. To predict a given full-length protein sequence, the
sliding window method is employed to obtain subsequences
with varying lengths, including 10AA, 12AA, 14AA, 16AA,
18AA, and 20AA. For each subsequence, SVMTriP calculates
its score, and a positive score indicates that the subsequence is a
putative antigenic epitope.
4. Standard amino acid characters are “ACDEF-
GHIKLMNPQRSTVWY”. Any characters not included in
these 20 will be removed by the webserver, such as “X” and “.”.
5. In the image, above the “submit” button, all letters are
upper case.
6. The URL looks like http://sysbio.unl.edu/SVMTriP/waiting.
php?jobid¼5c76f8418544f4.27142663. The string after
“jobid¼” is the ID specifically assigned for the submitted job
by the webserver, which is unique for each case. Please save this
URL for retrieving the outputs in the future.
7. Usually, it will take a while for the prediction step to complete.

The waiting time depends on the length of the input protein
sequence, and the length of the job queue of the webserver.
The average waiting time is about 20 min. However, some-
times, the server is very busy, and the computing time for one
protein sequence could be more than several hours. If a user has
many proteins, the best way to use this server is to submit one
sequence at a time and wait to get the result before submitting
the next one. If all sequences are submitted to the server at
once from different browser windows, most of them will stay in
the queue for a very long time, likely more than several days.
However, the server will automatically delete jobs that have
stayed in the queue for too long, such as more than 24 h.
8. For one input sequence, all subsequences with scores >0.5 will
be returned, and the top 3% will be flagged as the recom-
mended candidate epitopes.
9. For model training, it has been shown that the optimal ratio of
positive to negative sites is one [25].
10. The parameters were trained with a five-fold cross-validation
method. To carry out the five-fold validation procedure, the
total of 4925 positive epitopes were split into five groups, and
any two epitope sequences from two different groups could
not have sequence similarity more than 20%. After five rounds,
all positive and negative sites in the whole dataset obtained
prediction scores for analysis. The optimal set of parameters
resulting in the highest AUC values was obtained by a grid
search. All SVM parameters are optimized by a grid search
(c ¼ 2–10~ 1, g ¼ 2–12~ 3, and p ¼ 2–5~ 2). For each grid
point of the triplets, a five-fold cross-validation procedure is
employed to evaluate the performance of the trained SVM
model.
11. For the application on the online server, the prediction model
is obtained by training the whole dataset with the same num-
bers of positive and negative epitopes.
12. For example, if the length of a given epitope candidate is
20 AA, the tri-peptide subsequence similarity kernel for the i-
th attribute is generated by summing over similarity scores of
the 18 pairs of tri-peptides; each pair consists of one tri-peptide
from the input sequence, and the tri-peptide represents i-th
attribute from the tri-peptide subsequence space.
13. A similar subsequence kernel, but only for two amino acid
combination, was previously used to predict protein subcellular
localization by Lei and Dai [26].
306 Bo Yao et al.
Acknowledgments
The work is supported by funding under C.Z.’s startup funds from

the University of Nebraska, Lincoln, NE. This work was completed
utilizing the Holland Computing Center of the University of
Nebraska.
References
1. Getzoff ED, Tainer JA, Lerner RA et al (1988) 12. Larsen JE, Lund O, Nielsen M (2006)
The chemistry and mechanism of antibody Improved method for predicting linear B-cell
binding to protein antigens. Adv Immunol epitopes. Immunome Res 2:2. https://doi.
43:1–98 org/10.1186/1745-7580-2-2
2. Milich DR (1989) Synthetic T and B cell rec- 13. Saha S, GPS R (2006) Prediction of continu-
ognition sites: implications for vaccine devel- ous B-cell epitopes in an antigen using recur-
opment. Adv Immunol 45:195–282 rent neural network. Proteins 65(1):40–48.
3. Parker JM, Guo D, Hodges RS (1986) New https://doi.org/10.1002/Prot.21078
hydrophilicity scale derived from high- 14. Chen J, Liu H, Yang J et al (2007) Prediction
performance liquid chromatography peptide of linear B-cell epitopes using amino acid pair
retention data: correlation of predicted surface antigenicity scale. Amino Acids 33
residues with antigenicity and X-ray-derived (3):423–428. https://doi.org/10.1007/
accessible sites. Biochemistry 25 S00726-006-0485-9
(19):5425–5432 15. El-Manzalawy Y, Dobbs D, Honavar V (2008)
4. Hopp TP, Woods KR (1981) Prediction of Predicting linear B-cell epitopes using string
protein antigenic determinants from amino kernels. J Mol Recognit 21(4):243–255.
acid sequences. Proc Natl Acad Sci U S A 78 https://doi.org/10.1002/Jmr.893
(6):3824–3828 16. Yao B, Zhang L, Liang S et al (2012)
5. Emini EA, Hughes JV, Perlow DS et al (1985) SVMTriP: a method to predict antigenic epi-
Induction of hepatitis a virus-neutralizing anti- topes using support vector machine to inte-
body by a virus-specific synthetic peptide. J grate tri-peptide similarity and propensity.
Virol 55(3):836–839 PLoS One 7(9):e45152. https://doi.org/10.
6. Pellequer JL, Westhof E, MHV V (1993) Cor- 1371/journal.pone.0045152
relation between the location of antigenic sites 17. Alix AJ (1999) Predictive estimation of protein
and the prediction of turns in proteins. Immu- linear epitopes by using the program PEOPLE.
nol Lett 36(1):83–100 Vaccine 18(3–4):311–314
7. Karplus PA, Schulz GE (1985) Prediction of 18. Odorico M, Pellequer JL (2003) BEPITOPE:
chain flexibility in proteins - a tool for the predicting the location of continuous epitopes
selection of peptide antigens. Naturwis- and patterns in proteins. J Mol Recognit 16
senschaften 72(4):212–213 (1):20–22. https://doi.org/10.1002/jmr.602
8. Kolaskar AS, Tongaonkar PC (1990) A semi- 19. Wee LJ, Simarmata D, Kam YW et al (2010)
empirical method for prediction of antigenic SVM-based prediction of linear B-cell epitopes
determinants on protein antigens. FEBS Lett using Bayes feature extraction. BMC Genomics
276(1–2):172–174 11(Suppl 4):S21. https://doi.org/10.1186/
9. Vita R, Zarebski L, Greenbaum JA et al (2010) 1471-2164-11-S4-S21
The immune epitope database 2.0. Nucleic 20. Wang Y, Wu W, Negre NN et al (2011) Deter-
Acids Res 38(Database issue):D854–D862. minants of antigenicity and specificity in
https://doi.org/10.1093/nar/gkp1004 immune response for protein sequences. BMC
10. Saha S, Bhasin M, Raghava GP (2005) Bcipep: Bioinformatics 12:251. https://doi.org/10.
a database of B-cell epitopes. BMC Genomics 1186/1471-2105-12-251
6:79. https://doi.org/10.1186/1471-2164- 21. Gao J, Faraggi E, Zhou Y et al (2012) BEST:
6-79 improved prediction of B-cell epitopes from
11. Schonbach C, JLY K, Sheng X et al (2000) antigen sequences. PLoS One 7(6):e40104.
FIMM, a database of functional molecular https://doi.org/10.1371/journal.pone.
immunology. Nucleic Acids Res 28 0040104
(1):222–224
22. Altschul SF, Madden TL, Schaffer AA et al (Database issue):D32–D36. https://doi.org/

(1997) Gapped BLAST and PSI-BLAST: a 10.1093/nar/gkn721
new generation of protein database search pro- 25. Biswas AK, Noman N, Sikder AR (2010)
grams. Nucleic Acids Res 25(17):3389–3402 Machine learning approach to predict protein
23. Joachims T (1999) Making large-Scale SVM phosphorylation sites by incorporating evolu-
Learning Practical. In: Schölkopf B, Burges C tionary information. BMC Bioinformatics
(eds) Advances in Kernel Methods-Support 11:273. https://doi.org/10.1186/1471-
Vector Learning. MIT Press, Cambridge, MA, 2105-11-273
pp 169–184 26. Lei Z, Dai Y (2005) An SVM-based system for
24. Pruitt KD, Tatusova T, Klimke W et al (2009) predicting protein subnuclear localizations.
NCBI reference sequences: current status, pol- BMC Bioinformatics 6:291. https://doi.org/
icy and new initiatives. Nucleic Acids Res 37 10.1186/1471-2105-6-291
Chapter 18
Modeling Phage–Bacteria Dynamics

Saptarshi Sinha, Rajdeep Kaur Grewal, and Soumen Roy
Abstract
Phage–bacteria interaction is a classic example of competitive coevolution in nature. Mathematical model-
ing of such interactions furnishes new insight into the dynamics of phage and bacteria. Besides its intrinsic
value, a somewhat underutilized aspect of such insight is that it can provide beneficial inputs toward better
experimental design. In this chapter, we discuss several modeling techniques that can be used to study the
dynamics between phages and their host bacteria. Monte Carlo simulations and differential equations (both
ordinary and delay differential equations) can be used to successfully model phage–bacteria dynamics in
well-mixed populations. The presence of spatial restrictions in the interaction media significantly affects the
dynamics of phage–bacteria interactions. For such cases, techniques like cellular automata and reaction–-
diffusion equations can be used to capture these effects adequately. We discuss details of the modeling
techniques with specific examples.
Key words Phage–bacteria interactions, Monte Carlo simulations, Differential equations, Reaction–-
diffusion equations, Cellular Automata
1 Introduction
1.1 Phage–Bacteria Bacteriophages are the viruses that require their host bacterial cells
Dynamics due to their metabolic and reproductive obligations. The distribu-
tion of bacteriophages depends on the distribution of the popula-
tion of their host cells in nature, as phages replicate inside the host
cell. A phage particle chiefly consists of nucleic acid as its genetic
material and proteins in its outer coating. The genetic material of a
phage could be either DNA or RNA in a single-stranded or double-
stranded form. DNA phages have a much simpler replication cycle
than RNA phages. The overall size of a phage particle varies
between 20 and 200 nm in either length or diameter, depending
on its shape [1].
The replication cycle of a phage particle can be broadly classi-
fied into two types: lytic and lysogenic. The lytic cycle is much
simpler than the lysogenic cycle. At the beginning of a typical
lytic cycle, a phage particle is adsorbed on a host cell surface. This
adsorption is very specific and depends on the host and the phage.

309
310 Saptarshi Sinha et al.
Fig. 1 The lytic and lysogenic cycle of phage
There are a series of host surface receptors that have been identified
in various hosts like E. coli, salmonella, mycobacterium, etc. The
receptor molecules vary in their biochemical composition depend-
ing on the host cell [2]. Figure 1 shows the lytic and lysogenic cycle
of a phage.
Upon the adsorption of a phage particle on the bacterial cell
surface, the genetic material of the phage penetrates inside the host
cell. This step is followed by the replication of the phage compo-
nents. Packaging of genetic material of the phage inside the capsid
follows. Ultimately, the newly created phage particles are released
from the host cell through burst out or budding. In this chapter, we
mainly consider the lytic cycle of phage in phage–bacteria dynamics.
On the other hand, in the lysogenic cycle, phage genetic mate-
rial gets integrated inside the host chromosome after penetration.
This particular phage is called prophage. It remains within the host
chromosome for several generations before it arrives back to the
lytic cycle. During prophage, the phage genetic material replicates
within the host chromosome.
1.2 Phage–Bacteria Phages and their host bacteria foster a competitive interaction in
Coevolution natural environments [3, 4]. During their coevolution, bacterial
cells evolve phage resistance mechanisms. These mechanisms are
host specific and have great diversity. These mechanisms involve:
(1) modification of surface bond receptors, which disrupts the
phage adsorption, (2) restriction enzyme modifications, due to
which phage particles are left unable to resist, and (3) modification
of metabolic enzymes, which could inhibit phage propagation.
Modeling Phage–Bacteria Dynamics 311
Simultaneously, phages also develop strategies to overcome

these hurdles. These strategies include modification of tail attach-
ment proteins to overcome the obstacles of attachment, mutations
in their genetic material to overcome the endonuclease-mediated
phage genome degradation, etc.
Most of the time, we observe a steady state between phages and
bacterial population in nature. In laboratory environments, we
need the host cell culture to maintain a phage population. This is
one of the prime impediments toward the study of a wide range of
phage–bacteria systems, since the majority of total bacteria in
nature are not amenable to culture [5].
1.3 Theoretical Phage–bacteria interaction dynamics could be studied either on

Modeling solid media like agar or in liquid media like broth. Depending on
the nature of the medium used, the appropriate modeling tech-
nique needs to be selected. A primary motivation for such modeling
is to develop realistic models to the extent possible, which could
help in the development of phage therapy.
Phage–bacteria interactions on agar are generally considered
under the ambit of interactions with spatial restriction, as each
phage particle could only infect its neighboring bacterial cells due
to its limited diffusion rate on the agar plate. The experimental
outcome is generally dependent on plaque morphology, plaque
diameter, etc. Thus, when we consider spatial restrictions in
phage–bacteria dynamics, we also need to incorporate spatial fac-
tors in the model. Examples of such models include reaction–
diffusion equations and cellular automata. On the other hand, if
there is no such spatial restriction on the phage–bacteria interac-
tion, we could consider basic dynamical models dependent on time.
These models include ordinary differential equations, delay differ-
ential equations, and Monte Carlo simulations [6]. In Fig. 2, we
have demonstrated an overall classification of modeling techniques
used in phage–bacteria dynamics.
Fig. 2 Different modeling techniques used to study phage–bacteria dynamics

1.3.1 Models Without When we consider phage–bacteria dynamics in liquid media, every
Spatial Restriction phage particle is free in principle to interact with any of the host
bacterial cells present in the medium [7]. The spatial restriction is
far less as the viscosity of the medium is usually rather low. Not
merely this, the reaction mixture is generally incubated in the
laboratory at 120–180 rotations per minute (RPM); during the
interaction time, this creates a well-mixed environment. In such
situations, the experimental results are based on the number of host
cells and phage particles at a particular instant of time. We can
experimentally measure various parameters such as host growth
rate, phage adsorption rate, etc. We can also measure the burst
size, which depends on the specific phage–bacteria system. To
model such scenarios, we can use either Monte Carlo simulations
or a set of differential equations.
1.3.2 Models with Spatial When we discussed the spatial restrictions in phage–bacteria inter-
Restriction actions, we were mainly focused on their interactions in solid or
semi-solid media. Such interactions can be found in both labora-
tory and natural environments. The hollow zone formed on an
opaque bacterial lawn due to phage propagation is called a plaque.
This plaque or plaque-forming unit is used in a laboratory environ-
ment for counting phages. This plaque formation is used not only
for quantification of phage numbers but also for studying the
phage–bacteria dynamics on solid media.
In this method, a phage sample is diluted and mixed with
0.4–0.8% agar, which is called soft or top agar. Host bacterial
solution, which acts as indicator bacteria, is also added with this
phage–agar mixture. Now the combination is pour plated on
1.8–2% hard agar plate and incubated overnight at a suitable tem-
perature. After the incubation period has elapsed, clear plaques
would be visible on the hazy lawn of indicator bacteria.
Now, this plaque morphology depends on various parameters,
which would be incorporated in our theoretical model. These
parameters include plaque diameter, agar concentration, phage
particle diffusion in soft agar, mechanism of release of phage parti-
cle, the density of host cell, etc. The size and appearance of plaque
are dependent not only on the specific host–phage system but also
on the above mentioned parameters in a complex manner. Figure 3
shows Plaque formation on agar and various factors controlling
plaque diameter.
The adsorption events inside the soft agar depend on agar
density, phage diffusion rate, adsorption rate, host cell density,
and latent period. In a host crowded environment, the adsorption
rate will be high due to the availability of host cells for phage
adsorption. But in high agar density, it would be tedious for a
phage particle to diffuse toward the host cell. Thus, agar density
has a negative effect on phage adsorption. After the first round of
infection, the burst size plays an essential role in determining the
Fig. 3 Plaque formation on agar and various factors controlling plaque diameter
plaque character. Larger burst size indicates that a higher number

of phage particles will be released from an infected cell and will be
available for the next round of infection in the surroundings of an
uninfected cell. The clarity of a plaque depends on both the burst
size and host cell density.
2 Methods
2.1 Monte Carlo Among various simulation techniques, Monte Carlo is one of the
Simulations most widely applied methods used in fields, as diverse as economics
to biology [8].
Let us consider a simple example of the Monte Carlo simula-
tions. Suppose we are throwing two dice together, each of which
could exhibit values from one to six in a given throw. Now we have
to calculate the probability of a particular value, which is the sum of
the values shown by the two dice when thrown together.
There are 36 possible combinations for two dice, when they are
cast together. We can calculate the probability of a particular out-
come as represented in Fig. 4a. For example, let us consider a
combined score of 5. This score is attainable in the following ways
with a pair of dice: (1,4), (2,3), (3,2), and (4,1). The probability of
scoring 5 will, therefore, be 4/36 ¼ 0.111.
Now computationally we could simulate the same thing repeat-
edly by generating two mutually independent natural random
numbers in the range of 1–6 and then adding these numbers
using Monte Carlo simulations. We can thus calculate the approxi-
mate value of the probability of any given score with two dice. For
10,000 trials, the outcome is represented in Fig. 4b. Obviously, the
determination of this probability can be improved by increasing the
number of trials.
Somewhat removed from our present discussion, even in the
field of finance, Monte Carlo simulations can be similarly used to
calculate risk analysis from such distributions for given scenarios.
Fig. 4 (a) Probability of obtaining various outcomes when a pair of dice is thrown simultaneously.
(b) The probability distribution for 10,000 trials
2.1.1 Multiplicity of The multiplicity of infection, MOI, of phage–bacteria dynamics

Infection (MOI) indicates the availability of an uninfected host cell for a phage
particle. In a laboratory environment, various values of MOI are
used in experiments. MOI of a phage–bacteria mixture is calculated
as the ratio of the number of phage particle to the number of
uninfected host cells. For example, if 1000 phage particles are
allowed to infect 1000 host cells in a phage–bacteria mixture,
then the MOI of the mixture will be 1. Similarly, if 1000 phage
particles are allowed to infect 10,000 host cells, then the MOI will
be 0.1.
2.1.2 Probability of When we have considered MOI ¼ 1, it is theoretically possible that

Adsorption one host bacterial cell will be available for each phage particle in the
mixture. However, in well-mixed conditions, this will not be the
case every time. There would be a probability that a phage particle
can be adsorbed on the cell surface of a host. A Poisson distribution
captures this probability quite well [9]. The probability of ‘y’ phage
particles to be adsorbed on a host cell, when the mixture has an

MOI of ‘x’, can be determined by the following expression:
xy ex
P ðy Þ ¼
y!
This probability helps simulate phage–bacteria dynamics using
Monte Carlo simulations.
2.1.3 Probability of Another important probability, which needs to be considered in

Division phage–bacteria dynamics, is the probability of division of the host
cell. A bacterial population is usually not synchronized from the
reproductive point of view all the time. The time point of cell
division will vary from cell to cell, though the generation time will
of course be the same. This is why the bacterial growth curve is
represented by a smooth line and not by steps. Bacterial growth is
generally quantified by the growth rate constant ‘μ’. This is
expressed as
∂ ln ðN Þ
μ¼
∂t
Here ‘∂ln(N)’ is the difference in cell count in ‘∂t’ time inter-
val. We can consider this growth rate as the probability of cell
division. It is expressed in terms of generations/unit time. There-
fore, rate constant implies that within unit time, 0.3 generation of a
bacterial cell is created. In other words, we can express this as the
probability of cell division. Therefore, ‘μ’ indicates that at each time
instant, every cell has a 0.3% chance of division [10]. We also need
to factor in this probability for simulating phage–bacteria dynamics
in Monte Carlo simulations.
2.1.4 Latent Period and Two other parameters need to be considered in phage–bacteria
Burst Size dynamics, namely, the latent period and burst size. We saw in the
lytic cycle that after penetration phages require some time for
synthesis and assembly of new particles. This time gap between
infection and release of a new phage particle is referred to as the
latent period. The phage growth curve is represented as a one-step
growth curve. On the other hand, burst size represents the number
of new phage particles released from an infected cell after successful
phage infection. We consider these two parameters in our Monte
Carlo simulations.
2.1.5 Algorithm for An outline of the algorithm for Monte Carlo simulations toward
Monte Carlo Simulations simulating a typical lytic cycle in phage–bacteria dynamics is repre-
sented in Fig. 5. Here, we need to generate two pseudo–random
numbers (R1, R2) from a uniform distribution using two different
“seeds” for every host cell at every time point. If R1 is less than the
adsorption probability mentioned earlier, then the host cell would
be infected by a phage particle. If no infection occurs, then we need
Fig. 5 Algorithm for Monte Carlo simulation of phage–bacteria dynamics. Here, U, P, and I denote the number
of uninfected host, phage, and infected host cells, respectively. Dp and Ap denote the probability of host cell
division and phage adsorption, respectively. Lp is the latent period, while R1 and R2 are random numbers from
a uniform distribution
to consider R2. If R2 is less than the division probability, then the

uninfected cell will divide. Otherwise, no division would occur for
that cell at that time instant. Subsequently, for infected cells, we
need to consider a latent time period. After this duration, a partic-
ular fraction of cells will burst out and release new phage particles
according to the predetermined burst size.
2.2 Differential Ordinary differential equations (ODE) consist of derivatives

Equations depending only upon the present value of the function at time ‘t’:
dx
¼ f ðt, x ðt ÞÞ
dt
Here, ‘x’ is the dependent variable and ‘t’ is the independent
variable. However, this is not the case for delay differential equa-
tions (DDE). In a DDE, the derivative of a function depends not
only on the present time instant ‘t’, but also on time instants prior
to ‘t’. In other words,
dx
¼ f ðt, x ðt Þ, x ðt τ1 Þ, x ðt τ2 Þ, . . . :, x ðt τn ÞÞ
dt
Here, τ1, τ2, . . . . , τn denote time delays and are obviously
positive constants. In many biological systems, the outcome of a
function might depend on its previous values, as in the prey–preda-

tor model. The birth rate of predators not only depends on the
current but may also depend on the number of predators and preys
at earlier time instants. Thus, DDE should naturally be much more
effective in capturing the population dynamics of phage–bacteria
interactions. However, many a time, a system of ODEs renders
significant insights into phage–bacteria population dynamics, as
discussed later. Though DDEs are effective in dealing with factors
that are common in real-world systems, they are not easily solvable
and are usually solved using numerical analyses. Various DDE
solving programs are available now. Pydelay in python [11] and
DDE solver for FORTRAN 90 and FORTRAN 95 [12] are some
of the most frequently used program solvers in recent times.
Below we describe a DDE model, which is one of the basic
models to analyze the population dynamics of phage–bacteria inter-
actions in well-mixed populations. Depending upon the system,
new parameters can be introduced for a better understanding of the
systems under consideration. But one should be careful when
introducing a new parameter. Consideration of too many para-
meters results in complex mathematical models. Besides, when
the reason for the import of parameters is not obvious or clear,
their significance is lost as opposed to a parameter, which can be
experimentally determined. Thus, a mathematical model must be
able to mimic a real-world system at least to a first approximation, as
well as be analytically or numerically solvable.
One of the first delay differential models to describe the phage–
bacteria interactions was introduced by Campbell in 1961 [13]. He
considered a model of two interacting populations—susceptible
bacteria, S, and free phage, P. The model was built to understand
the population dynamics between phage and bacteria under
chemostat-like conditions and is described by following equations:

_S ðt Þ ¼ αS ðt Þ 1 S ðt Þ kS ðt ÞP ðt Þ aS ðt Þ
C
P_ ðt Þ ¼ bkS ðt τÞP ðt τÞ μp P ðt Þ aP ðt Þ
The above equations determine the rate of change of concen-
tration of bacteria and free phage, respectively, with respect to time.
Let us first discus the parameters affecting the change of concentra-
tion of bacteria. ‘α’ corresponds to the growth rate of susceptible
bacteria and ‘C’ is the carrying capacity of the bacterial population.
The rate of adsorption of free phage particle by bacterial cells is
denoted by ‘k’. The phage–bacteria infection is modeled upon the
principle of mass action. In a well-mixed population, the rate at
which two populations interact with each other is directly propor-
tional to the product of the size of both populations. It is to be
noted here that adsorption is considered as an irreversible process.
The constant removal rate of susceptible bacteria and free phage is
denoted by ‘a’. For the case of free phage particles, an infected cell
is lysed after a fixed amount of time called the latent period,
denoted by ‘τ’. Upon lysis, each infected bacteria releases ‘b’ new
free phage particles. ‘b’ is known as burst size. Thus, the number of
free phage particles at time ‘t’ depends on the number of interact-
ing susceptible bacterial cells and free phage particles at time (t-τ).
The rate of spontaneous decay or inactivation of free phage parti-
cles is denoted by ‘μp’. For simplicity, in this model, all rate para-
meters are considered to be constants. In the later sections of this
chapter, the definition of the variables described in the above DDE
model remains the same throughout. New parameters, if intro-
duced, will be explained accordingly.
2.3 Reaction– Earlier we have described the use of ODE and DDE to understand
Diffusion Equations phage–bacteria interactions in well-mixed environments. However,
most bacteria exist in the form of biofilms. Thus, for a better
understanding of phage–bacteria ecology, we need to explore
methodologies to understand the growth of phage population in
spatially constrained environments. This is because unlike well-
mixed environments, here the interaction between a bacterial pop-
ulation and a phage population is subject to spatial limitation.
In the laboratory, such spatially constrained conditions can be
easily seen in agar gels. Also, agar gel provides a simplified setup to
study the growth of phage population in spatially constrained
environments as compared to biofilms. As discussed above, various
factors that are responsible for phage growth within semi-solid
media are infection of bacterial cells, phage particle diffusion, and
phage-induced lysis of bacterial cells [14]. Koch first modeled
plaque formation or phage growth in semi-solid media in 1964
[15]. The plaque enlargement rate, ‘r’, was estimated to be pro-
portional to (D/L)1/2, i.e.,
12
D
r¼c
L
where ‘D’ is the phage diffusion rate, ‘L’ corresponds to the latent
period, and ‘c’ is the binding constant of phages [15]. A more
mechanistic approach can be adopted by constructing a set of
reaction–diffusion equations to model plaque growth [16]. Three
population interactions were considered for the above modeling,
namely, host bacteria (B), infected host bacteria (I), and free phage
particles (V) given by
k1
V þ B Ð I k2 Y :V
k1
Here, ‘k1’ is the adsorption rate constant, ‘k1’ corresponds to

the rate constant of desorption of phage particle to its host bacterial
cell, and ‘k2’ represents rate constant for lysis of the infected
bacterial cell. ‘Y’ is the burst size per lysed bacterial cell. The
resulting set of equations is as follows [16]:

δ ½V δ2 ½V D δ½V
¼ D: þ k1 ½V ½B þ k1 ½I þ Y k2 ½I
δt δr 2 r δr
δ½B
¼ k1 ½V ½B þ k1 ½I
δt
δ½I
¼ k1 ½V ½B k1 ½I k2 ½I
δt
where ‘D’ is the phage diffusion rate. It was presumed that the host
cells cannot diffuse. Since the plaques are considered to be radially
symmetric, the above equations were formulated in polar
co-ordinates as a function of time, ‘t’, and position, ‘r’, with
boundary conditions given by.
r δδr
½V
¼0 at r ¼ 0.
[V] ¼ 0, [B] ¼ 0 and [I] ¼ 0 as r ! 1.
The above model was subsequently extended to incorporate

time delay corresponding to phage replication inside infected host
bacterial cells [17].
2.4 Cellular Cellular automata simulation is an agent-based discrete modeling

Automata system. It has huge applicability in fields like physics, engineering,
and theoretical biology. Stanislaw Ulam and John von Neumann
initially developed the concept of cellular automata in the 1940s,
while at Los Alamos National Laboratory. Subsequently, further
developments have added to the basic technique. Systematic devel-
opments in this area became possible after the eighties, with the
studies of Stephen Wolfram and Matthew Cook.
A cellular automata system consists of a grid of cells in which
every cell has a number of previously determined states. Initially, it
was developed in the 1D lattice and was then extended to 2D and
3D lattices. A cellular automata–based modeling is useful to
describe systems where spatial restrictions are prevalent. Here, the
state of each cell is defined as a function of time and space.
2.4.1 2D Cellular To describe the basic process of 2D cellular automata, let us con-
Automata sider an infinitely long graph paper. Here each small square is
defined as a cell. Now each of the cells has one of the two definite
states (0 and 1). Again every cell has neighboring cells, which are
represented by various neighborhood methods. In 2D cellular
automata, there are mainly two types of neighborhood methods
that are considered: von Neumann neighborhood and Moore
neighborhood. The position of each cell is determined by its coor-
dinate [18]. Now if we consider the coordinate of the central cell at
Fig. 6 Different neighborhood systems in cellular automata: (a) Radial neighborhood, (b) Moore neighborhood,
and, (c) von Neumann neighborhood
(0, 0), the two neighborhood methods consider the coordinates of

the neighboring cells in two different ways.
In 2D cellular automata, von Neumann neighborhood is sim-
plest neighborhood method. According to this method, the coor-
dinates of the neighboring cells are represented by
N ¼ ½ð0, 1Þ, ð1, 0Þ, ð0, 0Þ, ð1, 0Þ, ð0, 1Þ
Moore neighborhood method considers the neighboring cells
in a slight different way. According to this method, the coordinates
of the neighboring cell will be
N ¼ ½ð1, 1Þ, ð0, 1Þ, ð1, 1Þ, ð1, 0Þ, ð0, 0Þ, ð1, 0Þ, ð1, 1Þ, ð0, 1Þ, ð1, 1Þ
Other than these, Hexagonal and Margolus neighborhood
methods are also used. On the other hand, in 1D cellular automata,
radial neighborhood method is generally applied. These neighbor-
hoods are shown in Fig. 6.
3 SIR-Type Modeling
Cellular automata simulations can be used in SIR-type modeling. S,

I, and R in SIR refer to susceptible, infected, and removed states,
respectively. This technique was mainly developed to model epi-
demic spreading in a population. In this model, a 2D cellular
automata is considered similar to the two-dimensional grid. Each
of the cells has three possible states: S (susceptible), I (infected),
and R (removed). The state of a cell will be determined by its
neighbor’s state using rules similar to cellular automata. This
model is useful in modeling plaque dynamics on soft agar. It starts
with an infected cell inside the grid. This infected cell will infect
neighboring susceptible cells based on a definite rule or function.
Now the old infected cells will be removed or will produce a blank
Fig. 7 Representation of plaque propagation in the SIR model obtained by

cellular automata simulations
space after a definite time point, depending on their neighboring

cells. A similar situation can be arrived at when we consider plaque
growth. Here, phage infection is initiated with an infected cell. This
infected cell will produce new phage particles after burst out. The
newly produced phage particles will re-infect the neighboring unin-
fected cells depending on the specific function, which includes
parameters like phage diffusion, agar concentration, host cell den-
sity, etc. Now a plaque develops when the infected cells are subse-
quently converted to the removed state. This removed state
indicates a clear zone or plaque. With this method, we can visualize
the plaque formation on a 2D grid as shown in Fig. 7.
4 Notes
4.1 Monte Carlo Monte Carlo simulations have been used to study
Simulations mycobacteriophage-mycobacteria infection. We discuss later how
this simulation method can demonstrate the presence of alternative
killing mechanism of mycobacteria, which is not a phage-mediated
direct killing of the host cell [19]. The initial simulation of phage
dynamics was found to match perfectly with experimental results.
However, the host cell count in simulations is at variance with
experimental results. This difference can however be accounted
for by the incorporation of secondary host cell density-dependent
killing factor in the simulation. It was presumed that at the time of
burst out, this unknown secondary killing factor is produced from
infected cells and caused lyses of neighboring uninfected cells. Later
on, it was experimentally found that Reactive oxygen species (ROS)
were indeed produced from infected cells. This ROS was often
Fig. 8 The secondary killing mechanism of mycobacteria through ROS generation during phage infection
found to be lethal for adjacent uninfected cells. The mechanism is

depicted in Fig. 8.
4.2 Ordinary The host–bacteria dynamics was modeled using ODE wherein the
Differential Equations degradation rate, ‘μi’, of infected bacterial cells was introduced [20]
and the number of infected bacteria is denoted by ‘I’. It demon-
strates the coexistence of both bacterial host and phages. The
model is as follows:

S ðt Þ
S_ ðt Þ ¼ αS ðt Þ 1 kS ðt ÞP ðt Þ
C
I_ ðt Þ ¼ kS ðt ÞP ðt Þ μi I ðt Þ
P_ ðt Þ ¼ bI ðt Þ μp P ðt Þ
It was shown that at stable equilibrium, which depends upon
the carrying capacity of the bacterial cell population, both phage
and host coexist. Furthermore, the carrying capacity of the host
population, ‘C’, with C < μiμp/bk, results in the extinction of the
phage population along with the infected host cells [20].
The incorporation of host response against both bacteria and
phage into the system of ODE provides significant insights into
phage–bacteria interactions [21]. The therapeutic responses were
shown to be dependent on various density-dependent thresholds.
In passive therapy, net decrease in bacterial population occurs due

to lysis caused by inoculated phages (primary infection). Such
outcome depends on the concentration of the phage. On the
other hand, active therapy, where secondary infection due to lysis
is the prime factor in the removal of bacterial population, depends
on the bacterial concentration as well as on the timing of phage
inoculation.
A set of ODEs were used to study multiple phage adsorption to
a single bacterial cell [22]. A basic reproduction number, ‘R0’, was
linked to phage proliferation. It was shown that both host and
phage can persist if R0 > 1. Moreover, the rate of phage adsorption
on a single bacterial cell is binomially distributed, such that the
mean of the distribution increases with time.
Another deterministic model to understand within-host
dynamics of cholera was formulated using a system of ODEs
[23]. The study considered interactions between human bacteria
and environmental bacteria and the virus. It was observed that the
disease dynamics depends on the basic reproduction number, R´0,
which consists of two components. One component corresponds
to the intrinsic growth of human bacteria, whereas the other relates
to bacteria–virus interaction. R´0 > 1 results in the growth of
human bacteria, which ultimately leads to cholera infection. For
R´0 < 1, the ingested environmental bacteria would not cause
human cholera. Further, mathematical modeling of population
dynamics in Salmonella Enteritidis and ΦSan23 phage system pre-
dicts an increase in the effectiveness of phage-therapy treatment in
the presence of a growth-inhibiting albeit non-lethal
antibiotic [24].
4.3 Delay Differential The model proposed by Campbell was further extended by the
Equations inclusion of infected bacteria in the interacting population to
study phage–bacteria dynamics [25]. Lenski and Levin
incorporated resource concentrations into the system of DDEs to
describe the evolutionary constraints for E. coli and a virulent phage
in chemostat [26]. A heterogeneous population of bacterial cells
has also been considered [27]. The division of population is based
upon the number of receptors present on each bacterial cell wall.
Cells having the same number of receptors are considered to exhibit
a similar sensitivity to phage infection. It was observed that hetero-
geneity imparts robustness to the bacterial population toward sur-
vival under strong phage pressure. This is different from the case of
homogeneous bacterial populations, which leads to the extinction
of susceptible bacteria in the long run.
It is a widely known fact that bacteria can develop a certain
degree of resistance against the interacting phage due to coevolu-
tion. Many studies have been conducted where such characteristics
of bacteria have been included to model the population dynamics
between phage and bacteria. One of these calculated the bacterial
mutation rate at which the susceptible bacteria gain resistance

against the phage [28]. The study identified various crucial thresh-
olds such as inundation threshold, which is the minimum threshold
required for free phage population so as to cause a decline in the
susceptible bacterial population. The inundation threshold is
defined as follows:
αf
PI ¼
k
Here, ‘f’ represents the rate of bacteria mutation. Another
threshold is the proliferation threshold. For a rise in phage popula-
tion, the bacterial population must exceed the proliferation thresh-
old as ascertained by
μp
Sp ¼
kðb 1Þ
The cost of resistance to phage infection was incorporated to
analyze the bacteriophage resistant and bacteriophage-sensitive
bacteria interactions in a chemostat [29]. The sufficient and neces-
sary conditions for the persistence of phage resistant bacteria were
also determined.
A study based on mycobacteriophage D29 and Mycobacterium
smegmatis interactions uncovered a mechanism of “secondary kill-
ing” for bacterial cell death, apart from the well-known phenome-
non of (primary) death by cell lysis [19]. Secondary killing
mechanism was mentioned in Section 4.1, while discussing
Monte Carlo simulations. The experimental results were modeled
using a system of DDEs. Further, it was initially presumed that only
a fraction of infected cells were lysed. This fraction is denoted by
‘m’. The resulting mathematical model can be represented by the
following system of DDEs:
dS
¼ αS ðt Þ rS ðt ÞP ðt Þ
dt |fflffl{zfflffl} |fflfflfflfflfflffl{zfflfflfflfflfflffl}
cell growth cell decay due
to adsorption
h i
qmrS ðt τÞP ðt τÞS ðt Þexp t=ðaτÞ Heavi ðt τÞ
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
secondary cell decay due to release of superoxide from lysed bacteria
dI
¼ rS ðt ÞP ðt Þ mrS ðt τÞP ðt τÞHeavi ðt τÞ
dt |fflfflfflfflfflffl{zfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
fraction of cells lysed
infected cells
ðinfected cell population decay due to lysisÞ
due to adsorption
dP
¼ b mrS ðt τÞP ðt τÞHeavi ð t τÞ
dt |{z} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
burst size
fraction of cells lysed resulting
in new phages
rS ðt ÞP ðt Þ
|fflfflfflfflfflffl{zfflfflfflfflfflffl}
phage decay due to adsorption
However, the factor ‘m’ alone cannot account for the experi-
mentally observed numbers of phage and bacteria. Only upon the
incorporation of an additional parameter ‘q’ we can account for the
experimental observations. This comes into play only upon the
occurrence of lysis after the primary infection. The predicted effects
of this “secondary killing factor” mathematically represented above
by ‘q’ have been verified experimentally [19].
4.4 Reaction– After Koch’s model for plaque enlargement, various other models
Diffusion Equations have been proposed, which shed light upon the factors that are
crucial for plaque enlargement [30–32].
Time delay has been included in the set of reaction–diffusion
equations [33]. Using numerical analyses, an approximate value of
plaque development rate is obtained, which is closer to that
observed experimentally. A solution is also obtainable for the fol-
lowing set of equations [33]:
n o
τ τ
½V tt þ ½V t ¼ D eff ½V rr k1 ½V ½B þ ð½V ½B Þt
2 2

½I τ ½I
þ Y k 2 ½I 1 þ ½I 1
½I max 2 ½I max t
½B t ¼ k1 ½V ½B

½I
½I t ¼ k 1 ½V ½B k 2 ½I 1
½I max
Here, [. . .] corresponds to the concentration. The sub-indices
[. . .]rr, [. . .]t represent second-order derivatives w.r.t position and
time, respectively. [. . .]t denotes time derivative. ‘τ’ is the latent
period. The remaining variables are same as defined previously.
‘Deff ’ is the hindered diffusion constant and is related to diffusion
constant ‘D’ according to the following equation:
1f
D eff ¼ D
1 þ fx
where ‘x’ accounts for bacterium shape and f ¼ B0/Bmax, i.e., the
ratio of concentration of bacteria to its maximum possible value.
The above models were combined with the Koch plaque growth
model to derive simplified estimations of phage optimal latent
period toward maximizing the plaque size [14].
A delay-differential reaction–diffusion model was constructed,

and the speed of infection spreading in 1D environment was calcu-
lated using numerical simulations [34]. Phage–bacteria interaction
in a flow reactor was modeled using delayed reaction–diffusion
equations [35] to predict the survival of both phage and bacteria
in the bio-reactor. In a recent study, a reaction–diffusion model was
developed to address the dynamics of plaque development for both
virulent mutants and temperate phages [36]. It was observed that
with an increase in distance from the plaque center, the temperate
phages produce a greater number of progressively smaller
colonies [36].
4.5 Cellular Cellular automata has been applied in combination with partial
Automata differential equations (PDE) to model phage growth in a bacterial
biofilm [37]. This model concludes that the steady-state of the
phage–bacteria mixed population in a biofilm depends on nutrient
availability. Simple cellular automata techniques were introduced to
model spatial heterogeneity in phage–bacteria dynamics in the
biofilm surface [38]. These models have described the coexistence
of phage and bacteria in spatially restricted situations
reasonably well.
References
1. Calendar R (ed) (2012) The bacteriophages, 9. Rothfield L, Justice S, Garcia-Lara J (1999)

vol 1. Springer, Berlin Bacterial cell division. Ann Rev Genet
2. Maurice CF, Bouvier CD, De Wit R et al 33:423–448
(2013) Linking the lytic and lysogenic bacteri- 10. Moldovan R, Chapman-McQuiston E, Wu XL
ophage cycles to environmental conditions, (2007) On kinetics of phage adsorption. Bio-
host physiology and their variability in coastal phys J 93:303–315
lagoons. Environ Microbiol 15:2463–2475 11. Flunkert V, Schoell E (2009) Pydelay-a python
3. Hendrix RW (2002) Bacteriophages: evolution tool for solving delay differential equations
of the majority. Theor Pop Biol 61:471–480 arXiv preprint arXiv:0911.1633
4. Gomez P, Buckling A (2011) Phage-bacteria 12. Thompson S, Shampine LF (2006) A friendly
antagonistic coevolution in soil. Science Fortran DDE solver. Appl Numer Math
332:106–109 56:503–516
5. Weitz JS, Hartman H, Levin SA (2005) Coevo- 13. Campbell A (1961) Conditions for the exis-
lutionary arms races between bacteria and bac- tence of bacteriophage. Evolution 15:153–165
teriophage. Proc Natl Acad Sci U S A 14. Abedon ST, Culler RR (2007) Optimizing bac-
102:9535–9540 teriophage plaque fecundity. J Theor Biol
6. Sinha S, Grewal RK, Roy S (2018) Modelling 249:582–592
Bacteria–Phage Interactions and Its Implica- 15. Koch AL (1964) The growth of viral plaques
tions for Phage Therapy. Adv Appl Microbiol during the enlargement phase. J Theor Biol
103:103–141 6:413–431
7. Lenski RE (1988) Dynamics of interactions 16. Yin J, McCaskill JS (1992) Replication of
between bacteria and virulent bacteriophage. viruses in a growing plaque: a reaction-
In: Advances in microbial ecology. Springer, diffusion model. Biophys J 61:1540–1549
Boston, MA, pp 1–44 17. Fort J, Mendez V (2002) Time-delayed spread
8. Manly BF (2018) Randomization, bootstrap of viruses in growing plaques. Phys Rev Lett
and Monte Carlo methods in biology. Chap- 89:178101
man and Hall/CRC, London
18. Ermentrout GB, Edelstein-Keshet L (1993) 28. Cairns BJ et al (2009) Quantitative models of
Cellular automata approaches to biological in vitro bacteriophage-host dynamics and their
modelling. J Theor Biol 160:97–133 application to phage therapy. PLoS Pathog 5
19. Samaddar S et al (2016) Dynamics of (1):e1000253
mycobacteriophage-mycobacterial host inter- 29. Han Z, Smith HL (2012) Bacteriophage-
action: evidence for secondary mechanisms for resistant and Bacteriophage-sensitive bacteria
host lethality. Appl Environ Microbiol in a chemostat. Math Biosci Eng 9:737–765
82:124–133 30. Da K et al (1981) Appendix: a model of plaque
20. Bremermann HJ (1983) Parasites at the origin formation. Gene 13:221–225
of life. J Math Biol 16:165–180 31. Lee Y, Eisner SD, Yin J (1997) Antiserum inhi-
21. Payne RJH, Jansen VAA (2001) Understand- bition of propagating viruses. Biotechnol
ing bacteriophage therapy as a density- Bioeng 55:542–546
dependent kinetic process. J Theor Biol 32. You L, Yin J (1999) Amplification and spread
208:37–48 of viruses in a growing plaque. J Theor Biol
22. Smith HL, Trevino RT (2009) Bacteriophage 200:365–373
infection dynamics: multiple host binding sites. 33. Ortega-Cejas V et al (2004) Approximate solu-
Math Model Nat Phenom 4:111–136 tion to the speed of spreading viruses. Phys Rev
23. Wang X, Wang J (2017) Modelling the within- E 031909:69
host dynamics of cholera: bacterial–viral inter- 34. Gourley SA, Kuang Y (2005) A delay reaction-
action. J Biol Dyn 11:484–501 diffusion model of the spread of bacteriophage
24. Holguin AV et al (2019) Host resistance, geno- infection. SIAM J Appl Math 65:550–566
mics and population dynamics in a salmonella 35. Jones DA, Smith HL (2011) Bacteriophage
enteritidis and phage system. Viruses 11:188 and Bacteria in a flow reactor. Bull Math Biol
25. Levin BR, Stewart FM, Chao L (1977) 73:2357–2383
Resource-limited growth, competition and 36. Mitarai N, Brown S, Sneppen K (2016) Popu-
predation: a model and experimental studies lation dynamics of phage and bacteria in spa-
with bacteria and bacteriophage. Am Nat tially structured habitats using phage and
111:3–24 Escherichia coli. J Bacteriol 198:1783–1793
26. Lenski RE, Levin BR (1985) Constraints on 37. Simmons M et al (2018) Phage mobility is a
the coevolution of bacteria and virulent core determinant of phage–bacteria coexis-
phage: a model, some experiments, and predic- tence in biofilms. ISME J 12:531
tions for natural communities. Am Nat 38. Kerr B et al (2006) Local migration promotes
125:585–602 competitive restraint in a host–pathogen ’trag-
27. Chapman-McQuiston E, Wu XL (2008) Sto- edy of the commons’. Nature 442:75
chastic receptor expression allows sensitive bac-
teria to evade phage attack. Part II: theoretical
analyses. Biophys J 94:4537–4548
Chapter 19
Dynamics of Mycobacteriophage—Mycobacterial Host

Interaction
Arabinda Ghosh, Tridip Phukan, Surabhi Johari, Ashwani Sharma,
Abha Vashista, and Subrata Sinha
Abstract
Mycobacterium sp. is exhibiting complex evolution of antimicrobial resistance (AMR) and can therefore be
considered as a serious human pathogen. Many strategies were employed earlier to evade the pathogenesis
but AMR became threatened. Molecular tools employing bacteriophage can be an alternative to effective
treatment against Mycobacterium. Phage treatment using phage-encoded products, such as lysins, causes
lysis of cells; particularly bacteria could be used instead of direct use of these bacteriophages. Modern
technologies along with bacteriophage strategies such as in silico immunoinformatics approach, machine
learning, and artificial intelligence have been described thoroughly to escape the pathogenesis. Therefore,
understanding the molecular mechanisms could be a possible alternative to evade the pathogenesis.
Key words Mycobacteriophage, Mycobacterium, Phage treatment, Immunoinformatics, Machine

learning
1 Introduction
Bacteriophages are the scavengers of bacteria; they have parasitic

activities on the host cell, for example, Mycobacterium tuberculosis
and Mycobacterium smegmatis [1]. Phages are natural antibacterial
agents that are ubiquitous, obligate parasites that are very specific
to their host. Felix D’Herelle discovered phages officially in 1917
and named them due to their ability to parasitize bacteria by
removing all sorts of controversy over the phage discovery
[2]. The time period between 1920s and 1950s is considered as
golden time of phage therapy, but with the introduction of anti-
biotics in 1940s, phage therapy was absolutely replaced by antibio-
tics. Bacteriophages are very diverse in origin, genome
organization, evolution and abundance, as there are more than
1031 phages on Earth [3, 4]. Viral ecologists estimated that there
are approximately 1023 phage infections per second worldwide,
showing that the phage population is extremely large and vibrant

329
330 Arabinda Ghosh et al.
[5]. Phages display extraordinary hereditary fluctuations irrespec-

tive to the host genome size and association; for instance, Leuco-
nostoc phage L5 has 2435 bp in its genome while Pseudomonas
phage 201phi2-1 contains 316,674 bp in its genome. As GenBank
archive in excess of 330 phages has been sequenced for Mycobacte-
rium spp., 43 for Streptococcus spp., 100 for Staphylococcus spp.,
61 for Salmonella spp., 15 for Klebsiella spp., and 35 for Vibrio
spp., and a portion of these phages have likewise been concentrated
for their protein arrangement. Phages are being sequenced to
describe and investigate their potential application in different
fields, as “The Actinobacteriophage database” contains sequenced
genome information from 10,487 phages, which incorporates
1410 Mycobacterium phages, 219 Gordonia phages, 151 Arthrobac-
ter phages full genome arrangement (http://phagesdb.org/).
Other than their comparative host run, these phages can demon-
strate an incredible inconstancy in their genome association as
Salmonella phage SPN3US contains 240,413bp in its genome
though Salmonella phage FSL SP-004 contains 29,742bp, which
is one-tenth of its near phage genome (see Note 1). The current
chapter will throw a light on the detailed mechanism of phage–host
interaction involving in silico approaches.
1.1 Phage-Induced Some phages (e.g., D29) have broad host areas and infect many
Host Gene Expression species, including both rapidly growing and slowly growing
and Alteration M. tuberculosis and M. smegmatis [6], while other are extremely
narrow and have a single known host infected (Barnyard). There is
at least a phase (DS6A) whose host range is limited to the
M. tuberculosis strains. Although only a partial genome sequence
of this possibly highly helpful and exciting phase is present [7, 8]. A
variety of bacterial pathogens such as Escherichia coli, Salmonella
sp., Coynebacterium diphtheria, and Vibrio cholera and their patho-
geneis are menifested by phage-coded toxins. Most M. tuberculosis
have one or two tiny (size of 10 kbp) prophylactical components,
namely Rv1 and μRv2, that are carried on tuberculosis strains.
Multiple mycobacterial strains, including M. cannetti,
M. marinum, M. abscessus, and M. ulcerans, carry ulceran prophets
that are semi-intact that could have an effect on their biodiversity.
Phage-encoded protein expression can only impact phages in the
hosts. An alternative path is to integrate the phage genome in a host
gene, which is essential for certain physiological processes. Phage
integration usually involves site-specific recombination—including
an integration-mediated combination between the phage locations
and the bacterial attachments (AtP and ATB), with the use of two
separate kinds of enzymes. Tyrosine integrases are the most preva-
lent and typically mediate integration into the tRNA gene
(a significant exception is the well-studied lambda phage integra-
tion). In contrast, phages using a serine-integrase typically use an
attB site located within a host’s protein-coding genes, which is
Dynamics of Mycobacteriophage—Mycobacterial Host Interaction 331
interrupted by the integration event. D29 phage was able to affect

slow-growing pathogenic mycobacteria and the fast-growing envi-
ronmental strains, and visible plaque formed in the fast-growing
M. smegmatis bacteria following overnight incubation [9, 10]. The
existence of viable bacteria can be determined by rapid detection of
the release of progeny phages following the infection of the myco-
bacterium target using this technique. These experiments laid the
foundation for the subsequent development of phage amplification
technology and its application in surveying hostile to mycobacterial
treatment affectability [10]. In spite of the fact that mycobacter-
iophage has inclinations are relied upon to be unequivocally over-
whelmed by the accessibility of specific cell receptors, few have been
identified or examined. Lipid concentrates of M. smegmatis have
appeared to repress contamination by phages D29 and the unchar-
acterized D4 [11], and a specific peptidoglycolipid, mycoside C
(sm), has been purified and proposed to assume a job in phage D4
official [12]. Glycolipids may go about as receptors for the adsorp-
tion of mycobacteriophage Phlei [13], and a subset of lyxose-
containing particles has been further synthetically portrayed
[14]. All the more as of late, a solitary methylated rhamnose
buildup on the M. smegmatis cell wall–related glycopeptidolipid
appeared to be associated with the adsorption of phage I3 [15]
(see Note 2).
Since the disclosure and genomic portrayal of mycobacterio-
phages has been the focal point of coordinated research and train-
ing programs, including the Phage Hunters Integrating Research
and Education (PHIRE) and the Howard Hughes Medical Insti-
tute Science Education Alliance Phage Hunters Advancing Geno-
mics and Evolutionary Science (HHMI SEA-PHAGES), a great
many phages have been confined utilizing a solitary host strain,
M. smegmatis mc2155, more than 500 of which have been totally
sequenced [16]. These are for the most part from natural examples;
however, mycobacteriophages have likewise been disconnected
from feces tests of tuberculosis patients [17], despite the fact that
these presently cannot seem to be genomically examined. Obvi-
ously, these mycobacteriophages speak to just a minor bit of the
general phage populace, which is anticipated to incorporate 1031
particles, making them the lion’s share of all life shapes in the
biosphere.
Moreover, phages are the key players in building up an easier
hereditary framework. For instance, a few applications depend
particularly on the capacity of phages to infuse their DNA into
basically inside every mycobacterial populace, making phages per-
fect for transposon passage and arrangement of complex transpo-
son libraries [18], quality substitution utilizing specific
transduction, and enhanced consideration of reporter genes
[19]. In any case, the segment parts of the phages additionally
have enormous utility and regularly work in M. tuberculosis
regardless of whether the phage doesn’t really infect it. Precedents

incorporate mix plasmid vectors and recombining methodologies,
despite the fact that there are various other potential applications
that still cannot seem to be ill-treated. The general assorted variety
of the phages greatly energizes these methodologies, giving a tool-
box of more than 50,000 qualities that can be misused
[20, 21]. The recent increase in alarming problems caused by
antibiotic-resistant pathogenic bacteria urgently requires alterna-
tives to antibacterial chemotherapy using antibiotics. One potential
candidate may be bacteriophage therapy, which until somewhat
recently was little known outside the former states of Soviet
Russia [22]. Bacteriophages are bactericidal agents that have
prompted many researchers to curb bacterial diseases as many
bacteria become resistant to antibiotics. They had been a viable
option to therapy bacterial infection prior to the antibiotic era [23–
27]. However, phage therapy lost its relevance with the demon-
strated effectiveness of antibiotics [28]. The increase in the recent
incidence of antibiotic resistance due to genetic changes or pheno-
typic variations resulted in the resurfacing of using bacteriophages
for controlling bacteria that cause various diseases. The phage
therapy has advantages over the antibiotic therapy as the prior can
directly destroy the pathogen but later can be highly toxic in some
circumstances [29, 30]. It was also reported that many phage-
encoded products, such as lysins, cause lysis of cells; particularly
bacteria could be used instead of direct use of these bacteriophages
[31–34]. Additionally, phages exhibit toxicity to host organisms
and have the ability to inhibit host metabolic activities [35]. It is
very important to understand how they affect host and find alter-
native strategies that do not essentially use the phages directly for
controlling antibiotics resistant bacterial diseases [28] (see Note 3).
Bacteriophages present in the biosphere with an estimated size
of 1031 open a new horizon of enormous bio-resources for the
growing interest in biotechnology and biomedical applications
[36, 37]. Among them, mycobacteriophages are bacteriophages
that cause infection to mycobacterium. Mycobacteria are a group
of pathogenic bacteria causing severe diseases in humans, such as
leprosy (M. leprae) and tuberculosis (M. tuberculosis). Although
different antibiotics are available against these harmful bacteria,
they refuse to die out. This is primarily due to the ability of these
organisms to transit into a dormant state and can remain in a latent
form for a longer period of time in the host body. Therefore,
diseases such as leprosy and tuberculosis are difficult to eradicate
as most of the drugs cannot reach and act upon this latent form of
the pathogens [38–40].
The studies of phage–mycobacteria interaction have revealed
many insights into the molecular mechanisms of host–pathogen
interactions. The lytic mycobacteriophage TM4 and the lysogenic
mycobacteriophage L5 have been studied comprehensively in
mycobacterial research as a result of their growing importance in
understanding the interaction between Mycobacterium and bacter-

iophages in the field of biomedical application [41–44]. Of late
many new mycobacteriophages have been discovered and complete
phage genomes were sequenced more than 842 till date [17].
1.2 Mycobacterio- The hereditary make-up of various mycobacteriophages has been

phage Biology examined and finished genome maps are currently accessible on the
web. They are among the biggest bacteriophage genomes so far
sequenced with genomes shifting in size from 49.1 to 156 kbp
[45]. Mycobacteriophage genomes show a mosaic course of action
as homologous qualities are sprinkled with irrelevant qualities in an
unmistakable particular example. Around 87% of the qualities so far
portrayed are restrictive to mycobacteriophage. As opposed to
another bacteriophage, no qualities encoding for poisons have
been distinguished in mycobacteriophage. Bacteriophage that con-
veys poison qualities is related to extensive dreariness and mortality.
To get better knowledge about the elements of mycobacterial
inactivation by mycobacteriophages, the investigation was started
utilizing mycobacteriophage D29 and M. smegmatis as the phage-
host framework. Here, they executed an objective arranged in an
iterative cycle of tests on one hand and numerical displaying joined
with Monte Carlo recreations on the other. This integrative meth-
odology deciphered profitable understanding into the nitty gritty
energy of bacterium–phage collaborations. A gauged time-
subordinate changes in host suitability amid the development of
phage D29 in M. smegmatis at various multiplicities of disease
(MOI). The expectations rising out of hypothetical examinations
were additionally inspected utilizing biochemical and cell natural
measures [46]. In a phage-host cooperation framework where
various rounds of disease are permitted to happen, cell checks
drop more quickly than anticipated if cell lysis is viewed as the
main system for cell demise. The wonder could be clarified by
considering an optional factor for cell demise notwithstanding lysis.
Mycobacteriophages are infectious to mycobacterial hosts.
Interest for mycobacteriophages started in the late 1940s with the
phage isolation that contaminates Mycobacterium smegmatis and
later the discovery of phages that infect Mycobacterium tuberculosis.
dsDNA tailed phages are either temperate, forming stable lysogens
at moderate frequencies (e.g., lambda), or lytic, such that all infec-
tions lead to phage growth and cell death (e.g., T4 and T7). The
classification of mycobacteriophages into two such groups is, how-
ever, complex. A good example of a temperate phage is L5, which
forms obviously turbid plaques from which stable lysogens immune
to super infection can be readily isolated [47]; in contrast, D29
forms completely clear plaques in which virtually all host cells are
killed.
The first completely sequenced mycobacteriophage genome
was that of phage L5 [43], a temperate phage isolated in Japan
[48]; it is a close relative of phage L1 (Fig. 1), which shares a similar
Fig. 1 Schematic diagram of temperate mycobacteriophage L5
restriction pattern but does not grow at 42 C [49]. Both L5 and

L1 infect fast-growing and slowly growing mycobacterial strains,
although efficient infection of slow-growers by L5 requires the
presence of high calcium concentrations [42]. Although the
sequence of L1 has not been determined, derivatives that grow at
both 42 and 30 C have been identified, followed by isolation and
characterization of temperature-sensitive mutants [50, 51]. The
next complete genome reported was that of D29, which was
isolated in California from a soil sample by enrichment and infects
both fast-growing and slowly growing strains, and is clearly lytic
[41]. D29 has considerable nucleotide sequence similarity to L5,
especially in the left-most parts of the genomes that encode the
virion structural genes [41, 43, 52]. Whereas D29 forms distinctly
clear plaques than any other mycobacteriophage, the sequenced
version is likely a recent derivative of a temperate parent, and
Bowman noted a mixture of consistent plaque morphologies in
the D29 stock while comparing with the genome of L5 [53]. The
third sequenced mycobacteriophage, TM4, was isolated by the
induction of a strain of M. avium [54]. It is unclear whether the
original strain was lysogenic or pseudolysogenic, since TM4 is
capable of lysing it as well as M. smegmatis and M. tuberculosis
[54]. Over the past 20 years, many other mycobacteriophage
genomes were sequenced from phages isolated, and all were
isolated from environmental samples using M. smegmatis MC2
155 as a host.
1.3 Phage Therapy In the more extensive arrangement, bacteriophages are of two
sorts: lysogenic (calm) in which bacteriophage coordinates their
genome into host DNA and lytic (destructive) in which bacterio-
phage duplicates quickly into the cell and therefore bursts the host
cell to proceed with the contamination to bacterial cells. Lytic
bacteriophages duplicate in a logarithmic way in host bacterial cell
and discharged by the lysis of the tainted bacterium, which incor-
porates the holin-endolysin discharge system [55, 56]. Holins cre-
ate a scraped spot in the bacterial film through which endolysins
finds the best approach to the peptidoglycan layer [56]. Endolysins
are cell divider hydrolases that debase the bacterial peptidoglycan
and prompt cell lysis and release of descendant phages [57]. It was
evidenced that the phage lambda lysis obey holin-endolysin-subor-
dinate process yet another parallel pathway managed by spanins and
Ms6 LysB, an adornment lytic proteins were presented later on
[58, 59]. Phage treatment is as yet observed to be effective against
various pathogens, for example, Pseudomonas, Staphylococcus, Kleb-
siella, and E. coli, and staphylococcal lung infection [60–63] as of
late; phage treatment has demonstrated a critical guarantee in the
treatment of diseases caused by pathogens that are impervious to
numerous anti-toxins. Chhibber et al. demonstrated that phages
can be utilized for treating Klebsiella pneumoniae respiratory tract
contamination, and a solitary portion was sufficient to protect the
majority of the tested creatures [64]. Likewise, phage treatment is
additionally answered to be successful in cerebrospinal meningitis
in infants [65]. Phage treatment has demonstrated guarantee
against numerous diseases caused by E. coli including skin infec-
tions [66], intermittent subphrenic and subhepatic abscesses [67],
cystic fibrosis by Pseudomonas aeruginosa [68], staphylococcal eye
infections [69], Gram-negative interceded neonatal sepsis [70],
fiery urinary tract diseases [71], and Buruli Ulcer caused by Myco-
bacterium ulcerans [40]. It was additionally demonstrated that
phage treatment amid deadly contamination prompts increment
in phage titer with time while in the event of antimicrobials the
complexity declines [72, 73]. In earlier days, it was accepted that
phages can act just against extracellularly increasing bacterium;
however, an ongoing report demonstrated that phages are skilled
at intracellular killing of immersed methicillin-safe Staphylococcus
aureus. This was exhibited utilizing host microorganisms as a vehi-
cle to convey phages inside phagocytic cells [74, 75]. A numerical
demonstrating in the populace elements method demonstrated
that a solitary portion of phage was more viable as opposed to
different dosages of antibiotics [76].
Biofilm brings down the antimicrobial susceptibility of a micro-
organism; a biofilm of Mtb and M. smegmatis shows a higher
degree of drug resistance than growing bacilli or bacilli in plank-
tonic form [77]. Although in TB, there is no obvious proof of
biofilm production, particularly in MDR-TB, but the expression
of pilin protein in Mtb which is actively engaged in binding to

extracellular matrix [78]. This provides an insight that pathogen
surface may be engaged with surface attachment [79].
Non-TB microscopic organisms confined in consumable water,
Mycobacterium abscessus, Mycobacterium chelonae, M. smegmatis,
and M. avium additionally demonstrated biofilm-forming ability
[37]. M. marinum demonstrated biofilm development on hydro-
phobic structures, for example, silicon and on fluid air interfaces
[80]. Subsequent investigations have additionally uncovered that
Mtb is a biofilm-shaping life form, for example, other Mycobac-
teria. Mtb biofilm demonstrated hereditary variety from planktonic
cell and showed the higher level of drug resistance [81]. Biofilm-
shaping capacity of this living being and expanded anti-microbial
obstruction have incited researchers to search for treatments other
than antimicrobials.
Mycobacteriophages can disintegrate M. smegmatis biofilms
alone or in the blend with mechanical powers, for example, a
water stream and sonication. In any case, throughout the years, it
has been imagined that phages are dynamic against just extracellular
microscopic organisms. Mtb is an intracellular pathogen that devel-
ops inside macrophages and hence makes phages unavailable to
pathogen. Intracellular focusing of phage by nonvirulent
M. smegmatis demonstrated a model for intracellular focusing of
Mycobacterium by phages [82], phage treatment remained rela-
tively unexplored in mycobacterial diseases because of its prolonged
occurrence and high hazard involved in dealing with it. Show life
forms can along these lines, be utilized to investigate new conceiv-
able outcomes of phage treatment in treating Mycobacterial con-
taminations. Dynamics of mycobacteriophages infection on the
broad range of Mycobacteria deciphered the factors, including the
number of phages and bacteria present initially, the growth rate of
the bacteria, the length of the latent phase of growth, burst size,
and adsorption rate [28]. They have used mathematical modeling
by poisson’s distribution that has displayed the enumeration of
phages inversely proportional to bacterial cell numbers. In the
latent period bacteria on which phage has adsorbed ultimately are
destined to lyse after about 60 min of infection. Out of all bacterial
cells, few survived from lysis by the bacteriopahage but later it
was reported that bacterial cell death occurs due to host cell inacti-
vation by lytic phages and the release of reactive oxygen (ROS)
molecules [28].
1.4 Phage Treatment Anti-toxin obstruction is surely a developing and troubling

for Tuberculosis improvement, especially with the rise of widely tranquilize safe
(XDR) and absolutely sedate safe (TDR) strains, the two of which
are particularly hard to control. The conveyance of phages to the
lungs ought to be moderately basic, despite the fact that there is
extensive uncertainty concerning whether they would adequately
achieve their bacterial hosts, which might be intracellular and inside

granulomas. An interesting recommendation for tending to the
entrance question is to utilize tainted surrogate mycobacterial
cells for the conveyance [83]. Lamentably, moderately a couple of
productive phage enemies of M. tuberculosis are accessible, and in
light of the fact that phage opposition is not out of the ordinary, a
suite of three to six phages that productively slaughter
M. tuberculosis and inspire distinctive obstruction instruments
in the host are required. Since just a subset of those phages secluded
on M. smegmatis likewise taints M. tuberculosis, the confinement
of extra phages known to contaminate M. tuberculosis is
attractive [27].
The counteractive action and control of tuberculosis (TB) on a
worldwide scale has turned out to be progressively vital with the rise
of multidrug-safe TB. Mycobacterium tuberculosis phages have
been distinguished as an essential analytical instrument. Phage
genomes display a critical dimension of assorted variety and mosaic
genome design; in any case, they are basic structures, which are
amiable to hereditary control. In view of these qualities, the phages
might be utilized to build a van plasmid, which is an imperative
device in the examination of TB. Moreover, they might be utilized
for quick analysis and surveying drug vulnerability of TB, including
phage-enhanced appraisal and correspondent phage innovation.
With an enhanced comprehension of mycobacteriophages, further
illumination of the pathogenesis of TB and of the suggestions for its
conclusion and treatment might be explained.
Since mycobacteriophages were recognized 50 years prior,
>2439 sorts of mycobacteriophages have been detached and the
genome successions of >363 kinds of mycobacteriophages have
been finished. Mycobacteriophage genomes have a few highlights,
including assorted variety and mosaicism, a basic structure, and
agreeability to hereditary control. In view of these qualities, a bus
plasmid was developed for TB examination utilizing recombinant
DNA innovation. With enhancements in genomics, carry plasmids
have likewise been utilized to construct diverse luciferase journalist
phages and fluoromycobacteriophages, which have added to the
examination of mycobacteria and TB [26]. Following quite a long
while of constrained examinations, phage treatment is again a
functioning zone of examination, especially in bacteriophage
lyase. As examination concerning mycobacterial phages advances,
upgrades in the current comprehension of its job in TB, and
especially its determination and treatment, are normal (see Note 4).
2 In Silico Tools Utilized for Studying Phage Dynamics
New strategies are progressively used to portray the association

among phage and its host, for example, the use of microarrays for
studying phage translation all through the T4-E. coli
contamination cycle, involving phage disease profiling with the host

genome [84–86]. RNA sequencing (RNA-seq) offers one of a kind
open door for queries involved in complementary reconstructing of
phage and bacterium among the phage disease cycle [87]. In any
case, clinical pertinent phage–microscopic organism’s connections
are rare and earnestly required. Mycobacterium incorporates infa-
mous pathogens causing genuine sicknesses in warm-blooded ani-
mals, for example, tuberculosis (M. tuberculosis) and infection
(M. leprae). In excess of 5850 mycobacteriophages, bacteriophage
known to contaminate mycobacteria, have been disengaged utiliz-
ing a solitary host strain, Mycobacterium smegmatis mc2155, more
than 600 of which have been totally sequenced.
As of the middle of 2015, 5914 mycobacteriophages have been
found, and 853 have been sequenced through these activities.
These phages are assembled into bunches and subclusters con-
trolled by and large genomic nucleotide comparability [88]. With
the end goal to comprehend the proteomes of these phages, puta-
tive mycobacteriophage proteins are assembled into protein
families, known as phams, by shared amino corrosive similitude.
Be that as it may, in spite of this abundance of genomic information,
the elements of most putative phams stay obscure and even less is
known with respect to the articulation and control of bacteriophage
qualities. Mass-spectrometry and protein communication have
been utilized to affirm quality calls, distinguish new quality items,
and illustrate quality capacity. Giles has been exposed to global
transcriptomic research using RNA-Seq. This permits a worldwide
examination of the quality of expression such as microarrays, but at
a high level of accuracy, especially in low-bounty transcripts, and
allows single nucleotide aims and ID of novel transcripts [89]. Giles
has a place with group Q, which is a little, firmly monitored bunch
(>98% nucleotide likeness among individuals), and just indirectly
identified with other mycobacteriophage groups. In fact, Giles was
specifically concentrated in huge part on account of its unordinary
genome engineering and transformative divergence from different
mycobacteriophages. To comprehend the transcriptome-level ele-
ments of the most widely recognized mycobacteriophages, utiliza-
tion RNA-Seq and mass spectrometry to dissect the disease of
M. smegmatis with a phage gives more extesive idea of functional
activities. Experiences in the developmental weapons contest
among phage and bacterium have uncovered numerous new atomic
hardware, for example, the far-reaching bacterial safeguard frame-
work called CRISPR/Cas, which thus has propelled progressive
genome-altering devices [90], and energizing novel methodologies
for antimicrobials revelation [28, 83] (see Notes 5 and 6).
2.1 Immuno- Immunoinformatics manages the utilization of computational

informatics techniques for immunological issues and is therefore viewed as a
Prediction for piece of bioinformatics. Truly, instruments for the forecast of
Mycobacteriophage HLA-restricting peptides were the primary devices grown particu-
larly for immunoinformatics applications. These devices prepared
for more complex applications. The advancement of immunoinfor-
matics devices has been pivotal to the accessibility of adequate
exploratory information. High-throughput human leukocyte anti-
gen (HLA)–restricting tests prompted significant advancement
around there. All the more as of late, cutting-edge sequencing
(NGS) has encouraged huge numbers of the novel applications
and difficulties that we will audit here. A first region where the
accessibility of financially savvy sequencing has a vast effect is our
insight into the real histocompatibility complex (MHC, HLA in
human) itself. The quantity of known HLA alleles, as enrolled in
the International ImMunoGeneTics data framework [91] database,
has expanded from 1000 in 1998 to more than 13,000 in 2015. At
first instruments for the forecast of HLA official (regularly like-
wise—marginally mistakenly—called epitope expectation) were
prepared on information for each HLA allele autonomously; how-
ever, the quantity of new alleles renders this methodology more
illogical [92]. The accessibility of extensive scale information has
enhanced the execution of immunoinformatics, and, for some,
despite the fact that not for all, applications, there is presently an
abundance of information accessible. This expansion in informa-
tion volume regularly means an expanded precision of these instru-
ments, principally in light of the fact that numerous devices depend
on machine learning techniques, which benefit significantly from
extra information. In this specific circumstance, the accessibility of
thorough and all around curated immunological databases is basic.
2.2 Immuno- The accessibility of the succession information of HLA-restricting

informatics peptides in the mid-1990s prompted a scan for shared traits among
Techniques these arrangements—that is, allele-particular themes that pass on
and Databases authoritative. Some of the approaches are discussed in the follow-
for Epitope Forecast ing sections.
2.3 Machine In managed ML, a technique endeavors to take in a capacity that

Learning Approaches maps an offered contribution to its comparing yield for a given
preparing an informational collection of known information and
yield esteems (gaining from precedents). This could either order
(e.g., separation among fastener and non-cover) or relapse (e.g.,
the expectation of peptide restricting partiality). The least complex
ML procedure that is still broadly utilized is position-particular
scoring lattices (PSSMs) [93]. Nonetheless, more-complex
learning strategies, for example, bolster vector machines (SVMs)
[94], concealed Markov models (HMMs) [95], or counterfeit
neural systems (ANNs) [96], have now turned out to be more
vital instruments.
2.4 NGS-Based HLA To foresee a T-cell epitope, information of the HLA allotype is
Composing required. Traditional methodologies for HLA composing depend
on either counteracting agent-based techniques or focused on
sequencing [97]. In numerous clinical applications, the NGS infor-
mation of a patient is as of now accessible. The apparatuses induc-
ing the HLA allotype from NGS information (exome,
transcriptome) would thus be able to maintain a strategic distance
from extra expense. These instruments are added as often as possi-
ble to surmise HLA types for substantial scale genome sequencing
ventures (e.g., ICGC) [98], The Cancer Genome Atlas, 1000
Genomes venture [99], where no committed HLA composing
information is accessible for the dominant part of genomes.
2.5 White Blood Cell Given the HLA type for an individual, it is presently conceivable to
Epitope (T-cell) foresee the HLA ligands. This is frequently alluded to as T-cell
Forecast epitope expectation, despite the fact that introduction by HLA is
important, yet not adequate, for a peptide to end up an epitope
since acknowledgment by the resistant framework is not ensured.
HLA ligand restricting is a constraining advance in the antigen-
handling pathway. It is, for the most part, viewed as more particular
than resulting ventures of the antigen-preparing pathways and in
this way essential for immunization structure. PSSM-based indica-
tors (e.g., SYFPEITHI [100], RANKPEP [101] or Bimas [102],
SVM-based indices (e.g., SVMHC) [94], SVRMHc, and
ANN-based methods (e.g., netMHC) [103] are some of the most
popular of these approaches.
NetMHCpan [104], TEPITOPEpan [105], ADT [106], Uni-
Tope [107], and KISS [108] are also specific strategies for these
approaches. The PickPocket and the Tepicopotamus computes the
coping specifics in the HLA atom by contrasting the pocket build-
ups and the HLAs in their libraries and by establishing a weighted
normal score. The SVM-based tools are KISS whereas MULTI-
PRED trains one indicator for every super class. Rather than every
single other strategy, netMHCpan enables the client to make expec-
tations for discretionary HLA class I arrangements.
HLA class II epitope indicators are ProPred [109], RANKPED
[101], TEPITOPE [98], SVRMHC [94], MHC2MIL [110],
and MHC2pred. These instruments have a few indicators for the
HLA-DR locus. netMHCII, RANKPED, and MHC2MIL likewise
give forecasts to HLA-DQ and DP.
2.6 Consensus To enhance expectations in machine input, numerous indicators

Methods can be joined to play out an accord forecast. The most utilized
accord strategies are CONSENSUS, which is facilitated on the
IEDB site [111], and netMHCcons given by Karosiene and
associates.
2.7 Integrated Precedents of instruments executing this methodology are EpiJen

Processing Forecast [112] and WAPP [113], both based on effectively existing forecast
Tools strategies. NetCTL [114] picked an alternate methodology.
2.8 Ligands Entry The entry of a ligand in HLA will not guarantee that it is seen by the
into HLA TCR. The component immunogenicity of the ligand was subse-
quently defined as the epitope of POPI, a SVM-based indicator
created by Tung and Ho in 2007 [115].
2.9 Focusing B-cell The expectation of B-cell epitope is unique in relation to the T-cell
Epitope epitope prediction at a very fundamental point. Immune system
epitopes are brief, direct arrangements for peptides, while B-cell
epitopes are not really compatible within groups. The intricate
structure of collapsed proteins can prompt spatial closeness of
amino acids that can be remote in the antigen grouping. As of
late distributed indicators for ceaseless epitopes are COBEpro
[116], BCPRed, and FBCPred [117].
3 Conclusion
Mycobacteriophages infect mycobacteria, eventually leading to

their demise. The chances of using them as carriers against wild
mycobacterium (TB) are thus a potential for new era of treatment.
In the growth of instruments for hereditary control, mycobacter-
iophages were essential to the enhancement of mycobacterial
hereditary characteristics. To use bacteriophage as a drug, however,
a quantitative and mechanistic overview of how they reduce the
bacterial population needs to be obtained. In order to achieve
better knowledge of any specified scheme, computer simulation
can be employed. The conduct of the system during mycobacter-
iophage and host interaction is generally monitored in immunoin-
formatics. However, we cannot forecast or often even comprehend,
to some extent, the inner processes working in the scheme, owing
to different constraints. But if we simulate a system, it allows us to
“understand” and even predict a system’s conduct—in reaction to
modifications that occur in different parameters. Sometimes, at
least in available test areas dictated by model parameters, this can
lead to a better experimental design.
4 Notes
1. The identification of suitable and unique target sites on the

pathogen constitutes a major component of designing effective
drugs, to be followed up by logical design based on computer-
aided molecular modeling. The distinctive features of the
organization of the mycobacterial cell wall, and the consequent

existence for their synthesis and assembly of unique metabolic
pathways, should be excellent for genetic and biochemical
analyzes and offer rational drug design goals. A significant
criterion must be the accessibility of drugs to the target
website.
2. There are virtually no investigations on the complex mycobac-
terial phenomenon. The fascinating properties are from patho-
gens’ invasivity, their survival in the environment, which evades
the host defense mechanism, and the causation of pathologic
lesions or symptoms.
3. Two faces of mycobacterial diseases are protective immunity
and pathological immunity. In this phase it is essential to
involve the same or distinct T-cell populations of the host. It
may be almost difficult to discover a ‘single’ protective antigen
obtained from pathogen, and thus identification of a ‘related
illness’ group is at risk. These elements must be taken into
account when designing new and more efficient vaccines
against mycobacterial diseases.
4. Early diagnosis of mycobacterial infection based on immunol-
ogy and recombination DNA or RNA is presently accessible for
sensitive, particular, reliable, simple to conduct diagnostic test-
ing. Furthermore, it is a significant characteristic to consider
the viability of the identified bacteria in the host.
5. Is it worth sequencing the entire mycobacterial genome (either
M. tuberculosis, M. leprosy, or any other pathogenic species)?
The technology is now accessible worldwide but continues to
be enhanced.
6. Ultimately, the fundamental data will show mycobacteria’s
distinctive and uncommon characteristics. The correct strain
of mycobacteria for sequencing should be selected very
carefully.
Acknowledgments
Authors are grateful to the Indian Science and Technology Foun-

dation, Delhi, and the Department of Biotechnology, Govt. of
India, for giving the Bioinformatics Infrastructure Facility (BIF)
at the Center for Biotechnology and Bioinformatics, Dibrugarh
University in which the manuscript was completed. Authors are
additionally grateful to DeLCON facility provided by DBT, Govt.
of India.
References
1. Hatfull GF (2012) The secret lives of myco- mycobacteriophage receptors of Mycobacte-
bacteriophages. Adv Virus Res 82:179–288 rium phlei/Mycobacterium smegmatis. Bio-
2. Chanishvili N (2012) Phage therapy – history chemistry 35:11812–11819
from Twort and d’Herelle through Soviet 15. Chen J, Kriakov J, Singh A, Jacobs WR Jr,
experience to current approaches. Adv Virus Besra GS, Bhatt A (2009) Defects in glyco-
Res 83:3–40 peptidolipid biosynthesis confer phage I3
3. Hendrix RW (2003) Bacteriophage geno- resistance in Mycobacterium smegmatis.
mics. Curr Opin Microbiol 6:506–511 Microbiology 155:4050–4057
4. Hatfull GF (2008) Bacteriophage genomics. 16. Hatfull GF (2013) Complete genome
Curr Opin Microbiol 11:447–453 sequences of 63 mycobacteriophages.
5. Suttle CA (2007) Marine viruses – major Genome Announc 1(6):e00847–e00813
players in the global ecosystem. Nat Rev 17. Hatfull GF (2014) Mycobacteriophages: win-
Microbiol 5:801–812 dows into tuberculosis. PLoS Pathog 10(3):
6. Rybniker J, Kramme S, Small PL (2006) Host e1003953
range of 14 mycobacteriophages in Mycobac- 18. Cole ST, Brosch R, Parkhill J, Garnier T,
terium ulcerans and seven other mycobacteria Churcher C, Harris D (1998) Deciphering
including Mycobacterium tuberculosis—appli- the biology of Mycobacterium tuberculosis
cation for identification and susceptibility from the complete genome sequence. Nature
testing. J Med Microbiol 55:37–42 393(6685):537e44
7. Bowman BU (1969) Properties of mycobac- 19. Jacobs-Sera D, Marinelli LJ, Bowman C,
teriophage DS6A. I. Immunogenicity in Broussard GW, Guerrero Bustamante C,
rabbits. Proc Soc Exp Biol Med 131:196–200 Boyle MM, Petrova ZO, Dedrick RM, Pope
8. Jones WD Jr (1975) Differentiation of known WH, Science Education Alliance Phage Hun-
strains of BCG from isolates of Mycobacterium ters Advancing Genomics and Evolutionary
bovis and Mycobacterium tuberculosis by using Science Sea-Phages Program, Modlin RL,
mycobacteriophage 33D. J Clin Microbiol Hendrix RW, Hatfull GF (2012) On the
1:391–392 nature of mycobacteriophage diversity and
host preference. Virology 434:187–201
9. Phillips LM, Sellers MI (1970) Effects of eth-
ambutol, actinomycin D and mitomycin C on 20. Court DL, Oppenheim AB, Adhya SL (2007)
the biosynthesis of D29-infected mycobacte- A new look at bacteriophage lambda genetic
rium smegmatis. In: Juhasz SE, Plummer G networks. J Bacteriol 189:298–304
(eds) Host-virus relationships in mycobacte- 21. Zumla A, George A, Sharma V, Herbert N,
rium, nocardia and actinomyces. Charles Baroness Masham of Ilton (2013) WHO’s
C. Thomas, Springfield, pp 80–102 2013 global report on tuberculosis: successes,
10. David HL, Clavel S, Clement F, Moniz- threats, and opportunities. Lancet 382
Pereira J (1980) Effects of antituberculosis (9907):1765e7
and antileprosy drugs on mycobacteriophage 22. Waites MJ, Morgan NL, Rockey JS, Higton G
D29 growth. Antimicrob Agents Chemother (2001) Industrial microbiology: an introduc-
18:357–359 tion. Blackwell Science Ltd, Hoboken, p 177
11. Tokunaga T, Kataoka T, Suga K (1970) Phage 23. Fruciano DE, Bourne S (2007) Phage as an
inactivation by an ethanol-ether extract of antimicrobial agent: d’Herelle’s heretical the-
Mycobacterium smegmatis. Am Rev Respir ories and their role in the decline of phage
Dis 101:309–313 prophylaxis in the West. Can J Infect Dis
12. Furuchi A, Tokunaga T (1972) Nature of the Med Microbiol 18:19–26
receptor substance of Mycobacterium smeg- 24. Herelle FD (1917) An invisible microbe that
matis for D4 bacteriophage adsorption. J Bac- is antagonistic to the dysentery bacillus
teriol 111:404–411 Cozzes rendus. Acad Sci 165:373–375
13. Bisso G, Castelnuovo G, Nardelli MG, 25. Levin BR, Bull JJ (2004) Population and evo-
Orefici G, Arancia G, Lanéelle G, lutionary dynamics of phage therapy. Nat Rev
Asselineau C, Asselineau J (1976) A study on Microbiol 2:166–173. https://doi.org/10.
the receptor for a mycobacteriophage: phage 1038/nrmicro822
phlei. Biochimie 58:87–97 26. Lu TK, Koeris MS (2011) The next genera-
14. Khoo KH, Suzuki R, Dell A, Morris HR, tion of bacteriophage therapy. Curr Opin
McNeil MR, Brennan PJ, Besra GS (1996) Microbiol 14:524–531. https://doi.org/10.
Chemistry of the lyxose-containing 1016/j.mib.2011.07.028
27. Radetsky P (1996) The good virus. Discover. during therapy. PLoS One 6:e18327.
http://discovermagazine.com/1996/nov/ https://doi.org/10.1371/journal.pone.
thegoodvirus918 0018327
28. Samaddar S, Grewal RK, Sinha S, Ghosh S, 39. Gillespie SH (2002) Evolution of drug resis-
Roy S, Gupta SKD (2016) Dynamics of tance in Mycobacterium tuberculosis: clinical
mycobacteriophage-mycobacterial host inter- and molecular perspective. Antimicrob
action: evidence for secondary mechanisms Agents Chemother 46:267–274. https://
for host lethality. Appl Environ Microbiol doi.org/10.1128/AAC.46.2.267-274.2002
82:124–133 40. Trigo G, Martins TG, Fraga AG, Longatto-
29. Berry M, Gurung A, Easty DL (1995) Toxic- Filho A, Castro AG, Azeredo J, Pedrosa J
ity of antibiotics and antifungals on cultured (2013) Phage therapy is effective against
human corneal cells: effect of mixing, expo- infection by Mycobacterium ulcerans in a
sure and concentration. Eye 9(Part murine footpad model. PLoS Negl Trop Dis
1):110–115. https://doi.org/10.1038/eye. 7:e2183. https://doi.org/10.1371/journal.
1995.17 pntd.0002183
30. Lees AW, Allan GW, Smith J, Tyrrell WF, 41. Ford ME, Stenstrom C, Hendrix RW, Hatfull
Fallon RJ (1971) Toxicity form rifampicin GF (1998) Mycobacteriophage TM4:
plus isoniazid and rifampicin plus ethambutol genome structure and gene expression.
therapy. Tubercle 52:182–190. https://doi. Tuber Lung Dis 79:63–73. https://doi.org/
org/10.1016/0041-3879(71)90041-9 10.1054/tuld.1998.0007
31. Fenton M, Ross P, McAuliffe O, O’Mahony J, 42. Fullner KJ, Hatfull GF (1997) Mycobacter-
Coffey A (2010) Recombinant bacteriophage iophage L5 infection of Mycobacterium bovis
lysins as antibacterials. Bioeng Bugs 1:9–16. BCG: implications for phage genetics in the
https://doi.org/10.4161/bbug.1.1.9818 slow-growing mycobacteria. Mol. Microbio.
32. Fischetti VA (2008) Bacteriophage lysins as 26:755–766
effective antibacterials. Curr Opin Microbiol 43. Hatfull GF, Sarkis GJ (1993) DNA sequence,
11:393–400. https://doi.org/10.1016/j. structure and gene expression of mycobacter-
mib.2008.09.012 iophage L5: a phage system for mycobacterial
33. Matsuzaki S, Rashel M, Uchiyama J, genetics. Mol Microbiol 7:395–405. https://
Sakurai S, Ujihara T, Kuroda M, Ikeuchi M, doi.org/10.1111/j.1365-2958.1993.
Tani T, Fujieda M, Wakiguchi H, Imai S tb01131.x
(2005) Bacteriophage therapy: a revitalized 44. Piuri M, Hatfull GF (2006) A peptidoglycan
therapy against bacterial infectious diseases. J hydrolase motif within the mycobacterioph-
Infect Chemother 11:211–219. https://doi. age TM4 tape measure protein promotes effi-
org/10.1007/s10156-005-0408-9 cient infection of stationary phase cells. Mol
34. Schuch R, Nelson D, Fischetti VA (2002) A Microbiol 62:1569–1585. https://doi.org/
bacteriolytic agent that detects and kills Bacil- 10.1111/j.1365-2958.2006.05473.x
lus anthracis. Nature 418:884–889. https:// 45. Pedulla ML, Ford ME, Houtz JM,
doi.org/10.1038/nature01026 Karthikeyan T, Wadsworth C, Lewis JA,
35. Miller ES, Kutter E, Mosig G, Arisaka F, Jacobs-Sera D, Falbo J, Gross J, Pannunzio
Kunisawa T, Ruger W (2003) Bacteriophage NR, Brucker W, Kumar V, Kandasamy J,
T4 genome. Microbiol Mol Biol Rev Keenan L, Bardarov S, Kriakov J, Lawrence
67:86–156. https://doi.org/10.1128/ JG, Jacobs WR Jr, Hendrix RW, Hatfull GF
MMBR.67.1.86-156.2003 (2003) Origins of highly mosaic mycobacter-
36. Monk A, Rees C, Barrow P, Hagens S, Harper iophage genomes. Cell 113:171–182
D (2010) Bacteriophage applications: where 46. Pena CE, Judy S, Hatfull Graham F (1998)
are we now? Lett. Appl. Microbiol Mycobacteriophage D29 integrase-mediated
51:363–369. https://doi.org/10.1111/j. recombination: specificity of mycobacterioph-
1472-765X.2010.02916.x age integration. Gene 225:143
37. Williams MM, Yakrus MA, Arduino MJ, 47. Donnelly-Wu MK, Jacobs WR Jr, Hatfull GF
Cooksey RC, Crane CB, Banerjee SN, Hil- (1993) Superinfection immunity of mycobac-
born ED, Donlan RM (2009) Structural anal- teriophage L5: applications for genetic trans-
ysis of biofilm formation by rapidly and slowly formation of mycobacteria. Mol Microbiol
growing nontuberculous mycobacteria. Appl 7:407–417
Environ Microbiol 75:2091–2098 48. Doke S (1960) Studies on mycobacterio-
38. Colijn C, Cohen T, Ganesh A, Murray M phages and lysogenic mycobacteria. J Kuma-
(2011) Spontaneous emergence of multiple moto Med Soc 34:1360–1373
drug resistance in tuberculosis before and
49. Lee MH, Pascopella L, Jacobs WR Jr, Hatfull 62. Kaźmierczak Z, Górski A, Da˛browska K
GF (1991) Site-specific integration of myco- (2014) Facing antibiotic resistance: Staphylo-
bacteriophage L5: integration-proficient vec- coccus aureus phages as a medical tool.
tors for Mycobacterium smegmatis, Viruses 6:2551–2570
Mycobacterium tuberculosis, and bacille 63. Pires DP, Vilas Boas D, Sillankorva S, Azeredo
Calmette-Guerin. Proc Natl Acad Sci U S A J (2015) Phage therapy: a Step forward in the
88:3111–3115 treatment of Pseudomonas aeruginosa infec-
50. Chatterjee S, Mitra M, Das Gupta SK (2000) tions. J Virol 89:7449–7456
A high yielding mutant of mycobacteriophage 64. Chhibber S, Kaur S, Kumari S (2008) Thera-
L1 and its application as a diagnostic tool. peutic potential of bacteriophage in treating
FEMS Microbiol Lett 188:47–53 Klebsiella pneumoniae B5055-mediated lobar
51. Chaudhuri B, Sau S, Datta HJ, Mandal NC pneumonia in mice. J Med Microbiol
(1993) Isolation, characterization, and 57:1508–1513
mapping of temperature-sensitive mutations 65. Strój L, Weber-Dabrowska B, Partyka K,
in the genes essential for lysogenic and lytic Mulczyk M, Wójcik M (1999) Successful
growth of the mycobacteriophage L1. Virol- treatment with bacteriophage in purulent
ogy 194:166–172 cerebrospinal meningitis in a newborn.
52. Freitas-Vieira A, Anes E, Moniz-Pereira J Neurol Neurochir Pol 33:693–698
(1998) The site-specific recombination locus 66. Cisło M, Dabrowski M, Weber-Dabrowska B,
of mycobacteriophage Ms6 determines DNA Woytoń A (1987) Bacteriophage treatment of
integration at the tRNA(Ala) gene of Myco- suppurative skin infections. Arch Immunol
bacterium spp. Microbiology Ther Exp (Warsz) 35:175–183
144:3397–3406 67. Kwarcinski W, Lazarkiewicz B, Weber-
53. Bowman B Jr (1958) Quantitative studies on Dabrowska B, Rudnicki J, Kaminski K, Scie-
some mycobacterialphage host systems. J. bura M (1994) Bacteriophage therapy in the
Bacteriol 76:52–62 treatment of repeated subphrenic abscess and
54. Timme TL, Brennan PJ (1984) Induction of subhepatic abscess with jejunal fistula after
bacteriophage from members of the Myco- stomach resection. Pol Tyg Lek 49:535
bacterium avium, Mycobacterium intracellu- 68. Shabalova IA, Karpanov NI, Krylov VN, Shar-
lare, Mycobacterium scrofulaceum ibjanova TO, Akhverdijan VZ (1995) Pseudo-
serocomplex. J Gen Microbiol monas aeruginosa bacteriophage in treatment
130:2059–2066 of p. aeruginosa infection in cystic fibrosis
55. Young R (1992) Bacteriophage lysis: mecha- patients. In Proceedings of IX International
nism and regulation. Microbiol Rev Cystic Fibrosis Congress. International Cystic
56:430–481 Fibrosis Association, Zurich, Switzerland,
56. Young R (2002) Bacteriophage holins: deadly p. 443
diversity. J Mol Microbiol Biotechnol 69. Proskurov VA (1970) Use of staphylococcal
4:21–36 bacteriophage for therapeutic and preventive
57. Loessner MJ (2005) Bacteriophage endoly- purposes. Zh Mikrobiol Epidemiol Immuno-
sins – current state of research and applica- biol 47:104–107
tions. Curr Opin Microbiol 8:480–487 70. Pavlenishvili I, Tsertsvadze T (1993) Bacter-
58. Berry J, Rajaure M, Pang T, Young R (2012) iophagotherapy and enterosrbtion in treat-
The spanin complex is essential for lambda ment of sepsis of newborns caused by gram
lysis. J Bacteriol 194:5667–5674 negative bacteria. Pren Neon Infect 11:104
59. Catalão MJ, Gil F, Moniz-Pereira J, São- 71. Perepanova TS, Darbeeva OS, Kotliarova GA,
José C, Pimentel M (2013) Diversity in bac- Kondrat’eva EM, Maı̆skaia LM, Malysheva
terial lysis systems: bacteriophages show the VF, Baı̆guzina FA, Grishkova NV (1995)
way. FEMS Microbiol Rev 37:554–571 The efficacy of bacteriophage preparations in
60. Capparelli R, Parlato M, Borriello G, treating inflammatory urologic diseases. Urol
Salvatore P, Iannelli D (2007) Experimental Nefrol (Mosk) 5:14–17
phage therapy against Staphylococcus aureus 72. D’hérelle F (1923) (1993) The Bacterio-
in mice. Antimicrob Agents Chemother phage, Its Role in Immunity. Ind Med Gaz.
51:2765–2773 58(9):443–444
61. Denou E, Bruttin A, Barretto C, Ngom- 73. Brüssow H (2005) Phage therapy: the Escher-
Bru C, Brüssow H, Zuber S (2009) T4 phages ichia coli experience. Microbiology
against Escherichia coli diarrhea: Potential 151:2133–2140
and problems. Virology 388:21–30
74. Kaur S, Harjai K, Chhibber S (2014) during bacteriophage PRD1 infection. J

Bacteriophage-aided intracellular killing of Virol 80:8081–8088
engulfed methicillin-resistant Staphylococcus 85. Ravantti JJ, Ruokoranta TM, Alapuranen
aureus (MRSA) by murine macrophages. Appl AM, Bamford DH (2008) Global transcrip-
Microbiol Biotechnol 98:4653–4661 tional responses of Pseudomonas aeruginosa to
75. Gondil VS, Chhibber S (2017) Evading anti- phage PRR1 infection. J Virol 82:2324–2329
body mediated inactivation of bacteriophages 86. Fallico V, Ross RP, Fitzgerald GF, McAuliffe
using delivery systems. J Virol Curr Res O (2011) Genetic response to bacteriophage
1:555–574 infection in Lactococcus lactisreveals a four-
76. Levin B, Bull JJ (1996) Phage therapy revis- strand approach involving induction of mem-
ited: the population biology of a bacterial brane stress proteins, D-alanylation of the cell
infection and its treatment with bacteriophage wall, maintenance of proton motive force, and
and antibiotics. AM Nat 147:881–898 energy conservation. J Virol
77. Bansal S, Soni SK, Harjai K, Chhibber S 85:12032–12042
(2014) Aeromonas punctata derived depoly- 87. Garber M, Grabherr MG, Guttman M, Trap-
merase that disrupts the integrity of Klebsiella nell C (2011) Computational methods for
pneumoniae capsule: optimization of depoly- transcriptome annotation and quantification
merase production. J Basic Microbiol using RNA-seq. Nat Methods 8:469–477
54:711–720 88. Pan Y, Yang X, Duan J, Lu N, Leung AS,
78. Ramsugit S, Guma S, Pillay B, Jain P, Larsen Tran V, Hu Y, Wu N, Liu D, Wang Z, Yu X,
MH, Danaviah S, Pillay M (2013) Pili con- Chen C, Zhang Y, Wan K, Liu J, Zhu B
tribute to biofilm formation in vitro in Myco- (2011) Whole-genome sequences of four
bacterium tuberculosis. Antonie Van Mycobacterium bovis BCG vaccine strains. J
Leeuwenhoek 104:725–735 Bacteriol 193(12):3152e3
79. Alteri CJ, Xicohténcatl-Cortes J, Hess S, 89. Zvi A, Ariel N, Fulkerson J, Sadoff JC, Shaf-
Caballero-Olı́n G, Girón JA, Friedman RL ferman A (2008) Whole genome identifica-
(2007) Mycobacterium tuberculosis produces tion of Mycobacterium tuberculosis vaccine
pili during human infection. Proc Natl Acad candidates by comprehensive data mining
Sci U S A 104:5145–5150 and bioinformatics analyses. BMC Med
80. Hall-Stoodley L, Brun OS, Polshyna G, Genomics 1:18
Barker LP (2006) Mycobacterium marinum 90. Barrangou R, Fremaux C, Deveau H,
biofilm formation reveals cording morphol- Richards M, Boyaval P, Moineau S, Romero
ogy. FEMS Microbiol Lett 257:43–49 DA, Horvath P (2007) CRISPR provides
81. Ojha AK, Baughn AD, Sambandan D, Hsu T, acquired resistance against viruses in prokar-
Trivelli X, Guerardel Y, Alahari A, Kremer L, yotes. Science 315:1709–1712
Jacobs WR Jr, Hatfull GF (2008) Growth of 91. Lefranc MP, Giudicelli V, Ginestoux C et al
mycobacterium tuberculosis biofilms contain- (1999) IMGT, the international ImMunoGe-
ing free mycolic acids and harbouring drug- neTics database. Nucleic Acids Res. 27
tolerant bacteria. Mol Microbiol 69:164–174 (1):209–212
82. Broxmeyer L, Sosnowska D, Miltner E, 92. Singh H, Raghava GPS (2001) ProPred: pre-
Chacón O, Wagner D, McGarvey J, Barletta diction of HLA-DR binding sites. Bioinfor-
RG, Bermudez LE (2002) Killing of Myco- matics 17:1236e7
bacterium avium and Mycobacterium tuber- 93. Altschul SF, Madden TL, Sch€affer AA,
culosis by a mycobacteriophage delivered by a Zhang J, Zhang Z, Miller W, Lipman DJ
nonvirulent mycobacterium: a model for (1997) Gapped BLAST and PSI-BLAST: a
phage therapy of intracellular bacterial patho- new generation of protein database search
gens. J Infect Dis 186:1155–1160 programs. Nucleic Acids Res 25
83. Liu J, Dehbi M, Moeck G, Arhin F, Bauda P, (17):3389–3402
Bergeron D, Callejo M, Ferretti V, Ha N, 94. Dönnes P, Elofsson A (2002) Prediction of
Kwan T, McCarty J, Srikumar R, Williams D, MHC class I binding peptides, using
Wu JJ, Gros P, Pelletier J (2004) Antimicro- SVMHC. BMC Bioinformatics 3:25
bial drug discovery through bacteriophage 95. Noguchi H, Kato R, Hanai T, Matsubara Y,
genomics. Nat Biotechnol 22:185–191 Honda H, Brusic V, Kobayashi T (2002)
84. Poranen MM, Ravantti JJ, Grahn AM, Hidden Markov model-based prediction of
Gupta R, Auvinen P, Bamford DH (2006) antigenic peptides that interact with MHC
Global changes in cellular gene expression class II molecules. J Biosci Bioeng 94
(3):264–270
96. Lundegaard C, Lund O, Nielsen MJ (2011) 107. Toussaint NC, Feldhahn M, Ziehm M,
Prediction of epitopes using neural network Stevanovic S, Kohlbacher O (2011) T-cell
based methods. Immunol Methods 374 epitope prediction based on self-tolerance.
(1-2):26–34 In: Proceedings of the 2nd ACM Conference
97. Erlich H (2012) HLA DNA typing: past, on Bioinformatics, Computational Biology
present, and future. Tissue Antigens 80 and Biomedicine - BCB ’11. New York:
(1):1–11 ACM Press, p. 584
98. Zhang L, Chen Y, Wong HS, Zhou S, 108. Jacob L, Vert JP (2008) Efficient peptide-
Mamitsuka H, Zhu S (2012) TEPITOPEpan: MHC-I binding prediction for alleles with
extending TEPITOPE for peptide binding few known binders. Bioinformatics
prediction covering over 700 HLA-DR mole- 24:358–366
cules. PLoS One 7(2):e30483 109. Singh H, Raghava GP (2001) ProPred: pre-
99. Abecasis GR, Auton A, Brooks LD, DePristo diction of HLA-DR binding sites. Bioinfor-
MA, Durbin RM, Handsaker RE, Kang HM, matics 17:1236–1237
Marth GT, McVean GA (2012) An integrated 110. Wan J, Liu W, Xu Q, Ren Y, Flower DR, Li T
map of genetic variation from 1,092 human (2006) SVRMHC prediction server for
genomes. 1000 Genomes Project Consor- MHC-binding peptides. BMC Bioinformatics
tium. Nature 491(7422):56–65 7:463
100. Rammensee H, Bachmann J, Emmerich NP, 111. Vita R, Overton JA, Greenbaum JA,
Bachor OA, Stevanovic S (1999) SYF- Ponomarenko J, Clark JD, Cantrell JR,
PEITHI: database for MHC ligands and pep- Wheeler DK, Gabbard JL, Hix D, Sette A,
tide motifs. Immunogenetics 50:213–219 Peters B (2015) The immune epitope data-
101. Reche PA, Glutting JP, Reinherz EL (2002) base (IEDB) 3.0. Nucleic Acids Res 43(Data-
Prediction of MHC class I binding peptides base issue):405–412
using profile motifs. Human Immunol 112. Doytchinova IA, Guan P, Flower DR (2006)
63:701–709 EpiJen: a server for multistep T cell epitope
102. Parker KC, Bednarek MA, Coligan JE (1994) prediction. BMC Bioinformatics 7:131.
Scheme for ranking potential HLA-A2 bind- https://doi.org/10.1186/1471-2105-7-
ing peptides based on independent binding of 131
individual peptide side-chains. J Immunol 113. Donnes P, Kohlbacher O (2005) Integrated
152:163–175 modeling of the major events in the MHC
103. Lundegaard C, Lamberth K, Harndahl M, class I antigen processing pathway. Protein
Buus S, Lund O (2008) NetMHC-3.0: accu- Sci 4:2132–2140
rate web accessible predictions of human, 114. Larsen MV, Lundegaard C, Lamberth K,
mouse and monkey MHC class I affinities Buus S, Lund O, Nielsen M (2007) Large-
for peptides of length 8-11. Nucleic Acids scale validation of methods for cytotoxic
Res 36(Web Server issue):509–512 T-lymphocyte epitope prediction. BMC Bio-
104. Nielsen M, Lundegaard C, Blicher T, informatics 8:424
Lamberth K, Harndahl M, Justesen S et al 115. Tung CW, Ho SY (2007) POPI: predicting
(2007) NetMHCpan, a Method for Quanti- immunogenicity of MHC class I binding pep-
tative Predictions of Peptide Binding to Any tides by mining informative physicochemical
HLA-A and -B Locus Protein of Known properties. Bioinformatics 23:942–949
Sequence. PLoS ONE 2(8):e796 116. Sweredoski MJ, Baldi P (2008) (2008).
105. Bian H, Hammer J (2004) Discovery of pro- COBEpro: a novel system for predicting con-
miscuous HLA-II-restricted T cell epitopes tinuous B-cell epitopes. Protein Eng Des Sel
with TEPITOPE. Methods 34:468–475 22(3):113–120
106. Jojic N, Reyes-Gomez M, Heckerman D, 117. EL-Manzalawy Y, Dobbs D, Honavar V
Kadie C, Schueler-Furman O (2006) (2008) Predicting linear B-cell epitopes
Learning MHC I–peptide binding. Bioinfor- using string kernels. J Mol Recognit
matics 22:227–235 21:243–255
Chapter 20
Multiplexing of Immune Markers via

Electrochemiluminescence Immunoassays for Systems
Biology
Vrushali Abhyankar and Ammaar H. Abidi
Abstract
Electrochemiluminescence immunoassays are based on the principle of light emission in a chemical
environment to detect and analyze different proteins and biomolecules. It has numerous advantages over
traditional analytical methods including conservation of sample, high sensitivity, broad range, and relative
ease of use. Herein, we describe the electrochemiluminescence methods by using Mesoscale Discovery
System with recommendations and optimization of protocols to aid in discovery of biological relevant
markers and also discuss avoidance of major pitfalls for accurate biomarker detection.
Key words Electrochemiluminescence, Immnoassay, ECLIA, Multiplex assays, Mesoscale Discovery

System, Biomarkers
1 Introduction
Electrochemiluminescence immunoassays (ECLIA) enables detec-

tion of important biomolecules by generating emitted light stimu-
lated via electrical signal in the appropriate chemical environment
[1]. The emission of light signal is generated at the electrode
surface in the bottom of multi-array and multi-spot microplates
that have conjugated detection antibodies (SULFO-TAG) to allow
for electron transfer, resulting in the excited state and consequently
light emission [2]. Due to the nature of ECLIA, no background
fluorescence is generated, and no excitation light is required. The
immunoassays by principle immobilize the analyte (biomolecule of
interest) by a pre-coated capture antibody on a high binding plate
on working electrode surface, followed by detection antibody that
binds tightly to the analyte squeezing the analyte between the
capture and detection antibody known as a “sandwich” assay
[3]. The conventional western blots (labor-intensive) [4] and
enzyme-linked immunosorbent assay (ELISA) [5] are other

349
350 Vrushali Abhyankar and Ammaar H. Abidi
methods commonly used for biomolecule detection but require

more time and have more complexity of biological matrices with
less reliability and sensitivity of results compared to ECLIA. West-
ern blots once considered the “gold standard” typically requires the
denaturing of protein, making it more intricate and prone to oper-
ator errors. ELISA and ECLIA are more sensitive than western blot
and capture proteins in their native conformation, but conventional
ELISA requires larger samples and more washing steps and has
limitation to how many analytes are quantified per well. ECLIA is
a highly successful detection system that boasts its success on stably
labeled conjugated biological molecules without any radioactivity,
which achieves clinical quality of data from various samples (blood,
serum, cerebrospinal fluid, etc.) [6–9]. The ECLIA technique is
used by mesoscale discovery (MSD), in which plates like MSD
MULTI-SPOT 96-well plate V-PLEX, U-PLEX, etc., can detect
up to 10 analytes of interest. The assay protocol is simple and is not
limited to advertised product. Several studies have utilized this
technology by making custom ECLIA assays and optimized them
with MSD technology to their respective needs [10–12]. For
instance, on cell markers can be detected on MSD high binding
plates, which make it perfect for external surface markers like cell
receptors, ion channels, etc. The principle for those assays is similar
but key difference involved is that cells are directly plated on the
MSD high binding plates with respective primary antibody (e.g.,
human, rat, goat.) and the conjugated secondary detection anti-
body that is SULFO-TAG labeled to be used to generate the
electrochemiluminescence [13]. The general assay protocol can
be summarized in the following stages. (1) Completion of the
treatment regimen. (2) Samples are put into a pre-coated MSD
plate followed by detection SULFO-TAG antibodies while adding
MSD buffer creates the appropriate chemical environment where a
voltage is applied to electrodes of the plate, causing light emission.
(3) The reader instrument quantifies the measured intensity of light
emitted, which is proportional to the sample analyte present and is
optimized and validated to the principles of “Fit-for-Purpose
Method and Validation for Successful Biomarker
Measurements” [14].
2 Materials
As aforementioned, assays through MSD are available in 10-spot

multiplex kits, which can be used for individual assays (i.e. IL-6),
pre-selected subset of markers, and/or customizable to experimen-
tal needs. The materials for controls, wash buffer, plate sealers, and
detection buffer are usually shared within all kits provided by MSD
except for detection antibodies, diluents, calibrators, and plate
types.
Electrochemiluminescence Immunoassays 351
1. Polypropylene microcentrifuge tubes (appropriately sized) and

vortex for reagent preparation (i.e. calibrators, diluents, etc.).
2. Penicillin/Streptomycin (P/S), Gentamicin, DPBS, Hank’s
Buffer, 96-well plates (Fisher Scientific, Waltham, MA).
3. Multichannel or equivalent pipettor for dispensing solutions
(10–150 μL) into 96-well plate format (i.e. conditioned media,
wash buffer, antibody solution, and detection buffer).
4. MSD wash buffer (catalog #R61AA-1) included in some kits
but can be prepared as follows: Phosphate-buffered saline
(PBS) 1 with the addition of 0.05% Tween-20.
5. Primary human microglia of unknown sex (PHMG) (Clonex-
press, Inc., Gaithersburg, MD).
6. Dulbecco’s Modified Eagle Media (DMEM).
7. Fetal Bovine Serum (FBS) (Atlanta Biologicals, Flowery
Branch, GA).
8. Primary human periodontal ligament fibroblasts (hPDLFs)
(Lonza Walkersville, MD).
9. Stromal cell–basal medium (SCBM) (Lonza Walkersville, MD).
10. SingleQuot Kit Supplements & Growth Factors Pack (Lonza
Walkersville, MD (CC-7049).
11. Lipopolysaccharide (LPS), recombinant tumor necrosis factor
alpha (TNF-α), and recombinant interleukin-1 beta (IL-1β)
(Invitrogen).
12. Microplate Shaker (500–1000 rpm) needed during the capture
and detection phase of antibody incubation.
13. Adhesive plate seal (avoid solution spills into other wells).
14. Centrifuge may be needed for certain sample preparations.
15. Water (deionized).
16. Safety equipment for safe laboratory practices (gloves, safety
glasses, lab coats, etc.).
3 Methods
3.1 On Cell Western 1. Plate cells at a density of 20,000 cells/50 μL in full growth
media on high bind plates (Meso-Scale Discovery, Gaithers-
burg, MD).
2. The following day, remove full growth media gently and
replace with 100 μL media (e.g., DMEM with 1% FBS and
1% P/S. Let the cells harmonize and synchronize activity over a
24-h period.
3. The next day, add 100 μL of media with 1% FBS and 1% P/S
containing stimulus to achieve the final well concentration,
which is 1. An hour later (stimulating inflammation), add
compounds of choice with previously determined concentra-
tions. Make sure to account for volume differences so that the
final concentrations of ligands and stimulus are 1 (see
Note 1).
4. Discard medium after 24 h and use appropriate antibody con-
centrations most appropriate for your assay. For instance, to
study the on cell marker for pro-inflammation, use 2 μg/mL of
CD16/32 added in 30 μL PBS to each well and incubate for
2 h. The incubation is at room temperature with light shaking
shaker at 130 rpm, followed by a gentle 150 μL PBS wash twice.
5. Then use 30 μL of 2 μg/mL anti-rat (CD16/32) SULFO-
TAG antibody for another 2-h incubation. The incubation is at
room temperature with light shaking shaker at 130 rpm.
6. After 3 gentle washes with PBS 150 μL/well, add 2 -
Surfactant-Free Read Buffer and follow instructions in Sub-
heading 3.5 [13].
3.2 Conventional 1. To obtain conditioned media for the immune-assays, cells are
Assay (Multiplex) to be seeded at densities of 10,000–30,000 cells/well in
96-well polystyrene flat bottom plates (use collagen plates if
cell staining protocols are to be performed). However, samples
can be also obtained from and are not limited to plasma, serum,
urine, and CSF (see Note 2).
2. In a cell culture–based system, medium is changed after 24 h
from full growth medium (contains higher serum content) to
0–1% FBS with respective antibiotic (i.e., P/S, Gentamicin) for
another 24 h at 37 C, 5% CO2 to synchronize cell activity.
3. Check the cells for optimal health and morphology, followed
by addition of stimulus of choice to induce immune response
(made in the 0–1% FBS medium). The most common stimu-
lants include LPS, TNF-α, and IL-1β.
4. To test a compound(s) or molecule(s) ability to increase/
decrease immune responses in stimulated conditions, there
are a few approaches that can be taken. Stimulus can be
added an hour before at respective concentration to create a
proinflammatory response prior to the addition of compound.
For additive or synergistic effects, co-stimulation of compound
and stimulus can be added together. To achieve stimulated
inflammation, addition of the compound after 30–60 min
stimulus can be used [15] (see Note 3).
5. The conditioned medium can be removed based on assay
parameter or allotted points in a time course (i.e., 1, 6,
18, 24, 48, 72 h).
6. Remove conditioned media for Mesoscale Discovery human/

mouse kits. Cell culture plates on collagen can be further fixed
and analyzed with respective staining protocols (see Note 4).
3.3 ECLIA 1. Preparation of calibrators is an essential step prior to

Immunoassay Working performing the immunoassay (see Note 5).
Solutions Preparation 2. Manufacturer will provide either single/multi-analyte lyophi-
(MesoScale Discovery lized calibrator based upon marker selection for immunoassay
System) and respective diluent to perform a serial dilution that will
make a concentration gradient.
3. Reconstitute the calibrator with respective diluent volume
provided in the kit and naming it calibrator #1. It is preferred
after adding the diluent to the calibrator to invert it three or
more times and avoid vortex. Leave the calibrator #1 (recon-
stituted solution) in room temperature for 15–30 min with
short pulse vortex.
4. Prepare a total of 8 calibrators and label them accordingly.
Prepare fourfold serial dilution (1:4 ratio) for calibrators to
generate 7 calibrators with calibrator #8 with diluent only.
Take 100 μL of calibrator #1 (highest concentration) and
transfer it to another tube with 300 μL diluent (calibrator
#2). Calibrator 3 will be made performing the same process
from calibrator 2, transferring 100uL from #2 (vortex), and
transferring it to another 300 μL diluent (calibrator #3).
Repeat the same process to make the remaining calibrators
(#4–7) (see Notes 6–7).
5. Sample preparation requires diluents, which may vary per kit,
to be added to the samples to optimize study results at least in
duplicates. Generally, MSD will recommend a dilution; how-
ever, a twofold dilution is used for cytokine detection
(e.g. 25 μL sample and 25 μL diluent), while a four-fold
dilution is used for chemokines (e.g. 12.5 μL sample and
37.5 μL diluent) in V-Plex kits (note: a control pack may also
be purchased in addition to already supplied materials from
MSD) (see Notes 7–10).
6. To prepare the detection antibody solution, the final volume in
the provided diluent should be equal to 3000 μL. For instance,
if only 1 analyte is present, then use 60 μL of provided detec-
tion antibody and transfer it to a vial with 2940 μL respective
diluent. If several analytes are present, then add 60 μL detec-
tion antibody for each respective analyte to a vial with diluent
to bring the final volume to 3000 μL (see Note 11).
7. Preparation of wash buffer is made by adding 0.05% Tween-20
to 1 PBS solution. However, several MSD kits come with
wash buffer, that is, 20 stock solution. The recommended
wash buffer solution per plate is 300 mL. If wash buffer is

included, then follow MSD guidelines in preparation of wash
buffer solution.
8. The preparation of read buffer is included in each kit. The
working solution for the read buffer is usually 2. To make
20 mL of read buffer, use 10 mL of the provided Read Buffer
(4) and 10 mL of deionized water.
3.4 ECLIA 1. Following the preparation of working solutions and reagents,

Immunoassay Assay the 96-well multi-spot plate should be washed by the wash
Protocol (MesoScale buffer solution at least 3 times (150 μL/well). Tilt upside
Discovery System) down with slight jerking to remove fluid.
2. Add 50 μL/well of calibrator in duplicates (minimum repli-
cate) from the highest concentration to the lowest.
(A1-A2 ! Calibrator 1), (B1-B2 ! Calibrator 2) to
(H1-H2 ! Calibrator 8) (note: calibrator 8 is diluent only)
(note: calibrators should not be diluted).
3. Follow respective treatment/non-treatment layout of samples
and add 50 μL/well of sample to each well (note: it may save
material and time if diluents for the samples are added directly
to the multi-spot plate followed by addition of the samples. For
instance, in the cytokine plates, a twofold dilution is recom-
mended. Therefore, 25 μL of diluent can be added to the plate
prior to the addition of 25 μL of sample).
4. After the addition of the sample, seal the plate and incubate for
2 h on a plate shaker with rotary motion (500–1000 rpm) (see
Note 12).
5. Discard the conditioned media/supernatant by rotating the
plate 180 with a slight jerk to remove content. Do not use
tips to aspirate as it can damage the bottom of the plate. After
initial removal of the conditioned media, plate can also be tilted
over a paper towel with slight tapping to get rid of the leftover
media (see Notes 13–14).
6. Use the wash buffer solution at least three times (150 μL/well)
to wash out. Let it sit on a paper towel after the washes for a few
minutes to dry out the surface.
7. Add 25 μL/well prepared detection antibody solution. Seal the
plate and incubate for 2 h on a plate shaker with rotary motion
(500–1000 rpm) (note: incubation times can be increased to
overnight if needed).
8. Use the wash buffer solution at least three times (150 μL/
well). Let it sit on a paper towel after the washes for a couple
of minutes to dry out the surface.
9. Add 2 read buffer 150 μL/well and incubate in room tem-
perature for 10 min (note: the times for read buffer incubation
may vary per kit. Additionally, adding the read buffer is an
important step. The plates can be stored overnight after the

incubation period has been completed. However, if detection
buffer has been added, the plates must be read the same day)
(see Notes 14–15).
3.5 ECLIA MesoScale 1. MSD Discovery Workbench (DWB) software can be used to
Discovery Reader prepare a template for the experiment or use previous plate
and Analysis layouts used (Fig. 1).
2. Launch DWB software; either make a new template or use a
previously prepared template.
3. Run the plate via reader (ex. MSD Sector 2400); the orienta-
tion of the plate does not matter as barcodes can detect the
orientation of the plate. The plates after reading can be
returned to the input site or sent to an adjacent site if multiple
plates are being read.
4. Click new plate layout icon and select the appropriate spot for
each well. If the plate is a standard kit (proinflammatory panel),
stored kit layouts can be used to simplify the process.
Fig. 1 Creating an experiment in Mesoscale Discovery Workbence Software

Fig. 2 Importing experiment into Mesoscale Discovery Workbence Software
5. After assigning respective spot for each marker, standards, con-

trols, and unknowns can be defined as icons are available within
the software. Additionally, select the number of replicates and
concentration with dilution factor. Perform this till the layout
has been completed and save the save the plate layout and exit
out of the file (see Notes 16–17).
6. Click on the plate from Plate (Table) and select the analysis of
the plate using the template.
7. Select the appropriate template with the configuration of the
experimental design.
8. This creates the experiment with the template that will have
saved plate layout and profile (Fig. 2).
9. The experiment parameters, data grid per specific marker, and
standards with statistics can be viewed (Figs. 3, 4, 5, and 6).
10. Plot for standards and unknowns can be also achieved within
the software (Fig. 7).
Fig. 3 Experiment 10-spot and plate information in Mesoscale Discovery Workbence Software
11. The data can be exported to Excel as a spreadsheet from which

graphs can be generated in Excel or other software (Prism, etc.)
(Fig. 8).
4 Notes
1. As per protocol, the final volume will become 250 μL if 50 μL

of ligand/stimulus is added. Therefore, make 5 ligand con-
centration with 1 LPS.
2. It is preferable to have at least 25 μL per sample and perform
3 replicates (minimum 75 μL for three samples).
3. Always include separate controls for each respective compound
or stimulus for comparison.
4. In the on cell/polarization assays, cells are maintained and
condition media is expelled, while in cell pathway studies
(intracellular), lysis buffer is used to acquire lysate before next
steps are performed.
Fig. 4 Inter-plate statistics of standards and unknowns after read in Mesoscale Discovery Workbence
Software
5. It is important to bring all reagents provided by the manufac-

turer to room temperature prior to beginning of the assay.
6. It is important to note that all incubation steps for each assay
should be performed in the same condition preferably between
20 and 26 C. This helps with consistency and repeatability
between runs.
7. To thaw frozen diluents before use, it should be brought down
to room temperature. It is wise to keep it in room temperature
in a water bath (24–25 C). Please note if any reagents need to
be kept on ice and use them as directed. Generally, all reagents
should be at room temperature before use.
8. When preparing solutions, it is preferable to prepare them in
microcentrifuge tubes while using fresh pipette tips for each
dilution.
9. Beware of pipetting techniques, DO NOT touch the pipette tip
at the bottom of the MSD plate well, it can damage or disrupt
the plate.
Fig. 5 Spot selection in 10-spot plate displays data grid for selected analyte in Mesoscale Discovery
Workbence Software
10. Avoid making bubbles during pipetting steps as this may also
lead to variability in results. Important: Avoid bubbles espe-
cially during the addition of Read Buffer; if bubbles are pres-
ent, carefully use a pipette tip and gently remove them. DO
NOT touch the pipette tip at the bottom of the well or shake
plate after the addition of Read Buffer.
11. Capture antibodies pre-coated and exposed to propriety stabi-
lizing treatment on the multi-spot plates unless U-Plex is
purchased, in which a few additional steps are performed to
prime the plate with capture antibodies.
12. Increasing the rotary motion (rpm) for the plate shaker during
capture/detection stage may aid in reaching equilibrium.
Therefore, increasing above 500 rpm but below 1000 rpm is
recommended if available.
13. Keep plates sealed during incubation periods and carefully
remove the seal as fluids can spill.
Fig. 6 Statistics and protein concentrations of standards and Mesoscale Discovery Workbence Software
14. After the aspiration of media from MSD plates, gently tap the
plates over a paper towel to remove residual content.
15. The plates can also be run partially, therefore, to avoid the spill
of the content of unused wells; unused wells on the plate
should be sealed or covered during the procedure. The plate
can be kept in the fridge (2–8 C) for up to 30 days in the foil
pack with the dessicant.
16. Additionally, the general guidelines provided by MSD do not
always fit the calibration curve for all the samples. In the event
that samples are either above or below the calibration curve fit,
dilutions can be altered to optimize the response.
17. After read, if the results are not showing as expected. Consult
MSD technical specialist as dilution errors or assay template
setup in MSD Workbench Software may be an issue (common
error).
Alternative technologies do exist that may also be helpful
in the detection of biomakers. A quick highlight of two com-
parable technologies can be seen below.
Fig. 7 Standard curve and fit of unknowns in Mesoscale Discovery Workbence Software
Homogeneous Time Resolved Fluorescence (HTRF)

Immunoassays:
Cisbio HTRF technology is a FRET-based technology that
exhibits great stability and specificity, making it one of the best
time-resolved Fluorescence resonance energy transfer technol-
ogy (FRET) applications (Fig. 9). It is especially stable in time,
allowing repetitive measurements for days. Although Cisbio
HTRF technology does lack the multiplex capacity, it can be
configured for a duplex assay. HTRF technology may also be
more cost effective as several manufacturers of plate reader
(BioTek, Thermo, and Molecular Devices, etc.) support the
filters for their assay, making it an easy-to-use and a time-saving
approach. The HTRF reagents are resistant to most experi-
mental conditions, including light, cell media, chelates,
DMSO, pH, temperature, etc. Furthermore, due to the sensi-
tivity of the assay, it requires only small volumes for small–large
complex interaction with stable signal overtime and scalability
[16–18].
Fig. 8 Work Bench Exported to MS Excel to generate graphs and respective sheets
Fig. 9 HTRF-based assay principle to detect donor–acceptor biotinylated biomolecules. Permission to reprint
this image has been kindly provided by Cisbio
Bead or particle immobilization:

Luminex xMAP technology assays are performed on the
surface of microspheres, which are color coded for multiple
different targets. The beads are read either by flow cytometers
or compact analyzers that read the reactions on each individual
microsphere. However, specialized equipment is a requirement
for using these types of assays [19–22]. Additionally, in the
event that the marker of choice is not available, consumers can
still select singleplex assays like ELISA to evaluate one analyte at
a time to build a prolife and create a workflow to combine assays
for multiple targets to create custom multiplex immunoassay via

the companies mentioned above. Thereby, multiple analytes
can be captured to fit the laboratories’ needs.
References
1. Sano M, Tatsumi N (1996) Electro chemilumi- electrochemiluminescence immunoassay. J Vis

nescence immunoassay. Rinsho Byori 44 Exp (78):50786. https://doi.org/10.3791/
(11):1076–1079 50786
2. Bastarache JA, Koyama T, Wickersham NE et al 11. Kleinberger G, Yamanishi Y, Suarez-Calvet M
(2014) Validation of a multiplex electrochemi- et al (2014) TREM2 mutations implicated in
luminescent immunoassay platform in human neurodegeneration impair cell surface trans-
and mouse samples. J Immunol Methods port and phagocytosis. Sci Transl Med 6
408:13–23. https://doi.org/10.1016/j.jim. (243):243ra286. https://doi.org/10.1126/
2014.04.006 scitranslmed.3009093
3. Belanger L, Sylvestre C, Dufour D (1973) 12. Bacioglu M, Maia LF, Preische O et al (2016)
Enzyme-linked immunoassay for alpha- Neurofilament light chain in blood and CSF as
fetoprotein by competitive and sandwich pro- marker of disease progression in mouse models
cedures. Clin Chim Acta 48(1):15–18 and in neurodegenerative diseases. Neuron 91
4. Mahmood T, Yang P-C (2012) Western blot: (2):494–496. https://doi.org/10.1016/j.neu
technique, theory, and trouble shooting. N Am ron.2016.07.007
J Med Sci 4(9):429–434. https://doi.org/10. 13. Presley C, Abidi A, Suryawanshi S et al (2015)
4103/1947-2714.100998 Preclinical evaluation of SMM-189, a cannabi-
5. Vashist SK, Luong JHT (2018) Chapter 5 – noid receptor 2-specific inverse agonist. Phar-
Enzyme-linked immunoassays. In: Vashist SK, macol Res Perspect 3(4):e00159. https://doi.
Luong JHT (eds) Handbook of immunoassay org/10.1002/prp2.159
technologies. Academic Press, pp 97–127 14. Lee JW, Devanarayan V, Barrett YC et al
6. Bittner T, Zetterberg H, Teunissen CE et al (2006) Fit-for-purpose method development
(2016) Technical performance of a novel, and validation for successful biomarker mea-
fully automated electrochemiluminescence surement. Pharm Res 23(2):312–328.
immunoassay for the quantitation of beta- https://doi.org/10.1007/s11095-005-9045-
amyloid (1-42) in human cerebrospinal fluid. 3
Alzheimers Dement 12(5):517–526. https:// 15. Abidi AH, Presley CS, Dabbous M et al (2018)
doi.org/10.1016/j.jalz.2015.09.009 Anti-inflammatory activity of cannabinoid
7. Verstraete AG, Rigo-Bonnin R, Wallemacq P receptor 2 ligands in primary hPDL fibroblasts.
et al (2018) Multicenter evaluation of a new Arch Oral Biol 87:79–85. https://doi.org/10.
electrochemiluminescence immunoassay for 1016/j.archoralbio.2017.12.005
everolimus concentrations in whole blood. 16. Kimos M, Burton M, Urbain D et al (2016)
Ther Drug Monit 40(1):59–68. https://doi. Development of an HTRF assay for the detec-
org/10.1097/ftd.0000000000000474 tion and characterization of inhibitors of cate-
8. Sanden E, Enriquez Perez J, Visse E et al chol-O-methyltransferase. J Biomol Screen 21
(2016) Preoperative systemic levels of (5):490–495. https://doi.org/10.1177/
VEGFA, IL-7, IL-17A, and TNF-beta delin- 1087057115616793
eate two distinct groups of children with brain 17. Heuninck J, Hounsou C, Dupuis E et al (2019,
tumors. Pediatr Blood Cancer 63 1947) Time-resolved FRET-based assays to
(12):2112–2122. https://doi.org/10.1002/ characterize g protein-coupled receptor
pbc.26158 hetero-oligomer pharmacology. Methods Mol
9. Gafson AR, Thorne T, McKechnie CIJ et al Biol (Clifton, NJ):151–168. https://doi.org/
(2018) Lipoprotein markers associated with 10.1007/978-1-4939-9121-1_8
disability from multiple sclerosis. Sci Rep 8 18. Ayoub MA, Trebaux J, Vallaghe J et al (2014)
(1):17026. https://doi.org/10.1038/ Homogeneous time-resolved fluorescence-
s41598-018-35232-7 based assay to monitor extracellular signal-
10. Kuster DWD, Barefield D, Govindan S et al regulated kinase signaling in a high-
(2013) A sensitive and specific quantitation throughput format. Front Endocrinol (Lau-
method for determination of serum cardiac sanne) 5:94. https://doi.org/10.3389/fendo.
myosin binding protein-C by 2014.00094
19. Pan J, Zheng QZ, Li Y et al (2019) Discovery for the simultaneous quantitative detection of
and validation of a serological autoantibody Neuropilin-1 and Neuropilin-2 using xMAP
panel for early diagnosis of esophageal squa- technology and its clinical application. J Clin
mous cell carcinoma. Cancer Epidemiol Bio- Lab Anal 33(4):e22850. https://doi.org/10.
markers Prev. https://doi.org/10.1158/ 1002/jcla.22850
1055-9965.Epi-18-1269 22. Bates AM, Fischer CL, Abhyankar VP et al
20. Cao Q, Xiao B, Jin G et al (2019) Expression of (2018) Matrix metalloproteinase response of
transforming growth factor beta and matrix dendritic cell, gingival epithelial keratinocyte,
metalloproteinases in the aqueous humor of and T-cell transwell co-cultures treated with
patients with congenital ectopia lentis. Mol porphyromonas gingivalis hemagglutinin-B.
Med Rep. https://doi.org/10.3892/mmr. Int J Mol Sci 19(12). https://doi.org/10.
2019.10287 3390/ijms19123923
21. Huang ZL, Meng PP, Yang Y et al (2019)
Establishment of a bead-based duplex assay
Chapter 21
AAgAtlas 1.0: A Database of Human Autoantigens Extracted

from Biomedical Literature
Dan Wang, Yupeng Zhang, Qing Meng, and Xiaobo Yu
Abstract
Autoantibodies are antibodies against host self-proteins (autoantigens), which play significant roles in
homeostasis maintenance and diseases with autoimmune disorders. Numerous papers were published in
the past decade on the identification of human autoantigens in different human diseases. However, there is
no consensus collection with all the reported autoantigens yet. To address this need, previously we
developed a human autoantigen database, AAgAtlas 1.0, by text-mining and manual curation, which
collects 1126 autoantigens associated with 1071 human diseases. AAgAtlas 1.0 provides a user-friendly
interface to conveniently browse, retrieve, and download human autoantigen genes, their functional
annotation, related diseases, and the evidence from the literature. AAgAtlas is freely available online
http://biokb.ncpsb.org/aagatlas/. In this chapter, we make an introduction and provide a guide to the
users of AAgAtlas 1.0 database.
Key words Database, Autoantibody, Autoantigen, Autoimmune disease, Cancer, Biomarker, Diag-
nosis, Therapeutic treatment
1 Introduction
Autoantibodies (AAbs) are antibodies targeting self-proteins (auto-

antigens, AAgs), whose generation are subjected to the genetic
predisposition, environments, pathogen infection, etc. These
AAbs play significant roles in homeostasis maintenance and diseases
with autoimmune disorders. Numerous papers were published in
the past decade on the identification of AAgs associated with the
diagnosis and therapeutic target of human diseases [1–3]. However,
there is a consensus collection with all the identified AAgs yet [4].
To address the need, our lab previously developed an AAgAtlas
1.0 database to support basic and translational studies associated
with the autoimmunity [4]. We searched all PubMed abstracts by
using text mining with the keywords of “Autoantibody”, “Auto-
antigen”, their synonyms, and lexical variants. We identified 1126
genes and 1071 associated human diseases accordingly. AAgAtlas

365
366 Dan Wang et al.
Fig. 1 The website of AAgAtlas 1.0 database
database 1.0 provides a user-friendly interface to conveniently

browse, retrieve, and download the list of human AAg genes and
their related diseases. Both gene- and disease-centric queries are
provided in the search engine of AAgAtlas 1.0. The list of AAgs and
corresponding literature evidence can be downloadable online. The
website is freely accessible at http://biokb.ncpsb.org/aagatlas/
(Fig. 1), which is expected to be a valuable resource for the transla-
tional studies. In this chapter, we introduce the workflow of AAgA-
tlas 1.0 database construction and provide a guide to the users of
AAgAtlas 1.0 database.
AAgAtlas 1.0: A Database of Human Autoantigens Extracted from Biomedical. . . 367
2 Methods
All AAg genes were collected from the published biomedical

abstracts in PubMed database. Both the keyword and gene recog-
nition was performed by a customized ontology-based entity rec-
ognizer. The F-measure of our recognition tool of 0.845 was
obtained by evaluation against the CRAFT corpus for gene/protein
recognition based on Protein Ontology, which is comparable to
that of BeCAS annotation system (0.76). The three-step procedure
of text-mining and manual curation of human AAg dataset is shown
below.
2.1 AAg-Related All PubMed abstracts were extracted from PubMed database
Keywords Extraction through the NCBI E-utilities API. The AAg-related abstracts
were obtained by bio-entity recognizer using the keywords of
either “autoantigen” or “autoantibody” or their lexical variants
like “auto-antigen”, “autoantigens”, “auto-antigens”, “auto-anti-
body”, “autoantibodies”, or “auto-antibodies”. As a result, 45,830
abstracts and 94,313 sentences were obtained (see Notes 1 and 2).
2.2 AAg Gene A dictionary of human gene/protein mentions was constructed

Candidate Recognition with all extracted gene/protein names and their correspondent
synonyms from Protein Ontology. These mentions were then
mapped to official gene symbols from HUGO Gene Nomenclature
Committee, HGNC (www.genenames.org/). Next, all the genes in
keyword-containing sentences recognized by the gene dictionary
were selected as the candidate, which resulted in the total of 3984
candidates from 25,520 abstracts and 43,253 sentences.
2.3 Manual Curation We performed three rounds of manual curation to remove false-
positives and select the bona fide AAgs for our database. (1) All
extracted sentences with appropriate AAg names were checked and
selected by two experienced researchers, independently; (2) the
resulting sentences were then submitted to an internal review, in
which all AAg names were manually reviewed and approved by
three experts again; (3) all co-authors were required to randomly
check database to make sure that all genes imported into our
database are bona fide AAgs with appropriate supporting evidence.
Finally, 1126 AAg genes and 1071 related diseases were obtained
(see Note 3). All genes were functionally annotated and uploaded
to the database with appropriate evident sentences.
3 Data Search and Navigation
The website interface contains six sections, including “Home”,

“Browse & Download”, “FAQ”, and “Contact”, “Feedback”,
and “Log in” (Fig. 1). The “Home” page shows the introduction
368 Dan Wang et al.
and references to the AAgAtlas 1.0 database. On the “Browse &

Download” page, the user can find the complete list of human
AAgs, relevant diseases, and supporting evidence. The answers to
the common questions of using AAgAtlas 1.0 database are shown
on the “FAQ” page. For the questions that are not addressed on
“FAQ”, the users can send their questions or suggestions to us
using the contact information in the “Contact” section. After
clicking the “Log in” button, the user can register, log in, and
submit new genes to our database through the “Feedback” section.
Two query approaches are provided for searching the database:
query by the gene symbol and query by the disease term.
3.1 Query by Gene On the “Home” page, the user can enter the gene symbol in the
Symbol “Gene Symbol” search box. The drop-down menu will provide the
auto-completed gene symbol in AAgAtlas 1.0 database. Select one
and click the “Search” button; the page will return the searching
result. The results are divided into four columns: Gene, Disease,
PubMed Abstracts, and Sentences. Basic information about the
AAg gene and cross-references to external databases can be
obtained by clicking on the hyperlink of the AAg gene symbol in
the gene column. Supporting literature evidence can be viewed by
clicking on the PubMed abstract or sentence. If the user clicks the
“Reset” button, all current search terms will be deleted.
Here we employ Breast cancer type 1 susceptibility protein
(BRCA1) as an example to show the query with the gene symbol.
BRCA1 is a tumor suppressor that can regulate the cell growth
through the maintenance of genomic stability. Mutations in
BRCA1 and BRCA2 account for about 25% of familial breast
cancers [5, 6]. After typing BRCA1 in the search box of “Gene
Symbol” and clicking “Search”, the searching results will be dis-
played, which contain the BRCA1 gene name, a list of related
diseases, PubMed abstracts, and supporting sentence (Fig. 2).
The detailed information of BRCA1 can be shown after a click on
the BRCA1 gene name, which contains the validated evidence,
synonyms, disease section, description, Entrez gene summary,
chromosome, cytoband, chromosome location (bp), and the links
to Ensembl, Entrez gene, Uniprot, neXtProt, and Antibodypedia
databases (Fig. 3). In the search result, it can be noticed that AAb to
BRCA1 was generated in a variety of human cancers, including
breast cancer, lung cancer, prostate cancer, and ovarian cancer
with the frequency from 0.7% to 28% [7, 8]. Furthermore, the
combination of BRCA1 and other AAbs (p53, c-myc, HER2,
NY-ESO-1, BRCA2 and MUC1) can achieve the sensitivity of
65% for primary breast cancer and 45% ductal carcinoma in situ at
a specificity of 85% [8].
Fig. 2 Example of query by autoantigen gene on AAgAtlas 1.0 database

370 Dan Wang et al.
Fig. 3 Detail information of the queried AAg gene AAgAtlas 1.0 database
3.2 Query by The user can also query the AAgs that are associated with a specific
Disease Term disease using “Query by disease term” function. For example, as
the most common invasive cancer in women, breast cancer affects
~12% of women worldwide. The development of breast cancer is
associated with being female, obesity, genetics, drinking alcohol,
etc. [9]. The accumulating evidence reveals the association between
breast cancer and autoimmunity, in which the risk of developing
breast cancer can be reduced in patients with rheumatoid arthritis
and systemic lupus erythematosus and increased in the patients with
psoriasis [10].
To find all the reported AAbs associated with breast cancer, we
entered the name of “breast cancer” into the “Disease Term”
search box, the search engine returned the results, including
88 AAg genes related to breast cancer, the supporting literature
evidence, and the number of sentences (Fig. 4). These AAgs
include the well-known cancer-related genes, such as ERBB2,
BRCA1, TP53, TP63, and MUC1. When the user clicks on the
supporting literature of the evidence or the number of sentences,
the page will display a table containing the gene, disease, PubMed
ID, evidence, and manual verification information. The original
Fig. 4 Example of disease query on AAgAtlas 1.0 database

372 Dan Wang et al.
Fig. 5 GO analysis of breast cancer associated AAgs using panther database. (A–D) are the analysis of
biological process, signaling pathway, subcellular location and protein class, respectively
abstract can be obtained by clicking on the hyperlink to the evi-

dence, which highlights the keywords, i.e., breast cancer and breast
cancer–related genes. By clicking on the small triangle on the head
of each table column, the results can be sorted in the table in the
ascending/descending order.
To know the function of these breast cancer–related AAg pro-
teins, we performed GO analysis using panther database (http://
www.pantherdb.org/) (Fig. 5). The biological process analysis
indicates that these AAg proteins participate in the immune system
process, cell proliferation, metabolic process, biological adhesion,
response to stimuli, reproduction, location, and cellular process
(Fig. 5a). The function of these AAgs proteins can be further
revealed by the signaling pathway analysis, which are p38 MAPK,
EGF receptor, P53 pathway, insulin, angiogenesis, apoptosis, cad-
herin adhesion, and Wnt signaling pathways (Fig. 5b). The subcel-
lular location analysis reveals that these AAg proteins locate in the
extracellular region, organelle, protein-containing complex, mem-

brane, and cell junction (Fig. 5c). These AAg proteins belong to the
protein classes of receptor, isomerase, transcription factor, hydro-
lase, signaling molecule, cytoskeleton, and nucleic acid binding
(Fig. 5d). All the results suggest these AAg proteins may have
important roles in the initiation and progression of breast cancer
and may serve as biomarkers for the diagnosis and therapeutic
treatment of breast cancer, which remains to be investigated future.
4 Notes
1. The AAgs that are not displayed in PubMed abstracts cannot be

recognized by our text-mining approach and were not included
in our database.
2. The post-translational modification that can be targeted by the
AAbs, such as citrullination and glycosylation, were not con-
sidered in AAgAtlas 1.0 database.
3. All AAgs in AAgAtlas 1.0 database were selected by the litera-
ture. The function of these AAgs as biomarkers has to be
demonstrated using immunoassay (i.e. protein array and
ELISA) and clinical serum/plasma cohort.
Acknowledgments
This work was supported by the Chinese National Major Project for
New Drug Innovation (2018ZX09733003), National Key Basic
Research Project (2018YFA0507503, 2017YFC0906703),
National Natural Science Foundation of China (81673040 and
31870823), State Key Laboratory of Proteomics (SKLP-
O201703 and SKLP-K201505), and Capital’s Funds for Health
Improvement and Research (2018-2-4034).
References
1. Ludwig RJ, Vanhoorelbeke K, Leypoldt F et al Nucleic Acids Res 45(D1):D769–D776.

(2017) Mechanisms of autoantibody-induced https://doi.org/10.1093/nar/gkw946
pathology. Front Immunol 8:603. https://doi. 5. Nolan E, Lindeman GJ, Visvader JE (2017)
org/10.3389/fimmu.2017.00603 Out-RANKing BRCA1 in mutation carriers.
2. Plotz PH (2003) The autoantibody repertoire: Cancer Res 77(3):595–600. https://doi.org/
searching for order. Nat Rev Immunol 3 10.1158/0008-5472.CAN-16-2025
(1):73–78. https://doi.org/10.1038/nri976 6. Melchor L, Benitez J (2013) The complex
3. Wang X, Yu J, Sreekumar A et al (2005) Auto- genetic landscape of familial breast cancer.
antibody signatures in prostate cancer. N Engl Hum Genet 132(8):845–863. https://doi.
J Med 353(12):1224–1235. https://doi.org/ org/10.1007/s00439-013-1299-y
10.1056/NEJMoa051931 7. Zhu Q, Han SX, Zhou CY et al (2015) Auto-
4. Wang D, Yang L, Zhang P et al (2017) AAgA- immune response to PARP and BRCA1/
tlas 1.0: a human autoantigen database. BRCA2 in cancer. Oncotarget 6
374 Dan Wang et al.
(13):11575–11584. https://doi.org/10. management of breast cancer. Cancers (Basel)

18632/oncotarget.3428 7(2):908–929. https://doi.org/10.3390/
8. Chapman C, Murray A, Chakrabarti J et al cancers7020815
(2007) Autoantibodies in breast cancer: their 10. Schairer C, Pfeiffer RM, Gadalla SM (2018)
use as an aid to early diagnosis. Ann Oncol 18 Autoimmune diseases and breast cancer risk
(5):868–873. https://doi.org/10.1093/ by tumor hormone-receptor status among
annonc/mdm007 elderly women. Int J Cancer 142
9. McGuire A, Brown JA, Malone C et al (2015) (6):1202–1208. https://doi.org/10.1002/
Effects of age on the detection and ijc.31148
Chapter 22
Application of Meta Learning to B-Cell Conformational

Epitope Prediction
Yuh-Jyh Hu
Abstract
One of the major challenges in the field of vaccine design is identifying B-cell epitopes in continuously
evolving viruses. Various tools have been developed to predict linear or conformational epitopes, each
relying on different physicochemical properties and adopting distinct search strategies. In this chapter, we
propose different ensemble meta-learning approaches for epitope prediction based on stacked, cascade
generalizations, and meta decision trees. Through meta learning, we expect a meta learner to be able to
integrate multiple prediction models and outperform the single best-performing model. The objective of
this chapter is twofold: (1) to promote the complementary predictive strengths in different prediction tools
and (2) to introduce computational models to exploit the synergy among various prediction tools. Our
primary goal is not to develop any particular classifier for B-cell epitope prediction, but to advocate the
feasibility of meta learning to epitope prediction. With the flexibility of meta learning, the researcher can
construct various meta classification hierarchies that are applicable to epitope prediction in different protein
domains.
Key words B-cell epitopes, Meta learning, Stacking, Cascade, Meta decision trees
1 Introduction
B-cell epitopes are specific regions on proteins recognized as

antigen-binding sites by the antibodies of B cells. A detailed under-
standing of the interaction between antibodies and epitopes facil-
itates the development of diagnostics and therapeutics as well as
rational design of preventive vaccines [1–3]. Therefore, generation
of potent antibodies through reverse immunological approaches
requires precise knowledge of epitopes. According to their struc-
ture and interaction with antibodies, epitopes can be classified as
conformational and linear epitopes. A linear epitope is formed by a
continuous sequence of amino acids, whereas a conformational
epitope comprises discontinuous sections of the antigen’s primary
sequence; the discontinuous sections are close together in the
three-dimensional (3D) space and interact with an antibody

375
376 Yuh-Jyh Hu
together. Approximately 10% of B-cell epitopes are linear, whereas

the remaining 90% are conformational [4, 5].
Several different approaches exist for predicting linear and
conformational epitopes. Previous studies relied on the varying
physicochemical properties of amino acids to predict linear epitopes
[5–7]. A study on 484 amino acid scales revealed that predictions
based on the best-performing scales poorly correlated with experi-
mentally confirmed epitopes [8]. This result prompted the devel-
opment of machine-learning methods to improve prediction.
BepiPred combines amino acid propensity scales with a hidden
Markov model to achieve marginal improvement over methods
based on physicochemical properties [9]. ABCPred uses artificial
neural networks (ANN) for predicting linear B-cell epitopes
[10]. Chen et al. proposed the novel amino acid pair (AAP) antige-
nicity scale [11], for which the authors trained a support vector
machine (SVM) classifier, using the AAP propensity scale to distin-
guish epitopes and nonepitopes. BCPREDS uses SVM combined
with a variety of kernel methods, including string kernels, radial
basis kernels, and subsequence kernels, to predict linear B-cell
epitopes [12].
The increasing availability of protein structures has facilitated
the development of computational prediction tools by exploiting
protein antigen structures. The following are some of the knowl-
edge that has been used to elucidate these structures for epitope
prediction: (1) spatial neighborhood information and a surface
measure [13], (2) local spatial context, accessible surface area
(ASA) propensity, and consolidated amino acid index [14], (3) loca-
tions of continuous antigenic determinants [15], and (4) the
B-factor to detect atomic fluctuation [16]. Some studies have either
adopted a hybrid approach combining structural and physicochem-
ical features [17, 18], proposed an ensemble meta learner incorpor-
ating consensus results from multiple prediction servers by using a
voting mechanism [19], applied an ensemble of classifiers using
various input features [20], or used a combination of amino acid
composition information, spatial neighborhood information, and a
surface measure for predicting epitopes [21].
In this chapter, we propose combining multiple predictions to
improve epitope prediction based on three meta-learning strate-
gies: stacked generalization (stacking) [22, 23], cascade generaliza-
tion (cascade) [24, 25], and meta decision trees (MDT)
[26]. These meta-learning strategies work in different hierarchical
architectures and utilize the synergy differently from multiple pre-
diction methods. To evaluate performance, the combinatorial
methods have been tested on an independent set of antigen pro-
teins that were not used previously to train the epitope prediction
tools according to the documents on the tools and their publica-
tions. The results indicate the potential of meta learning for epitope
prediction.
Meta Learning to Predict B-cell Epitopes 377
2.1 Epitope When addressing an inductive learning problem, by representing

Prediction as Inductive each example by a set of descriptive attributes, its target attribute,
Learning and the attribute values, then an inductive learning task can be
defined as follows:
If
E ¼ {e1,e2,. . .,en} is a set of training examples,
X ¼ {x1,x2,. . .,xm} is a set of descriptive attributes,
C is the target attribute,
then each training example ei is represented by a vector <v1,
v2,. . .,vm, ti>, where v1,v2,. . .,vm denotes a legal value of attribute
x1,x2,. . .,xm, and ti is a legal value of the target attribute c.
Assuming
F:: X ! c is the target attribute function, which maps an
example represented by a vector of descriptive attribute values to
its target attribute value, and
H:: X ! c is a hypothesis that approximates the target attribute
function, H(X) F(X),
then for a test example t, the target value is predicted as H(t).
Considering epitope prediction as an inductive learning prob-
lem, when provided a set of antigens with known epitopic and
nonepitopic regions, the initial goal is to train a classifier from a
set of antigens, each of which is described by a set of protein
features (see Note 1), and then apply the classifier to novel antigens
for epitope detection. Many learning-based epitope prediction
tools have been developed [9–21]. Because different learning algo-
rithms employ different knowledge representations and search
heuristics, they explore different hypothesis space and consequently
obtain different results. We propose combining multiple prediction
methods to achieve superior performance compared with that
achieved using a single predictor.
2.2 Stacked Stacked and cascade generalizations are methods of combining the
Generalization predictions of multiple learning models that have been trained for a
(Stacking) classification task [22–25]. Unlike approaches based on bagging
and Cascade [27] or boosting [28], which aim to reduce the variance of multiple
Generalization learners to improve performance, stacked and cascade generaliza-
(Cascade) tions both work as layered processes with the aim of reducing
learner bias.
In stacked generalization, each of a set of base learners is trained
in a data set, and the predictions of these base learners become the
meta features. A successive layer of meta learners receives the meta
features as the input with which to train the meta models in parallel,
passing their output to the subsequent layer. A single classifier at
the top level makes the final prediction. Figure 1 shows a stacked
generalization architecture. Stacked generalization is considered a
378 Yuh-Jyh Hu
Fig. 1 Generic stacked generalization architecture. In stacked generalization, a varying number of meta
learners are placed in parallel at each level in the hierarchy. They integrate and transform the output from the
preceding level into meta features and pass the meta features as the input to the successive level. One meta
learner serves as the arbitrator at the top level to produce the final meta classification
form of meta learning because the transformations of the training

data for the successive layers contain the information of the predic-
tions of the preceding learners, which is a form of meta knowledge.
Similar to stacked generalization, cascade generalization is a
form of meta learning [24, 25]. Cascade generalization is distin-
guishable from stacked generalization because it produces a
sequential, rather than a parallel, composition of classifiers in a
hierarchy. Only one learner exists at each level, and its prediction
becomes a novel feature, in addition to the base features, of the
input to the learner in the successive level (Fig. 2). Stacked gener-
alization combines the predictions of multiple learners in parallel at
each level in a layered architecture to improve classification accu-
racy, whereas cascade generalization connects multiple learners in a
sequential fashion to obtain a meta model by propagating the
prediction of the learner, as a novel feature, to the subsequent
learner.
Fig. 2 Generic cascade generalization architecture. In cascade generalization,

one meta learner is placed at each level in the hierarchy. These meta learners
are connected in a sequence in which each meta learner propagates its output to
the meta learner at the successive level as meta input. Because the propagation
ends at the top level, the top-level meta learner makes the final meta
classification
In this study, we developed multilevel architectures for stacked

and cascade generalizations. We used C4.5 [29], k-nearest neigh-
bors (k-NN) [30], ANN [31], and SVM [32] as the meta learners
because C4.5 learns comprehensible decision trees, the nearest-
neighbor rule is capable of constructing local approximations to
the target, artificial neural network learning methods provide a
robust approach to approximating a wide variety of target func-
tions, and SVM has demonstrated promising performances in vari-
ous applications (see Note 2). We selected several state-of-the-art
linear and conformational epitope prediction tools as the candidate
B-cell epitope base learners, including BepiPred [9], ABCpred
[10], AAP [11], BCPREDS [12], DiscoTope 2.0 [21], ElliPro
[15], SEPPA 2.0 [14], and Bpredictor [33]. We analyzed and
compared the base features exploited by previous prediction meth-
ods and selected those that characterize physicochemical
380 Yuh-Jyh Hu
propensities and structural properties. We adopted 14 base fea-

tures: epitope propensity [21], secondary structure [34], residue
accessibility [35], B factor [36, 37], solvent-excluded surfaces,
solvent-accessible surfaces [38], protein chain flexibility [16],
hydrophilicity [39], PSSM [40], atom volume [41], accessible
surface area [42, 43], side chain polarity [44], hydropathy index
[45], and antigenic propensity [46]. Table 1 lists descriptions of
these features. In the training stage, the outputs of the base learners
and base features are passed to meta learners at higher levels to train
a meta model for classification. In the prediction stage, the trained
meta classifier predicts the epitopes for a previously unseen patho-
gen protein based on the predictions of the base learners and the
base features of the protein.
Table 1
Summary of base features
Base Feature Description Reference

Propensity Score The propensity score is derived from a scoring function that sums the [21]
log-odd ratios of the amino acids in the spatial neighborhood (defined
in [28]) around each residue in a given protein.
Residue Using NACCESS to calculate the accessibilities of the whole molecule [35]
accessibility submitted in a pdb file. NACCESS calculates the atomic accessible
surface defined by rolling a probe around a van der Waals surface. The
residue accessibilities are categorized into 5 classes: all atoms, total side
chain, polar side chain, nonpolar side chain, and main chain.
Secondary Secondary structure refers to highly regular local sub-structures defined [34]
structure by patterns of hydrogen bonds between the main-chain peptide groups.
In such cases, the chain of amino acids folds into regular repeating
structures such as α helix, β structure, and coil.
Accessible Calculated using Gerstein et al.’s calc-surface program to measure the [42, 43]
surface area accessible surface area of a sphere, on each point of which the center of a
solvent molecule can be placed in contact with this atom without
penetrating any other atoms of the molecule.
Atom volume Calculated using Gerstein et al.’s calc-volume program. It calculates [41]
volumes by applying a geometric construction called Voronoi
polyhedra to divide the total volume among the atoms in a protein
model.
B factor The B factor is also known as the Debye-Waller factor or the temperature [36, 37]
factor. It is used to describe the attenuation of x-ray scattering or
coherent neutron scattering caused by thermal motion. Two B factors
of a protein were considered in this study: the B factor of side chain and
the B factor of main chain.
Solvent excluded Calculated using Sanner et al.’s MSMS program, which builds the solvent [38]
surface excluded surface based on the reduced surface.
Solvent Calculated using Sanner et al.’s MSMS program, which builds the solvent [38]
accessible accessible surface based on the reduced surface.
surface
(continued)
Table 1
(continued)
Base Feature Description Reference

PSSM Using PSI-BLAST to search the non-redundant protein database and [40]
derive the information content from a position-specific scoring matrix
as the base feature.
Side chain The 20 amino acids were divided into four categories: polar, nonpolar, [44]
polarity acidic polar, and basic polar.
Hydropathy Kyte and Doolittle devised the hydopathy index by applying a sliding- [45]
index window strategy that continuously determined the average hydopathy
in a window as it advanced through the sequence.
Antigenic Kolaskar and Tongaonkar analyzed 156 antigenic determinants (<20 [46]
propensity residues per determinant) in 34 different proteins to obtain the
antigenic propensities of amino acid residues.
Flexibility Karplus and Schulz developed the flexibility scale based on the mobility of [16]
the protein segments on 31 proteins with known structures.
Hydrophilic Parker et al. developed the hydrophilic scale based on the high- [39]
scale performance liquid chromatography (HPLC) peptide retention data.
The multilevel architecture for stacked or cascade generaliza-

tion can vary with the arrangement of the meta learners in the
hierarchy. For example, we can place SVM at the top level in a
stacked generalization architecture, or we can substitute C4.5 for
SVM. For cascade generalization, we can place the k-NN prior to
the ANN, or vice versa, in the cascading sequence. By conducting
cross-validation (CV), we can identify the appropriate stacked
architectures, as shown in Figs. 3 and 4. Cascade generalization
performs a sequential composition of meta learners in a hierarchy in
which only one meta learner exists at each level. By testing all
24 possible sequential arrangements of the meta learners SVM,
C4.5, k-NN, and ANN from CV, we can determine the sequential
arrangement with the highest performance. Figure 5 shows a can-
didate cascade generalization architecture.
2.3 Meta Decision MDTs [26] are used for meta learning that applies multiple base
Trees (MDT) classifiers to a single data set by exploiting the classification results
of the base classifiers as a type of meta-knowledge. The structure of
an MDT is identical to that of an ordinary decision tree, in that
both have internal nodes and leaves, and have the same computa-
tional complexity; however, in an MDT, the attributes associated
with the internal nodes and the meaning indicated by the leaves
differ from those of an ordinary decision tree.
In both MDT and ordinary decision trees, an internal node
specifies a test on an attribute value. For an ordinary decision tree,
the attribute selected for the internal node must be one of the base
382 Yuh-Jyh Hu
Fig. 3 Two-level stacking architecture. The conformational epitope predictors and linear epitope predictors
were all placed at Level 0. One of the learners SVM, C4.5, k-NN, or ANN served as a meta learner to integrate
the output from the base predictors and produced the meta classification as the final result
attributes used to describe the data instances, for instance, the

hydrophilic scale. By contrast, the attribute at an internal node of
an MDT is a meta-attribute derived from the output of the base
classifiers. Notably, although the base classifiers used in an MDT are
standard inductive learners (e.g., artificial neutral network and
naı̈ve Bayes classifier), they differ from the B-cell epitope prediction
servers (e.g., SEPPA 2.0 [14] and DiscoTope 2.0 [21]) used as the
base predictors by other meta-learning methods [19, 47]. Unlike
the others, the base classifiers used in MDTs can be retrained from
new training data if required. As for the leaves, a leaf of an ordinary
decision tree corresponds to a predicted class, whereas that of an
MDT specifies a particular base classifier for class prediction.
Figure 6 illustrates examples of an ordinary decision tree and an
MDT. The ordinary decision tree in Fig. 6a includes three internal
nodes and four leaf nodes; each internal node specifies a test on a
particular base attribute value [e.g., Feature1 0.75 (or >0.75)],
and each leaf indicates the predicted class (e.g., C1). The MDT in
Fig. 6b also has three internal nodes and four leaf nodes; unlike in
the ordinary decision tree, each internal node in this MDT specifies
a test on a particular meta-attribute derived from the output of a
base classifier [e.g., metaF1(CL1) in Fig. 6b], and rather than pre-
dicting the class, each leaf node predicts the base classifier most
suitable for classification (e.g., CL1).
Fig. 4 Three-level stacking architecture. The conformational epitope predictors and linear epitope predictors
were all placed at Level 0. We selected C4.5, k-NN, and ANN as the Level-1 meta learners that transformed
the output of the base predictors into meta features and passed them to the successive level. We designated
SVM as the top meta learner that learned from the base features and the meta features to produce the meta
classification as the final result
To describe each amino acid on a protein antigen, in addition

to the same 14 base attributes as those used in stacking and cascade
(see Table 1), we may also consider other physicochemical proper-
ties such as surface probability [48], turns [7], exposed surface
[49], and two types of polarities defined by Ponnuswamy et al.
[50], and Grantham [51]. Unlike stacking or cascade B-cell epitope
predictors above, an MDT B-cell epitope predictor does not rely on
the output of other B-cell prediction servers. The motivation
behind adding more base features to an MDT is to alleviate the
limitation by its lack of B-cell prediction servers (e.g., SEPPA 2.0
[14] and DiscoTope 2.0 [21]).
A meta-attribute is defined over the output of the trained base
classifiers. In our study, we used C4.5 [29], k-NN [30], SVM [32],
Random Forests (RF) [52], PART [53], Bayesian Network
(BN) [54], JRip [55], and Voted Perceptron (VP) [56] as the
base classifiers (see Note 3). Furthermore, the majority vote of the
384 Yuh-Jyh Hu
Fig. 5 Cascade generalization architecture. The conformational epitope predictors and linear epitope pre-
dictors all served at Level 0 as the base predictors. We placed k-NN, C4.5, ANN, and SVM sequentially from
Levels 1 to 4 as meta learners. Each meta learner generalized the output from the previous level to meta
knowledge in the form of meta features. The meta features and base features propagated sequentially to the
successive level as input to the subsequent meta learner. The top-level meta learner, SVM, produced the final
meta classification
Fig. 6 Sample ordinary decision tree and MDT. (a) An ordinary decision tree and (b) a meta decision tree
base classifiers was also included as a base classification. According

to Todorovski and Dzeroski [26], we calculated the properties of
the class probability distributions predicted by the base classifiers,
reflecting the certainty and confidence of the predictions. Here, we
defined three meta-attributes: epi_prob(x,B), entropy(x,B), and
vote_epi_prop(x), where x is a data instance and B a base classifier.

The meta-attribute epi_prob(x,B) is the probability of epitope pre-
dicted by the base classifier B for the amino acid x. The meta-
attribute entropy(x,B) is the entropy of the class probability distri-
bution predicted by the base classifier B for the amino acid x. The
meta-attribute vote_epi_prop(x) is the proportion of the epitope
class predicted by all base classifiers for the amino acid x. These
meta-attributes reflect the certainty of the base classifier in predict-
ing the class, and they characterize the confidence variedly. We
computed the meta-attribute values for each data instance, namely,
the amino acid, on the basis of the output of the base classifiers and
combined them to form a meta training data set. We then trained an
MDT from a training set of data described by the meta-attributes.
MDT construction is identical to that of an ordinary decision tree.
It involves a greedy, top-down, recursive search for the most suit-
able decision tree from a training data set.
Rather than employing the measures of impurity reduction
commonly used for ordinary decision trees, such as information
gain, gain ratio [29], and Gini [57], the focus of MDTs is the
accuracy of each base classifier in predicting the data instance
S available at an internal node. We defined the new information
measure as follows:
infoðS Þ ¼ 1 max B∈Base AccuracyðB, S Þ, ð1Þ
where B is a base classifier, Base is the set of all base classifiers, S is
the data available at an internal node, and Accuracy is the classifica-
tion accuracy of B on S. We selected the attribute that maximized
the decrease in info of the subsets of S after the partition according
to the values of the selected attribute compared with the original
info of S. The classifier at the leaf node with the maximum accuracy
was used to predict new instances after the tree grew completely. By
constructing an MDT ensemble from multiple random samples
based on a bagging-like strategy [27], we expect the final prediction
based on the predictions of the MDTs ensemble to further reduce
the variance among different MDTs, and provide a more accurate
approximation to the true target. We computed the probability for
an amino acid of being epitope or nonepitope on the basis of the
predictions of the MDTs trained from the random samples. By
using m MDTs, we defined the scores of epitope, ScoreE, and none-
pitope, ScoreN, for an amino acid AA as follows:
X
m X
m
ScoreE ¼ α∙ wi ∙de i 0:5e þ ð1 αÞ∙ wi ∙e i ð2Þ
i¼1 i¼1
X
m X
m
ScoreN ¼ α∙ wi ∙dni 0:5e þ ð1 αÞ∙ w i ∙ni ð3Þ
i¼1 i¼1
386 Yuh-Jyh Hu
In eqs. (2) and (3), ei (or ni) is the probability of being an

epitope (or nonepitope) according to the prediction of the ith
MDT. A higher wi value indicates a stronger weight exerted on
the score by the base classifier visited in the ith MDT; when wi is set
to 1, all the base classifiers are treated equally. The first term in eqs.
(2) and (3) considers only the count of classifications by the
m MDTs, whereas the second term considers the class probabilities.
We used a control parameter α to balance the effects of the two
scoring mechanisms, and its value could be determined through
CV. We defined the probability for the amino acid AA of being
epitope or nonepitope as follows:
ScoreE
P E ðAA Þ ¼ ð4Þ
ScoreE þ ScoreN
ScoreN
P N ðAA Þ ¼ ð5Þ
ScoreE þ ScoreN
To appropriately address the imbalanced class distribution in
B-cell epitopes, we also set a probability threshold for the final
classification as follows:
(
non epitope, P E ðAAÞ < θ
ClassðAA Þ ¼ , ð6Þ
epitope, P E ðAAÞ θ
where θ is a threshold. A carefully selected θ on the basis of CV or

prior knowledge warrants a reasonable performance of the class-
sensitive bagging MDT approach. Figure 7 illustrates the entire
control flow of BaggingMDT.
2.4 Data Sets An epitope prediction server must be trained to obtain its predic-
and Performance tion model before it can make a prediction. Because the epitope
Measures predictors used in this study were web-based servers or software
packages, they could not be retrained using novel training data. To
conduct a consistent and unbiased comparative analysis of the
prediction performances of these servers, we created an indepen-
dent data set of antigens with known epitopes. We collected the test
data sets used in DiscoTope 2.0 [21], SEPPA 2.0 [14], and Bpre-
dictor [33] and combined them with the data of the Epitome
database [58] and Immune Epitope Database (IEDB) [59]. After
removing the duplicate proteins, and filtering out the antigens
without annotations, or previously used to train the base prediction
servers, we built an independent data set of 64 antigens for predic-
tion performance evaluation (Table 2). To ensure fair comparison
between different prediction methods, we used the independent
64 antigens with the epitope residues annotated in the IEDB for
testing and selected 94 antigens that have been previously used to
train the base prediction servers (Table 3) to train the classification
models. The antigen protein 3D structures were used as input for
the structure-based classifiers, and the corresponding antigen
sequences were sent to the sequence-based predictors as input.
Fig. 7 System flow of BaggingMDT
We evaluated prediction performances by using several mea-

sures: TP rate (i.e., sensitivity), FP rate, precision (i.e., positive
predictive value), percentage accuracy, F-score, MCC, and AUC.
Table 4 lists the definitions of these measures. We considered a
predicted antigenic residue a TP if it was within a known epitopic
region. Otherwise, we considered it an FP. We considered a pre-
dicted nonantigenic residue a true negative (TN) if it was outside
the known epitopes, or a false negative (FN) if it was part of a
388 Yuh-Jyh Hu
Table 2
Test data set of 64 protein antigens
1BJ1_V 1BJ1_W 1BVK_F 1BZQ_A 1CZ8_V 1CZ8_W 1FJ1_F 1I9R_A 1J5O_B

1KXT_A 1KXV_A 1MLC_F 1N5Y_B 1N6Q_B 1OAZ_A 1ORQ_C 1OTS_A 1P2C_C
1R0A_B 1R3J_C 1TPX_A 1TQB_A 1TQC_A 1VFB_C 1ZA3_R 2BDN_A 2DQC_Y
2DQD_Y 2DQE_Y 2DQF_C 2DQF_F 2DQG_Y 2DQH_Y 2DQI_Y 2DQJ_Y 2EKS_C
2FJG_V 2FJG_W 2H9G_S 2IFF_Y 2J4W_D 2J5L_A 2JEL_P 2NY7_G 2OZ4_A
2Q8B_A 2R4R_A 2R4S_A 2R56_A 2VIS_C 2VIT_C 2VXS_A 2YSS_C 2ZJS_Y
3B9K_B 3BSZ_F 3CVH_A 3D85_C 3D9A_C 3DVG_Y 3DVN_V 3G04_C 3GBN_B
3GI9_C 3H42_B 3HI6_A 3HI6_B 3KJ4_A 3KJ6_A - - -
Table 3
Training data set of 94 protein antigens
1A2Y_C 1ADQ_A 1AFV_A 1AHW_C 1AR1_B 1BGX_T 1BQL_Y 1BVK_C 1C08_C 1DQJ_C
1DZB_X 1DZB_Y 1EGJ_A 1EO8_A 1EZV_E 1FDL_Y 1FNS_A 1FSK_A 1G7H_C 1G7I_C
1G7J_C 1G7L_C 1G7M_C 1G9M_G 1G9N_G 1GC1_G 1HYS_B 1IC4_Y 1IC5_Y 1IC7_Y
1J1O_Y 1J1P_Y 1J1X_Y 1JHL_A 1JPS_T 1JRH_I 1KIP_C 1KIQ_C 1KIR_C 1KYO_E
1LK3_A 1MEL_L 1MHP_B 1MLC_E 1N8Z_C 1NBY_C 1NBZ_C 1NDG_C 1NDM_C 1NSN_S
1OAK_A 1ORS_C 1OSP_O 1QLE_B 1R3K_C 1RJL_C 1RVF_1 1RVF_2 1RVF_3 1RZJ_G
1RZK_G 1TZH_V 1TZI_V 1UA6_Y 1UAC_Y 1UJ3_C 1V7M_V 1W72_A 1WEJ_F 1XIW_A
1YJD_C 1YQV_Y 1YY9_A 1ZTX_E 2AEP_A 2ARJ_Q 2B2X_A 2DD8_S 2EIZ_C 2HMI_B
2Q8A_A 2QQK_A 2QQN_A 2UZI_R 2VH5_R 2VXQ_A 2VXT_I 2W9E_A 2XTJ_A 2ZUQ_A
3G6D_A 3GRW_A 3O0R_B 3PGF_A - - - - - -
Table 4
Definitions of performance measures
Performance measure Definition

TPRa TP/(TP+FN)
FPR FP/(FP+TN)
b
Precision TP/(TP+FP)
Accuracy (TP+TN)/(TP+TN+FP+FN)
F-score 2TPRprecision/(TPR+precision)
MCC pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
TPTN FPFN
ðTPþFP ÞðTPþFN ÞðTN þFP ÞðTN þFN Þ
AUC Area under the ROC curve

a
True positive rate is also known as sensitivity or recall
b
Precision is also known as positive predictive value
known epitope. We tested the prediction models on the indepen-

dent antigen data. According to the output of the prediction mod-
els, for each amino acid we obtained: (1) the epitope prediction
score or (2) the classification (e.g., epitope or nonepitope based on
a prespecified score threshold). The numbers obtained for TP, TN,
FP, and FN depended on the manner in which the threshold was
selected and provided performance information. In general, corre-
lation exists between the TP rate and the FP rate produced by the
predictive model. Typically, the FP rate increases with the TP rate.
2.5 Correlation A meta classifier can consist of an arbitrary number of base learners,
Analysis and its overall performance depends on these learning components.
If the learning components have complementary predictive
strengths, a meta classifier can search a variety of hypotheses in
the hypothesis space and provide superior generalizations for
novel test data than a single-component learner can. We introduce
two methods to evaluate the correlation between base learning
components in meta learning. One is based on statistical correlation
analysis; the other is based on clustering analysis.
2.5.1 Stacking We used statistical techniques to analyze the B-cell epitope predic-
and Cascade tion tools. We evaluated the correlations between the prediction
scores, and between the rankings of the prediction scores. Using a
Pearson’s correlation analysis, we measured the strength of the
relationship between the prediction scores produced by the tools.
We ranked the prediction scores produced by the tools and calcu-
lated the Spearman’s rank correlation coefficient to investigate the
correlations between the prediction score rankings of the predic-
tion tools. The results from correlation analysis can provide a basis
for selecting the appropriate base learners in stacking and cascade
meta learning.
2.5.2 MDT An MDT can be constructed from an arbitrary number of different

base classifiers, and its overall performance depends on these
learning components. If the learning components have comple-
mentary predictive strengths, an MDT can search various hypoth-
eses in the hypothesis space and provide superior generalizations for
novel test data to those of a single-component learner. We used the
adjusted Rand index (ARI) [60] to measure the strength of the
relationship between the predictions produced by two base classi-
fiers. Although the ARI was initially designed to measure agree-
ment between two clustering results, in our case, a higher ARI value
could indicate greater agreement between the two classifiers. If P is
the partition of the amino acids into epitopes and nonepitopes for a
given data set of antigens, according to the predictions of the
classifier A, and Q is the partition produced by the classifier B, a
lower ARI value between P and Q suggests a higher probability that
the two classifiers have complementary strengths.
390 Yuh-Jyh Hu
3 Experimental Results
3.1 Correlation For a meta-learning method to perform effectively, the base

Analysis learning components must have complementary predictive capabil-
ities, which can be reflected by relatively low correlation among
their predictions.
3.1.1 Prediction We evaluated four conformational and four linear epitope predic-
Correlations Between Base tors as the base learners in our stacking and cascade architectures.
Prediction Servers The conformational predictors were DiscoTope 2.0 [9], ElliPro
[10], SEPPA 2.0 [11], and Bpredictor [18]; and the linear epitope
predictors were BepiPred [5], ABCpred [6], AAP [7], and
BCPREDS [8]. We calculated the Pearson’s correlation coefficients
for the prediction scores produced by the base prediction tools. To
further analyze the correlations among predictions based on the
score rankings, we sorted the prediction scores of all protein sites
provided by each base learner and then conducted a Spearman’s
rank correlation analysis. Tables 5 and 6 list the Pearson’s correla-
tion coefficients and Spearman’s rank correlation coefficients of all
pairs of linear and conformational predictors, respectively. The
average correlation coefficients of the linear and conformational
prediction tools were 0.383 vs. 0.384 and 0.370 vs. 0.459 in the
Pearson’s and Spearman’s correlation analyses, respectively, which
indicate a relatively weak correlation among the epitope predictions
of the base learners.
3.1.2 Prediction We built MDTs based on eight base classifiers: C4.5 [29], k-NN
Correlations Between Base [30], SVM [32], RF [52], PART [53], BN [54], JRip [55], and VP
Inductive Classifiers [56]. We measured the correlation between two base classifiers by
the ARI [37] of their classifications. Table 7 lists the ARI values of
all pairs of the base classifiers for an independent test data set of
18 antigens. The mean standard deviation ARI values for the test
data sets were 0.238 0.084; the ARI value is relatively low,
indicating a relatively weak correlation among the base classifiers.
3.2 Comparisons We conducted two stratified fivefold cross-validations (CV) to eval-

Between Stacking/ uate the performances of different meta-learning architectures. We
Cascade randomly divided a data set of 94 antigens into five disjoint folds
and BaggingMDT (i.e., subsets), each of approximately equal size. We stratified the
folds to maintain the same distribution of epitopes and nonepitopes
as in the original data set. We used one fold of data for testing
prediction performance and used the remaining four folds for
training. We repeated the same training-testing process on each
fold iteratively. Each run produced a result based on the fold
selected for testing. The overall performance was used as the aver-
age of the results obtained from all iterations of the two fivefold
CVs. The results are shown in Table 8, and for references, we also
Table 5
Correlation analysis of linear epitope predictors
AAP ABCpred BCPREDS
Linear Pearson Spearman Pearson Spearman Pearson Spearman

AAP 1 1 - - - -
ABCpred 0.241 0.251 1 1 - -
BCPREDS 0.515 0.520 0.342 0.287 1 1
BepiPred 0.383 0.372 0.282 0.299 0.536 0.489
Table 6
Correlation analysis of conformational epitope predictors
SEPPA 2.0 DiscoTope 2.0 Bpredictor
Conformational Pearson Spearman Pearson Spearman Pearson Spearman

SEPPA 2.0 1 1 - - - -
DiscoTope 2.0 0.246 0.400 1 1 - -
Bpredictor 0.339 0.509 0.372 0.364 1 1
ElliPro 0.333 0.487 0.388 0.362 0.624 0.630
Table 7
Correlation analysis of base inductive classifiers based on ARI
Classifier C4.5 KNN Voted perceptron PART Random forest Bayes Net JRip
KNN 0.198 - - - - - -
Voted perceptron 0.189 0.256 - - - - -
PART 0.290 0.259 0.282 - - - -
Random forest 0.157 0.245 0.166 0.164 - - -
BayesNet 0.232 0.120 0.126 0.240 0.050 - -
JRip 0.251 0.239 0.281 0.306 0.237 0.191 -
SVM 0.248 0.359 0.382 0.386 0.361 0.125 0.335
present the fivefold CV results of the base prediction servers in

Table 9. Tables 8 and 9 indicate the superior performances of
meta classifiers in comparison with single base B-cell epitope
predictors.
From the results of the fivefold CVs, we observed that most of
the base predictors produced high true positive rates of prediction;
nevertheless they also suffered high false positive rates. In contrast
392 Yuh-Jyh Hu
Table 8
Fivefold cross-validation of meta classifiers
Classifier TPR FPR Precision Accuracy F-score MCC AUC

a
2-level Stacking 0.593 0.009 0.848 0.959 0.697 0.689 0.920
b
3-level Stacking 0.580 0.009 0.850 0.959 0.689 0.682 0.925
Cascadec 0.588 0.009 0.843 0.959 0.693 0.684 0.925
BaggingMDT 0.448 0.018 0.676 0.939 0.539 0.520 0.869
a
Two-level stacking meta classifiers with SVM as the top-level meta learner
b
Three-level stacking meta classifier (Fig. 4)
c
Cascade meta classifier (Fig. 5)
Table 9
Fivefold cross-validation of base epitope prediction servers
Classifier TPR FPR Precision Accuracy F-score MCC AUC

SEPPA 2.0 0.450 0.097 0.291 0.867 0.348 0.290 0.793
DiscoTope 2.0 0.930 0.761 0.096 0.294 0.173 0.110 0.617
Bpredictor 0.129 0.017 0.399 0.916 0.195 0.192 0.690
ElliPro 0.711 0.512 0.108 0.506 0.186 0.109 0.635
AAP 0.831 0.770 0.085 0.278 0.154 0.039 0.490
ABCpred 0.603 0.548 0.088 0.463 0.152 0.031 0.536
BCPREDS 0.962 0.906 0.084 0.163 0.154 0.053 0.476
BepiPred 0.718 0.500 0.110 0.517 0.191 0.118 0.609
to most of the base tools, the proposed meta-learning approaches

(stacking, cascade, and BaggingMDT) showed lower false positive
rates. Although Bpredictor demonstrated the lowest false positive
rate in the independent test, unfortunately its true positive rate was
also the lowest. Among the eight base prediction tools, SEPPA 2.0
obtained the best balance between true and false positive rates as
indicated by the highest F-score, MCC, and AUC. When compar-
ing SEPPA 2.0 with the meta classifiers, we observed that stacking,
cascade, and BaggingMDT all outperformed SEPPA 2.0 for all the
performance measures except the true positive rate of Bag-
gingMDT. Overall, these observations suggest that the perfor-
mance of an ensemble approach based on meta learning is
superior to that of a single prediction tool for B-cell epitope
prediction.
3.3 Independent In addition to the comparisons between the meta classifiers and the
Tests base epitope predictors for the same fivefold CVs, we also com-
pared the meta classifiers with the epitope predictors separately,
using different test antigens selected from the independent test
data set of 64 antigens (see Subheading 2.4). We conducted the
experiments on several representative epitope predictors: SEPPA
2.0, DiscoTope 2.0, Bpredictor, ElliPro, and CBTOPE [61]. Each
of them had been trained and tested by different data sets. We first
trained the meta classifiers, stacking, cascade, and BaggingMDT,
from the same training data set of 94 antigens (see Subheading 2.4),
which were used previously to train these predictors in comparison.
In each experiment, we selected one epitope predictor for compar-
ison. To conduct consistent and unbiased analysis, from the inde-
pendent test data set of 64 antigens, we removed those that were
also used to train the base predictor selected for comparison to
ensure the training and test data were mutually exclusive. Table 10
shows that stacking, cascade, and BaggingMDT were superior or
comparable to these representative epitope predictors. The results
demonstrate that the synergy in the effects of multiple epitope
predictors or inductive classifiers can achieve superior performance
compared with that produced by a single epitope predictor.
4 Conclusion
Understanding of the interactions between antibodies and epitopes

provides the basis for the rational design of preventive vaccines.
Following the increased availability of protein sequences and struc-
tures, various computational tools have been developed for epitope
prediction. The analytical and experimental results reveal the com-
plementary performances of various epitope prediction methods,
suggesting synergy among these computational tools. In this chap-
ter, we introduce three meta-learning strategies: stacking, cascade,
and BaggingMDT. We have examined their performances in the
problem of predicting B-cell epitopes. They are capable of exploit-
ing the synergy among various prediction methods and demon-
strate prediction performance superior to that of a single epitope
predictor. We conducted a consistent and unbiased independent
test on our method and compared the results with those from other
prediction tools. Our results demonstrate that the proposed meta-
learning approach outperforms the single base tools and other
recently developed epitope predictors.
394 Yuh-Jyh Hu
Table 10
Results of independent tests
Classifier TPR FPR Precision Accuracy F-Score MCC AUC

SEPPA 0.280 0.058 0.221 0.906 0.247 0.199 0.78
a
3-level Stacking 0.214 0.012 0.503 0.945 0.300 0.304 0.828
Cascadeb 0.217 0.012 0.521 0.946 0.306 0.313 0.822
BaggingMDT 0.214 0.040 0.239 0.919 0.226 0.184 0.744
Discotope 0.912 0.732 0.095 0.318 0.172 0.111 0.615
a
3-level Stacking 0.429 0.013 0.732 0.943 0.541 0.533 0.877
Cascadeb 0.412 0.012 0.744 0.943 0.530 0.528 0.872
BaggingMDT 0.424 0.032 0.527 0.925 0.470 0.433 0.833
Bpredictor 0.022 0.015 0.117 0.905 0.037 0.016 0.642
a
3-level Stacking 0.431 0.013 0.752 0.941 0.548 0.542 0.877
Cascadeb 0.411 0.011 0.764 0.941 0.534 0.534 0.872
BaggingMDT 0.443 0.031 0.559 0.925 0.494 0.458 0.842
Ellipro 0.761 0.519 0.113 0.503 0.196 0.131 0.654
a
3-level Stacking 0.430 0.013 0.742 0.943 0.544 0.538 0.885
Cascadeb 0.411 0.012 0.748 0.942 0.530 0.528 0.878
BaggingMDT 0.410 0.030 0.546 0.926 0.468 0.434 0.840
CBTOPE 0.379 0.144 0.204 0.814 0.265 0.180 0.649
3-level Stackinga 0.425 0.014 0.752 0.937 0.543 0.536 0.862
b
Cascade 0.393 0.011 0.770 0.936 0.520 0.522 0.853
BaggingMDTs 0.414 0.032 0.554 0.919 0.473 0.436 0.833
a
Three-level stacking meta classifier (Fig. 4)
b
Cascade meta classifier (Fig. 5)
5 Notes
1. These features are the descriptive attributes to represent the

examples in inductive learning. The selection of the features
addresses the inductive bias and consequently affects learning
performance. In this chapter, we introduced 14 base features
for epitope prediction, as shown in Table 1.
2. The selection of the classifiers to be meta learners can affect the
performance of the multilevel meta-learning architectures,
stacking and cascade. Cross-validation is a commonly used
method to determine an appropriate combination of classifiers.
3. The selection of the base classifiers used in MDTs affects the

predictive performance because a meta-attribute is defined over
the output of the trained base classifiers.
References
1. Meloen RH, Puijk WC, Langeveld JP, Lange- 14. Qi T, Qiu T, Zhang Q, Tang K, Fan Y, Qiu J
dijk JP, Timmerman P (2003) Design of syn- et al (2014) SEPPA 2.0-more refined server to
thetic peptides for diagnostics. Curr Protein predict spatial epitope considering species of
Pept Sci 4:253–260 immune host and subcellular localization of
2. Tanabe S (2007) Epitope peptides and immu- protein antigen. Nucleic Acids Res 42:
notherapy. Curr Protein Pept Sci 8:109–118 W59–W63
3. Naz RK, Dabir P (2007) Peptide vaccines 15. Ponomarenko J, Bui HH, Li W, Fusseder N,
against cancer, infectious diseases, and concep- Bourne PE, Sette A, Peters B (2008) ElliPro: a
tion. Front Biosci 12:1833–1844 new structure-based tool for the prediction of
4. Benjamin DC, Berzofsky JA, East IJ, Gurd FR, antibody epitopes. BMC Bioinformatics 9:514
Hannum C, Leach SJ et al (1984) The anti- 16. Karplus PA, Schulz GE (1985) Prediction of
genic structure of proteins: a reappraisal. Annu chain flexibility in proteins – a tool for the
Rev Immunol 2:67–101 selection of peptide antigens. Naturwis-
5. Pellequer JL, Westhof E, Van Regenmortel senschaften 72:212–213
MH (1991) Predicting location of continuous 17. Rubinstein ND, Mayrose I, Martz E, Pupko T
epitopes in proteins from their primary struc- (2009) Epitopia: a web-server for predicting
tures. Methods Enzymol 203:176–201 B-cell epitopes. BMC Bioinformatics 10:287
6. Hopp TP, Woods KR (1981) Prediction of 18. Zhang W, Liu J, Zhao M, Li Q (2012) Predict-
protein antigenic determinant from amino ing linear B-cell epitopes by using sequence-
acid sequences. Proc Natl Acad Sci U S A derived structural and physicochemical fea-
78:3824–3828 tures. Int J Data Min Bioinform 6(5):557–569
7. Pellequer J, Westhof E, Van Regenmortel M 19. Liang S, Zheng D, Standley DM, Yao B,
(1993) Correlation between the location of Zacharias M, Zhang C (2010) EPSVR and
antigenic sites and the prediction of turns in EPMeta: prediction of antigenic epitopes
proteins. Immunol Lett 36(1):83–99 using support vector regression and multiple
8. Blythe MJ, Doytchinova IA, Flower DR server results. BMC Bioinformatics 11:381
(2002) JenPep: A database of quantitative 20. Zhang W, Niu Y, Xiong Y, Zhao M, Yu R, Liu J
functional peptide data for immunology. Bio- (2012) Computational prediction of confor-
informatics 18(3):434–439 mational B-cell epitopes from antigen primary
9. Larsen JE, Lund O, Nielsen M (2006) structures by ensemble learning. PLoS One 7
Improved method for predicting linear B-cell (8):e43575
epitopes. Immunome Res 2:2 21. Kringelum JV, Lundegaard C, Lund O, Niel-
10. Saha S, Raghava G (2006) Prediction of con- sen M (2012) Reliable B cell epitope predic-
tinuous B-cell epitopes in an antigen using tions: impacts of method development and
recurrent neural network. Proteins 65 improved benchmarking. PLoS Comput Biol
(1):40–48 8(12):e1002829
11. Chen J, Liu H, Yang J, Chou K (2007) Predic- 22. Wolpert DH (1992) Stacked Generalization.
tion of linear B-cell epitopes using amino acid Neural Netw 5:241–259
pair antigenicity scale. Amino Acids 33 23. Ting KM, Witten IH (1997) Stacked generali-
(3):423–428 zation: When does it work? In: International
12. El-Manzalawy Y, Dobbs D, Honavar V (2008) Joint Conference on Artificial Intelligence, pp
Predicting linear B-cell epitopes using string 866–873
kernels. J Mol Recognit 21(4):243–255 24. Gama J (1998) Combining classifiers by con-
13. Andersen PH, Nielsen M, Lund O (2006) Pre- structive induction. In: European Conference
diction of residues in discontinuous B-cell epi- on Machine Learning, pp 178–189
topes using protein 3D structures. Protein Sci 25. Gama J, Brazdil P (2000) Cascade Generaliza-
15:2558–2567 tion. Mach Learn 41(3):315–343
396 Yuh-Jyh Hu
26. Todorovski L, Dzeroski S (2000) Combining 41. Gerstein M, Tsai J, Levitt M (1995) The vol-
multiple models with meta decision trees. Lect ume of atoms on the protein surface: calculated
Notes Comput Sci 1910:54–64 from simulation, using Voronoi Polyhedra. J
27. Breiman L (1996) Bagging predictors. Mach Mol Biol 249:955–966
Learn 24:123–140 42. Lee B, Richards FM (1971) The interpretation
28. Schapire R (1990) The strength of weak learn- of protein structures: estimation of static acces-
ability. Mach Learn 5:197–227 sibility. J Mol Biol 55(3):379–400
29. Quinlan JR (1993) C4.5: programs for 43. Gerstein M (1992) A resolution-sensitive pro-
machine learning. Morgan Kaufmann Publish- cedure for comparing protein surfaces and its
ers, San Francisco application to the comparison of antigen-
30. Cover TM, Hart PE (1967) Nearest neighbor combining sites. Acta Cryst A48:271–276
pattern classification. IEEE Trans Inf Theory 44. Hausman RE, Cooper GM (2003) The cell: a
13(1):21–27 molecular approach. ASM Press, Washington,
31. Bishop CM (1996) Neural networks for pat- DC
tern recognition. Oxford University Press, 45. Kyte J, Doolittle RF (1982) A simple method
Oxford for displaying the hydropathic character of a
32. Chang CC, Lin CJ (2011) LIBSVM: a library protein. J Mol Biol 157(1):105–132
for support vector machines. ACM Trans Intell 46. Kolaskar AS, Tongaonkar PC (1990) A semi-
Syst Technol 2(3):1–27 empirical method for prediction of antigenic
33. Zhang W, Xiong Y, Zhao M, Zou H, Ye X, Liu determinants on protein antigens. FEBS Lett
J (2011) Prediction of conformational B-cell 276(1–2):172–174
epitopes from 3D structures by random forests 47. Hu Y-J, Lin S-C, Lin Y-L, Lin K-H, You S-N
with a distance-based feature. BMC Bioinfor- (2014) A meta-learning approach for B-cell
matics 12:341 conformational epitope prediction. BMC Bio-
34. Nagano K (1973) Logical analysis of the mech- informatics 15:378
anism of protein folding: I. predictions of heli- 48. Emini EA, Hughes JV, Perlow DS, Boger J
ces, loops and beta-structures from primary (1985) Induction of hepatitis A virus-
structure. J Mol Biol 75(2):401–420 neutralizing antibody by a virus-specific syn-
35. Hubbard SJ, Thornton JM (1993) NACCESS thetic peptide. J Virol 55:836–839
Computer Program. Department of Biochem- 49. Janin J, Wodak S, Levitt M, Maigret B (1978)
istry and Molecular Biology, University Col- Conformation of amino acid side-chains in
lege London proteins. J Mol Biol 125(3):357–386
36. Lipkin HJ (2004) Physics of Debye-Waller Fac- 50. Ponnuswamy PK, Prabhakaran M, Manavalan
tors. arXiv:cond-mat/0405023 P (1980) Hydrophobic packing and spatial
37. Liu R, Hu J (2011) Prediction of discontinu- arrangement of amino-acid-residues in
ous B-cell epitopes using logistic regression globular-proteins. Biochim Biophys Acta
and structural information. J Proteomics 623:301–316
Bioinform 4:10–15 51. Grantham R (1974) Amino acid difference for-
38. Sanner MF, Olson AJ, Spehner JC (1996) mula to help explain protein evolution. Science
Reduced surface: an efficient way to compute 185:862–864
molecular surfaces. Biopolymers 38 52. Breiman L (2001) Random forests. Mach
(3):305–320 Learn 45:5–32
39. Parker JM, Guo D, Hodges RS (1986) New 53. Frank E, Witten IH (1998) Generating accu-
hydrophilicity scale derived from high- rate rule sets without global optimization. In:
performance liquid chromatography peptide Proceedings of the Fifteenth International
retention data: correlation of predicted surface Conference on Machine Learning, pp 144–151
residues with antigenicity and X-ray-derived 54. Pearl J (1988) Probabilistic reasoning in intel-
accessible sites. Biochemistry 25 ligent systems: networks of plausible inference.
(19):5425–5432 Morgan Kaufmann Publishers Inc., Burling-
40. Zhang Z, Sch€affer AA, Miller W, Madden TL, ton, MA
Lipman DJ, Koonin EV, Altschul SF (1998) 55. Cohen WW (1995) Fast effective rule induc-
Protein sequence similarity searches using pat- tion. In: Proceedings of the Fifteenth Interna-
terns as seeds. Nucleic Acids Res 26 tional Conference on Machine Learning, pp
(17):3986–3990 115–123
56. Freund Y, Schapire RF (1999) Large margin 59. Ponomarenko J, Papangelopoulos N, Zajonc
classification using the perceptron algorithm. DM, Peters B, Sette A, Bourne PE (2011)
Mach Learn 37:277–296 IEDB-3D: structural data within the immune
57. Breiman L, Friedman JH, Olshen RA, Stone CJ epitope database. Nucleic Acids Res 39:
(1984) Classification and regression trees. D1164–D1170
Wadsworth & Brooks/Cole Advanced Books 60. Hubert L, Arabie P (1985) Comparing parti-
& Software, Monterey, CA tions. J Classif 2:193–218
58. Schlessinger A, Ofran Y, Yachdav G, Rost B 61. Ansari HR, Raghava G (2010) Identification of
(2006) Epitome: database of structure-inferred conformational B-cell Epitopes in an antigen
antigenic epitopes. Nucleic Acids Res 34: from its primary sequence. Immunome Res 6:6
D777–D780
Chapter 23
PCPS: A Web Server to Predict Proteasomal Cleavage Sites

Marta Gomez-Perosanz, Alvaro Ras-Carmona, and Pedro A. Reche
Abstract
The proteasome complex is mainly responsible for proteolytic degradation of cytosolic proteins, generating
the C-terminus of MHC I-restricted peptide ligands and CD8 T cell epitopes. Therefore, prediction of
proteasomal cleavage sites is relevant for anticipating CD8 T-cell epitopes. There are two different protea-
somes, the constitutive proteasome, expressed in all types of cells, and the immunoproteasome, constitu-
tively expressed in dendritic cells. Although both proteasome forms generate peptides for presentation by
MHC I molecules, the immunoproteasome is the main form involved in providing peptide fragments for
priming CD8 T cells. On the contrary, the proteasome provides peptides for presentation by MHC I
molecules that can be targeted by already primed CD8 T cells. Proteasome cleavage prediction server
(PCPS) is a server for predicting cleavage sites generated by both the constitutive proteasome and the
immunoproteasome. Here, we illustrate the usage of PCPS to predict proteasome and immunoproteasome
cleavage sites and compare the results with those provided by NetChop, a related tool available online.
PCPS is implemented for free public use available online at http://imed.med.ucm.es/Tools/pcps/.
Key words Proteasome, Immunoproteasome, Prediction of cleavage sites, PCPS
1 Introduction
The proteasome is a protein complex responsible for the proteolytic

degradation of cytosolic proteins [1]. While several proteases and
peptidases of the endoplasmic reticulum and the cytosol trim amino
acids form the N-terminus of major histocompatibility class I
(MHCI) peptide ligands, the C-terminus (P1 residue of the cleav-
age site) is generally generated by the proteasome, thus shaping the
repertoire of peptides presented by MHCI molecules [2, 3].
The proteasome complex includes a 20S catalytic core and two
19S regulatory cap subunits. The 20S particle consists of four
heptameric rings. The outer rings are made of α subunits, whereas
the two inner rings are made of catalytic β subunits. In mammals,
the catalytic activity specifically resides in the β1, β2, and β5 sub-
units. Most mammalian cells constitutively express this type of
proteasome which is also known as the constitutive proteasome.
Interestingly, under the influence of proinflammatory stimuli,

399
400 Marta Gomez-Perosanz et al.
particularly IFNγ, mammalian hematopoietic cells assemble a dif-

ferent type of proteasome, incorporating alternative β1i, β2i, and
β5i catalytic subunits. This alternative form of proteasome, known
as immunoproteasome, is constitutively expressed by dendritic cells
[1, 3, 4]. The immunoproteasome and the constitutive proteasome
have distinct cleavage patterns [5, 6], both generating peptide
fragments that can eventually be displayed bound to MHCI mole-
cules on the cell surface and recognized by CD8 T cells. However,
the immunoproteasome provides peptides for priming CD8 T cells,
while the proteasome provides peptides for presentation by MHCI
molecules in target cells that can be recognized by primed CD8 T
cells [7, 8]. Hence, prediction of cleavage sites by both the protea-
some and immunoproteasome serves to identify protective CD8
T-cell epitopes.
The proteasome, both constitutive and immunoproteasome, is
quite unspecific, but yet exhibits certain residue preferences around
the cleavage site, which occurs between the P1 and P1’ residues.
Proteasome cleavage models can be derived from experimental
cleavage data consisting of protein fragments generated in vitro
by protein digestion with proteasomes [9–12] and from datasets
of MHCI-restricted peptides and their C-terminal flanking regions
[13–16]. The PCPS tool [16] that we describe on this chapter
(freely available at http://imed.med.ucm.es/Tools/PCPS/)
implements n-grams models trained on MHC I-restricted peptides.
However, unlike related methods, we distinguished between self-
peptides eluted from human MHC I molecules and CD8 T-cell
epitopes, generating proteasome and immunoproteasome n-grams
models, respectively.
We have shown that proteasomal cleavage site predictions serve
to enhance CD8 T-cell epitopes predictions by discarding peptides
resulting from peptide-MHC I binding models that do not have a
C-terminus compatible with proteasome cleavage [16]. Here, we
illustrate the usage of PCPS to predict cleavage sites by the consti-
tutive proteasome and the immunoproteasome. Moreover, we use a
set of 59 HCV-specific CD8 T-cell epitopes to analyze cleavage
predictions by PCPS and compare them with those provided by
NetChop, a related method available online [15].
2 PCPS Overview
Proteasome cleavage prediction server (PCPS) is a web-based tool

to predict proteasome and immunoproteasome cleavage sites using
n-gram models. N-gram models are frequently used for language
modeling and they also proved useful for predicting cleavage sites
[16]. An overview of the web interface of PCPS and available
models is shown in Fig. 1. The homepage of PCPS was designed
to be intuitive and user-friendly. Briefly, the main input data for
Proteasomal Cleavage Predictions 401
Fig. 1 Web interface of PCPS. Cleavage site predictions by PCPS are performed in three simple steps:
(1) Upload the target sequence, (2) choose the proteasome and/or immunoproteasome cleavage model to
apply, and (3) run analysis
PCPS is one or several protein sequences that can be pasted or

uploaded to the server in multiple formats. The sequences provided
to the server are subjected to a cleavage analysis using n-gram
models that are selected by the user. There are several models
available for both proteasomes, constitutive and immunoprotea-
some, which differ in sensitivity and specificity, and users can com-
bine different proteasome and immunoproteasome models. The
output of PCPS consists of a table indicating the cleavage score of
each residue in the protein queries. Computed scores reflect the
likelihood that the proteasome/immunoproteasome would cleave
the protein after such residue (P1 residue of cleavage site). When-
ever the cleavage score is higher than 0.5, a tick marks the
corresponding residue (see Subheading 3 below for more
information).
3 PCPS Practical User Guide
In this section, we illustrate the usage of PCPS to predict cleavage

sites by the constitutive proteasome and the immunoproteasome,
using the Hepatitis C Virus (HCV) proteome with GenBank acces-
sion number (ACN) M62321.1 as target sequence.
3.1 Input Protein input sequence/s query for PCPS can be pasted or
uploaded from a local file. In our example (Fig. 2a), we have
introduced our target HCV proteome in the “INPUT” field in
FASTA (see Note 1). If user opts for uploading the sequence from a
local file, there will be two sequential steps: first, browse/choose
the local file with the sequences and second, hit the upload bottom.
This is done to facilitate preprocessing and error checking of input
data prior to submission to the server.
3.2 Cleavage The web server allows the user to select a single n-gram cleavage
Prediction Models prediction model, proteasome or immunoproteasome, or both
models simultaneously (see Note 2). The different cleavage model
predictions were trained on datasets of peptide fragments of differ-
ent lengths (6, 8, or 12 amino acids) consisting of 382 MHCI-
eluted peptides (proteasome model) or 553 naturally processed
CD8 T-cell epitopes (immunoproteasome model). Datasets com-
prise two distinct portions with the same number of residues: one
consisting of the C-terminal end of MHCI-restricted peptides and
Fig. 2 PCPS search example and output. The figure illustrates the different parameters selected for computing
PCPS predictions using HCV proteome (ACN: M62321.1) as example (a) and the proteasome and immuno-
proteasome cleavage sites prediction results provided by the server (b)
Table 1
Cleavage models available at PCPS server
Proteasome Model N Sensitivity Specificity ECS BTR

Constitutive proteasome 1 12 0.87 ( 0.03) 0.53 ( 0.06) 0.35 ( 0.02) 0.53 ( 0.02)
2 8 0.85 ( 0.04) 0.60 ( 0.05) 0.38 ( 0.02) 0.47 ( 0.02)
3 6 0.79 ( 0.05) 0.72 ( 0.04) 0.52 ( 0.08) 0.43 ( 0.03)
Immunoproteasome 1 12 0.90 ( 0.03) 0.41 ( 0.03) 0.46 ( 0.01) 0.44 ( 0.01)
2 8 0.91 ( 0.02) 0.54 ( 0.04) 0.51 ( 0.01) 0.39 ( 0.01)
3 6 0.76 ( 0.04) 0.71 ( 0.04) 0.39 ( 0.01) 0.38 ( 0.02)
N Size of the peptide fragments in training and testing sets
ECS expected cleavage sites. Calculated using the eq. 100 C/(N1), where C is the average number of cut points per
fragment yielded by a given model when tested in a file of fragments size N
BTR better than random. Calculated as the difference between SE and ECS (BTR ¼ SE – ECS). The bigger the
difference between sensitivity and ECS, the better the prediction capacity of the model
the other consisting of their C-terminal flanking region. The cleav-

age models for proteasome and immunoproteasome available in
PCPS and their performance are collected in Table 1.
In our practical example (Fig. 2a), we selected the “CLEAV-
AGE MODEL” field proteasome and immunoproteasome cleavage
prediction models trained on 12 residue fragments (Model 1), since
these were the ones with a better predictive performance (BTR
value).
3.3 Output The results of PCPS are presented in a user-friendly format

(Fig. 2b). The server computes cleavage sites prediction after each
residue of the protein and returns a table with all the residues of the
input sequence and their cleavage prediction score, mapping the
cleavage sites with a tick when the score is >0.5 (see Note 3). When
both models are selected, as in our example, the server will return a
table with both cleavage predictions. The output of our practical
example is shown in Fig. 2b. At the end of the output page, the user
can download comma or tab separated results in .txt or excel
format, respectively, for further analysis of cleavage results.
4 NetChop vs PCPS Comparison
NetChop (available at http://www.cbs.dtu.dk/services/

NetChop/) is considered as a reference tool to predict proteasome
cleavages [15]. NetChop currently implements two models that
were generated training artificial neural networks on MHC
I-restricted peptides and in vitro digestion data, respectively. The
default and most accurate predictions in NetChop are provided by
an immunoproteasome model trained on 1260 publicly available
MHC class I ligands (C-term 3.0 model). Since prediction of
proteasome cleavage sites serves to identify CD8 T-cell epitopes

with C-terminal residues compatible with proteasome cleavage, we
compared PCPS and NetChop on a dataset of 59 HCV-specific
CD8 T-cell epitopes (see Note 4). These epitopes were selected out
of 232 HCV-specific CD8 T-cell epitopes obtained from IEDB
epitope database [17] for matching in a selected reference HCV
proteome (ACN: M62321.1). We predicted proteasomal cleavage
sites using NetChop immunoproteasome model (C-term 3.0
model) and immunoproteasome model 1 in PCPS (see Table 1).
We chose the same arbitrary cleavage score threshold of 0.5 in both
servers.
In order to evaluate performance of the cleavage models being
compared, we adopted the assumptions reported by Saxová et al.
[14], as previously followed in other works [7, 16]. We assume that
cleavage is more likely to occur in the C-terminal of the epitopes
than in any other internal cleavage site. Following that schema, we
can examine each residue cleavage score (CS) computed by the
program and classify each residue as:
– True positive (TP), if CS at the C-terminal is above the
threshold.
– False negative (FN), if CS is below the threshold.
– True negative (TN), if there are no cleavages sites predicted (all
cleavage scores are below the threshold) or if predicted cleavage
scores within the epitope are inferior to the CS of the C-
terminal.
– False positive (FP), if at least one residue of the peptide is above
the threshold and has a higher CS than the C-terminal.
Subsequently, we used Eqs. 1, 2, and 3 to compute sensitivity
(SE), specificity (SP) and Matthews’ correlation coefficient (MCC):
TP
SE ¼ ð1Þ
TP þ FN
TN
SP ¼ ð2Þ
TN þ FP
ðTP TNÞ ðFN FPÞ
MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3Þ
ðTN þ FNÞðTP þ FNÞðTN þ FPÞðTP þ FPÞ
In that context, SE measures the proportion of cleavage sites
correctly predicted, SP measures the proportion of true negatives as
defined above, and MCC is a global measure performance.
The results obtained from both servers are shown in Table 2.
PCPS outperformed NetChop in sensitivity (0.88 and 0.78, respec-
tively); it identified much better than NetChop that the C-terminus
of tested CD8 T-cell epitopes likely resulted from proteasomal
cleavage. However, PCPS identified additional preferential cleavage
sites within the tested epitopes more often than NetChop, which
Table 2
Predictive performance of PCPS and NetChop
Server SE SP MCC
PCPS 0.88 0.57 0.34
NetChop 0.78 0.66 0.33
SE sensitivity, SP specificity, MCC Mathews’ correlation coefficient of the cleavage
prediction method as computed by Eqs. 1, 2, and 3
resulted in lower specificity (0.57 vs 0.66, respectively). Since the

overall performance of PCPS and NetChop given by the MCC value
is equivalent, PCPS predictions are more suited to enhance CD8
T-cell epitope predictions without losing bona fide CD8 T-cell
epitopes (see Note 5).
5 Notes
1. Additional input sequence formats accepted by PCPS include

GenBank, EMBL, and Phylip formats. Sequence must only
contain ASCII characters.
2. Protective CD8 T-cell epitopes have C-terminal ends that are
compatible with cleavage by the immunoproteasome and the
proteasome.
3. We also recommend to check for epitope-destroying cleavage
sites. Even if the searching epitope has a C-terminal which is
predicted to be cleaved by the proteasome, it is possible that
one or more residues of the peptide have a cleavage score that is
above the cleavage score of the C-terminal. In such cases,
cleavage by the proteasome can destroy the epitopes.
4. It is important to note that PCPS and NetChop are meant to
predict C-terminus of peptides that are compatible with being
generated by proteasome cleavage rather than to predict pro-
teolytic fragments. Proteasome fragmentation patterns (the
size of fragments) may be better reproduced by methods
trained on actual cleavage data [9–12].
5. Enhanced prediction of CD8 T-cell epitopes resulting in com-
bining predictions of peptide binding to MHC I molecules and
cleavage sites using n-grams can be achieved using the RANK-
PEP tool, available for free public use at http://imed.med.ucm.
es/Tools/rankpep.html [18, 19].
Acknowledgments
We wish to thank the Spanish department of science at MINECO

for supporting the research of the immunomedicine group through
grants SAF2006:07879, SAF2009:08301 and BIO2014:54164-R
to P.A.R.
References
1. Kloetzel PM (2001) Antigen processing by the predictions of proteasomal cleavage, TAP
proteasome. Nat Rev Mol Cell Biol 2 transport and MHC class I binding. Cell Mol
(3):179–187 Life Sci 62(9):1025–1037
2. Blum JS, Wearsch PA, Cresswell P (2013) 11. Holzhutter HG, Frommel C (1999) Kloetzel
Pathways of antigen processing. Annu Rev PM. A theoretical approach towards the identi-
Immunol 31:443–473 fication of cleavage-determining amino acid
3. Rock KL, Goldberg AL (1999) Degradation of motifs of the 20 S proteasome. J Mol Biol
cell proteins and the generation of MHC class 286(4):1251–1265
I-presented peptides. Annu Rev Immunol 12. Kuttler C, Nussbaum AK, Dick TP, Rammen-
17:739–779 see HG, Schild H, Hadeler KP (2000) An
4. Craiu A, Akopian T, Goldberg A, Rock KL algorithm for the prediction of proteasomal
(1997) Two distinct proteolytic processes in cleavages. J Mol Biol 298(3):417–429
the generation of a major histocompatibility 13. Bhasin M, Raghava GP (2005) Pcleavage: an
complex class I-presented peptide. Proc Natl SVM based method for prediction of constitu-
Acad Sci U S A 94(20):10850–10855 tive proteasome and immunoproteasome cleav-
5. Dalet A, Stroobant V, Vigneron N, Van den age sites in antigenic sequences. Nucleic Acids
Eynde BJ (2011) Differences in the production Res 33(Web Server issue):W202–W207
of spliced antigenic peptides by the standard 14. Saxova P, Buus S, Brunak S, Kesmir C (2003)
proteasome and the immunoproteasome. Eur Predicting proteasomal cleavage sites: a com-
J Immunol 41(1):39–46 parison of available methods. Int Immunol 15
6. Morel S, Levy F, Burlet-Schiltz O, Brasseur F, (7):781–787
Probst-Kepper M, Peitrequin AL et al (2000) 15. Kesmir C, Nussbaum AK, Schild H, Detours V,
Processing of some antigens by the standard Brunak S (2002) Prediction of proteasome
proteasome but not by the immunoprotea- cleavage motifs by neural networks. Protein
some results in poor presentation by dendritic Eng 15(4):287–296
cells. Immunity 12(1):107–117 16. Diez-Rivero CM, Lafuente EM, Reche PA
7. Nielsen M, Lundegaard C, Lund O, Kesmir C (2010) Computational analysis and modeling
(2005) The role of the proteasome in generat- of cleavage by the immunoproteasome and the
ing cytotoxic T-cell epitopes: insights obtained constitutive proteasome. BMC Bioinformatics
from improved predictions of proteasomal 11:479
cleavage. Immunogenetics 57(1–2):33–41 17. Fleri W, Paul S, Dhanda SK, Mahajan S, Xu X,
8. Rivett AJ, Hearn AR (2004) Proteasome func- Peters B et al (2017) The immune epitope
tion in antigen presentation: immunoprotea- database and analysis resource in epitope dis-
some complexes, peptide production, and covery and synthetic vaccine design. Front
interactions with viral proteins. Curr Protein Immunol 8:278
Pept Sci 5(3):153–161 18. Reche PA, Glutting JP, Zhang H, Reinherz EL
9. Nussbaum AK, Kuttler C, Hadeler KP, Ram- (2004) Enhancement to the RANKPEP
mensee HG, Schild H (2001) PAProC: a pre- resource for the prediction of peptide binding
diction algorithm for proteasomal cleavages to MHC molecules using profiles. Immunoge-
available on the WWW. Immunogenetics 53 netics 56(6):405–419
(2):87–94 19. Reche PA, Glutting JP, Reinherz EL (2002)
10. Tenzer S, Peters B, Bulik S, Schoor O, Prediction of MHC class I binding peptides
Lemmel C, Schatz MM et al (2005) Modeling using profile motifs. Hum Immunol 63
the MHC class I pathway by combining (9):701–709
INDEX
A C
Adaptive immunity................................18, 155, 166, 262 Cancer susceptibility prediction ................................... 186
Aggregation ..................................................... vi, 245–253 Cancer vaccine development ............................... 213–226
Allele frequency net database (AFND) ......................... 34, CD8+ and CD4+ T-cells ..................................... 167, 168,
159, 161, 237 235, 237–240
AllergenFP model ......................................................... 148 Cellular automata ................................311, 319–321, 326
Allergenicity prediction............................ v, 147–152, 179 Chimera ..........................................................49, 181, 182
Allergy............................................................................ 147 ClustalW .......................................................................... 42
AllerHunter server ........................................................ 179 Clustering ................................................... 168, 202–206,
AllerTOP ......................................................148–151, 283 208, 210, 253, 258, 389
Annotation .....................................................50, 367, 386 Combinatorial peptide libraries (CombLib) ................ 47,
Antibody ........................................................... vi, 3, 8, 11, 225, 270, 273
12, 21, 140, 147, 155, 156, 165–167, 177, 219, Conformational prediction B-cell eptiopes ........ 289–296
220, 235, 245, 349–354, 375 Consensus approach............................................. 233, 235
Antibody-binding sites .......................291, 292, 294, 295 Conservancy analysis ..................................................... 179
Antigen ....................................................... 20, 22, 24, 47, Cytokines ....................................207, 208, 213, 353, 354
51, 128, 140, 155, 162, 166–168, 173, 174, 176, Cytoscape.............................................................. 203, 210
177, 208, 219, 255, 256, 265, 266, 280, 286, Cytotoxic T-lymphocyte (CTL) ..................................141,
291, 293, 295, 302 142, 158, 177–179, 281
Antigen-Antibody Interaction Database
(AgAbDb) .......................................................... 167 D
Artificial neural network (ANN) .......................... 47, 167, Delay-differential reaction diffusion model................. 326
168, 178, 225, 233, 234, 270, 273, 281, 282,
Data-driven networks ......................................... 200, 202,
300, 339, 340, 376, 379, 381–384, 403 203, 205, 207–209
Antigen-presenting cell (APCs) ................. 167, 177, 214 Dendritic cells............................. 177, 201–203, 208, 400
Autoantibody (AAbs).............................................vii, 365, Discontinuous antigenic epitopes ....................... 286, 299
367, 368, 370, 373
Autoimmunity ...............................vii, 284–287, 365, 370 E
Autoreactivity ................................................................ 230
ECLIA technique .......................................................... 350
B Electrochemiluminescence ............................ vii, 349–363
Emini surface accessibility prediction ........................... 42,
Bacteriophages .............................................. vii, 246, 309,
52, 53, 56, 59, 62, 162, 165, 214, 222, 223
324, 329, 332, 333, 335, 337, 338, 341 Energy minimization .............................................. 4, 181,
BaggingMDT ...................................................... 386, 387, 271, 286, 287
390, 392–394
Enzyme-linked immunosorbent assay
B and T lymphocytes ........................................................ 3 (ELISA)...........................300, 349, 350, 362, 373
Basic Local Alignmet Search Tool (BLAST) ................ 20, EPCES ............................................................. vi, 289–296
22, 24–26, 51, 128, 139, 149, 157, 267, 281, 302 Epitope-based immune-derived vaccines......................... 2
Bayesian network (BN)....................................... 157, 206,
Epitope potential.......................................................20, 23
210, 383, 390 Epitope prediction .............................................. 4, 12, 13,
BcePred server................................................12, 181, 236 26, 42–49, 51, 52, 61, 155–169, 214, 215,
Binding stability ............................................................ 181
219–221, 223, 226, 255, 266, 268, 269,
Breast cancer type 1 susceptibility protein 281–282, 285, 286, 289, 290, 299, 300
(BRCA1)................................................... 368, 370 Epitopes-HLA docked complex................................... 181

https://doi.org/10.1007/978-1-0716-0389-5, © Springer Science+Business Media, LLC, part of Springer Nature 2020
407
IMMUNOINFORMATICS
408 Index
EPSVR ............................................................. vi, 289–296 Immunosurveillance............................................. 185–187
Escherichia coli ................................................. vi, 156, 330 Immunotherapy ............................................................ 213
Influenza virus ...................................................... 229–241
F Inhibitory concentration .............................................. 158
In silico amino acid substitution .................................. 215
FASTA format .................................................4, 150, 151,
176, 178, 179, 215, 231, 232, 238, 259, 261, In silico PCR.................................................................... 42
262, 267, 280, 283, 300, 304 In silico vaccine design .......................................................v
Flow cytometers ............................................................ 362
K
Fluorescein isothiocyanate (FITC) .............................. 246
Fluorescence polarization ...................246, 247, 251–253 Kernel methods ............................................................. 376
Fuzzy C-means clustering ...........................204–206, 208 k nearest neighbour (kNN) algorithm ................ 148, 149
Kolaskar and Tongaonker antigenicity.......................... 47,
G 51–53, 56, 59, 62, 214, 221–223
GenBank data base.......................................................... 26
L
Gene co-expression ......................................202, 205–207
Grand Average Hydropathicity (GRAVY) ................... 286 Latent period.............................. 312, 315–318, 324, 336
Leptospirosis.................................................... vi, 173–182
H LIBSVM ........................................................................ 293
Helper T-cell (HTC) ..........................141, 177–179, 279
M
Hidden Markov model (HMM) ................................... 42,
214, 233, 300, 339 Machine learning (ML) ............................................. v, 49,
HIV-1 ........................................................................4, 255 176, 233, 235, 284, 300, 339, 376
HLA allele genotyping .............................................31–37 Major histocompatibility complex (MHC) ................ 5–7,
HLA distribution analysis .................................... 161–162 12, 13, 31, 33, 48, 49, 63, 67, 74, 155, 156, 159,
HLA sequence data......................................................... 33 162, 167, 168, 177, 213, 216, 217, 225
HLA typing ................................................................... 187 Meta decision trees (MDT)...................................vii, 376,
Homology modeling ............................................ 4, 7, 49, 381–386, 389, 390, 392, 395
82, 103, 266, 267 Meta learning ................................................. vii, 375–395
HTRF Immunoassays ................................................... 361 Middle East Respiratory Syndrome Coronavirus
Human leukocyte antigen (HLA).......................... 24, 31, (MERS-CoV)........................................... v, 39–144
33–36, 49, 80, 92, 102, 110, 120, 128, 138, 141, Mimotopes .....................................vi, 167, 213, 214, 224
158, 159, 162, 168, 169, 178–182, 185 Molecular docking ........................................... vi, 4, 8–15,
Human papilloma virus .................................................. 19 143, 162, 179, 180, 236–237, 239–240, 271,
Hydropathy index ................................................ 380, 381 284, 285
Hydrophilicity ...................................................47, 49, 52, Molecular dynamics simulations .................................. 181
54, 57, 60, 63, 162, 165, 167, 177, 181, 214, Monoclonal antibody (mAb) ......................... vi, 245–253
221, 223, 224, 286, 300 Monte Carlo simulations .................................... 311–313,
315, 316, 321, 322, 324
I Motif-based sequence analysis...................................... 251
IC50 value ........................... 48, 158–160, 169, 178, 282 Multiplex assays ............................................350, 352–353
IMGT database ............................................................. 168 Multiplicity of infection (MOI) .......................... 314, 315
Mycobacteria .............................................. 139, 321, 322,
Immune epitope database (IEDB)....................... v, vi, 13,
23, 26, 29, 42, 47, 49, 69, 80, 141, 158, 159, 161, 331, 332, 336–338, 341, 342
162, 165, 177–179, 188, 214–217, 220, 223, Mycobacteriophage........................................ vi, 321, 324,
331–334, 336–339, 341
225, 234, 237, 238, 256, 260, 270, 281–283,
300, 301, 340, 386, 404
N
Immune epitope database analysis resource
(IEDB-AR) ................................13, 214, 266, 270 NCBI database ........................................ 4, 148, 176, 280
Immunoassays ..............................................349–363, 373 NetMHCpan .....................................................48, 61, 62,
Immunodominant epitopes..................................... vi, 265 67, 68, 73, 168, 188, 225, 234, 270, 273, 340
Immunoglobulins ................................................ 166, 168 Next-generation sequencing (NGS) ...................... 31, 32,
Immunosuppression ..................................................... 186 34, 233, 246, 247, 250–253, 339, 340
IMMUNOINFORMATICS
Index 409
O Residue conservation score .......................................... 294
Respiratory syncytial virus (RSV)........................ 207, 208
Oncogenic mutation probability......................... 185–197 Reverse vaccinology ................................. v, 1–15, 19, 156
Oncogenome................................................................. 187 RNAseq data ................................................................. 208
Outer membrane protein (OMP) ...............................174,
176, 178, 266 S
P Secondary structure .............................................. 26, 165,
167, 214, 290, 293–295, 300, 380
Paratopes ...................................................................3, 213 Severe acute respiratory syndrome coronavirus ............ 40
Parker hydrophilicity prediction............................. 47, 52, Side chain polarity ................................................ 380, 381
54, 57, 60, 63, 162, 165, 177, 221, 222 Size exclusion chromatography (SEC) .......................246,
PatchDock rigid-body server........................................ 180 248, 249
Pathogen..........................................................2, 3, 17–19, Stabilized matrix method (SMM) ........................ 47, 225,
24, 156, 157, 166, 169, 173, 174, 178, 208, 230, 234, 270, 273, 282
255, 262, 265, 266, 271, 272, 278, 330, 332, Stromal cell-basal medium (SCBM) ............................ 351
336, 338, 341, 342, 355, 365 Support vector regression (SVR) ............... 168, 291, 293
Pearson correlation .............................200, 201, 204, 209 SVM package........................................................ 293, 302
PEP-FOLD3 ........................................................ 180, 284 SVMTriP.........................................vi, 236, 281, 299–305
Peptide-MHC-I affinity ................................................ 400 Swiss-Prot database ....................................................... 168
Peptide vaccine design ..............................................17–29 SYFPEITHI...........................................13, 168, 234, 340
Phage-bacteria dynamics................................. vi, 309–326 Synthetic peptide vaccine.............................................. 231
Phage display ......................... vi, 245–247, 249, 252, 253
Phage panning............................................................... 249 T
Physicochemical properties................................... 51, 149,
177, 214, 280, 286, 376, 383 TAP transport....................................................48, 49, 61,
Population coverage calculation...............................49, 80 62, 67, 158, 159, 167, 178, 282
Position specific scoring matrices (PSSMs) ................168, T-cell receptor (TCR) ......................................... 167, 168,
233, 234, 281, 339, 340, 380, 381 177, 213, 230, 280, 341
Primary human periodontal ligament fibroblasts The Cancer Genome Atlas (TCGA) ...........................187,
(hPDLFs) ........................................................... 351 188, 193, 195, 340
Propensity scores............................................ vi, 167, 181, 3D modeling ................................................................... 23
222, 236, 300, 376, 380 Transcriptomics ...................................200, 206, 207, 338
Proteasome cleavage ............................. 48, 400, 403–405 Transporter associated with antigen presentation
Protein data bank (PDB) ...................................... 7–9, 26, (TAP) ..................................................48, 167, 168
148, 162, 179, 181, 182, 267, 271, 284, 286,
V
290, 291, 295
Protein Information Resource (PIR)..................... 7, 148, Vaccinomics .................................................................1, 19
281, 284 VaxiJen server ......................................157, 159, 165, 176
Protein-protein interaction network (PPI)............. vi, 207 Virus Pathogen Database and Analysis
Protein Variability Server (PVS)................................... 256 Resource (ViPR)................................................ 280
Q W
qPCR ............................................................................. 250 Whole genome database ................................................... 4
Quantum Matrices ............................................... 281, 282
Z
R
Zika virus ............................................................... v, 17–29
Ramachandran Plot assessment .................................... 267 Zoonotic disease ........................................................... 173
Random Forest Algorithm ......................... 236, 272, 286 Z-score.................................................148, 268, 270, 272
RANKPEP .......................................................4–6, 12, 13,
168, 234, 260, 282, 340

Best-Insilco Epitope Design

Uploaded by

Copyright:

Available Formats

Best-Insilco Epitope Design

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Best-Insilco Epitope Design

Uploaded by

Copyright:

Available Formats

Methods in

Molecular Biology 2131

Namrata Tomar Editor

For further volumes:

ISSN 1064-3745 ISSN 1940-6029 (electronic)

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Content and General Outline of the Book

Chapter 6 reviews the methodology used for computational identification of B and T

antimicrobial resistance. Phage treatment using phage-encoded products can be used

Milwaukee, WI, USA Namrata Tomar

1 Reverse Vaccinology and Its Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

14 Immunoinformatic Identification of Potential Epitopes. . . . . . . . . . . . . . . . . . . . . . 265

VRUSHALI ABHYANKAR • American Academy of Periodontology, UTHSC, College of Dentistry,

JUILEE THAKAR • Department of Microbiology and Immunology, University of Rochester

Reverse Vaccinology and Its Applications

Key words Vaccinomics, Reverse vaccinology, IDV, Epitope

The use of genomic information with aid of a computer for the

Namrata Tomar (ed.), Immunoinformatics, Methods in Molecular Biology, vol. 2131,

Reverse vaccinology is the first example of such an approach

The second revolution took place in the twentieth century with

1.2 Advantages of 1. It allowed identification of a much broader spectrum of candi-

and B cell plays an important role in the determination of the

2.1 Web-Based 1. http://www.ncbi.nlm.nih.gov/.

2.2 Software 1. Modeller9.10 (any homology modeling software).

In fulfilling the objectives of the project work, the following steps

3.1 Selection of 1. Whole genome sequence of HIV-1 was downloaded from

3.2 Epitope 1. Home page of RANKPEP was opened (http://tools.

Fig. 2 Outcomes of RANKPEP

Fig. 3 MHC class II predicted epitopes of Pr55 (Gag)

Sr. MHC Percentile

Sr. MHC Percentile

Sr. Epitope Epitope PDB ID of epitope Position of Percentage

Sr. Epitope PDB ID of Position of Percentage

3.3 Homology Steps involved in homology modeling were as follows:

Fig. 4 Predicted model and Ramachandran plot of Pr1

Fig. 5 Predicted model and Ramachandran plot of Pr2

4. Copy the sequence of epitope to be modeled in TvLDH.ali file

>P1; gp1sequence:gp1::::::: 0.00: 0.00 KQWPLTEEKI∗

Fig. 6 Predicted model and Ramachandran plot of Pr 1

Fig. 7 Predicted model and Ramachandran plot of Pr 2

5. Open build_profile.py file and change the name of protein. For

6. Run on Command prompt:

“C:\modeling>gp1>mod9.10 build_profile.py” and press enter.

This command takes “pdb_95.pir” and”gp1.ali” file

“C:\modeling>gp1>mod9.10 compare.py” and press enter.

mdl = model(env, file=’1rth’, model_segment=(’FIRST:

This step aligns the sequence of TARGET (epitope) with

12. Run on command prompt:

“C:\modeling>gp1>mod9.10 align2d.py” and press enter.

env = environ()a = automodel(env, alnfile=’gp1-1rthA.ali’,

14. Run on command prompt:

“C:\modeling>gp1>mod9.10 model-single.py” and press enter.

Sr. no. Epitope Antibody Run Lowest binding/docked energy

Immunological Databases URL

Epitope prediction Tools URLs

Edit—charges—add kollman charges—ok.

Fig. 8 Epitope Pr1-Antibody 4E10 interactions

Fig. 9 Epitope Pr2-Antibody 4E10 interactions

6. The grid file was then created: