Andrade Martínez Et Al 2022 Computational Tools For The Analysis of Uncultivated Phage Genomes
Andrade Martínez Et Al 2022 Computational Tools For The Analysis of Uncultivated Phage Genomes
Andrade Martínez Et Al 2022 Computational Tools For The Analysis of Uncultivated Phage Genomes
a Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
b Department of Plant and Environmental Science, University of Copenhagen, Frederiksberg, Denmark
c Department of Microbiome Science, Max Planck Institute for Developmental Biology, Tübingen, Germany
d The GLOBE Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
e The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri, USA
SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
ASSEMBLY AND IDENTIFICATION OF PHAGES AND PROPHAGES IN METAGENOMIC DATA . . . . 4
Assembly and Binning Approaches for Viral Metagenomic Data . . . . . . . . . . . . . . . . . . . . . . . . . 4
Main assembly approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Selecting an Assembly Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Main binning approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Selecting a Binning Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Detection of Phages and Prophages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Main approaches for phage and prophage identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Critical factors that affect the performance of phage and prophage prediction tools . . . . 7
Selecting a Phage/Prophage Prediction Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Assessing the completeness and contamination of predicted phage contigs . . . . . . . . . . . . . . . 9
Contribution of phage and prophage prediction tools to the expansion of the
known global phage diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
GENOME ANNOTATION OF PHAGES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Gene Calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
which has prompted the development of computational tools that support the multile-
vel characterization of these novel phages based solely on their genome sequences.
The impact of these technologies has been so large that, together with MAGs
(Metagenomic Assembled Genomes), we now have UViGs (Uncultivated Viral Genomes),
which are now officially recognized by the International Committee for the Taxonomy
of Viruses (ICTV), and new taxonomic groups can now be created based exclusively on
genomic sequence information. Even though the available tools have immensely con-
tributed to our knowledge of phage diversity and ecology, the ongoing surge in soft-
ware programs makes it challenging to keep up with them and the purpose each one
is designed for. Therefore, in this review, we describe a comprehensive set of currently
available computational tools designed for the characterization of phage genome
sequences, focusing on five specific analyses: (i) assembly and identification of phage
and prophage sequences, (ii) phage genome annotation, (iii) phage taxonomic classifica-
tion, (iv) phage-host interaction analysis, and (v) phage microdiversity.
KEYWORDS computational analysis, microdiversity, phage annotation, phage
metagenomics, phage taxonomy, phage and prophage identification, phage-host
interaction, uncultivated viruses, viromes
INTRODUCTION
BOX 1: GLOSSARY
Area under the ROC curve (AUC): Statistical measurement of the performance of
a classifier, based on sensitivity and specificity. Its values range from 0 to 1, with
1 being the best possible. While a value of 0 is considered the worst possible,
any value under 1/N, with N being the number of classes which can be
predicted, is worse than random.
Command line: Also referred to as a terminal, an interface through which a user
can interact with a computer or server without a graphical interface.
Communication via the command line is done through the typing and
execution of commands from a shell programming language (e.g., Bash).
E value: number of sequences in a database which would be expected to have,
by chance, an alignment as good as or better than the one obtained given a
query sequence and the selected database. The lower the E value, the more
significant the alignment, and the more likely that the aligned sequences are
homologous.
F1 score: Statistical measurement of the performance of a classifier, based on
precision and recall. Its values range from 0 to 1, with 1 being the best value
possible and 0 the worst.
FASTA format: Format for the computational representation of nucleotide or
amino acid sequences. A sequence in FASTA format is composed of (i) one
description line, identified by a leading greater-than (.) sign, which contains
the description of the sequence, and (ii) a sequence line, made up of the
sequence itself without any additional characters.
General feature format (GFF): Format for the storage of descriptions regarding
biologically relevant features in a nucleotide or amino acid sequence. The GFF
format is tab delimited, with one line per feature.
Profile hidden Markov models (pHMMs): Mathematical representations of a set
of conserved regions in a given group of sequences. pHMMs are derived from
multiple sequence alignments and are particularly useful for searching for
used for eukaryotic viruses, and even for cellular organisms, we focus on applications
for phage data. We do not aim at recommending any particular tool above others;
instead, we want to make the reader aware of the different advantages and limitations
of the most used, available tools and to outline the main factors to be considered
when selecting a tool to make an informed decision now and as more tools are
developed.
FIG 1 Proposed workflow for the analysis of phage metagenomic data. Raw metagenomic reads are
first filtered for contaminants and then assembled into contigs and binned. While in principle the
identification of phage and/or prophage contigs could be omitted if the researcher knows that the
reads are enriched for viruses, we suggest performing it as an additional way to filter contaminant
nonviral bins. Phage or prophage contigs can then be subjected to genome annotation, taxonomic
classification, microdiversity analysis, and host-association analysis. Moreover, while they are different
analyses, genome annotation and taxonomic classification are usually done conjointly. While one can
carry out microdiversity and/or phage-host analysis without prior annotation and/or taxonomic
classification, the former two are usually done before either of the latter.
(13) in three distinct data sets composed of both real and artificial viromes. Overall, the
authors recommend the use of MetaSPAdes (14) for virome assembly, which yielded
good results in all the test sets considered, followed by MEGAHIT (15). They also high-
light the presence of repeat sequences, as well as too-high and too-low coverage val-
ues, as the main hindrances to efficient assembly of virome data. In particular, given
that MetaSPAdes performed poorly in the assembly of poorly covered viral genomes,
they suggest that the conjoint use of MetaSPAdes and MIRA (16) might be able to pro-
vide an overall better assembly than either program employed individually.
ViralAssembly (17), developed as part of the MetaviralSPAdes pipeline, is an adapta-
tion of MetaSPAdes for assembly of viral data. It leverages the circular genomic
sequence detection of MetaplasmidSPAdes for detecting circular viral genomes and
allows detection and assembly of terminal repeats in linear genomes. In an analysis of
18 real virome data sets, ViralAssembly was shown to outperform MetaSPAdes in terms
of contig completeness in 12 cases. Furthermore, rnaSPAdes (18), originally designed
for the assembly of transcriptomic data, was recently shown to have the capacity to
generate RNA phage contigs from metagenomic data (19).
One of the most exciting prospects in viral assembly is the development of assem-
bly software for long-read sequencing technologies, which might allow not only
increased viral detection but also more resolution in elucidating the microdiversity of
viral communities (17, 20). While not a specific viral assembler, metaFlye (21), based on
the generation of assembly graphs via high-frequency k-mers, has been shown to
detect and assemble viral genomes in long-read metagenomic data sets with good ef-
ficiency. A specific version for the assembly of viral genomes, viralFlye, is currently in
development (https://github.com/Dmitry-Antipov/viralFlye). The VirION pipeline, now
in its second iteration, VirION2 (22), employs short reads to correct sequencing errors
in long-read assemblies and outperforms hybrid and short-read assemblers when
tested on double-stranded-DNA (dsDNA) viromes. It is important to mention that to
the best of our knowledge, the efficacy of said assembly method has not been tested
yet for RNA or single-stranded-DNA (ssDNA) viruses.
From the final set of the three recommended software programs, MetaviralSPAdes
and metaFlye are available for download from GitHub, where the user can find instruc-
tions on how to install and run them. Both require a basic knowledge of command line
the same viral species in a metagenomic sample. For example, CoCoNet (24) employs
tetranucleotide frequencies and read coverage to train a neural network which com-
putes the probability that two fragments come from the same genome and then clusters
contigs into bins based on said probabilities. PHAMB (25), in contrast, uses a random for-
est classifier to discern the viral bins generated via deep variational autoencoders with
VAMB (26). METABAT (27), now in its second iteration, METABAT2 (28), uses as input tet-
ranucleotide frequencies and read coverage information, just like CoCoNet, but derives
distances between contigs from these measurements and uses them directly for the
clustering into bins. Of these three tools, only PHAMB discerns viral from nonviral bins
directly, and it therefore might be preferred for use with metagenome data, while
CoCoNet is specialized for use in highly diverse viromes. METABAT is not specialized in
viromes, but output bins from METABAT can be designated as viral via other phage and
prophage software or the taxonomic classification tools discussed in depth below.
TABLE 1 Comparison of the different tools presented for identification of phage and prophage sequences
Tool Typea Input data type Accessibility Last update
VirSorter (29) 1 Viral or phage genomes or contigs; host genomes Web-based (https://cyverse.org/) Last release Oct 2019
for prophage prediction (FASTA files) and stand-alone versions
VirSorter2 (36) 3 Viral or phage genomes or contigs; host genomes Web-based (https://cyverse.org/) Last update Apr 2021
for prophage prediction (FASTA files) and stand-alone versions
VirFinder (37) 2 Viral or phage genomes or contigs Stand-alone version Last update Sept 2019
DeepVirFinder (42) 2 Viral or phage genomes or contigs Stand-alone version Last update Nov 2020
MARVEL (39) 3 Viral or phage genomes or contigs and raw reads Stand-alone version Last update Apr 2019
PPR-Meta (35) 2 Phage and plasmid fragments from metagenomic Stand-alone version Last update Jan 2020
assemblies
VIBRANT (38) 3 Sequence derived from metagenomic assemblies Web-based (https://cyverse.org/) Last update May 2020
and stand-alone versions
VirMiner (11) 3 Processed raw reads Stand-alone version Last update May 2020
Prophage 3 Viral or phage genomes or contigs; host genomes Web based (https://pro-hunter Last update Apr 2019
Hunter (34) for prophage prediction (FASTA files) .genomics.cn/)
PhiSpy (40) 2 Viral or phage genomes or contigs Stand-alone version Last update May 2021
VIRALVERIFY (17) 3 Raw reads, phage or plasmid fragments from Stand-alone version Last update 2020
metagenomic assemblies
ProphET (31) 1 Bacterial genome sequences Stand-alone and web based Self-updates or by
(https://cpt.tamu.edu/galaxy the user
-pub)
PHASTER (30) 1 Viral or phage genomes or contigs; host genomes Web based (http://phaster.ca/) Last update Dec 2020
for prophage prediction (FASTA files)
Phigaro (32) 1 Metagenomic assemblies or raw genomes or Stand-alone version Last update Aug 2020
contigs; host genomes for prophage prediction
aCorresponds to the approaches described in “Main Approaches.”
derived from phage genomes. Although those methods seem to be database inde-
pendent, the ability of such prediction models to accurately identify phage sequences
from a metagenomic assembly depends on the training data set, which is often based
on a database such as the ones mentioned above. Furthermore, the training sequence
sets must include genomic sequences from phages and bacteria, as well as other host-
associated sequences such as plasmids (e.g., PPR-Meta, VirSorter2, and Prophage
Hunter) (34–36). There are a variety of sequence features that different phage predic-
Gene Calling
The process of genome annotation begins with the identification of the genes pres-
ent in a given phage. The most commonly and successfully applied tools to predict
open reading frames (ORFs) in phages are Prodigal (57), Glimmer (58), and GeneMarkS
(59, 60), even though they were initially developed for predictions in prokaryotes.
Further improvement in gene calling accuracy has been observed when the above-
mentioned tools are combined; thus, robust packages dedicated to a comprehensive
genome annotation of microbial genomes, such as Rast-tk (61) (now part of PATRIC
[62]) and Prokka (63), allow such tasks. Nevertheless, predicting ORFs in phages has its
own challenges compared to prediction of ORFs in prokaryotes, and some genes might
be missed by these tools and packages.
Overlapping genes, the presence of introns/inteins in phage genes, and alternative
coding of phages add an extra layer to the complexity of gene calling. As the tools
Annotation Approaches
After gene calling, different strategies for the functional annotation of ORFs can be
applied depending on the genome or genomes in question. For common phages or
phages with close homologues in public databases, query searches against sequence
databases (Table 2) using BLAST (76) or DIAMOND (77) might be sufficient. Nevertheless,
given the mosaicism and the high mutation rate of phages, often no significant results
are obtained after a sequence similarity search. In this case, it is advisable to use meth-
ods for detection of remote homologs based on hidden Markov models (HMMs), which
leverage the use of sequence profiles and the information about conservation for each
TABLE 2 Description of protein databases used for functional annotation of predicted ORFs
Database Description Typea
Viral RefSeq (149) Curated NCBI database of viral genomes, genes, and proteins. Blast search is available Sequences
online. Periodically updated.
UniProtKB (150) Curated collection of proteins and proteomes from all domains of life, derived from Sequences
either direct submissions or predictions from either the European Nucleotide
Archive (ENA), GenBank and the DNA Data bank of Japan (DDBJ). BLAST search is
available online. Periodically updated.
Pfam (151) Database of protein families of all domains of life, derived from curated UniProtKB MSA and HMM
entries. Periodically updated.
viral eggNOG (79) Clusters of orthologous viral proteins derived from graph-based unsupervised HMM
clustering. Last updated in 2016.
ViPhOGs (80) Database of clusters of orthologous viral and phage protein domains generated HMM and MSA
through the CogSoft algorithm. Last updated in 2021.
pVOGs (33) Phage gene families derived from orthologous clustering of phage proteins from HMM and MSA
complete phage genomes. Last updated in 2016.
NCBI_CD (152) Collection of conserved bacterial domains, compiled from six different databases. HMM and PSSM
Web search is available (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi).
Last updated in 2020.
SCOP (153, 154) Database of structurally and evolutionary conserved proteins, organized in a Sequences
hierarchical classification of families and superfamilies. Conserved domains based
on different degrees of sequence identity are also available. Last updated in 2021.
VOGdb (http://vogdb.org) Database of clusters of orthologous viral proteins, derived from the combined use of HMM and sequences
the CogSoft algorithm and HH-suite on RefSeq phage and prophage genomes. The
database provides both virus specific proteins (for detection of viral sequences in
metagenomes), and panels of essential viral proteins. Periodically updated.
VPFs (155) Database derived from the Earth Virome analysis of Paez-Espino et al. (7), consisting of HMMs
groups of viral orthologous proteins. Last updated in 2016.
PHROGs (81) Remote homologous groups from proteins of complete genomes of viruses infecting HMM and MSA
bacteria or archaea. Last updated in 2021.
aMSA, multiple sequence alignment; PSSM, position specific scoring matrix.
Beyond the identification of genes and their functions, annotating a phage genome
encompasses the annotation of RNAs, tandem repeats, and diversity-generating retro-
elements, among others. For example, RASTtk allows the annotation of RNAs and
repeat regions. Although it was conceived to annotate bacterial and archaeal
genomes, RASTtk has been a handy tool for phage annotation in the past (86).
Moreover, while VIGA relies only on Prodigal for gene calling, its strength is that it
includes several modules or routines to comprehensively annotate a viral genome. In
addition to gene calling and functional annotation, VIGA allows detection of the contig
shape (linear or circular) with LASTZ (87), prediction of ribosomal genes with INFERNAL
(88), and prediction of tRNA and tmRNA sequences with ARAGORN (89) and runs
PILER-CR (90), Tandem Repeats Finder (TRF) (91), and Inverted Repeats Finder (IRF) (92).
Machine learning-based annotation approaches. As mentioned before, most
phage-borne genes have no close homologs in reference databases and their func-
tional annotation is usually lacking. To solve this, several groups have used deep neural
networks to recognize phage virion proteins, mainly because these can be considered
a hallmark of phages and play an important role in phage-host interaction.
DeepCapTail (93) classifies an ORF as capsid or tail; VirionFinder makes a binary classifi-
cation of phage virion protein or not (94); DeephageTP focuses on Portal, TerL, and
TerS proteins (95); and PhANNs classifies the proteins in 10 different classes of struc-
tural proteins (96). Ultimately, specific research and data requirements will lead to the
selection of a specific classifier among those mentioned.
TAXONOMIC CLASSIFICATION
The process of taxonomic classification of viral metagenomic samples is more chal-
lenging than that of cellular organisms. The shared common ancestry of the latter
allows the existence of universal marker genes, most notably 16S and 18S rRNA genes,
which can provide a reasonable representation of their evolutionary origin and diver-
gence (1, 97). In contrast, this approach is not fully applicable to viruses, since they lack
any equivalent set of universally conserved genes on which to construct a phylogeny
(98, 99).
TABLE 3 Comparison of the tools for taxonomic classification of phage metagenomic data
Tool Approach Accessibility Recently updateda
VIRIDIC (104) BLAST followed by hierarchical clustering Stand-alone version Yes
VICTOR (105) Blast-derived intergenomic distances Online web service No
VipTree (106) tBLASTx Online web service Yes
Dougan and Quake method (107) tBLASTx and 4-mer distances Needs to be implemented by user NA
VPF-Class (108) HMMs against different databases Stand-alone version Yes
GRAViTy (110, 111) Presence/absence and synteny of Stand-alone version Yes
orthologous groups determined via
HMMs
VIRify (https://GitHub.com/EBI HMMs from ViPhOGs Online web service Yes
-Metagenomics/emg-viral
-pipeline)
Classiphage (97) HMMs refined by BLASTp HMM database available for Yes
download, distances must be self
implemented
vConTACT (110) Distances derived from Markov-based Stand-alone version Yes
phage protein clusters
a“Yes” indicates that the tool or database has been updated or created in the past 2 years. NA, not applicable.
Viral classification is limited by two factors. First, current viral genome databases do
not reflect the actual diversity of these elements in the biosphere (6). Second, viral tax-
onomy, as defined by the International Committee on Taxonomy of Viruses (ICTV), is
currently in a state of flux (100), with changes being made and/or proposed to, among
others, the number of taxonomic ranks to consider (101), the criteria for defining each
specific rank (6), and the validity of traditional viral clades, not based on molecular cri-
teria (102). However, several tools to classify viruses assembled from metagenomes
have emerged in recent years (103), most of them based on sequence similarity by
using either BLAST or HMMs. In this section, we briefly present some tools for classify-
ing viruses from metagenomic data. Table 3 compares the main characteristics of these
approaches. Other methods not mentioned here can be found in a recent comprehen-
sive review by Nooij et al. (10).
the tool and generates a phylogenetic tree based on said similarities. The software is
limited to constructing trees with at most 200 query sequences.
The usefulness of these three tools is limited by the degree of completeness of the
query viral contigs, the prior knowledge the author might have regarding their taxon-
omy, and the database employed. VICTOR, given its use of nucleotide distances, is bet-
ter suited for determining close evolutionary relationships, while VIRIDIC and VipTree,
given their ability to use proteomic information, are better for more divergent ones. In
the extreme case where little or nothing is known of the taxonomy of the queried
viruses, the use of reference sequences automatically provided by VipTree would be
preferred to the manual input of reference sequences from VICTOR or VIRIDIC.
In fact, tBLASTx has been shown to be able generate large-scale viral clusterings when
combined with k-mer information: Dougan and Quake (107) employed a combination of
tBLASTx and 4-mers to derive an entropy-based genomic distance for classifying a set of
5,817 viral genomes. These distances are then converted into a multidimensional represen-
tation through t-distributed stochastic neighbor embedding (t-SNE) and clustered to pro-
duce a distance dendrogram. Albeit done with complete viral genomes, this method could
be adapted for metagenomic data sets by an experienced user and is known to work with
a large set of viral genomes, compared to the three aforementioned tools.
Markov-Based Approaches
Other approaches use HMMs and/or Markov clustering (MCL) for classification (ei-
ther conjointly with or without BLAST). For example, VPF-Class (108) runs an HMM
search against three different data sets: the NCBI viral sequences, the prophages data
set from Roux et al. (29), and the Global Ocean Virome (109) to determine viral classifi-
cation and host prediction.
In contrast, GRAViTy (110, 111) derives protein profile HMMs from BLASTp-based
orthologous protein clusters. These models are then used to scan complete viral genomes
to derive information of their gene content and orientation (i.e., synteny), from which pair-
wise distances are computed. GRAViTy has been applied for classification of both com-
plete eukaryotic (110) and prokaryotic (111) viruses. As mentioned in the section above,
the use of presence/absence information of ViPhOGs (formerly VDOGs) in a given set of vi-
ral genomes, obtained via scanning of genomic proteins against ViPhOG HMMs, can be
Main Approaches
A recent increase in sequence-based tools aimed at identifying which bacteria act
as the hosts of a given phage in a metagenomic sample has been observed (112, 114–
119). Some of these tools use prediction signals from phage-host interactions that can
be categorized as (i) homology dependent, such as nucleotide similarity BLAST scores
and CRISPR spacer matches (111), as in SpacePHARER (119), CrisprOpenDB (120) and
TABLE 4 Model performance metrics per phage-host prediction tool obtained using their own benchmark data sets, and description of each
prediction methoda
Phage-host prediction
tool Prediction methodb Taxonomic level AUC F1 score Accuracy
SpacePHARER (119) Identifies CRISPR spacers in bacterial genomes, Phylum 0.87
translates those into protein motifs, and
performs a protein alignment against possible
protospacer motifs from phage sequences; the
prediction is based on a probability score which
selects the host with the higher likelihood.
Class 0.84
Order 0.8
Family 0.79
Genus/species 0.77
CrisprOpenDB pipeline Looks for CRISPR spacer matches (alignments) and Genus 0.57
(120) applies some filters (no. of gaps, position of the
spacer in the bacterial genome, and taxonomic
accordance between predictions) to make
predictions at the lowest taxonomic levels
possible.
WIsH (114) Computes the maximum likelihood of phage P Genus 0.35
infecting host H based on the training of a
homogeneous Markov model.
Family 0.43c
Order 0.48c
Class 0.6c
Phylum 0.75c
VirHostMatcher (112) Calculates different dissimilarity/similarity measures Genus 0.33
based on genomic composition to predict the
host of a given phage. The least dissimilar or the
most similar measure indicates a likely positive
interaction.
Family 0.48
Order 0.54
Class 0.67
Phylum 0.75
HostPhinder (115) Predicts the host of a phage as the host of the most Species 0.74
TABLE 4 (Continued)
Phage-host prediction
tool Prediction methodb Taxonomic level AUC F1 score Accuracy
VirHostMatcher-Net Uses homology-dependent and -independent Species 0.43
(118) features to compare between virus and hosts:
CRISPR matches, homology scores from BLAST,
an alignment-free similarity measure, WIsH
maximum likelihood, comparison between
viruses that infect the same host (SV1) and
viruses infecting different hosts (SV2). The
prediction is based on a Markov random field
model.
Genus 0.59
Family 0.7
Order 0.78
Class 0.83
Phylum 0.86
PHISDetector (123) Uses 6-mer frequencies, codon usage, prophage Genus or species 0.93 0.88 0.88
detection, CRISPR matches and protein-protein
interactions information to predict interactions.
The model trained for the prediction is an
ensemble of different machine learning models
(random forest, decision tree, logistic regression,
SVMs with different kernel settings, Gaussian and
Bernoulli naive Bayes).
Host Taxon Predictor Uses nucleotide sequence information such as Domain 0.98 0.93
(HTP) (116) molecule type, mono/dinucleotide absolute
frequencies and di/trinucleotide relative
frequencies to predict if a virus is phage or
nonphage. They have implemented 4 types of
classifiers: logistic regression, quadratic
discriminant analysis, support vector machines,
and k-nearest neighbors.
aNote that the performance metrics are not directly comparable, as different tools were benchmarked with data sets which might or might not be the same.
bSVMs, support vector machines.
cContig-based prediction.
phage sequence. While these tools have shown good performance, with an F1 score of
.0.8 (128), it is important to note that they were trained on limited data sets, biased
to culturable phage-host pairs, whose interactions are very well described. Therefore,
these tools may not be very accurate for use with novel or distant phage sequences.
MICRODIVERSITY
Microbial populations experience constant genetic changes. Besides being impor-
tant drivers for bacterial evolution, given their capability to promote horizontal gene
transfer, phages undergo several changes within their genomes, leading to a wide col-
lection of mutations that are chosen by natural selection (129). Moreover, the high sub-
stitution rates shown by some viruses in a short time period increase the number of
significant changes within phage genomes, allowing speciation evens and, therefore,
the diversification of the ecosystem (130). It is known that the intrapopulation genetic
variation, or microdiversity, of viruses has an impact in shaping host dynamics.
Furthermore, changes related to an increase in the infection rates and host range
might produce bacterial changes related to resistance against viral infection, generat-
ing antagonistic evolution dynamics (131).
Antagonistic changes have been reported in multiple environments, such as the
human gut (130–132) and marine ecosystems (133). These have been strongly associ-
ated with the maintenance of ecological microbial equilibrium within the human gut
but also with its diversification. Given that most alterations within phage genomes are
related to adaptations to new host strains or the ability to infect strains that are resist-
ant to the infection, an increase in phage populations with new genomic features leads
to the infection of bacterial strains that are usually found in high abundance. This
infection produces a decrease in bacterial population but also a selection pressure that
allows the host to acquire resistance. This pattern leads to the regulation of bacterial
populations and an increase in bacterial adaptation to their new predators (134).
Analogous to this process, elements such as the cyanophages undergo mutations
depending on how optimal their host is. When infecting optimal hosts, phages accu-
mulate few mutations within their genomes. In the opposite case, in which the host is
suboptimal, the number of mutations increases and so does the diversity of viral popu-
lations (135).
Main Approaches
Despite the importance of the dynamics driven by intraspecific variation on differ-
ent environments, microdiversity analysis is considered a recent approach in virome
analysis. Therefore, the development of specific tools to characterize viral microdiver-
sity is still an ongoing process, with few tools already available. Consequently, studies
regarding single nucleotide polymorphism (SNP) calling, calculation of base substitu-
tions, and virome stability within individuals have had to rely on general software and
frameworks for the discovery of variants from NGS data. As proposed by DePristo et al.
(137), a framework for variant calling and genotyping must include at least 3 phases: (i)
manipulation of NGS data, consisting of mapping the reads against the genomes or
contigs of interest, followed by a series of alignment refinement and base quality reca-
libration; (ii) a variant discovery step, in which SNPs, structural variations (SV), and
indels are reported, followed by a genotyping process; and (iii) association analysis and
computation of microdiversity metrics. When applied to virome data, the variant call-
ing process must rely on a well-curated set of contigs, which are expected to have a
minimum sequencing depth of 5 (134, 138) or 10 (130, 133, 139). Furthermore, in
order for a SNP to be considered valid, it must be supported by at least 4 reads (140).
Likewise, a valid SNP will also depend on the quality call threshold, defined as a Phred-
based value representing the confidence of the presence of a specific variation in a
given site (141). However, to the best of our knowledge, there is no universal reference
value for this metric, implying that the threshold is chosen based on specific character-
istics for each analysis.
To be able to make biological inferences with sequence variant data, multiple met-
rics can be applied. To measure the degree of polymorphism in a given population,
the nucleotide diversity (p ) metric, understood as the average number of differences
found within a specific region of DNA from two different taxa, can be implemented. As
explained by Schloissnig et al. (140), for estimating p between a pair of metagenomic
samples, a variation of the original formula proposed by Begun et al. (142) is made, in
order to include the possibility of having more than two alleles per site. In this case,
the chance of choosing different alleles in a specific and randomly chosen position in
the genome is computed. On the other hand, the fixation index (Fst) is also used for
measuring population differentiation with no changes to the original formula. For ana-
lyzing the effect of natural selection in the virome population, a measure of the ratio
between nonsynonymous and synonymous substitutions is calculated (pN/pS), if all
mutations have the same probability of occurring across the genome (52).
Even though few specific pipelines for viral microdiversity are available at the
moment, those available allow performing the multiple steps involved in a normal
analysis in a straightforward and user-friendly way. For example, MetaPop is described
as a tool for the manipulation and visualization of micro- and macrodiversity-related
data. In fact, it can be run completely via a single command. Furthermore, it imple-
ments multiple metrics, including the ones described above, but also Watterson’s theta
nucleotide diversity, Tajima’s D, and codon usage biases (143). Among the pipelines
designed exclusively for viral sequences, DiversiTools (http://josephhughes.github.io/
DiversiTools/), which was initially developed for the analysis of eukaryotic viruses, can
also be used for virome data in general, by applying pN/pS metrics for comparisons
between samples (136). In contrast to the previous tools, the inStrain pipeline is consid-
ered a suite of programs for analyzing metagenomic data which can be easily applied to
CONCLUDING REMARKS
We have presented a brief overview of the tools used for the principal analyses of
phage metagenomic data, namely, assembly and detection of phage and prophage
sequences, annotation, taxonomic classification, host identification, and microdiversity.
As we have shown, the choice of a specific tool must consider not only the approach
employed but also accessibility and the data the researcher is working with, as well as
the degree of support from the developers of the software. As more software is inte-
grated into metagenomic pipelines, the use of these tools should become easier for
the user, regardless of their computational expertise. Moreover, such integration will
allow the simultaneous use of multiple tools, employing different approaches, to
address specific metagenomic problems (e.g., taxonomic classification) in a given
phage data set, allowing the cross-examination of their outputs to determine which
findings are sustained between different tools.
Of notable importance is the growth over recent years in the total number of machine
learning-based tools for different virome procedures. Given the reliance of these tools in
their training data sets, the constant increase in virome data, both simulated and real,
should provide us with better benchmarks of their performance and therefore lead to an
overall improvement of their ability to characterize metagenomic data. Parallel improve-
ments might also be expected from other database-reliant but non-machine learning
tools, such as BLAST- or HMM-based annotation or classification software.
It is also relevant to acknowledge the scarcity of tools and knowledge regarding
ssDNA and RNA phages. Despite the fact that both types of phages have been shown
to be highly abundant in a variety of environments, limitations in experimental proto-
cols for viral isolation and lack of sufficient ssDNA and RNA genomes in databases
have led to a reduced number of isolated sequences (19, 147, 148). The recent work of
Callanan et al. in identification and assembly of ssRNA viruses (148), as well as improve-
ments in experimental techniques and sequencing technologies, provides an exciting
outlook for tapping the diversity of these less known phage clades.
REFERENCES
1. Garza DR, Dutilh BE. 2015. From cultured to uncultured genome sequen- 6. Dion MB, Oechslin F, Moineau S. 2020. Phage diversity, genomics and
ces: metagenomics and modeling microbial ecosystems. Cell Mol Life Sci phylogeny. Nat Rev Microbiol 18:125–138. https://doi.org/10.1038/s41579
72:4287–4308. https://doi.org/10.1007/s00018-015-2004-1. -019-0311-5.
2. Fancello L, Raoult D, Desnues C. 2012. Computational tools for viral 7. Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann
metagenomics and their application in clinical research. Virology 434: M, Mikhailova N, Rubin E, Ivanova NN, Kyrpides NC. 2016. Uncovering
162–174. https://doi.org/10.1016/j.virol.2012.09.025. Earth’s virome. Nature 536:425–430. https://doi.org/10.1038/nature19094.
3. Simmonds P, Adams MJ, Benko† M, Breitbart M, Brister JR, Carstens EB,
8. Cantalupo PG, Pipas JM. 2019. Detecting viral sequences in NGS data.
Davison AJ, Delwart E, Gorbalenya AE, Harrach B, Hull R, King AMQ,
Curr Opin Virol 39:41–48. https://doi.org/10.1016/j.coviro.2019.07.010.
Koonin EV, Krupovic M, Kuhn JH, Lefkowitz EJ, Nibert ML, Orton R,
9. Roux S, Adriaenssens EM, Dutilh BE, Koonin EV, Kropinski AM, Krupovic
Roossinck MJ, Sabanadzovic S, Sullivan MB, Suttle CA, Tesh RB, van der
Vlugt RA, Varsani A, Zerbini FM. 2017. Consensus statement: virus taxon- M, Kuhn JH, Lavigne R, Brister JR, Varsani A, Amid C, Aziz RK, Bordenstein
omy in the age of metagenomics. Nat Rev Microbiol 15:161–168. https:// SR, Bork P, Breitbart M, Cochrane GR, Daly RA, Desnues C, Duhaime MB,
doi.org/10.1038/nrmicro.2016.177. Emerson JB, Enault F, Fuhrman JA, Hingamp P, Hugenholtz P, Hurwitz BL,
4. Edwards RA, Rohwer F. 2005. Viral metagenomics. Nat Rev Microbiol 3: Ivanova NN, Labonté JM, Lee K-B, Malmstrom RR, Martinez-Garcia M,
504–510. https://doi.org/10.1038/nrmicro1163. Mizrachi IK, Ogata H, Páez-Espino D, Petit M-A, Putonti C, Rattei T, Reyes
5. Cook R, Brown N, Redgwell T, Rihtman B, Barnes M, Clokie M, Stekel DJ, A, Rodriguez-Valera F, Rosario K, Schriml L, Schulz F, Steward GF, Sullivan
Hobman J, Jones MA, Millard A. 2021. INfrastructure for a PHAge REfer- MB, Sunagawa S, Suttle CA, Temperton B, Tringe SG, Thurber RV,
ence Database: identification of large-scale biases in the current collec- Webster NS, Whiteson KL, et al. 2019. Minimum information about an
tion of cultured phage genomes. Phage (New Rochelle) 2:214–223. uncultivated virus genome (MIUViG). Nat Biotechnol 37:29–37. https://
https://doi.org/10.1089/phage.2021.0007. doi.org/10.1038/nbt.4306.
10. Nooij S, Schmitz D, Vennema H, Kroneman A, Koopmans MP. 2018. Over- 31. Reis-Cunha JL, Bartholomeu DC, Manson AL, Earl AM, Cerqueira GC.
view of virus metagenomic classification methods and their biological 2019. ProphET, prophage estimation tool: a stand-alone prophage
applications. Front Microbiol 9:749. https://doi.org/10.3389/fmicb.2018 sequence prediction tool with self-updating reference database. PLoS
.00749. One 14:e0223364. https://doi.org/10.1371/journal.pone.0223364.
11. Zheng T, Li J, Ni Y, Kang K, Misiakou MA, Imamovic L, Chow BK, Rode AA, 32. Starikova EV, Tikhonova PO, Prianichnikov NA, Rands CM, Zdobnov EM, Ilina
Bytzer P, Sommer M, Panagiotou G. 2019. Mining, analyzing, and inte- EN, Govorun VM. 2020. Phigaro: high-throughput prophage sequence anno-
grating viral signals from metagenomic data. Microbiome 7:42. https:// tation. Bioinformatics 36:3882–3884. https://doi.org/10.1093/bioinformatics/
doi.org/10.1186/s40168-019-0657-y. btaa250.
12. Labonté JM, Suttle CA. 2013. Previously unknown and highly divergent 33. Grazziotin AL, Koonin EV, Kristensen DM. 2017. Prokaryotic Virus Orthol-
ssDNA viruses populate the oceans. ISME J 7:2169–2177. https://doi.org/ ogous Groups (pVOGs): a resource for comparative genomics and pro-
10.1038/ismej.2013.110. tein family annotation. Nucleic Acids Res 45:D491–D498. https://doi.org/
13. Sutton TD, Clooney AG, Ryan FJ, Ross RP, Hill C. 2019. Choice of assembly 10.1093/nar/gkw975.
software has a critical impact on virome characterisation. Microbiome 7: 34. Song W, Sun H-X, Zhang C, Cheng L, Peng Y, Deng Z, Wang D, Wang Y,
12. https://doi.org/10.1186/s40168-019-0626-5. Hu M, Liu W, Yang H, Shen Y, Li J, You L, Xiao M. 2019. Prophage Hunter:
14. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. 2017. metaSPAdes: a an integrative hunting tool for active prophages. Nucleic Acids Res 47:
new versatile metagenomic assembler. Genome Res 27:824–834. https:// W74–W80. https://doi.org/10.1093/nar/gkz380.
doi.org/10.1101/gr.213959.116. 35. Fang Z, Tan J, Wu S, Li M, Xu C, Xie Z, Zhu H. 2019. PPR-Meta: a tool for iden-
15. Li D, Liu CM, Luo R, Sadakane K, Lam TW. 2015. MEGAHIT: an ultra-fast tifying phages and plasmids from metagenomic fragments using deep
single-node solution for large and complex metagenomics assembly via learning. GigaScience 8:giz066. https://doi.org/10.1093/gigascience/giz066.
succinct de Bruijn graph. Bioinformatics 31:1674–1676. https://doi.org/ 36. Guo J, Bolduc B, Zayed AA, Varsani A, Dominguez-Huerta G, Delmont TO,
10.1093/bioinformatics/btv033. Pratama AA, Gazitúa MC, Vik D, Sullivan MB, Roux S. 2021. VirSorter2: a
16. García-López R, Vázquez-Castellanos JF, Moya A. 2015. Fragmentation multi-classifier, expert-guided approach to detect diverse DNA and RNA
and coverage variation in viral metagenome assemblies, and their effect viruses. Microbiome 9:37. https://doi.org/10.1186/s40168-020-00990-y.
in diversity calculations. Front Bioeng Biotechnol 3:141. https://doi.org/ 37. Ren J, Ahlgren NA, Lu YY, Fuhrman JA, Sun F. 2017. VirFinder: a novel k-mer
10.3389/fbioe.2015.00141. based tool for identifying viral sequences from assembled metagenomic
17. Antipov D, Raiko M, Lapidus A, Pevzner PA. 2020. Metaviral SPAdes: as- data. Microbiome 5:69. https://doi.org/10.1186/s40168-017-0283-5.
sembly of viruses from metagenomic data. Bioinformatics 36:4126–4129. 38. Kieft K, Zhou Z, Anantharaman K. 2020. VIBRANT: automated recovery,
https://doi.org/10.1093/bioinformatics/btaa490. annotation and curation of microbial viruses, and evaluation of viral
18. Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. 2019. rnaSPAdes: a community function from genomic sequences. Microbiome 8:90.
de novo transcriptome assembler and its application to RNA-Seq data. https://doi.org/10.1186/s40168-020-00867-0.
GigaScience 8:giz100. https://doi.org/10.1093/gigascience/giz100. 39. Amgarten D, Braga LP, da Silva AM, Setubal JC. 2018. MARVEL, a tool for
19. Callanan J, Stockdale SR, Shkoporov A, Draper LA, Ross RP, Hill C. 2020. prediction of bacteriophage sequences in metagenomic bins. Front
Expansion of known ssRNA phage genomes: from tens to over a thou- Genet 9:304. https://doi.org/10.3389/fgene.2018.00304.
sand. Sci Adv 6:eaay5981. https://doi.org/10.1126/sciadv.aay5981. 40. Akhter S, Aziz RK, Edwards RA. 2012. PhiSpy: a novel algorithm for find-
20. Warwick-Dugdale J, Solonenko N, Moore K, Chittick L, Gregory AC, Allen MJ, ing prophages in bacterial genomes that combines similarity-and com-
Sullivan MB, Temperton B. 2019. Long-read viral metagenomics captures position-based strategies. Nucleic Acids Res 40:e126. https://doi.org/10
abundant and microdiverse viral populations and their niche-defining .1093/nar/gks406.
genomic islands. PeerJ 7:e6800. https://doi.org/10.7717/peerj.6800. 41. Ponsero AJ, Hurwitz BL. 2019. The promises and pitfalls of machine
21. Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, Shin SB, learning for detecting viruses in aquatic metagenomes. Front Microbiol
Kuhn K, Yuan J, Polevikov E, Smith TP, Pevzner PA. 2020. metaFlye: scal- 10:806. https://doi.org/10.3389/fmicb.2019.00806.
able long-read metagenome assembly using repeat graphs. Nat Meth- 42. Ren J, Song K, Deng C, Ahlgren NA, Fuhrman JA, Li Y, Xie X, Poplin R, Sun
ods 17:1103–1110. https://doi.org/10.1038/s41592-020-00971-x. F. 2020. Identifying viruses from metagenomic data using deep learning.
51. López-Leal G, Camelo-Valera LC, Hurtado-Ramírez JM, Verleyen J, 69. Yutin N, Benler S, Shmakov SA, Wolf YI, Tolstoy I, Rayko M, Antipov D,
Castillo-Ramírez S, Reyes-Muñoz A. 2021. Mining of thousands of pro- Pevzner PA, Koonin EV. 2021. Analysis of metagenome-assembled viral
karyotic genomes reveals high abundance of prophage signals. bioRxiv genomes from the human gut reveals diverse putative CrAss-like phages
https://doi.org/10.1101/2021.10.20.465230. with unique genomic features. Nat Commun 12:1044. https://doi.org/10
52. Gregory AC, Zayed AA, Conceição-Neto N, Temperton B, Bolduc B, .1038/s41467-021-21350-w.
Alberti A, Ardyna M, Arkhipova K, Carmichael M, Cruaud C, Dimier C, 70. Shapiro JW, Putonti C. 2021. Rephine.r: a pipeline for correcting gene
Domínguez-Huerta G, Ferland J, Kandels S, Liu Y, Marec C, Pesant S, calls and clusters to improve phage pangenomes and phylogenies. PeerJ
Picheral M, Pisarev S, Poulain J, Tremblay J, Vik D, Coordinators T, Acinas 9:e11950. https://doi.org/10.7717/peerj.11950.
SG, Babin M, Bork P, Boss E, Bowler C, Cochrane G, Vargas C, Follows M, 71. Devoto AE, Santini JM, Olm MR, Anantharaman K, Munk P, Tung J, Archie
Gorsky G, Grimsley N, Guidi L, Hingamp P, Iudicone D, Jaillon O, Kandels- EA, Turnbaugh PJ, Seed KD, Blekhman R, Aarestrup FM, Thomas BC,
Lewis S, Karp-Boss L, Karsenti E, Not F, Ogata H, Pesant S, Poulton N, Raes Banfield JF. 2019. Megaphages infect Prevotella and variants are wide-
J, Sardet C, Speich S, Stemmann L, Sullivan MB, Sunagawa S, Wincker P, spread in gut microbiomes. Nat Microbiol 4:693–700. https://doi.org/10
Babin M, Tara Oceans Coordinators, et al. 2019. Marine DNA viral macro- .1038/s41564-018-0338-9.
and microdiversity from pole to pole. Cell 177:1109–1123. https://doi 72. Crisci MA, Chen LX, Devoto AE, Borges AL, Bordin N, Sachdeva R, Tett A,
.org/10.1016/j.cell.2019.03.040. Sharrar AM, Segata N, Debenedetti F, Bailey M, Burt R, Wood RM,
53. Camarillo-Guerrero LF, Almeida A, Rangel-Pineros G, Finn RD, Lawley TD. Rowden LJ, Corsini PM, van Winden S, Holmes MA, Lei S, Banfield JF,
2021. Massive expansion of human gut bacteriophage diversity. Cell Santini JM. 2021. Closely related Lak megaphages replicate in the micro-
184:1098–1109. https://doi.org/10.1016/j.cell.2021.01.029. biomes of diverse animals. iScience 24:102875. https://doi.org/10.1016/j
54. Gregory AC, Zablocki O, Howell A, Bolduc B, Sullivan MB. 2019. The .isci.2021.102875.
human gut virome database. bioRxiv https://doi.org/10.1101/655910. 73. Dutilh BE, Jurgelenaite R, Szklarczyk R, van Hijum SA, Harhangi HR,
55. Nayfach S, Páez-Espino D, Call L, Low SJ, Sberro H, Ivanova NN, Proal AD, Schmid M, de Wild B, Françoijs KJ, Stunnenberg HG, Strous M, Jetten MS,
Fischbach MA, Bhatt AS, Hugenholtz P, Kyrpides NC. 2021. Metagenomic Op den Camp HJ, Huynen MA. 2011. FACIL: fast and accurate genetic
compendium of 189,680 DNA viruses from the human gut microbiome. code inference and logo. Bioinformatics 27:1929–1933. https://doi.org/
Nat Microbiol 6:960–970. https://doi.org/10.1038/s41564-021-00928-6. 10.1093/bioinformatics/btr316.
56. Ackermann HW. 1998. Tailed bacteriophages: the order Caudovirales. Adv 74. Salisbury A, Tsourkas PK. 2019. A method for improving the accuracy
Virus Res 51:135–201. https://doi.org/10.1016/s0065-3527(08)60785-x. and efficiency of bacteriophage genome annotation. Int J Mol Sci 20:
57. Hyatt D, Chen GL, LoCascio PF, Land ML, Larimer FW, Hauser LJ. 2010. 3391. https://doi.org/10.3390/ijms20143391.
Prodigal: prokaryotic gene recognition and translation initiation site 75. Lazeroff M, Ryder G, Harris S, Tsourkas PK. 2021. Phage Commander, a soft-
identification. BMC Bioinform 11:119. https://doi.org/10.1186/1471-2105 ware tool for rapid annotation of bacteriophage genomes using multiple
-11-119.
programs. Phage 2:204–213. https://doi.org/10.1089/phage.2020.0044.
58. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL. 1999. Improved mi-
76. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K,
crobial gene identification with GLIMMER. Nucleic Acids Res 27:
Madden TL. 2009. BLAST1: architecture and applications. BMC Bioinform
4636–4641. https://doi.org/10.1093/nar/27.23.4636.
10:421. https://doi.org/10.1186/1471-2105-10-421.
59. Besemer J, Lomsadze A, Borodovsky M. 2001. GeneMarkS: a self-training
77. Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment
method for prediction of gene starts in microbial genomes. Implications
using DIAMOND. Nat Methods 12:59–60. https://doi.org/10.1038/nmeth
for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:
.3176.
2607–2618. https://doi.org/10.1093/nar/29.12.2607.
78. Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger SJ, Söding
60. Borodovsky M, Lomsadze A. 2014. Gene identification in prokaryotic
J. 2019. HH-suite3 for fast remote homology detection and deep protein
genomes, phages, metagenomes, and EST sequences with GeneMarkS suite.
annotation. BMC Bioinformatics 20:473. https://doi.org/10.1186/s12859
Curr Protoc Microbiol 32:Unit 1E.7. https://doi.org/10.1002/9780471729259
-019-3019-7.
.mc01e07s32.
79. Huerta-Cepas J, Szklarczyk D, Heller D, Hernández-Plaza A, Forslund SK,
61. Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, Olson R,
Cook H, Mende DR, Letunic I, Rattei T, Jensen LJ, von Mering C, Bork P.
88. Nawrocki EP, Eddy SR. 2013. Infernal 1.1: 100-fold faster RNA homol- abundant ocean viruses. Nature 537:689–693. https://doi.org/10.1038/
ogy searches. Bioinformatics 29:2933–2935. https://doi.org/10.1093/ nature19366.
bioinformatics/btt509. 110. Bin Jang H, Bolduc B, Zablocki O, Kuhn JH, Roux S, Adriaenssens EM,
89. Laslett D, Canback B. 2004. ARAGORN, a program to detect tRNA genes Brister JR, Kropinski AM, Krupovic M, Lavigne R, Turner D, Sullivan MB.
and tmRNA genes in nucleotide sequences. Nucleic Acids Res 32:11–16. 2019. Taxonomic assignment of uncultivated prokaryotic virus genomes
https://doi.org/10.1093/nar/gkh152. is enabled by gene-sharing networks. Nat Biotechnol 37:632–639.
90. Edgar RC. 2007. PILER-CR: fast and accurate identification of CRISPR https://doi.org/10.1038/s41587-019-0100-8.
repeats. BMC Bioinform 8:18. https://doi.org/10.1186/1471-2105-8-18. 111. Edwards RA, McNair K, Faust K, Raes J, Dutilh BE. 2016. Computational
91. Benson G. 1999. Tandem repeats finder: a program to analyze DNA approaches to predict bacteriophage–host relationships. FEMS Micro-
sequences. Nucleic Acids Res 27:573–580. https://doi.org/10.1093/nar/ biol Rev 40:258–272. https://doi.org/10.1093/femsre/fuv048.
27.2.573. 112. Ahlgren NA, Ren J, Lu YY, Fuhrman JA, Sun F. 2017. Alignment-free oli-
92. Warburton PE, Giordano J, Cheung F, Gelfand Y, Benson G. 2004. gonucleotide frequency dissimilarity measure improves prediction of
Inverted repeat structure of the human genome: the X-chromosome hosts from metagenomically-derived viral sequences. Nucleic Acids Res
contains a preponderance of large, highly homologous inverted repeats 45:39–53. https://doi.org/10.1093/nar/gkw1002.
that contain testes genes. Genome Res 14:1861–1869. https://doi.org/10 113. Coutinho FH, Zaragoza-Solas A, López-Pérez M, Barylski J, Zielezinski A,
.1101/gr.2542904.
Dutilh BE, Edwards RA, Rodriguez-Valera F. 2020. RaFAH: A superior
93. Abid D, Zhang L. 2018. DeepCapTail: a deep learning framework to pre-
method for virus-host prediction. bioRxiv. https://doi.org/10.1101/2020
dict capsid and tail proteins of phage genomes. bioRxiv. https://doi.org/
.09.25.313155.
10.1101/477885.
114. Galiez C, Siebert M, Enault F, Vincent J, Söding J. 2017. WIsH: who is the
94. Fang Z, Zhou H. 2021. VirionFinder: identification of complete and partial
host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioin-
prokaryote virus virion protein from virome data using the sequence
and biochemical properties of amino acids. Front Microbiol 12:615711. formatics 33:3113–3114. https://doi.org/10.1093/bioinformatics/btx383.
https://doi.org/10.3389/fmicb.2021.615711. 115. Villarroel J, Kleinheinz KA, Jurtz VI, Zschach H, Lund O, Nielsen M, Larsen
95. Chu Y, Guo S, Cui D, Fu X, Ma Y. 2021. DeephageTP: a convolutional neu- MV. 2016. HostPhinder: a phage host prediction tool. Viruses 8:116.
ral network framework for identifying phage-specific proteins from https://doi.org/10.3390/v8050116.
metagenomic sequencing data. Res Square. https://doi.org/10.21203/rs 116. Gałan W, Bąk M, Jakubowska M. 2019. Host taxon predictor-a tool for
.3.rs-21641/v2. predicting taxon of the host of a newly discovered virus. Sci Rep 9:3436.
96. Cantu VA, Salamon P, Seguritan V, Redfield J, Salamon D, Edwards RA, https://doi.org/10.1038/s41598-019-39847-2.
Segall AM. 2020. PhANNs, a fast and accurate tool and web server to 117. Liu D, Ma Y, Jiang X, He T. 2019. Predicting virus-host association by Ker-
classify phage structural proteins. PLoS Comput Biol 16:e1007845. nelized logistic matrix factorization and similarity network fusion. BMC
https://doi.org/10.1371/journal.pcbi.1007845. Bioinform 20(Suppl 16):594. https://doi.org/10.1186/s12859-019-3082-0.
97. Chibani CM, Farr A, Klama S, Dietrich S, Liesegang H. 2019. Classifying 118. Wang W, Ren J, Tang K, Dart E, Ignacio-Espinoza JC, Fuhrman JA, Braun
the unclassified: a phage classification method. Viruses 11:195. https:// J, Sun F, Ahlgren NA. 2020. A network-based integrated framework for
doi.org/10.3390/v11020195. predicting virus–prokaryote interactions. NAR Genom Bioinform 2:
98. Mande SS, Mohammed MH, Ghosh TS. 2012. Classification of metage- lqaa044. https://doi.org/10.1093/nargab/lqaa044.
nomic sequences: methods and challenges. Brief Bioinform 13:669–681. 119. Zhang R, Mirdita M, Levy Karin E, Norroy C, Galiez C, Söding J. 2021. Space-
https://doi.org/10.1093/bib/bbs054. PHARER: sensitive identification of phages from CRISPR spacers in prokaryotic
99. Iranzo J, Krupovic M, Koonin EV. 2016. The double-stranded DNA viro- hosts. Bioinformatics 37:3364–3366. https://doi.org/10.1093/bioinformatics/
sphere as a modular hierarchical network of gene sharing. mBio 7: btab222.
e00978-16. https://doi.org/10.1128/mBio.00978-16. 120. Dion MB, Plante PL, Zufferey E, Shah SA, Corbeil J, Moineau S. 2021.
100. Turner D, Kropinski AM, Adriaenssens EM. 2021. A roadmap for genome- Streamlining CRISPR spacer-based bacterial host predictions to decipher
based phage taxonomy. Viruses 13:506. https://doi.org/10.3390/v13030506. the viral dark matter. Nucleic Acids Res 49:3127–3138. https://doi.org/10
131. Buckling A, Rainey PB. 2002. Antagonistic coevolution between a bacte- 144. Olm MR, Crits-Christoph A, Bouma-Gregson K, Firek BA, Morowitz MJ,
rium and a bacteriophage. Proc R Soc Lond B 269:931–936. https://doi Banfield JF. 2021. inStrain profiles population microdiversity from meta-
.org/10.1098/rspb.2001.1945. genomic data and sensitively detects shared microbial strains. Nat Bio-
132. Shkoporov AN, Clooney AG, Sutton TDS, Ryan FJ, Daly KM, Nolan JA, technol 39:727–710. https://doi.org/10.1038/s41587-020-00797-0.
McDonnell SA, Khokhlova EV, Draper LA, Forde A, Guerin E, Velayudhan 145. Dixon P. 2003. VEGAN, a package of R functions for community ecology. J
V, Ross RP, Hill C. 2019. The human gut virome is highly diverse, stable, Veg Sci 14:927–930. https://doi.org/10.1111/j.1654-1103.2003.tb02228.x.
and individual specific. Cell Host Microbe 26:527–541. https://doi.org/10 146. McMurdie PJ, Holmes S. 2013. phyloseq: an R package for reproducible
.1016/j.chom.2019.09.009. interactive analysis and graphics of microbiome census data. PLoS One
133. Ignacio-Espinoza JC, Ahlgren NA, Fuhrman JA. 2020. Long-term stability 8:e61217. https://doi.org/10.1371/journal.pone.0061217.
and Red Queen-like strain dynamics in marine viruses. Nat Microbiol 5: 147. Szekely AJ, Breitbart M. 2016. Single-stranded DNA phages: from early
265–271. https://doi.org/10.1038/s41564-019-0628-x. molecular biology tools to recent revolutions in environmental microbi-
134. De Sordi L, Lourenço M, Debarbieux L. 2019. “I will survive”: a tale of bac- ology. FEMS Microbiol Lett 363:fnw027. https://doi.org/10.1093/femsle/
teriophage-bacteria coevolution in the gut. Gut Microbes 10:92–99. fnw027.
https://doi.org/10.1080/19490976.2018.1474322. 148. Callanan J, Stockdale SR, Shkoporov A, Draper LA, Ross RP, Hill C. 2018.
135. Enav H, Kirzner S, Lindell D, Mandel-Gutfreund Y, Béjà O. 2018. Adapta- RNA phage biology in a metagenomic era. Viruses 10:386. https://doi
tion to sub-optimal hosts is a driver of viral diversification in the ocean. .org/10.3390/v10070386.
Nat Commun 9:4698. https://doi.org/10.1038/s41467-018-07164-3. 149. O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput
136. Coutinho FH, Rosselli R, Rodríguez-Valera F. 2019. Trends of microdiversity B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A,
reveal depth-dependent evolutionary strategies of viruses in the Mediterra- Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O,
nean. mSystems 4:e00554-19. https://doi.org/10.1128/mSystems.00554-19. Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS,
137. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR,
Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda
Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR,
MJ. 2011. A framework for variation discovery and genotyping using Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, et al. 2016. Reference
next-generation DNA sequencing data. Nat Genet 43:491–498. https:// sequence (RefSeq) database at NCBI: current status, taxonomic expan-
doi.org/10.1038/ng.806. sion, and functional annotation. Nucleic Acids Res 44:D733–D745.
138. Chen LX, Méheust R, Crits-Christoph A, McMahon KD, Nelson TC, Slater https://doi.org/10.1093/nar/gkv1189.
GF, Warren LA, Banfield JF. 2020. Large freshwater phages with the 150. UniProt Consortium. 2019. UniProt: a worldwide hub of protein knowl-
potential to augment aerobic methane oxidation. Nat Microbiol 5: edge. Nucleic Acids Res 47:D506–D515. https://doi.org/10.1093/nar/
1504–1515. https://doi.org/10.1038/s41564-020-0779-9. gky1049.
139. Siranosian BA, Tamburini FB, Sherlock G, Bhatt AS. 2020. Acquisition, trans- 151. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer
mission and strain diversity of human gut-colonizing crAss-like phages. Nat ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, Bateman A.
Commun 11:280. https://doi.org/10.1038/s41467-019-14103-3. 2021. Pfam: the protein families database in 2021. Nucleic Acids Res 49:
140. Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, D412–D419. https://doi.org/10.1093/nar/gkaa913.
Waller A, Mende DR, Kultima JR, Martin J, Kota K, Sunyaev SR, Weinstock 152. Lu S, Wang J, Chitsaz F, Derbyshire MK, Geer RC, Gonzales NR, Gwadz M,
GM, Bork P. 2013. Genomic variation landscape of the human gut micro- Hurwitz DI, Marchler GH, Song JS, Thanki N, Yamashita RA, Yang M,
biome. Nature 493:45–50. https://doi.org/10.1038/nature11711. Zhang D, Zheng C, Lanczycki CJ, Marchler-Bauer A. 2020. CDD/SPARCLE:
141. Cosgun E, Oh M. 2020. Exploring the consistency of the quality scores the conserved domain database in 2020. Nucleic Acids Res 48:
with machine learning for next-generation sequencing experiments. D265–D268. https://doi.org/10.1093/nar/gkz991.
Biomed Res Int 2020:8531502. https://doi.org/10.1155/2020/8531502. 153. Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG. 2014. SCOP2
142. Begun DJ, Holloway AK, Stevens K, Hillier LW, Poh Y-P, Hahn MW, Nista prototype: a new approach to protein structure mining. Nucleic Acids
PM, Jones CD, Kern AD, Dewey CN, Pachter L, Myers E, Langley CH. 2007. Res 42:D310–D314. https://doi.org/10.1093/nar/gkt1242.