0% found this document useful (0 votes)
193 views9 pages

Genomics For Beginner

genomics

Uploaded by

ludhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views9 pages

Genomics For Beginner

genomics

Uploaded by

ludhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2

http://www.microbialinformaticsj.com/content/3/1/2

REVIEW Open Access

Beginner’s guide to comparative bacterial


genome analysis using next-generation sequence
data
David J Edwards1,2 and Kathryn E Holt1*

Abstract
High throughput sequencing is now fast and cheap enough to be considered part of the toolbox for investigating
bacteria, and there are thousands of bacterial genome sequences available for comparison in the public domain.
Bacterial genome analysis is increasingly being performed by diverse groups in research, clinical and public health
labs alike, who are interested in a wide array of topics related to bacterial genetics and evolution. Examples include
outbreak analysis and the study of pathogenicity and antimicrobial resistance. In this beginner’s guide, we aim to
provide an entry point for individuals with a biology background who want to perform their own bioinformatics
analysis of bacterial genome data, to enable them to answer their own research questions. We assume readers will
be familiar with genetics and the basic nature of sequence data, but do not assume any computer programming
skills. The main topics covered are assembly, ordering of contigs, annotation, genome comparison and extracting
common typing information. Each section includes worked examples using publicly available E. coli data and free
software tools, all which can be performed on a desktop computer.
Keywords: Bacterial, Microbial, Comparative, Genomics, Next generation sequencing, Analysis, Methods

Review assemblies, two thirds of which were in ‘draft’ form (i.e.


Introduction and aims presented as a set of sequence fragments rather than a
High throughput sequencing is now fast and cheap single sequence representing the whole genome, see [7]
enough to be considered part of the toolbox for investi- for a detailed discussion).
gating bacteria [1,2]. This work is performed by diverse In this beginner’s guide, we aim to provide an entry
groups of individuals including researchers, public health point for individuals wanting to make use of whole-
practitioners and clinicians, interested in a wide array of genome sequence data for the de novo assembly of ge-
topics related to bacterial genetics and evolution. Exam- nomes to answer questions in the context of their
ples include the study of clinical isolates as well as la- broader research goals. The guide is not aimed at those
boratory strains and mutants [3]; outbreak investigation wishing to perform automated processing of hundreds
[4,5]; and the evolution and spread of drug resistance of genomes at a time; some discussion of the use of
[6]. Bacterial genome sequences can now be generated sequencing in routine microbiological diagnostic labora-
in-house in many labs, in a matter of hours or days tories is available in the literature [8]. We assume
using benchtop sequencers such as the Illumina MiSeq, readers will be familiar with genetics and the basic na-
Ion Torrent PGM or Roche 454 FLX Junior [1,2]. Much ture of sequence data, but do not assume any computer
of this data is available in the public domain, allowing programming skills and all the examples we use can be
for extensive comparative analysis; e.g. in February 2013 performed on a desktop computer (Mac, Windows or
the GenBank database included >6,500 bacterial genome Linux). The guide is not intended to be exhaustive, but
to introduce a set of simple but flexible and free tools
* Correspondence: [email protected]
that can be used to investigate a variety of common
1
Department of Biochemistry and Molecular Biology, Bio21 Institute, questions including (i) how does this genome compare
University of Melbourne, Victoria 3010, Australia to that one?, and (ii) does this genome have plasmids,
Full list of author information is available at the end of the article

© 2013 Edwards and Holt; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2 Page 2 of 9
http://www.microbialinformaticsj.com/content/3/1/2

phage or resistance genes? Each section includes guid- data are presented in the text and figures, and detailed
ance on where to find more detailed technical informa- instructions on how to replicate the example are pro-
tion, alternative software packages and where to look for vided in the corresponding tutorial (Additional file 1).
more sophisticated approaches. The tutorial includes links to the software programs re-
quired for each stage, the specific steps needed to use
Examples and tutorial the program(s), and the expected inputs and outputs (in-
Throughout the guide, we will use Escherichia coli structions for software installation are provided by the
O104:H4 as a worked example. E. coli O104:H4 was re- developers of each program).
sponsible for a lethal foodborne outbreak of haemolytic Whilst quality control of raw sequence data can be im-
uraemic syndrome (HUS) in Germany during 2011 portant in obtaining the best assembly for comparison,
[9-11]. Sequence reads and assemblies from a number of the number and complexity of possible steps is too nu-
outbreak strains, generated using different high through- merous, and the variations between platforms too sub-
put sequencing platforms (including Illumina, Ion stantial, to cover in this guide. However, we recommend
Torrent and 454) are now available for download from readers check the quality of raw sequence reads using
the European Nucleotide Archive [11-17]. the tools accompanying their benchtop sequencing ma-
The outbreak strain belongs to an enteroaggregative E. chines, or use FastQC to assess the quality of raw read
coli (EAEC) lineage that has acquired a bacteriophage sets (see Tutorial, Additional file 1).
encoding Shiga-toxin (commonly associated with enter-
ohaemorrhagic E. coli (EHEC)), and multiple antibiotic re- Genome assembly
sistance genes [12]. For the worked examples, we will use De novo assembly is the process of merging overlapping
a set of paired-end Illumina reads from O104:H4 strain sequence reads into contiguous sequences (contigs) with-
TY-2482 (ENA accession SRR292770), but also include al- out the use of any reference genome as a guide (Figure 1).
ternatives for the other available short-read data types. For The most efficient assemblers for short-read sequences
those so interested, longer Pacific Bioscience reads are also are typically those that employ de Bruijn graphs to pro-
available, but are not included in this tutorial. duce an assembly [18]. An eloquent explanation of how
The workflow has been divided into five logical sec- de Bruijn graphs work in sequence assembly can be found
tions: assembly, ordering of contigs, annotation, genome in Compeau et al. [18]. One of the first and most widely
comparison and typing. Examples using E. coli O104:H4 used de Bruijn graph assemblers is the open-source

Figure 1 Genome assembly with Velvet. Reads are assembled into contigs using Velvet and VelvetOptimiser in two steps, (1) velveth converts
reads to k-mers using a hash table, and (2) velvetg assembles overlapping k-mers into contigs via a de Bruijn graph. VelvetOptimiser can be used
to automate the optimisation of parameters for velveth and velvetg and generate an optimal assembly. To generate an assembly of E. coli O104:
H4 using the command-line tool Velvet: • Download Velvet [23] (we used version 1.2.08 on Mac OS X, compiled with a maximum k-mer length of
101 bp) • Download the paired-end Illumina reads for E. coli O104:H4 strain TY-2482 (ENA accession SRR292770 [17]) • Convert the reads to k-mers
using this command: velveth out_data_35 35 -fastq.gz -shortPaired -separate SRR292770_1.fastq.gz SRR292770_2.fastq.gz • Then, assemble
overlapping k-mers into contigs using this command: velvetg out_data_35 -clean yes -exp_cov 21 -cov_cutoff 2.81 -min_contig_lgth 200 This will
produce a set of contigs in multifasta format for further analysis. See Additional file 1: Tutorial for further details, including help with downloading
reads and using VelvetOptimiser.
Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2 Page 3 of 9
http://www.microbialinformaticsj.com/content/3/1/2

program Velvet [19]. With further development to im- O104:H4, finding the best reference may itself involve
prove the resolution of repeats and scaffolding using trial and error [12].
paired-end and longer reads [20], Velvet remains one of Ordering of contigs can be achieved using command-
the most-used (and cited) assemblers for bacterial ge- line tools such as MUMmer [25], which can be simplified
nomes, being best suited to Illumina sequence reads using a wrapper program like ABACAS [26]. However we
(Velvet is included as the default assembler in the Illumina suggest the easiest way for beginners is to use the contig
MiSeq analysis suite). ordering tool in the Java-based graphical-interface pro-
Ion Torrent reads are better assembled using the open gram Mauve [27,28]. This ordering algorithm uses an it-
source program MIRA [21], which uses a modified Smith- erative mapping approach to find the best fit for each
Waterman algorithm for local alignment rather than a de contig against the reference genome. Mauve takes as input
Bruijn graph method. MIRA is available as a plugin for the the reference genome in fasta format along with the
Ion Torrent analysis suite. For 454 data, Roche provides a assembly in multifasta format, and outputs another
proprietary (de Bruijn graph-based) assembler [22]. multifasta file containing the ordered contigs. Detailed in-
When using a de Bruijn graph assembler, a number of structions for ordering the E. coli O104:H4 contigs against
variables need to be considered in order to produce opti- a reference are given in Additional file 1: Tutorial.
mal contigs [23]. This can be automated quite effectively Due to evolutionary differences between the reference
using VelvetOptimiser [24]. The key issue is selecting an and novel genome, the presence of (often mobile) repeat
appropriate k-mer length for building the de Bruijn elements such as prophages, and the very nature of
graph. Different sequencing platforms produce fragments short-read assemblers, there will almost certainly be as-
of differing length and quality [1], meaning very different sembly errors present within the contigs. Indeed, all as-
ranges of k-mers will be better suited to different types of semblers used in Assemblathon 1 [29] and the Genome
read sets. A balance must be found between the sensitivity Assembly Gold-standard Evaluations (GAGE) [30] com-
offered by a smaller k-mer against the specificity of a lar- munity “bake-offs” produced assemblies with errors. The
ger one [18]. Other variables to consider when running error rate of an assembly can be assessed if a closely re-
Velvet include the expected coverage across the genome, lated reference genome is available. A good option for
the length of the insert sizes in paired-end read libraries, assessing the error rate is MauveAssemblyMetrics [31]
and the minimum coverage (read depth) cut-off value, all (see Additional file 1: Tutorial for an example with
of which can be automated using VelvetOptimiser [23]. If E. coli O104:H4), an optional addition to Mauve that
the coverage obtained is higher than 20× reads deep on generates a report on assembly quality.
average, the chances of errors being incorporated into the Another way to explore the ordered assembly is by
contigs increases, as de Bruijn graph assemblers cannot means of visualization. Mauve provides one way to
distinguish between an error and a real variant if there is visualize the assembly by alignment to other sequences
lots of evidence for the error, as found with higher cover- (see Additional file 1: Tutorial for instructions). Another
age levels. In this case, a subset of the reads can be sam- option is to use Artemis and the companion Artemis
pled and used for the assembly [23]. Comparison Tool (ACT), a pair of open-source Java-
Instructions on how to assemble Illumina reads from based applications [32]. An example using E. coli O104:
E. coli O104:H4 strain TY-2482 using Velvet are given in H4 is shown in Figure 2 and in Additional file 1: Tutorial.
Figure 1 and Additional file 1: Tutorial. The assembler To view comparisons in ACT, you need to first generate a
takes the sequence reads as input (in fastq format) and comparison file that identifies regions of homology be-
outputs the assembled contigs (in multifasta format). tween your assembly and a reference genome. You can
Note that the contig set, referred to as the draft assem- then load this into ACT along with your assembly and ref-
bly, will include sequences derived from all the DNA erence sequence(s). The comparison file can be generated
present in the sequenced sample, including chromosome using the WebACT or DoubleACT websites, or using
(s) and any bacteriophage or plasmids. BLAST+ on your own computer (see Additional file 1:
Tutorial for details of these programs). Note that before
Ordering and viewing assembled contigs you can generate the comparison file, the assembly needs
Once a set of contigs have been assembled from the se- to be converted into a single fasta sequence. This can be
quencing reads, the next step is to order those contigs done in Artemis (Figure 2), or using a command-line tool
against a suitable reference genome. This may seems such as the ‘union’ command in the EMBOSS package
counter-intuitive at first as we have applied de novo as- [33] (see Additional file 1: Tutorial for details).
sembly to obtain these contigs, but ordering the contigs
aids the discovery and comparison process. The best ref- Genome annotation
erence to use is usually the most closely related bacter- Once the ordered set of contigs has been obtained, the
ium with a ‘finished’ genome, but as in the case of E. coli next step is to annotate the draft genome. Annotation is
Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2 Page 4 of 9
http://www.microbialinformaticsj.com/content/3/1/2

Figure 2 Pairwise genome comparisons with ACT, the Artemis Comparison Tool. Artemis and ACT are free, interactive genome browsers
[32,40] (we used ACT 11.0.0 on Mac OS X). • Open the assembled E. coli O104:H4 contigs in Artemis and write out a single, concatenated
sequence using File -> Write -> All Bases -> FASTA Format. • Generate a comparison file between the concatenated contigs and 2 alternative
reference genomes using the website WebACT (http://www.webact.org/). • Launch ACT and load in the reference sequences, contigs and
comparison files, to get a 3-way comparison like the one shown here. Here, the E. coli O104:H4 contigs are in the middle row, the
enteroaggregative E. coli strain Ec55989 is on top and the enterohaemorrhagic E. coli strain EDL933 is below. Details of the comparison can be
viewed by zooming in, to the level of genes or DNA bases.

the process of ‘gene’ finding, and can also include the of formats, including in GenBank format. See Additional
identification of ribosomal and transfer RNAs encoded file 1: Tutorial for detailed instructions on how to anno-
in the genome. Bacterial genome annotation is most tate the E. coli O104:H4 genome using RAST.
easily achieved by uploading a genome assembly to an
automated web-based tool such as RAST [34,35]. There Comparative genome analysis
are also many command-line annotation tools available. For most sequencing experiments, comparison to other
These include methods based on de novo discovery of genomes or sequences is a critical step. Sometimes gen-
genes, such as Prokka [36] and DIYA [37], or programs eral questions are asked, along the lines of “which genes
that transfer annotation directly from closely related ge- do these genomes share and which are unique to par-
nomes, such as RATT [38] and BG-7 [39]. ticular genomes?”. In many cases, users are also inter-
Since the quality of the final annotation is largely de- ested in looking for specific genes that are known to
termined by the quality of the gene database used, we have important functions, such as virulence genes or
prefer the easy-to-use online de novo annotation tool drug resistance determinants.
RAST for bacterial genome annotation [35]. RAST takes For most users, it is important to be able to visualize
as input the ordered contigs in multifasta format, identi- these comparisons, both to aid understanding and inter-
fies open reading frames that are likely to be genes, and pretation of the data, and to generate figures for commu-
uses a series of subsystem techniques (the ‘ST’ in RAST) nicating results. We therefore recommend three software
to compare these with a sophisticated database of genes tools that combine data analysis and visualization - BRIG,
and RNA sequences, producing a high-quality annota- Mauve and ACT (the latter two have already been intro-
tion of the assembly. The genes identified can be viewed, duced above). For more experienced users, comparative
and compared to other genomes, using the RAST online questions can also be answered using command-line
tool. The annotation can also be downloaded in a variety search tools, such as MUMmer or BLAST.
Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2 Page 5 of 9
http://www.microbialinformaticsj.com/content/3/1/2

ACT [32,40] is a Java-based tool for visualizing pair- set of input genomes, and regions that are unique to sub-
wise comparisons of sequences, including whole ge- sets of genomes (islands). The tutorial (Additional file 1)
nomes. As outlined above, BLAST is used to compare includes a detailed example of how to use Mauve to
the sequences (this can be done locally, or through web identify unique regions in the E. coli O104:H4 outbreak
services); the two genomes and the BLAST result are assembly compared to EHEC and EAEC chromosomal se-
then loaded into ACT for visualization of the comparison quences. Because Mauve generates an alignment of the
(see Additional file 1: Tutorial). Multiple pairwise compar- genome sequences, it can also be used to identify single
isons can be visualized simultaneously; an example using nucleotide polymorphisms (SNPs, or point mutations)
E. coli O104:H4 is given in Figure 2 and Additional file 1: suitable for downstream phylogenetic or evolutionary ana-
Tutorial. Regions of sequence homology are linked by lyses (see the Mauve user guide for details).
blocks, which are coloured red (same orientation) or blue BRIG, or the BLAST Ring Image Generator, is a Java-
(reverse orientation), with saturation indicating the degree based tool for visualizing the comparison of a reference
of homology (dark=high homology, to light=low hom- sequence to a set of query sequences [42,43]). Results
ology). Advantages of using ACT include (i) the flexibility are plotted as a series of rings, each representing a query
to zoom right out to see whole-genome comparisons, sequence, which are coloured to indicate the presence of
(ii) ability to zoom right down to DNA and/or protein se- hits to the reference sequence (see Figure 4). BRIG is
quences to examine fine-scale comparisons, and (iii) it is flexible and can be used to answer a broad range of
possible to add or edit annotations for the genomes being comparative questions, depending on the selection of
compared. the reference and comparison sequences. However it is
Mauve is a Java-based tool for multiple alignment of important to keep in mind that this particular approach
whole genomes, with a built-in viewer and the option to is reference-based, meaning it can show you which re-
export comparative genomic information in various gions of the reference sequence are present or absent in
forms [27,41]. Its alignment functions can also be used query sequences, but it cannot reveal regions of the
to order and orient contigs against an existing assembly, query sequences that are missing from the reference se-
as outlined above. Mauve takes as input a set of genome quence. Therefore the selection of the reference is crit-
assemblies, and generates a multiple whole-genome align- ical to understanding the results. An example is given in
ment. It identifies blocks of sequence homology, and as- Figure 4, in which an EHEC genome is used as the refer-
signs each block a unique colour. Each genome can then ence sequence and the E. coli O104:H4 outbreak genome
be visualized as a sequence of these coloured sequence assembly, along with other pathogenic E. coli genomes,
blocks, facilitating visualization of the genome compari- are used as queries. This makes it easy to see that the out-
sons. An example is given in Figure 3. This makes it easy break strain differs significantly from enterohaemorrhagic
to identify regions that are conserved among the whole E. coli (EHEC) in terms of gene content, but shares with

Figure 3 Mauve for multiple genome alignment. Mauve is a free alignment tool with an interactive browser for visualising results [27,41] (we
used Mauve 2.3.1 on Mac OS X). • Launch Mauve and select File -> Align with progressiveMauve • Click ‘Add Sequence. . .’ to add your genome
assembly (e.g. annotated E. coli O104:H4 contigs) and other reference genomes for comparison. • Specify a file for output, then click ‘Align. . .’ •
When the alignment is finished, a visualization of the genome blocks and their homology will be displayed, as shown here. E. coli O104:H4 is on
the top, red lines indicate contig boundaries within the assembly. Sequences outside coloured blocks do not have homologs in the
other genomes.
Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2 Page 6 of 9
http://www.microbialinformaticsj.com/content/3/1/2

Figure 4 BRIG for multiple genome comparison. BRIG is a free tool [42,43] that requires a local installation of BLAST (we used BRIG 0.95 on
Mac OS X). The output is a static image. • Launch BRIG and set the reference sequence (EHEC EDL933 chromosome) and the location of other E.
coli sequences for comparison. If you include reference sequences for the Stx2 phage and LEE pathogenicity island, it will be easy to see where
these sequences are located. • Click ‘Next’ and specify the sequence data and colour for each ring to be displayed in comparison to the
reference. • Click ‘Next’ and specify a title for the centre of the image and an output file, then click ‘Submit’ to run BRIG. • BRIG will create an
output file containing a circular image like the one shown here. It is easy to see that the Stx2 phage is present in the EHEC chromosomes
(purple) and the outbreak genome (black), but not the EAEC or EPEC chromosomes.

it the Stx2 phage sequence which is missing from performed by BLAST, and the output is displayed in a
enteroaggregative E. coli (EAEC) and enteropathogenic table format that indicates which resistance genes were
E. coli (EPEC) (highlighted in Figure 4). The tutorial in- found, where they were found (contig name and coordi-
cludes a second example, using the E. coli O104:H4 out- nates), and the expected effect on phenotype. The fastest
break genome as the reference for comparison. way to use ResFinder is to upload a genome assembly,
however it is also possible to upload raw sequence reads
Typing and public health applications: identifying in fastq format, which will be assembled prior to searching
resistance genes, sequence types, phage, plasmids and for resistance genes.
other specific sequences Multi-locus sequence typing (MLST) is a widely used,
Whole genome sequencing is increasingly being used in sequence-based method for typing of bacterial species and
place of PCR-based sequencing or typing methods. Here plasmids [46]. In February 2013, public MLST schemes
we outline some specialist tools for these purposes. The were available for over 100 bacterial species and five plas-
tutorial contains instructions for using these tools to mid incompatibility types [47]. The Center for Genomic
examine the E. coli O104:H4 outbreak genome. Epidemiology hosts a publicly available web-based tool
The detection of antimicrobial resistance genes is a [48] that allows users to upload sequence data and extract
key question for many researchers, especially in public sequence types for most of the publicly available MLST
health and diagnostic labs. The ResFinder tool [44], schemes. Like ResFinder, the tool uses BLAST searches of
freely available online [45], allows users to upload se- assemblies to identify sequence types, and can accept ei-
quence data to search against its curated database of ac- ther genome assemblies or read sets, which are assembled
quired antimicrobial resistance genes. Sequence search is on the fly prior to searching. Sequence types can also be
Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2 Page 7 of 9
http://www.microbialinformaticsj.com/content/3/1/2

extracted directly from reads, which can be more sensitive Another useful approach is to perform a blastn (nu-
than assembly; see e.g. SRST, a command-line tool based cleotide BLAST) search of the whole database at NCBI
on read mapping [49,50]. to see which known sequences your non-chromosomal
For many bacteria, phage are the most dynamic part contigs match (go to [53] and click ‘nucleotide blast’,
of the genome and are therefore of key interest to many then upload your contigs and make sure you are
researchers. Several free online tools exist for the identi- searching the ‘nr’ database). If you find you have a large
fication of prophage sequences within bacterial genomes. contig with lots of matches to plasmid sequences, it’s
A particularly feature-rich tool is PHAST (PHAge Search likely your contig is also part of a plasmid. One advan-
Tool) [51]. Genome assemblies can be uploaded in fasta tage of using NCBI’s BLAST search page is that results
or GenBank format; outputs include summary tables (in- can be viewed in the form of a phylogenetic tree (click
dicating the location and identity of phage sequences ‘Distance tree of results’ at the top of the results page).
within the assembly) and interactive tools for visualization This can help to quickly identify the plasmid sequences
of both the individual phage annotations and their loca- closest to yours, which can then be used for comparative
tions on a circular map of the genome. analysis. If you find a contig that has close matches to
In most bacterial genome sequencing experiments, part of a known plasmid, it may be of interest to know if
whole genomic DNA is extracted from the isolate and the rest of the reference plasmid sequence is also present
thus the sequence data includes both chromosomal and in the novel genome. You could get a quick idea of this
plasmid DNA. Many researchers are interested in exploring using BRIG - use the known plasmid sequence as the
which plasmids are present in their bacterial genomes, par- reference and your set of assembled contigs as the query,
ticularly in the context of plasmid-borne resistance genes then look to see how much of the known plasmid is cov-
or virulence genes. One approach to rapidly detecting the ered by contigs. If more of the plasmid is covered, an
presence and sequence type of a particular plasmid incom- ACT comparison could be performed using the refer-
patibility group is to run a plasmid MLST analysis, e.g. ence plasmid and the annotated contig set, in order to
using SRST [49]. However this will only work for the small identify which other contigs are likely to ‘belong’ to the
number of plasmids with MLST schemes, and does not tell same plasmid replicon and inspect what other genes are
you which genes are encoded in the plasmid. carried by the new plasmid.
The ability to determine which sequences belong to
plasmids and which belong to chromosomes varies with Other analyses
each sequencing experiment. This generally hinges on There are many other methods for performing compara-
whether it is possible to assemble whole plasmids into a tive bacterial genomic analysis, which are not discussed
single sequence, which depends on many factors includ- here. In particular, we have not discussed phylogenetic
ing read length, the availability of paired-end or mate- analysis, or how to perform detailed gene content com-
pair data, and the presence of repetitive DNA within the parisons between sets of genomes.
plasmid sequence. In most cases it is not possible to Arguably, phylogenetic analysis of closely related ge-
confidently assign every single contig to its correct nomes is best performed using single nucleotide polymor-
replicon (i.e. chromosome or specific plasmid), without phisms (SNPs) identified by read mapping rather than
performing additional laboratory experiments. However, assembly-based approaches [6,54,55]. Many software pro-
it is possible to get a very good idea of what plasmids grams are available for this task; see [56,57] for a review
are present in a genome assembly using comparative and the updated software list hosted by the SeqAnswers
analyses. A good place to start is to identify all the web forum [58]. The process can be somewhat automated
contigs that are not definitely chromosomal (by compar- using command-driven pipelines such as Nesoni [59] or
ing to other sequenced chromosomes using ACT or graphical-interfaces within the MiSeq or Ion Torrent ana-
Mauve, see above) and BLAST these against GenBank or lysis suites or the web-based Galaxy [60].
a plasmid-specific database. One such database is avail- Detailed gene content comparisons are generally best-
able on the PATRIC website [52]. On the PATRIC performed using databases tailored to the bacterial spe-
BLAST page, select ‘blastn’ from the Program dropdown cies of interest. An excellent place to start is to explore
list and select ‘Plasmid sequences’ from the Database the web-based tools PATRIC [61] and PGAT [62], which
dropdown list. At the bottom of the page you can are suitable for biologists with little or no programming
choose to view your results graphically (great if you are skills.
just searching a few contigs) or as a table (better if you
have lots of contigs to investigate). The most similar Delving deeper into bioinformatics
plasmid sequences should make good candidates for more For biologists interested in learning more about bioinfor-
detailed comparison and visualization using Mauve, ACT matics analysis, we recommend two things. First, get
or BRIG as outlined above. comfortable with the Unix command-line [63,64], which
Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2 Page 8 of 9
http://www.microbialinformaticsj.com/content/3/1/2

opens up a huge array of software tools to do more so- pneumoniae with whole-genome sequencing. Sci Transl Med 2012,
phisticated analyses (see [58] for a list of available next- 4:148ra116.
5. Harris SR, Cartwright EJ, Torok ME, Holden MT, Brown NM, Ogilvy-Stuart AL,
generation sequence analysis tools). Second, learn to use Ellington MJ, Quail MA, Bentley SD, Parkhill J, Peacock SJ: Whole-genome
the Python scripting language (tutorial at [65]) and asso- sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus
ciated BioPython functions [66], which will help you to aureus: a descriptive study. Lancet Infect Dis 2012, 13:130–136.
6. Holt K, Baker S, Weill F, Holmes E, Kitchen A, Yu J, Sangal V, Brown D, Coia J,
write your own snippets of code to do exactly the ana- Kim D, et al: Shigella sonnei genome sequencing and phylogenetic
lysis you want. analysis indicate recent global dissemination from Europe. Nat Genet
2012, 44:1056–1059.
7. Nagarajan N, Cook C, Di Bonaventura M, Ge H, Richards A, Bishop-Lilly KA,
Conclusions DeSalle R, Read TD, Pop M: Finishing genomes with limited resources:
The bench-top sequencing revolution has led to a lessons from an ensemble of microbial genomes. BMC Genomics 2010,
11:242.
‘democratization’ of sequencing, meaning most research
8. Koser CU, Ellington MJ, Cartwright EJ, Gillespie SH, Brown NM, Farrington M,
laboratories can afford to sequence whole bacterial ge- Holden MT, Dougan G, Bentley SD, Parkhill J, Peacock SJ: Routine use of
nomes when their work demands it. However analysing microbial whole genome sequencing in diagnostic and public health
microbiology. PLoS Pathog 2012, 8:e1002824.
the data is now a major bottleneck for most laboratories.
9. Buchholz U, Bernard H, Werber D, Bohmer MM, Remschmidt C, Wilking H,
We have provided a starting point for biologists to Delere Y, an der Heiden M, Adlhoch C, Dreesman J, et al: German outbreak
quickly begin working with their own bacterial genome of Escherichia coli O104:H4 associated with sprouts. N Engl J Med 2011,
data, without investing money in expensive software or 365:1763–1770.
10. Frank C, Werber D, Cramer JP, Askar M, Faber M, an der Heiden M, Bernard H,
training courses. The figures show examples of what can Fruth A, Prager R, Spode A, et al: Epidemic profile of Shiga-toxin-producing
be achieved with the tools presented, and the accom- Escherichia coli O104:H4 outbreak in Germany. N Engl J Med 2011,
panying tutorial gives step-by-step instructions for each 365:1771–1780.
11. Bielaszewska M, Mellmann A, Zhang W, Kock R, Fruth A, Bauwens A, Peters
kind of analysis. G, Karch H: Characterisation of the Escherichia coli strain associated with
an outbreak of haemolytic uraemic syndrome in Germany, 2011: a
microbiological study. Lancet Infect Dis 2011, 11:671–676.
Additional file 12. Rohde H, Qin J, Cui Y, Li D, Loman NJ, Hentschke M, Chen W, Pu F, Peng Y,
Li J, et al: Open-source genomic analysis of Shiga-toxin-producing E. coli
Additional file 1: Tutorial. Bacterial Comparative Genomics Tutorial. O104:H4. N Engl J Med 2011, 365:718–724.
Detailed tutorial including worked examples, divided into three sections 13. Brzuszkiewicz E, Thurmer A, Schuldes J, Leimbach A, Liesegang H, Meyer FD,
(1) Genome assembly and annotation, (2) Comparative genome analysis, Boelter J, Petersen H, Gottschalk G, Daniel R: Genome sequence analyses
and (3) Typing and specialist tools. of two isolates from the recent Escherichia coli outbreak in Germany
reveal the emergence of a new pathotype: Entero-Aggregative
-Haemorrhagic Escherichia coli (EAHEC). Arch Microbiol 2011, 193:883–891.
Competing interests 14. Rasko DA, Webster DR, Sahl JW, Bashir A, Boisen N, Scheutz F, Paxinos EE,
The authors declare that they have no competing interests. Sebra R, Chin CS, Iliopoulos D, et al: Origins of the E. coli strain causing an
outbreak of hemolytic-uremic syndrome in Germany. N Engl J Med 2011,
Authors’ contributions 365:709–717.
DJE and KEH drafted, read and approved the manuscript. 15. Struelens MJ, Palm D, Takkinen J: Enteroaggregative, Shiga toxin-
producing Escherichia coli O104:H4 outbreak: new microbiological
Acknowledgements findings boost coordinated investigations by European public health
KEH is supported by Fellowship #628930 from the NHMRC of Australia. laboratories. Euro Surveill 2011, 16:19890.
DJE is supported by a VLSCI MSc Bioinformatics Studentship from the 16. Grad YH, Lipsitch M, Feldgarden M, Arachchi HM, Cerqueira GC, Fitzgerald
Victorian Life Sciences Computation Initiative (VLSCI). M, Godfrey P, Haas BJ, Murphy CI, Russ C, et al: Genomic epidemiology of
the Escherichia coli O104:H4 outbreaks in Europe, 2011. Proc Natl Acad Sci
Author details U S A 2012, 109:3065–3070.
1
Department of Biochemistry and Molecular Biology, Bio21 Institute, 17. European Nucleotide Archive. [http://www.ebi.ac.uk/ena/data/search?
University of Melbourne, Victoria 3010, Australia. 2Victorian Life Sciences query=o104:h4]
Computation Initiative, University of Melbourne, Victoria 3010, Australia. 18. Compeau PE, Pevzner PA, Tesler G: How to apply de Bruijn graphs to
genome assembly. Nat Biotechnol 2011, 29:987–991.
Received: 12 February 2013 Accepted: 31 March 2013 19. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly
Published: 10 April 2013 using de Bruijn graphs. Genome Res 2008, 18:821–829.
20. Zerbino DR, McEwen GK, Margulies EH, Birney E: Pebble and rock band:
References heuristic resolution of repeats and scaffolding in the velvet short-read
1. Loman NJ, Constantinidou C, Chan JZ, Halachev M, Sergeant M, Penn CW, de novo assembler. PLoS One 2009, 4:e8407.
Robinson ER, Pallen MJ: High-throughput bacterial genome sequencing: 21. The MIRA Assembler. [http://sourceforge.net/projects/mira-assembler/]
an embarrassment of choice, a world of opportunity. Nat Rev Microbiol 22. 454 Analysis Software. [http://454.com/products/analysis-software/index.asp]
2012, 10:599–606. 23. Zerbino DR: Using the Velvet de novo assembler for short-read sequencing
2. Stahl PL, Lundeberg J: Toward the single-hour high-quality genome. Annu technologies. In Current Protocols in Bioinformatics. 11th edition. Edited by
Rev Biochem 2012, 81:359–378. Baxevanis AD. US: John Wiley and Sons Inc; 2010. Unit 11 15.
3. Howden BP, McEvoy CR, Allen DL, Chua K, Gao W, Harrison PF, Bell J, 24. Velvet Optimiser. [http://bioinformatics.net.au/software.velvetoptimiser.shtml]
Coombs G, Bennett-Wood V, Porter JL, et al: Evolution of multidrug 25. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C,
resistance during Staphylococcus aureus infection involves mutation of Salzberg SL: Versatile and open software for comparing large genomes.
the essential two component regulator WalKR. PLoS Pathog 2011, Genome Biol 2004, 5:R12.
7:e1002359. 26. Assefa S, Keane TM, Otto TD, Newbold C, Berriman M: ABACAS: algorithm-
4. Snitkin ES, Zelazny AM, Thomas PJ, Stock F, Henderson DK, Palmore TN, based automatic contiguation of assembled sequences. Bioinformatics
Segre JA: Tracking a hospital outbreak of carbapenem-resistant Klebsiella 2009, 25:1968–1969.
Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2 Page 9 of 9
http://www.microbialinformaticsj.com/content/3/1/2

27. Darling AE, Mau B, Perna NT: progressiveMauve: multiple genome 63. Software Carpentry - The Shell. [http://software-carpentry.org/4_0/shell/]
alignment with gene gain, loss and rearrangement. PLoS One 2010, 64. Stein LD: Unix survival guide. In Current Protocols in Bioinformatics. Edited
5:e11147. by Baxevanis AD, et al. US: John Wiley and Sons Inc; 2007. Appendix 1:
28. Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT: Reordering Appendix 1C.
contigs of draft genomes using the Mauve aligner. Bioinformatics 2009, 65. Bassi S: A primer on python for life science researchers. PLoS Comput Biol
25:2071–2073. 2007, 3:e199.
29. Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino 66. Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I,
DR, Diekhans M, et al: Assemblathon 1: a competitive assessment of de Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ: Biopython: freely available
novo short read assembly methods. Genome Res 2011, 21:2224–2241. Python tools for computational molecular biology and bioinformatics.
30. Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Bioinformatics 2009, 25:1422–1423.
Schatz MC, Delcher AL, Roberts M, et al: GAGE: A critical evaluation of
genome assemblies and assembly algorithms. Genome Res 2012, doi:10.1186/2042-5783-3-2
22:557–567. Cite this article as: Edwards and Holt: Beginner’s guide to comparative
31. Darling AE, Tritt A, Eisen JA, Facciotti MT: Mauve assembly metrics. bacterial genome analysis using next-generation sequence data.
Bioinformatics 2011, 27:2756–2757. Microbial Informatics and Experimentation 2013 3:2.
32. Carver TJ, Rutherford KM, Berriman M, Rajandream MA, Barrell BG, Parkhill J:
ACT: the Artemis Comparison Tool. Bioinformatics 2005, 21:3422–3423.
33. Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology
Open Software Suite. Trends Genet 2000, 16:276–277.
34. RAST (Rapid Annotation using Subsystem Technology). [http://rast.nmpdr.org/]
35. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K,
Gerdes S, Glass EM, Kubal M, et al: The RAST Server: rapid annotations
using subsystems technology. BMC Genomics 2008, 9:75.
36. Prokka. [http://www.vicbioinformatics.com/software.prokka.shtml]
37. Stewart AC, Osborne B, Read TD: DIYA: a bacterial annotation pipeline for
any genomics lab. Bioinformatics 2009, 25:962–963.
38. Otto TD, Dillon GP, Degrave WS, Berriman M: RATT: Rapid Annotation
Transfer Tool. Nucleic Acids Res 2011, 39:e57.
39. Pareja-Tobes P, Manrique M, Pareja-Tobes E, Pareja E, Tobes R: BG7: a new
approach for bacterial genome annotation designed for next generation
sequencing data. PLoS One 2012, 7:e49239.
40. ACT: Artemis Comparison Tool. [http://www.sanger.ac.uk/resources/software/
act/]
41. Mauve Genome Alignment Software. [http://asap.ahabs.wisc.edu/mauve/]
42. BLAST Ring Image Generator (BRIG). [http://brig.sourceforge.net/]
43. Alikhan NF, Petty NK, Ben Zakour NL, Beatson SA: BLAST Ring Image
Generator (BRIG): simple prokaryote genome comparisons. BMC
Genomics 2011, 12:402.
44. Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O,
Aarestrup FM, Larsen MV: Identification of acquired antimicrobial
resistance genes. J Antimicrob Chemother 2012, 67:2640–2644.
45. ResFinder 1.3 (Acquired antimicrobial resistance gene finder). [http://cge.cbs.
dtu.dk/services/ResFinder/]
46. Maiden MC: Multilocus sequence typing of bacteria. Annu Rev Microbiol
2006, 60:561–588.
47. Plasmid MLST Databases. [http://pubmlst.org/plasmid/]
48. MLST 1.5 (MultiLocus Sequence Typing). [http://cge.cbs.dtu.dk/services/MLST/].
49. SRST on SourceForge. [http://srst.sourceforge.net]
50. Inouye M, Conway TC, Zobel J, Holt KE: Short read sequence typing
(SRST): multi-locus sequence types from short reads. BMC Genomics 2012,
13:338.
51. PHAST (PHAge Search Tool). [http://phast.wishartlab.com/].
52. PATRIC Blast Search. [http://www.patricbrc.org/portal/portal/patric/Blast]
53. NCBI BLAST Server. [http://blast.ncbi.nlm.nih.gov]
54. Croucher NJ, Harris SR, Fraser C, Quail MA, Burton J, van der Linden M,
McGee L, von Gottberg A, Song JH, Ko KS, et al: Rapid pneumococcal
evolution in response to clinical interventions. Science 2011, 331:430–434.
55. Harris SR, Feil EJ, Holden MT, Quail MA, Nickerson EK, Chantratita N, Gardete Submit your next manuscript to BioMed Central
S, Tavares A, Day N, Lindsay JA, et al: Evolution of MRSA during hospital and take full advantage of:
transmission and intercontinental spread. Science 2010, 327:469–474.
56. Li H, Homer N: A survey of sequence alignment algorithms for next-
• Convenient online submission
generation sequencing. Brief Bioinform 2010, 11:473–483.
57. Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler • Thorough peer review
B, Speicher MR, Zschocke J, Trajanoski Z: A survey of tools for variant • No space constraints or color figure charges
analysis of next-generation genome sequencing data. Brief Bioinform
2013, 14:56–66. • Immediate publication on acceptance
58. SEQanswers Wiki Software. [http://seqanswers.com/wiki/Software] • Inclusion in PubMed, CAS, Scopus and Google Scholar
59. Nesoni. [http://www.vicbioinformatics.com/software.nesoni.shtml] • Research which is freely available for redistribution
60. Galaxy - Data intensive biology for everyone. [http://galaxyproject.org/]
61. PATRIC - Pathogen Resource Integration Center. [http://www.patricbrc.org/]
62. PGAT - Prokaryotic Genome Analysis Tool. [http://tools.nwrce.org/pgat/] Submit your manuscript at
www.biomedcentral.com/submit

You might also like