Ten Steps To Get Started in Genome Assembly and Annotation
Ten Steps To Get Started in Genome Assembly and Annotation
Ten Steps To Get Started in Genome Assembly and Annotation
OPINION ARTICLE
Ten steps to get started in Genome Assembly and Annotation
[version 1; peer review: 2 approved]
Victoria Dominguez Del Angel 1, Erik Hjerde 2, Lieven Sterck 3,4,
Salvadors Capella-Gutierrez5,6, Cederic Notredame7,8, Olga Vinnere Pettersson9,
Joelle Amselem 10, Laurent Bouri 1, Stephanie Bocs 11-13,
Brane L. Leskosek16, Lucile Soler17, Mahesh Binzer-Panchal 17, Henrik Lantz 17
1Institut Français de Bioinformatique, UMS3601-CNRS, Université Paris-Saclay, Orsay, 91403, France
2Department of Chemistry, Norstruct, UiT The Arctic University of Norway, Tromsø, 9019, Norway
3Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Ghent, Belgium
4VIB-UGent Center for Plant Systems Biology, Ghent University - VIB, Technologiepark 927, 9052 Ghent, Belgium
5Spanish National Bioinformatics Institute (INB), Barcelona, Spain
6Barcelona Supercomputing Center (BSC), Centro Nacional de Supercomputación, Barcelona, Spain
7Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology , Barcelona, Spain
8Universitat Pompeu Fabra (UPF), Barcelona, Spain
9Uppsala Genome Center, NGI/SciLifeLab, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, SE-752 37 ,
Sweden
10URGI, INRA, Université Paris-Saclay, Versailles, 78026, France
11CIRAD, UMR AGAP, Montpellier, 34398, France
12AGAP, Cirad, INRA, Montpellier SupAgro, Universite Montpellier, Montpellier, France
13South Green Bioinformatics Platform, Montpellier, France
14Genotoul Bioinfo, MIAT, INRA Toulouse, Castanet-Tolosan, France
15Unité de recherche , INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France
16Faculty of Medicine, Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
17IMBIM/NBIS/SciLifeLab, Uppsala University, Uppsala, Sweden
Abstract Invited Reviewers
As a part of the ELIXIR-EXCELERATE efforts in capacity building, we 1 2
present here 10 steps to facilitate researchers getting started in genome
assembly and genome annotation. The guidelines given are broadly version 1
applicable, intended to be stable over time, and cover all aspects from start
published report report
to finish of a general assembly and annotation project. 05 Feb 2018
Intrinsic properties of genomes are discussed, as is the importance of using
high quality DNA. Different sequencing technologies and generally
applicable workflows for genome assembly are also detailed. We cover 1 Bruno Contreras-Moreira , Fundación
structural and functional annotation and encourage readers to also ARAID, Zaragoza, Spain
annotate transposable elements, something that is often omitted from
2 Dave Clements , Johns Hopkins University,
Baltimore, USA
annotation workflows. The importance of data management is stressed,
Page 1 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
annotation workflows. The importance of data management is stressed, Any reports and responses or comments on the
and we give advice on where to submit data and how to make your results article can be found at the end of the article.
Findable, Accessible, Interoperable, and Reusable (FAIR).
Keywords
Genome, Assembly, Annotation, FAIR, NGS, Workflows, DNA
This article is included in the ELIXIR gateway.
This article is included in the International Society
for Computational Biology Community Journal
gateway.
Corresponding author: Henrik Lantz ([email protected])
Author roles: Dominguez Del Angel V: Conceptualization, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing; Hjerde
E: Visualization, Writing – Original Draft Preparation; Sterck L: Conceptualization, Writing – Original Draft Preparation, Writing – Review & Editing;
Capella-Gutierrez S: Writing – Original Draft Preparation, Writing – Review & Editing; Notredame C: Writing – Original Draft Preparation; Vinnere
Pettersson O: Writing – Original Draft Preparation; Amselem J: Writing – Original Draft Preparation; Bouri L: Visualization, Writing – Original Draft
Preparation; Bocs S: Writing – Review & Editing; Klopp C: Writing – Review & Editing; Gibrat JF: Writing – Original Draft Preparation, Writing –
Review & Editing; Vlasova A: Visualization, Writing – Review & Editing; Leskosek BL: Funding Acquisition, Project Administration, Writing –
Review & Editing; Soler L: Writing – Review & Editing; Binzer-Panchal M: Writing – Review & Editing; Lantz H: Conceptualization, Funding
Acquisition, Project Administration, Writing – Original Draft Preparation, Writing – Review & Editing
Competing interests: No competing interests were disclosed.
Grant information: ELIXIR-EXCELERATE is funded by the European Commission within the Research Infrastructures Programme of Horizon
2020 [676559].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Copyright: © 2018 Dominguez Del Angel V et al. This is an open access article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
How to cite this article: Dominguez Del Angel V, Hjerde E, Sterck L et al. Ten steps to get started in Genome Assembly and Annotation
[version 1; peer review: 2 approved] F1000Research 2018, 7(ELIXIR):148 (https://doi.org/10.12688/f1000research.13598.1)
First published: 05 Feb 2018, 7(ELIXIR):148 (https://doi.org/10.12688/f1000research.13598.1)
Page 2 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Page 3 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Page 4 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Page 5 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Figure 1. Timeline and comparison of different sequencing technologies. The data is based on the throughput metrics for the different
platforms since their first instrument version came out. The figure visualises the results by plotting throughput in raw bases versus read length.
Data released under CC BY 4.0 International license. doi 10.6084/m9.figshare.100940.
SGS and Third-Generation Sequencing However, these long reads exhibit per sequence error rates up
The SGS have dominated the market, thanks to their to 10% to 15%, requiring a preliminary stage of correction
ability to produce enormous volumes of data cheaply. before or after the assembly process. In fact, long read
Examples are the Illumina or Ion Torrent sequencers. Many assembly has caused a paradigm shift in whole-genome assembly
remarkable projects like the 1000 Genomes Project15 and the in terms of algorithms, software pipelines and supporting steps25.
Human Microbiome Project16 have been finished thanks to SGS
technologies. However, some genes and important regions of Supporting technologies
interest are often not assembled correctly, mainly due to the pres- There are also supporting technologies, most of which
ences of repeat elements in the sequences17. A promising solution are used to improve the contiguity of already existing genome
is Third-Generation-Sequencing (TGS) based on long reads18. assemblies. These include optical mapping methods (e.g., Bio-
TGS technologies have been used for the reconstruction of Nano), linked-read technologies (e.g., 10X Genomics Chromium
highly contiguous regions in eukaryotic genomes19,20 and de novo system), or the genome folding-based approach of HiC26. In a
microbial genomes with high precision21. In terms of resequenc- rapidly changing field, it is difficult to recommend one of these
ing, the TSG technology has generated detailed maps of the technologies over the others. We advise researchers interested
structural variations in multiple species and has covered many of in assembling large genomes to read up on the current status
the gaps in the human reference genome22,23. of these methods when ordering sequence data, and remember
to budget for them. For researchers interested in large-scale
Currently, the two most important third-generation DNA structural changes, the improvements of contiguity provided by
sequencing technologies are Pacific Biosciences (PacBio) Single these methods will be of extra interest.
Molecule Real Time (SMRT) and Oxford Nanopore Technology
(ONT)24. These technologies can produce long reads averaging Long reads definitely have an advantage over shorter reads
between 10,000 to 15,000bp, with some reads exceeding when used in genome assembly as they deal with repeats much
100,000bp. better. In practice, this often leads to less fragmented assemblies,
Page 6 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
which is what most researchers are aiming for. The problems possible assembled sequences (least fragmented assembly) with
with third generation technologies are a higher price, a lack of the smallest number of mis-assemblies.
availability in some countries, and sometimes higher require-
ments in terms of DNA amount and quality. Unless these Quality control of reads and the actual genome assembly
complicating factors prevents the use of third generation long read are different for the Illumina technology compared with long read
technologies in your research project, we strongly recommend technologies. These technologies will be discussed separately
them over short read technologies. That being said, a hereafter. We end this section with a discussion about assembly
combination of both might be even better, as the shorter reads validation, which is similar for all technologies.
have a different error profile and can be used to correct the
longer ones27 (see Section 5). Illumina Genome assembly
The most common approach to perform genome assemblies is
4. Estimate the necessary computational resources de novo assembly, where the genome is reconstructed exclu-
To succeed in a genome assembly and annotation project sively from the information of overlapping reads. For
you need to have sufficient compute resources. The resource prokaryotes, it is also common to assemble with a reference
demands are different between assembly and annotation, genome, e.g., when complete strain collections are sequenced.
and different tools also have very different requirements, but The reference sequence can either be used as a template to 1)
some generalities can be observed (for examples, see Table 1). guide the mapping of reads, or 2) reorder the de novo assembled
contigs.
For genome assembly, running times and memory requirements
will increase with the amount of data. As more data is needed In general, Illumina sequencing technology produces
for large genomes, there is thus also a correlation between large amounts of high quality short sequence reads. The adapter
genome size and running time/memory requirements. Only a and multiplex index sequences are screened for and removed
small subset of available assembly programs can distribute the after the base calling on the sequencing machine. However, it
assembly into several processes and run them in parallel on sev- is highly recommended to assess the raw sequence data quality
eral compute nodes. Tools that cannot do this tend to require prior to assembly. Poor quality reads, ambiguous base calling,
a lot of memory on a single node, while programs that can contamination, biases in the data and even technical issues
split the process need less memory in each individual node, on the sequencing chip, are some, but not all, possible techni-
but do on the other work most efficiently when several nodes cal errors that can be detected early and corrected28. Also, if the
are available. It is therefore important to select the proper sequencing libraries contain very short fragments, it is likely
assembly tools early in a project, and make sure that there are that the sequencing reaction will continue past the DNA insert
enough available compute resources of the right type to run these and into the adapter in the 3’ end, a process known as adapter
tools. read-through, which may escape the adapter screening step
on the sequencing machine29.
Annotation has a different profile when it comes to computer
resource use compared to assembly. When external data such Assessing the quality of Illumina short reads
as RNA-seq or protein sequences are used (something that is Assessing the quality of the sequence data is important,
strongly recommended), mapping these sequences to the as it may affect downstream applications and potentially lead
genome is a major part of the annotation process. Mapping to erroneous conclusions. Base calling accuracy measures
is computationally intense, and it is highly preferable to use the probability that a given base is called incorrectly, and is
annotation tools that can run on several nodes in parallel. commonly measured by the Phred quality score (Q score). Sev-
eral tools are available for the quality assessment. FastQC30
Regarding storage, usually no extra consideration needs to is a commonly used tool that can be run both from the command
be taken for assembly or annotation projects compared to other line or through an interactive graphical user interface (GUI).
NGS projects. Intermediate files are often much larger than It produces plots and statistics showing, among others, the
the final results, but can often be safely deleted once the run is average and range of the sequence quality values across the
finished. reads, over-represented sequences and k-mers which in total
can help the user interpret data quality. k-mers represent
5. Assemble your genome all subsequences of length k in a sequence read. Most methods
In general, irrespective of the sequencing technology you for assembling or mapping reads are based on the use of k-mers.
choose, you would follow the same workflow (Figure 2). In the More in depth analysis of k-mers can also be performed, for
quality control (QC) stage the sequence reads are examined example using KAT31 to identify error levels, biases and
for overall quality and presence of adapters. Presence of con- contamination, and this also comes highly recommended.
taminants can also be examined. In the assembly stage, several
assemblers are often tried in parallel and the results are then com- Pre-processing of raw data
pared in the assembly validation step, where mis-assemblies also After having investigated the sequence data quality,
can be identified and corrected. Often, assemblers are rerun with informed decisions on downstream operations can be made.
new parameters based on the results of the assembly validation. We would in general recommend that adapters are removed,
The aim is usually to create a genome assembly with the longest although there are also assemblers that prefer working with the
Page 7 of 25
Table 1. Examples of time and computer resources used by software dedicated to assembly and annotation. SPAdes is an assembler designed for
the assembly of small genomes using short reads. Smartdenovo is a de novo assembler for PacBio and Oxford Nanopore (ONT) data. The REPET package
is a software suite dedicated to detect, classify and annotate repeats. EuGene is an open integrative gene finder for eukaryotic and prokaryotic genomes.
Processing time and RAM used will be affected by amount of input data, complexity of data, and genome size.
Page 8 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Figure 2. General steps in a genome assembly workflow. Input and output data are indicated for each step.
raw data, including potential adapter sequences. It is highly rec- allpaths-LG37 and Masurca38. Note that with large amounts
ommended that the user studies the assembler documentation to of data, available RAM will be a limiting factor.
determine whether the program requires quality-trimmed data or
not. If trimming is required by the assembler, it would be The characteristics of the genomes being assembled
sensible to omit poor quality data from further analysis by have a greater impact on the results than the choice of the algo-
trimming low quality read ends and filtering of low quality reads. rithm. Haploid genomes with no sequence repeats will be
A variety of tools are available, such as PRINSEQ32, which much easier to reconstruct than genomes of polyploids or
offers a standalone command-line version, a version with a GUI genomes with many sequence repeats e.g. many plants species.
and an online web based service, and Trimmomatic33. The GAGE-B study39 showed that assembly software perform-
ing well on one organism, performed poorly on another organism.
Illumina machines produce a wide range of read numbers, from Hence, it is wise to test several approaches; different soft-
10 millions up to 20 billions (NovaSeq). Reducing the sequence ware, assembly with or without pre-processing of the sequence
coverage by subsampling for deeply sequenced genomes data, and also with different parameter settings. Another
is recommended, as de Brujin assemblers work best around approach that will have impact on the assembly is the use of
60-80x coverage34. High coverage in a particular genome loca- mate pair sequencing. This enables the generation of long-insert
tion will increase the probability that this location is seen as paired-end DNA libraries with fragments up to 15 kb, and can
a sequencing error or sequencing errors can propagate and be particularly useful in de novo sequencing. The large inserts
start to look like true sequence. BBnorm35, a member of the can span across regions problematic to the assembler such
BBTools package, is a common kmer-based normalisation as repetitive elements, and anchor the paired reads in unique parts
tool that can normalise highly covered regions to the expected of the DNA, and reduce the number of contigs and scaffolds.
coverage. Despite the enormous development in this field, it is still
challenging to assemble large genomes from short reads. Fur-
Short reads genome assembly ther improvements, both in the assembly technology, but also in
For the de novo assembly of short reads, the most commonly increasing read length and in fragment size is needed for more
used algorithms are based on de Bruijn graphs, although accurate reconstruction of genomes.
other algorithms such as Overlap Layout Consensus (OLC)36 are
still being used. One of the advantages of de Bruijn graph over Long read genome assembly
OLC is that it consumes less computational time and memory. TGS developed by Pacific Biosciences or Oxford Nanopore
Depending on the complexity of the genome to be is able to produce long reads with average fragment lengths
assembled such as size, repeat-content, polyploidy, a proper of over 10,000 base-pairs that can be advantageously used
tool should be selected. Some assembly tools, such as SPAdes12, to improve the genome assembly40. In fact, long reads can
work best with smaller amounts of data and are thus well span stretches of repetitive regions and thus produce a more
adapted for bacterial projects, while others handle large amounts contiguous reconstruction of the genome. However, raw long
of data well and can be used for any type of project. These include reads have a high rate of sequencing error (5–20%). As a result,
Page 9 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
some long read assemblers opt to correct these errors prior to Determining whether the assembly is ready for annotation
assembly. Determining if the assembly is ready for annotation is a key step
towards successful genome annotation. Errors in assemblies
There are two main families of assemblers based on long reads:
occur for many reasons. Genomic regions can be incorrectly
• Long Reads Only assembler (LRO) discarded as being fallacies or repeats. Others can be spliced
• Short and Long Reads combined assembler (SLR) together in the wrong places or in the wrong orientation.
Unfortunately, there are few ways to distinguish what is real,
In general, LRO assemblers are based on the OLC algorithm. what is missing, and what is an experimental artefact. There are,
First, this algorithm produces alignments between long reads. however, some statistics that often are used when choosing
Then it calculates the best overlap graph, and finally it generates between assemblies, and some ways of identifying and
the consensus sequence of the contigs from the graph. LRO removing potential problems.
assemblers require more sequencing coverage (minimum ~50X)
from the long reads dataset than SLR assemblers. Schemati- N50 is often used as a standard metric to evaluate an
cally, SLR assemblers instead generate a de Bruijn graph pre- assembly45. N50 is the length of the smallest contig, after they
assembly using short reads, then the long reads are used to have been ranked from longest to smallest, such that the sum of
improve the pre-assembly by closing gaps, ordering contigs, and contig lengths up to it covers 50% of the total size of all contigs.
resolving repetitive regions. It is worth noting that some long It is thus a measure of contiguity, with higher numbers indicating
reads assemblers require corrected long reads as input. Software lower levels of fragmentation. It is important to note that N50 is
to correct long reads are based on two strategies. The first strat- not a measure of correctness. So-called aggressive assemblers
egy consists of aligning long reads against themselves. The second may produce longer contigs and scaffolds than conservative
one uses short reads to correct long reads. assemblers, but are also more likely to join regions in the
wrong order and orientation. We recommend to compare the
A document with guideline practices for long-reads genome output from different assemblers (and of trimmed/filtered
assemblies is available41. This document shows the perform- data). Assembly evaluation tools, such as Quast46, compare
ance of long read assembly benchmarked against 4 reference the metrics between assemblies, and allow the user to make
genomes: Acinetobacter DP1, Escherichia coli K12 MG1655, educated choices to further improve and select the best assem-
Saccharomyces cerevisae W303 and Caenorhabditis elegans bly. If a reference sequence is available, Quast can also describe
(sequenced in different TGS platforms and under different con- mis-assemblies and structural variations relative to the refer-
ditions). Among the 11 tools that have been evaluated, 8 use ence. If paired Illumina data is available, tools such as Reapr47 or
only long reads as input data, while the 3 others can assemble FRCBam48 can be used to evaluate assemblies and to iden-
genome using a mix of long and short reads. The tests show that tify which assembly has the least amount of misassemblies.
it is strongly recommended to use a long read correction software If other organisms were present in the reads (contaminants or
before the assembly42. symbionts) and have been assembled together with the other
reads, these contigs can be identified using for example
Assembly polishing Blobtools49 and removed, if necessary. To determine how
Although an error correction step may have been part of many protein coding genes have been assembled, BUSCO50 is
the assembler pipeline, errors can still be present in the assem- very useful. This tool looks for genes that should be present
bly, particularly in long read assemblies. Polishing draft assem- in a genome of the investigated taxonomic lineage type, and
blies with either short or long reads can help to improve local reports the number of complete and fragmented genes found.
base accuracy in particular correcting base calls and small inser- Choosing the assembly with the highest percentage of
tion-deletion errors, and also resolve some mis-assemblies complete genes could be given greater importance if the purpose
caused by poor reads alignment during the assembly43. of the genome project is to investigate protein coding genes.
Scaffolding and gap filling Knowing when to stop assembly and moving into annotation
In scaffolding, assembled contigs are stitched together based is one of the most difficult decisions to take in genome assem-
on information from paired short reads. The unknown sequence bly projects. It is always possible to try one more tool or
between the contigs will be filled with Ns. If matching one more setting, and this wish of wanting to improve the
reads are instead used to join contigs together, for example long assembly just a little bit more can delay these types of projects
reads, actual sequence will fill in the gaps, and this is referred substantially. It is best to have a goal in mind before starting
to as gap filling. In the case of an existing scaffolded assembly, assembly, and to stop when that goal has been reached.
long reads can also be used to replace the N-regions. Note that If you feel that you can answer the questions you had before
misassemblies in an existing assembly need to be broken starting, then the assembly is good enough for your purposes
prior to scaffolding in order to join the correct contigs together. and it is probably time to move into annotation. It is always
Scaffolding and gap filling can be performed with low coverage44. possible to release a new and improved version of the genome
Page 10 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
later. Be aware that any changes to a genome assembly will of the initial gene set to detect those genes which are
most likely necessitate annotation to be re-started from scratch, primarily annotated with terms associated to TEs activity. Those
and you should therefore be sure to “freeze” the assembly genes can be safely removed if they do not have homologous
completely before starting annotation. sequences in relative species and/or their homologous sequences
have been annotated as TEs related59.
6. Do not neglect to annotate Transposable Elements
The genome annotation stage starts with repeat identification 7. Annotate genes with high quality experimental
and masking. evidence
7.1. Structural annotation – where are the genes and what
There are two different types of repeat sequences: do they look like?
‘low-complexity’ sequences (such as homopolymeric runs of A raw genomic sequence is to most biologists of no
nucleotides) and transposable elements. Transposable Ele- great value as such. Genome annotation consists of attach-
ments (TEs) are key contributors to genome structure of almost ing biological meaningful information to genome sequences
all eukaryotic genomes (animals, plants, fungi). Their abun- by analyzing their sequence structure and composition as well
dance, up to 90% of some genomes such as wheat51, is usually as to consider what we know from closely related species, which
correlated with genome size and organization. TEs ability can be used as reference. While genome annotation involves
to move and to accumulate in genomes, make them a major characterizing a plethora of biologically significant ele-
players of genome structure, plasticity, genetic variations ments in a genomic sequence, most of the attention is spent on
and evolution. Interestingly, they can affect gene expression, the correct identification of protein coding genes. This is
structure and function when their insertion occurs in the vicinity not because the other types of genetic elements are of
of genes52 and sometimes through epigenetic mechanisms53. lesser importance, far from actually, but mainly because the
approaches to characterize them are either fairly straightfor-
TEs are classified in two classes including subclasses, ward (eg. INFERNAL60 and tRNAscan-se61 for non-coding RNA
orders and superfamilies according to mechanistic and enzy- detection) or are the focus of more specialized analyses
matic criteria. These two classes are based on their mechanism (eg. transcription factor binding sites).
of transposition using a copy-and-paste (Class I) or cut-and-
paste mechanisms (Class II) through RNA or DNA intermediates The process of correctly determining the location and
respectively54. structure of the protein coding genes in a genome, “gene pre-
diction”, is fairly well understood with many successful
TE annotation is nowadays considered as a major task algorithms being developed over the past decades. In general, there
in genome projects and should be undertaken before any are three main approaches to predict genes in a genome: intrin-
other genome annotation task such as gene prediction. sic (or ab-initio), extrinsic and the combiners. Where the intrin-
Consequently, there has been a growing interest in develop- sic approach focuses solely on information that can be extracted
ing new methods allowing an efficient computational detection, from the genomic sequence itself such as coding potential and
annotation, and analysis of these TEs, in particular when they splice site prediction, the extrinsic way uses similarity to other
are nested and degenerated. Many software have been devel- sequence types (e.g. transcripts and/or polypeptides) as informa-
oped to detect and annotate TEs55. One of the best known is tion. There are inherent advantages and disadvantages to each of
RepeatMasker, which harnesses nhmmer, cross_match, ABBlast/ those.
WUBlast, RMBlast and Decypher as search engines and uses
curated libraries of repeats, currently supporting Dfam (profile The intrinsic approach is labor intensive as statistical models
HMM library) and Repbase56,57. need to be built and software needs to be trained and optimized.
Of prime importance for this approach is a good training set,
Another important tool is the REPET package, one of the i.e. a set of structurally well annotated genes used to build mod-
most used tools for large eukaryotic genomes with more els and to train gene prediction software. As each genome is
than 50 genomes analyzed in the framework of international different, these models and software must be specific to each
consortia. The REPET package is a suite of pipelines and tools genome and thus need to be rebuilt and retrained for each
designed to tackle biological issues at the genomic scale. new species. This is, however, also the big advantage of this
approach, as it is capable of predicting fast evolving and species
REPET consists of two main pipelines: TEdenovo and specific genes.
TEannot. First, TEdenovo efficiently detects classified TEs
(TEdenovo pipeline), then TEannot annotates TEs, including The extrinsic way, on the other hand, is much more univer-
nested and degenerated copies58. sally applicable. A vast number of polypeptide sequences
are already described and available in databases (eg. NCBI non-
Depending on the complexity and number of detected TEs, redundant protein, RefSeq, UniProt), which creates a wealth of
it might be possible that additional rounds of TEs identification information to be exploited in the gene prediction process. Tran-
and removal are needed once the initial gene set has been pro- script information, be it Sanger sequenced ESTs, RNA-Seq or
duced. It is a common practice to analyze the functional annotation even long read sequenced transcripts, plays an even bigger role
Page 11 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
in this approach. High quality protein sequences of other spe- The combiners are probably the most popular and widely
cies provide good indication on the presence and location of used gene prediction approach. They integrate the best of both
genes and can be very useful to accurately predict the cor- worlds: they have an ab initio part that is then often complemented
rect gene structure. Indeed, as polypeptide sequences often are with extrinsic information (Figure 3). Especially, nowadays, with
more conserved than the underlying nucleotide sequences, they the advances of sequencing technologies, these approaches are
can still be aligned even from distantly related species. increasingly used, reflecting the growing number of new tools
Although they are very useful to determine the presence of and software trying to integrate RNA-Seq, protein or even intrin-
gene loci, they do not always provide accurate information sic information. However, not all these combiners are the same.
on the exact structure of a gene. Transcripts on the other hand While some simply aim to pick the most appropriate model
provide very accurate information for the correct prediction of or build the consensus out of the provided input data (where an
the genes’ structure but are much less comprehensive and to ab initio prediction tool might be one of them) for a given
some extent are noisier. Transcript information will locus, others have a more integrated approach in which the intrin-
not be available for all genes and sometime introns can still sic prediction can be modified by the given extrinsic data. The
be present due to incomplete mRNA processing. Nonetheless, advantage of the latter is that they allow one type of
accurate alignment of the extrinsic data is key here: transcripts information to overrule the other if this results in an overall
need to be splice-aligned (taking the exon-intron structure of more consistent prediction.
eukaryotic genes into account) and protein sequences need to be
compared to the six translation-frames of the nucleotide Apart from the choice of which tool to use, the choice
sequences. Moreover, it is a matter of thresholds: too stringent of which data to integrate also has an influence on the final result.
and less conserved genes will be missed, while too lenient will This is especially the case for the use of protein information.
result in less specific information and introduce more false Error propagation is a real danger. Therefore, curated data-
positives. These thresholds will depend on your objectives. sets, are preferred over the more general but less clean ones
A recommendation is to use lenient parameters in order to mini- because it is vital that the provided information be as reliable as
mize the number of false negatives, as it is more difficult to possible. The use of transcript information is less prone to error
create a new gene than to change the status of a false posi- propagation although it is of importance that one realises what
tive to obsolete. Then according to different confidence scores kind of data is being used. Short read RNA-Seq data is easily
(e.g. coding potential, GO Evidence Codes), you can filter the generated and is often an inherent part of a genome project. A
gene set in order to provide, for instance, a high confidence gene downside is the short length of the reads. It will give accurate
set to train ab initio software, or a high confidence gene set information on the location and existence of the exons but it
to submit to a suitable repository and keep the full set for manual will sometimes be more difficult to know how these exons are
curation. combined into a single gene structure. Therefore, it is becoming
Figure 3. Simplified Illustration of a structural genome annotation using Combiners. On the left, the diagram shows a typical assembly
process. At the end of the process, scaffolds or chromosomes ready to be annotated are obtained. These scaffolds are then annotated using
two different methods. The first method is called ab-initio and requires a known set of training genes. Once the ab initio tool has been trained
it can be used to predict other similarly structured genes. The second similarity-based approach relies on experimental evidence such as
CDSs, ESTs, or RNA-seq to build gene models. Combiners (such as Maker or Eugene) can then incorporate all of these results, eliminate
incongruences, and present gene models best supported by all methods.
Page 12 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
common to complement the short read transcript data e.g. IPR002328, among other features. The function of predicted
with long read transcript information. Those will often contain proteins can be computationally inferred based on the similar-
the full set of exons into a single read and will as such provide ity between the sequence of interest and other sequences in dif-
unambiguous information on the complete gene structure and ferent public repositories, e.g. BLASTP against Uniprot. Caution
even alternative transcripts. should be taken when assigning results merely based on
sequence similarity as two evolutionary independent sequences
When performing genome annotation, choices have to which share some common domains could be considered
be made, not only what tools to use but equally important homologs62. Thus, whenever possible, it is better to use ortholo-
what kind of data to use. It is clear that the choice should go gous sequences for annotation purposes rather than simply
towards the more reliable but unfortunately sometimes less com- similar sequences63. With the growing number of sequences in
prehensive data sources as the use of lower quality information those public repositories, it is possible to perform various searches
will inevitably lead to an inferior gene prediction result. and combine obtained results into a consensus annotation.
The accurate assignment of the functional elements is a complex
7.2. Functional annotation process, and the best annotation will involve manual curation.
The ultimate goal of the functional annotation process
(Figure 4) is to assign biologically relevant information to pre- There are two main outcomes of the functional annotation
dicted polypeptides, and to the features they derive from (e.g. process. The first is the assignment of functional elements to
gene, mRNA). This process is especially relevant nowadays in genes. Downstream analysis of these elements allow further
the context of the NGS era due to the capacity of sequencing, understanding of specific genome properties, e.g. metabolic path-
assembling, and annotating full genomes in short periods of ways, and similarities compared with closely related species.
time, e.g. less than a month. Functional elements could range The second result of the functional annotation is the additional
from putative name and/or symbols for protein-coding genes, quality check for the predicted gene set. It is possible to identify
e.g. ADH to its putative biological function, e.g. alcohol dehy- problematic and/or suspicious genes by the presence of spe-
drogenase, associated gene ontology terms, e.g. GO:0004022, cific domains, suspicious orthology assignment and/or absence
functional sites, e.g. METAL 47 47 Zinc 1, and domains, of other functional elements, e.g. functional completeness. These
Figure 4. Functional Annotation Pipelines. This schema is showing a typical functional annotation pipeline, in which functional roles are
assigned to coding sequences (CDSs) inferred in the gene prediction process. The process implements three parallel routes for the definition
of functions. The first refers to proteins domains and motifs, the second for orthology search and finally the third is applied to homology
search. At the end, the output from the three different sources is put together for more valuable predictions.
Page 13 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
problematic genes can include those belonging to another spe- in which sequencing data are produced, and 3) a subsequent
cies due to contamination, those detected as TEs, non-functional bioinformatic analysis pipeline. ENA records this information
and/or artefactual genes annotated by error. in a data model that covers input information (sample, experi-
mental setup, machine configuration), output machine data
There are a number of tools available for functional (sequence traces, reads and quality scores) and interpreted
annotation that allow users to obtain annotations for their gene information (assembly, mapping, functional annotation).
set of interest via public databases in a high-throughput manner.
These tools often start by sequence similarity search using tools There are also a growing number of theme-based genome data-
like BLAST, HMMER or LAST against either non-redundant bases. Human genome sequence projects are recommended to
sequences database from NCBI GenBank and/or UniProt ref- use the European Genome-phenome Archive (EGA)66. EGA
erence clusters (UniRef). After the initial homology search, is a service for permanently archiving and sharing data resulting
candidate sequences can be assigned to one or more orthol- from biomedical research projects, and all types of personally
ogy groups using either best-reciprocal or tree-based methods63. identifiable genetic and phenotypic can be included. This service
Alternatively, users can make use of machine learning meth- provides the necessary security to control access and maintain
ods, such as Hidden Markov Models (HMM) or neural networks the confidentiality of patient data, while providing access to
to predict particular patterns from a given input gene set. The researchers and authorized physicians to view the data. The
majority of these tools are freely available for the academic data was collected from individuals whose consent agreements
users, working under Linux OS and are often part of large-scale authorize the disclosure of data only for specific investigations.
annotation pipelines.
9. Ensure your methods are computationally
For those users who do not want to run individual tools repeatable and reproducible
and combine results, there are a few available workflows Reproducibility and repeatability have been reported as
that provide the entire annotation process. These pipelines a major scientific issue when it comes to large scale data
can either include installation of the required tools and corre- analysis67. For genomics to fulfil its complete scientific and social
sponding databases, or users are required to make this installation potential, in silico analysis must be both repeatable, reproduc-
on their own and the pipeline just provides a framework for the ible and traceable. Repeatability refers to the re-computation
analysis. of an existing result with the original data and the original soft-
ware. For instance, the authors report numerical instability arising
8. Use well-established output formats and submit from a mere change of Linux platforms, even when using
your data to suitable repositories exactly the same version of the genomic analysis tools.
Data formats
The output of a genome annotation pipeline is almost Fortunately, solutions exist and along with their report
always in GFF format. The information captured includes the of numerical instability, the authors did show that repeat-
structure and often the function of features of the genome, but ability could be achieved through the efficient combination of
usually not the actual sequence. Together with the Fasta file that containers technology and workflow tools. Containers can be
was used in the annotation process, the sequence of these fea- described as a new generation of lightweight virtual machines
tures can however easily be extracted. Other output formats are whose deployment has limited impact on performances.
GTF, BED, Genbank, and EMBL, of which the last two Container methods, such as Docker and Singularity, make it pos-
include both sequence and annotation and are often used when sible to compile and deploy a software in a given environment,
submitting annotation results to sequence repositories. Some and to later re-deploy that same software in the same original
of these formats use controlled vocabularies and ontologies to environment while being hosted on a different host environment.
guarantee interoperability between analysis and visualisation Once encapsulated this way, analysis pipelines were shown to
tools. We highly recommended the adoption of Fasta and GFF3 become entirely repeatable across platforms.
output formats. Both formats are compatible with the Genetic
Model Organism Database (GMOD), a powerful suite of Several workflow management systems, such as Nextflow,
tools used for genome annotation, visualisation, and redistribu- Toll and Galaxy, have recently been reported as having the
tion of genome data. By adhering to commonly used formats, capacity to use and deploy containers. These tools all share the
you are making your results more useful to other researchers. same philosophy: they make it relatively easy to define and
implement new pipelines, and they provide more or less
Data submission extensive support for the massively parallel deployment of
To improve the availability and findability of results these pipelines across high performance computational (HPC)
from genome annotation projects, the annotated sequences have infrastructures or over the cloud.
to be submitted to databases, such as Genbank at the National
Center for Biotechnology Information (NCBI)64 or the European Containerization also provides a very powerful way of
Nucleotide Archive (ENA)65. In these archives, the information distributing tools in production mode. This makes it an
relating to experimental workflows are captured and displayed. integral part of the ongoing effort to standardise genome analy-
A typical workflow includes: 1) the isolation and preparation sis tools. The wide availability of public software repositories,
of material for sequencing, 2) a run of a sequencing machine such as GitHub or Docker Hub provides a context in which the
Page 14 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
implementation of existing standards bring immediate sustainable infrastructure to host software containers. Thus, this
benefits to the analysis, both in terms of costs, repeatability and is the desirable solution to ensure software accessibility.
dissemination across a wide variety of environments.
Interoperable: Software should use the most common
The choice of a workflow manager and the proper format and should be adequately documented. It should come
integration of the selected pipelines through a well thought con- along with a proper versioning for both the software and the refer-
tainerization strategy can therefore be considered an integral ence biological databases they operate upon. The software behav-
part of the genome annotation process, especially if one expects ior should also be adequately described using the right metadata,
annotation to keep being updated over time. This makes thus allowing programmatic interaction with other resources.
the adoption of good computational practices like the one
described here an essential milestone for genomic analysis to Reusable: Software should be distributed in open-source format
become compliant with the new data paradigm. In order to carry so as to ensure possible long term maintenance by third parties.
this out, the first guidelines to make data “findable, accessible, Software should be encapsulated within containers ensuring
interoperable and re-usable” (FAIR)68 was published in 2016. the permanent availability of production mode pipelines. Authors
Even if FAIR principles were originally focused on data, should be encouraged to develop their pipelines in commonly
they are sufficiently general so these high level concepts can be used workflow managers (Galaxy69, Nextflow70, Snakemake71).
applied to any Digital Object such as software or pipelines. Decisions should be taken on the basis of a compromise between
the level of usage of the selected workflow and its support
Repeatability is merely the most technical side of of the required features. It should also contain meta-data describ-
reproducibility. Reproducibility is a broader concept that encom- ing which parameters have been used with the software in order
passes any decision and bookkeeping procedure that could to guarantee data reproducibility.
compromise the reproducibility of an established scientific
result. For this reason the implementation of the FAIR principle 10. Investigate, re-analyse, re-annotate
also impacts higher level aspect of the genome annotation strat- Successful genome annotation projects do not just end
egy and for a genomic project to be FAIR compliant, these good with the publication of a paper; they should produce sustainable
practices should be applied to both data, meta-data and software. resources to promote, extend and improve the genome annotation
This can be achieved as follows: life cycle.
Data and meta-data Some genome consortia choose to manually review and
Findable: Globally unique and persistent identifiers for edit their annotation data sets via jamborees, for instance
data and metadata. Identifiers should persist across release and the BioInformatics Platform for Agroecosystem Arthropods.
make it possible to trace back older analysis and relate them to Although this process is time- and resource-intensive, it provides
the current annotation. Deprecated annotation should remain opportunities for community building, education and training.
traceable. Even when data is not any longer available, meta-data All these elements help to improve the annotation life cycle and
should remain and provide a description of the original data. are promoted by the International Society for Biocuration.
Accessible: Proper registration of data and metadata in suit- Manual and continuous annotation are critical to achieve
able public, or self-maintained repository. All data should be accurate and reliable gene models, mRNA, TEs, regulatory
properly indexed and searchable and accessible by identifier using sequences, among other elements. In addition, research com-
standardized protocols munities will face the generation of a huge volume of new
data including re-sequencing, transcriptomics, transcriptional
Interoperable: Data and meta-data must be deposited using the regulation profiling, epigenetic studies, high-throughput geno-
most commonly used format typing and other related whole-genome functional studies. Thus,
it is important to provide a software infrastructure to facilitate
Reusable: Data and meta-data standards should insure that the updating of the genomic data.
the data is sufficiently well characterized to be effectively
reused in future analysis or to be challenged by novel evaluation Tools such as WebApollo72 from the GMOD project or
methods. Licensing should be as little restrictive as possible. web-portals like ORCAE73 are particularly useful. These tools
allow groups of researchers to review, add and delete annota-
Software and pipelines tions in a collaborative approach. The applications are robust and
Findable: Software and pipelines should be deposited in an flexible enough to allow the members of a group to work simul-
open source registries along with proper technical descriptions taneously or at different times. The administration of the server
allowing their rapid identification. allows to initiate a session to a user and if it has the authorization,
to edit the content.
Accessible: Software should be deposited in public repositor-
ies such as GitHub, Docker Hub, so as to be available. Attempts Thanks to this system, annotations of genomes can be
should be made at having the licensing as little restrictive improved in a continuous cycle as data is collected and updated.
as possible. ELIXIR has taken the challenge to provide a long-term In this way the annotations can always continue to improve.
Page 15 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Other useful tools include Artemis74, a successful curation We do, however, present advantages of certain NGS technolo-
software from the European Sanger institute and Gencode75, gies in specific cases, for example when looking at genome
which seems to succeed the Havana team’s Vega76. properties such genome size, complexity, or GC content. We
also explain pitfalls to avoid throughout the whole assembly and
Concluding remarks/general recommendations annotation process. Finally, we also encourage the adoption of
Genome assembly and genome annotation are areas where our guidelines regarding data deposition and reproducibility,
there are no gold standards. Projects are often explorative, as they offer a simple mechanism to improve the quality,
and knowing if your results are good or bad is often hard to findability, reusability and sustainability of results derived from
determine. This is especially true if you are working with genome assembly and genome annotation projects.
organisms only distantly related to already sequenced ones,
which leaves you with little to compare with. Try to set an aim
with your study, and stop working with the assembly and
annotation once you have a result that allows you to reach that Competing interests
aim. Do not fall into the trap of wanting a “perfect” genome, No competing interests were disclosed.
as this tends to lead to a project that never ends. But also
do not be afraid to start your own assembly and annotation project. Grant information
With the development of new sequencing technologies it is ELIXIR-EXCELERATE is funded by the European Commis-
more feasible than ever, and a well assembled and annotated sion within the Research Infrastructures Programme of Horizon
genome will be a resource you can use for many years to follow. 2020 [676559].
The recommendations we give are broad guidelines, and we The funders had no role in study design, data collection and
try not to force readers into explicit technologies or software. analysis, decision to publish, or preparation of the manuscript.
References
1. Jansen HJ, Liem M, Jong-Raadsen SA, et al.: Rapid de novo assembly of the 19(5): 455–77.
European eel genome from nanopore sequencing reads. Sci Rep. 2017; 7(1): PubMed Abstract | Publisher Full Text | Free Full Text
7213. 13. Lee H, Gurtowski J, Yoo S, et al.: Third-generation sequencing and the future of
PubMed Abstract | Publisher Full Text | Free Full Text genomics. bioRxiv. 2016; 048603.
2. Badouin H, Gouzy J, Grassa CJ, et al.: The sunflower genome provides insights Publisher Full Text
into oil metabolism, flowering and Asterid evolution. Nature. 2017; 546(7656): 14. Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain-terminating
148–52. inhibitors. Proc Natl Acad Sci U S A. 1977; 74(12): 5463–7.
PubMed Abstract | Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text
3. Phillippy AM, Schatz MC, Pop M: Genome assembly forensics: finding the 15. 1000 Genomes Project Consortium, Abecasis GR, Auton A, et al.: An integrated
elusive mis-assembly. Genome Biol. 2008; 9(3): R55. map of genetic variation from 1,092 human genomes. Nature. 2012; 491(7422):
PubMed Abstract | Publisher Full Text | Free Full Text 56–65.
4. Chaisson MJ, Wilson RK, Eichler EE: Genetic variation and the de novo PubMed Abstract | Publisher Full Text | Free Full Text
assembly of human genomes. Nat Rev Genet. 2015; 16(11): 627–40. 16. Li J, Jia H, Cai X, et al.: An integrated catalog of reference genes in the human
PubMed Abstract | Publisher Full Text | Free Full Text gut microbiome. Nat Biotechnol. 2014; 32(8): 834–41.
5. Pryszcz LP, Gabaldón T: Redundans: an assembly pipeline for highly PubMed Abstract | Publisher Full Text
heterozygous genomes. Nucleic Acids Res. 2016; 44(12): e113. 17. Schatz MC, Delcher AL, Salzberg SL: Assembly of large genomes using second-
PubMed Abstract | Publisher Full Text | Free Full Text generation sequencing. Genome Res. 2010; 20(9): 1165–73.
6. Chen YC, Liu T, Yu CH, et al.: Effects of GC Bias in Next-Generation- PubMed Abstract | Publisher Full Text | Free Full Text
Sequencing Data on De Novo Genome Assembly. PLoS One. 2013; 8(4): 18. Nagarajan N, Pop M: Sequence assembly demystified. Nat Rev Genet. 2013;
e62856. 14(3): 157–67.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text
7. Endrullat C, Glökler J, Franke P, et al.: Standardization and quality management 19. Rhoads A, Au KF: PacBio Sequencing and Its Applications. Genomics
in next-generation sequencing. Appl Transl Genom. 2016; 10: 2–9. Proteomics Bioinformatics. 2015; 13(5): 278–89.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text
8. Porebski S, Bailey LG, Baum BR: Modification of a CTAB DNA extraction 20. Chen X, Bracht JR, Goldman AD, et al.: The architecture of a scrambled genome
protocol for plants containing high polysaccharide and polyphenol reveals massive levels of genomic rearrangement during development. Cell.
components. Plant Mol Biol Rep. 1997; 15(1): 8–15. 2014; 158(5): 1187–98.
Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text
9. Blin N, Stafford DW: A general method for isolation of high molecular weight 21. Loman NJ, Quick J, Simpson JT: A complete bacterial genome assembled de
DNA from eukaryotes. Nucleic Acids Res. 1976; 3(9): 2303–2308. novo using only nanopore sequencing data. Nat Methods. 2015; 12(8): 733–5.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text
10. Japelaghi RH, Haddad R, Garoosi GA: Rapid and Efficient Isolation of 22. Cao H, Hastie AR, Cao D, et al.: Rapid detection of structural variation in a
High Quality Nucleic Acids from Plant Tissues Rich in Polyphenols and human genome using nanochannel-based genome mapping technology.
Polysaccharides. Mol Biotechnol. 2011; 49(2): 129–37. Gigascience. 2014; 3(1): 34.
PubMed Abstract | Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text
11. Tsai IJ, Hunt M, Holroyd N, et al.: Summarizing Specific Profiles in Illumina 23. Chaisson MJ, Huddleston J, Dennis MY, et al.: Resolving the complexity of the
Sequencing from Whole-Genome Amplified DNA. DNA Res. 2014; 21(3): 243–54. human genome using single-molecule sequencing. Nature. 2015; 517(7536):
PubMed Abstract | Publisher Full Text | Free Full Text 608–11.
12. Bankevich A, Nurk S, Antipov D, et al.: SPAdes: A New Genome Assembly PubMed Abstract | Publisher Full Text | Free Full Text
Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol. 2012; 24. Lu H, Giordano F, Ning Z: Oxford Nanopore MinION Sequencing and Genome
Page 16 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Assembly. Genomics Proteomics Bioinformatics. 2016; 14(5): 265–79. 48. Vezzi F, Narzisi G, Mishra B: Reevaluating assembly evaluations with feature
PubMed Abstract | Publisher Full Text | Free Full Text response curves: GAGE and assemblathons. PLoS One. 2012; 7(12): e52210.
25. Myers EW Jr: A history of DNA sequence assembly. it - Information Technology. PubMed Abstract | Publisher Full Text | Free Full Text
2016; (3): 58. 49. Laetsch DR, Blaxter ML: BlobTools: Interrogation of genome assemblies
Publisher Full Text [version 1; referees: 2 approved with reservations]. F1000Res. 2017; 6: 1287.
26. Lieberman-Aiden E, van Berkum NL, Williams L, et al.: Comprehensive mapping Publisher Full Text
of long-range interactions reveals folding principles of the human genome. 50. Simão FA, Waterhouse RM, Ioannidis P, et al.: BUSCO: assessing genome
Science. 2009; 326(5950): 289–93. assembly and annotation completeness with single-copy orthologs.
PubMed Abstract | Publisher Full Text | Free Full Text Bioinformatics. 2015; 31(19): 3210–2.
27. Koren S, Schatz MC, Walenz BP, et al.: Hybrid error correction and de novo PubMed Abstract | Publisher Full Text
assembly of single-molecule sequencing reads. Nat Biotechnol. 2012; 30(7): 51. Choulet F, Wicker T, Rustenholz C, et al.: Megabase level sequencing reveals
693–700. contrasted organization and evolution patterns of the wheat gene and
PubMed Abstract | Publisher Full Text | Free Full Text transposable element spaces. Plant Cell. 2010; 22(6): 1686–701.
28. Heydari M, Miclotte G, Demeester P, et al.: Evaluation of the impact of Illumina PubMed Abstract | Publisher Full Text | Free Full Text
error correction tools on de novo genome assembly. BMC Bioinformatics. 2017; 52. Lisch D: How important are transposons for plant evolution? Nat Rev Genet.
18(1): 374. 2013; 14(1): 49–61.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text
29. Sturm M, Schroeder C, Bauer P: SeqPurge: highly-sensitive adapter trimming 53. Slotkin RK, Martienssen R: Transposable elements and the epigenetic
for paired-end NGS data. BMC Bioinformatics. 2016; 17: 208. regulation of the genome. Nat Rev Genet. 2007; 8(4): 272–85.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text
30. Andrews S: FastQC: a quality control tool for high throughput sequence data. 54. Wicker T, Sabot F, Hua-Van A, et al.: A unified classification system for
2010. eukaryotic transposable elements. Nat Rev Genet. 2007; 8(12): 973–82.
Reference Source PubMed Abstract | Publisher Full Text
31. Mapleson D, Garcia Accinelli G, Kettleborough G, et al.: KAT: a K-mer analysis 55. Flutre T, Duprat E, Feuillet C, et al.: Considering Transposable Element
toolkit to quality control NGS datasets and genome assemblies. Bioinformatics. Diversification in De Novo Annotation Approaches. PLoS One. 2011; 6(1):
2017; 33(4): 574–6. e16526.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text
32. Schmieder R, Edwards R: Quality control and preprocessing of metagenomic 56. Hoede C, Arnoux S, Moisset M, et al.: PASTEC: an automatic transposable
datasets. Bioinformatics. 2011; 27(6): 863–4. element classification tool. PLoS One. 2014; 9(5): e91929.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text
33. Bolger AM, Lohse M, Usadel B: Trimmomatic: a flexible trimmer for Illumina 57. Quesneville H, Bergman CM, Andrieu O, et al.: Combined evidence annotation
sequence data. Bioinformatics. 2014; 30(15): 2114–20. of transposable elements in genome sequences. PLoS Comput Biol. 2005; 1(2):
PubMed Abstract | Publisher Full Text | Free Full Text 166–75, e22.
34. Desai A, Marwah VS, Yadav A, et al.: Identification of optimum sequencing PubMed Abstract | Publisher Full Text | Free Full Text
depth especially for de novo genome assembly of small genomes using next 58. Repet Tutorial [Internet]. [cited 2018 Feb 2].
generation sequencing data. PLoS One. 2013; 8(4): e60204. Reference Source
PubMed Abstract | Publisher Full Text | Free Full Text 59. Steinbiss S, Willhoeft U, Gremme G, et al.: Fine-grained annotation and
35. Bushnell B: BBTools Software Package. 2017. classification of de novo predicted LTR retrotransposons. Nucleic Acids Res.
Reference Source 2009; 37(21): 7002–13.
36. Li Z, Chen Y, Mu D, et al.: Comparison of the two major classes of assembly PubMed Abstract | Publisher Full Text | Free Full Text
algorithms: overlap-layout-consensus and de-bruijn-graph. Brief Funct 60. Nawrocki EP, Eddy SR: Infernal 1.1: 100-fold faster RNA homology searches.
Genomics. 2012; 11(1): 25–37. Bioinformatics. 2013; 29(22): 2933–5.
PubMed Abstract | Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text
37. Gnerre S, MacCallum I, Przybylski D, et al.: High-quality draft assemblies of 61. Lowe TM, Eddy SR: tRNAscan-SE: a program for improved detection of transfer
mammalian genomes from massively parallel sequence data. Proc Natl Acad RNA genes in genomic sequence. Nucleic Acids Res. 1997; 25(5): 955–64.
Sci U S A. 2011; 108(4): 1513–8. PubMed Abstract | Free Full Text
PubMed Abstract | Publisher Full Text | Free Full Text 62. Galperin MY, Koonin EV: Sources of systematic error in functional annotation
38. Zimin AV, Marçais G, Puiu D, et al.: The MaSuRCA genome assembler. of genomes: domain rearrangement, non-orthologous gene displacement and
Bioinformatics. 2013; 29(21): 2669–77. operon disruption. In Silico Biol. 1998; 1(1): 55–67.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract
39. Magoc T, Pabinger S, Canzar S, et al.: GAGE-B: an evaluation of genome 63. Kristensen DM, Wolf YI, Mushegian AR, et al.: Computational methods for Gene
assemblers for bacterial organisms. Bioinformatics. 2013; 29(14): 1718–25. Orthology inference. Brief Bioinform. 2011; 12(5): 379–91.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text
40. Giordano F, Aigrain L, Quail MA, et al.: De novo yeast genome assemblies from 64. NCBI Resource Coordinators: Database Resources of the National Center for
MinION, PacBio and MiSeq platforms. Sci Rep. 2017; 7(1): 1, 3935. Biotechnology Information. Nucleic Acids Res. 2017; 45(D1): D12–7.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text
41. Bouri L, Lavenier D, Gibrat JF, et al.: Evaluation of genome assembly software 65. Leinonen R, Akhtar R, Birney E, et al.: The European Nucleotide Archive. Nucleic
based on long reads. Zenodo. 2017. Acids Res. 2011; 39(Database issue): D28–31.
Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text
42. Salmela L, Rivals E: LoRDEC: accurate and efficient long read error correction. 66. Lappalainen I, Almeida-King J, Kumanduri V, et al.: The European Genome-
Bioinformatics. 2014; 30(24): 3506–14. phenome Archive of human data consented for biomedical research. Nat
PubMed Abstract | Publisher Full Text | Free Full Text Genet. 2015; 47: 692–5.
43. Walker BJ, Abeel T, Shea T, et al.: Pilon: an integrated tool for comprehensive PubMed Abstract | Publisher Full Text | Free Full Text
microbial variant detection and genome assembly improvement. PLoS One. 67. Munafò MR, Nosek BA, Bishop DV, et al.: A manifesto for reproducible science.
2014; 9(11): e112963. Nat Hum Behav. 2017; 1: 0021.
PubMed Abstract | Publisher Full Text | Free Full Text Publisher Full Text
44. English AC, Richards S, Han Y, et al.: Mind the gap: upgrading genomes with 68. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al.: The FAIR Guiding Principles
Pacific Biosciences RS long-read sequencing technology. PLoS One. 2012; for scientific data management and stewardship. Sci Data. 2016; 3: 160018.
7(11): e47768. PubMed Abstract | Publisher Full Text | Free Full Text
PubMed Abstract | Publisher Full Text | Free Full Text 69. Afgan E, Baker D, van den Beek M, et al.: The Galaxy platform for accessible,
45. Yandell M, Ence D: A beginner’s guide to eukaryotic genome annotation. Nat reproducible and collaborative biomedical analyses: 2016 update. Nucleic
Rev Genet. 2012; 13(5): 329–42. Acids Res. 2016; 44(W1): W3–10.
PubMed Abstract | Publisher Full Text PubMed Abstract | Publisher Full Text | Free Full Text
46. Gurevich A, Saveliev V, Vyahhi N, et al.: QUAST: quality assessment tool for 70. Di Tommaso P, Chatzou M, Floden EW, et al.: Nextflow enables reproducible
genome assemblies. Bioinformatics. 2013; 29(8): 1072–5. computational workflows. Nat Biotechnol. 2017; 35: 316–319.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text
47. Hunt M, Kikuchi T, Sanders M, et al.: REAPR: a universal tool for genome 71. Köster J, Rahmann S: Snakemake--a scalable bioinformatics workflow engine.
assembly evaluation. Genome Biol. 2013; 14(5): R47. Bioinformatics. 2012; 28(19): 2520–2.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text
Page 17 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
72. Lee E, Helt GA, Reese JT, et al.: Web Apollo: a web-based genomic annotation comparing sequences stored in a relational database. Bioinformatics. 2008;
editing platform. Genome Biol. 2013; 14(8): R93. 24(23): 2672–6.
PubMed Abstract | Publisher Full Text | Free Full Text PubMed Abstract | Publisher Full Text | Free Full Text
73. Sterck L, Billiau K, Abeel T, et al.: ORCAE: online resource for community 75. GENCODE - Home page [Internet]. [cited 2018 Jan 12].
annotation of eukaryotes. Nat Methods. 2012; 9(11): 1041. Reference Source
PubMed Abstract | Publisher Full Text 76. Vega archive [Internet]. [cited 2018 Jan 12].
74. Carver T, Berriman M, Tivey A, et al.: Artemis and ACT: viewing, annotating and Reference Source
Page 18 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Version 1
https://doi.org/10.5256/f1000research.14771.r30566
Dave Clements
Department of Biology, Johns Hopkins University, Baltimore, MD, USA
This paper does a good job of covering the big picture of what's needed to assemble and annotate a
genome. Its stated goal is to give guidelines that are "broadly applicable" and "intended to be stable over
time." This paper achieves that goal, and all of my concerns are minor. However, "stable over time" was
a frustrating goal for me. This means that comments on specific sequencing technologies and software
are not as common as they would be in a review article on the state of the art. This is a fine line to walk.
Specific comments
Introduction
1. Think about dropping or shortening the checklist at the end of the introduction. I believe that all of this
is covered in detail in the individual sections.
Repeats
1. "Contigs" used for the first time, Not sure if these need to be explained. (Could reference figure 3.)
Chemical purity
1. What ONT means is explained later. Move that explanation here.
2. The text says:
It is widely known in the PacBio community that samples rich in contaminants...
Are there any references for this? This highlights a larger question with the paper. It is an opinion article
and it contains many opinions such as
The characteristics of the genomes being assembled have a greater impact on the results than the
Page 19 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
The characteristics of the genomes being assembled have a greater impact on the results than the
choice of the algorithm.
I am not disputing any of these statements, but if and when references exist that support them, then those
references should be included.
The very end of this section states:
Be aware that any changes to a genome assembly will most likely necessitate annotation to be
re-started from scratch, and you should therefore be sure to “freeze” the assembly completely before
starting annotation.
I think "restarted from scratch" gives the wrong impression. Tools such as MAKER can do liftover from
one version of an assembly to the next. Perhaps this could be clarified?
Toll -> Toil ?
Is the topic of the opinion article discussed accurately in the context of the current literature?
Yes
Are the conclusions drawn balanced and justified on the basis of the presented arguments?
Yes
Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of
expertise to confirm that it is of an acceptable scientific standard.
https://doi.org/10.5256/f1000research.14771.r31487
Page 20 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Bruno Contreras-Moreira
Estación Experimental de Aula Dei-CSIC, Fundación ARAID, Zaragoza, Spain
The opinion article by Victoria Dominguez Del Angel and collaborators is a well-written sort of broad
tutorial which will certainly be of help for anyone trying to sequence and assemble a genome sequence
with little previous experience. While it avoids details that are required when the real work is actually
carried out (for instance, K-mer length), it discusses important questions that must be addressed before
carrying out genomic projects, and choices that must be made along the way down to the point that data
and procedures are published and released.
Next I comment on specific parts of the text that I believe can be improved.
0. Introduction
0.1 Genomics in 2018 cannot possibly be done without a pan-genome perspective. This has been true for
years in the area of microbiology, where projects now rarely assemble and annotate a single genome, but
rather a few dozens related strains. In addition, this is becoming state-of-the-art also for genomic projects
of crops and model plants, such as Brachypodium distachyon, as well as in human medicine. A couple of
sentences should be added explaining that in this context a group of genomes are sequences, assembled
and annotated in parallel, which makes it more challenging but also facilitates spotting and correcting
errors.
0.2 I would add to the checklist a literature survey to identify related genomes.
1.1 Heterozygosity
I would add a sentence about the outcrossing or selfing nature of species, which has a direct impact on
the expected heterozygosity, and often limits the possibility to obtain inbred individuals. In plants
double-haploids are used to this end (see for instance 1.
1.2 Ploidy level
Please note that it has been estimated that many plant species are polyploid 2-3. One strategy to solve
these complex genomes is to first sequence the genomes of the expected/known parental species.
2.1 Organelle DNA
I would add that frequently chloroplast genomes or plastomes are of high interest as they can provide a
complementary, maternally-biased evolutionary history. This might be of particular interest for polyploid
species as it might help determine which was the maternal and which the paternal genome donor. Even
with a low ratio of organelle DNA in our experience is likely that complete chloroplasts can be assembled
and annotated (see for instance 4) as a by-product of whole genome sequencing. Instead, mitochondria
seem to be difficult to assemble in plants.
I would add that the assembly tools selected at the time the proposal was written are likely to be replaced
by others when the work is actually to be performed due to pace of innovation in this area.
In the last left-side paragraph it is said that “misassemblies in an existing assembly need to be broken
prior to scaffolding in order to join the correct contigs together.” . This is followed by another sentence
later on “Unfortunately, there are few ways to distinguish what is real, what is missing, and what is an
experimental artefact.”
In our experience many scaffolding errors can be spotted by mapping back the original sequence reads to
the assembly and then visualizing the results with browsers such as IGV. Of course this can be
cumbersome for very large assemblies, but tools such as SEQuel and Reapr can be really helpful for such
tasks.
5.1 BUSCO is shown as a way of evaluating assembly quality in page 10.
I would add that in pan-genome projects this can be generalized so that assemblies, and subsequent
annotations, can be evaluated in terms of the proportion of core-genes contained. Poor assemblies can
be identified frequently for having less core-genes than expected.
7.1 Stressing that different annotation strategies often yield annotation sets that are implicitly biased.
Therefore, if the user plans to compare its genome to others should make and effort to use a similar
approach so that any conclusions regarding issues such as gene family expansions are not caused by the
underlying methodology. Indeed we have seen this happening when annotating a microbial pan-genome
and then comparing it to genomes in public databases.
7.2 Adding a section on microbial genome annotation, mentioning popular tools such as PROKKA, RAST
or NCBI Prokaryotic Genome Annotation Pipeline, and commenting on the annotation of bacterial features
such as CRISPRs or plasmids.
7.3 On page 12, 1st paragraph, is its stated that “Transcripts on the other hand provide very accurate
information for the correct prediction of the genes”
I think it should definitely be mentioned that, unlike HQ protein sequences, transcripts allow the
annotation of unstranslated regions (UTR) and despite their noise and the isoform deluge can be used to
define also gene promoters, which can then be annotated in terms of regulation.
7.4 I miss a section on comparison of gene order/synteny
8. Use well-established output formats and submit your data to suitable repositories
I would add that soft/hard repeat-masked versions of the genome sequence can be made available in
FASTA format, which are helpful for subsequent analyses of regulatory sequences.
Page 22 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
Minor edits:
Page 3, 1st paragraph:
“The advice here presented is based on a need seen while working in the ELIXIR-EXCELERATE task
“Capacity Building in Genome Assembly and Annotation”. In this capacity we have”
can be changed to
“The advice here presented is based on a need first seen while working in the ELIXIR-EXCELERATE task
“Capacity Building in Genome Assembly and Annotation”. In this project we have”
Page 6, 2nd paragraph
Why is “or after” in bold?
Page 14, left column
Fasta format should be FASTA format
References
1. Daccord N, Celton JM, Linsmith G, Becker C, Choisne N, Schijlen E, van de Geest H, Bianco L,
Micheletti D, Velasco R, Di Pierro EA, Gouzy J, Rees DJG, Guérif P, Muranty H, Durel CE, Laurens F,
Lespinasse Y, Gaillard S, Aubourg S, Quesneville H, Weigel D, van de Weg E, Troggio M, Bucher E:
High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development.
Nat Genet. 2017; 49 (7): 1099-1106 PubMed Abstract | Publisher Full Text
2. Otto SP, Whitton J: Polyploid incidence and evolution.Annu Rev Genet. 2000; 34: 401-437 PubMed
Abstract | Publisher Full Text
3. Doyle J, Sherman-Broyles S: Double trouble: taxonomy and definitions of polyploidy. New Phytologist.
2017; 213 (2): 487-493 Publisher Full Text
4. Sancho R, Cantalapiedra CP, López-Alvarez D, Gordon SP, Vogel JP, Catalán P, Contreras-Moreira B:
Comparative plastome genomics and phylogenomics of Brachypodium: flowering time signatures,
introgression and recombination in recently diverged ecotypes.New Phytol. 2017. PubMed Abstract |
Publisher Full Text
Is the topic of the opinion article discussed accurately in the context of the current literature?
Yes
Are the conclusions drawn balanced and justified on the basis of the presented arguments?
Yes
Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of
Page 23 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
I confirm that I have read this submission and believe that I have an appropriate level of
expertise to confirm that it is of an acceptable scientific standard.
Reader Comment 16 Feb 2019
Steven Sullivan, New York University, USA
Very useful, but it would be more useful still if section 7.2 on Functional Annotation, where it says,
"There are a number of tools available for functional annotation that allow users to obtain
annotations for their gene set of interest via public databases in a high-throughput manner."
was accompanied by a table listing tools and web servers used for this purpose. The more
comprehensive the better.
Competing Interests: No competing interests were disclosed.
Reader Comment 06 Jul 2018
Yoosook Lee, UC Davis, USA
Is it possible to update Figure 1 to change marker colors? There are three green markers (454 GS FLX,
Illumina Nextseq 500, and Pacbio RS) and it's hard to tell which one is which for those who are not familiar
with those techniques.
Competing Interests: I do not have any competing interest.
Reader Comment 19 Feb 2018
Marc Robinson-Rechavi, Ecology and Evolution, Universite de Lausanne, Switzerland
Figure 1 needs more diversity of color and line types to be readable please.
Competing Interests: No competing interests.
Page 24 of 25
F1000Research 2018, 7(ELIXIR):148 Last updated: 18 SEP 2019
The benefits of publishing with F1000Research:
Your article is published within days, with no editorial bias
You can publish traditional articles, null/negative results, case reports, data notes and more
The peer review process is transparent and collaborative
Your article is indexed in PubMed after passing peer review
Dedicated customer support at every stage
For pre-submission enquiries, contact [email protected]
Page 25 of 25