Skip to main content

NucBreak: location of structural errors in a genome assembly by using paired-end Illumina reads

Abstract

Background

Advances in whole genome sequencing strategies have provided the opportunity for genomic and comparative genomic analysis of a vast variety of organisms. The analysis results are highly dependent on the quality of the genome assemblies used. Assessment of the assembly accuracy may significantly increase the reliability of the analysis results and is therefore of great importance.

Results

Here, we present a new tool called NucBreak aimed at localizing structural errors in assemblies, including insertions, deletions, duplications, inversions, and different inter- and intra-chromosomal rearrangements. The approach taken by existing alternative tools is based on analysing reads that do not map properly to the assembly, for instance discordantly mapped reads, soft-clipped reads and singletons. NucBreak uses an entirely different and unique method to localise the errors. It is based on analysing the alignments of reads that are properly mapped to an assembly and exploit information about the alternative read alignments. It does not annotate detected errors. We have compared NucBreak with other existing assembly accuracy assessment tools, namely Pilon, REAPR, and FRCbam as well as with several structural variant detection tools, including BreakDancer, Lumpy, and Wham, by using both simulated and real datasets.

Conclusions

The benchmarking results have shown that NucBreak in general predicts assembly errors of different types and sizes with relatively high sensitivity and with lower false discovery rate than the other tools. Such a balance between sensitivity and false discovery rate makes NucBreak a good alternative to the existing assembly accuracy assessment tools and SV detection tools. NucBreak is freely available at https://github.com/uio-bmi/NucBreak under the MPL license.

Background

Advances in whole genome sequencing technologies have led to a greatly increased number of organisms with sequenced genomes over the recent years. This has provided the opportunity to make genomic and comparative genomic analysis of a vast variety of organisms. The analysis results are highly dependent on the quality of the genome assemblies used. Any errors in an assembly directly impair analysis predictions and inferences based upon them [1]. The assessment of assembly accuracy may significantly increase the reliability of analysis results and is therefore of great importance.

There are several tools developed for genome assembly accuracy assessment, i.e. REAPR [2], FRCbam [3] and Pilon [4]. These tools identify regions with various inconsistencies in the alignments of reads mapped back to the assembly and detect the locations of assembly errors. The inconsistencies include abnormal read coverage, abnormal distance between reads in a pair relative to the insert size, wrong orientation of one or both reads in a pair, and a large percentage of soft-clipped reads (reads that are partly mapped to an assembly: the one end of the read is mapped to the reference while the second is not) and singletons (reads whose partner was not mapped). The tools are aimed at detecting structural errors including medium to long insertions and deletions, as well as inversions, duplications, and inter- and intra-chromosomal rearrangements. Pilon also enables detection of small insertions, deletions and substitutions and performs local assembly to fix detected assembly errors where possible.

The genome assembly accuracy assessment problem is very similar to the structural variant (SV) detection problem. The tools developed to detect structural variants between genomes of the same or closely related species, such as Wham [5], BreakDancer [6] and Lumpy [7], are based on the approaches similar to the ones implemented in REAPR, Pilon and FRCbam. They exploit the same types of inconsistencies in the read alignments in their workflow. The usage of such tools may be a possible alternative to the tools developed for genome assembly error detection.

In addition to the tools mentioned above, it is also possible to use optical mapping data [8] or reads produced by the third-generation sequencing technologies, such as Single-molecule real-time (SMRT) technology developed by Pacific BioSciences (PacBio) [9] and the Oxford Nanopore Technologies (ONT) [10]. Optical mapping data and PacBio reads has been already successfully used to detect assembly errors [11] and to detect structural variants [12], respectively. The use of this kind of technology and data may be more straightforward and accurate for detection of assembly errors than relying on paired-end data, but it would also require additional resources and expenses that may not be available. The methods presented in this work can be applied without additional data.

Here we present a new tool called NucBreak aimed at genome assembly accuracy assessment. In contrast to other tools examining reads that do not map very well to the assembly, like discordantly mapped reads, soft-clipped reads and singletons, NucBreak rather investigates the reads that map well. NucBreak analyses the alignments of these properly mapped reads (where both reads in a pair are fully aligned in correct orientation at a reasonable distance) and exploits information about the alternative read alignments to detect the locations of assembly structural errors. The tool has been compared to REAPR, FRCbam and Pilon, the only existing tools detecting assembly error locations, as well as BreakDancer, Lumpy, and Wham. We have chosen BreakDancer, Lumpy, and Wham because they were developed to perform the analysis in whole genomes of different species and detect various types of structural variants compared to other existing SV detection tools. All tools have been tested for their ability to detect errors in assemblies by using either simulated or real datasets. The test results have shown that NucBreak enables prediction of assembly errors with lower false discovery rate (FDR) than other tools, keeping relatively high level of sensitivity at the same time.

Implementation

NucBreak is a tool created to detect structural errors in an assembly by using paired-end Illumina reads. The reads are first mapped to the assembly, and then the mapping results are rigorously analysed to detect the assembly errors locations. The NucBreak workflow is shown in (Additional file 1: Fig. S1).

Read mapping

The error detection process starts with mapping reads to the assembly by using Bowtie2 [13]. Bowtie2 is run separately for each read file with the parameter settings “--sensitive_local --ma 1 -a” to report all local alignments with an added nucleotide match bonus. The resulting sam files contain all possible alignments for each read not depending on the second read in a pair. A read alignment may contain either a full read sequence or a read sequence clipped on one or both ends. The read clipping is performed when one or both ends of a read accumulate a high number of differences compared to the assembly. The clipped part of a read may be mapped to another location in the assembly or remained unmapped. There may be a few short substitutions, insertions and/or deletions inside mapped reads or their parts.

Properly mapped read pair formation and categorization

Once the mapping results have been obtained, NucBreak combines reads into properly mapped read pairs and categorizes the pairs into several groups. A pair of reads is considered to be properly mapped if all of the five following conditions are fulfilled:

  1. 1.

    Both reads are mapped to the same assembly sequence.

  2. 2.

    The reads have different orientations relative to the assembly sequence.

  3. 3.

    The read with the reverse orientation is located at the same position or further down on the sequence compared to the mapping locations of the forward-oriented read.

  4. 4.

    The beginnings of the read sequences (the first bases of the read sequences as they are given in the input files) are not clipped. The exception is made only for the forward-oriented read mapped to the very beginning of the assembly sequence and the reverse-oriented read mapped to the very end of the assembly sequence.

  5. 5.

    The reads have a proper insert size (see (Additional file 1) for the details about the insert size detection approach).

The alignments of properly mapped reads may contain short substitutions, insertions and deletions.

To combine reads into properly mapped read pairs, NucBreak analyses all possible combinations of the read mapping locations for each input read pair and forms properly mapped read pairs from those reads whose locations satisfy the five conditions mentioned above. Each input read pair may give rise to none, one or several properly mapped read pairs (see (Additional file 1: Fig. S2)).

Then the created properly mapped read pairs are divided into 4 groups, based on the presence of alternative alignments for each read in a pair:

  1. 1.

    Single group - consisting of pairs where both reads are mapped uniquely to a sequence. The pairs from this group point to the non-repeated regions of a genome (Additional file 1: Fig. S3a).

  2. 2.

    Single_Multiple group - consisting of pairs where the forward-oriented read is mapped uniquely to a sequence and the reverse-oriented read has multiple alternative mapping locations. The pairs point to the regions where non-repeated regions end and repeated regions start (Additional file 1: Figure S3b).

  3. 3.

    Multiple_Single group - consisting of pairs where the forward-oriented read has multiple alternative mapping locations and the reverse-oriented read is mapped uniquely to a sequence. The pairs point to the regions where repeated regions end and non-repeated regions start (Additional file 1: Figure S3c).

  4. 4.

    Multiple group - consisting of pairs where both reads have multiple mapping locations. The pairs point to the repeated regions of a genome (Additional file 1: Fig. S3d).

Read path creation and path gap detection

During the third step, reads from each group are merged together to form continuous paths. This is done separately for forward- and reverse-oriented reads. Only neighbouring reads having an overlap of more than 5 bases are involved in the merging process. There may be small substitution, deletion and insertion differences in reads inside the overlapping regions. If neighbouring reads overlap with 5 or less bases, the overlapped bases are clipped, creating an uncovered region between them. The 5-base limit has been introduced to exclude overlaps appearing due to uncertainties in alignment rather than actual overlaps of positions. The obtained paths represent the fragments of a genome that are considered free of assembly errors.

Usually, several paths of the same type and orientation cover a full assembly sequence. The assembly sequence regions located between paths of the same type and orientation are called path gaps (see Additional file 1: Figure S4). The path gaps may potentially contain assembly errors and, therefore, are extensively analysed by NucBreak during the next step.

Path gap analysis

There can be several reasons for path gaps to appear. First, a path gap may appear due to the presence of an assembly error in this region. Second, a path gap may correspond to a region covered by paths of other types. Third, a path gap may appear when there is not enough read coverage to provide the required overlap between reads. Such a situation may occur when: (1) a genome or its fragments were sequenced with a very low coverage, (2) read pairs from these regions are absent due to sequencing errors in reads, (3) read pairs are filtered out due to a violation of condition 4: when there are sequencing errors in the beginning of one of the read in a pair, and (4) there are gaps (a subsequence of N’s) in the genome.

The goal of the fourth step is to exclude path gaps that do not contain assembly errors. NucBreak starts with excluding path gaps that do not overlap with path gaps between paths of the same type in the opposite orientation (Fig. 1a). Such situations are often observed in Single and Multiple paths and are due to low coverage by either forward- or reverse-oriented reads.

Fig. 1
figure 1

Path gap exclusion. The black line represents an assembly. The assembly regions marked by red colour correspond to repeated regions. The repeated regions are identical or near-identical copies of the same repeat or copies of different repeats. The arrows represent read paths. (a) Exclusion of a path gap fully covered by a read path of the same type and another orientation. The rectangles between read paths indicate path gaps. Path gap 1 is excluded due to the presence of a required read path. The path gaps marked by number 2 are not excluded and require further analysis. (b) Exclusion of a path gap appeared due to alternation of paths of different types. The black squares mark the locations of assembly errors. The rectangles between read paths indicate path gaps that are not excluded. The path gaps marked by number 3 is not excluded due to the repetition of read path types (e.g. the Single forward-oriented path is followed by another Single forward-oriented path instead of the Single-Multiple forward-oriented path). The path gaps marked by number 4 are not excluded because one read path type is missed (e.g. Multiple forward-oriented path is followed by Single forward-oriented path instead of Multiple-Single forward-oriented path)

Then NucBreak detects path gaps appearing because of the alternation of paths of different types. To accomplish this, NucBreak analyses the location order of path types and the locations of paths separately for the forward- and reverse-oriented paths. The path types should appear in a fixed order, like a cycle: Single, Single_Multiple, Multiple, Multiple_Single, Single, and so on (Additional file 1: Figure S4). The cycle may start with any type. If one type is skipped or repeated (Fig. 1b), it indicates an error in this region. There is also a requirement for the locations of paths: both a path and the following path gap should overlap with the next path with more than 5 bases. However, we make some exceptions for type order and path locations in special cases (see Additional file 1: Figure S5 for the details). In this way, NucBreak excludes a path gap if the beginning of the path gap is covered with a path that has a correct type order and location.

Unfortunately, it is not always possible to exclude all path gaps located in the assembly sequence regions that do not contain errors. The path gaps that have appeared due to low read coverage or are located in the regions containing subsequences of N’s of appropriate lengths are never excluded.

Error location adjustment

All non-excluded path gaps are treated as containing assembly errors. To narrow down the region where an error is located, NucBreak shortens the path gaps during the fifth step. To accomplish this, it first combines the paths of all types with the same direction together. Then for each path gap, it determines whether the end of any path is inside the path gap region. If it is, the path gap beginning is shifted to the path end (or to the right-most end in case of several paths detected, Fig. 2a). Finally, it determines whether the beginning of any path is inside the path gap region. If it is, the path gap end is shifted to the path beginning (or to the left-most beginning in case of several paths detected, Fig. 2b). If any path gap is fully covered by any path, then this path gap is excluded.

Fig. 2
figure 2

Error location adjustment. The black line represents an assembly. The arrows represent read paths of any type. The rectangles represent initial path gaps. The red areas in the rectangles in cases a) and b) correspond to the adjusted path gaps with the shortened beginning and end, respectively

To pinpoint the locations of errors, NucBreak first finds the union of the adjusted path gaps of all types. This is carried out separately for path gaps located on forward- and reverse-oriented paths. Then NucBreak finds the intersection of the obtained forward- and reverse-oriented unions of regions and pinpoints the error locations. Errors in the beginning and at the end of a sequence (inside the regions with lengths equal to the read length) are excluded by NucBreak, because in most cases they are due to the lack of perfectly mapped read pairs.

Data sets

For the testing purposes, we created four different datasets. For the first and second datasets, we constructed artificial reference genomes and assemblies, and generated simulated Illumina paired-end read libraries. In both datasets, the reference genomes were constructed from random DNA sequences by introducing different interspersed and tandem repeats. The assemblies were generated from the reference genomes sequences by introducing controlled modifications (e.g. relocations, deletions, duplications of different fragments and so on). The detailed description of introduced modifications is given in (Additional file 1: Table S1). Depending on the datasets, different approaches were applied to create an Illumina paired-end read library in each case. For the first dataset, one read library was generated with the help of ART (Q version 2.5.8) [14] run with the “-ss MSv3 -l 250 -p -m 700 -s 40” settings with 40x read coverage for each reference genome. For the second dataset, read libraries with 5x,10x, 40x,100x, and 200x read coverages were generated by ART run with the “-ss MSv3 -l 250 -p -m 700 -s 40” settings.

The third dataset was created on the base of the data provided by the Assemblathon 1 project [15]. An artificially evolved human chromosome 13 (hg18/NCBI36), simulated Illumina paired-end read library with 40x coverage, and genome assembly obtained by PE-assembler [16] were downloaded from the Assemblathon 1 website [17]. To increase the number of errors and to introduce more variability of error types, we deleted all gaps from the assembly.

The fourth dataset consisted of 8 bacterial genomes (Bordetella pertussis str. J081, Brucella melitensis str. 1, Enterobacter cloacae str. AR_0136, Escherichia coli str. 2014C-3599, Klebsiella pneumoniae str. SGH10, Pseudomonas aeruginosa str. AR_0095, Salmonella enterica str. CFSAN047866, and Staphylococcus aureus str. CFSAN007896), MiSeq Illumina paired-end reads libraries provided for these genomes, and assemblies generated using the ABySS (version 2.0.2) [18], SPAdes (version 3.11.0) [19] and Velvet (version 1.2.10) [20] assemblers. The genomes were downloaded from the NCBI database [21], and the reads were downloaded from the EBI database [22]. The genomes accession numbers and information about the read libraries are given in (Additional file 1: Table S2). The parameter settings used to run ABySS, SPAdes and Velvet are described in (Additional file 1). As in the third dataset, we have removed all gaps from the assemblies before testing.

Results

We have created a tool called NucBreak that is aimed at detection of structural errors in assemblies by analysing the placements of properly mapped reads and exploiting information about the alternative alignments of the reads. In this section, we examine the ability of NucBreak as well as REAPR (version 1.0.18), FRCbam (version 1.2.0), Pilon (version 1.22), BreakDancer (version 1.3.6), Lumpy (version 0.2.13), and Wham (version 1.8.0) to detect assembly errors in real and simulated datasets.

All tools, except REAPR, FRCbam and partly NucBreak, were run with their default settings. The parameter settings used to run REAPR, FRCbam and NucBreak are described in (Additional file 1). To validate the results, we compared the obtained results of each test with the ground truth results consisting of real errors. Depending on the test performed, the ground truth results were generated during the simulation process or produced using NucDiff [23], the tool which enables comparison of reference genomes with assemblies. NucDiff was run with the default parameter settings. The ground truth and obtained results were compared using BEDTools (version 2.17.0) [24] to get sensitivity and FDR for each tool and each dataset (see (Additional file 1) for more details).

We studied sensitivity and FDR for each tool allowing various degrees of slack in the location of each region in the comparison by adding flanking regions of different sizes to the ground truth regions. We added 1, 5, 10, 20, 50, 100, 200, 400, and 600 bp both up- and downstream of each ground truth entry. The flanking regions were introduced to investigate the positional accuracy of the tools tested. In addition, for the flanking region size equal to 600 bp, we identified the ability of each tool to detect ground truth errors depending on the read coverage value in one of the tests.

Accuracy assessment in simulated datasets

We created a simulated dataset consisting of ten artificial reference genomes, assemblies, and Illumina paired-end read libraries, as described in Section Data sets (the first dataset), and ran NucBreak, Pilon, REAPR, FRCbam Lumpy, Wham and BreakDancer to detect errors in the assemblies. To enable validation of the obtained results, we also generated the ground truth results during the simulation process. All ground truth errors were divided into several groups according to their types and sizes: insertion, duplication, tandem duplication, deletion, deletion of interspersed repeats or their parts, deletion of tandem repeats or their parts, inversion, relocation (intra-chromosomal rearrangements) with either inserted regions between misjoined regions or without them, and relocation with overlapped misjoined regions groups with error sizes between 10 and 49 bp, 50 and 299 bp, and greater than 299 bp. The overall sensitivity and FDR are presented in Fig. 3. The sensitivity for each ground truth error group is shown in (Additional file 1: Figures S6-S8). The number of ground truth errors in each group is given in (Additional file 1: Table S3).

Fig. 3
figure 3

ROC-like plot based on the simulated datasets with varying flanking region size. The sensitivity and false discovery rate (FDR) are plotted for seven tools (indicated with different colours) using varying flanking region sizes (indicated with different symbols). The flanking region size corresponds to the amount of slack allowed in the position of correct predictions

As can be seen from (Additional file 1: Figures S6-S8), the sensitivity of each tool largely depends on the types and sizes of errors and size of the flanking region. For Pilon and NucBreak, the sensitivity constantly increases with respect to flanking region size increment in all cases where sensitivity is larger than zero. Wham’s and REAPR’s sensitivity either increases with respect to the flanking region size increase or remains approximately the same, depending on the error types and sizes. In case of BreakDancer, FRCbam and Lumpy, sensitivity increases starting from medium- or long-sized flanking regions depending on an error group.

As expected, all tools perform best with 600 bp flanking region. For this flanking region size, Pilon obtains sensitivity equal to 1 in almost all error groups and outperforms other tools in many cases. NucBreak’s and REAPR’s sensitivity is the same or close to Pilon’s one in most groups. Wham shows relatively high sensitivity in many groups, while BreakDancer, FRCbam and Lumpy have low sensitivity in almost all cases.

As shown in Fig. 3, the overall sensitivity increases rapidly for FRCbam, NucBreak, Pilon, and REAPR together with the flanking region size increase and is more than 0.8 with 600 bp flanking region for all the mentioned tools, except FRCbam. The overall sensitivity for the other tools does not increase significantly and is less than 0.5 for all flanking regions. Overall FDR does not change much for Wham and FRCbam and decreases rapidly for all other tools together with the flanking region size increase. All tools except Pilon and FRCbam reach FDR less than 0.2 with a 600 bp flanking region.

Accuracy assessment in simulated datasets depending on read coverage

To explore the influence of read coverage on the results of NucBreak, Pilon, FRCbam, REAPR, Wham, Lumpy, and BreakDancer, we created ten simulated reference genomes, assemblies, and Illumina paired-end read libraries with 5x, 10x, 40x, 100x, and 200x coverage as described in the Section Data sets (the second dataset). As well as in the Section 3.1, the ground truth errors were generated during simulation process and divided into different groups based on the error types and size. The sensitivity and FDR values were calculated with a 600 bp flanking region. The 600 bp flanking region was chosen because all tools performed best with this flanking region size in the previous section. The sensitivity for each ground truth error group are presented in (Additional file 1: Figures S9-S11) and the overall sensitivity and FDR are shown in Fig. 4.

Fig. 4
figure 4

ROC-like plot based on the simulated datasets with varying sequencing coverage. The sensitivity and false discovery rate (FDR) are plotted for seven tools (indicated with different colours) using varying sequencing coverage (indicated with different symbols)

As indicated in provided plots, NucBreak’s and REAPR’s sensitivity either decreases with the coverage increase or is approximately the same starting from 10x coverage. Pilon’s and FRCbam’s sensitivity decreases or increases depending on the error type and coverage values, while in case of Wham, BreakDancer, and Lumpy sensitivity always increases, except a small number of cases when the sensitivity remains approximately the same.

The overall sensitivity decreases for NucBreak and REAPR and increases for Pilon, Wham, and Lumpy together with the coverage increase. However, the overall sensitivity for NucBreak and REAPR never drops below 0.7. In the case of BreakDancer and FRCbam the overall sensitivity either increases or decreases depending on coverage values. REAPR, NucBreak, and Pilon demonstrate a fast decrease of FDR with up to 40x coverage. Starting from 40x coverage, FDR remains the same or slightly decreases. In the case of FRCbam, BreakDancer, and Wham, FDR remains approximately the same for all coverage values or slightly changes with coverage increase. In contrast to other tools, Lumpy is the only tool that demonstrates a rapid increase of FDR together with the coverage increase.

Accuracy assessment in an assembly obtained from simulated reads

To validate the ability of NucBreak, Pilon, REAPR, FRCbam, Lumpy, BreakDancer, and Wham to detect errors in real assemblies, we ran the tools with a dataset where reads were created for an artificially evolved diploid genome and an assembly was generated by the PE-assembler (see Section Data sets, the third dataset for details). The ground truth results were obtained by comparing the assembly with the reference genome using NucDiff. All ground truth errors were divided into types according to the error types and sizes provided by NucDiff: substitution, insertion, duplication, tandem duplication, deletion, deletion of interspersed repeats or their parts, deletion of tandem repeats or their parts, inversion, reshuffling (several neighbouring genome regions are placed in a different order in an assembly), and two groups of rearrangements (arrangement and rearrangement with overlap) with sizes between 10 and 49 bp, between 50 and 299 bp, and greater than 299 bp. The rearrangement group consisted of relocation and translocation (an inter-chromosomal rearrangement) errors with either inserted regions between misjoined regions or without them. The relocation with overlap group contained relocation and translocation errors with overlapped misjoined regions. The sensitivity for each ground truth error group and overall sensitivity and FDR are presented in Fig. 5 and (Additional file 1: Figures S12-S15). The number of ground truth errors in each group is given in (Additional file 1: TableS3).

Fig. 5
figure 5

ROC-like plot based on Assemblathon 1 datasets with varying flanking region size. The sensitivity and false discovery rate (FDR) are plotted for seven tools (indicated with different colours) using varying flanking region sizes (indicated with different symbols). The flanking region size corresponds to the amount of slack allowed in the position of correct predictions

As we see from (Additional file 1: Figures S12-S15), the sensitivity increases with increasing flanking region size for all tools in all groups. As expected, all tools perform best with a 600 bp flanking region. For this flanking region size, Pilon shows high sensitivity in almost all error groups and outperforms other tools in many cases. The sensitivity results of the other tools largely depend on types and sizes of detected errors. However, all tools show high sensitivity in some groups.

The overall sensitivity is quite low for all tools for all flanking regions. Such a low sensitivity can be explained by the fact that there are a lot of small substitutions in the tested assemblies (225,721 small substitutions versus 6933 small deletions, which form the second largest ground truth error group), that are rarely detected by any tool. The overall FDR decreases together with increasing flanking region size for all tools. All tools except FRCbam obtained the lowest FDR value with a 600 bp flanking region. The FDR values for FRCbam are greater than 0.8 for all flanking region sizes.

Accuracy assessment in an assembly obtained from real reads

We also explored the ability of NucBreak, Pilon, REAPR, FRCbam, Lumpy, BreakDancer to detect errors in assemblies obtained from real reads. For this purpose, we downloaded reads for eight bacterial genomes, generated assemblies by using ABySS, SPAdes, and Velvet (see Section Data sets for full description of data and assembler parameter settings used) and ran NucBreak, Pilon, REAPR, FRCbam, Lumpy, BreakDancer, and Wham. Unfortunately, REAPR crashed during execution and was therefore eliminated from the evaluation process. The ground truth errors were obtained by comparison of assemblies with the reference genomes by using NucDiff and categorized into several types according to the error types and sizes provided by NucDiff, in the same way as it was described in Section Accuracy assessment in an assembly obtained from simulated reads. The sensitivity for each ground truth error group and overall sensitivity and FDR were first computed separately for each assembly and genome and then combined together. The final results are presented in Fig. 6 and (Additional file 1: Figures S16-S19). The number of ground truth errors in each group is given in (Additional file 1: Table S3).

Fig. 6
figure 6

ROC-like plot based on bacterial genome datasets with varying flanking region size. The sensitivity and false discovery rate (FDR) are plotted for six tools (indicated with different colours) using varying flanking region sizes (indicated with different symbols). The flanking region size corresponds to the amount of slack allowed in the position of correct predictions

The sensitivity results indicate that Pilon and NucBreak (with some small exceptions) enable detection of ground truth errors in all non-empty groups, and other tools predict errors only in some cases. Pilon outperforms other tools in almost all groups with respect to sensitivity. However, in half of the cases, the NucBreak results are comparable to Pilon’s ones. Pilon and NucBreak have relatively high sensitivity in many cases, while sensitivity of other tools, except Wham’s and FRCbam’s sensitivity in one case, is quite low or equal to 0.

Like with the dataset from the Assemblathon 1 project, the overall sensitivity is low for all tools for all flanking region sizes due to a large number of small substitutions (8000 small substitutions versus 437 small deletions that form the second largest groups), that are rarely detected by any tool. The overall FDR are very high for all tools except NucBreak. NucBreak has relatively low FDR, even with short flanking regions.

Discussion

In this paper, we have introduced a tool called NucBreak that detects errors in assemblies by using short paired-end Illumina reads. Neither a reference genome nor a long jump library are required. NucBreak enables detection of assembly errors of all types and sizes, except (1) small insertions, deletions and substitutions that do not change repeat copy numbers, (2) deletions of copies of long interspersed repeats together with bases between repeat copies or long tandem repeat units, and (3) relocations and translocations with long overlapped misjoined regions. The inability of NucBreak to detect such types of assembly errors can be explained by two facts. First, NucBreak does not analyse small errors (approximately up to 30 bp) that are detected during the mapping process, and, thus, misses small insertion, deletion and substitution assembly errors. Second, NucBreak cannot detect errors in the regions that are covered with overlapping properly mapped reads, and, as the result, deletions of copies of long repeats together with bases between repeat copies and rearrangements with long overlapped misjoined regions remain unnoticed. The benchmarking results have shown that NucBreak detects all other assembly errors with low FDR and relatively high sensitivity. Such a balance between sensitivity and FDR makes NucBreak a good alternative to the existing assembly accuracy assessment tools and SV detection tools.

We have compared NucBreak with several existing tools for assembly accuracy assessment, namely Pilon, FRCbam and REAPR, as well as with some SV detection tools, including BreakDancer, Lumpy and Wham. Only Pilon, REAPR, and Wham detect assembly errors of most types and sizes with high sensitivity. However, the high sensitivity of these tools is always combined with higher FDR compared to NucBreak. All other tools demonstrate quite low sensitivity and high FDR, showing good sensitivity results only for some specific assembly error types and sizes.

The results reveal that all tested tools do not output their predictions with a single-nucleotide positional accuracy. All tools obtain better sensitivity when the flanking region size increases. However, Wham and Lumpy do not show such rapid growth of sensitivity as other tools. It means that their initial predictions were more proximal to the annotated assembly errors when at all detected.

It has been also observed that the read coverage is an important factor for detecting structural errors. In the case of REAPR and NucBreak, increase in coverage leads to decrease of sensitivity, while in case of Wham, BreakDancer, and Lumpy it helps to improve sensitivity. The sensitivity of Pilon and FRCbam either decreases or increases with coverage increment, depending on the types and sizes of detected assembly errors. In addition, we have noticed that NucBreak shows high sensitivity in all error groups when coverage is low. However, this is not a real advantage of the proposed algorithm but rather a coincidence, since the FDR is quite high in such cases. The reason for this behaviour in low coverage areas is that the number of regions where reads do not overlap are quite high, and many assembly errors are therefore predicted just by chance. With increased coverage, the number of uncovered regions is reduced, leading to a sensitivity and FDR decrease.

In case of the Assemblathon 1 dataset and the bacterial genome datasets, the ground truth has been generated using NucDiff. NucDiff has some limitations that may influence the results. First, the locations of the detected differences between the two sequences compared may be shifted a few bases off compared to the real difference locations due to accidental base similarity at the region borders [23]. This influences the sensitivity and FDR results obtained for short flanking regions (1 bp or 5 bp long). Second, a partial loss of information about the types of differences are observed in cases when both relocations/translocations and duplications are involved at the same time [23]. In this case, duplications may be detected as insertions. This could lead to misclassification of errors to the wrong group and thus influences the sensitivity results for the insertion and duplication groups.

Conclusions

We have presented the tool NucBreak aimed at detection of structural errors in assemblies by using Illumina paired-end reads. NucBreak’s approach is unique in that it exploits information about alternative read alignments and analyses the reads that map properly to the assembly, in contrast to alternative tools that are based on analysing reads that do not map properly. It enables localization of insertions, deletions, duplications, inversions, and different inter- and intra-chromosomal rearrangements. However, it does not annotate detected errors. We have compared NucBreak with REAPR, FRCbam, Pilon, BreakDancer, Lumpy, and Wham. The benchmarking results have shown that in general NucBreak predicts assembly errors with relatively high sensitivity and with lower FDR than the other tools. We have also obtained evidence that Lumpy, BreakDancer and Wham, the tools developed for SV detection, can be used for assembly error detection, although in general the sensitivity of these tools, except Wham, is much lower compared to Pilon, REAPR and NucBreak.

Availability of data and materials

• Project name: NucBreak

• Project home page: https://github.com/uio-bmi/NucBreak

• Operating system(s): Unix-like system such as Ubuntu Linux and MacOS X.

• Programming language: Python

• Other requirements: Python 2.7

• License: Mozilla Public License (MPL), version 2.0

• Any restrictions to use by non-academics: No

• Additional data: All data used is available as described in Section Data sets

Abbreviations

bp:

base pairs

FDR:

false discovery rate

ONT:

Oxford Nanopore Technologies

PacBio:

Pacific Biosciences

ROC:

receiver operating characteristic

SMRT:

Single-molecule real-time

SV:

structural variant

References

  1. Meader S, Hillier LW, Locke D, Ponting CP, Lunter G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 2010 May;20(5):675–84. https://doi.org/10.1101/gr.096966.109.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013;14(5):R47. https://doi.org/10.1186/gb-2013-14-5-r47.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Vezzi F, Narzisi G, Mishra B. Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS One. 2012;7(12):e52210. https://doi.org/10.1371/journal.pone.0052210.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9(11):e112963. https://doi.org/10.1371/journal.pone.0112963.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Kronenberg ZN, Osborne EJ, Cone KR, Kennedy BJ, Domyan ET, Shapiro MD, Elde NC, Yandell M. Wham: identifying structural variants of biological consequence. PLoS Comput Biol. 2015;11(12):e1004572. https://doi.org/10.1371/journal.pcbi.1004572.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6(9):677–81. https://doi.org/10.1038/nmeth.1363.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014;15(6):R84. https://doi.org/10.1186/gb-2014-15-6-r84.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Zhou S, Herschleb J, Schwartz DC. (2007). A single molecule system for whole genome analysis. Perspectives in Bioanalysis. 2007;2:265–300. https://doi.org/10.1016/s1871-0069(06)02009-x.

    Article  CAS  Google Scholar 

  9. The SMRT science website. https://www.pacb.com/smrt-science/. .

  10. The Oxford Nanopore Technologies website. https://nanoporetech.com/. Accessed 21 Jan 2018.

  11. Zhou S, Bechner MC, Place M, Churas CP, Pape L, Leong SA, Runnheim R, Forrest DK, Goldstein S, Livny M, Schwartz DC. Validation of rice genome sequence by optical mapping. BMC Genomics. 2007;8:278. https://doi.org/10.1186/1471-2164-8-278.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Chaisson MJ, Huddleston J, Dennis MY, Sudmant PH, Malig M, Hormozdiari F, et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2015;517(7536):608–11. https://doi.org/10.1038/nature13907.

    Article  CAS  PubMed  Google Scholar 

  13. Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods. 2012;9(4):357–9. https://doi.org/10.1038/nmeth.1923.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28(4):593–4. https://doi.org/10.1093/bioinformatics/btr708.

    Article  CAS  PubMed  Google Scholar 

  15. Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 2011;21(12):2224–41. https://doi.org/10.1101/gr.126599.111.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Ariyaratne PN, Sung WK. PE-assembler: de novo assembler using short paired-end reads. Bioinformatics. 2011;27(2):167–74. https://doi.org/10.1093/bioinformatics/btq626.

    Article  CAS  PubMed  Google Scholar 

  17. The Assemblathon 1 dataset website. http://korflab.ucdavis.edu/Datasets/Assemblathon/Assemblathon1. Accessed 9 Aug 2018.

  18. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–23. https://doi.org/10.1101/gr.089532.108.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77. https://doi.org/10.1089/cmb.2012.0021.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18(5):821–9. https://doi.org/10.1101/gr.074492.107.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. The NCBI database. https://www.ncbi.nlm.nih.gov/. Accessed 9 Aug 2018.

  22. The EBI database. https://www.ebi.ac.uk/. Accessed 9 Aug 2018.

  23. Khelik K, Lagesen K, Sandve GK, Rognes T, Nederbragt AJ. NucDiff: in-depth characterization and annotation of differences between two sets of DNA sequences. BMC Bioinformatics. 2017;18(1):338. https://doi.org/10.1186/s12859-017-1748-z.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2. https://doi.org/10.1093/bioinformatics/btq033.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors wish to thank the Centre for Ecological and Evolutionary Synthesis (CEES) for access to the computational infrastructure (‘cod’ servers) that enabled the bioinformatics analysis for this project. The authors also wish to thank Karin Lagesen for valuable input in the early phase of the project.

Funding

KK was funded by the Computational Life Science initiative (CLSi) at the University of Oslo. The funding body played no role in the design or conclusions of this study.

Author information

Authors and Affiliations

Authors

Contributions

KK designed and implemented NucBreak. KK, GKS, AJN and TR suggested the demonstration examples and other experiments performed. KK performed all the experiments. KK and TR wrote the manuscript. GKS and AJN revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Torbjørn Rognes.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1:

Supplementary materials. Figure S1. NucBreak workflow. Figure S2. Properly mapped read pair formation. The black line represents an assembly. The arrows represent all possible read mapping locations. The cases a) and b) correspond to the situations when no read pairs are formed or just one read pair is formed, respectively. The cases c) and d) show examples when several read pairs are formed from two given reads. The case d) is an example of the situation when reads are mapped to a tandem repeat. Figure S3. Properly mapped read pair categorization. The black line represents an assembly. The assembly regions marked by red colour correspond to repeated regions. The repeated regions are identical or near-identical copies of the same repeat. The arrows represent all possible read mapping locations. Figure S4. Read paths and path gaps. The black line represents an assembly. The assembly regions marked by red colour correspond to repeated regions. The repeated regions are identical or near-identical copies of the same repeat or copies of different repeats. The arrows represent read paths. The arrows of the same colour correspond to the read paths of the same type. The rectangles between the read paths indicate path gaps. The example demonstrates the correct order of the read paths in the absence of assembly errors. Figure S5. Possible type order and locations of read paths in the absence of breakpoints. Figure S6. Sensitivity results for the insertion, duplication and tandem duplication groups, obtained using the simulated datasets. Figure S7. Sensitivity results for the deletion, deletion_repeat and deletion_tandem groups, obtained using the simulated datasets. The deletion_repeat group contains deletions of interspersed repeats or their parts. The deletion_tandem group contains deletions of tandem repeats or their parts. Figure S8. Sensitivity results for the inversion, relocation and relocation_overlap groups, obtained using the simulated datasets. The relocation group consists of relocations with either inserted regions between misjoined regions (size varied between 10 and 1000) or without them (size is equal to 0). The relocation_overlap group consists of relocations with overlapped misjoined regions. Figure S9. Sensitivity results for the insertion, duplication and tandem duplication groups, obtained using the simulated datasets. Figure S10. Sensitivity results for the deletion, deletion_repeat and deletion_tandem groups, obtained using the simulated datasets. The deletion_repeat group contains deletions of interspersed repeats or their parts. The deletion_tandem group contains deletions of tandem repeats or their parts. Figure S11. Sensitivity results for the inversion, relocation and relocation_overlap groups, obtained using the simulated datasets. The relocation group consists of relocations with either inserted regions between misjoined regions (size varied between 10 and 1000) or without them (size is equal to 0). The relocation_overlap group consists of relocations with overlapped misjoined regions. Figure S12. Sensitivity results for the insertion, duplication and tandem duplication groups, obtained using the datasets from the Assemblathon 1 project. Figure S13. Sensitivity results for the deletion, deletion_repeat and deletion_tandem groups, obtained using the datasets from the Assemblathon 1 project. The deletion_repeat group contains deletions of interspersed repeats or their parts. The deletion_tandem group contains deletions of tandem repeats or their parts. Figure S14. Sensitivity results for the inversion, rearrangement and rearrangement_overlap groups, obtained using the datasets from the Assemblathon 1 project. The rearrangement group consists of relocations and translocations with either inserted regions between misjoined regions (size varied between 1 and 1000) or without them (size is equal to 0). The rearrangement_overlap group consists of relocations and translocations with overlapped misjoined regions. Figure S15. Sensitivity results for the reshuffling and substitution groups, obtained using the datasets from the Assemblathon 1 project. Figure S16. Sensitivity results for the insertion, duplication and tandem duplication groups obtained using the bacterial genome datasets. Figure S17. Sensitivity results for the deletion, deletion_repeat and deletion_tandem groups, obtained using the bacterial genome datasets. The deletion_repeat group contains deletions of interspersed repeats or their parts. The deletion_tandem group contains deletions of tandem repeats or their parts. Figure S18. Sensitivity results for the inversion, rearrangement and rearrangement_overlap groups, obtained using the bacterial genome datasets. The rearrangement group consists of relocations and translocations with either inserted regions between misjoined regions (size varied between 1 and 1000) or without them (size is equal to 0). The rearrangement_overlap group consists of relocations and translocations with overlapped misjoined regions. Figure S19. Sensitivity results for the reshuffling and substitution groups, obtained using the bacterial genome datasets. Table S1. Genome modifications implemented during the simulation process. G and A denote a reference genome and assembly, respectively. All other letters denote reference genome and assembly sequence regions. Diff means difference. C′ is the reverse complement of C. Table S2. List of bacterial genomes. Table S3. Number of ground truth errors in each group.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khelik, K., Sandve, G.K., Nederbragt, A.J. et al. NucBreak: location of structural errors in a genome assembly by using paired-end Illumina reads. BMC Bioinformatics 21, 66 (2020). https://doi.org/10.1186/s12859-020-3414-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-020-3414-0

Keywords