CIAlign A Highly Customisable Command
CIAlign A Highly Customisable Command
ABSTRACT
Background. Throughout biology, multiple sequence alignments (MSAs) form the
basis of much investigation into biological features and relationships. These alignments
are at the heart of many bioinformatics analyses. However, sequences in MSAs are
often incomplete or very divergent, which can lead to poor alignment and large gaps.
This slows down computation and can impact conclusions without being biologically
relevant. Cleaning the alignment by removing common issues such as gaps, divergent
sequences, large insertions and deletions and poorly aligned sequence ends can
substantially improve analyses. Manual editing of MSAs is very widespread but is time-
consuming and difficult to reproduce.
Results. We present a comprehensive, user-friendly MSA trimming tool with multiple
visualisation options. Our highly customisable command line tool aims to give
intervention power to the user by offering various options, and outputs graphical
representations of the alignment before and after processing to give the user a clear
overview of what has been removed. The main functionalities of the tool include
removing regions of low coverage due to insertions, removing gaps, cropping poorly
aligned sequence ends and removing sequences that are too divergent or too short. The
thresholds for each function can be specified by the user and parameters can be adjusted
to each individual MSA. CIAlign is designed with an emphasis on solving specific and
common alignment problems and on providing transparency to the user.
Submitted 6 April 2021
Accepted 1 February 2022
Conclusion. CIAlign effectively removes problematic regions and sequences from
Published 15 March 2022 MSAs and provides novel visualisation options. This tool can be used to fine-tune
Corresponding author
alignments for further analysis and processing. The tool is aimed at anyone who wishes
Katherine Brown, [email protected] to automatically clean up parts of an MSA and those requiring a new, accessible way of
Academic editor
visualising large MSAs.
Burkhard Morgenstern
Additional Information and
Subjects Bioinformatics, Computational Biology, Evolutionary Studies, Genomics
Declarations can be found on
page 25 Keywords Multiple sequence alignment, Alignment quality, Python tool, Comparative genomics,
Transcriptomics, Phylogenetics
DOI 10.7717/peerj.12983
Copyright
2022 Tumescheit et al. INTRODUCTION
Distributed under Throughout biology, multiple sequence alignments (MSAs) of DNA, RNA or amino acid
Creative Commons CC-BY 4.0 sequences are often the basis of investigation into biological features and relationships.
OPEN ACCESS Applications of MSAs include, but are not limited to, transcriptome analysis, in which
How to cite this article Tumescheit C, Firth AE, Brown K. 2022. CIAlign: A highly customisable command line tool to clean, interpret
and visualise multiple sequence alignments. PeerJ 10:e12983 http://doi.org/10.7717/peerj.12983
transcripts may need to be aligned to genes; RNA structure prediction, in which an MSA
improves results significantly compared to predictions based on single sequences; and
phylogenetics, where trees are usually created based on MSAs. There are many more
applications of MSA at a gene, transcript and genome level, involved in a huge variety of
traditional and new approaches to genetics and genomics, many of which could benefit
from the tool presented here.
An MSA typically represents three or more DNA, RNA or amino acid sequences,
which represent partial or complete gene, transcript, protein or genome sequences. These
sequences are aligned by inserting gaps between residues to bring more similar residues
(either based on simple sequence similarity or an evolutionary model) into the same
column, allowing insertions, deletions and differences in sequence length to be taken
into account (Boswell, 1987; Higgins & Sharp, 1988). The first widely used automated
method for generating MSAs was Clustal (Higgins & Sharp, 1988) and more recent
versions of this tool are still in use today, along with tools such as MUSCLE (Edgar,
2004), MAFFT (Katoh et al., 2002), T-coffee (Notredame, Higgins & Heringa, 2000) and
many more. The majority of tools are based upon various heuristics used to optimise
progressive sequence alignment using a dynamic programming based algorithm such as
the Needleman-Wunsch algorithm (Needleman & Wunsch, 1970).
It has been shown previously that removing divergent regions from an MSA can improve
the resulting phylogenetic tree (Talavera & Castresana, 2007). Various tools are available
to identify or remove poorly aligned columns, including trimAl (Capella-Gutiérrez, Silla-
Martínez & Gabaldón, 2009), Gblocks (Talavera & Castresana, 2007) and ZORRO (Wu,
Chatterji & Eisen, 2012). These four tools use various algorithms to assign confidence scores
for each column in an MSA. Gblocks (Talavera & Castresana, 2007) identifies and removes
stretches of contiguous columns with low conservation. All positions with gaps, or adjacent
to gaps, are also removed (Talavera & Castresana, 2007). With trimAl (Capella-Gutiérrez,
Silla-Martínez & Gabaldón, 2009), poorly aligned columns are identified using proportion
of gaps, residue similarity and consistency across multiple alignments, either column-by-
column or based on a sliding window across the alignment. ZORRO uses hidden Markov
models to model sequence evolution and calculates posterior probabilities that columns
are correctly aligned (Wu, Chatterji & Eisen, 2012). All of these tools have been shown
to improve the accuracy of phylogenetic analysis under some circumstances and all can
be valuable (Talavera & Castresana, 2007; Capella-Gutiérrez, Silla-Martínez & Gabaldón,
2009; Wu, Chatterji & Eisen, 2012). However, poorly aligned columns are not the only issue
found in MSAs. All of these tools are designed to identify problematic columns, but none
are able to identify problematic rows which are disrupting an alignment. They also cannot
distinguish which gaps are the result of insertions within sequences and which are the
result of partial sequences. Column-wise tools can also be too stringent when working with
highly divergent alignments. Gblocks, trimAl and ZORRO are specifically tailored towards
phylogenetic analysis rather than other applications such as building consensus sequences,
scaffolding of contigs or secondary structure analysis.
Various refinement methods incorporated into alignment software can also improve
MSAs (Edgar, 2004; Katoh et al., 2002). Some tree building software can also take into
Cleaning alignments
CIAlign consists of several functions to clean an MSA by removing commonly encountered
alignment issues. All of these functions are optional and can be fine-tuned using user
parameters. All parameters have default values. The available functions are presented here
in the order they are executed by the program. The order can have a direct impact on the
results, the functions removing positions that lead to the greatest disruptions in the MSA
should be run first, as they potentially make removing more positions unnecessary and
therefore keep processing to a minimum. For example, divergent sequences often contain
many insertions compared to the consensus, so removing these sequences first reduces the
number of insertions which need to be removed. Sequences can be made shorter during
processing with CIAlign and therefore too short sequences are removed last.
Figure 1 shows a graphical representation of an example toy alignment before (Fig. 1A)
and after (Figs. 1B–1F) using each function individually. The remove gap only function is
run by default after every cleaning step, unless otherwise specified by the user.
Remove divergent
For each column in the alignment, this function finds the most common nucleotide
or amino acid and generates a temporary consensus sequence. Each sequence is then
compared individually to this consensus sequence. Sequences which match the consensus
at a proportion of positions less than a user-defined threshold (default: 0.65) are excluded
from the alignment (Fig. 1B). It is recommended to run the make similarity matrix function
to calculate pairwise similarity before removing divergent sequences, in order to adjust the
parameter value for more or less divergent alignments. This function requires an alignment
of three or more sequences.
Remove insertions
In order for CIAlign to define a region as an insertion, an alignment gap must be present
in the majority of sequences and flanked by a minimum number of non-gap positions on
either side, which can be defined by the user (default: 5). This pattern can be the result
of an insertion in a minority of sequences or a deletion in a majority of sequences. The
minimum and maximum size of insertion to be removed can also be defined by the user
(default: 3 and 200, respectively) (Fig. 1C). This function requires an alignment of three or
more sequences.
Crop ends
The crop ends function redefines where each sequence starts and ends, based on the
ratio of the numbers of gap and non-gap positions observed up to a given position in
the sequence. It then replaces all non-gap positions before and after the redefined start
and end, respectively, with gaps. This will be described for redefining the sequence start,
however crop ends is also applied to the reverse of the sequence to redefine the sequence
end. The number of gap positions separating every two consecutive non-gap positions is
compared to a threshold and if that difference is higher than the threshold, the start of the
sequence will be reset to that position. This threshold is defined as a proportion of the total
sequence length, excluding gaps, and can be defined by the user (default: 0.05) (Figs. 1D,
2). The user can set a parameter that defines the maximum proportion of the sequence for
which to consider the change in gap positions (default: 0.1) and therefore the innermost
position at which the start or end of the sequence may be redefined. It is recommended to
set this parameter no higher than 0.1, since even if there are a large number of gap positions
beyond this point, this is unlikely to be the result of incomplete sequences (Fig. 2). This
function requires an alignment of three or more sequences.
Visualisation
There are several ways of visualising the alignment, which both allow the user to interpret
the alignment and clearly show which positions and sequences CIAlign has removed.
CIAlign can also be used simply to visualise an alignment, without running any of the
cleaning functions. All visualisations can be output as publication ready image files.
Mini alignments
CIAlign provides functionality to generate mini alignments, in which an MSA is visualised
using coloured rectangles on a single x and y axis, with each rectangle representing a single
nucleotide or amino acid (e.g., Fig. 1, 3–5). Even for large alignments, this function provides
a visualisation that can be easily viewed and interpreted. Many properties of the resulting
file (dimensions, DPI, file type) are parameterised. In order to minimise the memory and
time required to generate the mini alignments, the matplotlib imshow function (Hunter,
2007) for displaying images is used. Briefly, each position in each sequence in the alignment
forms a single pixel in an image object and a custom dictionary is used to assign colours.
The image object is then stretched to fit the axes.
Sequence logos
CIAlign can generate traditional sequence logos (Schneider & Stephens, 1990) or sequence
logos using rectangles instead of letters to show the information and base/amino acid
content at each position, which can increase readability in less conserved regions. Sequence
logos can also be generated for sections of the alignment if a set of boundary coordinates
is provided.
Interpretation
Some additional functions are provided to further interpret the alignment, for example
plotting the number of sequences with non-gap residues at each position (the coverage),
calculating a pairwise similarity matrix and generating a consensus sequence with various
options.
Given the toy example shown in Fig. 1A, running all possible cleaning functions will
lead to the markup plot shown in Fig. 3A and the result shown in Fig. 3B. In the markup
plot each removed part is highlighted in a different colour corresponding to the function
with which it was removed.
Example alignments
Four example alignments are provided within the software directory to demonstrate the
functionality of CIAlign. Examples 1 and 2 use simulated sequences, Examples 3 and 4
use real biological sequences and are designed to resemble the type of complex alignment
many researchers encounter.
Example 1 is a very short alignment of six sequences which was generated manually
by creating arbitrary sequences of nucleotides that would show every cleaning function
while being as short as possible. This alignment contains an insertion, gaps at the ends of
sequences, a very short sequence and some highly divergent sequences.
Example 2 is a larger alignment based on randomly generated amino acid sequences
using RandSeq (a tool from ExPASy (Gasteiger et al., 2003)) with an average amino acid
composition, which were aligned with MAFFT v7.407, under the default settings (Katoh et
al., 2002). The sequences were adjusted manually to reflect an alignment that would fully
demonstrate the functionalities of CIAlign. It consists of many sequences that align well,
however there are again a few problems: one sequence has a large insertion, one is very
short, one is extremely divergent and some have multiple gaps at the start and at the end.
For Example 3, putative mitochondrial gene cytochrome C oxidase I (COI) sequences
were identified by applying TBLASTN v2.9.0 (Camacho et al., 2009) to the human COI
sequence (GenBank accession NC_012920.1, positions 5,904–7,445, translated to amino
acids), querying against 1,565 transcriptomic datasets from the NCBI transcriptome
shotgun assembly (TSA) database (Transcriptome Shotgun Assembly Sequence Database,
https://www.ncbi.nlm.nih.gov/genbank/tsa/) under the default settings. 2,855 putative COI
transcripts were reverse complemented where required, and those corresponding to the
COI gene of the primary host of the TSA dataset were identified using the BOLD online
specimen identification engine (Ratnasingham & Herbert, 2007) (accessed 07/10/2019)
querying against the species level barcode records. The resulting 232 sequences were then
aligned with MAFFT, under the default settings.
For Example 4, 91 sequences were selected from Example 3 to be representative of as
many taxonomic families as possible and to exclude families with unclear phylogeny in
the literature. These sequences were aligned with MAFFT under the default settings and
RESULTS
Here, an example is presented and the visualisation functions are used to illustrate the
functionality of CIAlign. Results will differ when using different parameters and thresholds.
CIAlign was applied to the Example 2 alignment with the following options:
python3 CIAlign.py infile INFILE outfile_stem OUTFILE_STEM all
Using these settings on the alignment in Fig. 4A results in the markup shown in Fig. 4B
and the output shown in Fig. 4C. The markup shows which function has removed each
sequence or position. The benefits of CIAlign are clear in this simulation—the single poorly
aligned sequence, the large insertion, very short sequences and gap-only columns have been
removed and the unreliably aligned end segments of the sequences have been cropped. The
resulting alignment is significantly shorter, which will speed up and simplify any further
analysis. The clear graphical representation makes it easy to see what has been removed, so
in the case of over-trimming the user can intervene and adjust functions and parameters.
In order to demonstrate the use of CIAlign on real biological sequences, an alignment
was generated based on the COI gene commonly used in phylogenetic analysis and DNA
barcoding (Ratnasingham & Herbert, 2007). As CIAlign addresses some common problems
encountered when generating an MSA based on de novo assembled transcripts, which tend
to have a higher error rate at transcript ends, gaps due to difficult to assemble regions and
divergent sequences due to chimeric connections between unrelated regions (Bushmanova
et al., 2019; Liao et al., 2019), COI-like transcripts were identified by searching the NCBI
transcriptome shotgun assembly database. Aligning these transcripts demonstrated several
common problems—multiple insertions, poor alignment at the starts and ends of sequences
and a few divergent sequences resulting in excessive gaps (Fig. 5A). This alignment was
cleaned using the default CIAlign settings except the threshold for removing divergent
sequences was reset to 50%, as some of the sequences are from evolutionarily distant
species. Cleaning this alignment with CIAlign took an average of 68.1 s and used on
average a maximum of 1.13GB of RAM (mean across 10 runs, on one Intel Core i7-7560U
core with 4 GB of RAM, running at 2.40 GHz, RAM measured as maximum resident set
size, this machine and 10 replicates were also used for all subsequent measurements of
CIAlign resource requirements in this section). Under these settings, CIAlign resolved
several of the problems with the alignment: the insertions and highly divergent sequences
were removed and the poorly aligned regions at the starts and ends of sequences were
cropped (Fig. 5B). One sequence and 6,029 positions were removed from the alignment
and a total of 2,446 positions were cropped from the ends of 112 sequences. The processed
alignment is 26.6% of the size of the input alignment. However, a minimal amount of
actual sequence data (as opposed to gaps) was removed, with 85.7% of bases remaining.
gap positions removed were much higher: 51–56% for all sets of parameters (Fig. 6A, Fig.
S2, Table 1). This shows that with relaxed and moderate settings, running CIAlign has a very
minimal impact on correctly aligned residues in the alignment, while a considerable amount
of gaps and noise are removed. The more stringent settings should be used cautiously,
however, even with high stringency, a large majority of correctly aligned residues remain
and the majority of gaps are removed. These results are separated by simulation tool
(EvolvAGene, INDELible or BAliBASE) and alignment tool (MUSCLE, MAFFT global,
MAFFT local and Clustal Omega) in Fig. S2.
To directly compare the impact of CIAlign on correctly aligned pairs of residues to
its overall impact, we fitted a linear regression line to show how, on average, the overall
proportion of positions removed from the alignment impacts the proportion of correctly
aligned residues reoved (Fig. 6B). The resulting line had a gradient of 0.281 for relaxed
parameters, 0.361 for moderate parameters and 0.554 for stringent parameters. In other
words, for every 1% of material removed from the alignment by CIAlign with relaxed
settings, an average of only 0.281% of correctly aligned residue pairs will be removed,
with moderate settings 0.361% and with stringent settings 0.554% (Fig. 6B). This will vary
depending on the input alignment and the use case. These results are shown separately
for MUSCLE, MAFFT and Clustal Omega in Fig. S2E. The impact of CIAlign on correctly
aligned pairs is most severe on the Clustal Omega EvolvAGene alignments, which have
lower pairwise identity than the alignments generated with the other tools and so have
more sequences removed entirely by the remove divergent function (discussed below).
In most cases, CIAlign is not intended or expected to change the phylogenetic tree
resulting from an alignment, although in many cases it will make building phylogenetic
trees faster. To test this, phylogenetic trees were generated for each of the EvolvAGene
and INDELible alignments (BAliBASE does not provide reference trees) to determine
if cleaning with CIAlign impacts the distance between the true phylogenetic tree and a
phylogenetic tree based on a test alignment (Fig. 6C, Table 1). For the EvolvAGene and
INDELible alignments, the mean n-RF distance and QD between the test trees and true trees
were virtually unchanged by running CIAlign and none of the changes were statistically
deviation of 5.36% (Fig. S3A, Table S3). There was one particular outlier for this metric,
with Clustal Omega (Sievers & Higgins, 2018), a HMM-based method, using stringent
settings removes a higher proportion of correctly aligned residues for the EvolvAGene
nucleotide simulations (median 24.5%). This is the result of a higher proportion of
sequences being removed by the remove divergent function, as the mean percentage
identity between pairs of sequences in the Clustal Omega alignments is lower (with a mean
of 57.9% identity) than the threshold of 65% identity used to remove divergent sequences
under the stringent CIAlign settings (Table S1, Fig. S3B).
Otherwise, the extent to which CIAlign will remove positions from an alignment is
primarily related to the number of gaps introduced by the alignment software. Amino acid
alignments generated with the tool DECIPHER (Wright, 2015) are outliers because this
tool introduces fewer and shorter internal gaps (as opposed to terminal gaps) into these
alignments than any other tool (under the default settings), which reduces the number
of positions meeting the criteria to be removed with either the crop ends or the remove
insertions functions (Fig. S3C, Table S3). Across all tools, there is a positive correlation
between the proportion of gaps in the input alignment and the proportion of residues
Realignment
As alignment tools take into account all the sequences and columns in the input file, the
most scrupulous option will always be to unalign and then realign sequences after running
a tool such as CIAlign, rather than using the CIAlign output directly in downstream
analysis. To test the extent to which using CIAlign outputs directly without realignment
could impact results, we removed gaps from the EvolvAGene alignments cleaned with
CIAlign with relaxed, moderate and stringent parameter settings and then reran the
original alignment tool on the result. We then calculated the sum-of-pairs score (Bahr et
al., 2001) treating the realigned file as the true alignment and the CIAlign output as the test
alignment. The mean sum-of-pairs score was 0.984, meaning 98.4% of pairs of nucleotides
aligned realigned MSA were also aligned in the CIAlign output (Fig. S5). This suggests that
while realigning the MSA cleaned with CIAlign is diligent, the effect is likely to be minimal.
The full results of this analysis are available in Online Table 7.
Removing outliers
CIAlign can also be used to remove clear outliers from an alignment, for example prior to
phylogenetic analysis. To illustrate this, we ran the CIAlign cleaning functions on data from
DISCUSSION
We have demonstrated that CIAlign can successfully mitigate the alignment issues caused
by non-majority insertions, poorly aligned sequence ends, highly divergent sequences and
short sequences and demonstrated this capability on specific examples, simulated and
benchmark datasets and large biological datasets. CIAlign has been shown to significantly
improve the accuracy of consensus sequences and secondary structure predictions generated
from MSAs (Figs. 6C and 7D) It also minimises the detrimental effect of adding additional
poorer quality sequences to both benchmark and real alignments (Fig. 7C and 8A). In most
cases, the proportion of correctly aligned material removed by CIAlign is minimal.
It is important to note that while CIAlign is helpful in mitigating alignment issues, using
an appropriate alignment tool and parameters to generate the original alignment is still
essential.
Parameters
Having as many parameters as possible to allow as much user control as possible gives
greater flexibility. However, this also means that these parameters should be adjusted, which
requires a good understanding of the cleaning functions and the MSA in question. CIAlign
Future work
New features are in progress to be added in the future, such as collapsing very similar
sequences, removing divergent columns, and making the colour scheme for the bases or
amino acids customisable. CIAlign is currently not parallelised, as the most time limiting
function, remove insertions, requires information from the entire alignment. However, a
future release will incorporate the ability to process more than one alignment in parallel.
CONCLUSIONS
CIAlign is a highly customisable tool which can be used to clean multiple sequence
alignments and address several common alignment problems. Due to its multiple user
options it can be used for many applications. CIAlign provides clear visual output showing
which positions have been removed and for what reason, allowing the user to adjust the
parameters accordingly. A number of additional visualisation and interpretation options
are provided.
Funding
This work was supported by the Wellcome Trust (106207) and the European Research
Council (646891). The funders had no role in study design, data collection and analysis,
decision to publish, or preparation of the manuscript.
Grant Disclosures
The following grant information was disclosed by the authors:
Wellcome Trust: 106207.
European Research Council: 646891.
Competing Interests
The authors declare there are no competing interests.
Data Availability
The following information was supplied regarding data availability:
The code is available at Zenodo: Katy Brown & Charlotte Tumescheit. (2022).
KatyBrown/CIAlign: v1.0.15 (v1.0.15). Zenodo. DOI:10.5281/zenodo.6330781.
The benchmarking data is available at GitHub: https://github.com/KatyBrown/
benchmarking_data_CIAlign.
Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/
peerj.12983#supplemental-information.
REFERENCES
Arnold C, Matthews LJ, Nunn CL. 2010. The 10kTrees website: a new online resource for
primate phylogeny. Evolutionary Anthropology: Issues, News, and Reviews 19:114–118
DOI 10.1002/evan.20251.
Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR.
2015. A global reference for human genetic variation. Nature 526:68–74
DOI 10.1038/nature15393.
Bäckström D, Yutin N, Jørgensen SL, Dharamshi J, Homa F, Zaremba-Niedwiedzka
K, Spang A, Wolf YI, Koonin EV, Ettema TJG. 2019. Virus genomes from deep sea
sediments expand the ocean megavirome and support independent origins of viral
gigantism. mBio 10(2):e02497-18 DOI 10.1128/mBio.02497-18.
Bahr A, Thompson JD, Thierry J-C, Poch O. 2001. BAliBASE (Benchmark Alignment
dataBASE): enhancements for repeats, transmembrane sequences and circular
permutations. Nucleic Acids Research 29:323–326 DOI 10.1093/nar/29.1.323.
Boswell RD. 1987. Sequence alignment by word processor. Trends in Biochemical Sciences
12:279–280 DOI 10.1016/0968-0004(87)90135-6.
Brito JJ, Li J, Moore JH, Greene CS, Nogoy NA, Garmire LX, Mangul S. 2020. Recom-
mendations to enhance rigor and reproducibility in biomedical research. GigaScience
9(6):giaa056 DOI 10.1093/gigascience/giaa056.
Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. 2019. rnaSPAdes: a de novo tran-
scriptome assembler and its application to RNA-Seq data. GigaScience 8(9):giz100
DOI 10.1093/gigascience/giz100.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden
TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421
DOI 10.1186/1471-2105-10-421.