List of Online Bioinformatics Tools and Software - Final
List of Online Bioinformatics Tools and Software - Final
This document describes the most commonly used software and algorithms for processing whole
genome sequencing. It is divided into categories, which describe the key processes for analysing
short read data. Tools of particular interest will be tag with a specific character (historical†, commonly
used*, easy to run#, etc). We are aware that the list is not complete, and that we present the status
as of January 2018. It should be also taken into account that the area is continuously under
development and new tools, not included here, will be released.
Most of the presented tools are command line based. In order to use them, you will need to install
them on your infrastructure. We highly recommend that you ensure to have proper settings for your
infrastructure (i.e. storage capacity and memory to run tools/software) as some of them require a lot
of resources. In case you want to try these software and do not have infrastructure, we can
recommend you to run Bio-Linux using a Virtual Box 1
o Seqtk
https://github.com/lh3/seqtk
Windows, Mac OS X and Linux
Tool for processing sequences in the FASTA or FASTQ format that can be used for
adapter removal and trimming of low-quality bases
Assembly
This is the process of joining short/long reads into longer contigs (contiguous lengths of DNA) without
the need for a reference sequence.
o VelvetK
http://www.vicbioinformatics.com/software.velvetk.shtml
Windows, Mac OS X and Linux
Perl script to estimate best k-mer size to use for your Velvet de novo assembly.
o VelvetOptimiser
http://www.vicbioinformatics.com/software.velvetk.shtml
Mac OS X and Linux
Perl script to assist with optimising the assembly.
Comments: optimisation can be made using different metrics (e.g. with best N50, best
coverage…)
o KmerGenie
http://kmergenie.bx.psu.edu/
Informed and Automated k-Mer Size Selection for Genome Assembly. Chikhi R.,
Medvedev P. HiTSeq 2013.
Windows, Mac OS X and Linux
Best k-mer length estimator for single-k genome assemblers like velvet.
o Khmer
http://khmer.readthedocs.io/en/v2.0/
The khmer software package: enabling efficient nucleotide sequence analysis. Crusoe et
al., 2015. F1000 http://dx.doi.org/10.12688/f1000research.6924.1
Linux and Mac OS X
Set of command-line tools for dealing with large and noisy datasets to normalise and
scale the data for more efficient genome assembly.
o Minia
http://minia.genouest.org/
Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Chikhi,
Rayan and Rizk, Guillaume. Algorithms for Molecular Biology, BioMed Central, 2013, 8
(1), pp.22.
Windows, Mac OS X and Linux
Short-read assembler based on a de Bruijn graph for low-memory assembly.
o SPAdes*#
http://cab.spbu.ru/software/spades/
SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell
Sequencing, Anton Bankevich, Sergey Nurk, Dmitry Anipov, Alexey A. Gurevich, Mikhail
Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D.
Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A.
Alekseyev, and Pavel A. Pevzner. Journal of Computational Biology 19(5) (2012), 455-
477. doi:10.1089/cmb.2012.0021
Mac OS X and Linux
Short and hybrid-long read assembler based on a de Bruijn graph that also performs
error correction and is a multi-k genome assembler.
Comments: Illumina Paired reads (2*150 and 2*250) need to be assemble with the
specific option --careful (see application note for full details) to get the best assembly
possible
o Velvet†*
https://www.ebi.ac.uk/~zerbino/velvet/
Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Daniel R.
Zerbino and Ewan Birney. Genome Res. May 2008 18: 821-829; Published in Advance
March 18, 2008, doi:10.1101/gr.074492.107
Linux
De novo short read genome assembler with error correction to produce high quality
unique contigs.
Comments: parameters can be difficult to select, some scripts have been developed and
are working well to help choose the best parameters. Optimisation of the option should
be used: VelvetOptimiser or VelvetK
o Canu
http://canu.readthedocs.io/en/stable/index.html
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat
separation. Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Adam M.
Phillippy doi: http://dx.doi.org/10.1101/071282
Windows, Mac OS X and Linux
Long-read assembler designed for high-noise data such as that generated by PacBio or
Oxford Nanopore MinION. Canu also performs error correction.
Comments: specifically designed to work with long read
o Unicycler
https://github.com/rrwick/Unicycler
Unicycler: resolving bacterial genome assemblies from short and long sequencing reads.
Ryan R. Wick, Louise M. Judd, Claire L. Gorrie, Kathryn E. Holt , Published in PLoS
Comput Biol (2017) https://doi.org/10.1371/journal.pcbi.1005595
Mac OS X and Linux
Unicycler is an assembly pipeline for bacterial genomes. It can assemble Illumina-only
read sets where it functions as a SPAdes-optimiser. It can also assemble long-read-only
sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible
assemblies, give it both Illumina reads and long reads, and it will conduct a hybrid
assembly.
Comments: use mainly as hybrid assembly for long read associated with Illumina read.
Well documented with a Wiki-tutorial https://github.com/rrwick/Unicycler/wiki/Tips-for-
finishing-genomes
o Bandage#
http://rrwick.github.io/Bandage/
Bandage: interactive visualization of de novo genome assemblies. Ryan R. Wick, Mark B.
Schultz, Justin Zobel, and Kathryn E. Holt. Bioinformatics (2015) 31 (20): 3350-3352 first
published online June 22, 2015 doi:10.1093/bioinformatics/btv383
Linux and Mac
Program for visualising de novo assembly graphs by displaying connection which are not
present in the contigs file for assembly assessment.
Comments: possibility to use blast inside the software to annotate regions of interest.
Can help determine relations between contigs.
Annotation
The process which takes the raw sequence of contigs resulting from assembly and marks it with
features such as gene names and putative functions.
o Prokka*#
http://www.vicbioinformatics.com/software.prokka.shtml
Prokka: rapid prokaryotic genome annotation. Seemann T. Bioinformatics. 2014 Jul
15;30(14):2068-9. PMID:24642063
Windows, Mac OS X and Linux
Software tool for the rapid annotation of prokaryotic genomes.
o RAST
http://rast.nmpdr.org/
The RAST Server: Rapid Annotations using Subsystems Technology. Aziz RK et al.. BMC
Genomics, 2008
Online tool
Fully-automated service for annotating complete or nearly complete bacterial and
archaeal genomes.
o Genix
http://labbioinfo.ufpel.edu.br/cgi-bin/genix_index.py
Online tool
Fully automated pipeline for bacterial genome annotation.
o Prodigal
https://github.com/hyattpd/Prodigal/wiki/Introduction
Hyatt, Doug et al. “Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site
Identification.” BMC Bioinformatics 11 (2010): 119. PMC. Web. 25 Apr. 2018.
Windows, Mac OS X, GenericUnix (Linux)
Prodigal is a software is a protein-coding gene prediction software tool for bacterial and
archaeal genomes
o NCBI Prokaryotic Genome Annotation Pipeline (PGAP)
https://www.ncbi.nlm.nih.gov/genome/annotation_prok/
Online tool – available for GenBank submitters only
PGAP is a pipeline for prediction of protein-coding genes, as well as other functional
genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions,
direct and inverted repeats, insertion sequences, transposons and other mobile elements
Alignment or sequence searching
Tools to align a sequence to other sequences locally or against publically available nucleotide or
protein archives.
o BLAST†#*
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Basic local alignment search tool. Stephen F. Altschul,Warren Gish,Webb Miller,Eugene
W. Myers,David J. Lipman. Journal of Molecular Biology, Volume 215, Issue 3, 5 October
1990, Pages 403-410
Windows, Mac OS X and Linux
Search tool to find regions of similarity between biological sequences through alignment
and calculating statistical significance.
Comments: classic methods to search for specific sequence. Different version can be
used such as blastn or megablast depending on the similarity between biological
sequences. Possibility to create local specific database with makeblastdb.
o MUMmer
http://mummer.sourceforge.net/
Versatile and open software for comparing large genomes. A.L. Delcher, A. Phillippy, J.
Carlton, and S.L. Salzberg, Nucleic Acids Research (2002), Vol. 30, No. 11 2478-2483.
Windows, Mac OS X and Linux
A system for rapidly aligning entire genomes and finding matches in DNA sequences.
o Clustal suite – ClustalO and ClustalW
http://www.clustal.org
Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position-specific
gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673-4680.
Sievers et al. (2011) Fast, Scalable Generation of High‐quality Protein Multiple Sequence
Alignments Using Clustal Omega. Molecular Systems Biology, 10.1038/msb.2011.75
Windows, Mac OS X and Linux and online (webservers)
Software that preforms sequences alignments. Mostly based on sequence weighting,
position-specific gap penalties and weight matrix choice.
Comments: ClustalO is usually present as performing better (faster and more accurate)
than the original version of ClustalW.
o MUSCLE*#†
https://www.drive5.com/muscle/
Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time
and space complexity. BMC Bioinformatics, (5) 113
Windows, Mac OS X and Linux and online (webservers)
Software for multiple alignment of protein sequences.
Mapping
Alignment of short reads against a reference sequence so that amount of coverage or variations
compared to the reference can be assessed.
o BWA*#
http://bio-bwa.sourceforge.net/
Fast and accurate short read alignment with Burrows-Wheeler Transform. Li H. and
Durbin R. (2009) Bioinformatics, 25:1754-60. [PMID: 19451168]
Windows, Mac OS X and Linux
Software package for mapping low-divergent sequences against a large reference
genome using the Burrows-Wheeler transform algorithm.
o Bowtie 2*#
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Fast gapped-read alignment with Bowtie 2. Langmead B, Salzberg S. Nature Methods.
2012, 9:357-359.
Windows, Mac OS X and Linux
Tool for aligning sequencing reads to long reference genomes also based on the
Burrows-Wheeler transform algorithm.
o Tablet
https://ics.hutton.ac.uk/tablet/
Using Tablet for visual exploration of second-generation sequencing data. Milne I,
Stephen G, Bayer M, Cock PJA, Pritchard L, Cardle L, Shaw PD and Marshall D. 2013.
Briefings in Bioinformatics 14(2), 193-202.
Windows, Mac OS X and Linux
Comments: Lighweight, high-performance graphical viewer for next generation sequence
assemblies and alignments that can be used to view mapping.
Assembly refinement
Process of curating assembly by re-using reads and re-mapping steps.
o Pilon
https://github.com/broadinstitute/pilon/wiki
Bruce J. Walker, Thomas Abeel, Terrance Shea, Margaret Priest, Amr Abouelliel,
Sharadha Sakthikumar, Christina A. Cuomo, Qiandong Zeng, Jennifer Wortman, Sarah K.
Young, Ashlee M. Earl (2014) Pilon: An Integrated Tool for Comprehensive Microbial
Variant Detection and Genome Assembly Improvement. PLoS ONE 9(11): e112963.
doi:10.1371/journal.pone.0112963
Windows, Mac OS X, Linux
Java based software that automatically improve draft assemblies. Find variation among
strains, including large event detection.
Comments: assembly need to be performs prior to use the software.
o FGAP
https://github.com/pirovc/fgap
Piro, Vitor C et al. “FGAP: An Automated Gap Closing Tool.” BMC Research Notes 7
(2014): 371. PMC
Online servers or Linux and Mac OS X
FGAP is a tool for closing gaps of draft genome. It uses BLAST to align multiple contigs
against a draft genome assembly aiming to find sequences that overlap gaps. The
algorithm selects the best sequence to fill and eliminate the gaps.
Variant Calling
Variant calling is the process by which variants (differences) are identify from sequence data. It
usually follows the step of mapping reads against a reference.
o SAMtools*
http://samtools.sourceforge.net/
The Sequence alignment/map (SAM) format and SAMtools. Li H.*, Handsaker B.*,
Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000
Genome Project Data Processing Subgroup (2009) Bioinformatics, 25, 2078-9. [PMID:
19505943]
Windows, Mac OS X and Linux
Toolkit that provides various utilities for manipulating alignments in the SAM format and
also can be used for generating consensus sequences and variant calling
o GATK*
https://software.broadinstitute.org/gatk/
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation
DNA sequencing data. Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko,
Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel,
Mark Daly, and Mark A. DePristo. Genome Res. September 2010 20: 1297-1303;
Published in Advance July 19, 2010, doi:10.1101/gr.107524.110
Windows, Mac OS X and Linux
Toolkit with a primary focus on variant discovery and genotyping.
o Picard
http://broadinstitute.github.io/picard/
Windows, Mac OS X and Linux
A set of command line tools (in Java) for manipulating high-throughput sequencing data
and formats.
Comments: command line only, but helpful to convert/sort and use different output bam,
sam…
o Varscan (version 2)
http://dkoboldt.github.io/varscan/
VarScan 2: Koboldt, D., Zhang, Q., Larson, D., Shen, D., McLellan, M., Lin, L., Miller, C.,
Mardis, E., Ding, L., & Wilson, R. (2012). VarScan 2: Somatic mutation and copy number
alteration discovery in cancer by exome sequencing Genome Research DOI:
10.1101/gr.129684.111
Windows, Linux and Mac OS X
A set of command line tools running with Java that detects different kind of variants such
as Germline variants (SNPs an dindels), Multi-sample variants (shared or private) in
multi-sample datasets (with mpileup), Somatic mutations, Somatic copy number
alterations (CNAs).
Phylogenetic analysis
Assessment of the evolutionary relationship between strains using either distance-based or Bayesian
methodologies.
o RaxML*
http://sco.h-its.org/exelixis/web/software/raxml/index.html
RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies.
A. Stamatakis. Bioinformatics (2014) 30 (9): 1312-1313.
Windows, Mac OS X and Linux
Randomized Axelerated Maximum Likelihood program for sequential and parallel
Maximum Likelihood based inference of large phylogenetic trees.
Comments: maximum-likelihood methods give more resolution/accuracy than FastTree
but take longer to run. Substitution models can be use as parameters.
o FastTree*#
http://www.microbesonline.org/fasttree/
FastTree: Computing Large Minimum-Evolution Trees with Profiles instead of a Distance
Matrix. Price, M.N., Dehal, P.S., and Arkin, A.P. (2009). Molecular Biology and Evolution
26:1641-1650, doi:10.1093/molbev/msp077.
Windows, Mac OS X and Linux
Comments: Faster tool for speedy inference of approximately-maximum-likelihood
phylogenetic trees from alignments of nucleotide or protein sequences. Particularly useful
to quickly generate trees.
o CSI Phylogeny*#
https://cge.cbs.dtu.dk/services/CSIPhylogeny/
Solving the Problem of Comparing Whole Bacterial Genomes across Different Sequencing
Platforms. Rolf S. Kaas , Pimlapas Leekitcharoenphon, Frank M. Aarestrup, Ole Lund.
PLoS ONE 2014; 9(8): e104984.
Comments: Online tool, easy to use and configure. Tool to call SNPs, filter the SNPs and
to do site validation and inference of phylogeny through a graphical user interface.
o Harvest
https://www.cbcb.umd.edu/software/harvest
The Harvest suite for rapid core-genome alignment and visualization of thousands of
intraspecific microbial genomes. Treangen TJ, Ondov BD, Koren S, Phillippy AM. Genome
Biology, 15 (11), 1-15
Windows, Mac OS X and Linux
Suite of core-genome alignment and visualization tools for quickly analysing thousands of
intraspecific microbial genomes, including variant calls, recombination detection, and
phylogenetic trees.
Comments: parsnp from this tool can compute trees based on very large number of
assembled genomes.
o Gubbins
http://sanger-pathogens.github.io/gubbins/
Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome
sequences using Gubbins. Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane
J. A., Bentley S. D., Parkhill J., Harris S.R. doi:10.1093/nar/gku1196, Nucleic Acids
Research, 2014.
Windows, Mac OS X and Linux
Gubbins (Genealogies Unbiased By recomBinations In Nucleotide Sequences) is an
algorithm that iteratively identifies loci containing elevated densities of base substitutions
while concurrently constructing a phylogeny based on the putative point mutations
outside of these regions.
Comments: detection of recombination and generation of phylogeny. Depending on the
number of genomes to analyse, this tool can be really long to run.
o BEAST
http://beast.bio.ed.ac.uk/
Bayesian phylogenetics with BEAUti and the BEAST 1.7. Drummond AJ, Suchard MA, Xie
D & Rambaut A (2012) Molecular Biology And Evolution 29: 1969-1973.
Windows, Mac OS X and Linux
Cross-platform program for Bayesian analysis of molecular sequences using MCMC.
Comments: can be use to generate phylogeny based on prior information like time.
Useful if you expect some time-relation in your phylogeny but really long to run.
o FigTree*#
http://tree.bio.ed.ac.uk/software/figtree/
Windows, Mac OS X and Linux
A graphical viewer of phylogenetic trees and program for producing publication-ready
figures of trees.
Comments: easy tools to visualise/manipulate trees
o I-TOL*#
https://itol.embl.de/
Letunic I and Bork P (2016) Nucleic Acids Res doi: 10.1093/nar/gkw290 Interactive Tree
Of Life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other
trees
Online server
I-TOL Interactive Tree Of Life is an online tool for the display, annotation and
management of phylogenetic trees.
Comments: This is only visualisation. Registration to have a workspace to
save/manipulate tree. Really powerful to view large/complex tree. An extensive range of
annotation available.
o Mega†
http://www.megasoftware.net/
MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets. Kumar
S, Stecher G, and Tamura K (2016) Molecular Biology and Evolution 33:1870-1874
Windows, Mac OS X and Linux
Comments: Sophisticated and user-friendly software suite for analysing DNA and protein
sequence data from species and populations. Contains building tree algorithms.
Virulence prediction
o PathogenFinder
https://cge.cbs.dtu.dk/services/PathogenFinder/
PathogenFinder - Distinguishing Friend from Foe Using Bacterial Whole Genome
Sequence Data. Cosentino S, Voldby Larsen M, Møller Aarestrup F, Lund O. (2013) PLoS
ONE 8(10): e77302.
Online tool
Web-server for the prediction of bacterial pathogenicity by analysing the input proteome,
genome, or raw reads provided by the user.
Cloud Services
If infrastructure is not available the cloud based services are worth considering
Genomics-Specific
o MRC CLIMB
http://www.climb.ac.uk/
Microbial bioinformatics cyber-infrastructure.
o Genomics Virtual Laboratory
https://www.gvl.org.au/
A genomics-specific version of Galaxy
o Galaxy
https://usegalaxy.org/
an open source, web-based platform for data intensive biomedical research.
Non-Genomics Specific
o Amazon Web Services
https://aws.amazon.com
Pay per usage cloud computing managed by amazon.com for temporary computing of big
data
o Azure (Microsoft)
https://azure.microsoft.com/en-us/
Multiple services divided into the following categories: AI + Machine Learning, Analytics,
Compute, Containers, Databases, Developer Tools, DevOps, Identity, Integration,
Internet of Things, Management Tools, Media, Migration, Mobile, Networking, Security,
Storage, Web
Commercial software
o Bionumerics Seven
http://www.applied-maths.com/applications
Offers a range of tools to analyse sequence data including MLST, wgMLST, AMR profiling,
wgSNPs.
o Ridom SeqSpehere +
http://www.ridom.de/seqsphere/index.shtml
Software design to analyse NGS data by using MLST/cgMLST
Here is a brief guide on how to set up a Virtual Machine on your PC to simulate a Linux environment
with several bioinformatics tools.
Downloading VirtualBox
Figure 1 VirtualBox 5.0 for Windows. Within VirtualBox Ubuntu 14.04 is running.
For further info on how to setup a VM on/with whichever OS you like, please refer to the manual
(also enclosed to this email).
Once this is working, you can delete the .ova file to save space. See the VirtualBox docs for more
details including how to share folders (also detailed in the next paragraph) and hardware. You will
also want to adjust hardware settings such as CPU, RAM and video acceleration settings to suit your
hardware, by tuning the parameters of the “System” tab of your VM (when it’s not running).
2 Note, however, that this project is no more funded/developed and therefore there might be a better long-term choice to setup a
Linux/Ubuntu based machine where you can install all the tools you need.
• Memory: 4096 MB
• 2 CPUs
• 10 MB video memory
or, more generally, we suggest to set both memory and CPUs values at half the value of your
actual system and never below 2GB of memory.
NOTE: You should treat the VM as a real machine for security purposes and apply all system security
updates in a timely manner. The default manager password is, clearly, not secure. This might not be
a problem because by default nobody can access the Linux VM unless they have direct access to your
computer, but if you open up the network settings (eg. by adding port forwarding rules) then you
must secure the account with a strong password or else take other steps to limit remote access.
Ideally enforce key-only access via SSH.
Now all is set up to start working on your VM. If you want, you can try installing these two tools
(which are not included in the Bio-Linux 8 release and that we will be using a lot during the training)
directly from the command line. Please make sure you have internet connection available and open a
terminal window to follow the instructions below
Trimmomatic
Trimmomatic is a flexible read trimming tool for Illumina NGS data. It is a Java-based tool, so first of
all check if you have it installed on your VM by typing
which java
(default output should be /usr/bin/java). Then get trimmomatic by typing
sudo apt-get install trimmomatic
and insert the password “manager”. Once the installation is completed, you should be able to
find it by typing
which TrimmomaticPE
To get usage information, just type
man TrimmomaticPE
on the command line.
To use Trimmomatic, you need to retrieve the ADAPTERS files (fasta format).
Run
#### GET THE ADAPATERS FOR TRIMMOMATIC
cd /usr/local/bioinf
sudo wget \ http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-
0.36.zip
(note that the last three lines is actually one command only).
Then type
sudo unzip Trimmomatic-0.36.zip
to extract the files. A usage example for Trimmomatic would be
#### RUN TRIMMOMATIC ON ONE SAMPLE
You can retrieve more information (all the explanation for options meaning and why/how to set
them) from Anais’s presentation.
Spades
SPAdes – St. Petersburg genome assembler – is an assembly toolkit containing various assembly
pipelines.
To get it, open the terminal and type
wget http://cab.spbu.ru/files/release3.11.0/SPAdes-3.11.0-Linux.tar.gz
if password is required, type “manager”. Move to the selected folder and uncompress the file
cd /usr/local/bin
sudo tar –xzf SPAdes-3.11.0-Linux.tar.gz
[Optional] Create a soft link to the folder, so you don’t have to change much if you install a newer
version later on:
sudo ln –s SPAdes-3.11.0-Linux/ spades
Add the folder to the path by modifying the .zshrc file (if your command line interpreter is zsh) or
your /etc/profile file (if your command line interpreter is bash) 3
Figure 3 Screenshot of the .zshrc file. Please insert the PATH command right after the comments
(lines starting with #) and ignore the rest of the file content.
In order to see the changes to the path without restarting the VM, re-type in the command line
export PATH="$PATH:/usr/local/bin/spades/bin"
For testing purposes, SPAdes comes with a toy data set (reads that align to first 1000 bp of E. coli).
To try SPAdes on this data set, run from command line:
spades.py --test
If the installation is successful, you will find the following information at the end of the log:
===== Assembling finished. Used k-mer sizes: 21, 33, 55
* Corrected reads are in spades_test/corrected/
* Assembled contigs are in spades_test/contigs.fasta
* Assembled scaffolds are in spades_test/scaffolds.fasta
* Assembly graph is in spades_test/assembly_graph.fastg
* Assembly graph in GFA format is in spades_test/assembly_graph.gfa
* Paths in the assembly graph corresponding to the contigs are in spades_test/contigs.paths
* Paths in the assembly graph corresponding to the scaffolds are in spades_test/scaffolds.paths
======= SPAdes pipeline finished.
========= TEST PASSED CORRECTLY.
3In order to test which interpreter you are using, write echo $0 on the command line. If
your result is zsh and you wish to change it to bash, just type “chsh -s /bin/bash” on the
command line and restart the VM.
SPAdes log can be found here: spades_test/spades.log
Thank you for using SPAdes!
Quast
Quast is a quality assessment tool for measuring the quality of your genome assembly. It is
particularly useful because it can generate a table comparing different metrics of your genome
assemblies.
To download the tool, run:
wget https://downloads.sourceforge.net/project/quast/quast-4.5.tar.gz
sudo cp quast-4.5.tar.gz /usr/local/bin
cd /usr/local/bin
sudo tar -xzf quast-4.5.tar.gz
echo “PATH=\"$PATH:/usr/local/bin/quast-4.5\”” >> ~/.zshrc
export PATH=\"$PATH:/usr/local/bin/quast-4.5\”
Let’s analyze what these lines are doing:
1. get the compressed installation file from internet
2. copy the compressed file into the /usr/local/bin folder; you have to use sudo to have the
administrator permissions to copy into this folder
3. change directory to /usr/local/bin
4. uncompress your compressed file
5. update your config file so to add the quast folder to your PATH variable
6. update your PATH on the fly to avoid rebooting your machine.
Now that you have quast at hand, you can use it with a list of contig files to compare their qualities.
References
https://www.virtualbox.org/
http://environmentalomics.org/whats-new-in-bio-linux-8/
http://www.usadellab.org/cms/index.php?page=trimmomatic
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/
http://cab.spbu.ru/software/spades/
http://quast.sourceforge.net/
https://www.ncbi.nlm.nih.gov/pubmed/22506599