Functional metagenomics is used to understand who is doing what in microbial ecosystems. DNA sequencing can be prioritized by activity-based screening of libraries obtained by cloning and expressing metagenomic DNA fragments in an heterologous host. When large insert libraries are used, allowing a direct access to the functions encoded by entire metagenomic loci sizing several dozens of kbp, NGS is required to identify the genes that are responsible for the screened function. MINTIA is an easy to use pipeline assembling and annotating metagenomic inserts.
The assembly module (assemble) assembles, cleans and extracts the longest and most covered contigs for each DNA insert. It handles reads (454, ion torrent,...) or read pairs (Illumina,...). This tools is not able to process PacBio or Oxford Nanopore reads. It sub-selects by default 300X read coverage depending on the sequencing platform and assembles them, removes cloning vector and selects best contigs. Only contigs with length and average depth over given thresholds are kept. The produced HTML report includes a dynamic graphic with contig length and coverage for each sample.
The annotation module (annotate) aims at obtaining main gene functions and a functional classification. The pipeline launches at least prokka for ORF detection, generating fasta files for genes and proteins as well as a tabular file containing ORFs description. Depending on additional options selected, contigs and ORFs are aligned against NCBI NR (Non Redundant) as well as SP (SwissProt) and COGss databases. The produced HTML report includes all results and allow to explore annotations based on igv.js an embeddable interactive genome visualization component.
This MINITA repository is for command line user.
For an easy install, conda environment is recommended. To install Miniconda,
follow this tutorial or,
to install Ananconda, follow this one.
Clone this repository:
$ git clone --recursive https://github.com/Bios4Biol/MINTIA.git
Use conda to install the third party software:
$ cd MINTIA
$ ./setup.sh
Two dependencies will not be installed by conda and must be installed "manually":
- cross_match required for the assemble module (step1): cross_match
- MEGAN5 (optional) for the annotate module (step2):
- xvfb-run
- megan5
Activate the mintia environment:
$ conda activate mintia
Tools dependencies can be checked:
$ mintia check
##############################################
Mintia_v1.0 check dependencies
##############################################
- Step 1 - assemble:
=> cutadpt..........ok...version:2.10
=> spades...........ok...version:v3.14.1
=> cross_match......ok...version:1.090518
- Step 2 - annotate:
=> prokka...........ok...version:1.14.6
=> diamond..........ok...version:0.9.29
=> xvfb-run.........ok
=> MEGAN............ok...version:5.11.3
=> rpsblast.........ok...version:2.9.0
=> samtools.........ok...version:1.9
=> tabix............ok...version:1.9
Tools | Tested version | Tools | Tested version | |
---|---|---|---|---|
cross_match |
1.090518 |
bgzip |
||
spades |
v3.14.1 |
file |
file-5.04 |
|
prokka |
1.14.6 |
grep |
GNU grep 2.6.3 |
|
diamond |
0.9.29 |
which |
||
megan |
5.11.3 |
xvfb-run |
||
samtools |
1.9 |
tabix |
1.9 |
|
rpsblast |
2.9.0 |
|||
cutadapt |
2.10 |
Reference databases are required for the "annotate" module.
- NR: non-redundant protein database indexed for Diamond (used by -F and -M see below)
- Uniprot/Swissprot: protein database indexed for Diamond (used by -F see below)
- COGs: database of Clusters of Orthologous Groups of proteins, input of
rpsblast
(used by -C see below)
Create COGs DB
$ wget ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd//cdd.tar.gz
$ tar -xvzf cdd.tar.gz
$ makeprofiledb -title COG.3-28-17 -in Cog.pn -out Cog.v3-28-17 -threshold 9.82 -scale 100.0 -dbtype rps -index true
$ conda activate mintia
$ mintia check -h
Name:
mintia - Fosmid assembly and annotation pipeline.
Check Synopsis:
mintia check
Check Options:
-h, --help
Print help
$ mintia assemble -h
Name:
mintia - Fosmid assembly and annotation pipeline.
Assemble Synopsis:
mintia assemble -i FASTQ_FILE[S] -v FASTA_FILE -d STR
Assemble Options:
-i, --input FILE[S]
Fastq(.gz) file(s)
For each sample one OR two fastq file must be provided:
- Paired data must contain R1/R2: R[12].f[ast]q[.gz]
Ex: sampleName1_R1.fastq sampleName1_R2.fastq
sampleName2_R1.fq.gz sampleName2_R2.fq.gz
- Single data
Ex: sampleName.f[ast]q[.gz]
-v, --vectorSeq FILE
Vector fasta file
--length INT
Fosmid's expected length [40000]
--minimalContigLength INT
Contig's minimum length [1000]
--minimalContigDepth INT
Contig's minimum depth [8]
-c, --maxDepth INT
Coverage, maximum depth use to filter input reads [300]
-d, --dirOutputs STR
Path to the outputs directory
-H, --htmlOutput STR
HTML output name [mintia_assemble.html]
-L, --logOutput STR
Log output file name [mintia_assemble.log]
-t, --threads
number of threads for SPADES [8]
-h, --help
Print help
Example based on test data:
mintia assemble -t 1 -i Data/Input/Assemble/BifidoAdolescentis.s*gz -v Data/Input/Assemble/pCC1FOS.fasta -len 40000 -c 300 -d Data/Output/Assemble/
$ mintia annotate -h
Name:
mintia - Fosmid assembly and annotation pipeline.
Annotate Synopsis:
mintia annotate -i FASTA_FILE -n NR_DMND_FILE -u UNIPROT_DMND_FILE -F -d STR
Annotate Options:
-i, --input FILE
Fasta(.gz) file
-s, --separator CHAR [#]
Which separator allows retreiving the fosmid name
This separator will be also use to create ORF id
Example: >fosmidName1#contig1... |
>fosmidName1#contig2... | => fosmidName1
>fosmidName1#contig3... |
-n, --nrDB FILE
Non-redundant proteins database indexed for Diamond (Ex: the nr.dmnd)
Required by -F.
-u, --uniprotDB FILE
Proteic sequence database indexed for Diamond (Ex: the uniprot_sprot.dmnd)
Required by -F and -M.
-F, --FunctionalAndTaxonomic
Run functional and taxonomic annotations
-n, --nrDB and -u, --uniprotDB must be provided
-e, --evalue FLOAT
Maximum diamond e-value to report alignments [10e-8]
-q, --queryCover INT
Minimum diamond query cover% to report an alignment [50]
-M, --Megan FILE
Run MEGAN - A license file must be provided
-n, --nrDB must be provided
-C, --Cog FILE
Run annotations with COGs, DB COGs path file
-c, --cMaxEvalue FLOAT
Max Evalue for Rps-Blast COGs filtering [10e-8]
-S, --SubmissionFiles
Build submission files
-D, --DiamondAgainstPrivateDB FILE
Run diamond against your own protein reference FASTA file
-t, --threads INT
Number of threads for Blast [8]
-d, --dirOutputs STR
Path to the outputs directory
-H, --htmlOutput STR
HTML output name [mintia_annotate.html]
-L, --logOutput STR
Log output file name [mintia_annotate.log]
-k, --keepTmpFiles
Keep temporary files
-h, --help
Print help
Example based on test data:
# Without Megan and COG annotations
mintia annotate -F -i Data/Output/Assemble/mintia_assemble.fasta -d Data/Output/Annotate2/ -t 1 -S -n Data/Input/Annotate/NR_subset4test.dmnd -u Data/Input/Annotate/UNIPROT-SP_subset4test.dmnd
# With Megan and COG annotations
mintia annotate -F -C PATH_TO_COG_DB/Cog.v3-28-17 -M PATH_TO_YOUR_LICENSE/MEGAN5-academic-license.txt - Data/Output/Assemble/mintia_assemble.fasta -d Data/Output/Annotate2/ -t 1 -S -n Data/Input/Annotate/NR_subset4test.dmnd -u Data/Input/Annotate/UNIPROT-SP_subset4test.dmnd
GNU GPL v3
2020 INRAE