riboseq-flow - A Nextflow DSL2 pipeline to perform ribo-seq data analysis and comprehensive quality control
- Introduction
- Pipeline summary
- Quick start (test the pipeline)
- Quick start (run the pipeline on your data)
- Pipeline parameters
- Pipeline outputs
- Pre-download container images
- Authors and contact
- Issues and contributions
- Contributing guidelines
- Why use riboseq-flow
riboseq-flow is a Nextflow DSL2 pipeline for the analysis and quality control of ribo-seq data.
- UMI extraction (
UMI-tools
) (Optional) - Adapter and quality trimming, read length filtering (
Cutadapt
) (Optional) - Read QC (
FastQC
) - Pre-mapping to remove reads mapping to contaminants (
Bowtie2
,SAMtools
) (Optional) - Mapping to the genome and transcriptome (
STAR
) - UMI-based deduplication (
UMI-tools
,SAMtools
,BEDTools
) (Optional) - Extensive QC specialised for ribo-seq (
mapping_length_analysis
,R
) (Optional) - Gene-level read quantification (
FeatureCounts
) - P-site identification, CDS occupancy and P-site diagnostics (
riboWaltz
) (Optional) - Principal Component Analysis (PCA) using gene-level read counts and P-site counts over CDS regions (
DESeq2
) - Generation of coverage tracks (
deepTools
) - Design of sgRNA templates to deplete unwanted abundant contaminants (
Ribocutter
) (Optional) - MultiQC report of reads QC, mapping statistics, PCA and summarised ribo-seq QC metrics (
MultiQC
) - Tracking reads and visualisation of read fate through the key pipeline steps (
networkD3
)
- Ensure
Nextflow
(version21.10.3
or later) andDocker
orSingularity
(version3.6.4
or later) are installed on your system. Nextflow installation instructions can be found here. We recommend using Nextflow withJava 17.0.9
or later.
Note: The pipeline has been tested on with Nextflow
versions 21.10.3
, 22.10.3
, 23.04.2
and 23.10.0
.
- Pull the desired version of the pipeline from the GitHub repository:
nextflow pull iraiosub/riboseq-flow -r v1.1.1
- Run the pipeline on the provided test dataset:
Using Singularity:
nextflow run iraiosub/riboseq-flow -r v1.1.1 -profile test,singularity
or using Docker:
nextflow run iraiosub/riboseq-flow -r v1.1.1 -profile test,docker
- Check succesful execution.
- Ensure
Nextflow
andDocker
/Singularity
are installed on your system. - Pull the desired version of the pipeline from the GitHub repository:
nextflow pull iraiosub/riboseq-flow -r v1.1.1
- Create a samplesheet
samplesheet.csv
with information about the samples you would like to analyse before running the pipeline. It has to be a comma-separated file with 2 columns, and a header row as shown in the example below.
Note: Only single-end read data can be used as input; if you used paired-end sequencing make sure the correct read is used for the analysis.
sample,fastq
sample1,/path/to/file1.fastq.gz
sample2,/path/to/file2.fastq.gz
sample3,/path/to/file3.fastq.gz
- Run the pipeline. The typical command for running the pipeline is as follows (the minimum parameters have been specified):
nextflow run iraiosub/riboseq-flow -r v1.1.1 \
-profile singularity,crick \
-resume \
--input samplesheet.csv \
--fasta /path/to/fasta \
--gtf /path/to/gtf \
--contaminants_fasta /path/to/contaminants_fasta \
--strandedness forward
-profile
: specifies a configuration profile. Profiles can give configuration presets for different compute environments. Options aretest
,docker
,singularity
andcrick
depending on the system being used and resources available. Others can be found at nf-core.-resume
: specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files'contents as well.
--input
specifies the input sample sheet--outdir
specifies the output results directory- default:
./results
- default:
--tracedir
specifies the pipeline run trace directory- default:
./results/pipeline_info
- default:
The pipeline is compatible with any well-annotated organism for which a FASTA genone file and GTF annotation (and ideally rRNA and other contaminat sequences) are available. Recommended sources for these files are GENCODE and Ensembl. Important: The GTF file must include UTR annotations, and the format should follow the standards set by Ensembl or GENCODE.
Simplified Option for Human and Mouse:
Use the --org
flag to automatically download and set up reference files for human or mouse genomes, eliminating the need to manually provide them.
Available options for --org
are GRCh38
(human) and GRCm39
(mouse).
When --org
is specified, all annotation files are sourced from the paths in the genomes.config file.
Manual Annotation File Specification:
If --org
is not specified, the user needs to provide full paths to all required annotation files:
-
--fasta
path to the FASTA genome file -
--gtf
path to the GTF annotation file -
--contaminants_fasta
path to the FASTA file of abundant RNA contaminants (like rRNA). (Required if pre-mapping is enabled) -
--star_index
path to directory of pre-built STAR index (Optional). If not provided, the pipeline will generate the STAR index. -
--transcript_info
path to TSV file with a single representative transcript for each gene, with information on CDS start, length, end and transcript length (Optional). If not provided, the pipeline will create it using the GTF file, selecting the longest CDS transcript per gene. The transcript IDs in this file must match those in the GTF file. An example file format can be found here. -
--transcript_fasta
path to transcripts FASTA (full sequence, including CDS and UTRs) and matches thetranscript_info
file. (Required if supplying thetranscript_info
file). If not provided, the pipeline will generate this file for the selected representative transcripts. -
--save_index
save the Bowtie2 and STAR indexes generated by the pipeline. By default, not enabled.
The pipeline allows the user to set preferred parameter values, or the option to skip pipeline steps, as detailed below. Most parameters have default values, which will be used by the pipeline unless the user overrides them by adding the appropriate options to the run script. Where a default value is missing, the user must provide an appropriate value. All pre-defined parameter values are listed here.
The pipeline supports UMI-based deduplication. If UMIs were used, you must explicitly specify so by providing the appropriate arguments.
--with_umi
enables UMI-based read deduplication. Use if you used UMIs in your protocol. By default, not enabled.--skip_umi_extract
skips UMI extraction from the read in case UMIs have been moved to the headers in advance or if UMIs were not used--umi_extract_method
specify method to extract the UMI barcode (options:string
(default) orregex
)--umi_pattern
specifies the UMI barcode pattern, e.g. 'NNNNN' indicates that the first 5 nucleotides of the read are from the UMI.--umi_separator
specifies the UMI barcode separator (default:_
; userbc:
if Ultraplex was used for demultiplexing)
Read trimming steps are executed in the following order: (i) adaptors and low quality bases are trimmed, (ii) bases that need to be removed from read extremities are removed (e.g. non-templated bases, if applicable) and (iii) reads shorter than a minimum length are filtered out.
--skip_trimming
skip the adapter and quality trimming and length filtering step--save_trimmed
save the final and intermediate FASTQ files produced during the trimming steps. By default, not enabled.--adapter_threeprime
sequence of 3' adapter (equivalent to-a
incutadapt
)--adapter_fiveprime
sequence of 5' adapter (equivalent to-g
incutadapt
)--times_trimmed
number of times a read will be adaptor trimmed (default:1
)--cut_end
number of nucleotides to be trimmed from 5' or 3' end of reads (equivalent to-u
incutadapt
). Supply positive values for trimming from the 5' end, and negative values for trimming from the 3'end (default0
, no residues are trimmed). Important:: This step is perfomed after adapter trimming, and after UMIs have been moved to the read header. Useful for libraries where non-templated nucleotides are part of the read and need trimming (e.g. template-switching, OTTR libraries).--minimum_quality
cutoff value for trimming low-quality ends from reads (default:10
)--minimum_length
minimum read length after trimming (default:20
)
Ribocutter
is a tool for designing sgRNA templates to deplete unwanted abundant contaminants.
--skip_ribocutter
skips Ribocutter--guide_number
number of guides to design (default:50
)--max_reads
maximum number of reads analysed (default:1000000
)--min_read_length
minimum read length threshold for reads to be analysed--extra_ribocutter_args
string specifying additional arguments that will be appended to theribocutter
command
--skip_premap
skips pre-mapping to common contaminant sequences--bowtie2_args
allows users to input custom arguments for Bowtie2--star_args
enables the input of specific arguments for STAR
The preset alignment settings for both STAR and Bowtie2 are optimized for standard ribo-seq analysis. Altering these arguments is generally not advised unless specific data requirements or unique analytical needs arise.
--strandedness
specifies the library strandedness (options:forward
,reverse
orunstranded
) (Required)--feature
specifies the feature to be summarised at gene-level (options:exon
orCDS
) (default:exon
)
--skip_qc
skips mapping length analysis and generation of ribo-seq QC plots--expected_length
expected read lengths range. Used to report the proportion of reads of expected lengths in the aligned reads, for the generation of ribo-seq QC plots, and for specifying the range of read lengths used for P-site identification (default:26:32
).
Important: The --expected_length
parameter does not filter reads based on this length range for any other analyses, including alignment, gene-level quantification or track data generation.
P-sites are identified with riboWaltz
. It is strongly recommended to check the riboWaltz method to ensure the approach is suitable for your data. By default, the reads must be in length bins that satisfy periodicity to be used for P-site offset calculations. Additionally, the user can specify the following options:
-
--skip_psite
skips P-site identification and riboWaltz diagnotics plots -
--expected_length
expected read lengths range. Used to report the proportion of reads of expected lengths in the aligned reads, for the generation of ribo-seq QC plots, and for specifying the range of RPF lengths used for P-site identification (default:26:32
). -
--periodicity_threshold
specifies the periodicity threshold for read lengths to be considered for P-site identification (default:50
) -
--psite_method
specifies method used for P-site offsets identification (options:ribowaltz
(default) orglobal_max_5end
).- For
ribowaltz
P-site offsets are defined usingriboWaltz
. - For
global_max_5end
P-site offsets are defined by the distances between the first nucleotide of the translation initiation site and the nucleotide corresponding to the global maximum of the read length-specific 5'end profiles (the first step of the riboWaltz method). Compared to riboWaltz-defined offsets, only the 5' extremities of the reads are considered for calculation and no further offset correction is performed after read-length global maximum identification.
- For
-
exclude_start
specifies the number of nucleotides 3' from the start codon to be excluded from CDS P-site quantification (default:42
, i.e. exclude the first 14 codons) -
exclude_stop
specifies the number of nucleotides 5' from the stop codon to be excluded from CDS P-site quantification (default:27
, i.e. exclude the last 9 codons) -
genomic_codon_coverage
convert transcript positions of inferred P-sites mapping on each triplet of annotated CDS and UTR to genomic coordinates and report geomic coordinates and P-site counts in a BED file. By default, not enabled.
Important: P-sites and information are identified using transcriptomic coordinates. For this, a representative transcript is selected for each gene. The selection is based on the following hierarchy: CDS length > total length > number of exons > 5'UTR length > 3'UTR length. The selection is performed automatically by the pipeline using the information in the provided GTF file, and stored in *.longest_cds.transcript_info.tsv
. The default representative transcript table can be overriden by the user by providing their own table with --transcript_info
(see instructions above).
--bin_size
spcifies bin size for calculating coverage (i.e. the number of nt per bin). Bins are short consecutive counting windows of a defined size. (default:1
)--track_format
specifies output file type (options:bigwig
(default) orbedgraph
)
The pipeline outputs results in a number of folders:
.
├── annotation
├── preprocessed
├── fastqc
├── premapped
├── mapped
├── deduplicated
├── riboseq_qc
├── featurecounts
├── psites
├── coverage_tracks
├── ribocutter
├── multiqc
└── pipeline_info
annotation
contains information on the representative transcript per gene used for ribo-eq QC and P-site analyses, as well as bowtie2, STAR indexes and annotation files used by the pipeline*.longest_cds.transcript_info.tsv
TSV file with a single representative transcript for each gene, with information on CDS start, length and endbowtie2
andstar
folders with indexes used for aligning reads (if--save_index
enabled)
preprocessed
contains reads pre-processed according to user settings for UMI extraction and trimming and filtering (if--save_trimmed
enabled) and the corresponding logs. The*.filtered.fastq.gz
files are used for downstream alignment steps.fastqc
contains FastQC reports of pre-processed reads (*.filtered.fastq.gz
)premapped
contains files from aligning reads to contaminants:*.bam
BAM format alignments to contaminants*.bam.seqs.gz
mapping locations and sequences of contaminant-mapped reads*.unmapped.fastq.gz
sequencing reads that did not map to the contaminants in FASTQ format*.premap.log
is the bowtie2 log file
mapped
contains files resulting from alignment to the genome and transcriptome:*.Aligned.sortedByCoord.out.bam
read alignments to the genome in BAM format*.Aligned.toTranscriptome.out.bam
read alignments to the transcriptome in BAM format*.Log.final.out
is the STAR log output file
deduplicated
contains files resulting after deduplication based on genomic or transcriptomic location and UMIs:*.genome.dedup.sorted.bam
UMI deduplicated alignments to the genome, in BAM format*.transcriptome.dedup.sorted.bam
UMI deduplicated alignments to the transcriptome in BAM format
riboseq_qc
contains QC plots informing on read and mapping lengths, frame, periodicity, signal around start and stop codons, contaminants content, duplication, fraction of useful (mapped to protein-coding transcripts) reads, proprtion of reads remaining after alignment steps etc and folders containing other QC results.*.qc_results.pdf
per-sample QC reports with specialised ribo-seq quality metrics stratified by read lengthmapping_length_analysis
contains CSV files with number of raw and mapped reads by length:*.after_premap.csv
*.before_dedup.csv
*.after_dedup.csv
multiqc_tables
contains TSV files with sample summary metrics for multiQCread_fate
contains sample-specific html files tracking read fate through the pipeline steps, a visualisation that helps understanding the yield of useful reads and troubleshooting.pca
contains PCA plots and rlog-normalised count tables. Only produced if 3 samples or more are analysed.rust_analysis
containsRUST
metafootprint analysis, with plots showing the Kullback–Leibler divergence (K–L) profiles stratified by read length, using the inferred P-sites.
featurecounts
contains gene-level quantification of the UMI deduplicated alignments to the genomepsites
contains P-sites information, codon coverage and CDS coverage tables, and riboWaltz diagnostic plots:psite_offset.tsv.gz
P-site offsets for each read-length for all samples*psite.tsv.gz
ample-specific P-site information for each readribowaltz_qc
folder containing P-site diagnostic plots generated by riboWaltz*.coverage_psite.tsv.gz
P-site counts over the CDS of representative transcripts, or over a CDS window excluding a spcified region downstream start codons and upstream stop codonscodon_coverage_psite.tsv.gz
P-sites counts on each triplet of annotated CDS and UTR for each transcriptcodon_coverage_rpf.tsv.gz
RPF counts on each triplet of annotated CDS and UTR for each transcriptoffset_plot
contains plots that detail P-site assignment using the method
coverage_tracks
contains track files in BED and bigWig or bedGraph format*.bigwig
UMI deduplicated genome coverage tracks in bigWig format*.bedgraph
UMI deduplicated genome coverage tracks in bedGraph format*.genome.dedup.bed.gz
UMI deduplicated alignments to the genome in BED format*.transcriptome.dedup.bed.gz
UMI deduplicated alignments to the transcriptomie in BED formatpsite/*.codon_coverage_psite.bed.gz
folder containing BED files with the genomic coordinates and counts of inferred P-sites mapping on each triplet of annotated CDS and UTR for each transcript. Obtained by converting transcriptomic positions fromcodon_coverage_psite.tsv.gz
to genomic coordinates.
ribocutter
contains results of Ribocutter run on the pre-processed reads of minimum length defined by user, and minimum length of 23 nt, respectively*.csv
CSV files with the designed oligo sequences, their target sequence, the fraction of the library that each oligo targets and the total fraction of the library targeted by all oligosribocutter.pdf
plot showing the estimated fraction of the library that is estimated to be targeted by the guides designed using Ribocutter
multiqc
contains a MultiQC report summarising the FastQC reports, pre-mapping and mapping logs, and summarised ribo-seq QC metricspipeline_info
contains the execution reports, traces and timelines generated by Nextflow:execution_report.html
execution_timeline.html
execution_trace.txt
When a Nextflow pipeline requires multiple Docker images, it can sometimes fail to pull them, leading to pipeline execution failures. In such scenarios, pre-downloading the container images to a designated location on your system can prevent this.
Follow these steps to pre-download and cache the necessary images:
- Identify the desired location on your system where you want to store the container images.
- Set the
NXF_SINGULARITY_CACHEDIR
environment variable to point to this chosen location. For example, you can add the following line to your shell profile or run script:
export NXF_SINGULARITY_CACHEDIR=/path/to/image/cache
- Make sure you have
Singularity
installed. Please use the same version you intend to use for running the pipeline. - Run the code below to pre-download and cache the required Docker images:
#!/bin/sh
# A script to pre-download Singularity images required by iraiosub/riboseq-flow pipeline
# change dir to your NXF_SINGULARITY_CACHEDIR path
cd /path/to/image/cache
singularity pull iraiosub-nf-riboseq-latest.img docker://iraiosub/nf-riboseq
singularity pull iraiosub-mapping-length-latest.img docker://iraiosub/nf-riboseq-qc
singularity pull iraiosub-nf-riboseq-dedup-latest.img docker://iraiosub/nf-riboseq-dedup
riboseq-flow is written and maintained by Ira Iosub in Prof. Jernej Ule's lab at The Francis Crick Institute. It is based on a Snakemake pipeline in collaboration with its original author, Oscar Wilkins.
Contact email: [email protected]
riboseq-flow is under active development by Ira Iosub. For queries related to the pipeline, raise an issue on GitHub. If you are interested in building more functionality or want to get involved please reach out.
We welcome contributions to riboseq-flow.
If you wish to make an addition or change to the pipeline, please follow these steps:
- Open an issue to detail the proposed fix or feature and select the appropriate label.
- Create a new branch based on the
dev
branch, with a short, descriptive name e.g.feat-colours
for making changes to a color palette - Modify the code exclusively on this new branch and mention the relavant issue in the commit messages.
- When your modifications are complete, submit a pull request to the
dev
branch describing the changes. - Request a review from iraiosub on your pull request.
- The pull request will trigger a workflow execution on GitHub Actions for continuous integration (CI) of the pipeline. This is designed to automatically test riboseq-flow whenever a pull request is made to the main or dev branches of the repository. It ensures that the pipeline runs correctly in an Ubuntu environment, helping to catch any issues or errors early in the development process.
- it's user-friendly: it only requires minimal knowledge of the command-line interface
- it can be reliably and consistently executed across different systems using Docker or Singularity
- the user doesn't need to manage or install individual tool dependencies, only Nextflow and Docker or Singularity; containerisation ensures each process has the exact software and environment it needs
- it produces the same results regardless of where the pipeline is run, which is a cornerstone of computational reproduciblity in scientific research
- it enables efficient analysis of many samples in parallel, enabling comparative studies
- it integrates data processing steps with quality control
- it tallies reads at each critical step and produces bespoke ribo-seq quality reports
- it provides users with a transparent view of the read processing and quality, and empowers them with a deeper understanding of the data, enabling more informed decisions for downstream analyses
- it's open-source and version-controlled
- it meets FAIR principles