Salmon
Salmon
1. Expression index
• Input:
§ FASTQ (is going to be slower due to the reads must be mapped) or BAM files.
§ Reference transcript annotaGon files: gene readout, or transcript readout, or the coding genes, or even
non-coding regions).
• Output: transcript-level gene expression (read count, TPM, FPKM) calculated on effecGve transcript length.
• EffecGve transcript length:
§ This is not the full gene length. It`s the coding part of them, the cDNA or exons.
§ Due to degradaGon of the ends, the reads have good coverage in the middle but worse in the ends these
is why you have to apply this correcGon.
§ Given the sequence composiGon of these transcripts, you would expect a priori to sample more reads
from them.
3. Isoform inference
• Depending on the transcript reference file that you give to RSEM, if you only give exons without considering
the isoforms of a gene, it can give you a general level expression esGmate.
• Given known set of isoforms:
§ EsGmate x (abundance of the isoform) by observing the n (number of reads exon) and knowing the
length of the exons and the exons that are part of the isoforms you can get the relaGve abundance of the
isoforms on a sample.
4. Pseudoalignment
• RSEM is considered the best quanGficaGon approach, but it could be a liYle bit slow.
• Pseudoalignment algorithms such as Kallisto and Salmon are faster because instead of doing a full alignment of
the reads across the whole genome they use the coding transfer (2% of the genome).
• They need reference transcript annotaGon files.
• Find all the transcripts and posiGons that a read is compaGble with (not useful to detect novel transcripts or gene
fusions).
• Salmon also corrects for sequence-specific GC biases.
• Can run either FASTQ files or BAM files.
• Can map 10 million reads in a few minutes, sacrificing accuracy for speed.
5. Output