0% found this document useful (0 votes)
3 views3 pages

Salmon

Uploaded by

carucast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views3 pages

Salmon

Uploaded by

carucast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

RNA-SEQ QUANTIFICATION

RSEM and Salmon

1. Expression index

• RPKM (reads per kilobase per million):


§ (Total reads/ 1M) / gene length in KB.
§ Corrects for coverage, gene length.
§ Methods that used RPKM TopHat and Cufflinks.
• TPM (transcripts per million):
§ (Read count/gene length)/scaling factor (summatory of RPK across all genes/1M).
§ ProporGon of reads mapped to a gene in each sample à is comparable.
§ It is used by RSEM algorithm.
• CPM (count per million): it`s used for differenGal expression assays.

2. RSEM for quan6fica6on

• Input:
§ FASTQ (is going to be slower due to the reads must be mapped) or BAM files.
§ Reference transcript annotaGon files: gene readout, or transcript readout, or the coding genes, or even
non-coding regions).
• Output: transcript-level gene expression (read count, TPM, FPKM) calculated on effecGve transcript length.
• EffecGve transcript length:
§ This is not the full gene length. It`s the coding part of them, the cDNA or exons.
§ Due to degradaGon of the ends, the reads have good coverage in the middle but worse in the ends these
is why you have to apply this correcGon.
§ Given the sequence composiGon of these transcripts, you would expect a priori to sample more reads
from them.
3. Isoform inference

• Depending on the transcript reference file that you give to RSEM, if you only give exons without considering
the isoforms of a gene, it can give you a general level expression esGmate.
• Given known set of isoforms:
§ EsGmate x (abundance of the isoform) by observing the n (number of reads exon) and knowing the
length of the exons and the exons that are part of the isoforms you can get the relaGve abundance of the
isoforms on a sample.
4. Pseudoalignment

• RSEM is considered the best quanGficaGon approach, but it could be a liYle bit slow.
• Pseudoalignment algorithms such as Kallisto and Salmon are faster because instead of doing a full alignment of
the reads across the whole genome they use the coding transfer (2% of the genome).
• They need reference transcript annotaGon files.
• Find all the transcripts and posiGons that a read is compaGble with (not useful to detect novel transcripts or gene
fusions).
• Salmon also corrects for sequence-specific GC biases.
• Can run either FASTQ files or BAM files.
• Can map 10 million reads in a few minutes, sacrificing accuracy for speed.

5. Output

• Kallisto (abundance.tsv file):


§ Taget ID: coding region of the genome where reads where mapped.
§ Length: length of the coding region.
§ Eff_length: length correcGon due to possible end degradaGon of the reads.
§ TPM (transcripts per million): proporGon of reads mapped to a gene in each sample.
§ Est_counts: number of reads per million mapped in this coding region of the genome.
• Salmon (quant.sf file):
§ Name: coding region of the genome where reads where mapped.
§ Length: length of the coding region.
§ EffecGve length: length correcGon due to possible end degradaGon of the reads.
§ TPM (transcripts per million): proporGon of reads mapped to a gene in each sample.
§ NumReads: number of reads per million mapped in this coding region of the genome.

You might also like