Module8 RNASeq Pathogen Practical Manual
Module8 RNASeq Pathogen Practical Manual
1
1 RNA-Seq Expression Analysis 1.3 Practical Outline
1.4 Authors
This tutorial was developed by Jon Ambler, Phelelani Mpangase and Nyasha Chambwe, based in
part, from materials from Victoria Offord and Adam Reid.
1.5 Prerequisites
This tutorial assumes that you have the following software or packages and their dependencies
installed on your computer. The software or packages used in this tutorial may be updated from
time to time so, we have also given you the version which was used when writing the tutorial. Where
necessary, instructions for how to download additional analysis tools are given in the relevant section.
2
1 RNA-Seq Expression Analysis 1.6 Setup
1.6 Setup
1.6.1 Create Practical Directory
Navigate to the module folder on the Virtual Machine (VM) using the following command:
[ ]: cd /home/manager/course_data/rna_seq_pathogen
Create a copy of the practial dataset to maintain the data integrity of the original dataset.
[ ]: cp /home/manager/course_data/rna_seq_pathogen/data/bacterial/* /home/manager/
,→course_data/rna_seq_pathogen/practical
Copy and paste the command above as a single line at your command line window.
Note: if you make a mistake at any point during this tutorial, you can reset by deleting the practical
folder and restarting from this section.
Move to the practical directory
[ ]: cd practical
Reference Transcriptome File Download the GFF formatted version of the annotation file.
Unfortunately due to the lack of standardization in bioinformatics, we need both the GTF file and the
GFF file for analysis (See the differences and similarities between these two file format specifications
here). Salmon uses the GTF file and the GenomicFeatures library in R needs a GFF file. Both files
contain the same annotation information, they are just formatted differently. No tools are able to
reliably convert one to the other because even GFF files are not formatted consistently.
3
1 RNA-Seq Expression Analysis 1.6 Setup
[ ]: wget https://www.dropbox.com/s/4yjgbmy3dyhfoad/GCA_000195955.
,→2_ASM19595v2_genomic.gff?dl=1 -O /home/manager/course_data/rna_seq_pathogen/
,→practical/GCA_000195955.2_ASM19595v2_genomic.gff
,→practical_study_design.txt
Note: In this practical, we will use the existing study design file provided. However, in your own
work, you will have to create this file on your own. You can create this file in several ways some
of which could include exporting an Excel spreadsheet as a ’Tab delimited’ text file. You can also
use your favorite text editor.such as using the Text Editor app under the Applications menu on the
virtual machine.
4
2 Introducing The Tutorial Dataset
[ ]: unzip DE_data.RData.zip
The Bioconductor Project provides tools written in R for the ’analysis and comprehension of high-
throughput genomic data’. Here we will install two additional software packages (GenomicFeatures
and tximport) for our practical today (See additional guidelines for Bioconductor package installa-
tion).
[ ]: # Bioconductor Packages
install.packages("BiocManager")
BiocManager::install("GenomicFeatures", force = TRUE)
BiocManager::install("tximport")
# CRAN Packages
install.packages("pheatmap")
N2 sample1 Resistant 1
N6 sample2 Resistant 2
N10 sample3 Resistant 3
N14 sample4 Sensitive 1
5
3 Estimate Transcript Abundance With Salmon 2.1 Exercise 1
Research Question: what genes are differentially expressed between these two isolates
that can explain the differences in their phenotypes?
Check that you can see the FASTQ files in the practical directory.
[ ]: ls N*.fq.gz
The FASTQ files contain the raw sequence reads for each sample. There will typically be four lines
per read:
1. Header
2. Sequence
3. Separator (usually a ’+’)
4. Encoded quality value
Take a look at one of the FASTQ files.
[ ]: zless N2_sub_R1.fq.gz | head
You can find out more about the FASTQ format at https://en.wikipedia.org/wiki/FASTQ_
format.
2.1 Exercise 1
2.2 Questions
Q1: Why is there more than one FASTQ file per sample?
Hint: think about why there is a N2_sub_R1.fq.gz and a N2_sub_2.fq.gz
Q2: How many reads were generated for the N2 sample?
Hint: we want the total number of reads from both files (N2_sub_R1.fq.gz and N2_sub_2.fq.gz) so
perhaps think about the FASTQ format and the number of lines for each read or whether there’s anything
you can use in the FASTQ header to search and count.
Q3: The three Resistant samples N2, N6 and N10 represent technical replicates. True or False?
Comment on your answer
6
3 Estimate Transcript Abundance With Salmon 3.1 Create Transcriptome Index
to align the reads directly to a set of target transcripts such as those from a reference database for
your organism.
Inputs include: * A set of target transcripts - FASTA format Reference Transcriptome
GCA_000195955.2_ASM19595v2_genomic.transcripts.fa * Sample reads - FASTA/FASTQ files for
your sample N2_sub_R1.fq.gz.....
See more details in:
Reference:
Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-
aware quantification of transcript expression. Nat Methods. 2017 Apr;14(4):417-
419. doi: 10.1038/nmeth.4197. PMID: 28263959; PMCID: PMC5600148.
Take a look at the various files created in the transcripts_index folder created by salmon.
In a typically workflow, you would do fastQC analysis before and after to check the effects of the
trimming, but as that is not the purpose of this section, we will skip that for now.
7
3 Estimate Transcript Abundance With Salmon 3.2 Transcript Quantification
This creates a new folder named “N2” in the directory, which contains a number of files:
The “quant” files are the files containing the read counts for the genes. These are what downstream
tools like DESeq2 or edgeR would use for the differential expression analysis.
It is not always practical to run commands for each sample one at a time. So we can write small
scripts to process multiple samples using the same set of commands as is shown here:
[ ]: for r1 in *_R1.fq.gz
do
echo $r1
sample=$(basename $r1)
sample=${sample%_sub_R1.fq.gz}
echo "Processing sample: "$sample
${sample}_trimmed_reverse_paired.fq.gz ${sample}_trimmed_reverse_unpaired.fq.
,→gz \
ILLUMINACLIP:/home/manager/miniconda/pkgs/trimmomatic-0.39-1/share/
,→trimmomatic-0.39-1/adapters/TruSeq3-PE.fa:2:30:10:2:keepBothReads \
8
4 Differential Expression Analysis with DESeq2 3.3 Questions
-1 ${sample}_trimmed_forward_paired.fq.gz \
-2 ${sample}_trimmed_reverse_paired.fq.gz \
-o ${sample}
done
3.3 Questions
In the N2 folder, take a look at the quant.sf file and answer the following questions:
• Q1: What does TPM stand for?
• Q2: How many reads in total are mapped to the tRNA genes? (They start with “rna-”)
• Q3: Do you think this level of tRNA is acceptable?
Take a moment to discuss what the effect of large amounts of reads from tRNA and rRNA could
have on normalisation.
Hint: For further help understanding the quant.sf* output files look at this salmon documentation
page.*
9
4 Differential Expression Analysis with DESeq2
library("pheatmap")
samples
If we look at the samples object, we can see the data from our study design file.
Link Salmon Output File Paths to Samples
Create a files object pointing to the related quant.sf file for each sample
[ ]: root_dir <- "/home/manager/course_data/rna_seq_pathogen/practical"
files <- file.path(root_dir, samples$run, "quant.sf")
files
In the files object, we should now see the path to the quant.sf file in each sample’s folder. This is
much simpler and less error-prone than typing out each file path manually.
Use the run column to map samples to their file paths
[ ]: names(files) <- samples$run
The files object now has the sample name linked to the file path and can be used as a way to map
the information.
The makeTxDbFromGFF function is part of the GenomicFeatures library and is used to create a
Transcript Database (txdb) object from a GFF annotation file. The TxDb class is a container for
storing transcript annotations.
Extract the GENEID key values from the tbdx object
[ ]: k <- keys(txdb, keytype = "GENEID")
head(k)
The k object is a list of all the gene names based on the GENEID as extracted from the annotation
file.
10
4 Differential Expression Analysis with DESeq2
Take a look at the output, you will see 2 columns. This will be used to map the gene names from
the salmon files to the annotation files.
Rename the genes in the GENEID column
We do this to match the gene names found in the salmon quant.sf files with those found in the
transcriptome and annotation files.
[ ]: tx2gene[["GENEID"]] <- with(tx2gene, ifelse(!grepl("rna-", TXNAME),␣
,→paste0("gene-", TXNAME), TXNAME))
head(tx2gene[["GENEID"]])
The above command goes through the tx2gene object, and looks for where the gene name in the
TXNAME column DOES NOT have the prefix ”rna-”. In those rows, it adds the ”gene-” prefix to
the gene name in the GENEID column.
When the annotation file is imported by makeTxDbFromGFF , it removes the “gene-” prefix that
is present in the quant files. This is because of the way that the annotation and transcript files are
formatted.
You will see in the GFF and transcript files, that the gene ID is formatted like this:
ID=gene-Rv0005
Whereas in the GTF file it is formatted like this:
gene_id ”Rv0005”;
Though salmon uses the GTF file, it adds on the gene- prefix where it is missing while
makeTxDbFromGFF removes it where it is present.
[ ]: txi <- tximport(files, type="salmon", tx2gene=tx2gene)
11
4 Differential Expression Analysis with DESeq2 4.1 Exploratory Visualization
isolate. If we have more than a single experimental factor to consider, we would change how we
specify the design formula.
Note: For a more extensive treatment of how to setup the design formula for more complex experi-
mental designs, read through A guide to creating design matrices for gene expression experiments.
This step does the actual normalisation and model fitting for the data, and results in a deseq data
set.
Load provided .RData for differential expression analysis:
[ ]: load("DE_data.RData")
Apply LFC threshold and adjusted p-value cutoffs to get significantly differentially expressed
genes:
12
5 Key Aspects of Differential Expression Analysis 4.3 Questions
A significance threshold (FDR, or alpha) and log2FC threshold can be passed to the results function
to filter non-signficant differentially expressed genes.
[ ]: resTxi <- results(ddsTxi, alpha=0.05, lfcThreshold=0.5,␣
,→altHypothesis="greaterAbs")
summary(resTxi)
sigTxi
Plot heatmap:
[ ]: pheatmap(matrixTxi, cluster_rows=FALSE, show_rownames=TRUE, cluster_cols=TRUE)
4.3 Questions
Q1. How many genes are up- and down-regulated between the resistant and sensitive isolates
of the Mycobacterium tuberculosis?
Hint:use the summary function on the DESeq results object.
Q2: How many genes are significantly differentially expressed (i.e., meet the LFC threshold
and adjusted p-value cutoffs) between the resistant and sensitive isolates of the Mycobac-
terium tuberculosis? Name these genes.
Q3. What are the p-values for the significantly differentially expressed genes?
13
5 Key Aspects of Differential Expression Analysis 5.2 p-values vs. q-values
In order to accurately ascertain which genes are differentially expressed and by how much it is
necessary to use replicated data. As with all biological experiments doing it once is simply not
enough. There is no simple way to decide how many replicates to do, it is usually a compromise
between statistical power and cost. By determining how much variability there is in the sample
preparation and sequencing reactions, we can better assess how highly genes are really expressed
and more accurately determine any differences. The key to this is performing biological rather than
technical replicates. This means, for instance, growing up three batches of parasites, treating them
all identically, extracting RNA from each and sequencing the three samples separately. Technical
replicates, whereby the same sample is sequenced three times do not account for the variability that
really exists in biological systems or the experimental error between batches of parasites and RNA
extractions.
Note: more replicates will help improve power for genes that are already detected at high levels, while
deeper sequencing will improve power to detect differential expression for genes which are expressed at
low levels.
14
6 Normalisation
to determine whether any particular sorts of genes occur more than expected in your differentially
expressed genes.
6 Normalisation
6.1 Introduction
In a previous section, we looked at estimating transcript abundance with Salmon. The abundances
are reported as transcripts per million (TPM), but what does TPM mean and how is it calculated?
The objectives of this part of the tutorial are:
• understand why RNA-Seq normalisation metrics are used
• understand the difference between RPKM, FPKM and TPM
• understand how RPKM and TPM are calculated for a gene of interest
There are many useful websites, publications and blog posts which go into much more detail about
RNA-Seq normalisation methods. Here are just a couple (in no particular order):
• What the FPKM? A review of RNA-Seq expression units
• RPKM, FPKM and TPM, clearly explained
• A survey of best practices for RNA-seq data analysis
• The RNA-seq abundance zoo
15
6 Normalisation 6.2 Why do we use normalisation units instead of raw counts?
Figure 4. Effect of sequencing depth and gene length on raw read counts
Look at the top part of Figure 4. In which sample, X or Y, is the gene more highly expressed?
Neither, it’s the same in both. What we didn’t tell you was that the total number of reads generated
for sample A was twice the number than for sample B. That meant almost twice the number of reads
are assigned to the same gene in sample A than in sample B.
Look at the bottom part of Figure 4. Which gene, X or Y, has the greatest gene level expression?
Neither, they are both expressed at the same level. This time we didn’t tell you that gene X is twice
the length of gene Y. This meant that almost twice the number reads were assigned to gene X than
gene Y.
In the top part of Figure 4, the gene in sample X has twice the number of reads assigned to it than the
same gene in sample Y. What isn’t shown is that sample X had twice the number or total reads than
sample Y so we would expect more reads to be assigned in sample X. Thus, the gene is expressed at
roughly the same level in both samples. In the bottom part of Figure 4, gene X has twice the number
of reads assigned to it than gene Y. However, gene X is twice the length of gene Y and so we expect
more reads to be assigned to gene X. Again, the expression level is roughly the same.
16
6 Normalisation 6.2 Why do we use normalisation units instead of raw counts?
C
RP KM =
LN
Where:
• C is number of reads mapped to the transcript or gene
• L is the total exon length of the transcript or gene in kilobases
• N is the total number of reads mapped in millions
17
6 Normalisation 6.3 Calculating RPKM and TPM values
A 2,000 bases 10 12 30
B 4,000 bases 20 25 60
C 1,000 bases 5 8 15
Step 2: normalise for sequencing depth We now divide our read counts by the per million scaling
factor to get our reads per million (RPM).
Before:
18
6 Normalisation 6.3 Calculating RPKM and TPM values
A 10 12 30
B 20 25 60
C 5 8 15
After:
Step 3: get your per kilobase scaling factor Here we have our gene length in base pairs. For our
per kilobase scaling factor we need to get our gene length in kilobases by dividing it by 1,000.
A 2,000 2
B 4,000 4
C 1,000 1
Step 4: normalise for length Finally, we divide our RPM values from step 2 by our per kilobase
scaling factor from step 3 to get our reads per kilobase per million (RPKM).
Before:
After:
19
6 Normalisation 6.3 Calculating RPKM and TPM values
Notice that even though replicate 3 had more reads assigned than the other samples and a greater
sequencing depth, its RPKM is quite similar. And, that although gene B had twice the number of
reads assigned than gene A, its RPKM is the same. This is because we have normalised by both
length and sequencing depth.
A 2,000 bases 10 12 30
B 4,000 bases 20 25 60
C 1,000 bases 5 8 15
Step 1: get your per kilobase scaling factor Again, our gene lengths are in base pairs. For our
per kilobase scaling factor we need to get our gene length in kilobases by dividing it by 1,000.
A 2,000 2
B 4,000 4
C 1,000 1
Step 2: normalise for length Now we divide the number of reads which have been assigned to
each gene by the per kilobase scaling factor we just calculated. This will give us our reads per
kilobase (RPK).
Before:
A 10 12 30
B 20 25 60
20
6 Normalisation 6.3 Calculating RPKM and TPM values
C 5 8 15
After:
A 5 6 15
B 5 6.25 15
C 5 8 15
Step 3: get the sum of all RPK values in your sample Next, we sum the RPK values for each of
our replices. This will give use our total RPK value for each replicate. To make this example scalable,
we assume there are other genes so the total RPK is made up.
A 5 6 15
B 5 6.25 15
C 5 8 15
... ... ... ...
Total RPK 150,000 202,500 450,000
Step 4: get your per million scaling factor Here, instead of dividing our total mapped reads
by 1,000,000 (1 million) to get our per million scaling factor, we divide our total RPK values by
1,000,000 (1 million).
Step 5: normalise for sequencing depth Finally, we divide our individual RPK values from step
2 by the per million scaling factor in step 4 to give us our TPM values.
Before:
21
6 Normalisation 6.4 Which normalisation unit should I use?
A 5 6 15
B 5 6.25 15
C 5 8 15
After:
6.4.1 RPKM
6.4.2 TPM
22
6 Normalisation 6.4 Which normalisation unit should I use?
Notice that that total TPM value for each of the replicates is the same. This is not true for RPKM
and FPKM where the total values differ. With TPM, having the same total value for each replicate
makes it easier to compare the proportion of reads mapping to each gene across replicates (although
you shouldn’t really compare across experiments). With RPKM and FPKM, the differing total values
make it much harder to compare replicates.
23