Intro To RNA-seq Concepts
Intro To RNA-seq Concepts
analysis
Gladstone Institutes
Krishna Choudhary
Bioinformatics Core, GIDB
July, 2020
Overall goals
✦ Demystify RNA-Seq data analysis
2
Contents
✦ Introduction
✦ Typical RNA-seq protocol
✦ Different platforms for computing
3
Typical protocol
Check quality
Trim adapters
Check quality
Map reads
Check quality
Tally counts
Check quality
DGE analysis
5
Bioinformatics software ecosystem:
Tools that “do one thing, and do it well”.
✦ Tools for today: fastqc, cutadapt, STAR, samtools,
featureCounts
✦ Available on Galaxy
✦ Some are pre-installed on Wynton; others we can install ourselves
✦ Download a container with all tools installed; use anywhere
✦ Everything can be installed on a laptop
6
Different platforms for
computing
7
Galaxy provides a collection of tools to analyze
variety of biomedical data
✦ https://usegalaxy.org/
8
Command line interfaces allow scripting
✦ Graphical User Interface
✦ Consists of windows, icons, menus, pointers
✦ Not always available for bioinformatics
9
Wynton is a high-performance computing (HPC)
system for UCSF affiliates
✦ Most universities/
institutions doing data-
intensive science have
HPC cluster on campus.
✦ Dependencies complicate
installations
✦ Dependencies are also
constantly under development
✦ => Many versions around
✦ Containers can be
deployed on commercial
cloud computing
platforms, e.g., Amazon
Web Services
(see Description for image source)
12
Best to use singularity with Linux (currently)
✦ Limited support for
MacOS
14
Problem: identifying differentially expressed
genes
✦ Other applications
✦ annotate novel transcriptional events, e.g., exon skipping, alternative
3’ acceptor or 5’ donor sites, intron retention
✦ analysis of genetic variation among expressed genes
✦ RNA editing events
✦ characterization of long noncoding RNAs
✦ …
Experiment design influences data analysis.
(should be planned to address relevant questions)
16
Dataset
✦ Small dataset with 100k reads (for practice only).
✦ FASTQ to tallying counts.
17
Sequencing centers provide FASTQ files. (~15 min)
Section goal: Understanding origin and contents of FASTQ file type.
18
cDNA library is applied to a flow cell.
19
Image adapted from a blog by Stuart M. Brown (link in description).
Flow cells are organized in lanes, columns and tiles.
20
Image borrowed from slides shared by Broad Institute (link in description).
DNA fragments immobilized on flow cell & amplified into clonal
clusters.
23
FASTQ files contain detailed information about each
read.
✦ Read sequence.
✦ Instrument used, flow cell id, lane number, tile number, etc.
24
Base calling may not be accurate.
Various possible causes: Example
28
Four lines per read:
1. Read ID, 2. Sequence, 3. Space for optional info, 4.
Quality.
29
Image adapted from blog by Lauren Launen (link in description).
Quality measured in terms of Phred scores.
Quality is
encoded
as
symbols.
30
Link for Illumina encoding of scores in description.
Adapters, primers, contaminants, target sequences, etc.
represented in FASTQ files.
✦ Open Bacteria_GATTACA_L001_R1_001.fastq.
31
Length of insert < Length of reads ordered
=> Adapters included in reads.
34
FastQC: Tool for quality control of sequencing data
36
Examples of FastQC reports
✦ Good Illumina data:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/goo
d_sequence_short_fastqc.html
37
What if QC gives warn/fail flag?
Check quality ✔️
Trim adapters
Check quality
Map reads
Check quality
Tally counts
Check quality
DGE analysis
39
Conclusion: Session 1
✦ Bioinformatics software ecosystem: Tools that “do one thing,
and do it well”.
✦ Multiple tools are available for the same task.
✦ HPC clusters enable data-intensive science.
✦ Containerization facilitates reproducibility.
✦ Always check “quality” after each step.
✦ Tools and file formats used:
✦ fastqc
✦ FASTQ 40
Contents: Session 2
✦ Resources for after the workshop
✦ Maybe:
✦ Shell scripting
✦ Submitting jobs to compute nodes
✦ Nextflow
41
When I need help, which I do need on a daily basis,
I visit:
✦ Slack channel for Wynton users
✦ ucsf-wynton.slack.com
✦ http://seqanswers.com/forums/
✦ https://www.biostars.org/
✦ https://www.rna-seqblog.com/
✦ https://stackexchange.com/
✦ Google groups for specific tools
✦ GitHub issues
✦ …
42
Question from Session 1:
How much programming skills do we need for
bioinformatics?
✦ Minimum essential
✦ Introductory R
✦ Introductory command line (also see the cheatsheet)
✦ Available at
✦ Gladstone Data Science Training program
✦ Data Science workshops from the UCSF library
✦ https://www.library.ucsf.edu/ask-an-expert/classes-catalog/
✦ Examples:
$ fastqc –t 16 –o ./ Bacteria_GATTACA_L001_R1_001.fastq
$ fastqc *.fastq
✦ Trimming adapters
$ cutadapt \
> –a file:Adapter_Sequence.fasta \
> –o ./trimmed.fastq Bacteria_GATTACA_L001_R1_001.fastq
45
File paths := Location of file on computer
✦ Lots of files on a computer
47
Submitting jobs for execution on compute nodes:
Same syntax as that for bioinformatics tools
48
Schedulers maintain a queue of jobs
✦ Many users are submitting jobs to Wynton
✦ Different resource requirements
✦ In what order should the jobs run?
✦ Popular schedulers
✦ SGE
✦ Slurm
✦ PBS Pro
49
Workflow managers: One command with all
required inputs for a pipeline
✦ Examples:
✦ Nextflow
✦ Snakemake
✦ Cromwell
✦ Example usage:
$ nextflow run nf-core/rnaseq \
> --reads '*_R{1,2}.fastq.gz’ \
> --genome GRCh37 \
> -profile singularity
50
Sequencer FASTQ ✔️
Check quality ✔️
Trim adapters
Check quality
Map reads
Check quality
Tally counts
Check quality
DGE analysis
51
Important!
✦ Know the tools you are using
✦ Know what each tool is doing
✦ Know the best practices for analysis
✦ Read benchmarking papers
✦ Read papers providing practical guidelines
✦ Read papers reviewing common practices
52
Cleaning up contaminants (20 mins)
Section goal: Run cutadapt on fastq files to remove adapters.
53
cutadapt removes adapters.
✦ Search for adapter sequence in read.
Adapter
56
Redo QC to ensure satisfactory quality.
✦ Run FastQC.
57
Sequencer ✔️ FASTQ ✔️
Check quality ✔️
Trim adapters ✔️
Check quality ✔️
Map reads
Check quality
Tally counts
Check quality
DGE analysis
58
Mapping reads (20 mins)
Section goal: Understand alignment method
59
Mapping := Aligning reads to regions of reference DNA.
✦ Challenges:
✦ Reference sequences can be very long (~3 billion bp for humans).
✦ Order of 100 million reads to be mapped.
✦ Sometimes, need to account for splicing.
✦ Allow for PCR artifacts/sequencing errors.
Inputs needed.
1. Reads to align.
✦ FASTQ file after cleaning (trimming adapters).
61
Indexing reference sequence speeds up mapping.
62
Output =>
1. Alignments in SAM format, 2. Summary of mapping statistics.
✦ SAM format:
✦ For each read, mapped where, in what orientation?
✦ Summary statistics:
✦ How many reads mapped?
✦ How many unmapped?
✦ …
63
Binary Alignment/Map (BAM) format
✦ Alignment reports often very large files.
64
Sequence Alignment/Map (SAM) format
✦ Open with Excel.
65
11 fields for each alignment (per row).
66
Alternatives
✦ Several. Example – bowtie2, BWA, subread, etc.
67
Tools to manipulate files are available.
✦ Need to sort alignment report?
✦ samtools
✦ …
✦ Google!
68
Sequencer ✔️ FASTQ ✔️
Check quality ✔️
Trim adapters ✔️
Check quality ✔️
Map reads ✔️
Check quality ✔️
Tally counts
Check quality
DGE analysis
69
Tally counts (~15 mins)
70
How many reads overlap annotated regions?
✦ Use featureCounts.
Sequencer ✔️ FASTQ ✔️
Check quality ✔️
Trim adapters ✔️
Check quality ✔️
Map reads ✔️
Check quality ✔️
Tally counts ✔️
Check quality
DGE analysis
72
Downstream analysis (~15
mins)
No. 1: Differential gene expression analysis.
73
Gene-wise counts should be normalized before
comparing between samples.
✦ Counts can differ because of different library sizes.
✦ …
75
Sequencer ✔️ FASTQ ✔️
Check quality ✔️
Trim adapters ✔️
Check quality ✔️
Map reads ✔️
Check quality ✔️
Tally counts ✔️
Check quality ✔️
DGE analysis ✔️
76
Your feedback is important to us!
✦ https://www.surveymonkey.com/r/RRTZPTC
✦ ~3 min.
77
Conclusions (~5 min)
78
Topics covered
✦ Steps of analysis.
79
Additional information: Sources of data
✦ Sequence read archive
✦ https://www.ncbi.nlm.nih.gov/sra
✦ Step-by-step guide:
✦ https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/#download-
sequence-data-files-usi
More tools
✦ Quality control: RSeQC, MultiQC, etc.
✦ Mapping: STAR, BWA, etc.
✦ File manipulation: bedtools, samtools, fastx-toolkit, etc.
✦ Visualization: UCSC Genome Browser
✦ …
81
Upcoming Workshops
✦ Pathway analysis
82
Thank you!
83