0% found this document useful (0 votes)
17 views85 pages

Intro To RNA-seq Concepts

The document provides an introduction to RNA-seq data analysis, focusing on demystifying the process and enabling informed discussions with computational biologists. It covers various tools and platforms for data analysis, including graphical user interfaces like Galaxy and command line interfaces for high-performance computing. Key topics include the typical RNA-seq protocol, quality control of sequencing data, and the importance of experimental design in data analysis.

Uploaded by

huangy56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views85 pages

Intro To RNA-seq Concepts

The document provides an introduction to RNA-seq data analysis, focusing on demystifying the process and enabling informed discussions with computational biologists. It covers various tools and platforms for data analysis, including graphical user interfaces like Galaxy and command line interfaces for high-performance computing. Key topics include the typical RNA-seq protocol, quality control of sequencing data, and the importance of experimental design in data analysis.

Uploaded by

huangy56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 85

Introduction to RNA-seq data

analysis
Gladstone Institutes

Krishna Choudhary
Bioinformatics Core, GIDB
July, 2020
Overall goals
✦ Demystify RNA-Seq data analysis

✦ Enable informed conversations with computational biologists

✦ Demonstrate how to analyze data


✦ using a graphical user interface (Galaxy)
✦ IMO: Good for dabbling but not if one needs to analyze large-scale data often
✦ using a command line interface (HPC)
✦ Most computational biologists use the command line

2
Contents
✦ Introduction
✦ Typical RNA-seq protocol
✦ Different platforms for computing

✦ From sequencer output to count matrix


✦ Tools for today: fastqc, cutadapt, STAR, samtools, featureCounts
✦ File formats: FASTQ, FASTA, BAM/SAM, GFF

✦ Conclusion & Overview of Session 2

3
Typical protocol

Figure: Berge et al., 2018, PeerJ Preprints. 4


Sequencer FASTQ

Check quality

Trim adapters

Check quality

Map reads

Check quality

Tally counts

Check quality

DGE analysis

5
Bioinformatics software ecosystem:
Tools that “do one thing, and do it well”.
✦ Tools for today: fastqc, cutadapt, STAR, samtools,
featureCounts
✦ Available on Galaxy
✦ Some are pre-installed on Wynton; others we can install ourselves
✦ Download a container with all tools installed; use anywhere
✦ Everything can be installed on a laptop

6
Different platforms for
computing

7
Galaxy provides a collection of tools to analyze
variety of biomedical data
✦ https://usegalaxy.org/

8
Command line interfaces allow scripting
✦ Graphical User Interface
✦ Consists of windows, icons, menus, pointers
✦ Not always available for bioinformatics

✦ Command Line Interface


✦ Text based
✦ Allow automation by scripting
✦ Examples
✦ Wynton CLI
✦ MacOS: Terminal
✦ Windows: Command Prompt, PuTTY

9
Wynton is a high-performance computing (HPC)
system for UCSF affiliates

✦ How to access Wynton?


Visit:
https://wynton.ucsf.edu/hpc/get
-started/access-cluster.html

✦ Most universities/
institutions doing data-
intensive science have
HPC cluster on campus.

(see Description for image source)


10
Containers enable reproducibility
✦ Tools are constantly under
development
✦ => Many versions around

✦ Dependencies complicate
installations
✦ Dependencies are also
constantly under development
✦ => Many versions around

✦ Different labs use different


programming languages
11
A container is like a computer within a
computer
✦ Containerization software
examples:
✦ Singularity
✦ Docker (has security issues
in the context of HPC)

✦ Containers can be
deployed on commercial
cloud computing
platforms, e.g., Amazon
Web Services
(see Description for image source)
12
Best to use singularity with Linux (currently)
✦ Limited support for
MacOS

✦ Even more limited


support for Microsoft
Windows

(see Description for image source) 13


Problem definition, shared
data files

14
Problem: identifying differentially expressed
genes
✦ Other applications
✦ annotate novel transcriptional events, e.g., exon skipping, alternative
3’ acceptor or 5’ donor sites, intron retention
✦ analysis of genetic variation among expressed genes
✦ RNA editing events
✦ characterization of long noncoding RNAs
✦ …
Experiment design influences data analysis.
(should be planned to address relevant questions)

✦ What is the biological question that we seek to answer?


✦ How many tissue types and/or time points to compare?
✦ How deep should we sequence?
✦ Read length? Not the subject matter today!

✦ Which sequencing platform?


• Workshop by Reuben Thomas:
✦ Single-end or paired-end?
Intro to statistics and experimental design.
✦ Pooling?
✦ Biological replicates? • Reading material:
✦ Technical replicates? RNA sequencing data : hitchhiker’s guide
✦ Additional considerations? to expression analysis by Berge et al., 2018

16
Dataset
✦ Small dataset with 100k reads (for practice only).
✦ FASTQ to tallying counts.

17
Sequencing centers provide FASTQ files. (~15 min)
Section goal: Understanding origin and contents of FASTQ file type.

18
cDNA library is applied to a flow cell.

19
Image adapted from a blog by Stuart M. Brown (link in description).
Flow cells are organized in lanes, columns and tiles.

20
Image borrowed from slides shared by Broad Institute (link in description).
DNA fragments immobilized on flow cell & amplified into clonal
clusters.

Synthesize Repeat steps =>


Immobilization Bridging Linearization
complementary Clonal clusters
strand

cDNA Fragments bend.


fragments Adapter on other
stick to end pairs with
complementar complementary
y sequences sequence.
on flow cell.

Image adapted from blog by Lauren Launen (link in description). 21


Sequencing by Synthesis
1. Adapters contain primer binding sites.
2. Nucleotide with reversible terminator &
fluorophore added.
3. Image nucleotide added.
4. Remove terminator and fluorophore.
5. Repeat 2-4.

Image source: DMLapato [CC BY-SA 4.0 (https://creat 22


ivecommons.org/licenses/by-sa/4.0)]
Strong signal from monoclonal clusters.

23
FASTQ files contain detailed information about each
read.
✦ Read sequence.

✦ Instrument used, flow cell id, lane number, tile number, etc.

✦ Quality of each base call.

24
Base calling may not be accurate.
Various possible causes: Example

Ideal world Real world

Image adapted from a Broad Institute presentation (link in description). 25


Base calling may not be accurate.
Various possible causes: Example

Image adapted from Pfeiffer et al., Scientific Reports, (2018) 8:10950. 26


Base calling may not be accurate.
Possible causes

✦ Blocking of synthesis after one nucleotide addition may be inefficient.


✦ Clusters might not be monoclonal.
✦ A tile may be out of focus.
✦ Oil, reagent, etc. on flow cell or imaging component, etc.

=> Need to record quality of each base call.


27
Example FASTQ file with one read only.
✦ Open Single_read.fastq

28
Four lines per read:
1. Read ID, 2. Sequence, 3. Space for optional info, 4.
Quality.

29
Image adapted from blog by Lauren Launen (link in description).
Quality measured in terms of Phred scores.

Quality is
encoded
as
symbols.

30
Link for Illumina encoding of scores in description.
Adapters, primers, contaminants, target sequences, etc.
represented in FASTQ files.
✦ Open Bacteria_GATTACA_L001_R1_001.fastq.

31
Length of insert < Length of reads ordered
=> Adapters included in reads.

Image source: DMLapato [CC BY-SA 4.0 (https://creat 32


ivecommons.org/licenses/by-sa/4.0)]
Naming conventions for fastq files.
✦ File names often follow a format.
✦ SampleName_Barcode_LaneNumber_ReadNumber_SetNumber.fastq
✦ Ex – Bacteria_GATTACA_L001_R1_001.fastq

✦ Paired-end reads named with R1 and R2 in file name.


✦ Ex – Bacteria_GATTACA_L001_R1_001.fastq and
Bacteria_GATTACA_L001_R2_001.fastq

✦ File extensions may be .fq or even .txt.


✦ Often compressed using gzip.
✦ gzip is free and open-source.
✦ Resulting file names have .gz added. Example – .fq.gz.
33
Quality control of sequencing files. (~ 30
mins)
Section goal: Running FastQC and interpreting results.

34
FastQC: Tool for quality control of sequencing data

✦ Summarizes quality of base calls.


✦ Checks for presence of known adapters.
✦ Any sequences more frequently observed than expected?
✦ Any sequence biases?
✦ Any GC biases?
✦ …
Galaxy: Open source, web-based platform that
integrates many tools.
✦ Free, public, internet accessible resource.
✦ https://usegalaxy.org/

✦ Data transfer and data storage are not encrypted.


✦ DO NOT UPLOAD PROTECTED DATA!!!

✦ For protected or large data:


✦ Setup local galaxy instance.
✦ Run Galaxy on the cloud.

36
Examples of FastQC reports
✦ Good Illumina data:
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/goo
d_sequence_short_fastqc.html

✦ Bad Illumina data:


https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad
_sequence_fastqc.html

37
What if QC gives warn/fail flag?

✦ Non-normal GC content per read?


✦ Normal expected for whole-genome shotgun sequencing.
✦ RNA-seq might give different distributions.

✦ Non-uniform sequence content per nucleotide?


✦ First 10-15 nt in RNA-seq often non-uniform.

✦ High duplication levels or over-represented sequences?


✦ Are they contaminants, e,g. adapers or PCR duplicates?
✦ If so, clean up contaminants.
✦ Could be attributed to highly abundant transcripts.

✦ Are sequence biases expected?

✦ For more: https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/


38
Sequencer FASTQ ✔️

Check quality ✔️

Trim adapters

Check quality

Map reads

Check quality

Tally counts

Check quality

DGE analysis

39
Conclusion: Session 1
✦ Bioinformatics software ecosystem: Tools that “do one thing,
and do it well”.
✦ Multiple tools are available for the same task.
✦ HPC clusters enable data-intensive science.
✦ Containerization facilitates reproducibility.
✦ Always check “quality” after each step.
✦ Tools and file formats used:
✦ fastqc
✦ FASTQ 40
Contents: Session 2
✦ Resources for after the workshop

✦ Mapping trimmed reads


✦ cutadapt, STAR or bowtie2, samtools
✦ FASTA, BAM/SAM, GFF

✦ Tallying gene-wise counts

✦ Maybe:
✦ Shell scripting
✦ Submitting jobs to compute nodes
✦ Nextflow
41
When I need help, which I do need on a daily basis,
I visit:
✦ Slack channel for Wynton users
✦ ucsf-wynton.slack.com
✦ http://seqanswers.com/forums/
✦ https://www.biostars.org/
✦ https://www.rna-seqblog.com/
✦ https://stackexchange.com/
✦ Google groups for specific tools
✦ GitHub issues
✦ …
42
Question from Session 1:
How much programming skills do we need for
bioinformatics?
✦ Minimum essential
✦ Introductory R
✦ Introductory command line (also see the cheatsheet)

✦ Available at
✦ Gladstone Data Science Training program
✦ Data Science workshops from the UCSF library
✦ https://www.library.ucsf.edu/ask-an-expert/classes-catalog/

✦ For RNA-seq data analysis beyond tallying gene-wise counts:


✦ Intermediate RNA-seq (coming up)
✦ Pathway analysis (coming up)
43
Typical command line syntax of bioinformatics
tools
$ Toolname –a 10 –b file.txt –c xyz.fq –o pqrs.tuv

$ Toolname --paramA 10 –b file.txt –c xyz.fq –outFile pqrs.tuv

✦ Examples:

$ fastqc –t 16 –o ./ Bacteria_GATTACA_L001_R1_001.fastq

$ fastqc *.fastq

$ cutadapt –a GATTACA –o ./trimmed.fastq input.fastq


44
Examples of commands to execute various steps

✦ Trimming adapters
$ cutadapt \
> –a file:Adapter_Sequence.fasta \
> –o ./trimmed.fastq Bacteria_GATTACA_L001_R1_001.fastq

✦ Similarly for other tools

45
File paths := Location of file on computer
✦ Lots of files on a computer

✦ Organized in directories which may


contain sub-directories

✦ Bioinformatics tools may need file inputs


and may output files
✦ Where are the input files located on the
computer?
✦ Searching entire computer not practical
✦ What if multiple files have the same name?
✦ Where should the output files be saved?
46
File paths
✦ ./ is for current working directory
✦ ../ is for parent directory of current
working directory
✦ ../../ is for parent directory of parent
directory of current working directory

✦ /Root/DirA/DirC/File4 is the path


of File4 in the image to the right.

47
Submitting jobs for execution on compute nodes:
Same syntax as that for bioinformatics tools

$ qsub -cwd -pe smp 4 -l mem_free=2G -l


scratch=50G -l h_rt=00:20:00 script.sh

✦ For details visit:


✦ https://wynton.ucsf.edu/hpc/scheduler/submit-jobs.html

48
Schedulers maintain a queue of jobs
✦ Many users are submitting jobs to Wynton
✦ Different resource requirements
✦ In what order should the jobs run?

✦ HPC administrators decide which scheduler to use

✦ Popular schedulers
✦ SGE
✦ Slurm
✦ PBS Pro
49
Workflow managers: One command with all
required inputs for a pipeline
✦ Examples:
✦ Nextflow
✦ Snakemake
✦ Cromwell

✦ Example usage:
$ nextflow run nf-core/rnaseq \
> --reads '*_R{1,2}.fastq.gz’ \
> --genome GRCh37 \
> -profile singularity
50
Sequencer FASTQ ✔️

Check quality ✔️

Trim adapters

Check quality

Map reads

Check quality

Tally counts

Check quality

DGE analysis

51
Important!
✦ Know the tools you are using
✦ Know what each tool is doing
✦ Know the best practices for analysis
✦ Read benchmarking papers
✦ Read papers providing practical guidelines
✦ Read papers reviewing common practices

52
Cleaning up contaminants (20 mins)
Section goal: Run cutadapt on fastq files to remove adapters.

53
cutadapt removes adapters.
✦ Search for adapter sequence in read.

✦ Allow for mismatches in sequence.

✦ If significant alignment, cut.

(Image adapted from Bolger et al., 2014, Bioinformatics. Link in description.) 54


Alternative approach: Trimmomatic
✦ Say adapter sequence in read is very short.
✦ Can we still identify it?
Primer binding

✦ Yes for paired-end reads.

Adapter

(Image adapted from Bolger et al., 2014, Bioinformatics. Link in description.) 55


What else to clean?
✦ PCR primers?

✦ Unique molecular identifiers?

✦ Poor quality base calls?


✦ …

56
Redo QC to ensure satisfactory quality.
✦ Run FastQC.

✦ Are over-represented sequences gone?

57
Sequencer ✔️ FASTQ ✔️

Check quality ✔️

Trim adapters ✔️

Check quality ✔️

Map reads

Check quality

Tally counts

Check quality

DGE analysis

58
Mapping reads (20 mins)
Section goal: Understand alignment method

59
Mapping := Aligning reads to regions of reference DNA.

✦ After cleaning, reads from real sample only. (Assumption)

✦ Mapping := Aligning reads to regions of reference DNA.

✦ Challenges:
✦ Reference sequences can be very long (~3 billion bp for humans).
✦ Order of 100 million reads to be mapped.
✦ Sometimes, need to account for splicing.
✦ Allow for PCR artifacts/sequencing errors.
Inputs needed.
1. Reads to align.
✦ FASTQ file after cleaning (trimming adapters).

2. Reference sequence to align to.


✦ Example – “rDNA_sequence.fasta”
✦ FASTA format. Two lines per sequence.
I. Starting with “>”, followed by sequence name/identifier.
II. Sequence.
✦ File extensions: .fasta, .fa, .txt.

61
Indexing reference sequence speeds up mapping.

✦ Use STAR or bowtie2 to build index.


✦ STAR is popular for RNA-Seq data because it
✦ does unbiased de novo detection of canonical junctions
✦ can discover non-canonical splices
✦ can discover chimeric (fusion) transcripts

✦ Use cleaned reads and index of reference sequence to map.

62
Output =>
1. Alignments in SAM format, 2. Summary of mapping statistics.

✦ SAM format:
✦ For each read, mapped where, in what orientation?

✦ Summary statistics:
✦ How many reads mapped?
✦ How many unmapped?
✦ …

63
Binary Alignment/Map (BAM) format
✦ Alignment reports often very large files.

✦ BAM extension used for compressed SAM files.

64
Sequence Alignment/Map (SAM) format
✦ Open with Excel.

✦ First few lines contain metadata about alignments.


✦ These lines start with “@”.
✦ Example – version of file format, sorting order of alignments, grouping,
etc.

✦ After header, a table of alignments.

65
11 fields for each alignment (per row).

66
Alternatives
✦ Several. Example – bowtie2, BWA, subread, etc.

✦ Differences in speed and memory requirement.

✦ Pros and cons of each:


✦ Example: Some handle spliced alignment, others do not.
✦ …

67
Tools to manipulate files are available.
✦ Need to sort alignment report?
✦ samtools

✦ Need to convert FASTQ to FASTA?


✦ fastx-toolkit

✦ …

✦ Google!

68
Sequencer ✔️ FASTQ ✔️

Check quality ✔️

Trim adapters ✔️

Check quality ✔️

Map reads ✔️

Check quality ✔️

Tally counts

Check quality

DGE analysis

69
Tally counts (~15 mins)

70
How many reads overlap annotated regions?

✦ Need annotation information.

✦ Need alignment information.

✦ Use featureCounts.
Sequencer ✔️ FASTQ ✔️

Check quality ✔️

Trim adapters ✔️

Check quality ✔️

Map reads ✔️

Check quality ✔️

Tally counts ✔️

Check quality

DGE analysis

72
Downstream analysis (~15
mins)
No. 1: Differential gene expression analysis.

73
Gene-wise counts should be normalized before
comparing between samples.
✦ Counts can differ because of different library sizes.

✦ Mapping statistics might be different for samples.

✦ Real change in expression level of a gene.

✦ …

✦ Need to factor out differences due to non-biological reasons.


Counts may differ due to inherent noisiness of biological
systems.
✦ Identical individuals may give different counts.

✦ Inherent variation used as benchmark to call out interesting


variation.

✦ Need to estimate inherent variation or dispersion.

75
Sequencer ✔️ FASTQ ✔️

Check quality ✔️

Trim adapters ✔️

Check quality ✔️

Map reads ✔️

Check quality ✔️

Tally counts ✔️

Check quality ✔️

DGE analysis ✔️

76
Your feedback is important to us!
✦ https://www.surveymonkey.com/r/RRTZPTC

✦ ~3 min.

77
Conclusions (~5 min)

78
Topics covered
✦ Steps of analysis.

✦ Common tools, e.g., cutadapt, fastqc, bowtie2, edgeR, etc.

✦ Common file formats, e.g., FASTQ, FASTA, SAM, GFF, etc.

✦ Analysis with Galaxy and Wynton.

79
Additional information: Sources of data
✦ Sequence read archive
✦ https://www.ncbi.nlm.nih.gov/sra

✦ Download and install SRA toolkit

✦ Step-by-step guide:
✦ https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/#download-
sequence-data-files-usi
More tools
✦ Quality control: RSeQC, MultiQC, etc.
✦ Mapping: STAR, BWA, etc.
✦ File manipulation: bedtools, samtools, fastx-toolkit, etc.
✦ Visualization: UCSC Genome Browser
✦ …

81
Upcoming Workshops
✦ Pathway analysis

✦ Intermediate RNA-Seq analysis

✦ Single cell analysis

82
Thank you!

83

You might also like