0% found this document useful (0 votes)
14 views

Lecture4 Expression - Analysis 2019

ChIP-seq and related techniques like ChIP-exo and CUT&RUN are described for studying protein-DNA interactions. RNA-seq is summarized as a method for quantifying transcript abundance and discovering novel transcripts and isoforms. Key steps in RNA-seq like library preparation, mapping reads, and analyzing the data including challenges in reconstructing the transcriptome and quantifying expression are covered at a high level.

Uploaded by

Charlie Hou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture4 Expression - Analysis 2019

ChIP-seq and related techniques like ChIP-exo and CUT&RUN are described for studying protein-DNA interactions. RNA-seq is summarized as a method for quantifying transcript abundance and discovering novel transcripts and isoforms. Key steps in RNA-seq like library preparation, mapping reads, and analyzing the data including challenges in reconstructing the transcriptome and quantifying expression are covered at a high level.

Uploaded by

Charlie Hou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Genetics 211 - 2019

Lecture 4

Functional Genomics
Gavin Sherlock
January 29th 2019
ChIP-Seq

Sonicate DNA to
produce sheared,
soluble chromatin

Immunoprecipitate and
purify
immunocomplexes

Reverse cross-links,
and purify DNA
Sequence
ChIP-Seq Data
Peak Calling
ChIP-exo
• ChIP-exo improves on resolution of ChIP-
Seq
• ChIP DNA is treated with an exo-nuclease,
to digest away unprotected sequences

ChIP
Exo 5’
3’
3’
5’

Rhee HS, Pugh BF (2011). Cell 147(6):1408-19.


ChIP-exo

Rhee HS, Pugh BF (2011). Cell 147(6):1408-19.


Cleavage Under Targets and Release
Using Nuclease (CUT & RUN)

Skene and Henikoff, elife, 2017


CUT&RUN compared to Chip-Seq
RNA-seq

• Detect transcript abundance by counting fragments of


transcripts
• No prior knowledge needed of which parts of the genome
are expressed
• Allows splice site discovery
• 5 and 3 UTR mapping
• Novel transcript discovery
• View RNA modifications (editing, other enzymatic
changes)
• With longer reads, can “phase” splice sites
• Possibly discover many novel isoforms
Dynamic Range

Mortazavi A, et al. (2008) Nat Methods 5(7):621-8.


How do we sequence mRNA?

Total RNA

DNAase Treatment

Oligo-dT beads

PolyA purified RNA


First Strand cDNA synthesis

5’ cap structure AAA(A)n 3’ poly A tail


mRNA

3’ 5’ oligo (dT)12-18 primer


5’ AAA(A)n 3’

dNTPs reverse transcriptase

3’ 5’
5’ AAA(A)n 3’

3’ 5’
5’ AAA(A)n 3’

cDNA:mRNA hybrid
Second Strand Synthesis
3’ 5’
5’ AAA(A)n 3’
dNTPs RNAaseH
E. coli polymerase I

3’ 5’
5’

remnants of mRNA serve as primers for


synthesis of second strand of cDNA

3’ 5’
5’

bacteriophage T4
DNA ligase

3’ 5’
5’ 3’
double stranded cDNA

Library construction, similar to genomic DNA, using forked adapters


Shatter RNA, Prime with Random Hexamers

5’ cap structure AAA(A)n 3’ poly A tail

Fragment RNA

Prime 1st strand synthesis with random hexamers

5’ 3’

3’ 5’
5’ 3’
double stranded cDNA

Library construction, using forked adapters


Random Hexamer Induced Sequence Errors

van Gurp TP, McIntyre LM, Verhoeven KJ. (2013). Consistent errors in first
strand cDNA due to random hexamer mispriming. PLoS One 8(12):e85583.
Using dUTP to retain strand specificity

mRNA

fragment

1st strand synthesis


with random hexamers
5’ 3’ and normal dNTPs

2nd strand synthesis


5’ 3’ with dTTP -> dUTP

forked adapter ligation


Creating Strand Specificity

UNG treatment

Ad #2 Ad #1

Pre-amplification and sequencing


Mapping of Reads
• Map reads to both the genome, and the predicted
spliced genome.
• Un-mappable reads may span unknown exon-exon
junctions from novel transcripts or exons.
• Need to be able to accommodate mismatches.
Exonic Read Density
• To measure abundance when sequencing entire
transcripts, you must normalize the data for the
transcript length.
• Exonic Read Density = Reads per kb gene
exon per million mapped reads
– Developed by the Wold lab, but makes intuitive
sense.
• Implies single end data – people now often use
fpkm, which works for paired end data too.
Why Exonic Read Density?
What we observe in mapped reads

What was sequenced

What was present in RNA

1 rpkm 3 rpkm 1 rpkm


Analysis Considerations
• Read Mapping
– Unspliced Aligners
– Spliced Aligners
• Transcriptome Reconstruction
– Genome guided
– Genome independent reconstruction
• Expression quantification
– Gene quantification
– Isoform quantification
• Differential Expression
Unspliced Aligners
• Limited to identifying known exons and
junctions
• Requires a good reference transcriptome
• BWT (e.g. bowtie2, bwa) based aligners are
fast, and have been typically used
• Pseudoaligners (kallisto and sailfish) much
faster, and probably as accurate
Spliced Aligners
Align to whole genome, including intron-spanning reads that allow
large gaps
• Exon first (MapSplice, SpliceMap, HiSat2)
– Two step process
• Use unspliced alignment
• Take unmapped reads, split, and look for possible spliced connections
– Typically faster
• Seed-extend (GSNAP, QPALMA)
– Break reads into short seeds and place on genome, then examine
with more sensitive methods
– Find more splice junctions, though not *yet* clear if they tend to
be false positives
Garber et al, 2011, Nature Methods
Transcriptome Reconstruction
• Challenging because
– Transcript abundance spans several orders of
magnitude
– Reads will originate from mature mRNA, as
well as incompletely spliced precursor RNA
– Reads are short, and genes can have many
isoforms, making it challenging to determine
which isoform produced which read
Two Approaches
• Genome Guided
– Relies on reference genome
– Uses spliced reads to reconstruct the transcriptome
– E.g. cufflinks (identifies minimal set of isoforms),
scripture (identifies maximal set of isoforms)
• Genome Independent Approach
– Tries to de novo assemble transcripts
– TransAbyss, Velvet, Trinity
– Sensitive to sequencing errors
– Usually requires more computational resources
Two isoforms of the same gene:
Determining differential
Expression
• A number of packages available
– Cuffdiff, DE-Seq, EdgeR etc.
• Require replicates for each condition, so can
compare within vs. between sample
variance
• More abundant transcripts are more able to
be determined to have differential
expression
Current trends
• No perfect solution
• Kallisto is now widely used, and is
incredibly fast, with low memory
requirements
– Speed allows bootstrapping to determine
uncertainly in abundance estimates
• Sleuth takes advantage of those bootstraps
to identify differential expression
Better Assaying Isoforms
• To better understand a biological system, we really want to
understand all transcripts
– Alternative splicing first seen in viruses in the 1970s
• Splicing generates complexity
– Humans have only ~2X more genes than Drosophila
– More than one gene one protein
– >38,000 Dscam isoforms!
– Alternative splicing can be altered in disease
• With relatively short reads, even with paired end sequencing,
it’s not clear which exons ends up with which other exons in
mature isoforms
• Long-Read RNA-Seq results in better isoform determination.
Long-Read RNA-Seq

Sharon D, Tilgner H, Grubert F, Snyder M. (2013). A single-molecule long-read survey of the human transcriptome.
Nat Biotechnol 31(11):1009-14
TIF-Seq
• Transcript Isoform Sequencing
• Does not capture exonic structure
• Instead captures 5’ and 3’ ends of
transcripts
• From only ~6,000 genes in yeast, almost 2
million unique transcript isoforms identified
• 371,087 major TIFs identified genome-wide
Pelechano V, Wei W, Steinmetz LM. (2013). Extensive transcriptional
heterogeneity revealed by isoform profiling. Nature 497(7447):127-31.
TIF-Seq
TIF-Seq
Analysis and visualization of
expression data
Visualizing Data
MAK16 YAL025C
5 MAK16
0.5
YBL015W ACH1
4
YBL048W
0
3 YBL048W
OD 0.26

OD 0.46

OD 0.80

OD 1.80

OD 3.70

OD 6.90

OD 7.30
YBL049W
YBL049W
-0.5 2
YBL064C
YBL064C
1 YBL078C
-1
MAK16
YBL078C
0 YBR072W
HSP26

O 26

O 46

O 80

O 80

O 70

O 90
30
-1 YBR139W

0.

0.

0.

1.

3.

6.

7.
-1.5
YBR139W
D

D
O YBR147W
-2
YBR147W
-2
YCR021C
-3 HSP30
YDL085W
-2.5
-4 YDL085W
YDL204W
YDL204W
YDL208W NHP2
Extracting Data
Experiments

RNA-Seq data

Genes
200 10000 50.00 5.64
4800 4800 1.00 0.00
9000 300 0.03 -4.91
Cy5 ⎛ Cy5⎞
Cy3 Cy5 log 2 ⎜⎜ ⎟⎟
Cy3 ⎝ Cy3⎠
Visualizing Data (cont.)
Expression During Sporulation

5
Series1
Series2
Series3
Series4

4 Series5
Series6
Series7
Series8
Series9
Series10
3 Series11
Series12
Series13
Series14
Series15
2 Series16
Series17
Series18
Series19
Series20
Series21
1
Log Ratio

Series22
Series23
Series24
Series25
Series26
0 Series27

0 2 4 6 8 10 Series28
Series29
Series30
Series31
-1 Series32
Series33
Series34
Series35
Series36
Series37
-2 Series38
Series39
Series40
Series41
Series42
-3 Series43
Series44
Series45
Series46
Series47

-4 Series48
Series49
Time (hours) Series50
Series51
Organizing Data
In expression studies,
we often use clustering
algorithms to help us
identify patterns in
complex data.

For example, we can


randomize the data
used to represent this
painting and see if
clustering will help us
visualize the pattern.
Clustering algorithms

First, we represent the painting in black and white.


Clustering algorithms

The painting is “sliced” into rows which are then randomized.


Clustering algorithms

Rows ordered by hierarchical clustering with nodes


flipped to optimize ordering
Clustering algorithms

Rows ordered by using a Self-Organizing Map (SOM)


Random vs. Biological Data

From Eisen MB, et al, PNAS 1998 95(25):14863-8


Types of Clustering
• Agglomerative
– Bottom up approach
– Different variants of hierarchical clustering
– This is the typical clustering you see
• Partitioning / Divisive
– Top down approach
– K-means Clustering
– Self-Organizing Maps
• All require the ability to compare expression
patterns to each other.
How do we compare expression
profiles?

• Treat expression data for a gene as a


multidimensional vector.

• Use a distance/correlation metric to


compare the vectors.
Expression Vectors
• Each gene is represented by a vector where coordinates
are its values - log(ratio) - in each experiment

• x = log(ratio)expt1
• y = log(ratio)expt2 z
• z = log(ratio)expt3
• etc. y
Similar expression

x
Distance metrics
• Distances or correlations are measured
“between” expression vectors

• Many different ways to measure distance:

• Euclidean distance
• Pearson correlation coefficient(s)
• Spearman’s Rank Correlation
• Manhattan distance
• Mutual information
• Kendall’s Tau
• etc.

• Each has different properties and can reveal


different features of the data
Euclidean distance
• Euclidean distance
metrics detect similar
vectors by identifying
those that are closest 2.5

in space. In this 2
Gene A

example, Gene A and Gene C

EXPERIMENT 2
1.5

C are closest. 1
Gene B

0.5

0
0 0.5 1 1.5 2 2.5
EXPERIMENT 1
Pearson correlation
• The Pearson correlation
disregards the magnitude
of the vectors but instead
compares their 2.5

directions. In this 2
Gene A

example, Gene A and Gene C

EXPERIMENT 2
1.5

Gene B have the same 1


Gene B

slope, so would be most


0.5

similar to each other.


0
0 0.5 1 1.5 2 2.5
EXPERIMENT 1
Agglomerative Hierarchical
Clustering
1. Compare all expression patterns to each other.
2. Join patterns that are the most similar out of all
patterns.
3. Compare all joined and unjoined patterns.
4. Go to step 2, and repeat until all patterns are
joined.

Need a rule to decide how to compare clusters to each other


Visualization of Hierarchical Clustering

G1

G6
G6
G1
G5

G5 G2
G2
G4
G3 G3

G4
Single linkage Clustering

Nearest Neighbor •
• +•


This method
• • produces long
• + chains which form
• straggly clusters.

Complete Linkage Clustering
Uses the
Furthest •
Neighbor • +•
• This method tends
• to produce very
tight clusters of
• similar patterns

• +
• •
Average Linkage Clustering

Average (only •
shown for two •
cases) +• The red and blue
• ‘+’ signs mark the

centroids of the
two clusters.
• •
• +
• •
Centroid Linkage Clustering


Centroid
• +• The red and blue
• ‘+’ signs mark the

centroids of the
two clusters.
• •
• +
• •
And we get a cluster:
Single Complete Average Centroid
Two-way clustering
• Just as gene vectors are clustered,
experiment vectors can be clustered.
• All the data points for an experiment can be
used to construct a vector and the vectors of
multiple experiments can be compared.
Two-way Clustering
Two-way clustering can help show which
samples are most similar, as well as which
genes.
Agglomerative Hierachical
Clustering
Advantages:
• Simple
• Easy to implement
• Easy to visualize

Disadvantages:
• Can lead to artifacts
• Discarding of subtleties in 2-way clustering
Partitioning Methods
• Split data up into smaller, more homogenous
sets
• Should avoid artifacts associated with
incorrectly joining dissimilar vectors
• Can cluster each partition independently of
others, by genes and arrays
• K-means clustering and Self-Organizing
Maps are two possible partitioning methods
K-means Clustering
• Split data into ‘n’ partitions, each with
an associated vector.
• Assign genes to partitions, and
recalculate the vector associated with
each partition as the centroid of its
associated genes.
• Repeat until solution converges, or for
a fixed number of iterations.
Self Organizing Maps
• Create a ‘Map’ of ‘n’ partitions, that
is modeled on the expression data,
where each partition in the map has
an associated vector.

• Genes’ expression vectors are


assigned to the partition with the
most similar associated vector.

• Neighboring partitions are more


similar to each other than they are to
distant partitions.
The
TheMap
MapIsIsDisorganized
Organized

Repeat 100,000 times


Dimensional Reduction
• Is hard to get a sense as to whether there are
clear clusters when clustering data – the
nature of the tree can be hard to discern
structure
• People have turned to Principal
Components Analysis, to be able to project
data in 2 (or more dimensions)
Principal Components Analysis
tSNE
• Similar in principal to
PCA
• However, uses
information in higher
dimensions to better
separate clusters back
in 2-dimensions
• Parameters make a
difference, so beware
Using the Gene Ontology to
assess list of genes
• Many experiments result in a list of
interesting genes
• Typically biologists can make up a story
about any random list
• So, look at all GO annotations for the
genes in a list, and see if the number of
annotations for any GO node is significant
The Categories of GO
(The Gene Ontology)
• Biological Process = goal or objective (Why)

(e.g. DNA replication, Cell Cycle Control, Cell adhesion)

• Molecular Function = elemental activity/task (What)

(e.g. Transcription factor, polymerase, protein kinase)

• Cellular Component = location or complex (Where)

(e.g. pre-replication complex, kinetochore, membrane)

Each Category is a structured, controlled vocabulary


Parent-Child Relationships

Nucleus

Nucleoplasm Nuclear Nucleolus Chromosome Perinuclear


envelope space

A child is a subset of The cell component term


a parent’s elements Nucleus has 5 children
Determining P-values for GO
annotation for a list of genes
We can calculate the probability of having x of n
genes having an annotation to a GO node, given
that in the genome, M of N genes have that
annotation, using the hypergeometric
distribution, as:
⎛ M ⎞⎛ N − M ⎞
⎜ ⎟⎜ ⎟
⎝ x ⎠⎝ n − x ⎠
p=
⎛N⎞
⎜ ⎟
⎝n ⎠
Determining GO significance
To calculate a P-value, we calculate the
probability of having at least x of n annotations:

⎛ M ⎞⎛ N − M ⎞
x−1 ⎜ ⎟⎜ ⎟
⎝i ⎠⎝ n − i ⎠
P-value = 1− ∑
⎛ N ⎞
⎜ ⎟
i=0

⎝i ⎠

Then do multiple hypothesis correction on the p-values


ICY2
YPL250C
MET11
MET11
Methionine Cluster MXR1
YER042W
MET17*
YLR302C
SAM3
YPL274W
MET28
MET28
STR3
YGL184C
MMP1
YLL061W
MET1
MET1
SER33
YIL074C
MHT1
YLL062C
MET14
MET14
MET16
MET16
MET3
MET3
MET10
MET10
ECM17
ECM17
MET2*
YNL276C
MUP1
MUP1
MET17
MET17
MET6
MET6
GO Annotations
• sulfur metabolic process : 2.43e-19 (12/18 vs 66/6608)
• methionine metabolic process : 1.40e-14 (10/18 vs 24/6608)
Recommended Reading
ChIP-Seq
• Valouev, A., Johnson, D.S., Sundquist, A., Medina, C., Anton, E., Batzoglou, S., Myers, R.M. and Sidow, A. (2008). Genome-
wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods 5(9):829-34. QuEST.
• Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown, M., Li, W.
and Liu, X.S. (2008). Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9(9):R137.
• Rozowsky, J., Euskirchen, G., Auerbach, R.K., Zhang, Z.D., Gibson, T., Bjornson, R., Carriero, N., Snyder, M. and Gerstein,
M.B. (2009). PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat. Biotechnol. 27:66-75.
• Rhee, H.S. & Pugh, B.F. (2011). Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide
resolution. Cell 147, 1408–1419.
• Nawy, T. (2012). High-resolution chromatin immunoprecipitation. Nature Methods 9, 130.
• Skene, P.J., Henikoff, S. (2015). A simple method for generating high-resolution maps of genome-wide protein binding. Elife
4:e09225.
• Skene, P.J., Henikoff, S. (2017). An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. Elife
6 pii: e21856. doi: 10.7554/eLife.21856.
Recommended Reading
RNA-Seq
• Parkhomchuk, D., Borodina, T., Amstislavskiy, V., Banaru, M., Hallen, L., Krobitsch, S., Lehrach, H., and Soldatov, A. (2009).
Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. 37(18):e123.
• Borodina, T., Adjaye, J., Sultan, M. (2011). A strand-specific library preparation protocol for RNA sequencing. Methods
Enzymol. 500:79-98.
• Grabherr, M.G., Haas, B.J., et al. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome.
Nat Biotechnol. 29(7):644-52. Trinity
• van Gurp, T.P., McIntyre, L.M., Verhoeven, K.J. (2013). Consistent errors in first strand cDNA due to random hexamer
mispriming. PLoS One 8(12):e85583.
• Sharon, D., Tilgner, H., Grubert, F. and Snyder, M. (2013). A single-molecule long-read survey of the human transcriptome.
Nat Biotechnol. 31(11):1009-14.
• Pelechano, V., Wei, W. and Steinmetz, L.M. (2013). Extensive transcriptional heterogeneity revealed by isoform profiling.
Nature 497(7447):127-31.
• Kim D, Langmead B, Salzberg SL (2015). HISAT: a fast spliced aligner with low memory requirements. Nat Methods.
12(4):357-60.
• Frazee, A.C., Pertea, G., Jaffe, A.E., Langmead, B., Salzberg, S.L., Leek, J.T. (2015). Ballgown bridges the gap between
transcriptome assembly and expression analysis. Nat Biotechnol. 33(3):243-6
• Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.C., Mendell, J.T. and Salzberg, S.L. (2015) StringTie enables improved
reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 33(3):290-5.
• Pertea, M., Kim D, Pertea, G.M., Leek, J.T., Salzberg, S.L. (2016). Transcript-level expression analysis of RNA-seq experiments
with HISAT, StringTie and Ballgown. Nat Protoc. 11(9):1650-67.
• Patro, R., Mount, S.M., Kingsford, C. (2014). Sailfish enables alignment-free isoform quantification from RNA-seq reads using
lightweight algorithms. Nat Biotechnol. 32(5):462-4.
• Bray NL, Pimentel H, Melsted P, Pachter L. (2016). Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol.
34(5):525-7 Kallisto
Recommended Reading
Clustering/Expression Data analysis:

• Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D. (1998). Cluster analysis and display of
genome-wide expression patterns. Proc Natl Acad Sci U S A 95(25):14863-8.
• Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S.,
Golub, T.R. (1999). Interpreting patterns of gene expression with self-organizing maps:
methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA
96(6):2907.
• Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M. (1999). Systematic
determination of genetic network architecture. Nat Genet. 22(3):281-5.
• Tusher, V.G., Tibshirani, R., Chu, G. (2001). Significance analysis of microarrays applied to
the ionizing radiation response. Proc Natl Acad Sci USA 98(9):5116-21
• Slonim, D.K. (2002). From patterns to pathways: gene expression data analysis comes of age.
Nat Genet. 32 Suppl:502-8.
• McShane, L.M., Radmacher, M.D., Freidlin, B., Yu, R., Li, M.C., Simon, R. (2002). Methods
for assessing reproducibility of clustering patterns observed in analyses of microarray data.
Bioinformatics 18(11):1462-9.
• Bryan, J. (2004). Problems in gene clustering based on gene expression data. Journal of
Multivariate Analysis 90, 44–66.
• Chipman, H. and Tibshirani, R. (2006). Hybrid Hierarchical Clustering with Applications to
Microarray Data. Biostatistics, 7(2):286-301.

You might also like