0% found this document useful (0 votes)
76 views75 pages

1.RNA Seq Part1 WorkingToTheGoal

This document discusses the goal of RNA-seq analysis for detecting differential gene expression between conditions. It begins by explaining that the goal is to detect differences in mRNA levels between conditions as a proxy for differences in protein activity that cause phenotypic differences. It then discusses key steps in RNA-seq experiments including extracting mRNA, converting to cDNA, sequencing, and normalizing counts between samples. The document emphasizes that the goal is to compare samples to find genes for which counts change systematically between conditions more than expected by chance.

Uploaded by

Parisha Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views75 pages

1.RNA Seq Part1 WorkingToTheGoal

This document discusses the goal of RNA-seq analysis for detecting differential gene expression between conditions. It begins by explaining that the goal is to detect differences in mRNA levels between conditions as a proxy for differences in protein activity that cause phenotypic differences. It then discusses key steps in RNA-seq experiments including extracting mRNA, converting to cDNA, sequencing, and normalizing counts between samples. The document emphasizes that the goal is to compare samples to find genes for which counts change systematically between conditions more than expected by chance.

Uploaded by

Parisha Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Defining the goal of RNA-seq

analysis for differential


expression
Joachim Jacob
20 and 27 January 20!
"his presentation is a#ailable under the $reati#e $ommons Attribution-%hareAli&e '(0 )nported *icense( +lease refer to
http,--...(bits(#ib(be- if you use this presentation or parts hereof(
/reat po.er comes .ith great responsibility
RNA-seq enables one to
1) get an idea which are all active genes
2) quantify expression of each transcript
3) quantify alternative splicing
(use your imagination)
+rinciples of transcriptome analysis and gene expression quantification, an RNA-seq
tutorial( http,--onlinelibrary(.iley(com-doi-0(-700-0112(201-abstract
/reat po.er comes .ith great responsibility
You can't do all
RNA-seq is po.erful3 .e ha#e
to aim for a certain goal(
4ur goal is to detect
differential expression
on the gene le#el(
Differential expression, useful5
6hat are .e loo&ing for5
7xplanations of obser#ed phenotypes
yeast
/DA
8east mutant
/DA 9 #it $
.hy5
"he central dogma
yeast
/DA
8east mutant
/DA 9 #it $
5
causes the phenotypic differences
"he central dogma
yeast
/DA
8east mutant
/DA 9 #it $
5
Difference in protein acti#ity
causes the phenotypic differences
"he central dogma
yeast
/DA
8east mutant
/DA 9 #it $
5
+resence-concentration of proteins in a cell
causes the phenotypic differences
"he central dogma
yeast
/DA
8east mutant
/DA 9 #it $
5
*e#el of protein production
causes the phenotypic differences
"he central dogma
yeast
/DA
8east mutant
/DA 9 #it $
5
*e#el of templates for protein production
causes the phenotypic differences
"he central dogma
yeast
/DA
8east mutant
/DA 9 #it $
5
*e#el of mRNA copies
causes the phenotypic differences
Does it hold5
Difference in protein acti#ity
*e#el of mRNA copies
*e#el of templates for protein production
*e#el of protein production
+resence-concentration of proteins in a cell
+henotype
+roblem reduction
6e can measure mRNA le#els :much easier
than protein le#els;(
%o .e measure mRNA(
"he le#el of mRNA is a proxy of the le#el of
protein acti#ity causing the aberrant
phenotype(
<o. to measure mRNA
( =-+$R :real-time;
2( >icroarray
'( RNA-seq
A lot of .or& to measure fe.
genes3 in a relati#ely .ide array
of tissues( ?ery accurate(
7asier .ay to measure many
predefined genes in a relati#ely
.ide array of tissues( Robust(
RNA-seq protocol in a nut shell

/et your sample

*yse the cells and extract RNA

$on#ert the RNA to cDNA

"he cDNA pool get sequenced


"he result is sequence information from
scratch( No prior information is needed(
8east sample
$omprehensi#e comparati#e analysis of strand-specific RNA sequencing methods
http,--...(nature(com-nmeth-@ournal-#7-n1-full-nmeth(!1(html
$omparati#e analysis of RNA sequencing methods for degraded or lo.-input samples
http,--...(nature(com-nmeth-@ournal-#0-n7-full-nmeth(2!2'(html
"he predecessors of RNA-seq

!"s, expressed
sequence tags3 ideal for
disco#ery of ne. genes(

!#$, serial analysis of


gene expression3
measurement of
number of copies of
mRNA
http,--...(montana(edu-obser#atory-people-mcdermottlab(html
"he predecessors of RNA-seq

!"s, expressed
sequence tags3 ideal for
disco#ery of ne. genes(

!#$, serial analysis of


gene expression3
measurement of
number of copies of
mRNA
http,--...(sagenet(org-findings-index(html
"he predecessors of RNA-seq

!"s, expressed sequence tags

!#$, serial analysis of gene expression


*o. throughput, long sequence
information3 but for only Athousands of
genes(
$oncept of measuring .ith RNA-seq
7xtract mRNA
and turn into
cDNA
Bragment3 ligate
adaptor3 amplify(
+ut a fraction of
the pool on
sequencer to
read fragments(
4ne template of protein production
Bigure, All things must pass, contrasts and commonalities in eu&aryotic and bacterial mRNA decay3 Nature Re#ie.s >olecular $ell Ciology 3 !D7E!72
/eneA /eneC /ene$
RNA-seq protocol in a nut shell
8east sample
%o many steps must fail our assumption
+henotype
+roteins
mRNA le#els
cDNA pool
RNA-seq reads
Represent the cDNA pool .eF#e created
Represent the RNA pool .eF#e extracted
Are a proxy for protein acti#ity
Define the phenotype
%o many steps must fail our assumption
+henotype
+roteins
mRNA le#els
cDNA pool
RNA-seq reads
+rotein acti#ity is regulated,
Bosforylation3 ubiquitination3(((
mRNA templates ha#e
different speeds of protein pro-
Duction, a#ailability of tRNAs3
rate of mRNA degration3
Alternati#e splicing e#ents3(((
*oss on RNA extraction3 10G of
RNA in cell is rRNA3 ligation
of adapters3 con#ersion to cDNA
not 00G
Bail to map reads to correct
gene3 lane-specific biases on
reading cDNA fragments3(((
$onsequence, focus on comparison
%henotype #
+roteins
mRNA le#els
cDNA pool
RNA-seq reads
%henotype &
+roteins
mRNA le#els
cDNA pool
RNA-seq reads
%ossi'ly due
to differences in
expression
$onsequence, focus on comparison
%henotype #
+roteins
mRNA le#els
cDNA pool
RNA-seq reads
%henotype &
+roteins
mRNA le#els
cDNA pool
RNA-seq reads
(!)$* +,
-%.)/*"
$omparing number of reads to genes
/eneA /eneC /ene$
sample
RNA-seq
4b#iously3 the number of reads is dependent on,
10 the expression level of the gene
20 the total num'er of reads generated
30 the length of the transcript
+1. 21!")+*
Normalisation is neededH Normalisation is neededH Normalisation is neededH
7xperimental design
4ur focus, .hich genes are differentially expressed
'etween different conditions5
4b#iously3 the number of reads is dependent on,
( the expression le#el of the gene
2( the total number of reads generated
'( the length of the transcript
6hich normalisation is needed5
<o. many reads to sequence5
7xperimental design
4ur focus, .hich genes are differentially expressed
'etween different conditions5
I<o. can .e detect genes for .hich the counts of
reads change bet.een conditions more
systematically than as expected by chanceJ
6e must design an experiment in .hich .e can test
this de#iance from chance(
4shlac& et al( 200( Brom RNA-seq reads to differential expression results( /enome Ciology 2003
,220 http,--genomebiology(com-200--2-220
<o. many reads to sequence5
Kn other .ords, ho. deep to sequence5 6hat is the
required Fdepth of sequencingF5
/eneA /eneC /ene$
sample
RNA-seq
RNA-seq
/eneA /eneC /ene$
"he final test .ill loo& at ratios,
D 0 '
0 D !
32 032' 0370
sample
<o. many reads to sequence5
"he difference bet.een the lo.est gene count and
the highest gene count is typically 0
0
( "his is called
the dynamic range(
*inear scale is useless( "he logarithmic scale is better(
6aitH %omethingFs not correct hereH
Lero remains MeroH
6e are .or&ing .ith counts( A count is NO( A gene
.ith 3ero counts can be not yet sequenced :not
deep enough; or is not expressed in that condition(
Kt is not a full logarithmic scale(
Kt starts at Mero(
4
%o &eep all counts abo#e Mero5
Assuming equal sequencing depth in the samples3
and these counts( Do all these genes differ in
expression5
sample sample
/eneA 0 0 2
/eneC 0 '0 2
/ene$ !0 20 2
/eneD 00 200 2
/ene7 000 2000 2
/eneL 2 2
.#")+
%o &eep e#erything abo#e Mero5
sample sample
/eneA 11 0 031
/eneC 11 '0 2372
/ene$ 54 20 3''
/eneD 67 200 230'
/ene7 1184 2000 37!
/eneL 8 0320
.#")+
29
)s there a trend in how
these num'ers change9
%equencing the result of the same steps again is
called a technical replicate(
"echnical replicates
sample
/eneA 0 ! !
/eneC D ! 2
/ene$ D0 !0 '2 '2
/eneD 71 02 10 0
/ene7 00 02' 127 000
/eneL ' 0 0
sample sample sample
6e ta&e the same cDNA pool and sequence it se#eral
times, technical replicates(
"he poisson distribution
"he counts of technical replicates follo. a poisson
distri'ution :>arioni et al 2002;( "he +oisson distribution
can be applied to systems .ith a large number of possible
e#ents3 each of .hich is rare(

Brom 6i&ipedia( $an be '
different genes3 each .ith
their o.n poisson
distribution( *ambda is
the mean of the geneFs
distribution3 .ith a
certain number of reads(
8Oaxis, chance to pic&
that number of reads(
"he poisson distribution
%o .hen .e ha#e ! technical replicates sequenced up
to a big depth :say 0 > reads;( 6e can get 'y
chance3 these numbers for ' different genes(

$ene# 4: 4: 1: 3
$ene& 2: 3: ;: 6
$ene< =: 7: 11: 1;
6or&ing the intuition
<o. many blue balls5
<o. many red balls5
Dra. 0
Dra. 0 more
Dra. 0 more
7stimate ho. large the fraction is in the set5
"he intuition .ith the balls
<olor 14 draws 24 draws 34 draws ;4 draws
Clue
Red
No color
$onclusion of the experiment
<o. bigger the fraction in the pool3 ho. quic&er :i(e(
.ith less sequencing depth; .e are certain about the
estimate of that fraction(
Bor lo.er counts3 the #ariance is
relatively 'igger than the
#ariance for higher counts(
$? :coPfficient of #ariation; O
sqrt:count;-count
/enes .ith lower expression
need much deeper sequencing
than genes .ith higher
expression le#els(
estimateOcountQ #arianceOcount
$omparing counts
I<ere .e sho. the o#erlap of +oisson
distributions of single measurements at
different read counts( Cecause relative
%oisson uncertainty is high at low read
counts3 a count of #ersus 2 has #ery
little po.er to discriminate a true 2R fold
change3 though at higher counts a 2R fold
change becomes significant(
Kn an actual experiment3 the .idth of the
distribution .ould be greater due to
additional biological and technical
uncertainty3 but the uncertainty to the
mean expression would narrow with
each additional replicate(J
%cotty, a .eb tool for designing RNA-%eq experiments to measure differential gene expression(
Cioinformatics :20'; doi, 0(01'-bioinformatics-btt00
$omparing technical replicates
Risso et al( I/$-$ontent NormaliMation for RNA-%eq DataJ
C>$ Cioinformatics 203 2,!20
http,--...(biomedcentral(com-!7-200-2-!20 - 7DA%eq pac&age :R;
$orrelation
bet.een mean
and #ariance
according to +oisson
*o.ess fit through
the data
:*og2 of the counts;
:
*
o
g
2

o
f

t
h
e

c
o
u
n
t
s
;
Cut poisson does not seem to fit
7xtending the samples to real biological samples3 this
mean #ariance relationship does not hold(((
+lotted using 7DA%eq
+ac&age in R(
Cut poisson does not seem to fit
7xtending the samples to real biological samples3 this
mean #ariance relationship does not holdH
+lotted using 7DA%eq
+ac&age in R(
Reasonable fit
%omething is going onH
An extra source of #ariation
"he +oisson distribution has an FoverdispersedF
#ariance, the #ariance is bigger than expected for
higher counts bet.een biological replicates(
+lotted using 7DA%eq
+ac&age in R(
%omething is going onH
An extra source of #ariation
6here +oisson, $? O std de# - mean ON $?S O -T
Kf an additional distribution is in#ol#ed :also
dependent on U3 the fraction of the gene in the cDNA
pool;3 .e ha#e a
mixture of
distri'utions>
$?S O -T 9 V
*o. countsH
dispersion
/eneraliMation of +oisson
.ith this extra parameter,
the *egative &inomial
/odel fits betterH
"he negati#e binomial model
"he NC model fits obser#ed expression data of
RNA-seq better( Kt is a generaliMation of +oisson3 and
2 parameters need to be estimated :T and V;
$ounts :gene g in sample @; has a
/ean ? T
g@

@ariance ? T
g@
9 V
g
T
g@
S
Ciological $?S O V
g
ON Ciological $? O WV
g
>ethods differ in estimating this dispersion per gene,
$an only be measured .ith true 'iological replicates
?ariation summary3 intuiti#ely
"otal $?S O "echnical $?S 9 Ciological $?S
Bor low counts3 the +oisson :technical; #ariation or
the measurement error is dominant(
Bor higher counts3 the +oisson #ariation gets smaller3
and another source of #ariation becomes dominant3
the dispersion or the 'iological variation( Ciological
#ariation does not get smaller .ith higher counts(
Ceyond the NC model
Kt appears from analysis of many
biological replicates :XOD1; that not
e#ery gene can be modeled as NC,
the %oissonA"weedie model
pro#ides a further generalisation
and a better fit for many genes
:.ith an additional shape
parameter;(
*eft figure, ra. data sho.s that about 2DG of
the genes fit a NC model( Depending on the
estimated shape parameter3 other
distributions fit better(
7snaola et al( C>$ Cioinformatics 20'3 !,20!
http,--...(biomedcentral(com-!7-200-!-20!
$onsequence for our design

Bor low counts, the uncertainty is big due to


+oisson

Bor high counts, the uncertainty is big due to


biological #ariation( :highly expressed genes differ
in their natural #ariation :regulated by cellular
processes; more than lo.ly expressed genes;(

Kf .e focus on the ratios bet.een the conditions,


is it reasonable to set a restriction of fold change5
<ighly expressed genes can ha#e a smaller and be
significant( *o.ly expressed genes can exceed 2(
$onsequence on fold change
"he readily applied cut-off in micro-array analysis
is in RNA-seq not of use(
Clue and red,
&no.n D7 genes
?olcanoplot
"hese cut-offs often
applied can prohibit
detecting D7 genes
*ong story to say(((
6e need to estimate the model behind the count(
Ne#er .or& .ithout biological replicates(
Ne#er .or& .ith 2 biological replicates(
"ry a#oiding .or&ing .ith ' biological replicates(
/o for at least ! biological replicates(
Crea&5
4#er#ie.
/eneA /eneC /ene$
%ample
RNA-seq
/eneA /eneC /ene$
%ample 2
RNA-seq
/eneA /eneC /ene$
%ample '
RNA-seq
/eneA /eneC /ene$
%ample !
RNA-seq
/eneA /eneC /ene$
%ample 0
RNA-seq
/eneA /eneC /ene$
%ample D
RNA-seq
$ondition R
$ondition 8
%ummary
4b#iously3 the number of reads is dependent on,
10 chance
Define the count model :NC; from replicates
20 the expression level of the gene
$ompare the ratios .ith a test
20 the total num'er of reads generated
30 the length of the transcript
"he total number of reads generated
/eneA /eneC /ene$
sample
RNA-seq
"he num'er of reads is dependent on the total
number of reads generated( Kf one library is
sequenced to 20> reads3 and another one to
!0>3 most genes .ill Adouble their counts(
/eneA /eneC /ene$
sample
/ore RNA-seq
NormaliMation for library siMe
*aive approach, di#ide by total library siMe( Ks not
applied anymoreH
6hy not5 <omposition mattersH
2 things to remember,
- Mero sum system :or I.e cannot count .hat .e canFt sequenceJ;
- 0 orders of magnitude
NormaliMation for library siMe
2 things to remember,
- Mero sum system
- 0 orders of magnitude
Kn e#ery sample3 a lot of
reads are spend on few
extremely highly expressed
genes( 6hich genes5 "hat
differ bet.een libraries3 but
affects negati#ely the naY#e
siMe normaliMation if .e
include those genes(
NormaliMation for library siMe
%chematically, .hen normaliMed on library siMe
:square represent number of reads;(
Rest of the genes
Rest of the genes
Be. genes .ith enormous counts, there is N4 %A")RA"K4N of these counts
All counts for li'rary #
All counts for li'rary &
NormaliMation for library siMe
Cetter normaliMation .ould be as sho.n belo.(
D7%eq2 and 7dgeR apply such an approach :see
later;(
Rest of the genes
Rest of the genes
00G
00G
/ene length influence the count
I*onger transcripts generate more readsJ
"rueH Cut the transcript length does not differ
'etween samples( %ince .e are concerned .ith
relati#e differences bet.een samples3 this needs
no normaliMation :this story changes in case of
absolute quantification;(
!ample # !ample &
$ene #
$ene &
$ene #
$ene &
Cet.een sample #ariation
+roperties of libraries-samples can effect the
counts3 and lead to #ariation( "his is called
'etweenAlane variation( 4b#ious ones, library
siMe :ho. many reads are sampled;3 library
composition(
Different libraries-samples can exhibit increased
#ariation by differing in ho. gene properties
relate to gene counts( "his is called withinAlane
#ariation(
/$-content of genes can influence counts
/$-content differs bet.een genes( Cut it does
not change 'etween samples3 so there should
be no problem for relati#e expression
comparison(
6e can #isualiMe the
relationship bet.een
counts and /$ #ery
easily :see right;( "here is
some trend3 and it is
equal for all samples(
7DAseq :R;
/$-content of genes can influence counts
%ometimes3 samples sho. different relationships
bet.een /$-content of the genes and the counts(
"his .ithin-lane #ariation
:or intra-sample; #ariation
needs to 'e corrected for3
so that in one sample not
all differentially expressed
genes are also the
/$-riched ones(
*ength can ha#e also this
effect(
6hat .e need to &no. for our set-up
6e .ant to detect differentially expressed genes
bet.een 2 or more conditions(
Bor this3 .e need to apply the conditions in a
controlled en#ironment :randomisation3(((;(
Bor good testing3 .e need to ha#e some 'iological
replicates per condition(
Bor cost effecti#eness3 .e determine how deep .e
.ill sequence from each sample(
6e analyse the reads3 get ra. counts and do the testH
*ibrary preparation and lane loading
<i%eq2000, 2! single-index barcodes a#ailable(
lane gi#es 00-20 > reads( 4ne lane of 00 bp %7
approx Z(000(
Cioinformatics analysis .ill ta&e most of your time
=uality control :2<; of ra. reads
%reprocessing, filtering of reads
and read parts3 to help our goal
of differential detection(
2< of preprocessing
/apping to a reference genome
:alternati#e, to a transcriptome;
2< of the mapping
<ount ta'le extraction
2< of the count table
D7 test
Ciological insight
Cioinformatics analysis .ill ta&e most of your time
=uality control :2<; of ra. reads
%reprocessing, filtering of reads
and read parts3 to help our goal
of differential detection(
2< of preprocessing
/apping to a reference genome
:alternati#e, to a transcriptome;
2< of the mapping
<ount ta'le extraction
2< of the count table
D7 test
Ciological insight
Cioinformatics analysis .ill ta&e most of your time
=uality control :2<; of ra. reads
%reprocessing, filtering of reads
and read parts3 to help our goal
of differential detection(
2< of preprocessing
/apping to a reference genome
:alternati#e, to a transcriptome;
2< of the mapping
<ount ta'le extraction
2< of the count table
D7 test
Ciological insight
1
2
3
;
8
5
4#er#ie.
http,--...(nature(com-nprot-@ournal-#2-n1-full-nprot(20'(011(html
"he numbers get reduced .ith e#ery step
20>
20>
0>
Deeper3 or more replicates5
?ariance .ill be lo.er .ith more reads, but
sequencing another 'iological replicate is
preferred o#er sequencing deeper3 or technical reps(
Doi, 0(01'-bioinformatics-btt00
"here is tool to help you set up
%cotty E po.er analysis
+o.er, the probability to re@ect the null hypothesis if the alternati#e is
true(
F<o. many samples and ho. deep in order to minimiMe false
negati#esF(
:a null hypothesis is al.ays a scenario in .hich there is no difference3
hence no differential expression;(
Alternati#e tools,
http,--.i&i(bits(#ib(be-index(php-RNAseq[toolbox
<elp .ith design
http,--.i&i(bits(#ib(be-index(php-RNAseq[toolbox
http,--rnaseq(uoregon(edu-exp[design(html
<o. many samples to sequence5
%cotty exercise
\ey.ords
A read count of a gene is dependent on,
( chance
2( expression le#el
'( transcript length
!( depth of sequencing
0( /$-content
+oisson distribution
Negati#e binomial distribution
$ondition
%ample
NormaliMation
6rite in your o.n .ords .hat the terms mean
Reads
All my references a#ailable at,
https,--...(Motero(org-groups-dernaseq-items

You might also like