This document discusses the goal of RNA-seq analysis for detecting differential gene expression between conditions. It begins by explaining that the goal is to detect differences in mRNA levels between conditions as a proxy for differences in protein activity that cause phenotypic differences. It then discusses key steps in RNA-seq experiments including extracting mRNA, converting to cDNA, sequencing, and normalizing counts between samples. The document emphasizes that the goal is to compare samples to find genes for which counts change systematically between conditions more than expected by chance.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
76 views75 pages
1.RNA Seq Part1 WorkingToTheGoal
This document discusses the goal of RNA-seq analysis for detecting differential gene expression between conditions. It begins by explaining that the goal is to detect differences in mRNA levels between conditions as a proxy for differences in protein activity that cause phenotypic differences. It then discusses key steps in RNA-seq experiments including extracting mRNA, converting to cDNA, sequencing, and normalizing counts between samples. The document emphasizes that the goal is to compare samples to find genes for which counts change systematically between conditions more than expected by chance.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75
Defining the goal of RNA-seq
analysis for differential
expression Joachim Jacob 20 and 27 January 20! "his presentation is a#ailable under the $reati#e $ommons Attribution-%hareAli&e '(0 )nported *icense( +lease refer to http,--...(bits(#ib(be- if you use this presentation or parts hereof( /reat po.er comes .ith great responsibility RNA-seq enables one to 1) get an idea which are all active genes 2) quantify expression of each transcript 3) quantify alternative splicing (use your imagination) +rinciples of transcriptome analysis and gene expression quantification, an RNA-seq tutorial( http,--onlinelibrary(.iley(com-doi-0(-700-0112(201-abstract /reat po.er comes .ith great responsibility You can't do all RNA-seq is po.erful3 .e ha#e to aim for a certain goal( 4ur goal is to detect differential expression on the gene le#el( Differential expression, useful5 6hat are .e loo&ing for5 7xplanations of obser#ed phenotypes yeast /DA 8east mutant /DA 9 #it $ .hy5 "he central dogma yeast /DA 8east mutant /DA 9 #it $ 5 causes the phenotypic differences "he central dogma yeast /DA 8east mutant /DA 9 #it $ 5 Difference in protein acti#ity causes the phenotypic differences "he central dogma yeast /DA 8east mutant /DA 9 #it $ 5 +resence-concentration of proteins in a cell causes the phenotypic differences "he central dogma yeast /DA 8east mutant /DA 9 #it $ 5 *e#el of protein production causes the phenotypic differences "he central dogma yeast /DA 8east mutant /DA 9 #it $ 5 *e#el of templates for protein production causes the phenotypic differences "he central dogma yeast /DA 8east mutant /DA 9 #it $ 5 *e#el of mRNA copies causes the phenotypic differences Does it hold5 Difference in protein acti#ity *e#el of mRNA copies *e#el of templates for protein production *e#el of protein production +resence-concentration of proteins in a cell +henotype +roblem reduction 6e can measure mRNA le#els :much easier than protein le#els;( %o .e measure mRNA( "he le#el of mRNA is a proxy of the le#el of protein acti#ity causing the aberrant phenotype( <o. to measure mRNA ( =-+$R :real-time; 2( >icroarray '( RNA-seq A lot of .or& to measure fe. genes3 in a relati#ely .ide array of tissues( ?ery accurate( 7asier .ay to measure many predefined genes in a relati#ely .ide array of tissues( Robust( RNA-seq protocol in a nut shell
/et your sample
*yse the cells and extract RNA
$on#ert the RNA to cDNA
"he cDNA pool get sequenced
"he result is sequence information from scratch( No prior information is needed( 8east sample $omprehensi#e comparati#e analysis of strand-specific RNA sequencing methods http,--...(nature(com-nmeth-@ournal-#7-n1-full-nmeth(!1(html $omparati#e analysis of RNA sequencing methods for degraded or lo.-input samples http,--...(nature(com-nmeth-@ournal-#0-n7-full-nmeth(2!2'(html "he predecessors of RNA-seq
!"s, expressed sequence tags3 ideal for disco#ery of ne. genes(
!#$, serial analysis of
gene expression3 measurement of number of copies of mRNA http,--...(montana(edu-obser#atory-people-mcdermottlab(html "he predecessors of RNA-seq
!"s, expressed sequence tags3 ideal for disco#ery of ne. genes(
!#$, serial analysis of
gene expression3 measurement of number of copies of mRNA http,--...(sagenet(org-findings-index(html "he predecessors of RNA-seq
!"s, expressed sequence tags
!#$, serial analysis of gene expression
*o. throughput, long sequence information3 but for only Athousands of genes( $oncept of measuring .ith RNA-seq 7xtract mRNA and turn into cDNA Bragment3 ligate adaptor3 amplify( +ut a fraction of the pool on sequencer to read fragments( 4ne template of protein production Bigure, All things must pass, contrasts and commonalities in eu&aryotic and bacterial mRNA decay3 Nature Re#ie.s >olecular $ell Ciology 3 !D7E!72 /eneA /eneC /ene$ RNA-seq protocol in a nut shell 8east sample %o many steps must fail our assumption +henotype +roteins mRNA le#els cDNA pool RNA-seq reads Represent the cDNA pool .eF#e created Represent the RNA pool .eF#e extracted Are a proxy for protein acti#ity Define the phenotype %o many steps must fail our assumption +henotype +roteins mRNA le#els cDNA pool RNA-seq reads +rotein acti#ity is regulated, Bosforylation3 ubiquitination3((( mRNA templates ha#e different speeds of protein pro- Duction, a#ailability of tRNAs3 rate of mRNA degration3 Alternati#e splicing e#ents3((( *oss on RNA extraction3 10G of RNA in cell is rRNA3 ligation of adapters3 con#ersion to cDNA not 00G Bail to map reads to correct gene3 lane-specific biases on reading cDNA fragments3((( $onsequence, focus on comparison %henotype # +roteins mRNA le#els cDNA pool RNA-seq reads %henotype & +roteins mRNA le#els cDNA pool RNA-seq reads %ossi'ly due to differences in expression $onsequence, focus on comparison %henotype # +roteins mRNA le#els cDNA pool RNA-seq reads %henotype & +roteins mRNA le#els cDNA pool RNA-seq reads (!)$* +, -%.)/*" $omparing number of reads to genes /eneA /eneC /ene$ sample RNA-seq 4b#iously3 the number of reads is dependent on, 10 the expression level of the gene 20 the total num'er of reads generated 30 the length of the transcript +1. 21!")+* Normalisation is neededH Normalisation is neededH Normalisation is neededH 7xperimental design 4ur focus, .hich genes are differentially expressed 'etween different conditions5 4b#iously3 the number of reads is dependent on, ( the expression le#el of the gene 2( the total number of reads generated '( the length of the transcript 6hich normalisation is needed5 <o. many reads to sequence5 7xperimental design 4ur focus, .hich genes are differentially expressed 'etween different conditions5 I<o. can .e detect genes for .hich the counts of reads change bet.een conditions more systematically than as expected by chanceJ 6e must design an experiment in .hich .e can test this de#iance from chance( 4shlac& et al( 200( Brom RNA-seq reads to differential expression results( /enome Ciology 2003 ,220 http,--genomebiology(com-200--2-220 <o. many reads to sequence5 Kn other .ords, ho. deep to sequence5 6hat is the required Fdepth of sequencingF5 /eneA /eneC /ene$ sample RNA-seq RNA-seq /eneA /eneC /ene$ "he final test .ill loo& at ratios, D 0 ' 0 D ! 32 032' 0370 sample <o. many reads to sequence5 "he difference bet.een the lo.est gene count and the highest gene count is typically 0 0 ( "his is called the dynamic range( *inear scale is useless( "he logarithmic scale is better( 6aitH %omethingFs not correct hereH Lero remains MeroH 6e are .or&ing .ith counts( A count is NO( A gene .ith 3ero counts can be not yet sequenced :not deep enough; or is not expressed in that condition( Kt is not a full logarithmic scale( Kt starts at Mero( 4 %o &eep all counts abo#e Mero5 Assuming equal sequencing depth in the samples3 and these counts( Do all these genes differ in expression5 sample sample /eneA 0 0 2 /eneC 0 '0 2 /ene$ !0 20 2 /eneD 00 200 2 /ene7 000 2000 2 /eneL 2 2 .#")+ %o &eep e#erything abo#e Mero5 sample sample /eneA 11 0 031 /eneC 11 '0 2372 /ene$ 54 20 3'' /eneD 67 200 230' /ene7 1184 2000 37! /eneL 8 0320 .#")+ 29 )s there a trend in how these num'ers change9 %equencing the result of the same steps again is called a technical replicate( "echnical replicates sample /eneA 0 ! ! /eneC D ! 2 /ene$ D0 !0 '2 '2 /eneD 71 02 10 0 /ene7 00 02' 127 000 /eneL ' 0 0 sample sample sample 6e ta&e the same cDNA pool and sequence it se#eral times, technical replicates( "he poisson distribution "he counts of technical replicates follo. a poisson distri'ution :>arioni et al 2002;( "he +oisson distribution can be applied to systems .ith a large number of possible e#ents3 each of .hich is rare(
Brom 6i&ipedia( $an be ' different genes3 each .ith their o.n poisson distribution( *ambda is the mean of the geneFs distribution3 .ith a certain number of reads( 8Oaxis, chance to pic& that number of reads( "he poisson distribution %o .hen .e ha#e ! technical replicates sequenced up to a big depth :say 0 > reads;( 6e can get 'y chance3 these numbers for ' different genes(
$ene# 4: 4: 1: 3 $ene& 2: 3: ;: 6 $ene< =: 7: 11: 1; 6or&ing the intuition <o. many blue balls5 <o. many red balls5 Dra. 0 Dra. 0 more Dra. 0 more 7stimate ho. large the fraction is in the set5 "he intuition .ith the balls <olor 14 draws 24 draws 34 draws ;4 draws Clue Red No color $onclusion of the experiment <o. bigger the fraction in the pool3 ho. quic&er :i(e( .ith less sequencing depth; .e are certain about the estimate of that fraction( Bor lo.er counts3 the #ariance is relatively 'igger than the #ariance for higher counts( $? :coPfficient of #ariation; O sqrt:count;-count /enes .ith lower expression need much deeper sequencing than genes .ith higher expression le#els( estimateOcountQ #arianceOcount $omparing counts I<ere .e sho. the o#erlap of +oisson distributions of single measurements at different read counts( Cecause relative %oisson uncertainty is high at low read counts3 a count of #ersus 2 has #ery little po.er to discriminate a true 2R fold change3 though at higher counts a 2R fold change becomes significant( Kn an actual experiment3 the .idth of the distribution .ould be greater due to additional biological and technical uncertainty3 but the uncertainty to the mean expression would narrow with each additional replicate(J %cotty, a .eb tool for designing RNA-%eq experiments to measure differential gene expression( Cioinformatics :20'; doi, 0(01'-bioinformatics-btt00 $omparing technical replicates Risso et al( I/$-$ontent NormaliMation for RNA-%eq DataJ C>$ Cioinformatics 203 2,!20 http,--...(biomedcentral(com-!7-200-2-!20 - 7DA%eq pac&age :R; $orrelation bet.een mean and #ariance according to +oisson *o.ess fit through the data :*og2 of the counts; : * o g 2
o f
t h e
c o u n t s ; Cut poisson does not seem to fit 7xtending the samples to real biological samples3 this mean #ariance relationship does not hold((( +lotted using 7DA%eq +ac&age in R( Cut poisson does not seem to fit 7xtending the samples to real biological samples3 this mean #ariance relationship does not holdH +lotted using 7DA%eq +ac&age in R( Reasonable fit %omething is going onH An extra source of #ariation "he +oisson distribution has an FoverdispersedF #ariance, the #ariance is bigger than expected for higher counts bet.een biological replicates( +lotted using 7DA%eq +ac&age in R( %omething is going onH An extra source of #ariation 6here +oisson, $? O std de# - mean ON $?S O -T Kf an additional distribution is in#ol#ed :also dependent on U3 the fraction of the gene in the cDNA pool;3 .e ha#e a mixture of distri'utions> $?S O -T 9 V *o. countsH dispersion /eneraliMation of +oisson .ith this extra parameter, the *egative &inomial /odel fits betterH "he negati#e binomial model "he NC model fits obser#ed expression data of RNA-seq better( Kt is a generaliMation of +oisson3 and 2 parameters need to be estimated :T and V; $ounts :gene g in sample @; has a /ean ? T g@
@ariance ? T g@ 9 V g T g@ S Ciological $?S O V g ON Ciological $? O WV g >ethods differ in estimating this dispersion per gene, $an only be measured .ith true 'iological replicates ?ariation summary3 intuiti#ely "otal $?S O "echnical $?S 9 Ciological $?S Bor low counts3 the +oisson :technical; #ariation or the measurement error is dominant( Bor higher counts3 the +oisson #ariation gets smaller3 and another source of #ariation becomes dominant3 the dispersion or the 'iological variation( Ciological #ariation does not get smaller .ith higher counts( Ceyond the NC model Kt appears from analysis of many biological replicates :XOD1; that not e#ery gene can be modeled as NC, the %oissonA"weedie model pro#ides a further generalisation and a better fit for many genes :.ith an additional shape parameter;( *eft figure, ra. data sho.s that about 2DG of the genes fit a NC model( Depending on the estimated shape parameter3 other distributions fit better( 7snaola et al( C>$ Cioinformatics 20'3 !,20! http,--...(biomedcentral(com-!7-200-!-20! $onsequence for our design
Bor low counts, the uncertainty is big due to
+oisson
Bor high counts, the uncertainty is big due to
biological #ariation( :highly expressed genes differ in their natural #ariation :regulated by cellular processes; more than lo.ly expressed genes;(
Kf .e focus on the ratios bet.een the conditions,
is it reasonable to set a restriction of fold change5 <ighly expressed genes can ha#e a smaller and be significant( *o.ly expressed genes can exceed 2( $onsequence on fold change "he readily applied cut-off in micro-array analysis is in RNA-seq not of use( Clue and red, &no.n D7 genes ?olcanoplot "hese cut-offs often applied can prohibit detecting D7 genes *ong story to say((( 6e need to estimate the model behind the count( Ne#er .or& .ithout biological replicates( Ne#er .or& .ith 2 biological replicates( "ry a#oiding .or&ing .ith ' biological replicates( /o for at least ! biological replicates( Crea&5 4#er#ie. /eneA /eneC /ene$ %ample RNA-seq /eneA /eneC /ene$ %ample 2 RNA-seq /eneA /eneC /ene$ %ample ' RNA-seq /eneA /eneC /ene$ %ample ! RNA-seq /eneA /eneC /ene$ %ample 0 RNA-seq /eneA /eneC /ene$ %ample D RNA-seq $ondition R $ondition 8 %ummary 4b#iously3 the number of reads is dependent on, 10 chance Define the count model :NC; from replicates 20 the expression level of the gene $ompare the ratios .ith a test 20 the total num'er of reads generated 30 the length of the transcript "he total number of reads generated /eneA /eneC /ene$ sample RNA-seq "he num'er of reads is dependent on the total number of reads generated( Kf one library is sequenced to 20> reads3 and another one to !0>3 most genes .ill Adouble their counts( /eneA /eneC /ene$ sample /ore RNA-seq NormaliMation for library siMe *aive approach, di#ide by total library siMe( Ks not applied anymoreH 6hy not5 <omposition mattersH 2 things to remember, - Mero sum system :or I.e cannot count .hat .e canFt sequenceJ; - 0 orders of magnitude NormaliMation for library siMe 2 things to remember, - Mero sum system - 0 orders of magnitude Kn e#ery sample3 a lot of reads are spend on few extremely highly expressed genes( 6hich genes5 "hat differ bet.een libraries3 but affects negati#ely the naY#e siMe normaliMation if .e include those genes( NormaliMation for library siMe %chematically, .hen normaliMed on library siMe :square represent number of reads;( Rest of the genes Rest of the genes Be. genes .ith enormous counts, there is N4 %A")RA"K4N of these counts All counts for li'rary # All counts for li'rary & NormaliMation for library siMe Cetter normaliMation .ould be as sho.n belo.( D7%eq2 and 7dgeR apply such an approach :see later;( Rest of the genes Rest of the genes 00G 00G /ene length influence the count I*onger transcripts generate more readsJ "rueH Cut the transcript length does not differ 'etween samples( %ince .e are concerned .ith relati#e differences bet.een samples3 this needs no normaliMation :this story changes in case of absolute quantification;( !ample # !ample & $ene # $ene & $ene # $ene & Cet.een sample #ariation +roperties of libraries-samples can effect the counts3 and lead to #ariation( "his is called 'etweenAlane variation( 4b#ious ones, library siMe :ho. many reads are sampled;3 library composition( Different libraries-samples can exhibit increased #ariation by differing in ho. gene properties relate to gene counts( "his is called withinAlane #ariation( /$-content of genes can influence counts /$-content differs bet.een genes( Cut it does not change 'etween samples3 so there should be no problem for relati#e expression comparison( 6e can #isualiMe the relationship bet.een counts and /$ #ery easily :see right;( "here is some trend3 and it is equal for all samples( 7DAseq :R; /$-content of genes can influence counts %ometimes3 samples sho. different relationships bet.een /$-content of the genes and the counts( "his .ithin-lane #ariation :or intra-sample; #ariation needs to 'e corrected for3 so that in one sample not all differentially expressed genes are also the /$-riched ones( *ength can ha#e also this effect( 6hat .e need to &no. for our set-up 6e .ant to detect differentially expressed genes bet.een 2 or more conditions( Bor this3 .e need to apply the conditions in a controlled en#ironment :randomisation3(((;( Bor good testing3 .e need to ha#e some 'iological replicates per condition( Bor cost effecti#eness3 .e determine how deep .e .ill sequence from each sample( 6e analyse the reads3 get ra. counts and do the testH *ibrary preparation and lane loading <i%eq2000, 2! single-index barcodes a#ailable( lane gi#es 00-20 > reads( 4ne lane of 00 bp %7 approx Z(000( Cioinformatics analysis .ill ta&e most of your time =uality control :2<; of ra. reads %reprocessing, filtering of reads and read parts3 to help our goal of differential detection( 2< of preprocessing /apping to a reference genome :alternati#e, to a transcriptome; 2< of the mapping <ount ta'le extraction 2< of the count table D7 test Ciological insight Cioinformatics analysis .ill ta&e most of your time =uality control :2<; of ra. reads %reprocessing, filtering of reads and read parts3 to help our goal of differential detection( 2< of preprocessing /apping to a reference genome :alternati#e, to a transcriptome; 2< of the mapping <ount ta'le extraction 2< of the count table D7 test Ciological insight Cioinformatics analysis .ill ta&e most of your time =uality control :2<; of ra. reads %reprocessing, filtering of reads and read parts3 to help our goal of differential detection( 2< of preprocessing /apping to a reference genome :alternati#e, to a transcriptome; 2< of the mapping <ount ta'le extraction 2< of the count table D7 test Ciological insight 1 2 3 ; 8 5 4#er#ie. http,--...(nature(com-nprot-@ournal-#2-n1-full-nprot(20'(011(html "he numbers get reduced .ith e#ery step 20> 20> 0> Deeper3 or more replicates5 ?ariance .ill be lo.er .ith more reads, but sequencing another 'iological replicate is preferred o#er sequencing deeper3 or technical reps( Doi, 0(01'-bioinformatics-btt00 "here is tool to help you set up %cotty E po.er analysis +o.er, the probability to re@ect the null hypothesis if the alternati#e is true( F<o. many samples and ho. deep in order to minimiMe false negati#esF( :a null hypothesis is al.ays a scenario in .hich there is no difference3 hence no differential expression;( Alternati#e tools, http,--.i&i(bits(#ib(be-index(php-RNAseq[toolbox <elp .ith design http,--.i&i(bits(#ib(be-index(php-RNAseq[toolbox http,--rnaseq(uoregon(edu-exp[design(html <o. many samples to sequence5 %cotty exercise \ey.ords A read count of a gene is dependent on, ( chance 2( expression le#el '( transcript length !( depth of sequencing 0( /$-content +oisson distribution Negati#e binomial distribution $ondition %ample NormaliMation 6rite in your o.n .ords .hat the terms mean Reads All my references a#ailable at, https,--...(Motero(org-groups-dernaseq-items