Exercises 2013 Teachernotes
Exercises 2013 Teachernotes
March 2013 Antoine van Kampen Bioinformatics Laboratory AMC Introduction Go to the web-site www.bioinformatics aboratory.n !the procee" to #$"%cation# an" #&ntro"%ction to Bioinformatics#'. (here yo% wi fin" pointers to reso%rces yo% nee" for these e)ercises. (he e)ercises wi *%i"e yo% thro%*h severa aspects of *ene str%ct%re an" si*na pre"iction %sin* specific software app ications an" bio o*ica "atabases. &n rea practice the e %ci"ation of *ene str%ct%re invo ves m%ch more "etai e" an" comp e) ana ysis.
Exercise 1: +etermination of the open rea"in* frame !,-.' of the hemo* obin a pha 2 !HBA2' *ene. &n this e)ercise yo% wi earn how to "etermine an open rea"in* frame !,-.' an" "etermine the *ene pro"%ct of the ,-.. a' -etrieve the a pha 2 * obin m-/A se0%ence !/M1000213' from the GenBan4 "atabase. Can yo% man%a y i"entify the ,pen -ea"in* .rame !,-.'5 i.e.5 the co"in* se0%ence !e.*.5 in notepa" or wor"pa"'6 7rocee" by "eterminin* the start an" stop co"ons !%se *enetic co"e tab e'. /ote that the se0%ence contains trip ets of n%c eoti"es that are simi ar to the start8stop co"ons b%t which are not the tr%e start an" stop co"ons. 9hy is that6
AC(C((C(GG(CCCCACAGAC(CAGAGAGAACCCACC A(GG(GC(G(C(CC(GCCGACAAGACCAACG(CAAGGCCGCC(GGGG(AAGG(CGGCGCGCACGC(GGCGAG(A(GG (GCGGAGGCCC(GGAGAGGA(G((CC(G(CC((CCCCACCACCAAGACC(AC((CCCGCAC((CGACC(GAGCCACGG C(C(GCCCAGG((AAGGGCCACGGCAAGAAGG(GGCCGACGCCC(GACCAACGCCG(GGCGCACG(GGACGACA(GC CCAACGCGC(G(CCGCCC(GAGCGACC(GCACGCGCACAAGC((CGGG(GGACCCGG(CAAC((CAAGC(CC(AAGCC AC(GCC(GC(GG(GACCC(GGCCGCCCACC(CCCCGCCGAG((CACCCC(GCGG(GCACGCC(CCC(GGACAAG((CC (GGC((C(G(GAGCACCG(GC(GACC(CCAAA(ACCG((AA GC(GGAGCC(CGG(AGCCG((CC(CC(GCCCGA(GGGCC(CCCAACGGGCCC(CC(CCCC(CC((GCACCGGCCC((C C(GG(C(((GAA(AAAG(C(GAG(GGGCGGC
Gene finding
Page 1
b' ,nce yo% have "etermine" the ,-. of the HBA2 *ene5 trans ate the first 10 co"ons to the amino aci" se0%ence !%se *enetic co"e tab e' .irst 10 amino aci"s: M ; L < 7 A + K ( / ; c' Are the ,-. an" the amino aci" se0%ence confirme" by the /M1000213 annotation in the GenBan4 "atabase6
"' .or the a%tomatic "etermination of p%tative ,-.s yo% can a so %se the ,-. fin"er at the /CB& site. Go to the ,-. fin"er an" copy8paste the /M1000213
Gene finding
Page 2
se0%ence or =%st type in the accession co"e !the pro*ram is in4e" to the GenBan4 "atabase'. (he res% ts are the ,-.s for a si) rea"in* frames. (he on*est ,-. is most probab y the frame that wi be trans ate" to the protein. By >c ic4in*? on the ar*est ,-.5 the correspon"in* trans ation is *iven. &s this correct6
Exercise 2: +etermination of the ,-. in the @b Constant <prin*. &n this e)ercise yo% wi earn how a m%tation can chan*e the ,-. of a *ene an"5 conse0%ent y5 its *ene pro"%ct. &n the Constant <prin* !C<' variant of hemo* obin5 name" for the comm%nity in Aamaica where it was first "iscovere"5 the a pha 2 chains have 133 amino aci"s rather
Gene finding
Page 3
than the norma 1B2. (his may ref ect a chain termination m%tation in the a pha 2 chain !this m%tation is f%rther "isc%sse" in ,M&M 1B1C20'. a' Can yo% confirm this m%tation by thro%*h inspection of the hemo* obin C< m-/A se0%ence from the co%rse web-site6 Do% can man%a y compare the C< an" norma HBA2 se0%ence or is there a more efficient way to "o this6 9hich n%c eoti"e was m%tate"6 9hat is the effect of this m%tation6 /ote that the se0%ence with this partic% ar se0%ence is not in the GenBan4 "atabase. ,nce yo% confirme" the n%c eoti"e chan*e5 %se the /CB& ,-. fin"er to i"entify the new ,-. an" *ene pro"%ct. ,ri*ina se0%ence:
AC(C((C(GG(CCCCACAGAC(CAGAGAGAACCCACC A(GG(GC(G(C(CC(GCCGACAAGACCAACG(CAAGGCCGCC(GGGG(AAGG(CGGCGCGCACGC(GGCGAG(A(GG (GCGGAGGCCC(GGAGAGGA(G((CC(G(CC((CCCCACCACCAAGACC(AC((CCCGCAC((CGACC(GAGCCACGG C(C(GCCCAGG((AAGGGCCACGGCAAGAAGG(GGCCGACGCCC(GACCAACGCCG(GGCGCACG(GGACGACA(GC CCAACGCGC(G(CCGCCC(GAGCGACC(GCACGCGCACAAGC((CGGG(GGACCCGG(CAAC((CAAGC(CC(AAGCC AC(GCC(GC(GG(GACCC(GGCCGCCCACC(CCCCGCCGAG((CACCCC(GCGG(GCACGCC(CCC(GGACAAG((CC (GGC((C(G(GAGCACCG(GC(GACC(CCAAA(ACCG((AA GC(GGAGCC(CGG(AGCCG((CC(CC(GCCCGA(GGGCC(CCCAACGGGCCC(CC(CCCC(CC((GCACCGGCCC((C C(GG(C(((GAA(AAAG(C(GAG(GGGCGGC
Constant <prin* se0%ence !(AA in previo%s se0%ence is m%tate" to CAA !not a stop co"on'E the ne)t stop co"on is (AA'
AC(C((C(GG(CCCCACAGAC(CAGAGAGAACCCACC A(GG(GC(G(C(CC(GCCGACAAGACCAACG(CAAGGCCGCC(GGGG(AAGG(CGGCGCGCACGC(GGCGAG(A(GG (GCGGAGGCCC(GGAGAGGA(G((CC(G(CC((CCCCACCACCAAGACC(AC((CCCGCAC((CGACC(GAGCCACGG C(C(GCCCAGG((AAGGGCCACGGCAAGAAGG(GGCCGACGCCC(GACCAACGCCG(GGCGCACG(GGACGACA(GC CCAACGCGC(G(CCGCCC(GAGCGACC(GCACGCGCACAAGC((CGGG(GGACCCGG(CAAC((CAAGC(CC(AAGCC AC(GCC(GC(GG(GACCC(GGCCGCCCACC(CCCCGCCGAG((CACCCC(GCGG(GCACGCC(CCC(GGACAAG((CC (GGC((C(G(GAGCACCG(GC(GACC(CCAAA(ACCG(CAAGC(GGAGCC(CGG(AGCCG((CC(CC(GCCCGA(GG GCC(CCCAACGGGCCC(CC(CCCC(CC((GCACCGGCCC((CC(GG(C(((GAA TAAAG(C(GAG(GGGCGGC
Gene finding
Page 4
Exercise 3: <ic4 e Ce "isor"er !@B<' (his is not a *ene pre"iction e)ercise b%t i4e the previo%s e)ercise5 shows yo% how to %se BLA<( to compare two m-/A se0%ences to i"entify non-synonymo%s chan*es. Therefore, you may prefer to make the other exercises first! <ic4 e Ce anemia is a *ro%p of inherite" re" b oo" ce "isor"ers. <ic4 e hemo* obin !HBS' "iffers from norma HBB by a sin* e amino aci": va ine rep aces * %tamate at position F on the s%rface of the beta chain. a' (o "etermine the correspon"in* m%tation in the m-/A se0%ence yo% can %se BLA<(. Go to the BLA<( site an" se ect >B ast two se0%ences?. Copy8paste the m-/A se0%ences for @emo* obin beta chain ! HBB5 /M100021C' an" <ic4 e beta
Gene finding
Page 5
hemo* obin !HBS5 M22113' in the inp%t fie "s an" s%bmit yo%r B ast =ob. Can yo% i"entify the m%tation from man%a inspection of the a i*ne" se0%ences6
b' (o "o at home: (o %n"erstan" how the n%c eoti"e m%tation is responsib e for sic4 in* ce *o to the hemo* obin t%toria an" se ect ><ic4 e @emo* obin?. !www.umass.edu/microbio/chime/hemog ob/2frmco!t.htm '. /e)t wa 4 thro%*h the t%toria by c ic4in* on the b%ttons in the ri*ht frame. /ote that yo% nee" the C@&M$ p %*in !www.m" .com8chime' for this e)ercise5 th%s yo% probab y have to "o this e)ercise at home.
Exercise 4. +eterminin* se0%ence composition an" si*na s. (he $MB,<< too bo) is very %sef% for se0%ence ana ysis !in *ene fin"in*'. &n the ne)t few e)ercises we wi see some e)amp es of $MB,<< pro*rams that may he p yo% to i"entify some feat%res of se0%ences. Gse the $MB,<< too bo) !http:88emboss.bioinformatics.n ' Gse the HBA se0%ence !A00123' !%n ess otherwise specifie"' in combination with the fo owin* pro*rams. @ave a 0%ic4 oo4 at the cate*ories of avai ab e pro*rams. a' CompSeq: ca c% ate "in%c eoti"e fre0%encies. +oes the GC "in%c eoti"e has a hi*h fre0%ency of occ%rrence6 9hat "oes this te yo%6 @e)amer se0%ences are more pre"ictive for *enes. Comp<e0 a so a ows yo% to ca c% ate the
Gene finding
Page 6
fre0%encies of a observe" he)amers. Give it a try. -emar4: "on?t try to interpret a these va %es. A%st remember that there is a too to ca c% ate a these fre0%encies5 if yo% ever nee" them.
b' Cpg lot: i"entify CpG is an"s. CpG is an"s typica y occ%r at or near the transcription start site of *enes5 partic% ar y ho%se4eepin* *enes5 in vertebrates. @ow many C*G is an"s are i"entifie"6
Gene finding
Page 7
CpG refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases. Detection of regions of genomic sequences that are rich in the CpG pattern is important because such regions are resistant to methylation and tend to be associated with genes which are frequently switched on. Regions rich in the CpG pattern are known as CpG islands. CpG islands are identified according to the lower plot. !or information about the algorithm see manual and corresponding publication.
c' Cusp: "etermine co"on %sa*e. Gse /M100021C. Co"on bias may he p to i"entify co"in* re*ions. @ow can this pro*ram be of he p with that6 9hich co"on is %se" most6 9hich is %se" with owest fre0%ency6 A so have a oo4 at S!co. -%n this pro*ram on the same se0%ence.
Gene finding
Page 8
"' "reg: re*% ar e)pressions. Gse /M100021C. Gse +re* to i"entify po yA si*na s in the *enomic se0%ence. 9hat are the ma=or po yA si*na s6 9hat "o they in"icate6 @ow many si*na s "i" yo% i"entify6 +i" yo% e)pect this6 e' Get#rf: "etermine open rea"in* frame !%se /M1000213'. 9hich open rea"in* frame is most i4e y to be correct6 9hy6
Exercise $% Gene prediction &it' Gene(ar)% &n this e)ercise5 yo% wi try to i"entify *enes in sma part of the He iobaci us mobi is *enome. GeneMar4 can chec4 for the presence of the <hine-+a *arno se0%ence !ribosoma bin"in* site' to "etect the start of a *ene with hi*her confi"ence. GenMar4 can on y inc %"e the consens%s se0%ence for "scherichia co i5 which is AAGGAG an" is ocate" within abo%t 10 n%c eoti"es %pstream of the start co"on. 9e wi %se this consens%s se0%ence for o%r *ene pre"iction. a' Go to the GeneMar4 web-site !%se the #GeneMar4# pro*ram5 not #GeneMar4.hmm#'. (he GeneMar4 a *orithm %ses species specific inhomo*eneo%s Mar4ov chain mo"e s of protein-co"in* +/A se0%ence as we as homo*eneo%s Mar4ov chain mo"e s of non-co"in* +/A. b' Copy8paste the fasta se0%ence fi e of He iobaci us #obi is in the inp%t fie " an" se ect Baci us subti is as H<peciesH !the c osest or*anism'. 9hy is this necessary6 <e ect the option to p ot the ,-.s an" start8stop co"ons in a 7+. fi e. Gse a ternate *enetic co"e !$%4aryote' to ens%re that on y A(G is seen as start co"on !otherwise a ternative start co"ons are consi"ere"'. Leave other
Gene finding
Page 9
options as "efa% t an" start the GeneMar4 pro*ram. +o yo% %n"erstan" the o%tp%t6 A so have a oo4 at the @e p pa*e of GeneMar4. (he o%tp%t of GenMar4 inc %"es ,-. pre"icte" as co"in* se0%ence !C+<E stop co"ons correspon"in* to a ternative start co"ons' an" #re*ions of interest# !re*ions from start to stop co"on'. (he 7+. o%tp%t for the "irect stran" sho% " oo4 i4e .i*%re 1. +o yo% %n"erstan" what is shown6
c' Copy the GenMar4 pre"ication res% ts from the screen in e.*.5 Microsoft 9or"7a". "' (he o%tp%t of GeneMar4 inc %"es the start an" stop positions of the i"entifie" ,-.s. &n or"er to inspect the inp%t se0%ence at these positions !e.*.5 to "etermine if a ribosoma bin"in* site is present'5 the fasta format is very inconvenient. (herefore5 yo% can transform the fasta se0%ence to another format that inc %"es n%c eoti"e positions. Convert the ori*ina .A<(A se0%ence fi e into a GenBan4 format %sin* the -ea"<e0 pro*ram. +o yo% %n"erstan" why this is a more convenient format6 e' 9e wi not %se the GenBan4 format to inspect the se0%ence for ribosoma bin"in* sites !-B<'. &nstea" we wi %se $MB,<< a*ain. (he ribosoma bin"in*
Gene finding
Page 10
site of $. co i is "efine" in .i*%re 2. .rom the se0%ence o*o we see that the -B< is 0%ite variab e. Let#s ass%me that in o%r case the first position is A5 C or G. Let#s a so ass%me that the ne)t three n%c eoti"es are a ways AGGA5 an" that the ast n%c eoti"e is a G or an A. (h%s the consens%s can then be represente" by !A or C or G'AGGA!G or A'. Let %s a so ass%me that the "istance between the -B< an" A(G is between B an" 12 n%c eoti"es. Do% can now %se the $MB,<< "re* pro*ram to search for -B< se0%ences that are c ose to the A(G site. Loo4 at the he p pa*e of "re* to constr%ct a #re*% ar e)pression# that he ps yo% to i"entify these instances. +o yo% fin" a consens%s se0%ence in front of every pre"icte" ,-.6 9hy6 9hy not6
*igure 2. (he ribosoma bin"in* site for $. co i represente" by a se0%ence o*o. (he variation in "istance between the -B< consens%s se0%ence an" the trans ation start site !A(G' is shown in the ri*ht histo*ram !an" represente" by the pin4 crosses in the se0%ence o*o'.
Gene finding
Page 11
f' He iobaci us mobi is is a rea"y in GenBan4. (h%s in this case we can easi y chec4 the res% ts of GeneMar4. Gse BLA<( at /CB& to b ast the se0%ence a*ainst the he iobaci us mobi is *enome. Do% sho% " *et an a i*nment as is shown in .i*%re 3. C ic4 on the accession co"e $G022FC1 to retrieve this se0%ence an" annotation from GenBan4. <witch to the *raphica o%tp%t !.i*%re B'. @ow many ,-.s are annotate" in GenBan46 @ow many ,-.s were i"entifie" by GenMar46
*igure 3. A i*nment of the He iobaci us mobi is se0%ence a*ainst the H. #obi is *enome. (he on* re" bar represents the He iobaci us mobi is heme "1 biosynthesis *ene c %ster !accession co"e: $G022FC1'.
Gene finding
Page 12
Figure 4. Graphical representation of our $elibacillus mobilis sequence. The green bars indicate the "R!s. The bo% &top left' indicates the region that is e%panded in the lower panel &red bars'
*' (o have a better %n"erstan"in* of the GeneMar4 o%tp%t retrieve the se0%ence of the hep2 ,-.: move8c ic4 yo%r mo%se over the re" bar for the hep2 *ene an" yo% wi see an option to retrieve the n%c eoti"e se0%ence for hep2. Get the fasta format of the hep2 se0%ence an" %se the /CB& ,-. .in"er to inspect the open rea"in* frames. ,nce yo% have the o%tp%t of the ,-. .in"er c ic4 on #si) frames# to show a start an" stop co"ons. (hen c ic4 on rea"in* frame 1 to show the trans ate" protein an" n%c eoti"e se0%ence !.i*%re 2'. @ow "oes this compare to the first ,-. that was fo%n" for frame 2 by GeneMar4 !a so inspect the #start probabi ity#5 which in"icates if an A(G is tr% y a start co"on'.
Gene finding
Page 13
Exercise +% romotor prediction% &n this e)ercise we wi %se the B7-,M pro*ram to i"entify the promotor re*ion of the hep2 *ene of He iobaci us mobi is. a' -etrieve the se0%ence containin* the hep2 *ene from GenBan4 !accession co"e $G022FC1'. 9ithin the GenBan4 recor" c ic4 on #*ene# to hi*h i*ht the *ene se0%ence. Copy8past this *ene se0%encin* I 120bp %pstream to e.*.5 notepa". b' Gse the -ea"se0 pro*ram to convert the GenBan4 format to fasta !or p ain te)t'. c' (he B7-,M a *orithm pre"icts potentia transcription start positions of bacteria *enes re*% ate" by si*ma30 promoters !ma=or $.co i promoter c ass'. (he J30 !-po+' factor is the Hho%se4eepin*H si*ma factor that transcribes most *enes in *rowin* ce s. A inear "iscriminant f%nction !L+.' is %se" that pre"ict these transcription start sites5 an" which combines characteristics "escribin* f%nctiona motifs an" o i*on%c eoti"e composition of these sites.
Gene finding
Page 14
Go to the B7-,M website an" %se yo%r se0%ence as inp%t for promotor pre"iction. (he res% t is shown in .i*%re F.
Figure 6. "utput of the )*R"# program for the hep( gene of Heliobacillus mobilis
"' B7-,M pre"icts the 7ribnow bo) !-10' an" the Gi bert Bo) !-30'. <ee .i*%re 3. +oes the pro*ram pre"icts the correct promotor se0%ences6
Gene finding
Page 15
*igure ,. (he consens%s se0%ence an" position of the 7ribnow an" Gi bert bo).
Gene finding
Page 16