Computational Biology, Part 8: Protein Coding Regions

Computational Biology, Part 8 Protein Coding Regions
Robert F. Murphy Copyright 1996-2009. All rights reserved.
Sequence Analysis Tasks

Finding protein coding regions
Goal
Given a DNA or RNA sequence, find those regions that code for protein(s)
Direct
approach: Look for stretches that can be interpreted as protein using the genetic code Statistical approaches: Use other knowledge about likely coding regions
Direct Approach
Genetic codes
The set of tRNAs that an organism possesses defines its genetic code(s) The universal genetic code is common to all organisms Prokaryotes, mitochondria and chloroplasts often use slightly different genetic codes More than one tRNA may be present for a given codon, allowing more than one possible translation product
Genetic codes
Differences in genetic codes occur in start and stop codons only Alternate initiation codons: codons that encode amino acids but can also be used to start translation (GUG, UUG, AUA, UUA, CUG) Suppressor tRNA codons: codons that normally stop translation but are translated as amino acids (UAG, UGA, UAA)
Genetic codes
Genetic codes
Genetic codes
Note additional start codons: UUA, UUG, CUG Note conversion of stop codon UGA (opal) to Trp
Reading Frames
Since nucleotide sequences are read three bases at a time, there are three possible frames in which a given nucleotide sequence can be read (in the forward direction) Taking the complement of the sequence and reading in the reverse direction gives three more reading frames
Reading frames
TTC TCA TGT TTG ACA GCT RF1 Phe Ser Cys Leu Thr Ala> RF2 Ser His Val *** Gln Leu> RF3 Leu Met Phe Asp Ser> AAG AGT ACA AAC TGT CGA RF4 <Glu *** Thr Gln Cys Ser RF5 <Glu His Lys Val Ala RF6 <Arg Met Asn Ser Leu
Open Reading Frames (ORF)

Concept: Region of DNA or RNA sequence that could be translated into a peptide sequence (open refers to absence of stop codons) Prerequisite: A specific genetic code Definition:
(start codon) (amino acid coding codon)n (stop codon)
Note: Not all ORFs are actually used
Open Reading Frames
Click boxes for List ORFS and ORF map
Check reading frame: mod(696,3)=0 -> RF3
EMBOSS plotorf
Splicing ORFs
For eukaryotes, which have interrupted genes, ORFs in different reading frames may be spliced together to generate final product ORFs from forward and reverse directions cannot be combined
Block Diagram for Search for ORFs

Genetic code Both strands?
Search Engine
Ends start/stop?
Sequence to be searched
List of ORF positions
Statistical Approaches
Calculation Windows
Many sequence analyses require calculating some statistic over a long sequence looking for regions where the statistic is unusually high or low To do this, we define a window size to be the width of the region over which each calculation is to be done Example: %AT
Base Composition Bias

For a protein with a roughly normal amino acid composition, the first 2 positions of all codons will be about 50% GC If an organism has a high GC content overall, the third position of all codons must be mostly GC Useful for prokaryotes Not useful for eukaryotes due to large amount of noncoding DNA
Ficketts statistic
Also called TestCode analysis Looks for asymmetry of base composition Strong statistical basis for calculations Method:
For each window on the sequence, calculate the base composition of nucleotides 1, 4, 7..., then of 2, 5, 8..., and then of 3, 6, 9... Calculate statistic from resulting three numbers
Codon Bias (Codon Preference)
Principle
Different
levels of expression of different tRNAs for a given amino acid lead to pressure on coding regions to conform to the preferred codon usage Non-coding regions, on the other hand, feel no selective pressure and can drift
Starting point: Table of observed codon frequencies in known genes from a given organism
best
to use highly expressed genes
Method
Calculate
coding potential within a moving window for all three reading frames Look for ORFs with high scores
Works best for prokaryotes or unicellular eukaryotes because for multicellular eukaryotes, different pools of tRNA may be expressed at different stages of development in different tissues
may
have to group genes into sets
Codon bias can also be used to estimate protein expression level
Portion of D. melanogaster codon frequency table

Amino Acid Gly Gly Gly Gly Glu Glu
G yG l
Codon GGG GGA GGT GGC GAG GAA
Number 11 92 86 142 212 69
Freq/1000 2.60 21.74 20.33 33.56 50.11 16.31
Fraction 0.03 0.28 0.26 0.43 0.75 0.25
Comparison of Glycine codon frequencies

Codon GGG GGA GGT GGC
G yG l
E. coli D. melanogaster 0.02 0.00 0.59 0.38 0.03 0.28 0.26 0.43
Illustration of Codon Bias Plots
Use Entrez via MacVector to get sequence of lexA

under
Database select Internet Entrez Search Select gene=lexA AND organism=Escherichia Pick one (e.g., region from 89.2 to 92.8)
Under Analyze select Codon Preference Plots

Choose
Escherichia coli codon bias file Choose gene region corresponding to lacZ Click on Staden codon bias and Gribskov codon bias
Codon Preference Algorithms
The Staden method (from Staden & McLachlan, 1982) uses a codon usage table directly in identifying coding regions. The codon usage table is normalized so that the sum of all 64 codons is 1. The usages for each codon in each reading frame in each window are multiplied together and normalized by the sum of the probabilities in all three positions to generate a relative coding probability.
Codon Preference Algorithms
The Gribskov method uses a codon usage table normalized so that the sum of the alternatives for each amino acid add to 1. The values for each codon for each reading frame in each window are multiplied together and normalized by the random probability expected for that codon given the mononucleotide frequencies of the target sequence. It is the most commonly used method.
Plot from syco
Summary
Translation of nucleic acid sequences into hypothetical protein sequences requires a genetic code Translation can occur in three forward and three reverse reading frames Open reading frames are regions that can be translated without encountering a stop codon
Summary
The likelihood that a particular open reading frames is in fact a coding region (actually made into protein) can be estimated using third-codon base composition or codon preference tables This can be used to scan long sequences for possible coding regions

Computational Biology, Part 8: Protein Coding Regions

Uploaded by

Document Informationclick to expand document informationprotine lactur 8

Document Informationclick to expand document information

Copyright:

Available Formats

Computational Biology, Part 8: Protein Coding Regions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Biology, Part 8: Protein Coding Regions

Uploaded by

Copyright:

Available Formats

Computational Biology, Part 8 Protein Coding Regions

Robert F. Murphy Copyright 1996-2009. All rights reserved.

Sequence Analysis Tasks

Open Reading Frames (ORF)

(start codon) (amino acid coding codon)n (stop codon)

Note: Not all ORFs are actually used

Open Reading Frames

Click boxes for List ORFS and ORF map

Check reading frame: mod(696,3)=0 -> RF3

Block Diagram for Search for ORFs

List of ORF positions

Base Composition Bias

Codon Bias (Codon Preference)

Codon Bias (Codon Preference)

to use highly expressed genes

Codon Bias (Codon Preference)

have to group genes into sets

Codon bias can also be used to estimate protein expression level

Portion of D. melanogaster codon frequency table

Codon GGG GGA GGT GGC GAG GAA

Number 11 92 86 142 212 69

Freq/1000 2.60 21.74 20.33 33.56 50.11 16.31

Fraction 0.03 0.28 0.26 0.43 0.75 0.25

Comparison of Glycine codon frequencies

Illustration of Codon Bias Plots

Use Entrez via MacVector to get sequence of lexA

Under Analyze select Codon Preference Plots

Codon Preference Algorithms

Codon Preference Algorithms

Plot from syco

You might also like