All About Dna
All About Dna
1
DNA
From Wikipedia, the free encyclopedia
For a non-technical introduction to the topic, see Introduction to genetics. For other uses, see DNA
(disambiguation).
The structure of the DNA double helix. The atoms in the structure are colour-coded by element and the detailed structure of two
base pairs are shown in the bottom right.
DNA is well-suited for biological information storage. The DNA backbone is resistant to cleavage, and both strands
of the double-stranded structure store the same biological information. Biological information is replicated as the two
strands are separated. A significant portion of DNA (more than 98% for humans) is non-coding, meaning that these
sections do not serve as patterns for protein sequences.
The two strands of DNA run in opposite directions to each other and are therefore anti-parallel. Attached to each
sugar is one of four types of nucleobases (informally, bases). It is the sequenceof these four nucleobases along the
backbone that encodes biological information. Under the genetic code, RNA strands are translated to specify the
sequence of amino acids within proteins. These RNA strands are initially created using DNA strands as a template
in a process called transcription.
Within cells, DNA is organized into long structures called chromosomes. During cell division these chromosomes
are duplicated in the process of DNA replication, providing each cell its own complete set of
chromosomes. Eukaryotic organisms (animals, plants, fungi, and protists) store most of their DNA inside the cell
nucleus and some of their DNA in organelles, such asmitochondria or chloroplasts.[1] In
contrast, prokaryotes (bacteria and archaea) store their DNA only in the cytoplasm. Within the
chromosomes, chromatin proteins such as histones compact and organize DNA. These compact structures guide
the interactions between DNA and other proteins, helping control which parts of the DNA are transcribed.
First isolated by Friedrich Miescher in 1869 and with its molecular structure first identified by James
Watson and Francis Crick in 1953, DNA is used by researchers as a molecular tool to explore physical laws and
theories, such as the ergodic theorem and the theory of elasticity. The unique material properties of DNA have
made it an attractive molecule for material scientists and engineers interested in micro- and nano-fabrication.
Among notable advances in this field are DNA origami and DNA-based hybrid materials.[2]
The obsolete synonym "desoxyribonucleic acid" may occasionally be encountered, for example, in pre-1953
genetics.
Contents
[hide]
1 Properties
o 1.1 Nucleobase classification
o 1.2 Grooves
o 1.3 Base pairing
o 1.4 Sense and antisense
o 1.5 Supercoiling
o 1.6 Alternate DNA structures
o 1.7 Alternative DNA chemistry
o 1.8 Quadruplex structures
o 1.9 Branched DNA
2 Chemical modifications and altered DNA packaging
o 2.1 Base modifications and DNA packaging
o 2.2 Damage
3 Biological functions
o 3.1 Genes and genomes
o 3.2 Transcription and translation
o 3.3 Replication
Properties[edit]
DNA is a long polymer made from repeating units called nucleotides.[3][4][5] DNA was first identified and isolated by
Miescher in 1869 at the University of Tbingen, a substance he called nuclein, and the double helix structure of
DNA was first discovered in 1953 by Watson and Crick at the University of Cambridge, using experimental data
collected by Rosalind Franklin and Maurice Wilkins. The structure of DNA of all species comprises two helical
chains each coiled round the same axis, and each with a pitch of 34 ngstrms (3.4 nanometres) and a radius of
10 ngstrms (1.0 nanometre).[6]According to another study, when measured in a particular solution, the DNA chain
measured 22 to 26 ngstrms wide (2.2 to 2.6 nanometres), and one nucleotide unit measured 3.3 (0.33 nm)
long.[7] Although each individual repeating unit is very small, DNA polymers can be very large molecules containing
millions of nucleotides. For instance, the largest human chromosome, chromosomenumber 1, consists of
approximately 220 million base pairs[8] and is 85 mm long.
In living organisms DNA does not usually exist as a single molecule, but instead as a pair of molecules that are held
tightly together.[9][10] These two long strands entwine like vines, in the shape of adouble helix. The nucleotide repeats
contain both the segment of the backbone of the molecule, which holds the chain together, and a nucleobase, which
interacts with the other DNA strand in the helix. A nucleobase linked to a sugar is called a nucleoside and a base
linked to a sugar and one or more phosphate groups is called a nucleotide. A polymer comprising multiple linked
nucleotides (as in DNA) is called a polynucleotide.[11]
The backbone of the DNA strand is made from alternating phosphate and sugar residues.[12] The sugar in DNA is 2deoxyribose, which is a pentose (five-carbon) sugar. The sugars are joined together by phosphate groups that
form phosphodiester bonds between the third and fifth carbon atoms of adjacent sugar rings. These
asymmetric bonds mean a strand of DNA has a direction. In a double helix the direction of the nucleotides in one
strand is opposite to their direction in the other strand: the strands are antiparallel. The asymmetric ends of DNA
strands are called the 5 (five prime) and 3(three prime) ends, with the 5 end having a terminal phosphate group
and the 3 end a terminal hydroxyl group. One major difference between DNA and RNA is the sugar, with the 2deoxyribose in DNA being replaced by the alternative pentose sugar ribose in RNA.[10]
A section of DNA. The bases lie horizontally between the two spiraling strands.[13] (animated version).
The DNA double helix is stabilized primarily by two forces: hydrogen bonds between nucleotides and basestacking interactions among aromatic nucleobases.[14] In the aqueous environment of the cell, the conjugated
bonds of nucleotide bases align perpendicular to the axis of the DNA molecule, minimizing their interaction with
the solvation shell and therefore, the Gibbs free energy. The four bases found in DNA are adenine (abbreviated
A), cytosine (C), guanine (G) and thymine (T). These four bases are attached to the sugar/phosphate to form the
complete nucleotide, as shown foradenosine monophosphate.
Nucleobase classification[edit]
The nucleobases are classified into two types: the purines, A and G, being fused five- and sixmembered heterocyclic compounds, and the pyrimidines, the six-membered rings C and T.[10] A fifth pyrimidine
nucleobase,uracil (U), usually takes the place of thymine in RNA and differs from thymine by lacking a methyl
group on its ring. In addition to RNA and DNA a large number of artificial nucleic acid analogues have also been
created to study the properties of nucleic acids, or for use in biotechnology.[15]
Uracil is not usually found in DNA, occurring only as a breakdown product of cytosine. However, in a number of
bacteriophages Bacillus subtilis bacteriophages PBS1 and PBS2 and Yersinia bacteriophage piR1-37 thymine
has been replaced by uracil.[16] Another phage - Staphylococcal phage S6 - has been identified with a genome
where thymine has been replaced by uracil.[17]
Base J (beta-d-glucopyranosyloxymethyluracil), a modified form of uracil, is also found in a number of organisms:
the flagellates Diplonema and Euglena, and all the kinetoplastid genera[18] Biosynthesis of J occurs in two steps: in
the first step a specific thymidine in DNA is converted into hydroxymethyldeoxyuridine; in the second HOMedU is
glycosylated to form J.[19] Proteins that bind specifically to this base have been identified.[20][21][22]These proteins
appear to be distant relatives of the Tet1 oncogene that is involved in the pathogenesis of acute myeloid
leukemia.[23] J appears to act as a termination signal for RNA polymerase II.[24][25]
Major and minor grooves of DNA. Minor groove is a binding site for the dye Hoechst 33258.
Grooves[edit]
Twin helical strands form the DNA backbone. Another double helix may be found tracing the spaces, or grooves,
between the strands. These voids are adjacent to the base pairs and may provide a binding site. As the strands are
not symmetrically located with respect to each other, the grooves are unequally sized. One groove, the major
groove, is 22 wide and the other, the minor groove, is 12 wide.[26] The width of the major groove means that the
edges of the bases are more accessible in the major groove than in the minor groove. As a result, proteins such
as transcription factors that can bind to specific sequences in double-stranded DNA usually make contact with the
sides of the bases exposed in the major groove.[27] This situation varies in unusual conformations of DNA within the
cell (see below), but the major and minor grooves are always named to reflect the differences in size that would be
seen if the DNA is twisted back into the ordinary B form.
Base pairing[edit]
Further information: Base pair
In a DNA double helix, each type of nucleobase on one strand bonds with just one type of nucleobase on the other
strand. This is called complementary base pairing. Here, purines form hydrogen bonds to pyrimidines, with adenine
bonding only to thymine in two hydrogen bonds, and cytosine bonding only to guanine in three hydrogen bonds.
This arrangement of two nucleotides binding together across the double helix is called a base pair. As hydrogen
bonds are not covalent, they can be broken and rejoined relatively easily. The two strands of DNA in a double helix
can therefore be pulled apart like a zipper, either by a mechanical force or high temperature.[28] As a result of this
complementarity, all the information in the double-stranded sequence of a DNA helix is duplicated on each strand,
which is vital in DNA replication. Indeed, this reversible and specific interaction between complementary base pairs
is critical for all the functions of DNA in living organisms.[4]
Top, a GC base pair with three hydrogen bonds. Bottom, an AT base pair with two hydrogen bonds. Non-covalent
hydrogen bonds between the pairs are shown as dashed lines.
The two types of base pairs form different numbers of hydrogen bonds, AT forming two hydrogen bonds, and GC
forming three hydrogen bonds (see figures, right). DNA with high GC-content is more stable than DNA with low GCcontent.
As noted above, most DNA molecules are actually two polymer strands, bound together in a helical fashion by
noncovalent bonds; this double stranded structure (dsDNA) is maintained largely by the intrastrand base stacking
interactions, which are strongest for G,C stacks. The two strands can come apart a process known as melting to
form two single-stranded DNA molecules (ssDNA) molecules. Melting occurs at high temperature, low salt and high
pH (low pH also melts DNA, but since DNA is unstable due to acid depurination, low pH is rarely used).
The stability of the dsDNA form depends not only on the GC-content (% G,C basepairs) but also on sequence (since
stacking is sequence specific) and also length (longer molecules are more stable). The stability can be measured in
various ways; a common way is the "melting temperature", which is the temperature at which 50% of the ds
molecules are converted to ss molecules; melting temperature is dependent on ionic strength and the concentration
of DNA. As a result, it is both the percentage of GC base pairs and the overall length of a DNA double helix that
determines the strength of the association between the two strands of DNA. Long DNA helices with a high GCcontent have stronger-interacting strands, while short helices with high AT content have weaker-interacting
strands.[29] In biology, parts of the DNA double helix that need to separate easily, such as the TATAAT Pribnow
box in some promoters, tend to have a high AT content, making the strands easier to pull apart.[30]
In the laboratory, the strength of this interaction can be measured by finding the temperature necessary to break the
hydrogen bonds, their melting temperature (also called Tm value). When all the base pairs in a DNA double helix
melt, the strands separate and exist in solution as two entirely independent molecules. These single-stranded DNA
molecules (ssDNA) have no single common shape, but some conformations are more stable than others.[31]
Supercoiling[edit]
Further information: DNA supercoil
DNA can be twisted like a rope in a process called DNA supercoiling. With DNA in its "relaxed" state, a strand
usually circles the axis of the double helix once every 10.4 base pairs, but if the DNA is twisted the strands become
more tightly or more loosely wound.[38] If the DNA is twisted in the direction of the helix, this is positive supercoiling,
and the bases are held more tightly together. If they are twisted in the opposite direction, this is negative
supercoiling, and the bases come apart more easily. In nature, most DNA has slight negative supercoiling that is
introduced by enzymes called topoisomerases.[39] These enzymes are also needed to relieve the twisting stresses
introduced into DNA strands during processes such as transcription and DNA replication.[40]
Quadruplex structures[edit]
Further information: G-quadruplex
At the ends of the linear chromosomes are specialized regions of DNA called telomeres. The main function of these
regions is to allow the cell to replicate chromosome ends using the enzyme telomerase, as the enzymes that
normally replicate DNA cannot copy the extreme 3 ends of chromosomes.[57] These specialized chromosome caps
also help protect the DNA ends, and stop the DNA repair systems in the cell from treating them as damage to be
corrected.[58] In human cells, telomeres are usually lengths of single-stranded DNA containing several thousand
repeats of a simple TTAGGG sequence.[59]
DNA quadruplex formed bytelomere repeats. The looped conformation of the DNA backbone is very different from the typical
DNA helix.[60]
These guanine-rich sequences may stabilize chromosome ends by forming structures of stacked sets of four-base
units, rather than the usual base pairs found in other DNA molecules. Here, four guanine bases form a flat plate and
these flat four-base units then stack on top of each other, to form a stable G-quadruplex structure.[61] These
structures are stabilized by hydrogen bonding between the edges of the bases andchelation of a metal ion in the
centre of each four-base unit.[62] Other structures can also be formed, with the central set of four bases coming from
either a single strand folded around the bases, or several different parallel strands, each contributing one base to
the central structure.
In addition to these stacked structures, telomeres also form large loop structures called telomere loops, or T-loops.
Here, the single-stranded DNA curls around in a long circle stabilized by telomere-binding proteins.[63] At the very
end of the T-loop, the single-stranded telomere DNA is held onto a region of double-stranded DNA by the telomere
strand disrupting the double-helical DNA and base pairing to one of the two strands. This triple-stranded structure is
called a displacement loop or D-loop.[61]
Single branch
Multiple branches
Branched DNA[edit]
Further information: Branched DNA and DNA nanotechnology
In DNA fraying occurs when non-complementary regions exist at the end of an otherwise complementary doublestrand of DNA. However, branched DNA can occur if a third strand of DNA is introduced and contains adjoining
regions able to hybridize with the frayed regions of the pre-existing double-strand. Although the simplest example of
branched DNA involves only three strands of DNA, complexes involving additional strands and multiple branches
are also possible.[64] Branched DNA can be used in nanotechnology to construct geometric shapes, see the section
on uses in technology below.
cytosine
5-methylcytosine
thymine
Structure of cytosine with and without the 5-methyl group.Deamination converts 5-methylcytosine into thymine.
Damage[edit]
Further information: DNA damage (naturally occurring), Mutation, DNA damage theory of aging
A covalent adduct between ametabolically activated form ofbenzo[a]pyrene, the major mutagen intobacco smoke, and DNA[72]
DNA can be damaged by many sorts of mutagens, which change the DNA sequence. Mutagens include oxidizing
agents, alkylating agents and also high-energy electromagnetic radiation such as ultraviolet light and X-rays. The
type of DNA damage produced depends on the type of mutagen. For example, UV light can damage DNA by
producing thymine dimers, which are cross-links between pyrimidine bases.[73] On the other hand, oxidants such
as free radicals or hydrogen peroxide produce multiple forms of damage, including base modifications, particularly
of guanosine, and double-strand breaks.[74] A typical human cell contains about 150,000 bases that have suffered
oxidative damage.[75] Of these oxidative lesions, the most dangerous are double-strand breaks, as these are difficult
to repair and can produce point mutations,insertions and deletions from the DNA sequence, as well as chromosomal
translocations.[76] These mutations can cause cancer. Because of inherent limitations in the DNA repair mechanisms,
if humans lived long enough, they would all eventually develop cancer.[77][78] DNA damages that are naturally
occurring, due to normal cellular processes that produce reactive oxygen species, the hydrolytic activities of cellular
water, etc., also occur frequently. Although most of these damages are repaired, in any cell some DNA damage may
remain despite the action of repair processes. These remaining DNA damages accumulate with age in mammalian
postmitotic tissues. This accumulation appears to be an important underlying cause of aging.[79][80][81]
Many mutagens fit into the space between two adjacent base pairs, this is called intercalation. Most intercalators
are aromatic and planar molecules; examples include ethidium bromide, acridines, daunomycin, and doxorubicin.
For an intercalator to fit between base pairs, the bases must separate, distorting the DNA strands by unwinding of
the double helix. This inhibits both transcription and DNA replication, causing toxicity and mutations.[82] As a result,
DNA intercalators may be carcinogens, and in the case of thalidomide, a teratogen.[83] Others such
as benzo[a]pyrene diol epoxide and aflatoxin form DNA adducts that induce errors in replication.[84] Nevertheless,
due to their ability to inhibit DNA transcription and replication, other similar toxins are also used in chemotherapy to
inhibit rapidly growing cancer cells.[85]
Biological functions[edit]
DNA usually occurs as linear chromosomes in eukaryotes, and circular chromosomes in prokaryotes. The set of
chromosomes in a cell makes up its genome; the human genome has approximately 3 billion base pairs of DNA
arranged into 46 chromosomes.[86] The information carried by DNA is held in the sequence of pieces of DNA
called genes. Transmission of genetic information in genes is achieved via complementary base pairing. For
example, in transcription, when a cell uses the information in a gene, the DNA sequence is copied into a
complementary RNA sequence through the attraction between the DNA and the correct RNA nucleotides. Usually,
this RNA copy is then used to make a matching protein sequence in a process called translation, which depends on
the same interaction between RNA nucleotides. In alternative fashion, a cell may simply copy its genetic information
in a process called DNA replication. The details of these functions are covered in other articles; here the focus is on
the interactions between DNA and other molecules that mediate the function of the genome.
T7 RNA polymerase (blue) producing a mRNA (green) from a DNA template (orange).[91]
Some noncoding DNA sequences play structural roles in chromosomes. Telomeres and centromeres typically
contain few genes, but are important for the function and stability of chromosomes.[58][92] An abundant form of
noncoding DNA in humans are pseudogenes, which are copies of genes that have been disabled by
mutation.[93] These sequences are usually just molecular fossils, although they can occasionally serve as raw genetic
material for the creation of new genes through the process of gene duplication and divergence.[94]
DNA replication. The double helix is unwound by a helicase andtopoisomerase. Next, one DNA polymerase produces
the leading strand copy. Another DNA polymerase binds to the lagging strand. This enzyme makes discontinuous segments
(called Okazaki fragments) before DNA ligase joins them together.
Replication[edit]
Further information: DNA replication
Cell division is essential for an organism to grow, but, when a cell divides, it must replicate the DNA in its genome so
that the two daughter cells have the same genetic information as their parent. The double-stranded structure of DNA
provides a simple mechanism for DNA replication. Here, the two strands are separated and then each
strand's complementary DNA sequence is recreated by an enzyme called DNA polymerase. This enzyme makes
the complementary strand by finding the correct base through complementary base pairing, and bonding it onto the
original strand. As DNA polymerases can only extend a DNA strand in a 5 to 3 direction, different mechanisms are
used to copy the antiparallel strands of the double helix.[95] In this way, the base on the old strand dictates which
base appears on the new strand, and the cell ends up with a perfect copy of its DNA.
DNA-binding proteins[edit]
Further information: DNA-binding protein
Interaction of DNA (shown in orange) withhistones (shown in blue). These proteins' basic amino acids bind to the acidic
phosphate groups on DNA.
Structural proteins that bind DNA are well-understood examples of non-specific DNA-protein interactions. Within
chromosomes, DNA is held in complexes with structural proteins. These proteins organize the DNA into a compact
structure called chromatin. In eukaryotes this structure involves DNA binding to a complex of small basic proteins
called histones, while in prokaryotes multiple types of proteins are involved.[103][104] The histones form a disk-shaped
complex called a nucleosome, which contains two complete turns of double-stranded DNA wrapped around its
surface. These non-specific interactions are formed through basic residues in the histones making ionic bonds to the
acidic sugar-phosphate backbone of the DNA, and are therefore largely independent of the base
sequence.[105] Chemical modifications of these basic amino acid residues
include methylation, phosphorylation and acetylation.[106] These chemical changes alter the strength of the interaction
between the DNA and the histones, making the DNA more or less accessible to transcription factors and changing
the rate of transcription.[107] Other non-specific DNA-binding proteins in chromatin include the high-mobility group
proteins, which bind to bent or distorted DNA.[108] These proteins are important in bending arrays of nucleosomes
and arranging them into the larger structures that make up chromosomes.[109]
A distinct group of DNA-binding proteins are the DNA-binding proteins that specifically bind single-stranded DNA. In
humans, replication protein A is the best-understood member of this family and is used in processes where the
double helix is separated, including DNA replication, recombination and DNA repair.[110] These binding proteins seem
to stabilize single-stranded DNA and protect it from forming stem-loops or being degraded by nucleases.
In contrast, other proteins have evolved to bind to particular DNA sequences. The most intensively studied of these
are the various transcription factors, which are proteins that regulate transcription. Each transcription factor binds to
one particular set of DNA sequences and activates or inhibits the transcription of genes that have these sequences
close to their promoters. The transcription factors do this in two ways. Firstly, they can bind the RNA polymerase
responsible for transcription, either directly or through other mediator proteins; this locates the polymerase at the
promoter and allows it to begin transcription.[112] Alternatively, transcription factors can bind enzymes that modify the
histones at the promoter. This changes the accessibility of the DNA template to the polymerase.[113]
As these DNA targets can occur throughout an organism's genome, changes in the activity of one type of
transcription factor can affect thousands of genes.[114] Consequently, these proteins are often the targets of thesignal
transduction processes that control responses to environmental changes or cellular differentiation and development.
The specificity of these transcription factors' interactions with DNA come from the proteins making multiple contacts
to the edges of the DNA bases, allowing them to "read" the DNA sequence. Most of these base-interactions are
made in the major groove, where the bases are most accessible.[27]
DNA-modifying enzymes[edit]
Nucleases and ligases[edit]
Nucleases are enzymes that cut DNA strands by catalyzing the hydrolysis of the phosphodiester bonds. Nucleases
that hydrolyse nucleotides from the ends of DNA strands are called exonucleases, while endonucleases cut within
strands. The most frequently used nucleases in molecular biology are the restriction endonucleases, which cut DNA
at specific sequences. For instance, the EcoRV enzyme shown to the left recognizes the 6-base sequence 5GATATC-3 and makes a cut at the vertical line. In nature, these enzymes protectbacteria against phage infection by
digesting the phage DNA when it enters the bacterial cell, acting as part of the restriction modification system.[116] In
technology, these sequence-specific nucleases are used in molecular cloning and DNA fingerprinting.
Enzymes called DNA ligases can rejoin cut or broken DNA strands.[117] Ligases are particularly important in lagging
strand DNA replication, as they join together the short segments of DNA produced at thereplication fork into a
complete copy of the DNA template. They are also used in DNA repair and genetic recombination.[117]
Topoisomerases and helicases[edit]
Topoisomerases are enzymes with both nuclease and ligase activity. These proteins change the amount
of supercoiling in DNA. Some of these enzymes work by cutting the DNA helix and allowing one section to rotate,
thereby reducing its level of supercoiling; the enzyme then seals the DNA break.[39] Other types of these enzymes
are capable of cutting one DNA helix and then passing a second strand of DNA through this break, before rejoining
the helix.[118] Topoisomerases are required for many processes involving DNA, such as DNA replication and
transcription.[40]
Helicases are proteins that are a type of molecular motor. They use the chemical energy in nucleoside
triphosphates, predominantly ATP, to break hydrogen bonds between bases and unwind the DNA double helix into
single strands.[119] These enzymes are essential for most processes where enzymes need to access the DNA bases.
Polymerases[edit]
Polymerases are enzymes that synthesize polynucleotide chains from nucleoside triphosphates. The sequence of
their products are created based on existing polynucleotide chainswhich are called templates. These enzymes
function by repeatedly adding a nucleotide to the 3 hydroxyl group at the end of the growing polynucleotide chain.
As a consequence, all polymerases work in a 5 to 3 direction.[120] In the active site of these enzymes, the incoming
nucleoside triphosphate base-pairs to the template: this allows polymerases to accurately synthesize the
complementary strand of their template. Polymerases are classified according to the type of template that they use.
In DNA replication, DNA-dependent DNA polymerases make copies of DNA polynucleotide chains. In order to
preserve biological information, it is essential that the sequence of bases in each copy are precisely complementary
to the sequence of bases in the template strand. Many DNA polymerases have a proofreading activity. Here, the
polymerase recognizes the occasional mistakes in the synthesis reaction by the lack of base pairing between the
mismatched nucleotides. If a mismatch is detected, a 3 to 5 exonuclease activity is activated and the incorrect base
removed.[121] In most organisms, DNA polymerases function in a large complex called the replisome that contains
multiple accessory subunits, such as the DNA clamp or helicases.[122]
RNA-dependent DNA polymerases are a specialized class of polymerases that copy the sequence of an RNA
strand into DNA. They include reverse transcriptase, which is a viral enzyme involved in the infection of cells
by retroviruses, and telomerase, which is required for the replication of telomeres.[57][123] Telomerase is an unusual
polymerase because it contains its own RNA template as part of its structure.[58]
Transcription is carried out by a DNA-dependent RNA polymerase that copies the sequence of a DNA strand into
RNA. To begin transcribing a gene, the RNA polymerase binds to a sequence of DNA called a promoter and
separates the DNA strands. It then copies the gene sequence into a messenger RNA transcript until it reaches a
region of DNA called the terminator, where it halts and detaches from the DNA. As with human DNA-dependent
DNA polymerases, RNA polymerase II, the enzyme that transcribes most of the genes in the human genome,
operates as part of a large protein complex with multiple regulatory and accessory subunits.[124]
Genetic recombination[edit]
Structure of the Holliday junctionintermediate in genetic recombination. The four separate DNA strands are coloured red,
blue, green and yellow.[125]
Recombination involves the breakage and rejoining of two chromosomes (M and F) to produce two re-arranged chromosomes
(C1 and C2).
A DNA helix usually does not interact with other segments of DNA, and in human cells the different chromosomes
even occupy separate areas in the nucleus called "chromosome territories".[126] This physical separation of different
chromosomes is important for the ability of DNA to function as a stable repository for information, as one of the few
times chromosomes interact is during chromosomal crossover when they recombine. Chromosomal crossover is
when two DNA helices break, swap a section and then rejoin.
Recombination allows chromosomes to exchange genetic information and produces new combinations of genes,
which increases the efficiency of natural selection and can be important in the rapid evolution of new
proteins.[127] Genetic recombination can also be involved in DNA repair, particularly in the cell's response to doublestrand breaks.[128]
The most common form of chromosomal crossover is homologous recombination, where the two chromosomes
involved share very similar sequences. Non-homologous recombination can be damaging to cells, as it can
produce chromosomal translocations and genetic abnormalities. The recombination reaction is catalyzed by
enzymes known as recombinases, such as RAD51.[129] The first step in recombination is a double-stranded break
caused by either an endonucleaseor damage to the DNA.[130] A series of steps catalyzed in part by the recombinase
then leads to joining of the two helices by at least one Holliday junction, in which a segment of a single strand in
each helix is annealed to the complementary strand in the other helix. The Holliday junction is a tetrahedral junction
structure that can be moved along the pair of chromosomes, swapping one strand for another. The recombination
reaction is then halted by cleavage of the junction and re-ligation of the released DNA.[131]
Evolution[edit]
Further information: RNA world hypothesis
DNA contains the genetic information that allows all modern living things to function, grow and reproduce. However,
it is unclear how long in the 4-billion-year history of life DNA has performed this function, as it has been proposed
that the earliest forms of life may have used RNA as their genetic material.[132][133] RNA may have acted as the central
part of early cell metabolism as it can both transmit genetic information and carry out catalysis as part
of ribozymes.[134] This ancient RNA world where nucleic acid would have been used for both catalysis and genetics
may have influenced the evolution of the current genetic code based on four nucleotide bases. This would occur,
since the number of different bases in such an organism is a trade-off between a small number of bases increasing
replication accuracy and a large number of bases increasing the catalytic efficiency of ribozymes.[135]
However, there is no direct evidence of ancient genetic systems, as recovery of DNA from most fossils is
impossible. This is because DNA survives in the environment for less than one million years, and slowly degrades
into short fragments in solution.[136] Claims for older DNA have been made, most notably a report of the isolation of a
viable bacterium from a salt crystal 250 million years old,[137] but these claims are controversial.[138][139]
Building blocks of DNA (adenine, guanine and related organic molecules) may have been formed extraterrestrially
in outer space.[140][141][142] Complex DNA and RNA organic compounds of life, including uracil, cytosine and thymine,
have also been formed in the laboratory under conditions mimicking those found in outer space, using starting
chemicals, such as pyrimidine, found in meteorites. Pyrimidine, like polycyclic aromatic hydrocarbons (PAHs), the
most carbon-rich chemical found in theuniverse, may have been formed in red giants or in interstellar dust and gas
clouds.[143]
Uses in technology[edit]
Genetic engineering[edit]
Further information: Molecular biology, nucleic acid methods and genetic engineering
Methods have been developed to purify DNA from organisms, such as phenol-chloroform extraction, and to
manipulate it in the laboratory, such as restriction digests and the polymerase chain reaction.
Modern biology and biochemistry make intensive use of these techniques in recombinant DNA
technology. Recombinant DNA is a man-made DNA sequence that has been assembled from other DNA
sequences. They can be transformed into organisms in the form of plasmids or in the appropriate format, by using
a viral vector.[144] The genetically modified organisms produced can be used to produce products such as
recombinant proteins, used in medical research,[145] or be grown in agriculture.[146][147]
Forensics[edit]
Further information: DNA profiling
Forensic scientists can use DNA in blood, semen, skin, saliva or hair found at a crime scene to identify a matching
DNA of an individual, such as a perpetrator. This process is formally termed DNA profiling, but may also be called
"genetic fingerprinting". In DNA profiling, the lengths of variable sections of repetitive DNA, such as short tandem
repeats and minisatellites, are compared between people. This method is usually an extremely reliable technique for
identifying a matching DNA.[148] However, identification can be complicated if the scene is contaminated with DNA
from several people.[149] DNA profiling was developed in 1984 by British geneticist Sir Alec Jeffreys,[150] and first used
in forensic science to convict Colin Pitchfork in the 1988 Enderby murders case.[151]
The development of forensic science, and the ability to now obtain genetic matching on minute samples of blood,
skin, saliva or hair has led to a re-examination of a number of cases. Evidence can now be uncovered that was not
scientifically possible at the time of the original examination. Combined with the removal of the double jeopardy law
in some places, this can allow cases to be reopened where previous trials have failed to produce sufficient evidence
to convince a jury. People charged with serious crimes may be required to provide a sample of DNA for matching
purposes. The most obvious defence to DNA matches obtained forensically is to claim that cross-contamination of
evidence has taken place. This has resulted in meticulous strict handling procedures with new cases of serious
crime. DNA profiling is also used to identify victims of mass casualty incidents.[152] As well as positively identifying
bodies or body parts in serious accidents, DNA profiling is being successfully used to identify individual victims in
mass war graves matching to family members.
Bioinformatics[edit]
Further information: Bioinformatics
Bioinformatics involves the manipulation, searching, and data mining of biological data, and this includes DNA
sequence data. The development of techniques to store and search DNA sequences have led to widely applied
advances in computer science, especially string searching algorithms, machine learning and database
theory.[153] String searching or matching algorithms, which find an occurrence of a sequence of letters inside a larger
sequence of letters, were developed to search for specific sequences of nucleotides.[154] The DNA sequence may
be aligned with other DNA sequences to identify homologous sequences and locate the specificmutations that make
them distinct. These techniques, especially multiple sequence alignment, are used in
studying phylogenetic relationships and protein function.[155] Data sets representing entire genomes' worth of DNA
sequences, such as those produced by the Human Genome Project, are difficult to use without the annotations that
identify the locations of genes and regulatory elements on each chromosome. Regions of DNA sequence that have
the characteristic patterns associated with protein- or RNA-coding genes can be identified by gene
finding algorithms, which allow researchers to predict the presence of particular gene products and their possible
functions in an organism even before they have been isolated experimentally.[156]Entire genomes may also be
compared, which can shed light on the evolutionary history of particular organism and permit the examination of
complex evolutionary events.
DNA nanotechnology[edit]
The DNA structure at left (schematic shown) will self-assemble into the structure visualized by atomic force microscopy at
right. DNA nanotechnology is the field that seeks to design nanoscale structures using the molecular recognition properties of
DNA molecules. Image from Strong, 2004.
Information storage[edit]
Main article: DNA digital data storage
In a paper published in Nature in January 2013, scientists from the European Bioinformatics Institute and Agilent
Technologies proposed a mechanism to use DNA's ability to code information as a means of digital data storage.
The group was able to encode 739 kilobytes of data into DNA code, synthesize the actual DNA, then sequence the
DNA and decode the information back to its original form, with a reported 100% accuracy. The encoded information
consisted of text files and audio files. A prior experiment was published in August 2012. It was conducted by
researchers at Harvard University, where the text of a 54,000-word book was encoded in DNA.[164][165]
James Watson and Francis Crick(right), co-originators of the double-helix model, with Maclyn McCarty (left).
DNA was first isolated by the Swiss physician Friedrich Miescher who, in 1869, discovered a microscopic substance
in the pus of discarded surgical bandages. As it resided in the nuclei of cells, he called it "nuclein".[166][167] In
1878, Albrecht Kossel isolated the non-protein component of "nuclein", nucleic acid, and later isolated its five
primary nucleobases.[168][169] In 1919, Phoebus Levene identified the base, sugar and phosphate nucleotide
unit.[170] Levene suggested that DNA consisted of a string of nucleotide units linked together through the phosphate
groups. Levene thought the chain was short and the bases repeated in a fixed order. In 1937, William
Astbury produced the first X-ray diffraction patterns that showed that DNA had a regular structure.[171]
In 1927, Nikolai Koltsov proposed that inherited traits would be inherited via a "giant hereditary molecule" made up
of "two mirror strands that would replicate in a semi-conservative fashion using each strand as a template".[172][173] In
1928, Frederick Griffith in his experiment discovered that traits of the "smooth" form of Pneumococcus could be
transferred to the "rough" form of the same bacteria by mixing killed "smooth" bacteria with the live "rough"
form.[174][175] This system provided the first clear suggestion that DNA carries genetic informationthe Avery
MacLeodMcCarty experimentwhen Oswald Avery, along with coworkers Colin MacLeod and Maclyn McCarty,
identified DNA as the transforming principle in 1943.[176] DNA's role in heredity was confirmed in 1952, when Alfred
Hershey and Martha Chase in the HersheyChase experiment showed that DNA is the genetic material of the T2
phage.[177]
In 1953, James Watson and Francis Crick suggested what is now accepted as the first correct double-helix model
of DNA structure in the journal Nature.[6] Their double-helix, molecular model of DNA was then based on a single Xray diffraction image (labeled as "Photo 51")[178] taken by Rosalind Franklin and Raymond Gosling in May 1952, as
well as the information that the DNA bases are pairedalso obtained through private communications from Erwin
Chargaff in the previous years.
Experimental evidence supporting the Watson and Crick model was published in a series of five articles in the same
issue of Nature.[179] Of these, Franklin and Gosling's paper was the first publication of their own X-ray diffraction data
and original analysis method that partially supported the Watson and Crick model;[43][180] this issue also contained an
article on DNA structure by Maurice Wilkins and two of his colleagues, whose analysis and in vivo B-DNA X-ray
patterns also supported the presence in vivo of the double-helical DNA configurations as proposed by Crick and
Watson for their double-helix molecular model of DNA in the previous two pages of Nature.[44] In 1962, after
Franklin's death, Watson, Crick, and Wilkins jointly received the Nobel Prize in Physiology or Medicine.[181] Nobel
Prizes were awarded only to living recipients at the time. A debate continues about who should receive credit for the
discovery.[182]
In an influential presentation in 1957, Crick laid out the central dogma of molecular biology, which foretold the
relationship between DNA, RNA, and proteins, and articulated the "adaptor hypothesis".[183] Final confirmation of the
replication mechanism that was implied by the double-helical structure followed in 1958 through the MeselsonStahl
experiment.[184] Further work by Crick and coworkers showed that the genetic code was based on non-overlapping
triplets of bases, called codons, allowingHar Gobind Khorana, Robert W. Holley and Marshall Warren Nirenberg to
decipher the genetic code.[185] These findings represent the birth of molecular biology.
What is DNA?
DNA, or deoxyribonucleic acid, is the hereditary material in humans and almost all other
organisms. Nearly every cell in a persons body has the same DNA. Most DNA is located in the
cell nucleus (where it is called nuclear DNA), but a small amount of DNA can also be found in the
mitochondria (where it is called mitochondrial DNA or mtDNA).
The information in DNA is stored as a code made up of four chemical bases: adenine (A), guanine
(G), cytosine (C), and thymine (T). Human DNA consists of about 3 billion bases, and more than
99 percent of those bases are the same in all people. The order, or sequence, of these bases
determines the information available for building and maintaining an organism, similar to the way
in which letters of the alphabet appear in a certain order to form words and sentences.
DNA bases pair up with each other, A with T and C with G, to form units called base pairs. Each
base is also attached to a sugar molecule and a phosphate molecule. Together, a base, sugar, and
phosphate are called a nucleotide. Nucleotides are arranged in two long strands that form a spiral
called a double helix. The structure of the double helix is somewhat like a ladder, with the base
pairs forming the ladders rungs and the sugar and phosphate molecules forming the vertical
sidepieces of the ladder.
An important property of DNA is that it can replicate, or make copies of itself. Each strand of DNA
in the double helix can serve as a pattern for duplicating the sequence of bases. This is critical when
cells divide because each new cell needs to have an exact copy of the DNA present in the old cell.
each species unique. DNA, along with the instructions it contains, is passed from adult organisms to their offspring
during reproduction.
Top of page
greatly, ranging from about 1,000 bases to 1 million bases in humans. Genes only make up about 1 percent of the
DNA sequence. DNA sequences outside this 1 percent are involved in regulating when, how and how much of a protein
is made.
Top of page
DNA profiling
From Wikipedia, the free encyclopedia
This article is about DNA profiling in forensics. For other uses, see DNA profiling (disambiguation).
For DNA testing for inherited diseases, see Genetic testing.
Not to be confused with DNA barcoding or DNA phenotyping.
Forensic science
Physiological sciences
Forensic anthropology
Forensic archaeology
Forensic odontology
Forensic entomology
Forensic pathology
Forensic botany
Forensic biology
DNA profiling
Forensic chemistry
Forensic osteology
Forensic dentistry
Social sciences
Forensic psychology
Forensic psychiatry
Forensic criminalistics
Ballistics
Ballistic fingerprinting
Body identification
Fingerprint analysis
Forensic accounting
Forensic arts
Forensic toxicology
Gloveprint analysis
Palmprint analysis
Vein matching
Digital forensics
Computer forensics
Database forensics
Network forensics
Forensic video
Forensic audio
Related disciplines
Fire investigation
Forensic engineering
Forensic linguistics
Forensic statistics
Forensic taphonomy
William M. Bass
George W. Gill
Richard Jantz
Edmond Locard
Douglas W. Owsley
Werner Spitz
Juan Vucetich
Related articles
Crime scene
CSI effect
Pollen calendar
Skid mark
Trace evidence
Use of DNA in
forensic entomology
Forensic DNA profiling (also called DNA testing or DNA typing) is a technique employed by forensic scientists to
identify individuals by characteristics of their DNA. DNA profiles are a small set of DNA variations that are very
likely to be different in all unrelated individuals. DNA profiling should not be confused with full genome
sequencing.[1] DNA profiling is used in, for example, parentage testing and criminal investigation.
Although 99.9% of human DNA sequences are the same in every person, enough of the DNA is different that it is
possible to distinguish one individual from another, unless they are monozygotic ("identical") twins.[2]DNA profiling
uses repetitive ("repeat") sequences that are highly variable,[2] called variable number tandem repeats (VNTRs), in
particular short tandem repeats (STRs). VNTR loci are very similar between closely related humans, but are so
variable that unrelated individuals are extremely unlikely to have the same VNTRs.
The DNA profiling technique was first reported in 1985.[3]
Contents
[hide]
Developed by Alec Jeffreys, the process begins with a sample of an individual's DNA (typically called a "reference
sample"). The most desirable method of collecting a reference sample is the use of a buccal swab, as this reduces
the possibility of contamination. When this is not available (e.g. because a court order is needed but not obtainable)
other methods may need to be used to collect a sample of blood, saliva, semen, or other appropriate fluid or tissue
from personal items (e.g. a toothbrush, razor) or from stored samples (e.g. banked sperm or biopsy tissue).
Samples obtained from blood relatives (related by birth, not marriage) can provide an indication of an individual's
profile, as could human remains that had been previously profiled.
A reference sample is then analyzed to create the individual's DNA profile using one of a number of techniques,
discussed below. The DNA profile is then compared against another sample to determine whether there is a genetic
match.
RFLP analysis[edit]
Main article: Restriction fragment length polymorphism
The first methods for finding out genetics used for DNA profiling involved RFLP analysis. DNA is collected from
cells, such as a blood sample, and cut into small pieces using a restriction enzyme (a restriction digest). This
generates thousands of DNA fragments of differing sizes as a consequence of variations between DNA sequences
of different individuals. The fragments are then separated on the basis of size using gel electrophoresis.
The separated fragments are then transferred to a nitrocellulose or nylon filter; this procedure is called a Southern
blot. The DNA fragments within the blot are permanently fixed to the filter, and the DNA strands
aredenatured. Radiolabeled probe molecules are then added that are complementary to sequences in
the genome that contain repeat sequences. These repeat sequences tend to vary in length among different
individuals and are called variable number tandem repeat sequences or VNTRs. The probe molecules hybridize to
DNA fragments containing the repeat sequences and excess probe molecules are washed away. The blot is then
exposed to an X-ray film. Fragments of DNA that have bound to the probe molecules appear as dark bands on the
film.
The Southern blot technique is laborious, and requires large amounts of undegraded sample DNA. Also, Karl
Brown's original technique looked at many minisatellite loci at the same time, increasing the observed variability, but
making it hard to discern individual alleles (and thereby precluding parental testing). These early techniques have
been supplanted by PCR-based assays.
PCR analysis[edit]
Main article: polymerase chain reaction
Developed by Kary Mullis in 1983, a process was reported by which specific portions of the sample DNA can be
amplified almost indefinitely (Saiki et al. 1985, 1988). This has revolutionized the whole field of DNA study. The
process, the polymerase chain reaction (PCR), mimics the biological process of DNA replication, but confines it to
specific DNA sequences of interest. With the invention of the PCR technique, DNA profiling took huge strides
forward in both discriminating power and the ability to recover information from very small (or degraded) starting
samples.
PCR greatly amplifies the amounts of a specific region of DNA. In the PCR process, the DNA sample is denatured
into the separate individual polynucleotide strands through heating. Two oligonucleotide DNAprimers are used to
hybridize to two corresponding nearby sites on opposite DNA strands in such a fashion that the normal enzymatic
extension of the active terminal of each primer (that is, the 3 end) leads toward the other primer. PCR uses
replication enzymes that are tolerant of high temperatures, such as the thermostable Taq polymerase. In this
fashion, two new copies of the sequence of interest are generated. Repeated denaturation, hybridization, and
extension in this fashion produce an exponentially growing number of copies of the DNA of interest. Instruments that
perform thermal cycling are now readily available from commercial sources. This process can produce a million-fold
or greater amplification of the desired region in 2 hours or less.
Early assays such as the HLA-DQ alpha reverse dot blot strips grew to be very popular due to their ease of use, and
the speed with which a result could be obtained. However, they were not as discriminating as RFLP analysis. It was
also difficult to determine a DNA profile for mixed samples, such as a vaginal swab from a sexual assault victim.
However, the PCR method was readily adaptable for analyzing VNTR, in particular STR loci. In recent years,
research in human DNA quantitation has focused on new "real-time" quantitative PCR (qPCR) techniques.
Quantitative PCR methods enable automated, precise, and high-throughput measurements. Interlaboratory studies
have demonstrated the importance of human DNA quantitation on achieving reliable interpretation of STR typing
and obtaining consistent results across laboratories.
STR analysis[edit]
Main article: Short tandem repeats
The system of DNA profiling used today is based on PCR and uses short tandem repeats (STR). This method uses
highly polymorphic regions that have short repeated sequences of DNA (the most common is 4 bases repeated, but
there are other lengths in use, including 3 and 5 bases). Because unrelated people almost certainly have different
numbers of repeat units, STRs can be used to discriminate between unrelated individuals. These STR loci (locations
on a chromosome) are targeted with sequence-specific primers and amplified using PCR. The DNA fragments that
result are then separated and detected using electrophoresis. There are two common methods of separation and
detection, capillary electrophoresis (CE) and gel electrophoresis.
Each STR is polymorphic, but the number of alleles is very small. Typically each STR allele will be shared by around
5 - 20% of individuals. The power of STR analysis comes from looking at multiple STR loci simultaneously. The
pattern of alleles can identify an individual quite accurately. Thus STR analysis provides an excellent identification
tool. The more STR regions that are tested in an individual the more discriminating the test becomes.
From country to country, different STR-based DNA-profiling systems are in use. In North America, systems that
amplify the CODIS 13 core loci are almost universal, whereas in the United Kingdom the SGM+ 11 loci system
(which is compatible with TheNational DNA Database) is in use. Whichever system is used, many of the STR
regions used are the same. These DNA-profiling systems are based on multiplex reactions, whereby many STR
regions will be tested at the same time.
The true power of STR analysis is in its statistical power of discrimination. Because the 13 loci that are currently
used for discrimination in CODIS are independently assorted (having a certain number of repeats at one locus does
not change the likelihood of having any number of repeats at any other locus), the product rule for probabilities can
be applied. This means that, if someone has the DNA type of ABC, where the three loci were independent, we can
say that the probability of having that DNA type is the probability of having type A times the probability of having
type B times the probability of having type C. This has resulted in the ability to generate match probabilities of 1 in a
quintillion (1x1018) or more. However, DNA database searches showed much more frequent than expected false
DNA profile matches.[4] Moreover, since there are about 12 million monozygotic twins on Earth, the theoretical
probability is not accurate.
In practice, the risk of contaminated-matching is much greater than matching a distant relative, such as
contamination of a sample from nearby objects, or from left-over cells transferred from a prior test. The risk is
greater for matching the most common person in the samples: Everything collected from, or in contact with, a victim
is a major source of contamination for any other samples brought into a lab. For that reason, multiple control-
samples are typically tested in order to ensure that they stayed clean, when prepared during the same period as the
actual test samples. Unexpected matches (or variations) in several control-samples indicates a high probability of
contamination for the actual test samples. In a relationship test, the full DNA profiles should differ (except for twins),
to prove that a person was not actually matched as being related to their own DNA in another sample.
AmpFLP[edit]
Main article: Amplified fragment length polymorphism
Another technique, AmpFLP, or amplified fragment length polymorphism was also put into practice during the early
1990s. This technique was also faster than RFLP analysis and used PCR to amplify DNA samples. It relied
on variable number tandem repeat (VNTR) polymorphisms to distinguish various alleles, which were separated on
a polyacrylamide gel using an allelic ladder (as opposed to a molecular weight ladder). Bands could be visualized
by silver staining the gel. One popular locus for fingerprinting was the D1S80 locus. As with all PCR based methods,
highly degraded DNA or very small amounts of DNA may cause allelic dropout (causing a mistake in thinking a
heterozygote is a homozygote) or other stochastic effects. In addition, because the analysis is done on a gel, very
high number repeats may bunch together at the top of the gel, making it difficult to resolve. AmpFLP analysis can be
highly automated, and allows for easy creation of phylogenetic trees based on comparing individual samples of
DNA. Due to its relatively low cost and ease of set-up and operation, AmpFLP remains popular in lower income
countries.
Using PCR technology, DNA analysis is widely applied to determine genetic family relationships such as paternity,
maternity, siblingship and other kinships.
During conception, the fathers sperm cell and the mothers egg cell, each containing half the amount of DNA found
in other body cells, meet and fuse to form a fertilized egg, called a zygote. The zygote contains a complete set of
DNA molecules, a unique combination of DNA from both parents. This zygote divides and multiplies into an embryo
and later, a full human being.
At each stage of development, all the cells forming the body contain the same DNAhalf from the father and half
from the mother. This fact allows the relationship testing to use all types of all samples including loose cells from the
cheeks collected using buccal swabs, blood or other types of samples.
There are predictable inheritance patterns at certain locations (called loci) in the human genome, which have been
found to be useful in determining identity and biological relationships. These loci contain specific DNA markers that
scientists use to identify individuals. In a routine DNA paternity test, the markers used are Short Tandem
Repeats (STRs), short pieces of DNA that occur in highly differential repeat patterns among individuals.
Each persons DNA contains two copies of these markersone copy inherited from the father and one from the
mother. Within a population, the markers at each persons DNA location could differ in length and sometimes
sequence, depending on the markers inherited from the parents.
The combination of marker sizes found in each person makes up his/her unique genetic profile. When determining
the relationship between two individuals, their genetic profiles are compared to see if they share the same
inheritance patterns at a statistically conclusive rate.
For example, the following sample report from this commercial DNA paternity testing laboratory Universal Genetics
signifies how relatedness between parents and child is identified on those special markers:
DNA Marker Mother
Child
Alleged father
D21S11
28, 30
28, 31 29, 31
D7S820
9, 10
10, 11 11, 12
TH01
14, 15
14, 16 15, 16
D13S317
7, 8
7, 9
D19S433
8, 9
The partial results indicate that the child and the alleged fathers DNA match among these five markers. The
complete test results show this correlation on 16 markers between the child and the tested man to enable a
conclusion to be drawn as to whether or not the man is the biological father.
Each marker is assigned with a Paternity Index (PI), which is a statistical measure of how powerfully a match at a
particular marker indicates paternity. The PI of each marker is multiplied with each other to generate the Combined
Paternity Index (CPI), which indicates the overall probability of an individual being the biological father of the tested
child relative to a randomly selected man from the entire population of the same race. The CPI is then converted
into a Probability of Paternity showing the degree of relatedness between the alleged father and child.
The DNA test report in other family relationship tests, such as grandparentage and siblingship tests, is similar to a
paternity test report. Instead of the Combined Paternity Index, a different value, such as a Siblingship Index, is
reported.
The report shows the genetic profiles of each tested person. If there are markers shared among the tested
individuals, the probability of biological relationship is calculated to determine how likely the tested individuals share
the same markers due to a blood relationship.
Y-chromosome analysis[edit]
Recent innovations have included the creation of primers targeting polymorphic regions on the Y-chromosome (YSTR), which allows resolution of a mixed DNA sample from a male and female or cases in which a differential
extraction is not possible. Y-chromosomes are paternally inherited, so Y-STR analysis can help in the identification
of paternally related males. Y-STR analysis was performed in the Sally Hemings controversy to determine if Thomas
Jefferson had sired a son with one of his slaves. The analysis of the Y-chromosome yields weaker results than
autosomal chromosome analysis. The Y male sex-determining chromosome, as it is inherited only by males from
their fathers, is almost identical along the patrilineal line. This leads to a less precise analysis than if autosomal
chromosomes were testing, because of the random matching that occurs between pairs of chromosomes as
zygotes are being made.[5]
Mitochondrial analysis[edit]
Main article: Mitochondrial DNA
For highly degraded samples, it is sometimes impossible to get a complete profile of the 13 CODIS STRs. In these
situations, mitochondrial DNA (mtDNA) is sometimes typed due to there being many copies of mtDNA in a cell,
while there may only be 1-2 copies of the nuclear DNA. Forensic scientists amplify the HV1 and HV2 regions of the
mtDNA, and then sequence each region and compare single-nucleotide differences to a reference. Because mtDNA
is maternally inherited, directly linked maternal relatives can be used as match references, such as one's maternal
grandmother's daughter's son. In general, a difference of two or more nucleotides is considered to be an
exclusion. Heteroplasmy and poly-C differences may throw off straight sequence comparisons, so some expertise
on the part of the analyst is required. mtDNA is useful in determining clear identities, such as those of missing
people when a maternally linked relative can be found. mtDNA testing was used in determining that Anna
Anderson was not the Russian princess she had claimed to be, Anastasia Romanov.
mtDNA can be obtained from such material as hair shafts and old bones/teeth. Control mechanism based on
interaction point with data. This is determined by tooled placement in sample.
DNA databases[edit]
Main article: National DNA database
An early application of a DNA database was the compilation of A Mitochondrial DNA Concordance,[6] prepared by
Kevin W. P. Miller and John L. Dawson at the University of Cambridge from 1996 to 1998[7] from data collected as
part of Miller's PhD thesis. There are now several DNA databases in existence around the world. Some are private,
but most of the largest databases are government controlled. The United States maintains the largest DNA
database, with the Combined DNA Index System (CODIS) holding over 5 million records as of 2007.[8] The United
Kingdom maintains the National DNA Database (NDNAD), which is of similar size, despite the UK's smaller
population. The size of this database, and its rate of growth, is giving concern to civil liberties groups in the UK,
where police have wide-ranging powers to take samples and retain them even in the event of acquittal.[9]
The U.S. Patriot Act of the United States provides a means for the U.S. government to get DNA samples from other
countries if they[clarification needed] are either a division of or a head office of a company operating in the U.S. Under the act;
the American offices of the company cannot divulge to their subsidiaries/offices in other countries the reasons that
these DNA samples are sought or by whom.[citation needed]
When a match is made from a National DNA Databank to link a crime scene to an offender having provided a DNA
Sample to a databank that link is often referred to as a cold hit. A cold hit is of value in referring the police agency to
a specific suspect but is of less evidential value than a DNA match made from outside the DNA Databank.[10]
FBI agents cannot legally store DNA of a person not convicted of a crime. DNA collected from a suspect not later
convicted must be disposed of and not entered into the database. In 1998, a man residing in the UK was arrested
on accusation of burglary. His DNA was taken and tested, and he was later released. Nine months later, this mans
DNA was accidentally and illegally entered in the DNA database. New DNA is automatically compared to the DNA
found at cold cases and, in this case, this man was found to be a match to DNA found at a rape and assault case
one year earlier. The government then prosecuted him for these crimes. During the trial the DNA match was
requested to be removed from the evidence because it had been illegally entered into the database. The request
was carried out.[11]
The DNA collected from victims of rape are often stored for years until matched with the perpetrator's, usually when
committing another crime. In 2014, Congress extended a bill that helps states deal with "a backlog" of unexamined
evidence.[12]
In the case of the Phantom of Heilbronn, police detectives found DNA traces from the same woman on various
crime scenes in Austria, Germany, and Franceamong them murders, burglaries and robberies. Only after the DNA
of the "woman" matched the DNA sampled from the burned body of a male asylum seeker in France, detectives
began to have serious doubts about the DNA evidence. In that case, DNA traces were already present on the cotton
swabs used to collect the samples at the crime scene, and the swabs had all been produced at the same factory in
Austria. The company's product specification said that the swabs were guaranteed to be sterile, but not DNA-free.
Types of evidence
Testimony
Documentary
Real (physical)
Exculpatory
Inculpatory
Digital
Demonstrative
Eyewitness identification
Genetic (DNA)
Lies
Relevance
Burden of proof
Laying a foundation
Spoliation
Character
Habit
Similar fact
Authentication
Chain of custody
Self-authenticating document
Judicial notice
Ancient document
Competence
Privilege
Direct examination
Cross-examination
Redirect
Impeachment
Recorded recollection
Expert witness
in English law
in United States law
Confessions
Business records
Excited utterance
Dying declaration
Party admission
Ancient document
Res gestae
Learned treatise
Implied assertion
Contract
Tort
Property
Criminal law
tree is populated from information gathered from public records and criminal justice records. Investigators rule out
family members involvement in the crime by finding excluding factors such as sex, living out of state or being
incarcerated when the crime was committed. They may also use other leads from the case, such as witness or
victim statements, to identify a suspect. Once a suspect has been identified, investigators seek to legally obtain a
DNA sample from the suspect. This suspect DNA profile is then compared to the sample found at the crime scene to
definitively identify the suspect as the source of the crime scene DNA.
Familial DNA database searching was first used in an investigation leading to the conviction of Craig Harman of
manslaughter in the United Kingdom on April 19, 2004.[21] Craig Harman was convicted using familial DNA because
of the partial matches from Harman's brother. When the police questioned Harman's brother, the police noticed
Harman lived very close to the original crime scene. Harman confessed when his DNA isolated from the DNA found
on the brick, matched.[22] Currently, familial DNA database searching is not conducted on a national level in the
United States. States determine their own policies and decision making processes for how and when to conduct
familial searches. The first familial DNA search and subsequent conviction in the United States was conducted
in Denver, Colorado, in 2008 using software developed under the leadership of Denver District Attorney Mitch
Morrissey and Denver Police Department Crime Lab Director Gregg LaBerge.[23] California was the first state to
implement a policy for familial searching under then Attorney General, now Governor, Jerry Brown.[24] In his role as
consultant to the Familial Search Working Group of the California Department of Justice, former Alameda County
Prosecutor Rock Harmon is widely considered to have been the catalyst in the adoption of familial search
technology in California. The technique was used to catch the Los Angeles serial killer known as the Grim Sleeper
in 2010.[25] It wasn't a witness or informant that tipped off law enforcement to the identity of the "Grim Sleeper" serial
killer, who had eluded police for more than two decades, but DNA from the suspect's own son. The suspect's son
was arrested and convicted in a felony weapons charge and swabbed for DNA last year. When his DNA was
entered into the database of convicted felons, detectives were alerted to a partial match to evidence found at the
"Grim Sleeper" crime scenes. David Franklin Jr., also known as the Grim Sleeper, was charged with ten counts of
murder and one count of attempted murder.[26] More recently, familial DNA, led to the arrest of 21-year-old Elvis
Garcia on charges of sexual assault and false imprisonment of a woman in Santa Cruz in 2008.[27] In March 2011
Virginia Governor Bob McDonnell announced that Virginia would begin using familial DNA searches.[28] Other states
are expected to follow.
At a press conference in Virginia on March 7, 2011, regarding the East Coast Rapist, Prince William County
prosecutor Paul Ebert and Fairfax County Police Detective John Kelly said the case would have been solved years
ago if Virginia had used familial DNA searching. Aaron Thomas, the suspected East Coast Rapist, was arrested in
connection with the rape of 17 women from Virginia to Rhode Island, but familial DNA was not used in the case.[29]
Critics of familial DNA database searches argue that the technique is an invasion of an individuals 4th
Amendment rights.[30] Privacy advocates are petitioning for DNA database restrictions, arguing that the only fair way
to search for possible DNA matches to relatives of offenders or arrestees would be to have a population-wide DNA
database.[11] Some scholars have pointed out that the privacy concerns surrounding familial searching are similar in
some respects to other police search techniques,[31] and most have concluded that the practice is
constitutional.[32] The Ninth Circuit Court of Appeals in United States v. Pool (vacated as moot) suggested that this
practice is somewhat analogous to a witness looking at a photograph of one person and stating that it looked like
the perpetrator, which leads law enforcement to show the witness photos of similar looking individuals, one of whom
is identified as the perpetrator.[33] Regardless of whether familial DNA searching was the method used to identify the
suspect, authorities always conduct a normal DNA test to match the suspects DNA with that of the DNA left at the
crime scene.
Critics also claim that racial profiling could occur on account of Familial DNA testing. In the United States, the
conviction rates of racial minorities are much higher than that of the overall population. It is unclear whether this is
due to discrimination from police officers and the courts, as opposed to a simple higher rate of offence among
minorities. Arrest-based databases, which are found in the majority of the United States, lead to an even greater
level of racial discrimination. An arrest, as opposed to conviction, relies much more heavily on police discretion.[11]
For instance, investigators with Denver District Attorneys Office successfully identified a suspect in a property theft
case using a familial DNA search. In this example, the suspects blood left at the scene of the crime strongly
resembled that of a currentColorado Department of Corrections prisoner.[34] Using publicly available records, the
investigators created a family tree. They then eliminated all the family members who were incarcerated at the time
of the offense, as well as all of the females (the crime scene DNA profile was that of a male). Investigators obtained
a court order to collect the suspects DNA, but the suspect actually volunteered to come to a police station and give
a DNA sample. After providing the sample, the suspect walked free without further interrogation or detainment. Later
confronted with an exact match to the forensic profile, the suspect pled guilty to criminal trespass at the first court
date and was sentenced to two years probation.
In Italy a familiar DNA search has been done to solve the case of the murder of Yara Gambirasio whose body was
found in the bush three months after her disappearance. A DNA trace was found on the underwear of the murdered
teenage near and a DNA sample was requested from a person who lived near the municipality of Brembate di
Sopra and a common male ancestor was found in the DNA sample of a young man not involved in the murder. After
a long investigation the father of the supposed killer was identified in Giuseppe Guerinoni a deceased man but his
two sons born from his wife were not related with the DNA samples found on the body of Yara. After 3 and a half
years the DNA found on the underwear of the deceased girl was matched with Massimo Giuseppe Bosetti who was
arrested and accused of the murder of the 13-year-old girl. Now Bosetti is awaiting in jail his trial.
Partial matches[edit]
Partial DNA matches are not searches themselves, but are the result of moderate stringency CODIS searches that
produce a potential match that shares at least one allele at every locus.[35] Partial matching does not involve the use
of familial search software, such as those used in the UK and United States, or additional Y-STR analysis, and
therefore often misses sibling relationships. Partial matching has been used to identify suspects in several cases in
the UK and United States,[36] and has also been used as a tool to exonerate the falsely accused. Darryl Hunt was
wrongly convicted in connection with the rape and murder of a young woman in 1984 in North Carolina.[37] Hunt was
exonerated in 2004 when a DNA database search produced a remarkably close match between a convicted felon
and the forensic profile from the case. The partial match led investigators to the felons brother, Willard E. Brown,
who confessed to the crime when confronted by police. A judge then signed an order to dismiss the case against
Hunt.
cases where the match probability in relation to all the samples tested is so great that the judge would consider its
probative value to be minimal and decide to exclude the evidence in the exercise of his discretion, but this gives rise
to no new question of principle and can be left for decision on a case by case basis. However, the fact that there
exists in the case of all partial profile evidence the possibility that a "missing" allele might exculpate the accused
altogether does not provide sufficient grounds for rejecting such evidence. In many there is a possibility (at least in
theory) that evidence that would assist the accused and perhaps even exculpate him altogether exists, but that does
not provide grounds for excluding relevant evidence that is available and otherwise admissible, though it does make
it important to ensure that the jury are given sufficient information to enable them to evaluate that evidence
properly[47]
There are state laws on DNA profiling in all 50 states of the United States.[48] Detailed information on database laws
in each state can be found at the National Conference of State Legislatures website.[49]
Cases[edit]
In 1986, Richard Buckland was exonerated, despite having admitted to the rape and murder of a teenager
near Leicester, the city where DNA profiling was first discovered. This was the first use of DNA fingerprinting in
a criminal investigation.[52]
In 1987, in the same case as Buckland, British baker Colin Pitchfork was the first criminal caught and
convicted using DNA fingerprinting.[53]
In 1987, genetic fingerprinting was used in criminal court for the first time in the trial of a man accused of
unlawful intercourse with a mentally handicapped 14-year-old female who gave birth to a baby.[54]
In 1987, Florida rapist Tommie Lee Andrews was the first person in the United States to be convicted as a result
of DNA evidence, for raping a woman during a burglary; he was convicted on November 6, 1987, and
sentenced to 22 years in prison.[55][56]
In 1988, Timothy Wilson Spencer was the first man in Virginia to be sentenced to death through DNA testing, for
several rape and murder charges. He was dubbed "The South Side Strangler" because he killed victims on the
south side of Richmond, Virginia. He was later charged with rape and first-degree murder and was sentenced to
death. He was executed on April 27, 1994. David Vasquez, initially convicted of one of Spencer's crimes,
became the first man in America exonerated based on DNA evidence.
In 1989, Chicago man Gary Dotson was the first person whose conviction was overturned using DNA evidence.
In 1991, Allan Legere was the first Canadian to be convicted as a result of DNA evidence, for four murders he
had committed while an escaped prisoner in 1989. During his trial, his defense argued that the relatively shallow
gene pool of the region could lead to false positives.
In 1992, DNA evidence was used to prove that Nazi doctor Josef Mengele was buried in Brazil under the name
Wolfgang Gerhard.
In 1992, DNA from a palo verde tree was used to convict Mark Alan Bogan of murder. DNA from seed pods of a
tree at the crime scene was found to match that of seed pods found in Bogan's truck. This is the first instance of
plant DNA admitted in a criminal case.[57][58][59]
In 1993, Kirk Bloodsworth was the first person to have been convicted of murder and sentenced to death,
whose conviction was overturned using DNA evidence.
The 1993 rape and murder of Mia Zapata, lead singer for the Seattle punk band The Gits was unsolved nine
years after the murder. A database search in 2001 failed, but the killer's DNA was collected when he was
arrested in Florida for burglary and domestic abuse in 2002.
The science was made famous in the United States in 1994 when prosecutors heavily relied on DNA evidence
allegedly linking O. J. Simpson to a double murder. The case also brought to light the laboratory difficulties and
handling procedure mishaps that can cause such evidence to be significantly doubted.
In 1994, Royal Canadian Mounted Police (RCMP) detectives successfully tested hairs from a cat known
as Snowball, and used the test to link a man to the murder of his wife, thus marking for the first time in forensic
history the use of non-human DNA to identify a criminal (except for the plant DNA mentioned in the case four
paragraphs up).
In 1994, the claim that Anna Anderson was Grand Duchess Anastasia Nikolaevna of Russia was tested after
her death using samples of her tissue that had been stored at a Charlottesville, Virginia hospital following a
medical procedure. The tissue was tested using DNA fingerprinting, and showed that she bore no relation to
the Romanovs.[60]
In 1994, Earl Washington, Jr., of Virginia had his death sentence commuted to life imprisonment a week before
his scheduled execution date based on DNA evidence. He received a full pardon in 2000 based on more
advanced testing.[61] His case is often cited by opponents of the death penalty.
In 1995, the British Forensic Science Service carried out its first mass intelligence DNA screening in the
investigation of the Naomi Smith murder case.
In 1998, Richard J. Schmidt was convicted of attempted second-degree murder when it was shown that there
was a link between the viral DNA of the human immunodeficiency virus (HIV) he had been accused of injecting
in his girlfriend and viral DNA from one of his patients with AIDS. This was the first time viral DNA fingerprinting
had been used as evidence in a criminal trial.
In 1999, Raymond Easton, a disabled man from Swindon, England, was arrested and detained for seven hours
in connection with a burglary. He was released due to an inaccurate DNA match. His DNA had been retained on
file after an unrelated domestic incident some time previously.[62]
In 2000 Frank Lee Smith was proved innocent by DNA profiling of the murder of an eight-year-old girl after
spending 14 years on death row in Florida, USA. However he had died of cancer just before his innocence was
proven.[63] In view of this the Florida state governor ordered that in future any death row inmate claiming
innocence should have DNA testing.[61]
In May 2000 Gordon Graham murdered Paul Gault at his home in Lisburn, Northern Ireland. Graham was
convicted of the murder when his DNA was found on a sports bag left in the house as part of an elaborate ploy
to suggest the murder occurred after a burglary had gone wrong. Graham was having an affair with the victim's
wife at the time of the murder. It was the first time Low Copy Number DNA was used in Northern Ireland.[64]
In 2001, Wayne Butler was convicted for the murder of Celia Douty. It was the first murder in Australia to be
solved using DNA profiling.[65][66]
In 2002, the body of James Hanratty, hanged in 1962 for the "A6 murder", was exhumed and DNA samples
from the body and members of his family were analysed. The results convinced Court of Appeal judges that
Hanratty's guilt, which had been strenuously disputed by campaigners, was proved "beyond doubt".[67] Paul Foot
and some other campaigners continued to believe in Hanratty's innocence and argued that the DNA evidence
could have been contaminated, noting that the small DNA samples from items of clothing, kept in a police
laboratory for over 40 years "in conditions that do not satisfy modern evidential standards", had had to be
subjected to very new amplification techniques in order to yield any genetic profile.[68] However, no DNA other
than Hanratty's was found on the evidence tested, contrary to what would have been expected had the
evidence indeed been contaminated.[69]
In 2002, DNA testing was used to exonerate Douglas Echols, a man who was wrongfully convicted in a 1986
rape case. Echols was the 114th person to be exonerated through post-conviction DNA testing.
In August 2002, Annalisa Vincenzi was shot dead in Tuscany. Bartender Peter Hamkin, 23, was arrested,
in Merseyside, in March 2003 on an extradition warrant heard at Bow Street Magistrates' Court in London to
establish whether he should be taken to Italy to face a murder charge. DNA "proved" he shot her, but he was
cleared on other evidence.[70]
In 2003, Welshman Jeffrey Gafoor was convicted of the 1988 murder of Lynette White, when crime scene
evidence collected 12 years earlier was re-examined using STR techniques, resulting in a match with his
nephew.[71] This may be the first known example of the DNA of an innocent yet related individual being used to
identify the actual criminal, via "familial searching".
In March 2003, Josiah Sutton was released from prison after serving four years of a twelve-year sentence for a
sexual assault charge. Questionable DNA samples taken from Sutton were retested in the wake of the Houston
Police Department's crime lab scandal of mishandling DNA evidence.
In June 2003, because of new DNA evidence, Dennis Halstead, John Kogut and John Restivo won a re-trial on
their murder conviction, their convictions were struck down and they were released.[72] The three men had
already served eighteen years of their thirty-plus-year sentences.
The trial of Robert Pickton (convicted in December 2003) is notable in that DNA evidence is being used
primarily to identify the victims, and in many cases to prove their existence.
In 2004, DNA testing shed new light into the mysterious 1912 disappearance of Bobby Dunbar, a four-year-old
boy who vanished during a fishing trip. He was allegedly found alive eight months later in the custody of William
Cantwell Walters, but another woman claimed that the boy was her son, Bruce Anderson, whom she had
entrusted in Walters' custody. The courts disbelieved her claim and convicted Walters for the kidnapping. The
boy was raised and known as Bobby Dunbar throughout the rest of his life. However, DNA tests on Dunbar's
son and nephew revealed the two were not related, thus establishing that the boy found in 1912 was not Bobby
Dunbar, whose real fate remains unknown.[73]
In 2005, Gary Leiterman was convicted of the 1969 murder of Jane Mixer, a law student at the University of
Michigan, after DNA found on Mixer's pantyhose was matched to Leiterman. DNA in a drop of blood on Mixer's
hand was matched to John Ruelas, who was only four years old in 1969 and was never successfully connected
to the case in any other way. Leiterman's defense unsuccessfully argued that the unexplained match of the
blood spot to Ruelas pointed to cross-contamination and raised doubts about the reliability of the lab's
identification of Leiterman.[74][75][76]
In December 2005, Evan Simmons was proven innocent of a 1981 attack on an Atlanta woman after serving
twenty-four years in prison. Mr. Clark is the 164th person in the United States and the fifth in Georgia to be freed
using post-conviction DNA testing.
In March 2009, Sean Hodgson who spent 27 years in jail, convicted of killing Teresa De Simone, 22, in her car
in Southampton 30 years before was released by senior judges. Tests prove DNA from the scene was not his.
British police have now reopened the case.
In November 2008, Anthony Curcio was arrested for masterminding one of the most elaborately planned
armored car heists in history. DNA evidence linked Curcio to the crime.[77]
In addition, when proteins are being made, the double helix unwinds to allow a single strand of DNA to serve as a
template. This template strand is then transcribed into mRNA, which is a molecule that conveys vital instructions to
the cell's protein-making machinery.
Genetic code
From Wikipedia, the free encyclopedia
"Codon" redirects here. For the plant genus, see Codon (genus).
A series of codons in part of amessenger RNA (mRNA) molecule. Each codon consists of threenucleotides, usually
corresponding to a single amino acid. The nucleotides are abbreviated with the letters A, U, G and C. This is mRNA, which uses
U (uracil). DNA uses T (thymine) instead. This mRNA molecule will instruct a ribosome to synthesize a protein according to this
code.
The genetic code is the set of rules by which information encoded within genetic material
(DNA or mRNA sequences) is translated into proteins by living cells. Biological decoding is accomplished by
the ribosome, which links amino acids in an order specified by mRNA, using transfer RNA (tRNA) molecules to carry
amino acids and to read the mRNA three nucleotides at a time. The genetic code is highly similar among all
organisms and can be expressed in a simple table with 64 entries.
The code defines how sequences of these nucleotide triplets, called codons, specify which amino acid will be added
next during protein synthesis. With some exceptions,[1] a three-nucleotide codon in a nucleic acid sequence specifies
a single amino acid. Because the vast majority of genes are encoded with exactly the same code (see the RNA
codon table), this particular code is often referred to as the canonical or standard genetic code, or simply the genetic
code, though in fact some variant codes have evolved. For example, protein synthesis in human mitochondria relies
on a genetic code that differs from the standard genetic code.
While the genetic code determines the protein sequence for a given coding region, other genomic regions can
influence when and where these proteins are produced.
Contents
[hide]
1 Discovery
2 Salient features
o 2.1 Sequence reading frame
o 2.2 Start/stop codons
Discovery[edit]
Serious efforts to understand how proteins are encoded began after the structure of DNA was discovered in
1953. George Gamow postulated that sets of three bases must be employed to encode the 20 standard amino acids
used by living cells to build proteins. With four different nucleotides, a code of 2 nucleotides would allow for only a
maximum of 42 = 16 amino acids. A code of 3 nucleotides could code for a maximum of 43 = 64 amino acids.[2]
The Crick, Brenner et al. experiment first demonstrated that codons consist of three DNA bases; Marshall
Nirenberg and Heinrich J. Matthaei were the first to elucidate the nature of a codon in 1961 at the National Institutes
of Health. They used a cell-free system to translate a poly-uracil RNA sequence (i.e., UUUUU...) and discovered
that the polypeptide that they had synthesized consisted of only the amino acidphenylalanine.[3] They thereby
deduced that the codon UUU specified the amino acid phenylalanine. This was followed by experiments in Severo
Ochoa's laboratory that demonstrated that the poly-adenine RNA sequence (AAAAA...) coded for the polypeptide
poly-lysine[4] and that the poly-cytosine RNA sequence (CCCCC...) coded for the polypeptide polyproline.[5] Therefore the codon AAA specified the amino acidlysine, and the codon CCC specified the amino
acid proline. Using different copolymers most of the remaining codons were then determined. Subsequent work
by Har Gobind Khorana identified the rest of the genetic code. Shortly thereafter, Robert W. Holley determined the
structure of transfer RNA (tRNA), the adapter molecule that facilitates the process of translating RNA into protein.
This work was based upon earlier studies by Severo Ochoa, who received the Nobel Prize in Physiology or
Medicine in 1959 for his work on the enzymology of RNA synthesis.[6]
Extending this work, Nirenberg and Philip Leder revealed the triplet nature of the genetic code and deciphered the
codons of the standard genetic code. In these experiments, various combinations of mRNA were passed through a
filter that contained ribosomes, the components of cells that translate RNA into protein. Unique triplets promoted the
binding of specific tRNAs to the ribosome. Leder and Nirenberg were able to determine the sequences of 54 out of
64 codons in their experiments.[7] In 1968, Khorana, Holley and Nirenberg received the Nobel Prize in Physiology or
Medicine for their work.[8]
Salient features[edit]
Sequence reading frame[edit]
A codon is defined by the initial nucleotide from which translation starts. For example, the string GGGAAACCC, if
read from the first position, contains the codons GGG, AAA, and CCC; and, if read from the second position, it
contains the codons GGA and AAC; if read starting from the third position, GAA and ACC. Every sequence can,
thus, be read in three reading frames, each of which will produce a different amino acid sequence (in the given
example, Gly-Lys-Pro, Gly-Asn, or Glu-Thr, respectively). With double-stranded DNA, there are six possible reading
frames, three in the forward orientation on one strand and three reverse on the opposite strand.[9]:330 The actual
frame in which a protein sequence is translated is defined by a start codon, usually the first AUG codon in the
mRNA sequence.
Start/stop codons[edit]
Translation starts with a chain initiation codon or start codon. Unlike stop codons, the codon alone is not sufficient to
begin the process. Nearby sequences such as the Shine-Dalgarno sequence in E. coli and initiation factors are also
required to start translation. The most common start codon is AUG, which is read as methionine or, in bacteria,
as formylmethionine. Alternative start codons depending on the organism include "GUG" or "UUG"; these codons
normally represent valine and leucine, respectively, but as start codons they are translated as methionine or
formylmethionine.[10]
The three stop codons have been given names: UAG is amber, UGA is opal (sometimes also called umber), and
UAA is ochre. "Amber" was named by discoverers Richard Epstein and Charles Steinberg after their friend Harris
Bernstein, whose last name means "amber" in German.[11] The other two stop codons were named "ochre" and
"opal" in order to keep the "color names" theme. Stop codons are also called "termination" or "nonsense" codons.
They signal release of the nascent polypeptide from the ribosome because there is no cognate tRNA that has
anticodons complementary to these stop signals, and so a release factor binds to the ribosome instead.[12]
Effect of mutations[edit]
During the process of DNA replication, errors occasionally occur in the polymerization of the second strand. These
errors, called mutations, can have an impact on the phenotype of an organism, especially if they occur within the
protein coding sequence of a gene. Error rates are usually very low1 error in every 10100 million basesdue to
the "proofreading" ability of DNA polymerases.[14][15]
Missense mutations and nonsense mutations are examples of point mutations, which can cause genetic diseases
such as sickle-cell disease and thalassemia respectively.[16][17][18] Clinically important missense mutations generally
change the properties of the coded amino acid residue between being basic, acidic, polar or non-polar, whereas
nonsense mutations result in a stop codon.[9]:266
Mutations that disrupt the reading frame sequence by indels (insertions or deletions) of a non-multiple of 3
nucleotide bases are known as frameshift mutations. These mutations usually result in a completely different
translation from the original, and are also very likely to cause a stop codon to be read, which truncates the creation
of the protein.[19] These mutations may impair the function of the resulting protein, and are thus rare in in
vivo protein-coding sequences. One reason inheritance of frameshift mutations is rare is that, if the protein being
translated is essential for growth under the selective pressures the organism faces, absence of a functional protein
may cause death before the organism is viable.[20] Frameshift mutations may result in severe genetic diseases such
as Tay-Sachs disease.[21]
Although most mutations that change protein sequences are harmful or neutral, some mutations have a beneficial
effect on an organism.[22] These mutations may enable the mutant organism to withstand particular environmental
stresses better thanwild-type organisms, or reproduce more quickly. In these cases a mutation will tend to become
more common in a population through natural selection.[23] Viruses that use RNA as their genetic material have rapid
mutation rates,[24] which can be an advantage, since these viruses will evolve constantly and rapidly, and thus evade
the defensive responses of e.g. the human immune system.[25] In large populations of asexually reproducing
organisms, for example, E. coli, multiple beneficial mutations may co-occur. This phenomenon is called clonal
interference and causes competition among the mutations.[26]
Degeneracy[edit]
Main article: Codon degeneracy
Degeneracy is the redundancy of the genetic code. The genetic code has redundancy but no ambiguity (see
the codon tables below for the full correlation). For example, although codons GAA and GAG both specify glutamic
acid (redundancy), neither of them specifies any other amino acid (no ambiguity). The codons encoding one amino
acid may differ in any of their three positions. For example the amino acid leucine is specified by YUR or
CUN (UUA, UUG, CUU, CUC, CUA, or CUG) codons (difference in the first or third position indicated using IUPAC
notation), while the amino acid serine is specified by UCN or AGY (UCA, UCG, UCC, UCU, AGU, or AGC) codons
(difference in the first, second, or third position).[27]:521522 A practical consequence of redundancy is that errors in the
third position of the triplet codon cause only a silent mutation or an error that would not affect the protein because
the hydrophilicity or hydrophobicity is maintained by equivalent substitution of amino acids; for example, a codon of
NUN (where N = any nucleotide) tends to code for hydrophobic amino acids. NCN yields amino acid residues that
are small in size and moderate in hydropathy; NAN encodes average size hydrophilic residues. The genetic code is
so well-structured for hydropathy that a mathematical analysis (Singular Value Decomposition) of 12 variables (4
nucleotides x 3 positions) yields a remarkable correlation (C = 0.95) for predicting the hydropathy of the encoded
amino acid directly from the triplet nucleotide sequence, without translation.[28][29] Note in the table, below, eight
amino acids are not affected at all by mutations at the third position of the codon, whereas in the figure above, a
mutation at the second position is likely to cause a radical change in the physicochemical properties of the encoded
amino acid.
Grouping of codons by amino acid residue molar volume and hydropathy. A more detailed version is available.
each other by hydrogen bonds in an arrangement known as base pairing. These bonds almost always form between
an adenine base on one strand and a thymine base on the other strand, or between a cytosine base on one strand
and a guanine base on the other. This means that the number of A and T bases will be the same in a given double
helix, as will the number of G and C bases.[27]:102117 In RNA, thymine (T) is replaced by uracil (U), and the
deoxyribose is substituted by ribose.[27]:127
Each protein-coding gene is transcribed into a molecule of the related RNA polymer. In prokaryotes, this RNA
functions as messenger RNA or mRNA; in eukaryotes, the transcript needs to be processed to produce a mature
mRNA. The mRNA is, in turn, translated on a ribosome into a chain of amino acids otherwise known as
a polypeptide.[27]:Chp 12 The process of translation requires transfer RNAs which arecovalently attached to a specific
amino acid, guanosine triphosphate as an energy source, and a number of translation factors. tRNAs
have anticodons complementary to the codons in an mRNA and can be covalently "charged" with specific amino
acids at their 3' terminal CCA ends by enzymes known as aminoacyl tRNA synthetases, which have high specificity
for both their cognate amino acid and tRNA. The high specificity of these enzymes is a major reason why the fidelity
of protein translation is maintained.[27]:464469
There are 4 = 64 different codon combinations possible with a triplet codon of three nucleotides; all 64 codons are
assigned to either an amino acid or a stop signal. If, for example, an RNA sequence UUUAAACCC is considered
and the reading framestarts with the first U (by convention, 5' to 3'), there are three codons, namely, UUU, AAA, and
CCC, each of which specifies one amino acid. Therefore, this 9 base RNA sequence will be translated into an amino
acid sequence that is three amino acids long.[27]:521539 A given amino acid may be encoded by between one and six
different codon sequences. A comparison may be made using bioinformatics tools wherein the codon is similar to
a word, which is the standard data "chunk" and a nucleotide is similar to a bit, in that it is the smallest unit. This
allows for powerful comparisons across species as well as within organisms.
The standard genetic code is shown in the following tables. Table 1 shows which amino acid each of the 64 codons
specifies. Table 2 shows which codons specify each of the 20 standard amino acids involved in translation. These
are called forward and reverse codon tables, respectively. For example, the codon "AAU" represents the amino
acid asparagine, and "UGU" and "UGC" represent cysteine (standard three-letter designations, Asn and Cys,
respectively).[27]:522
2nd base
1st
bas
e
UC
U
UUU
UA
U
(Phe/F) Phenylalani
ne
U
UUA
(Leu/L) Leucine
UG
U
(Tyr/Y) Tyrosine
UC
C
UUC
UC
(Ser/S) Serine
UA
C
UA
U
(Cys/C) Cysteine
UG
C
Stop (Ochre)
3rd
bas
e
UG
Stop (Opal)
UUG
UC
G
UA
G
CUU
CC
U
CA
U
Stop (Amber)
UG
G
(Trp/W) Tryptophan
CG
U
CGC
(His/H) Histidine
CUC
CCC
CAC
(Pro/P) Proline
CUA
CCA
(Arg/R) Arginine
CAA
CUG
CC
G
CA
G
AUU
AC
U
AA
U
AUC
(Ile/I) Isoleucine
ACC
(Gln/Q) Glutamin
e
(Asn/N) Asparagi
ne
CGA
CG
G
AG
U
U
(Ser/S) Serine
AAC
AGC
AA
A
AG
A
(Thr/T) Threonine
A
AC
A
AUA
(Lys/K) Lysine
AUG[
A]
(Met/M) Methionin
e
GUU
(Arg/R) Arginine
AC
G
AA
G
AG
G
GC
U
GA
U
GG
U
(Asp/D) Aspartic
acid
G
GUC
GUA
(Val/V) Valine
GC
C
GC
A
(Ala/A) Alanine
GA
C
GA
A
GG
C
(Glu/E) Glutamic
acid
GG
A
(Gly/G) Glycine
GC
G
GUG
GA
G
GG
G
The codon AUG both codes for methionine and serves as an initiation site: the first AUG in an mRNA's
coding region is where translation into protein begins.[30]
Inverse table (compressed using IUPAC notation)
Amino
acid
Codons
Compressed
Amino
acid
Codons
Compressed
Ala/A
GCN
Leu/L
YUR, CUN
Arg/R
CGN, MGR
Lys/K
AAA, AAG
AAR
Asn/N
AAU, AAC
AAY
Met/M
AUG
Asp/D
GAU, GAC
GAY
Phe/F
UUU, UUC
UUY
Cys/C
UGU, UGC
UGY
Pro/P
CCN
Gln/Q
CAA, CAG
CAR
Ser/S
UCN, AGY
Glu/E
GAA, GAG
GAR
Thr/T
ACN
Gly/G
GGN
Trp/W
UGG
His/H
CAU, CAC
CAY
Tyr/Y
UAU, UAC
UAY
AUH
Val/V
GUN
Ile/I
START
AUG
STOP
UAR, URA
Genetic code logo of theGlobobulimina pseudospinescensmitochondrial genome. The logo shows the 64 codons from left to
right, predicted alternatives in red (relative to the standard genetic code). Red line: stop codons. The height of each amino
acid in the stack shows how often it is aligned to the codon in homologous protein domains. The stack height indicates the
support for the prediction.
Since 2001, 40 non-natural amino acids have been added into protein by creating a unique codon (recoding)
and a corresponding transfer-RNA:aminoacyl tRNA-synthetase pair to encode it with diverse physicochemical
and biological properties in order to be used as a tool to exploring protein structure and function or to create
novel or enhanced proteins.[41][42]
H. Murakami and M. Sisido have extended some codons to have four and five bases. Steven A.
Benner constructed a functional 65th (in vivo) codon.[43]
Origin[edit]
If amino acids were randomly assigned to triplet codons, then there would be 1.5 x 1084 possible genetic codes
to choose from.[44]:163 This number is found by calculating how many ways there are to place 21 items (20 amino
acids plus one stop) in 64 bins, wherein each item is used at least once. [1] The genetic code used by all known
forms of life is nearly universal with few minor variations. One could ask: Has all life on Earth descended from a
single bacterium that mutated to make the final optimization in the genetic code? Many hypotheses on the
evolutionary origins of the genetic code have been proposed.
Four themes run through the many hypotheses about the evolution of the genetic code:[45]
Chemical principles govern specific RNA interaction with amino acids. Experiments with aptamers showed
that some amino acids have a selective chemical affinity for the base triplets that code for them.[46] Recent
experiments show that of the 8 amino acids tested, 6 show some RNA triplet-amino acid association.[44]:170[47]
Biosynthetic expansion. The standard modern genetic code grew from a simpler earlier code through a
process of "biosynthetic expansion". Here the idea is that primordial life "discovered" new amino acids (for
example, as by-products ofmetabolism) and later incorporated some of these into the machinery of genetic
coding. Although much circumstantial evidence has been found to suggest that fewer different amino acids
were used in the past than today,[48] precise and detailed hypotheses about which amino acids entered the
code in what order have proved far more controversial.[49][50]
Natural selection has led to codon assignments of the genetic code that minimize the effects
of mutations.[51] A recent hypothesis[52] suggests that the triplet code was derived from codes that used
longer than triplet codons (such as quadruplet codons). Longer than triplet decoding would have higher
degree of codon redundancy and would be more error resistant than the triplet decoding. This feature could
allow accurate decoding in the absence of highly complex translational machinery such as
the ribosome and before cells began making ribosomes.
Information channels: Information-theoretic approaches model the process of translating the genetic code
into corresponding amino acids as an error-prone information channel.[53] The inherent noise (that is, the
error) in the channel poses the organism with a fundamental question: how can a genetic code be
constructed to withstand the impact of noise[54] while accurately and efficiently translating information?
These rate-distortion models[55] suggest that the genetic code originated as a result of the interplay of the
three conflicting evolutionary forces: the needs for diverse amino-acids,[56] for error-tolerance[51] and for
minimal cost of resources. The code emerges at a coding transition when the mapping of codons to aminoacids becomes nonrandom. The emergence of the code is governed by the topology defined by the
probable errors and is related to the map coloring problem.[57]
Transfer RNA molecules appear to have evolved before modern aminoacyl-tRNA synthetases, so the latter
cannot be part of the explanation of its patterns.[58]
Models encompassing aspects of two or more of the above themes have also been explored. For example,
models based on signaling games combine elements of game theory, natural selection and information
channels. Such models have been used to suggest that the first polypeptides were likely short and had some
use other than enzymatic function. Game theoretic models have also suggested that the organization of RNA
strings into cells may have been necessary to prevent "deceptive" use of the genetic code, i.e. preventing the
ancient equivalent of viruses from overwhelming the RNA world.[59]
The distribution of codon assignments in the genetic code is nonrandom.[60] For example, the genetic code
clusters certain amino acid assignments. Amino acids that share the same biosynthetic pathway tend to have
the same first base in their codons.[61] Amino acids with similar physical properties tend to have similar
codons,[62][63] reducing the problems caused by point mutations and mistranslations.[60] A robust hypothesis for the
origin of genetic code should also address or predict the following gross features of the codon table:[64]
1.
2.
3.
4.
5.
6.
Genetic code
second
U
third
UUU - Phe
UUC - Phe
UUA - Leu
UUG - Leu
UCU
UCC
UCA
UCG
C
- Ser
- Ser
- Ser
- Ser
CUU - Leu
CUC - Leu
CUA - Leu
CUG - Leu
CCU
CCC
CCA
CCG
Pro
Pro
Pro
Pro
CAU
CAC
CAA
CAG
His
His
Gln
Gln
CGU
CGC
CGA
CGG
Arg
Arg
Arg
Arg
U
C
A
G
AUU - Ile
AUC - Ile
AUA - Ile
AUG - Met
ACU
ACC
ACA
ACG
Thr
Thr
Thr
Thr
AAU
AAC
AAA
AAG
Asn
Asn
Lys
Lys
AGU
AGC
AGA
AGG
Ser
Ser
Arg
Arg
U
C
A
G
GUU - Val
GUC - Val
GUA - Val
GUG - Val
GCU
GCC
GCA
GCG
Ala
Ala
Ala
Ala
GAU
GAC
GAA
GAG
Asp
Asp
Glu
Glu
GGU
GGC
GGA
GGG
Gly
Gly
Gly
Gly
U
C
A
G
8.
9.
UAU
UAC
UAA
UAG
- Tyr
- Tyr
- *
- *
UGU
UGC
UGA
UGG
Cys
Cys
*
Trp
U
C
A
G
Possible codons
GCA, GCC, GCG, GCT
Asx
Cys
Cysteine
TGC, TGT
Asp
Aspartic acid
GAC, GAT
Glu
Glutamic acid
GAA, GAG
Phe
Phenylalanine
TTC, TTT
Gly
Glycine
His
Histidine
CAC, CAT
Ile
Isoleucine
Lys
Lysine
AAA, AAG
10.
11.
Leu
Leucine
Met
Methionine
ATG
Asn
Asparagine
AAC, AAT
Pro
Proline
Gln
Glutamine
CAA, CAG
Arg
Arginine
Ser
Serine
Thr
Threonine
Val
Valine
Trp
Tryptophan
TGG
any codon
NNN
Tyr
Tyrosine
TAC, TAT
Glx
stop codon
Amino acids
small
Ala, Gly
acidic / amide
charged
negative
Asp, Glu
positive
Lys, Arg
polar
hydrophobic
small
size
aliphatic
aromatic
DNA), and then transfer ribonucleic acid (tRNA), which is read 3' to 5'. tRNA is the taxi that translates the information on the ribosome
into an amino acid chain or polypeptide.
For mRNA there are 43 = 64 different nucleotide combinations possible with a triplet codon of three nucleotides. All 64 possible
combinations are shown in Table 1. However, not all 64 codons of the genetic code specify a single amino acid during translation. The
reason is that in humans only 20 amino acids (except selenocysteine) are involved in translation. Therefore, one amino acid can be
encoded by more than one mRNA codon-triplet. Arginine and leucine are encoded by 6 triplets, isoleucine by 3, methionine and
tryptophan by 1, and all other amino acids by 4 or 2 codons. The redundant codons are typically different at the 3rd base. Table
2 shows the inverse codon assignment, i.e. which codon specifies which of the 20 standard amino acids involved in translation.
Table 1. Genetic code: mRNA codon -> amino acid
1st
Base
2nd
Base
3rd
Base
Phenylalanine
Serine
Tyrosine
Cysteine
Phenylalanine
Serine
Tyrosine
Cysteine
Leucine
Serine
Stop
Stop
Leucine
Serine
Stop
Tryptophan
Leucine
Proline
Histidine
Arginine
Leucine
Proline
Histidine
Arginine
Leucine
Proline
Glutamine
Arginine
Leucine
Proline
Glutamine
Arginine
Isoleucine
Threonine
Asparagine
Serine
Isoleucine
Threonine
Asparagine
Serine
Isoleucine
Threonine
Lysine
Arginine
Methionine
(Start)1
Threonine
Lysine
Arginine
Valine
Alanine
Aspartate
Glycine
Valine
Alanine
Aspartate
Glycine
Valine
Alanine
Glutamate
Glycine
Valine
Alanine
Glutamate
Glycine
mRNA codons
Amino
acid
mRNA codons
Ala/A
Leu/L
Arg/R
Lys/K
AAA, AAG
Asn/N
AAU, AAC
Met/M AUG
Asp/D
GAU, GAC
Phe/F
UUU, UUC
Cys/C
UGU, UGC
Pro/P
Gln/Q
CAA, CAG
Ser/S
Glu/E
GAA, GAG
Thr/T
Gly/G
Trp/W UGG
His/H
CAU, CAC
Tyr/Y
UAU, UAC
Ile/I
START
Val/V
AUG
STOP
The direction of reading mRNA is 5' to 3'. tRNA (reading 3' to 5') has anticodons complementary to the codons in mRNA and can be
"charged" covalently with amino acids at their 3' terminal. According to Crick the binding of the base-pairs between the mRNA codon
and the tRNA anticodon takes place only at the 1st and 2nd base. The binding at the 3rd base (i.e. at the 5' end of the tRNA anticodon)
is weaker and can result in different pairs. For the binding between codon and anticodon to come true the bases must wobble out of
their positions at the ribosome. Therefore, base-pairs are sometimes called wobble-pairs.
Table 3 shows the possible wobble-pairs at the 1st, 2nd and 3rd base. The possible pair combinations at the 1st and 2nd base are
identical. At the 3rd base (i.e. at the 3' end of mRNA and 5' end of tRNA) the possible pair combinations are less unambiguous, which
leads to the redundancy in mRNA. The deamination (removal of the amino group NH2) of adenosine (not to confuse with adenine)
produces the nucleotide inosine (I) on tRNA, which generates non-standard wobble-pairs with U, C or A (but not with G) on mRNA.
Inosine may occur at the 3rd base of tRNA.
Table 3. Base-pairs: mRNA codon -> tRNA anticodon
1st (i.e. 5' end) and 2nd place 1st (i.e. 3' end) and 2nd place
mRNA codon
tRNA anticodon
A
A or G
U or C
U, C or A
Table 3 is read in the following way: for the 1st and 2nd base-pairs the wobble-pairs provide uniqueness in the way that U on tRNA
always emerges from A on mRNA, A on tRNA always emerges from U on mRNA, etc. For the 3rd base-pair the genetic code is
redundant in the way that U on tRNA can emerge from A or G on mRNA, G on tRNA can emerge from U or C on mRNA and I on tRNA
can emerge from U, C or A on mRNA. Only A and C at the 3rd place on tRNA are unambiguously assigned to U and G at the 3rd place
on mRNA, respectively.
Due to this combination structure a tRNA can bind to different mRNA codons where synonymous or redundant mRNA codons differ at
the 3rd base (i.e. at the 5' end of tRNA and the 3' end of mRNA). By this logic the minimum number of tRNA anticodons necessary to
encode all amino acids reduces to 31 (excluding the 2 STOP codons AUU and ACU, see Table 5). This means that any tRNA
anticodon can be encoded by one or more different mRNA codons (Table 4). However, there are more than 31 tRNA anticodons
possible for the translation of all 64 mRNA codons. For example, serine has a fourfold degenerate site at the 3rd position (UCU, UCC,
UCA, UCG), which can be translated by AGI (for UCU, UCC and UCA) and AGC on tRNA (for UCG) but also by AGG and AGU. This
means, in turn, that any mRNA codon can also be translated by one or more tRNA anticodons (see Table 5).
The reason for the occurrence of different wobble-pairs encoding the same amino acid may be due to a compromise between velocity
and safety in protein synthesis. The redundancy of mRNA codons exist to prevent mistakes in transcription caused by mutations or
variations at the 3rd position but also at other positions. For example, the first position of the leucine codons (UCA, UCC, CCU, CCC,
CCA, CCG) is a twofold degenerate site, while the second position is unambiguous (not redundant). Another example is serine with
mRNA codons UCA, UCG, UCC, UCU, AGU, AGC. Of course, serine is also twofold degenerate at the first position and fourfold
degenerate at the third position, but it is twofold degenerate at the second position in addition. Table 4 shows the assignment of mRNA
codons to any possible tRNA anticodon in eukaryotes for the 20 standard amino acids involved in translation. It is the reverse codon
assignment.
Table 4. Reverse amino acid encoding: amino acid -> tRNA anticodon -> mRNA codon
Amino acid
tRNA anticodon
mRNA codon
Phenylalanine
3'-AAG-5'
5'-UUU-3', 5'-UUC-3'
3'-AAA-5'
5'-UUU-3'
3'-AAU-5'
5'-UUA-3', 5'-UUG-3'
3'-AAC-5'
5'-UUG-3'
3'-GAI-5'
3'-GAG-5'
5'-CUU-3', 5'-CUC-3'
3'-GAU-5'
5'-CUA-3', 5'-CUG-3'
3'-GAA-5'
5'-CUU-3'
3'-GAC-5'
5'-CUG-3'
3'-AGI-5'
3'-AGG-5'
5'-UCU-3', 5'-UCC-3'
3'-AGU-5'
5'-UCA-3', 5'-UCG-3'
3'-AGA-5'
5'-UCU-3'
3'-AGC-5'
5'-UCG-3'
3'-UCG-5'
5'-AGU-3', 5'-AGC-3'
3'-UCA-5'
5'-AGU-3'
3'-AUG-5'
5'-UAU-3', 5'-UAC-3'
3'-AUA-5'
5'-UAU-3'
3'-ACG-5'
5'-UGU-3', 5'-UGC-3'
3'-ACA-5'
5'-UGU-3'
Tryptophan
3'-ACC-5'
5'-UGG-3'
Proline
3'-GGI-5'
3'-GGG-5'
5'-CCU-3', 5'-CCC-3'
3'-GGU-5'
5'-CCA-3', 5'-CCG-3'
3'-GGA-5'
5'-CCU-3'
3'-GGC-5'
5'-CCG-3'
3'-GUG-5'
5'-CAU-3', 5'-CAC-3'
3'-GUA-5'
5'-CAU-3'
3'-GUU-5'
5'-CAA-3', 5'-CAG-3'
3'-GUC-5'
5'-CAG-3'
3'-GCI-5'
3'-GCG-5'
5'-CGU-3', 5'-CGC-3'
3'-GCU-5'
5'-CGA-3', 5'-CGG-3'
3'-GCA-5'
5'-CGU-3'
3'-GCC-5'
5'-CGG-3'
3'-UCU-5'
5'-AGA-3', 5'-AGG-3'
Leucine
Serine
Tyrosine
Cysteine
Histidine
Glutamine
Arginine
3'-UCC-5'
5'-AGG-3'
3'-UAI-5'
3'-UAG-5'
5'-AUU-3', 5'-AUC-3'
3'-UAA-5'
5'-AUU-3'
3'-UAU-5'
5'-AUA-3'
Methionine
3'-UAC-5'
5'-AUG-3'
Threonine
3'-UGI-5'
3'-UGG-5'
5'-ACU-3', 5'-ACC-3'
3'-UGU-5'
5'-ACA-3', 5'-ACG-3'
3'-UGA-5'
5'-ACU-3'
3'-UGC-5'
5'-ACG-3'
3'-UUG-5'
5'-AAU-3', 5'-AAC-3'
3'-UUA-5'
5'-AAU-3'
3'-UUU-5'
5'-AAA-3', 5'-AAG-3'
3'-UUC-5'
5'-AAG-3'
3'-CAI-5'
3'-CAG-5'
5'-GUU-3', 5'-GUC-3'
3'-CAU-5'
5'-GUA-3', 5'-GUG-3'
3'-CAA-5'
5'-GUU-3'
3'-CAC-5'
5'-GUG-3'
3'-CGI-5'
3'-CGG-5'
5'-GCU-3', 5'-GCC-3'
3'-CGU-5'
5'-GCA-3', 5'-GCG-3'
3'-CGA-5'
5'-GCU-3'
3'-CGC-5'
5'-GCG-3'
3'-CUG-5'
5'-GAU-3', 5'-GAC-3'
3'-CUA-5'
5'-GAU-3'
3'-CUU-5'
5'-GAA-3', 5'-GAG-3'
3'-CUC-5'
5'-GAG-3'
3'-CCI-5'
3'-CCG-5'
5'-GGU-3', 5'-GGC-3'
3'-CCU-5'
5'-GGA-3', 5'-GGG-3'
3'-CCA-5'
5'-GGU-3'
3'-CCC-5'
5'-GGG-3'
Isoleucine
Asparagine
Lysine
Valine
Alanine
Aspartate
Glutamate
Glycine
While it is not possible to predict a specific DNA codon from an amino acid, DNA codons can be decoded unambiguously into amino
acids. The reason is that there are 61 different DNA (and mRNA) codons specifying only 20 amino acids. Note that there are 3
additional codons for chain termination, i.e. there are 64 DNA (and thus 64 different mRNA) codons, but only 61 of them specify amino
acids.
Table 5 shows the genetic code for the translation of all 64 DNA codons, starting from DNA over mRNA and tRNA to amino acid. In the
last column, the table shows the different tRNA anticodons minimally necessary to translate all DNA codons into amino acids and sums
up the number in the final row. It reveals that the minimum number of tRNA anticodons to translate all DNA codons is 31 (plus 2 STOP
codons). The maximum number of tRNA anticodons that can emerge in amino acid transcription is 70 (plus 3 STOP codons).
Table 5. Genetic code: DNA -> mRNA codon -> tRNA anticodon -> amino acid
UUU
AAA, AAG
Phe
2 TTC UUC
AAG
Phe
3 TTA UUA
AAU
Leu
4 TTG UUG
AAU, AAC
Leu
5 TCT UCU
6 TCC UCC
AGI, AGG
Ser
7 TCA UCA
AGI, AGU
Ser
8 TCG UCG
AGC, AGU
Ser
9 TAT UAU
AUA, AUG
Tyr
10 TAC UAC
AUG
Tyr
11 TAA UAA
AUU
STOP
12 TAG UAG
AUC, AUU
STOP
13 TGT UGU
ACA, ACG
Cys
14 TGC UGC
ACG
Cys
15 TGA UGA
ACU
STOP
16 TGG UGG
ACC
Trp
17 CTT CUU
18 CTC CUC
GAI, GAG
Leu
19 CTA CUA
GAI, GAU
Leu
20 CTG CUG
GAC, GAU
Leu
21 CCT CCU
22 CCC CCC
GGI, GGG
Pro
23 CCA CCA
GGI, GGU
Pro
24 CCG CCG
GGC, GGU
Pro
25 CAT CAU
GUA, GUG
His
26 CAC CAC
GUG
His
27 CAA CAA
GUU
Gln
28 CAG CAG
GUC, GUU
Gln
29 CGT CGU
30 CGC CGC
GCI, GCG
Arg
31 CGA CGA
GCI, GCU
Arg
32 CGG CGG
GCC, GCU
Arg
33 ATT AUU
Ile
34 ATC AUC
UAI, UAG
Ile
35 ATA AUA
UAI, UAU
Ile
36 ATG AUG
UAC
Met
37 ACT ACU
38 ACC ACC
UGI, UGG
Thr
39 ACA ACA
UGI, UGU
Thr
40 ACG ACG
UGC, UGU
Thr
Phenylalanine AAG
Leucine
AAU
Serine
AGI
AUG
AUU
Cysteine
ACG
ACU
Tryptophan
ACC
GAI
GGI
GUG
Glutamine
GUU
Arginine
GCI
UAI
Methionine
UAC
Threonine
UGI
41 AAT AAU
UUA, UUG
Asn
42 AAC AAC
UUG
Asn
43 AAA AAA
UUU
Lys
44 AAG AAG
UUC, UUU
Lys
45 AGT AGU
UCA, UCG
Ser
46 AGC AGC
UCG
Ser
47 AGA AGA
UCU
Arg
48 AGG AGG
UCC, UCU
Arg
49 GTT GUU
Val
50 GTC GUC
CAI, CAG
Val
51 GTA GUA
CAI, CAU
Val
52 GTG GUG
CAC, CAU
Val
53 GCT GCU
54 GCC GCC
CGI, CGG
Ala
55 GCA GCA
CGI, CGU
Ala
56 GCG GCG
CGC, CGU
Ala
57 GAT GAU
CUG, CUA
Asp
58 GAC GAC
CUG
Asp
59 GAA GAA
CUU
Glu
60 GAG GAG
CUU, CUC
Glu
61 GGT GGU
62 GGC GGC
CCI, CCG
Gly
63 GGA GGA
CCI, CCU
Gly
64 GGG GGG
CCC, CCU
Gly
No.
64
64
Asparagine
UUG
Lysine
UUU
UCG
UCU
Valine
CAI
CGI
CUG
Glutamate
CUU
Glycine
CCI
33
DNA sequencing is the determination of the precise sequence of nucleotides in a sample of DNA.
The most popular method for doing this is called the dideoxy method or Sanger method (named
after its inventor, Frederick Sanger, who was awarded the 1980 Nobel prize in chemistry [his
second] for this achievment).
DNA is synthesized from four deoxynucleotide triphosphates. The top formula shows one of them:
deoxythymidine triphosphate (dTTP). Each new nucleotide is added to the 3 -OH group of the last
nucleotide added.
Link to discussion of DNA synthesis.
The dideoxy method gets its name from the critical role played by synthetic nucleotides
that lack the -OH at the 3 carbon atom (red arrow). A dideoxynucleotide (dideoxythymidine
triphosphate ddTTP is the one shown here) can be added to the growing DNA strand but
when it is, chain elongation stops because there is no 3 -OH for the next nucleotide to be
attached to. For this reason, the dideoxy method is also called the chain termination method.
The bottom formula shows the structure of azidothymidine (AZT), a drug used to treat AIDS. AZT (which is also
called zidovudine) is taken up by cells where it is converted into the triphosphate. The reverse transcriptase of the
human immunodeficiency virus (HIV) prefers AZT triphosphate to the normal nucleotide (dTTP). Because AZT has no 3 OH group, DNA synthesis by reverse transcriptase halts when AZT triphosphate is incorporated in the growing DNA
strand. Fortunately, the DNA polymerases of the host cell prefer dTTP, so side effects from the drug are not so severe as
might have been predicted.
The Procedure
The DNA to be sequenced is prepared as a single strand.
This template DNA is supplied with
color when illuminated by a laser beam and an automatic scanner provides a printout of the
sequence.
The 'insert' is a piece of DNA we've purposely put into another (a 'vector') so that we can replicate it. Usually the
'insert' is the interesting part, consequently. In the case of the Human Genome Project or other sequencing
projects, the insert is the part we want to sequence - the part we don't know. Usually we know the complete
DNA sequence of the vector.
Shotgun Sequencing
Shotgun sequencing is a method for determining the sequence fo a very large piece of DNA. The basic DNA
sequencing reaction can only get the sequence of a few hundred nucleotides. For larger ones (like BAC DNA), we
usually fragment the DNA and insert the resultant pieces into a convenient vector (a plasmid, usually) to
replicate them. After we sequence the fragments, we try to deduce from them the sequence of the original BAC
DNA.
For more definitions ...
See our Molecular Biology Glossary.
Well, OK, it's not so easy reading just C's, as you perhaps saw
in the last figure. The spacing between the bands isn't all that
easy to figure out. Imagine, though, that we ran the reaction
with *all four* of the dideoxy nucleotides (A, G, C and T)
present, and with *different* fluorescent colors on each. NOW
look at the gel we'd get (at left). The sequence of the DNA is
rather obvious if you know the color codes ... just read the
colors from bottom to top: TGCGTCCA-(etc).
(Forgive me for using black - it shows up better than yellow).
That's exactly what we do to sequence DNA, then - we run DNA replication reactions in a
test tube, but in the presence of trace amounts of all four of the dideoxy terminator
nucleotides. Electrophoresis is used to separate the resulting fragments by size and we can
'read' the sequence from it, as the colors march past in order.
In a large-scale sequencing lab, we use a machine to run the electrophoresis step and to
monitor the different colors as they come out. Since about 2001, these machines - not
surprisingly called automated DNA sequencers - have used 'capillary electrophoresis',
where the fragments are piped through a tiny glass-fiber capillary during the
electrophoresis step, and they come out the far end in size-order. There's an ultraviolet
laser built into the machine that shoots through the liquid emerging from the end of the
capillaries, checking for pulses of fluorescent colors to emerge. There might be as many
as 96 samples moving through as many capillaries ('lanes') in the most common type of
sequencer.
At left is a screen shot of a real fragment of sequencing gel (this one from an older model
of sequencer, but the concepts are identical). The four colors red, green, blue and yellow
each represent one of the four nucleotides.
The actual gel image, if you could get a monitor large enough to see it all at this
magnification, would be perhaps 3 or 4 meters long and 30 or 40 cm wide.
A 'Scan' of one gel lane:
We don't even have to 'read' the sequence from the gel - the computer does that for us! Below is an example
of what the sequencer's computer shows us for one sample. This is a plot of the colors detected in one 'lane'
of a gel (one sample), scanned from smallest fragments to largest. The computer even interprets the colors by
printing the nucleotide sequence across the top of the plot. This is just a fragment of the entire file, which
would span around 900 or so nucleotides of accurate sequence.
The sequencer also gives the operator a text file containing just the nucleotide sequence, without the color
traces.
As you have seen, we can get the sequence of a fragment of DNA as long as 900 or so
nucleotides. Great! But what about longer pieces? The human genome is 3 *billion* bases
long, arranged on 23 pairs of chromosomes. Our sequencing machine reads just a drop in
the bucket compared to what we really need!
To do it, we break the entire genome up into manageable pieces and sequence them. There
The Publically-funded Human Genome Project: The National Institutes of Health and the National Science
Foundation have funded the creation of 'libraries' of BAC clones. Each BAC carries a large piece of human
genomic DNA on the order of 100-300 kb. All of these BACs overlap randomly, so that any one gene is
probably on several different overlapping BACs. We can replicate those BACs as many times as necessary, so
there's a virtually endless supply of the large human DNA fragment.
In the Publically-funded project, the BACs are subjected to shotgun sequencing (see below) to figure
out their sequence. By sequencing all the BAC's, we know enough of the sequence in overlapping
segments to reconstruct how the original chromosome sequence looks.
A Privately-Funded Sequencing Project: Celera Genomics An innovative approach to sequencing the human
genome has been pioneered by Celera Genomics. The founders of this company realized that it might be
possible to skip the entire step of making libraries of BAC clones. Instead, they blast apart the entire human
genome into fragments of 2-10 kb and sequence those. Now the challenge is to assemble those fragments of
sequence into the whole genome sequence.
Imagine, for example that you have hundreds of 500-piece puzzles, each being assembled by a team
of puzzle experts using puzzle-solving computers. Those puzzles are like BACs - smaller puzzles that
make a big genome manageable. Now imagine that Celera throws all those puzzles together into one
room and scrambles the pieces. They, however, have scanners that scan all the puzzle pieces and huge
computers that figure out where they all go.
It is controversial still as to whether the Celera approach will succeed on a puzzle as large as the
human genome. Whether it does or not, they have certainly stirred up the intellectual pot a bit.
Introduction
You can think of the sequences of bases in the coding strand of DNA or in messenger RNA as coded
instructions for building protein chains out of amino acids. There are 20 amino acids used in making
proteins, but only four different bases to be used to code for them.
Obviously one base can't code for one amino acid. That would leave 16 amino acids with no codes.
If you took two bases to code for each amino acid, that would still only give you 16 possible codes
(TT, TC, TA, TG, CT, CC, CA and so on) - still not enough.
However, if you took three bases per amino acid, that gives you 64 codes (TTT, TTC, TTA, TTG,
TCT, TCC and so on). That's enough to code for everything with lots to spare. You will find a full table
of these below.
A three base sequence in DNA or RNA is known as a codon.
to code for individual amino acids - shown by their three letter abbreviation.
The table is arranged in such a way that it is easy to find any particular combination you want. It is
fairly obvious how it works and, in any case, it doesn't take very long just to scan through the table to
find what you want.
The colours are to stress the fact that most of the amino acids have more than one code. Look, for
example, at leucine in the first column. There are six different codons all of which will eventually
produce a leucine (Leu) in the protein chain. There are also six for serine (Ser).
In fact there are only two amino acids which have only one sequence of bases to code for them methionine (Met) and tryptophan (Trp).
You have probably noticed that three codons don't have an amino acid written beside them, but say
"stop" instead. For obvious reasons these are known as stop codons. We'll leave talking about those
until we have looked at the way the code works in messenger RNA.
In many ways, this is the more useful table. Messenger RNA is directly involved in the production of
the protein chains (see the next page in this sequence). The DNA coding chain is one stage removed
from this because it must first be transcribed into a messenger RNA chain.