0% found this document useful (0 votes)
636 views65 pages

All About Dna

DNA carries the genetic instructions for all living organisms. It is composed of two strands coiled around each other to form a double helix structure. Each strand is made up of repeating nucleotide units containing a nucleobase, sugar, and phosphate. The four nucleobases are adenine, cytosine, guanine, and thymine. Hydrogen bonds between the bases on opposite strands hold the DNA structure together. DNA stores and transmits genetic information from one generation of cells to the next.

Uploaded by

Peopwd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
636 views65 pages

All About Dna

DNA carries the genetic instructions for all living organisms. It is composed of two strands coiled around each other to form a double helix structure. Each strand is made up of repeating nucleotide units containing a nucleobase, sugar, and phosphate. The four nucleobases are adenine, cytosine, guanine, and thymine. Hydrogen bonds between the bases on opposite strands hold the DNA structure together. DNA stores and transmits genetic information from one generation of cells to the next.

Uploaded by

Peopwd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

ALL ABOUT DNA v.

1
DNA
From Wikipedia, the free encyclopedia

For a non-technical introduction to the topic, see Introduction to genetics. For other uses, see DNA
(disambiguation).

The structure of the DNA double helix. The atoms in the structure are colour-coded by element and the detailed structure of two
base pairs are shown in the bottom right.

The structure of part of a DNA double helix

Deoxyribonucleic acid ( /diksirab.njukle.k sd/; DNA) is a molecule that carries most of


the genetic instructions used in the development and functioning of all known livingorganisms and many viruses.
DNA is a nucleic acid; alongside proteins and carbohydrates, nucleic acids compose the three
major macromolecules essential for all known forms of life. Most DNA molecules consist of two biopolymer strands
coiled around each other to form a double helix. The two DNA strands are known as polynucleotides since they are
composed of simpler unitscalled nucleotides. Each nucleotide is composed of a nitrogen-containing nucleobase
either guanine (G), adenine (A), thymine (T), or cytosine (C)as well as a monosaccharide sugar
calleddeoxyribose and a phosphate group. The nucleotides are joined to one another in a chain by covalent
bonds between the sugar of one nucleotide and the phosphate of the next, resulting in an alternating sugarphosphate backbone. According to base pairing rules (A with T, and C with G), hydrogen bonds bind the
nitrogenous bases of the two separate polynucleotide strands to make double-stranded DNA.
i

DNA is well-suited for biological information storage. The DNA backbone is resistant to cleavage, and both strands
of the double-stranded structure store the same biological information. Biological information is replicated as the two
strands are separated. A significant portion of DNA (more than 98% for humans) is non-coding, meaning that these
sections do not serve as patterns for protein sequences.
The two strands of DNA run in opposite directions to each other and are therefore anti-parallel. Attached to each
sugar is one of four types of nucleobases (informally, bases). It is the sequenceof these four nucleobases along the
backbone that encodes biological information. Under the genetic code, RNA strands are translated to specify the
sequence of amino acids within proteins. These RNA strands are initially created using DNA strands as a template
in a process called transcription.
Within cells, DNA is organized into long structures called chromosomes. During cell division these chromosomes
are duplicated in the process of DNA replication, providing each cell its own complete set of
chromosomes. Eukaryotic organisms (animals, plants, fungi, and protists) store most of their DNA inside the cell
nucleus and some of their DNA in organelles, such asmitochondria or chloroplasts.[1] In
contrast, prokaryotes (bacteria and archaea) store their DNA only in the cytoplasm. Within the
chromosomes, chromatin proteins such as histones compact and organize DNA. These compact structures guide
the interactions between DNA and other proteins, helping control which parts of the DNA are transcribed.
First isolated by Friedrich Miescher in 1869 and with its molecular structure first identified by James
Watson and Francis Crick in 1953, DNA is used by researchers as a molecular tool to explore physical laws and
theories, such as the ergodic theorem and the theory of elasticity. The unique material properties of DNA have
made it an attractive molecule for material scientists and engineers interested in micro- and nano-fabrication.
Among notable advances in this field are DNA origami and DNA-based hybrid materials.[2]
The obsolete synonym "desoxyribonucleic acid" may occasionally be encountered, for example, in pre-1953
genetics.
Contents
[hide]

1 Properties
o 1.1 Nucleobase classification
o 1.2 Grooves
o 1.3 Base pairing
o 1.4 Sense and antisense
o 1.5 Supercoiling
o 1.6 Alternate DNA structures
o 1.7 Alternative DNA chemistry
o 1.8 Quadruplex structures
o 1.9 Branched DNA
2 Chemical modifications and altered DNA packaging
o 2.1 Base modifications and DNA packaging
o 2.2 Damage
3 Biological functions
o 3.1 Genes and genomes
o 3.2 Transcription and translation
o 3.3 Replication

o 3.4 Extracellular nucleic acids


4 Interactions with proteins
o 4.1 DNA-binding proteins
o 4.2 DNA-modifying enzymes
4.2.1 Nucleases and ligases
4.2.2 Topoisomerases and helicases
4.2.3 Polymerases
5 Genetic recombination
6 Evolution
7 Uses in technology
o 7.1 Genetic engineering
o 7.2 Forensics
o 7.3 Bioinformatics
o 7.4 DNA nanotechnology
o 7.5 History and anthropology
o 7.6 Information storage
8 History of DNA research
9 See also
10 References
11 Further reading
12 External links

Properties[edit]

Chemical structure of DNA; hydrogen bondsshown as dotted lines

DNA is a long polymer made from repeating units called nucleotides.[3][4][5] DNA was first identified and isolated by
Miescher in 1869 at the University of Tbingen, a substance he called nuclein, and the double helix structure of
DNA was first discovered in 1953 by Watson and Crick at the University of Cambridge, using experimental data
collected by Rosalind Franklin and Maurice Wilkins. The structure of DNA of all species comprises two helical
chains each coiled round the same axis, and each with a pitch of 34 ngstrms (3.4 nanometres) and a radius of
10 ngstrms (1.0 nanometre).[6]According to another study, when measured in a particular solution, the DNA chain
measured 22 to 26 ngstrms wide (2.2 to 2.6 nanometres), and one nucleotide unit measured 3.3 (0.33 nm)
long.[7] Although each individual repeating unit is very small, DNA polymers can be very large molecules containing
millions of nucleotides. For instance, the largest human chromosome, chromosomenumber 1, consists of
approximately 220 million base pairs[8] and is 85 mm long.

In living organisms DNA does not usually exist as a single molecule, but instead as a pair of molecules that are held
tightly together.[9][10] These two long strands entwine like vines, in the shape of adouble helix. The nucleotide repeats
contain both the segment of the backbone of the molecule, which holds the chain together, and a nucleobase, which
interacts with the other DNA strand in the helix. A nucleobase linked to a sugar is called a nucleoside and a base
linked to a sugar and one or more phosphate groups is called a nucleotide. A polymer comprising multiple linked
nucleotides (as in DNA) is called a polynucleotide.[11]
The backbone of the DNA strand is made from alternating phosphate and sugar residues.[12] The sugar in DNA is 2deoxyribose, which is a pentose (five-carbon) sugar. The sugars are joined together by phosphate groups that
form phosphodiester bonds between the third and fifth carbon atoms of adjacent sugar rings. These
asymmetric bonds mean a strand of DNA has a direction. In a double helix the direction of the nucleotides in one
strand is opposite to their direction in the other strand: the strands are antiparallel. The asymmetric ends of DNA
strands are called the 5 (five prime) and 3(three prime) ends, with the 5 end having a terminal phosphate group
and the 3 end a terminal hydroxyl group. One major difference between DNA and RNA is the sugar, with the 2deoxyribose in DNA being replaced by the alternative pentose sugar ribose in RNA.[10]

A section of DNA. The bases lie horizontally between the two spiraling strands.[13] (animated version).

The DNA double helix is stabilized primarily by two forces: hydrogen bonds between nucleotides and basestacking interactions among aromatic nucleobases.[14] In the aqueous environment of the cell, the conjugated
bonds of nucleotide bases align perpendicular to the axis of the DNA molecule, minimizing their interaction with
the solvation shell and therefore, the Gibbs free energy. The four bases found in DNA are adenine (abbreviated
A), cytosine (C), guanine (G) and thymine (T). These four bases are attached to the sugar/phosphate to form the
complete nucleotide, as shown foradenosine monophosphate.

Nucleobase classification[edit]
The nucleobases are classified into two types: the purines, A and G, being fused five- and sixmembered heterocyclic compounds, and the pyrimidines, the six-membered rings C and T.[10] A fifth pyrimidine
nucleobase,uracil (U), usually takes the place of thymine in RNA and differs from thymine by lacking a methyl
group on its ring. In addition to RNA and DNA a large number of artificial nucleic acid analogues have also been
created to study the properties of nucleic acids, or for use in biotechnology.[15]
Uracil is not usually found in DNA, occurring only as a breakdown product of cytosine. However, in a number of
bacteriophages Bacillus subtilis bacteriophages PBS1 and PBS2 and Yersinia bacteriophage piR1-37 thymine
has been replaced by uracil.[16] Another phage - Staphylococcal phage S6 - has been identified with a genome
where thymine has been replaced by uracil.[17]
Base J (beta-d-glucopyranosyloxymethyluracil), a modified form of uracil, is also found in a number of organisms:
the flagellates Diplonema and Euglena, and all the kinetoplastid genera[18] Biosynthesis of J occurs in two steps: in
the first step a specific thymidine in DNA is converted into hydroxymethyldeoxyuridine; in the second HOMedU is
glycosylated to form J.[19] Proteins that bind specifically to this base have been identified.[20][21][22]These proteins

appear to be distant relatives of the Tet1 oncogene that is involved in the pathogenesis of acute myeloid
leukemia.[23] J appears to act as a termination signal for RNA polymerase II.[24][25]

Major and minor grooves of DNA. Minor groove is a binding site for the dye Hoechst 33258.

Grooves[edit]
Twin helical strands form the DNA backbone. Another double helix may be found tracing the spaces, or grooves,
between the strands. These voids are adjacent to the base pairs and may provide a binding site. As the strands are
not symmetrically located with respect to each other, the grooves are unequally sized. One groove, the major
groove, is 22 wide and the other, the minor groove, is 12 wide.[26] The width of the major groove means that the
edges of the bases are more accessible in the major groove than in the minor groove. As a result, proteins such
as transcription factors that can bind to specific sequences in double-stranded DNA usually make contact with the
sides of the bases exposed in the major groove.[27] This situation varies in unusual conformations of DNA within the
cell (see below), but the major and minor grooves are always named to reflect the differences in size that would be
seen if the DNA is twisted back into the ordinary B form.

Base pairing[edit]
Further information: Base pair
In a DNA double helix, each type of nucleobase on one strand bonds with just one type of nucleobase on the other
strand. This is called complementary base pairing. Here, purines form hydrogen bonds to pyrimidines, with adenine
bonding only to thymine in two hydrogen bonds, and cytosine bonding only to guanine in three hydrogen bonds.
This arrangement of two nucleotides binding together across the double helix is called a base pair. As hydrogen
bonds are not covalent, they can be broken and rejoined relatively easily. The two strands of DNA in a double helix
can therefore be pulled apart like a zipper, either by a mechanical force or high temperature.[28] As a result of this
complementarity, all the information in the double-stranded sequence of a DNA helix is duplicated on each strand,
which is vital in DNA replication. Indeed, this reversible and specific interaction between complementary base pairs
is critical for all the functions of DNA in living organisms.[4]

Top, a GC base pair with three hydrogen bonds. Bottom, an AT base pair with two hydrogen bonds. Non-covalent
hydrogen bonds between the pairs are shown as dashed lines.

The two types of base pairs form different numbers of hydrogen bonds, AT forming two hydrogen bonds, and GC
forming three hydrogen bonds (see figures, right). DNA with high GC-content is more stable than DNA with low GCcontent.
As noted above, most DNA molecules are actually two polymer strands, bound together in a helical fashion by
noncovalent bonds; this double stranded structure (dsDNA) is maintained largely by the intrastrand base stacking
interactions, which are strongest for G,C stacks. The two strands can come apart a process known as melting to
form two single-stranded DNA molecules (ssDNA) molecules. Melting occurs at high temperature, low salt and high
pH (low pH also melts DNA, but since DNA is unstable due to acid depurination, low pH is rarely used).
The stability of the dsDNA form depends not only on the GC-content (% G,C basepairs) but also on sequence (since
stacking is sequence specific) and also length (longer molecules are more stable). The stability can be measured in
various ways; a common way is the "melting temperature", which is the temperature at which 50% of the ds
molecules are converted to ss molecules; melting temperature is dependent on ionic strength and the concentration
of DNA. As a result, it is both the percentage of GC base pairs and the overall length of a DNA double helix that
determines the strength of the association between the two strands of DNA. Long DNA helices with a high GCcontent have stronger-interacting strands, while short helices with high AT content have weaker-interacting
strands.[29] In biology, parts of the DNA double helix that need to separate easily, such as the TATAAT Pribnow
box in some promoters, tend to have a high AT content, making the strands easier to pull apart.[30]
In the laboratory, the strength of this interaction can be measured by finding the temperature necessary to break the
hydrogen bonds, their melting temperature (also called Tm value). When all the base pairs in a DNA double helix
melt, the strands separate and exist in solution as two entirely independent molecules. These single-stranded DNA
molecules (ssDNA) have no single common shape, but some conformations are more stable than others.[31]

Sense and antisense[edit]


Further information: Sense (molecular biology)
A DNA sequence is called "sense" if its sequence is the same as that of a messenger RNA copy that is translated
into protein.[32] The sequence on the opposite strand is called the "antisense" sequence. Both sense and antisense
sequences can exist on different parts of the same strand of DNA (i.e. both strands can contain both sense and
antisense sequences). In both prokaryotes and eukaryotes, antisense RNA sequences are produced, but the
functions of these RNAs are not entirely clear.[33] One proposal is that antisense RNAs are involved in
regulating gene expression through RNA-RNA base pairing.[34]
A few DNA sequences in prokaryotes and eukaryotes, and more in plasmids and viruses, blur the distinction
between sense and antisense strands by having overlapping genes.[35] In these cases, some DNA sequences do
double duty, encoding one protein when read along one strand, and a second protein when read in the opposite
direction along the other strand. In bacteria, this overlap may be involved in the regulation of gene
transcription,[36] while in viruses, overlapping genes increase the amount of information that can be encoded within
the small viral genome.[37]

Supercoiling[edit]
Further information: DNA supercoil
DNA can be twisted like a rope in a process called DNA supercoiling. With DNA in its "relaxed" state, a strand
usually circles the axis of the double helix once every 10.4 base pairs, but if the DNA is twisted the strands become
more tightly or more loosely wound.[38] If the DNA is twisted in the direction of the helix, this is positive supercoiling,

and the bases are held more tightly together. If they are twisted in the opposite direction, this is negative
supercoiling, and the bases come apart more easily. In nature, most DNA has slight negative supercoiling that is
introduced by enzymes called topoisomerases.[39] These enzymes are also needed to relieve the twisting stresses
introduced into DNA strands during processes such as transcription and DNA replication.[40]

From left to right, the structures of A, B and Z DNA

Alternate DNA structures[edit]


Further information: Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid, Molecular
models of DNA, and DNA structure
DNA exists in many possible conformations that include A-DNA, B-DNA, and Z-DNA forms, although, only B-DNA
and Z-DNA have been directly observed in functional organisms.[12] The conformation that DNA adopts depends on
the hydration level, DNA sequence, the amount and direction of supercoiling, chemical modifications of the bases,
the type and concentration of metal ions, as well as the presence ofpolyamines in solution.[41]
The first published reports of A-DNA X-ray diffraction patternsand also B-DNAused analyses based
on Patterson transforms that provided only a limited amount of structural information for oriented fibers of
DNA.[42][43] An alternate analysis was then proposed by Wilkins et al., in 1953, for the in vivo B-DNA X-ray
diffraction/scattering patterns of highly hydrated DNA fibers in terms of squares of Bessel functions.[44] In the same
journal, James Watson and Francis Crick presented their molecular modeling analysis of the DNA X-ray diffraction
patterns to suggest that the structure was a double-helix.[6]
Although the "B-DNA form" is most common under the conditions found in cells,[45] it is not a well-defined
conformation but a family of related DNA conformations[46] that occur at the high hydration levels present in living
cells. Their corresponding X-ray diffraction and scattering patterns are characteristic of molecular paracrystals with a
significant degree of disorder.[47][48]
Compared to B-DNA, the A-DNA form is a wider right-handed spiral, with a shallow, wide minor groove and a
narrower, deeper major groove. The A form occurs under non-physiological conditions in partially dehydrated
samples of DNA, while in the cell it may be produced in hybrid pairings of DNA and RNA strands, as well as in
enzyme-DNA complexes.[49][50] Segments of DNA where the bases have been chemically modified
by methylation may undergo a larger change in conformation and adopt the Z form. Here, the strands turn about the
helical axis in a left-handed spiral, the opposite of the more common B form.[51] These unusual structures can be
recognized by specific Z-DNA binding proteins and may be involved in the regulation of transcription.[52]

Alternative DNA chemistry[edit]


For a number of years exobiologists have proposed the existence of a shadow biosphere, a postulated microbial
biosphere of Earth that uses radically different biochemical and molecular processes than currently known life. One
of the proposals was the existence of lifeforms that use arsenic instead of phosphorus in DNA. A report in 2010 of
the possibility in the bacterium GFAJ-1, was announced,[53][53][54] though the research was disputed,[54][55] and evidence
suggests the bacterium actively prevents the incorporation of arsenic into the DNA backbone and other
biomolecules.[56]

Quadruplex structures[edit]
Further information: G-quadruplex
At the ends of the linear chromosomes are specialized regions of DNA called telomeres. The main function of these
regions is to allow the cell to replicate chromosome ends using the enzyme telomerase, as the enzymes that
normally replicate DNA cannot copy the extreme 3 ends of chromosomes.[57] These specialized chromosome caps
also help protect the DNA ends, and stop the DNA repair systems in the cell from treating them as damage to be
corrected.[58] In human cells, telomeres are usually lengths of single-stranded DNA containing several thousand
repeats of a simple TTAGGG sequence.[59]

DNA quadruplex formed bytelomere repeats. The looped conformation of the DNA backbone is very different from the typical
DNA helix.[60]

These guanine-rich sequences may stabilize chromosome ends by forming structures of stacked sets of four-base
units, rather than the usual base pairs found in other DNA molecules. Here, four guanine bases form a flat plate and
these flat four-base units then stack on top of each other, to form a stable G-quadruplex structure.[61] These
structures are stabilized by hydrogen bonding between the edges of the bases andchelation of a metal ion in the
centre of each four-base unit.[62] Other structures can also be formed, with the central set of four bases coming from
either a single strand folded around the bases, or several different parallel strands, each contributing one base to
the central structure.
In addition to these stacked structures, telomeres also form large loop structures called telomere loops, or T-loops.
Here, the single-stranded DNA curls around in a long circle stabilized by telomere-binding proteins.[63] At the very
end of the T-loop, the single-stranded telomere DNA is held onto a region of double-stranded DNA by the telomere
strand disrupting the double-helical DNA and base pairing to one of the two strands. This triple-stranded structure is
called a displacement loop or D-loop.[61]

Single branch

Multiple branches

Branched DNA can form networks containing multiple branches.

Branched DNA[edit]
Further information: Branched DNA and DNA nanotechnology
In DNA fraying occurs when non-complementary regions exist at the end of an otherwise complementary doublestrand of DNA. However, branched DNA can occur if a third strand of DNA is introduced and contains adjoining
regions able to hybridize with the frayed regions of the pre-existing double-strand. Although the simplest example of
branched DNA involves only three strands of DNA, complexes involving additional strands and multiple branches
are also possible.[64] Branched DNA can be used in nanotechnology to construct geometric shapes, see the section
on uses in technology below.

Chemical modifications and altered DNA packaging[edit]

cytosine

5-methylcytosine

thymine

Structure of cytosine with and without the 5-methyl group.Deamination converts 5-methylcytosine into thymine.

Base modifications and DNA packaging[edit]


Further information: DNA methylation, Chromatin remodeling
The expression of genes is influenced by how the DNA is packaged in chromosomes, in a structure
called chromatin. Base modifications can be involved in packaging, with regions that have low or no gene
expression usually containing high levels of methylation of cytosine bases. DNA packaging and its influence on
gene expression can also occur by covalent modifications of the histone protein core around which DNA is wrapped
in the chromatin structure or else by remodeling carried out by chromatin remodeling complexes (see Chromatin
remodeling). There is, further, crosstalk between DNA methylation and histone modification, so they can
coordinately affect chromatin and gene expression.[65]
For one example, cytosine methylation, produces 5-methylcytosine, which is important for X-chromosome
inactivation.[66] The average level of methylation varies between organisms the wormCaenorhabditis elegans lacks
cytosine methylation, while vertebrates have higher levels, with up to 1% of their DNA containing 5methylcytosine.[67] Despite the importance of 5-methylcytosine, it candeaminate to leave a thymine base, so
methylated cytosines are particularly prone to mutations.[68] Other base modifications include adenine methylation in
bacteria, the presence of 5-hydroxymethylcytosine in the brain,[69] and the glycosylation of uracil to produce the "Jbase" in kinetoplastids.[70][71]

Damage[edit]
Further information: DNA damage (naturally occurring), Mutation, DNA damage theory of aging

A covalent adduct between ametabolically activated form ofbenzo[a]pyrene, the major mutagen intobacco smoke, and DNA[72]

DNA can be damaged by many sorts of mutagens, which change the DNA sequence. Mutagens include oxidizing
agents, alkylating agents and also high-energy electromagnetic radiation such as ultraviolet light and X-rays. The
type of DNA damage produced depends on the type of mutagen. For example, UV light can damage DNA by
producing thymine dimers, which are cross-links between pyrimidine bases.[73] On the other hand, oxidants such
as free radicals or hydrogen peroxide produce multiple forms of damage, including base modifications, particularly
of guanosine, and double-strand breaks.[74] A typical human cell contains about 150,000 bases that have suffered
oxidative damage.[75] Of these oxidative lesions, the most dangerous are double-strand breaks, as these are difficult
to repair and can produce point mutations,insertions and deletions from the DNA sequence, as well as chromosomal
translocations.[76] These mutations can cause cancer. Because of inherent limitations in the DNA repair mechanisms,
if humans lived long enough, they would all eventually develop cancer.[77][78] DNA damages that are naturally
occurring, due to normal cellular processes that produce reactive oxygen species, the hydrolytic activities of cellular
water, etc., also occur frequently. Although most of these damages are repaired, in any cell some DNA damage may
remain despite the action of repair processes. These remaining DNA damages accumulate with age in mammalian
postmitotic tissues. This accumulation appears to be an important underlying cause of aging.[79][80][81]

Many mutagens fit into the space between two adjacent base pairs, this is called intercalation. Most intercalators
are aromatic and planar molecules; examples include ethidium bromide, acridines, daunomycin, and doxorubicin.
For an intercalator to fit between base pairs, the bases must separate, distorting the DNA strands by unwinding of
the double helix. This inhibits both transcription and DNA replication, causing toxicity and mutations.[82] As a result,
DNA intercalators may be carcinogens, and in the case of thalidomide, a teratogen.[83] Others such
as benzo[a]pyrene diol epoxide and aflatoxin form DNA adducts that induce errors in replication.[84] Nevertheless,
due to their ability to inhibit DNA transcription and replication, other similar toxins are also used in chemotherapy to
inhibit rapidly growing cancer cells.[85]

Biological functions[edit]

Location of eukaryote nuclear DNA within the chromosomes.

DNA usually occurs as linear chromosomes in eukaryotes, and circular chromosomes in prokaryotes. The set of
chromosomes in a cell makes up its genome; the human genome has approximately 3 billion base pairs of DNA
arranged into 46 chromosomes.[86] The information carried by DNA is held in the sequence of pieces of DNA
called genes. Transmission of genetic information in genes is achieved via complementary base pairing. For
example, in transcription, when a cell uses the information in a gene, the DNA sequence is copied into a
complementary RNA sequence through the attraction between the DNA and the correct RNA nucleotides. Usually,
this RNA copy is then used to make a matching protein sequence in a process called translation, which depends on
the same interaction between RNA nucleotides. In alternative fashion, a cell may simply copy its genetic information
in a process called DNA replication. The details of these functions are covered in other articles; here the focus is on
the interactions between DNA and other molecules that mediate the function of the genome.

Genes and genomes[edit]


Further information: Cell nucleus, Chromatin, Chromosome, Gene, Noncoding DNA
Genomic DNA is tightly and orderly packed in the process called DNA condensation to fit the small available
volumes of the cell. In eukaryotes, DNA is located in the cell nucleus, as well as small amounts
in mitochondria and chloroplasts. In prokaryotes, the DNA is held within an irregularly shaped body in the cytoplasm
called the nucleoid.[87] The genetic information in a genome is held within genes, and the complete set of this
information in an organism is called its genotype. A gene is a unit of heredity and is a region of DNA that influences
a particular characteristic in an organism. Genes contain an open reading frame that can be transcribed, as well
as regulatory sequences such as promoters and enhancers, which control the transcription of the open reading
frame.
In many species, only a small fraction of the total sequence of the genome encodes protein. For example, only
about 1.5% of the human genome consists of protein-coding exons, with over 50% of human DNA consisting of noncoding repetitive sequences.[88] The reasons for the presence of so much noncoding DNA in eukaryotic genomes
and the extraordinary differences in genome size, or C-value, among species represent a long-standing puzzle
known as the "C-value enigma".[89] However, some DNA sequences that do not code protein may still encode
functional non-coding RNA molecules, which are involved in the regulation of gene expression.[90]

T7 RNA polymerase (blue) producing a mRNA (green) from a DNA template (orange).[91]

Some noncoding DNA sequences play structural roles in chromosomes. Telomeres and centromeres typically
contain few genes, but are important for the function and stability of chromosomes.[58][92] An abundant form of
noncoding DNA in humans are pseudogenes, which are copies of genes that have been disabled by
mutation.[93] These sequences are usually just molecular fossils, although they can occasionally serve as raw genetic
material for the creation of new genes through the process of gene duplication and divergence.[94]

Transcription and translation[edit]


Further information: Genetic code, Transcription (genetics), Protein biosynthesis
A gene is a sequence of DNA that contains genetic information and can influence the phenotype of an organism.
Within a gene, the sequence of bases along a DNA strand defines a messenger RNA sequence, which then defines
one or more protein sequences. The relationship between the nucleotide sequences of genes and the aminoacid sequences of proteins is determined by the rules of translation, known collectively as the genetic code. The
genetic code consists of three-letter 'words' called codons formed from a sequence of three nucleotides (e.g. ACT,
CAG, TTT).
In transcription, the codons of a gene are copied into messenger RNA by RNA polymerase. This RNA copy is then
decoded by a ribosome that reads the RNA sequence by base-pairing the messenger RNA totransfer RNA, which
carries amino acids. Since there are 4 bases in 3-letter combinations, there are 64 possible codons
(43 combinations). These encode the twenty standard amino acids, giving most amino acids more than one possible
codon. There are also three 'stop' or 'nonsense' codons signifying the end of the coding region; these are the TAA,
TGA, and TAG codons.

DNA replication. The double helix is unwound by a helicase andtopoisomerase. Next, one DNA polymerase produces
the leading strand copy. Another DNA polymerase binds to the lagging strand. This enzyme makes discontinuous segments
(called Okazaki fragments) before DNA ligase joins them together.

Replication[edit]
Further information: DNA replication
Cell division is essential for an organism to grow, but, when a cell divides, it must replicate the DNA in its genome so
that the two daughter cells have the same genetic information as their parent. The double-stranded structure of DNA
provides a simple mechanism for DNA replication. Here, the two strands are separated and then each
strand's complementary DNA sequence is recreated by an enzyme called DNA polymerase. This enzyme makes

the complementary strand by finding the correct base through complementary base pairing, and bonding it onto the
original strand. As DNA polymerases can only extend a DNA strand in a 5 to 3 direction, different mechanisms are
used to copy the antiparallel strands of the double helix.[95] In this way, the base on the old strand dictates which
base appears on the new strand, and the cell ends up with a perfect copy of its DNA.

Extracellular nucleic acids[edit]


Naked extracellular DNA (eDNA), most of it released by cell death, is nearly ubiquitous in the environment. Its
concentration in soil may be as high as 2 g/g, and its concentration in natural aquatic environments may be as high
at 88 g/L.[96] Various possible functions have been proposed for eDNA: it may be involved in horizontal gene
transfer;[97] it may provide nutrients;[98] and it may act as a buffer to recruit or titrate ions or antibiotics.[99] Extracellular
DNA acts as a functional extracellular matrix component in the biofilms of a number of bacterial species. It may act a
recognition factor to regulate the attachment and dispersal of specific cell types in the biofilm;[100] it may contribute to
biofilm formation;[101] and it may contribute to the biofilm's physical strength and resistance to biological stress.[102]

Interactions with proteins[edit]


All the functions of DNA depend on interactions with proteins. These protein interactions can be non-specific, or the
protein can bind specifically to a single DNA sequence. Enzymes can also bind to DNA and of these, the
polymerases that copy the DNA base sequence in transcription and DNA replication are particularly important.

DNA-binding proteins[edit]
Further information: DNA-binding protein

Interaction of DNA (shown in orange) withhistones (shown in blue). These proteins' basic amino acids bind to the acidic
phosphate groups on DNA.

Structural proteins that bind DNA are well-understood examples of non-specific DNA-protein interactions. Within
chromosomes, DNA is held in complexes with structural proteins. These proteins organize the DNA into a compact
structure called chromatin. In eukaryotes this structure involves DNA binding to a complex of small basic proteins
called histones, while in prokaryotes multiple types of proteins are involved.[103][104] The histones form a disk-shaped
complex called a nucleosome, which contains two complete turns of double-stranded DNA wrapped around its
surface. These non-specific interactions are formed through basic residues in the histones making ionic bonds to the
acidic sugar-phosphate backbone of the DNA, and are therefore largely independent of the base
sequence.[105] Chemical modifications of these basic amino acid residues
include methylation, phosphorylation and acetylation.[106] These chemical changes alter the strength of the interaction
between the DNA and the histones, making the DNA more or less accessible to transcription factors and changing
the rate of transcription.[107] Other non-specific DNA-binding proteins in chromatin include the high-mobility group
proteins, which bind to bent or distorted DNA.[108] These proteins are important in bending arrays of nucleosomes
and arranging them into the larger structures that make up chromosomes.[109]
A distinct group of DNA-binding proteins are the DNA-binding proteins that specifically bind single-stranded DNA. In
humans, replication protein A is the best-understood member of this family and is used in processes where the
double helix is separated, including DNA replication, recombination and DNA repair.[110] These binding proteins seem
to stabilize single-stranded DNA and protect it from forming stem-loops or being degraded by nucleases.

The lambda repressorhelix-turn-helix transcription factor bound to its DNA target[111]

In contrast, other proteins have evolved to bind to particular DNA sequences. The most intensively studied of these
are the various transcription factors, which are proteins that regulate transcription. Each transcription factor binds to
one particular set of DNA sequences and activates or inhibits the transcription of genes that have these sequences
close to their promoters. The transcription factors do this in two ways. Firstly, they can bind the RNA polymerase
responsible for transcription, either directly or through other mediator proteins; this locates the polymerase at the
promoter and allows it to begin transcription.[112] Alternatively, transcription factors can bind enzymes that modify the
histones at the promoter. This changes the accessibility of the DNA template to the polymerase.[113]
As these DNA targets can occur throughout an organism's genome, changes in the activity of one type of
transcription factor can affect thousands of genes.[114] Consequently, these proteins are often the targets of thesignal
transduction processes that control responses to environmental changes or cellular differentiation and development.
The specificity of these transcription factors' interactions with DNA come from the proteins making multiple contacts
to the edges of the DNA bases, allowing them to "read" the DNA sequence. Most of these base-interactions are
made in the major groove, where the bases are most accessible.[27]

The restriction enzyme EcoRV(green) in a complex with its substrate DNA[115]

DNA-modifying enzymes[edit]
Nucleases and ligases[edit]
Nucleases are enzymes that cut DNA strands by catalyzing the hydrolysis of the phosphodiester bonds. Nucleases
that hydrolyse nucleotides from the ends of DNA strands are called exonucleases, while endonucleases cut within
strands. The most frequently used nucleases in molecular biology are the restriction endonucleases, which cut DNA
at specific sequences. For instance, the EcoRV enzyme shown to the left recognizes the 6-base sequence 5GATATC-3 and makes a cut at the vertical line. In nature, these enzymes protectbacteria against phage infection by
digesting the phage DNA when it enters the bacterial cell, acting as part of the restriction modification system.[116] In
technology, these sequence-specific nucleases are used in molecular cloning and DNA fingerprinting.
Enzymes called DNA ligases can rejoin cut or broken DNA strands.[117] Ligases are particularly important in lagging
strand DNA replication, as they join together the short segments of DNA produced at thereplication fork into a
complete copy of the DNA template. They are also used in DNA repair and genetic recombination.[117]
Topoisomerases and helicases[edit]

Topoisomerases are enzymes with both nuclease and ligase activity. These proteins change the amount
of supercoiling in DNA. Some of these enzymes work by cutting the DNA helix and allowing one section to rotate,
thereby reducing its level of supercoiling; the enzyme then seals the DNA break.[39] Other types of these enzymes
are capable of cutting one DNA helix and then passing a second strand of DNA through this break, before rejoining
the helix.[118] Topoisomerases are required for many processes involving DNA, such as DNA replication and
transcription.[40]
Helicases are proteins that are a type of molecular motor. They use the chemical energy in nucleoside
triphosphates, predominantly ATP, to break hydrogen bonds between bases and unwind the DNA double helix into
single strands.[119] These enzymes are essential for most processes where enzymes need to access the DNA bases.
Polymerases[edit]
Polymerases are enzymes that synthesize polynucleotide chains from nucleoside triphosphates. The sequence of
their products are created based on existing polynucleotide chainswhich are called templates. These enzymes
function by repeatedly adding a nucleotide to the 3 hydroxyl group at the end of the growing polynucleotide chain.
As a consequence, all polymerases work in a 5 to 3 direction.[120] In the active site of these enzymes, the incoming
nucleoside triphosphate base-pairs to the template: this allows polymerases to accurately synthesize the
complementary strand of their template. Polymerases are classified according to the type of template that they use.
In DNA replication, DNA-dependent DNA polymerases make copies of DNA polynucleotide chains. In order to
preserve biological information, it is essential that the sequence of bases in each copy are precisely complementary
to the sequence of bases in the template strand. Many DNA polymerases have a proofreading activity. Here, the
polymerase recognizes the occasional mistakes in the synthesis reaction by the lack of base pairing between the
mismatched nucleotides. If a mismatch is detected, a 3 to 5 exonuclease activity is activated and the incorrect base
removed.[121] In most organisms, DNA polymerases function in a large complex called the replisome that contains
multiple accessory subunits, such as the DNA clamp or helicases.[122]
RNA-dependent DNA polymerases are a specialized class of polymerases that copy the sequence of an RNA
strand into DNA. They include reverse transcriptase, which is a viral enzyme involved in the infection of cells
by retroviruses, and telomerase, which is required for the replication of telomeres.[57][123] Telomerase is an unusual
polymerase because it contains its own RNA template as part of its structure.[58]
Transcription is carried out by a DNA-dependent RNA polymerase that copies the sequence of a DNA strand into
RNA. To begin transcribing a gene, the RNA polymerase binds to a sequence of DNA called a promoter and
separates the DNA strands. It then copies the gene sequence into a messenger RNA transcript until it reaches a
region of DNA called the terminator, where it halts and detaches from the DNA. As with human DNA-dependent
DNA polymerases, RNA polymerase II, the enzyme that transcribes most of the genes in the human genome,
operates as part of a large protein complex with multiple regulatory and accessory subunits.[124]

Genetic recombination[edit]

Structure of the Holliday junctionintermediate in genetic recombination. The four separate DNA strands are coloured red,
blue, green and yellow.[125]

Further information: Genetic recombination

Recombination involves the breakage and rejoining of two chromosomes (M and F) to produce two re-arranged chromosomes
(C1 and C2).

A DNA helix usually does not interact with other segments of DNA, and in human cells the different chromosomes
even occupy separate areas in the nucleus called "chromosome territories".[126] This physical separation of different
chromosomes is important for the ability of DNA to function as a stable repository for information, as one of the few
times chromosomes interact is during chromosomal crossover when they recombine. Chromosomal crossover is
when two DNA helices break, swap a section and then rejoin.
Recombination allows chromosomes to exchange genetic information and produces new combinations of genes,
which increases the efficiency of natural selection and can be important in the rapid evolution of new
proteins.[127] Genetic recombination can also be involved in DNA repair, particularly in the cell's response to doublestrand breaks.[128]
The most common form of chromosomal crossover is homologous recombination, where the two chromosomes
involved share very similar sequences. Non-homologous recombination can be damaging to cells, as it can
produce chromosomal translocations and genetic abnormalities. The recombination reaction is catalyzed by
enzymes known as recombinases, such as RAD51.[129] The first step in recombination is a double-stranded break
caused by either an endonucleaseor damage to the DNA.[130] A series of steps catalyzed in part by the recombinase
then leads to joining of the two helices by at least one Holliday junction, in which a segment of a single strand in
each helix is annealed to the complementary strand in the other helix. The Holliday junction is a tetrahedral junction
structure that can be moved along the pair of chromosomes, swapping one strand for another. The recombination
reaction is then halted by cleavage of the junction and re-ligation of the released DNA.[131]

Evolution[edit]
Further information: RNA world hypothesis

DNA contains the genetic information that allows all modern living things to function, grow and reproduce. However,
it is unclear how long in the 4-billion-year history of life DNA has performed this function, as it has been proposed
that the earliest forms of life may have used RNA as their genetic material.[132][133] RNA may have acted as the central
part of early cell metabolism as it can both transmit genetic information and carry out catalysis as part
of ribozymes.[134] This ancient RNA world where nucleic acid would have been used for both catalysis and genetics
may have influenced the evolution of the current genetic code based on four nucleotide bases. This would occur,
since the number of different bases in such an organism is a trade-off between a small number of bases increasing
replication accuracy and a large number of bases increasing the catalytic efficiency of ribozymes.[135]
However, there is no direct evidence of ancient genetic systems, as recovery of DNA from most fossils is
impossible. This is because DNA survives in the environment for less than one million years, and slowly degrades
into short fragments in solution.[136] Claims for older DNA have been made, most notably a report of the isolation of a
viable bacterium from a salt crystal 250 million years old,[137] but these claims are controversial.[138][139]
Building blocks of DNA (adenine, guanine and related organic molecules) may have been formed extraterrestrially
in outer space.[140][141][142] Complex DNA and RNA organic compounds of life, including uracil, cytosine and thymine,
have also been formed in the laboratory under conditions mimicking those found in outer space, using starting
chemicals, such as pyrimidine, found in meteorites. Pyrimidine, like polycyclic aromatic hydrocarbons (PAHs), the
most carbon-rich chemical found in theuniverse, may have been formed in red giants or in interstellar dust and gas
clouds.[143]

Uses in technology[edit]

Sculpture of DNA, made out of shopping carts

Genetic engineering[edit]
Further information: Molecular biology, nucleic acid methods and genetic engineering

Methods have been developed to purify DNA from organisms, such as phenol-chloroform extraction, and to
manipulate it in the laboratory, such as restriction digests and the polymerase chain reaction.
Modern biology and biochemistry make intensive use of these techniques in recombinant DNA
technology. Recombinant DNA is a man-made DNA sequence that has been assembled from other DNA
sequences. They can be transformed into organisms in the form of plasmids or in the appropriate format, by using
a viral vector.[144] The genetically modified organisms produced can be used to produce products such as
recombinant proteins, used in medical research,[145] or be grown in agriculture.[146][147]

Forensics[edit]
Further information: DNA profiling
Forensic scientists can use DNA in blood, semen, skin, saliva or hair found at a crime scene to identify a matching
DNA of an individual, such as a perpetrator. This process is formally termed DNA profiling, but may also be called
"genetic fingerprinting". In DNA profiling, the lengths of variable sections of repetitive DNA, such as short tandem
repeats and minisatellites, are compared between people. This method is usually an extremely reliable technique for
identifying a matching DNA.[148] However, identification can be complicated if the scene is contaminated with DNA
from several people.[149] DNA profiling was developed in 1984 by British geneticist Sir Alec Jeffreys,[150] and first used
in forensic science to convict Colin Pitchfork in the 1988 Enderby murders case.[151]
The development of forensic science, and the ability to now obtain genetic matching on minute samples of blood,
skin, saliva or hair has led to a re-examination of a number of cases. Evidence can now be uncovered that was not
scientifically possible at the time of the original examination. Combined with the removal of the double jeopardy law
in some places, this can allow cases to be reopened where previous trials have failed to produce sufficient evidence
to convince a jury. People charged with serious crimes may be required to provide a sample of DNA for matching
purposes. The most obvious defence to DNA matches obtained forensically is to claim that cross-contamination of
evidence has taken place. This has resulted in meticulous strict handling procedures with new cases of serious
crime. DNA profiling is also used to identify victims of mass casualty incidents.[152] As well as positively identifying
bodies or body parts in serious accidents, DNA profiling is being successfully used to identify individual victims in
mass war graves matching to family members.

Bioinformatics[edit]
Further information: Bioinformatics
Bioinformatics involves the manipulation, searching, and data mining of biological data, and this includes DNA
sequence data. The development of techniques to store and search DNA sequences have led to widely applied
advances in computer science, especially string searching algorithms, machine learning and database
theory.[153] String searching or matching algorithms, which find an occurrence of a sequence of letters inside a larger
sequence of letters, were developed to search for specific sequences of nucleotides.[154] The DNA sequence may
be aligned with other DNA sequences to identify homologous sequences and locate the specificmutations that make
them distinct. These techniques, especially multiple sequence alignment, are used in
studying phylogenetic relationships and protein function.[155] Data sets representing entire genomes' worth of DNA
sequences, such as those produced by the Human Genome Project, are difficult to use without the annotations that
identify the locations of genes and regulatory elements on each chromosome. Regions of DNA sequence that have
the characteristic patterns associated with protein- or RNA-coding genes can be identified by gene
finding algorithms, which allow researchers to predict the presence of particular gene products and their possible
functions in an organism even before they have been isolated experimentally.[156]Entire genomes may also be
compared, which can shed light on the evolutionary history of particular organism and permit the examination of
complex evolutionary events.

DNA nanotechnology[edit]

The DNA structure at left (schematic shown) will self-assemble into the structure visualized by atomic force microscopy at
right. DNA nanotechnology is the field that seeks to design nanoscale structures using the molecular recognition properties of
DNA molecules. Image from Strong, 2004.

Further information: DNA nanotechnology


DNA nanotechnology uses the unique molecular recognition properties of DNA and other nucleic acids to create
self-assembling branched DNA complexes with useful properties.[157]DNA is thus used as a structural material rather
than as a carrier of biological information. This has led to the creation of two-dimensional periodic lattices (both tilebased and using the "DNA origami" method) as well as three-dimensional structures in the shapes
of polyhedra.[158] Nanomechanical devices and algorithmic self-assembly have also been demonstrated,[159] and these
DNA structures have been used to template the arrangement of other molecules such as gold
nanoparticles and streptavidin proteins.[160]

History and anthropology[edit]


Further information: Phylogenetics and Genetic genealogy
Because DNA collects mutations over time, which are then inherited, it contains historical information, and, by
comparing DNA sequences, geneticists can infer the evolutionary history of organisms, their phylogeny.[161] This field
of phylogenetics is a powerful tool in evolutionary biology. If DNA sequences within a species are
compared, population geneticists can learn the history of particular populations. This can be used in studies ranging
from ecological genetics to anthropology; For example, DNA evidence is being used to try to identify theTen Lost
Tribes of Israel.[162][163]

Information storage[edit]
Main article: DNA digital data storage
In a paper published in Nature in January 2013, scientists from the European Bioinformatics Institute and Agilent
Technologies proposed a mechanism to use DNA's ability to code information as a means of digital data storage.
The group was able to encode 739 kilobytes of data into DNA code, synthesize the actual DNA, then sequence the
DNA and decode the information back to its original form, with a reported 100% accuracy. The encoded information
consisted of text files and audio files. A prior experiment was published in August 2012. It was conducted by
researchers at Harvard University, where the text of a 54,000-word book was encoded in DNA.[164][165]

History of DNA research[edit]


Further information: History of molecular biology

James Watson and Francis Crick(right), co-originators of the double-helix model, with Maclyn McCarty (left).

DNA was first isolated by the Swiss physician Friedrich Miescher who, in 1869, discovered a microscopic substance
in the pus of discarded surgical bandages. As it resided in the nuclei of cells, he called it "nuclein".[166][167] In
1878, Albrecht Kossel isolated the non-protein component of "nuclein", nucleic acid, and later isolated its five
primary nucleobases.[168][169] In 1919, Phoebus Levene identified the base, sugar and phosphate nucleotide
unit.[170] Levene suggested that DNA consisted of a string of nucleotide units linked together through the phosphate
groups. Levene thought the chain was short and the bases repeated in a fixed order. In 1937, William
Astbury produced the first X-ray diffraction patterns that showed that DNA had a regular structure.[171]
In 1927, Nikolai Koltsov proposed that inherited traits would be inherited via a "giant hereditary molecule" made up
of "two mirror strands that would replicate in a semi-conservative fashion using each strand as a template".[172][173] In
1928, Frederick Griffith in his experiment discovered that traits of the "smooth" form of Pneumococcus could be
transferred to the "rough" form of the same bacteria by mixing killed "smooth" bacteria with the live "rough"
form.[174][175] This system provided the first clear suggestion that DNA carries genetic informationthe Avery
MacLeodMcCarty experimentwhen Oswald Avery, along with coworkers Colin MacLeod and Maclyn McCarty,
identified DNA as the transforming principle in 1943.[176] DNA's role in heredity was confirmed in 1952, when Alfred
Hershey and Martha Chase in the HersheyChase experiment showed that DNA is the genetic material of the T2
phage.[177]
In 1953, James Watson and Francis Crick suggested what is now accepted as the first correct double-helix model
of DNA structure in the journal Nature.[6] Their double-helix, molecular model of DNA was then based on a single Xray diffraction image (labeled as "Photo 51")[178] taken by Rosalind Franklin and Raymond Gosling in May 1952, as
well as the information that the DNA bases are pairedalso obtained through private communications from Erwin
Chargaff in the previous years.
Experimental evidence supporting the Watson and Crick model was published in a series of five articles in the same
issue of Nature.[179] Of these, Franklin and Gosling's paper was the first publication of their own X-ray diffraction data
and original analysis method that partially supported the Watson and Crick model;[43][180] this issue also contained an
article on DNA structure by Maurice Wilkins and two of his colleagues, whose analysis and in vivo B-DNA X-ray
patterns also supported the presence in vivo of the double-helical DNA configurations as proposed by Crick and
Watson for their double-helix molecular model of DNA in the previous two pages of Nature.[44] In 1962, after
Franklin's death, Watson, Crick, and Wilkins jointly received the Nobel Prize in Physiology or Medicine.[181] Nobel
Prizes were awarded only to living recipients at the time. A debate continues about who should receive credit for the
discovery.[182]
In an influential presentation in 1957, Crick laid out the central dogma of molecular biology, which foretold the
relationship between DNA, RNA, and proteins, and articulated the "adaptor hypothesis".[183] Final confirmation of the
replication mechanism that was implied by the double-helical structure followed in 1958 through the MeselsonStahl
experiment.[184] Further work by Crick and coworkers showed that the genetic code was based on non-overlapping
triplets of bases, called codons, allowingHar Gobind Khorana, Robert W. Holley and Marshall Warren Nirenberg to
decipher the genetic code.[185] These findings represent the birth of molecular biology.

What is DNA?
DNA, or deoxyribonucleic acid, is the hereditary material in humans and almost all other
organisms. Nearly every cell in a persons body has the same DNA. Most DNA is located in the

cell nucleus (where it is called nuclear DNA), but a small amount of DNA can also be found in the
mitochondria (where it is called mitochondrial DNA or mtDNA).
The information in DNA is stored as a code made up of four chemical bases: adenine (A), guanine
(G), cytosine (C), and thymine (T). Human DNA consists of about 3 billion bases, and more than
99 percent of those bases are the same in all people. The order, or sequence, of these bases
determines the information available for building and maintaining an organism, similar to the way
in which letters of the alphabet appear in a certain order to form words and sentences.
DNA bases pair up with each other, A with T and C with G, to form units called base pairs. Each
base is also attached to a sugar molecule and a phosphate molecule. Together, a base, sugar, and
phosphate are called a nucleotide. Nucleotides are arranged in two long strands that form a spiral
called a double helix. The structure of the double helix is somewhat like a ladder, with the base
pairs forming the ladders rungs and the sugar and phosphate molecules forming the vertical
sidepieces of the ladder.
An important property of DNA is that it can replicate, or make copies of itself. Each strand of DNA
in the double helix can serve as a pattern for duplicating the sequence of bases. This is critical when
cells divide because each new cell needs to have an exact copy of the DNA present in the old cell.

DNA is a double helix formed by base pairs attached to a sugar-phosphate backbone.


What is DNA?
We all know that elephants only give birth to little elephants, giraffes to giraffes, dogs to dogs and so on for every
type of living creature. But why is this so?
The answer lies in a molecule called deoxyribonucleic acid (DNA), which contains the biological instructions that make

each species unique. DNA, along with the instructions it contains, is passed from adult organisms to their offspring
during reproduction.
Top of page

Where is DNA found?


In organisms called eukaryotes, DNA is found inside a special area of the cell called the nucleus. Because the cell is
very small, and because organisms have many DNA molecules per cell, each DNA molecule must be tightly packaged.
This packaged form of the DNA is called a chromosome.
During DNA replication, DNA unwinds so it can be copied. At other times in the cell cycle, DNA also unwinds so that its
instructions can be used to make proteins and for other biological processes. But during cell division, DNA is in its
compact chromosome form to enable transfer to new cells.
Researchers refer to DNA found in the cell's nucleus as nuclear DNA. An organism's complete set of nuclear DNA is
called its genome.
Besides the DNA located in the nucleus, humans and other complex organisms also have a small amount of DNA in
cell structures known as mitochondria. Mitochondria generate the energy the cell needs to function properly.
In sexual reproduction, organisms inherit half of their nuclear DNA from the male parent and half from the female
parent. However, organisms inherit all of their mitochondrial DNA from the female parent. This occurs because only
egg cells, and not sperm cells, keep their mitochondria during fertilization.
Top of page

What is DNA made of?


DNA is made of chemical building blocks called nucleotides. These building blocks are made of three parts: a
phosphate group, a sugar group and one of four types of nitrogen bases. To form a strand of DNA, nucleotides are
linked into chains, with the phosphate and sugar groups alternating.
The four types of nitrogen bases found in nucleotides are: adenine (A), thymine (T), guanine (G) and cytosine (C).
The order, or sequence, of these bases determines what biological instructions are contained in a strand of DNA. For
example, the sequence ATCGTT might instruct for blue eyes, while ATCGCT might instruct for brown.
The complete DNA instruction book, or genome, for a human contains about 3 billion bases and about 20,000 genes
on 23 pairs of chromosomes.
Top of page

What does DNA do?


DNA contains the instructions needed for an organism to develop, survive and reproduce. To carry out these
functions, DNA sequences must be converted into messages that can be used to produce proteins, which are the
complex molecules that do most of the work in our bodies.
Each DNA sequence that contains instructions to make a protein is known as a gene. The size of a gene may vary

greatly, ranging from about 1,000 bases to 1 million bases in humans. Genes only make up about 1 percent of the
DNA sequence. DNA sequences outside this 1 percent are involved in regulating when, how and how much of a protein
is made.
Top of page

How are DNA sequences used to make proteins?


DNA's instructions are used to make proteins in a two-step process. First, enzymes read the information in a DNA
molecule and transcribe it into an intermediary molecule called messenger ribonucleic acid, or mRNA.
Next, the information contained in the mRNA molecule is translated into the "language" of amino acids, which are the
building blocks of proteins. This language tells the cell's protein-making machinery the precise order in which to link
the amino acids to produce a specific protein. This is a major task because there are 20 types of amino acids, which
can be placed in many different orders to form a wide variety of proteins.
Top of page

Who discovered DNA?


The Swiss biochemist Frederich Miescher first observed DNA in the late 1800s. But nearly a century passed from that
discovery until researchers unraveled the structure of the DNA molecule and realized its central importance to biology.
For many years, scientists debated which molecule carried life's biological instructions. Most thought that DNA was too
simple a molecule to play such a critical role. Instead, they argued that proteins were more likely to carry out this
vital function because of their greater complexity and wider variety of forms.
The importance of DNA became clear in 1953 thanks to the work of James Watson, Francis Crick, Maurice Wilkins and
Rosalind Franklin. By studying X-ray diffraction patterns and building models, the scientists figured out the double
helix structure of DNA - a structure that enables it to carry biological information from one generation to the next.
Top of page

What is the DNA double helix?


Scientist use the term "double helix" to describe DNA's winding, two-stranded chemical structure. This shape - which
looks much like a twisted ladder - gives DNA the power to pass along biological instructions with great precision.
To understand DNA's double helix from a chemical standpoint, picture the sides of the ladder as strands of alternating
sugar and phosphate groups - strands that run in opposite directions. Each "rung" of the ladder is made up of two
nitrogen bases, paired together by hydrogen bonds. Because of the highly specific nature of this type of chemical
pairing, base A always pairs with base T, and likewise C with G. So, if you know the sequence of the bases on one
strand of a DNA double helix, it is a simple matter to figure out the sequence of bases on the other strand.
DNA's unique structure enables the molecule to copy itself during cell division. When a cell prepares to divide, the
DNA helix splits down the middle and becomes two single strands. These single strands serve as templates for
building two new, double-stranded DNA molecules - each a replica of the original DNA molecule. In this process, an A
base is added wherever there is a T, a C where there is a G, and so on until all of the bases once again have partners.

DNA profiling
From Wikipedia, the free encyclopedia

This article is about DNA profiling in forensics. For other uses, see DNA profiling (disambiguation).
For DNA testing for inherited diseases, see Genetic testing.
Not to be confused with DNA barcoding or DNA phenotyping.

Forensic science

Physiological sciences

Forensic anthropology

Forensic archaeology

Forensic odontology

Forensic entomology

Forensic pathology

Forensic botany

Forensic biology

DNA profiling

Bloodstain pattern analysis

Forensic chemistry

Forensic osteology

Forensic dentistry
Social sciences

Forensic psychology

Forensic psychiatry
Forensic criminalistics

Ballistics

Ballistic fingerprinting

Body identification

Fingerprint analysis

Forensic accounting

Forensic arts

Forensic footwear evidence

Forensic toxicology

Gloveprint analysis

Palmprint analysis

Questioned document examination

Vein matching
Digital forensics

Computer forensics

Forensic data analysis

Database forensics

Mobile device forensics

Network forensics

Forensic video

Forensic audio
Related disciplines

Fire investigation

Fire accelerant detection

Forensic engineering

Forensic linguistics

Forensic materials engineering

Forensic polymer engineering

Forensic statistics

Forensic taphonomy

Vehicular accident reconstruction


People

William M. Bass

George W. Gill

Richard Jantz

Edmond Locard

Douglas W. Owsley

Werner Spitz

Auguste Ambroise Tardieu

Juan Vucetich
Related articles

Crime scene

CSI effect

Perry Mason syndrome

Pollen calendar

Skid mark
Trace evidence
Use of DNA in

forensic entomology

Forensic DNA profiling (also called DNA testing or DNA typing) is a technique employed by forensic scientists to
identify individuals by characteristics of their DNA. DNA profiles are a small set of DNA variations that are very
likely to be different in all unrelated individuals. DNA profiling should not be confused with full genome
sequencing.[1] DNA profiling is used in, for example, parentage testing and criminal investigation.
Although 99.9% of human DNA sequences are the same in every person, enough of the DNA is different that it is
possible to distinguish one individual from another, unless they are monozygotic ("identical") twins.[2]DNA profiling
uses repetitive ("repeat") sequences that are highly variable,[2] called variable number tandem repeats (VNTRs), in
particular short tandem repeats (STRs). VNTR loci are very similar between closely related humans, but are so
variable that unrelated individuals are extremely unlikely to have the same VNTRs.
The DNA profiling technique was first reported in 1985.[3]
Contents
[hide]

1 DNA profiling process


o 1.1 RFLP analysis
o 1.2 PCR analysis
o 1.3 STR analysis
o 1.4 AmpFLP
o 1.5 DNA family relationship analysis
o 1.6 Y-chromosome analysis
o 1.7 Mitochondrial analysis
2 DNA databases
3 Considerations when evaluating DNA evidence
o 3.1 Evidence of genetic relationship
4 Fake DNA evidence
5 DNA evidence as evidence in criminal trials
o 5.1 Familial DNA searching
o 5.2 Partial matches
o 5.3 Surreptitious DNA collecting
o 5.4 England and Wales
5.4.1 Presentation and evaluation of evidence of partial or incomplete DNA profiles
o 5.5 DNA testing in the United States
o 5.6 Development of artificial DNA
6 Cases
7 See also
8 References
9 Further reading
10 External links

DNA profiling process[edit]

Variations of VNTR allele lengths in 6 individuals.

Alec Jeffreys, the pioneer of DNA profiling.

Developed by Alec Jeffreys, the process begins with a sample of an individual's DNA (typically called a "reference
sample"). The most desirable method of collecting a reference sample is the use of a buccal swab, as this reduces
the possibility of contamination. When this is not available (e.g. because a court order is needed but not obtainable)
other methods may need to be used to collect a sample of blood, saliva, semen, or other appropriate fluid or tissue
from personal items (e.g. a toothbrush, razor) or from stored samples (e.g. banked sperm or biopsy tissue).
Samples obtained from blood relatives (related by birth, not marriage) can provide an indication of an individual's
profile, as could human remains that had been previously profiled.
A reference sample is then analyzed to create the individual's DNA profile using one of a number of techniques,
discussed below. The DNA profile is then compared against another sample to determine whether there is a genetic
match.

RFLP analysis[edit]
Main article: Restriction fragment length polymorphism
The first methods for finding out genetics used for DNA profiling involved RFLP analysis. DNA is collected from
cells, such as a blood sample, and cut into small pieces using a restriction enzyme (a restriction digest). This
generates thousands of DNA fragments of differing sizes as a consequence of variations between DNA sequences
of different individuals. The fragments are then separated on the basis of size using gel electrophoresis.
The separated fragments are then transferred to a nitrocellulose or nylon filter; this procedure is called a Southern
blot. The DNA fragments within the blot are permanently fixed to the filter, and the DNA strands
aredenatured. Radiolabeled probe molecules are then added that are complementary to sequences in
the genome that contain repeat sequences. These repeat sequences tend to vary in length among different
individuals and are called variable number tandem repeat sequences or VNTRs. The probe molecules hybridize to
DNA fragments containing the repeat sequences and excess probe molecules are washed away. The blot is then
exposed to an X-ray film. Fragments of DNA that have bound to the probe molecules appear as dark bands on the
film.
The Southern blot technique is laborious, and requires large amounts of undegraded sample DNA. Also, Karl
Brown's original technique looked at many minisatellite loci at the same time, increasing the observed variability, but
making it hard to discern individual alleles (and thereby precluding parental testing). These early techniques have
been supplanted by PCR-based assays.

PCR analysis[edit]
Main article: polymerase chain reaction

Developed by Kary Mullis in 1983, a process was reported by which specific portions of the sample DNA can be
amplified almost indefinitely (Saiki et al. 1985, 1988). This has revolutionized the whole field of DNA study. The
process, the polymerase chain reaction (PCR), mimics the biological process of DNA replication, but confines it to
specific DNA sequences of interest. With the invention of the PCR technique, DNA profiling took huge strides
forward in both discriminating power and the ability to recover information from very small (or degraded) starting
samples.
PCR greatly amplifies the amounts of a specific region of DNA. In the PCR process, the DNA sample is denatured
into the separate individual polynucleotide strands through heating. Two oligonucleotide DNAprimers are used to
hybridize to two corresponding nearby sites on opposite DNA strands in such a fashion that the normal enzymatic
extension of the active terminal of each primer (that is, the 3 end) leads toward the other primer. PCR uses
replication enzymes that are tolerant of high temperatures, such as the thermostable Taq polymerase. In this
fashion, two new copies of the sequence of interest are generated. Repeated denaturation, hybridization, and
extension in this fashion produce an exponentially growing number of copies of the DNA of interest. Instruments that
perform thermal cycling are now readily available from commercial sources. This process can produce a million-fold
or greater amplification of the desired region in 2 hours or less.
Early assays such as the HLA-DQ alpha reverse dot blot strips grew to be very popular due to their ease of use, and
the speed with which a result could be obtained. However, they were not as discriminating as RFLP analysis. It was
also difficult to determine a DNA profile for mixed samples, such as a vaginal swab from a sexual assault victim.
However, the PCR method was readily adaptable for analyzing VNTR, in particular STR loci. In recent years,
research in human DNA quantitation has focused on new "real-time" quantitative PCR (qPCR) techniques.
Quantitative PCR methods enable automated, precise, and high-throughput measurements. Interlaboratory studies
have demonstrated the importance of human DNA quantitation on achieving reliable interpretation of STR typing
and obtaining consistent results across laboratories.

STR analysis[edit]
Main article: Short tandem repeats
The system of DNA profiling used today is based on PCR and uses short tandem repeats (STR). This method uses
highly polymorphic regions that have short repeated sequences of DNA (the most common is 4 bases repeated, but
there are other lengths in use, including 3 and 5 bases). Because unrelated people almost certainly have different
numbers of repeat units, STRs can be used to discriminate between unrelated individuals. These STR loci (locations
on a chromosome) are targeted with sequence-specific primers and amplified using PCR. The DNA fragments that
result are then separated and detected using electrophoresis. There are two common methods of separation and
detection, capillary electrophoresis (CE) and gel electrophoresis.
Each STR is polymorphic, but the number of alleles is very small. Typically each STR allele will be shared by around
5 - 20% of individuals. The power of STR analysis comes from looking at multiple STR loci simultaneously. The
pattern of alleles can identify an individual quite accurately. Thus STR analysis provides an excellent identification
tool. The more STR regions that are tested in an individual the more discriminating the test becomes.
From country to country, different STR-based DNA-profiling systems are in use. In North America, systems that
amplify the CODIS 13 core loci are almost universal, whereas in the United Kingdom the SGM+ 11 loci system
(which is compatible with TheNational DNA Database) is in use. Whichever system is used, many of the STR
regions used are the same. These DNA-profiling systems are based on multiplex reactions, whereby many STR
regions will be tested at the same time.
The true power of STR analysis is in its statistical power of discrimination. Because the 13 loci that are currently
used for discrimination in CODIS are independently assorted (having a certain number of repeats at one locus does
not change the likelihood of having any number of repeats at any other locus), the product rule for probabilities can
be applied. This means that, if someone has the DNA type of ABC, where the three loci were independent, we can
say that the probability of having that DNA type is the probability of having type A times the probability of having
type B times the probability of having type C. This has resulted in the ability to generate match probabilities of 1 in a
quintillion (1x1018) or more. However, DNA database searches showed much more frequent than expected false
DNA profile matches.[4] Moreover, since there are about 12 million monozygotic twins on Earth, the theoretical
probability is not accurate.
In practice, the risk of contaminated-matching is much greater than matching a distant relative, such as
contamination of a sample from nearby objects, or from left-over cells transferred from a prior test. The risk is
greater for matching the most common person in the samples: Everything collected from, or in contact with, a victim
is a major source of contamination for any other samples brought into a lab. For that reason, multiple control-

samples are typically tested in order to ensure that they stayed clean, when prepared during the same period as the
actual test samples. Unexpected matches (or variations) in several control-samples indicates a high probability of
contamination for the actual test samples. In a relationship test, the full DNA profiles should differ (except for twins),
to prove that a person was not actually matched as being related to their own DNA in another sample.

AmpFLP[edit]
Main article: Amplified fragment length polymorphism
Another technique, AmpFLP, or amplified fragment length polymorphism was also put into practice during the early
1990s. This technique was also faster than RFLP analysis and used PCR to amplify DNA samples. It relied
on variable number tandem repeat (VNTR) polymorphisms to distinguish various alleles, which were separated on
a polyacrylamide gel using an allelic ladder (as opposed to a molecular weight ladder). Bands could be visualized
by silver staining the gel. One popular locus for fingerprinting was the D1S80 locus. As with all PCR based methods,
highly degraded DNA or very small amounts of DNA may cause allelic dropout (causing a mistake in thinking a
heterozygote is a homozygote) or other stochastic effects. In addition, because the analysis is done on a gel, very
high number repeats may bunch together at the top of the gel, making it difficult to resolve. AmpFLP analysis can be
highly automated, and allows for easy creation of phylogenetic trees based on comparing individual samples of
DNA. Due to its relatively low cost and ease of set-up and operation, AmpFLP remains popular in lower income
countries.

DNA family relationship analysis[edit]

1: A cell sample is taken- usually a cheek swab or blood test


2: DNA is extracted from sample
3: Cleavage of DNA by restriction enzyme- the DNA is broken into small fragments
4: Small fragments are amplified by the Polymerase Chain Reaction- results in many more fragments
5: DNA fragments are separated by electrophoresis
6: The fragments are transferred to an agar plate
7: On the Agar Plate specific DNA fragments are bound to a radioactive DNA probe
8: The Agar Plate is washed free of excess probe
9: An x-ray film is used to detect a radioactive pattern
10: The DNA is compared to other DNA samples

Using PCR technology, DNA analysis is widely applied to determine genetic family relationships such as paternity,
maternity, siblingship and other kinships.
During conception, the fathers sperm cell and the mothers egg cell, each containing half the amount of DNA found
in other body cells, meet and fuse to form a fertilized egg, called a zygote. The zygote contains a complete set of
DNA molecules, a unique combination of DNA from both parents. This zygote divides and multiplies into an embryo
and later, a full human being.

At each stage of development, all the cells forming the body contain the same DNAhalf from the father and half
from the mother. This fact allows the relationship testing to use all types of all samples including loose cells from the
cheeks collected using buccal swabs, blood or other types of samples.
There are predictable inheritance patterns at certain locations (called loci) in the human genome, which have been
found to be useful in determining identity and biological relationships. These loci contain specific DNA markers that
scientists use to identify individuals. In a routine DNA paternity test, the markers used are Short Tandem
Repeats (STRs), short pieces of DNA that occur in highly differential repeat patterns among individuals.
Each persons DNA contains two copies of these markersone copy inherited from the father and one from the
mother. Within a population, the markers at each persons DNA location could differ in length and sometimes
sequence, depending on the markers inherited from the parents.
The combination of marker sizes found in each person makes up his/her unique genetic profile. When determining
the relationship between two individuals, their genetic profiles are compared to see if they share the same
inheritance patterns at a statistically conclusive rate.
For example, the following sample report from this commercial DNA paternity testing laboratory Universal Genetics
signifies how relatedness between parents and child is identified on those special markers:
DNA Marker Mother

Child

Alleged father

D21S11

28, 30

28, 31 29, 31

D7S820

9, 10

10, 11 11, 12

TH01

14, 15

14, 16 15, 16

D13S317

7, 8

7, 9

D19S433

14, 16.2 14, 15 15, 17

8, 9

The partial results indicate that the child and the alleged fathers DNA match among these five markers. The
complete test results show this correlation on 16 markers between the child and the tested man to enable a
conclusion to be drawn as to whether or not the man is the biological father.
Each marker is assigned with a Paternity Index (PI), which is a statistical measure of how powerfully a match at a
particular marker indicates paternity. The PI of each marker is multiplied with each other to generate the Combined
Paternity Index (CPI), which indicates the overall probability of an individual being the biological father of the tested
child relative to a randomly selected man from the entire population of the same race. The CPI is then converted
into a Probability of Paternity showing the degree of relatedness between the alleged father and child.
The DNA test report in other family relationship tests, such as grandparentage and siblingship tests, is similar to a
paternity test report. Instead of the Combined Paternity Index, a different value, such as a Siblingship Index, is
reported.
The report shows the genetic profiles of each tested person. If there are markers shared among the tested
individuals, the probability of biological relationship is calculated to determine how likely the tested individuals share
the same markers due to a blood relationship.

Y-chromosome analysis[edit]

Recent innovations have included the creation of primers targeting polymorphic regions on the Y-chromosome (YSTR), which allows resolution of a mixed DNA sample from a male and female or cases in which a differential
extraction is not possible. Y-chromosomes are paternally inherited, so Y-STR analysis can help in the identification
of paternally related males. Y-STR analysis was performed in the Sally Hemings controversy to determine if Thomas
Jefferson had sired a son with one of his slaves. The analysis of the Y-chromosome yields weaker results than
autosomal chromosome analysis. The Y male sex-determining chromosome, as it is inherited only by males from
their fathers, is almost identical along the patrilineal line. This leads to a less precise analysis than if autosomal
chromosomes were testing, because of the random matching that occurs between pairs of chromosomes as
zygotes are being made.[5]

Mitochondrial analysis[edit]
Main article: Mitochondrial DNA
For highly degraded samples, it is sometimes impossible to get a complete profile of the 13 CODIS STRs. In these
situations, mitochondrial DNA (mtDNA) is sometimes typed due to there being many copies of mtDNA in a cell,
while there may only be 1-2 copies of the nuclear DNA. Forensic scientists amplify the HV1 and HV2 regions of the
mtDNA, and then sequence each region and compare single-nucleotide differences to a reference. Because mtDNA
is maternally inherited, directly linked maternal relatives can be used as match references, such as one's maternal
grandmother's daughter's son. In general, a difference of two or more nucleotides is considered to be an
exclusion. Heteroplasmy and poly-C differences may throw off straight sequence comparisons, so some expertise
on the part of the analyst is required. mtDNA is useful in determining clear identities, such as those of missing
people when a maternally linked relative can be found. mtDNA testing was used in determining that Anna
Anderson was not the Russian princess she had claimed to be, Anastasia Romanov.
mtDNA can be obtained from such material as hair shafts and old bones/teeth. Control mechanism based on
interaction point with data. This is determined by tooled placement in sample.

DNA databases[edit]
Main article: National DNA database
An early application of a DNA database was the compilation of A Mitochondrial DNA Concordance,[6] prepared by
Kevin W. P. Miller and John L. Dawson at the University of Cambridge from 1996 to 1998[7] from data collected as
part of Miller's PhD thesis. There are now several DNA databases in existence around the world. Some are private,
but most of the largest databases are government controlled. The United States maintains the largest DNA
database, with the Combined DNA Index System (CODIS) holding over 5 million records as of 2007.[8] The United
Kingdom maintains the National DNA Database (NDNAD), which is of similar size, despite the UK's smaller
population. The size of this database, and its rate of growth, is giving concern to civil liberties groups in the UK,
where police have wide-ranging powers to take samples and retain them even in the event of acquittal.[9]
The U.S. Patriot Act of the United States provides a means for the U.S. government to get DNA samples from other
countries if they[clarification needed] are either a division of or a head office of a company operating in the U.S. Under the act;
the American offices of the company cannot divulge to their subsidiaries/offices in other countries the reasons that
these DNA samples are sought or by whom.[citation needed]
When a match is made from a National DNA Databank to link a crime scene to an offender having provided a DNA
Sample to a databank that link is often referred to as a cold hit. A cold hit is of value in referring the police agency to
a specific suspect but is of less evidential value than a DNA match made from outside the DNA Databank.[10]
FBI agents cannot legally store DNA of a person not convicted of a crime. DNA collected from a suspect not later
convicted must be disposed of and not entered into the database. In 1998, a man residing in the UK was arrested
on accusation of burglary. His DNA was taken and tested, and he was later released. Nine months later, this mans
DNA was accidentally and illegally entered in the DNA database. New DNA is automatically compared to the DNA
found at cold cases and, in this case, this man was found to be a match to DNA found at a rape and assault case
one year earlier. The government then prosecuted him for these crimes. During the trial the DNA match was
requested to be removed from the evidence because it had been illegally entered into the database. The request
was carried out.[11]
The DNA collected from victims of rape are often stored for years until matched with the perpetrator's, usually when
committing another crime. In 2014, Congress extended a bill that helps states deal with "a backlog" of unexamined
evidence.[12]

Considerations when evaluating DNA evidence[edit]


In the early days of the use of genetic fingerprinting as criminal evidence, juries were often swayed by spurious
statistical arguments by defense lawyers along these lines: Given a match that had a 1 in 5 million probability of
occurring by chance, the lawyer would argue that this meant that in a country of say 60 million people there were 12
people who would also match the profile. This was then translated to a 1 in 12 chance of the suspect's being the
guilty one. This argument is not sound unless the suspect was drawn at random from the population of the country.
In fact, a jury should consider how likely it is that an individual matching the genetic profile would also have been a
suspect in the case for other reasons. Another spurious statistical argument is based on the false assumption that a
1 in 5 million probability of a match automatically translates into a 1 in 5 million probability of innocence and is
known as the prosecutor's fallacy.
When using RFLP, the theoretical risk of a coincidental match is 1 in 100 billion (100,000,000,000), although the
practical risk is actually 1 in 1000 because monozygotic twins are 0.2% of the human population.[citation needed] Moreover,
the rate of laboratory error is almost certainly higher than this, and often actual laboratory procedures do not reflect
the theory under which the coincidence probabilities were computed. For example, the coincidence probabilities
may be calculated based on the probabilities that markers in two samples have bands in precisely the same
location, but a laboratory worker may conclude that similarbut not precisely identicalband patterns result from
identical genetic samples with some imperfection in the agarose gel. However, in this case, the laboratory worker
increases the coincidence risk by expanding the criteria for declaring a match. Recent studies have quoted relatively
high error rates, which may be cause for concern.[13] In the early days of genetic fingerprinting, the necessary
population data to accurately compute a match probability was sometimes unavailable. Between 1992 and 1996,
arbitrary low ceilings were controversially put on match probabilities used in RFLP analysis rather than the higher
theoretically computed ones.[14] Today, RFLP has become widely disused due to the advent of more discriminating,
sensitive and easier technologies.
Since 1998, the DNA profiling system supported by The National DNA Database in the UK is the SGM+ DNA
profiling system that includes 10 STR regions and a sex-indicating test. STRs do not suffer from such subjectivity
and provide similar power of discrimination (1 in 1013 for unrelated individuals if using a full SGM+ profile). Figures of
this magnitude are not considered to be statistically supportable by scientists in the UK; for unrelated individuals
with full matching DNA profiles a match probability of 1 in a billion is considered statistically supportable. However,
with any DNA technique, the cautious juror should not convict on genetic fingerprint evidence alone if other factors
raise doubt. Contamination with other evidence (secondary transfer) is a key source of incorrect DNA profiles and
raising doubts as to whether a sample has been adulterated is a favorite defense technique. More
rarely, chimerism is one such instance where the lack of a genetic match may unfairly exclude a suspect.

Evidence of genetic relationship[edit]


It is also possible to use DNA profiling as evidence of genetic relationship, although such evidence varies in strength
from weak to positive. Testing that shows no relationship is absolutely certain.
While almost all individuals have a single and distinct set of genes, ultra-rare individuals, known as "chimeras", have
at least two different sets of genes. There have been two cases of DNA profiling that falsely suggested that a mother
was unrelated to her children.[15] This happens when two eggs are fertilized at the same time and fuse together to
create one individual instead of twins.

Fake DNA evidence[edit]


In one case, a criminal even planted fake DNA evidence in his own body: John Schneeberger raped one of his
sedated patients in 1992 and left semen on her underwear. Police drew what they believed to be Schneeberger's
blood and compared its DNA against the crime scene semen DNA on three occasions, never showing a match. It
turned out that he had surgically inserted a Penrose drain into his arm and filled it with foreign blood
and anticoagulants.
The functional analysis of genes and their coding sequences (open reading frames [ORFs]) typically requires that
each ORF be expressed, the encoded protein purified, antibodies produced, phenotypes examined, intracellular
localization determined, and interactions with other proteins sought.[16] In a study conducted by the life science
company Nucleix and published in the journal Forensic Science International, scientists found that an In
vitro synthesized sample of DNA matching any desired genetic profile can be constructed using standard molecular
biology techniques without obtaining any actual tissue from that person. Nucleix claims they can also prove the
difference between non-altered DNA and any that was synthesized.[17]

In the case of the Phantom of Heilbronn, police detectives found DNA traces from the same woman on various
crime scenes in Austria, Germany, and Franceamong them murders, burglaries and robberies. Only after the DNA
of the "woman" matched the DNA sampled from the burned body of a male asylum seeker in France, detectives
began to have serious doubts about the DNA evidence. In that case, DNA traces were already present on the cotton
swabs used to collect the samples at the crime scene, and the swabs had all been produced at the same factory in
Austria. The company's product specification said that the swabs were guaranteed to be sterile, but not DNA-free.

DNA evidence as evidence in criminal trials[edit]


Evidence
Part of the common law series

Types of evidence

Testimony

Documentary

Real (physical)

Exculpatory

Inculpatory

Digital

Demonstrative

Eyewitness identification

Genetic (DNA)

Lies
Relevance

Burden of proof
Laying a foundation

Public policy exclusions

Spoliation

Character

Habit
Similar fact

Authentication

Chain of custody

Best evidence rule

Self-authenticating document

Judicial notice

Ancient document

Hague Evidence Convention


Witnesses

Competence

Privilege

Direct examination

Cross-examination

Redirect
Impeachment

Recorded recollection

Expert witness

Dead Man's Statute


Hearsay and exceptions

in English law
in United States law

Confessions

Business records

Excited utterance

Dying declaration

Party admission

Ancient document

Declaration against interest

Present sense impression

Res gestae

Learned treatise

Implied assertion

Other common law areas

Contract

Tort
Property

Wills, trusts and estates

Criminal law

Familial DNA searching[edit]


Familial DNA searching (sometimes referred to as Familial DNA or Familial DNA Database Searching) is the
practice of creating new investigative leads in cases where DNA evidence found at the scene of a crime (forensic
profile) strongly resembles that of an existing DNA profile (offender profile) in a state DNA database but there is not
an exact match.[18][19] After all other leads have been exhausted, investigators may use specially developed software
to compare the forensic profile to all profiles taken from a states DNA database to generate a list of those offenders
already in the database who are most likely to be a very close relative of the individual whose DNA is in the forensic
profile.[20] To eliminate the majority of this list when the forensic DNA is a man's, crime lab technicians conduct YSTR analysis. Using standard investigative techniques, authorities are then able to build a family tree. The family

tree is populated from information gathered from public records and criminal justice records. Investigators rule out
family members involvement in the crime by finding excluding factors such as sex, living out of state or being
incarcerated when the crime was committed. They may also use other leads from the case, such as witness or
victim statements, to identify a suspect. Once a suspect has been identified, investigators seek to legally obtain a
DNA sample from the suspect. This suspect DNA profile is then compared to the sample found at the crime scene to
definitively identify the suspect as the source of the crime scene DNA.
Familial DNA database searching was first used in an investigation leading to the conviction of Craig Harman of
manslaughter in the United Kingdom on April 19, 2004.[21] Craig Harman was convicted using familial DNA because
of the partial matches from Harman's brother. When the police questioned Harman's brother, the police noticed
Harman lived very close to the original crime scene. Harman confessed when his DNA isolated from the DNA found
on the brick, matched.[22] Currently, familial DNA database searching is not conducted on a national level in the
United States. States determine their own policies and decision making processes for how and when to conduct
familial searches. The first familial DNA search and subsequent conviction in the United States was conducted
in Denver, Colorado, in 2008 using software developed under the leadership of Denver District Attorney Mitch
Morrissey and Denver Police Department Crime Lab Director Gregg LaBerge.[23] California was the first state to
implement a policy for familial searching under then Attorney General, now Governor, Jerry Brown.[24] In his role as
consultant to the Familial Search Working Group of the California Department of Justice, former Alameda County
Prosecutor Rock Harmon is widely considered to have been the catalyst in the adoption of familial search
technology in California. The technique was used to catch the Los Angeles serial killer known as the Grim Sleeper
in 2010.[25] It wasn't a witness or informant that tipped off law enforcement to the identity of the "Grim Sleeper" serial
killer, who had eluded police for more than two decades, but DNA from the suspect's own son. The suspect's son
was arrested and convicted in a felony weapons charge and swabbed for DNA last year. When his DNA was
entered into the database of convicted felons, detectives were alerted to a partial match to evidence found at the
"Grim Sleeper" crime scenes. David Franklin Jr., also known as the Grim Sleeper, was charged with ten counts of
murder and one count of attempted murder.[26] More recently, familial DNA, led to the arrest of 21-year-old Elvis
Garcia on charges of sexual assault and false imprisonment of a woman in Santa Cruz in 2008.[27] In March 2011
Virginia Governor Bob McDonnell announced that Virginia would begin using familial DNA searches.[28] Other states
are expected to follow.
At a press conference in Virginia on March 7, 2011, regarding the East Coast Rapist, Prince William County
prosecutor Paul Ebert and Fairfax County Police Detective John Kelly said the case would have been solved years
ago if Virginia had used familial DNA searching. Aaron Thomas, the suspected East Coast Rapist, was arrested in
connection with the rape of 17 women from Virginia to Rhode Island, but familial DNA was not used in the case.[29]
Critics of familial DNA database searches argue that the technique is an invasion of an individuals 4th
Amendment rights.[30] Privacy advocates are petitioning for DNA database restrictions, arguing that the only fair way
to search for possible DNA matches to relatives of offenders or arrestees would be to have a population-wide DNA
database.[11] Some scholars have pointed out that the privacy concerns surrounding familial searching are similar in
some respects to other police search techniques,[31] and most have concluded that the practice is
constitutional.[32] The Ninth Circuit Court of Appeals in United States v. Pool (vacated as moot) suggested that this
practice is somewhat analogous to a witness looking at a photograph of one person and stating that it looked like
the perpetrator, which leads law enforcement to show the witness photos of similar looking individuals, one of whom
is identified as the perpetrator.[33] Regardless of whether familial DNA searching was the method used to identify the
suspect, authorities always conduct a normal DNA test to match the suspects DNA with that of the DNA left at the
crime scene.
Critics also claim that racial profiling could occur on account of Familial DNA testing. In the United States, the
conviction rates of racial minorities are much higher than that of the overall population. It is unclear whether this is
due to discrimination from police officers and the courts, as opposed to a simple higher rate of offence among
minorities. Arrest-based databases, which are found in the majority of the United States, lead to an even greater
level of racial discrimination. An arrest, as opposed to conviction, relies much more heavily on police discretion.[11]
For instance, investigators with Denver District Attorneys Office successfully identified a suspect in a property theft
case using a familial DNA search. In this example, the suspects blood left at the scene of the crime strongly
resembled that of a currentColorado Department of Corrections prisoner.[34] Using publicly available records, the
investigators created a family tree. They then eliminated all the family members who were incarcerated at the time
of the offense, as well as all of the females (the crime scene DNA profile was that of a male). Investigators obtained
a court order to collect the suspects DNA, but the suspect actually volunteered to come to a police station and give
a DNA sample. After providing the sample, the suspect walked free without further interrogation or detainment. Later
confronted with an exact match to the forensic profile, the suspect pled guilty to criminal trespass at the first court
date and was sentenced to two years probation.

In Italy a familiar DNA search has been done to solve the case of the murder of Yara Gambirasio whose body was
found in the bush three months after her disappearance. A DNA trace was found on the underwear of the murdered
teenage near and a DNA sample was requested from a person who lived near the municipality of Brembate di
Sopra and a common male ancestor was found in the DNA sample of a young man not involved in the murder. After
a long investigation the father of the supposed killer was identified in Giuseppe Guerinoni a deceased man but his
two sons born from his wife were not related with the DNA samples found on the body of Yara. After 3 and a half
years the DNA found on the underwear of the deceased girl was matched with Massimo Giuseppe Bosetti who was
arrested and accused of the murder of the 13-year-old girl. Now Bosetti is awaiting in jail his trial.

Partial matches[edit]
Partial DNA matches are not searches themselves, but are the result of moderate stringency CODIS searches that
produce a potential match that shares at least one allele at every locus.[35] Partial matching does not involve the use
of familial search software, such as those used in the UK and United States, or additional Y-STR analysis, and
therefore often misses sibling relationships. Partial matching has been used to identify suspects in several cases in
the UK and United States,[36] and has also been used as a tool to exonerate the falsely accused. Darryl Hunt was
wrongly convicted in connection with the rape and murder of a young woman in 1984 in North Carolina.[37] Hunt was
exonerated in 2004 when a DNA database search produced a remarkably close match between a convicted felon
and the forensic profile from the case. The partial match led investigators to the felons brother, Willard E. Brown,
who confessed to the crime when confronted by police. A judge then signed an order to dismiss the case against
Hunt.

Surreptitious DNA collecting[edit]


Police forces may collect DNA samples without the suspects' knowledge, and use it as evidence. Legality of this
mode of proceeding has been questioned in Australia.
In the United States, it has been accepted, courts often claiming that there was no expectation of privacy,
citing California v. Greenwood (1985), in which the Supreme Court held that the Fourth Amendment does not
prohibit the warrantless search and seizure of garbage left for collection outside the curtilage of a home. Critics of
this practice underline that this analogy ignores that "most people have no idea that they risk surrendering their
genetic identity to the police by, for instance, failing to destroy a used coffee cup. Moreover, even if they do realize
it, there is no way to avoid abandoning ones DNA in public."[38]
In the UK, the Human Tissue Act 2004 prohibited private individuals from covertly collecting biological samples (hair,
fingernails, etc.) for DNA analysis, but excluded medical and criminal investigations from the offence.[39]
The U.S. Supreme Court ruled 54 on June 3, 2013, in the case of Maryland v. King, that DNA sampling of
prisoners arrested for serious crimes is constitutional.[40][41][42]

England and Wales[edit]


Evidence from an expert who has compared DNA samples must be accompanied by evidence as to the sources of
the samples and the procedures for obtaining the DNA profiles.[43] The judge must ensure that the jury must
understand the significance of DNA matches and mismatches in the profiles. The judge must also ensure that the
jury does not confuse the 'match probability' (the probability that a person that is chosen at random has a matching
DNA profile to the sample from the scene) with the probability that a person with matching DNA committed the
crime. In 1996 R v. Doheny[44] Phillips LJ gave this example of a summing up, which should be carefully tailored to
the particular facts in each case:
Members of the Jury, if you accept the scientific evidence called by the Crown, this indicates that there are probably
only four or five white males in the United Kingdom from whom that semen stain could have come. The Defendant is
one of them. If that is the position, the decision you have to reach, on all the evidence, is whether you are sure that it
was the Defendant who left that stain or whether it is possible that it was one of that other small group of men who
share the same DNA characteristics.
Juries should weigh up conflicting and corroborative evidence, using their own common sense and not by using
mathematical formulae, such as Bayes' theorem, so as to avoid "confusion, misunderstanding and misjudgment".[45]
Presentation and evaluation of evidence of partial or incomplete DNA profiles[edit]
In R v Bates,[46] Moore-Bick LJ said:
We can see no reason why partial profile DNA evidence should not be admissible provided that the jury are made
aware of its inherent limitations and are given a sufficient explanation to enable them to evaluate it. There may be

cases where the match probability in relation to all the samples tested is so great that the judge would consider its
probative value to be minimal and decide to exclude the evidence in the exercise of his discretion, but this gives rise
to no new question of principle and can be left for decision on a case by case basis. However, the fact that there
exists in the case of all partial profile evidence the possibility that a "missing" allele might exculpate the accused
altogether does not provide sufficient grounds for rejecting such evidence. In many there is a possibility (at least in
theory) that evidence that would assist the accused and perhaps even exculpate him altogether exists, but that does
not provide grounds for excluding relevant evidence that is available and otherwise admissible, though it does make
it important to ensure that the jury are given sufficient information to enable them to evaluate that evidence
properly[47]

DNA testing in the United States[edit]

CBP chemist reads a DNA profile to determine the origin of a commodity.

There are state laws on DNA profiling in all 50 states of the United States.[48] Detailed information on database laws
in each state can be found at the National Conference of State Legislatures website.[49]

Development of artificial DNA[edit]


In August 2009, scientists in Israel raised serious doubts concerning the use of DNA by law enforcement as the
ultimate method of identification. In a paper published in the journal Forensic Science International: Genetics, the
Israeli researchers demonstrated that it is possible to manufacture DNA in a laboratory, thus falsifying DNA
evidence. The scientists fabricated saliva and blood samples, which originally contained DNA from a person other
than the supposed donor of the blood and saliva.[50]
The researchers also showed that, using a DNA database, it is possible to take information from a profile and
manufacture DNA to match it, and that this can be done without access to any actual DNA from the person whose
DNA they are duplicating. The synthetic DNA oligos required for the procedure are common in molecular
laboratories.[50]
The New York Times quoted the lead author, Daniel Frumkin, saying, "You can just engineer a crime scene...any
biology undergraduate could perform this".[50] Frumkin perfected a test that can differentiate real DNA samples from
fake ones. His test detects epigenetic modifications, in particular, DNA methylation. Seventy percent of the DNA in
any human genome is methylated, meaning it contains methyl groupmodifications within a CpG
dinucleotide context. Methylation at the promoter region is associated with gene silencing. The synthetic DNA lacks
this epigenetic modification, which allows the test to distinguish manufactured DNA from genuine DNA.[50]
It is unknown how many police departments, if any, currently use the test. No police lab has publicly announced that
it is using the new test to verify DNA results.[51]

Cases[edit]

In 1986, Richard Buckland was exonerated, despite having admitted to the rape and murder of a teenager
near Leicester, the city where DNA profiling was first discovered. This was the first use of DNA fingerprinting in
a criminal investigation.[52]
In 1987, in the same case as Buckland, British baker Colin Pitchfork was the first criminal caught and
convicted using DNA fingerprinting.[53]
In 1987, genetic fingerprinting was used in criminal court for the first time in the trial of a man accused of
unlawful intercourse with a mentally handicapped 14-year-old female who gave birth to a baby.[54]

In 1987, Florida rapist Tommie Lee Andrews was the first person in the United States to be convicted as a result
of DNA evidence, for raping a woman during a burglary; he was convicted on November 6, 1987, and
sentenced to 22 years in prison.[55][56]
In 1988, Timothy Wilson Spencer was the first man in Virginia to be sentenced to death through DNA testing, for
several rape and murder charges. He was dubbed "The South Side Strangler" because he killed victims on the
south side of Richmond, Virginia. He was later charged with rape and first-degree murder and was sentenced to
death. He was executed on April 27, 1994. David Vasquez, initially convicted of one of Spencer's crimes,
became the first man in America exonerated based on DNA evidence.
In 1989, Chicago man Gary Dotson was the first person whose conviction was overturned using DNA evidence.
In 1991, Allan Legere was the first Canadian to be convicted as a result of DNA evidence, for four murders he
had committed while an escaped prisoner in 1989. During his trial, his defense argued that the relatively shallow
gene pool of the region could lead to false positives.
In 1992, DNA evidence was used to prove that Nazi doctor Josef Mengele was buried in Brazil under the name
Wolfgang Gerhard.
In 1992, DNA from a palo verde tree was used to convict Mark Alan Bogan of murder. DNA from seed pods of a
tree at the crime scene was found to match that of seed pods found in Bogan's truck. This is the first instance of
plant DNA admitted in a criminal case.[57][58][59]
In 1993, Kirk Bloodsworth was the first person to have been convicted of murder and sentenced to death,
whose conviction was overturned using DNA evidence.
The 1993 rape and murder of Mia Zapata, lead singer for the Seattle punk band The Gits was unsolved nine
years after the murder. A database search in 2001 failed, but the killer's DNA was collected when he was
arrested in Florida for burglary and domestic abuse in 2002.
The science was made famous in the United States in 1994 when prosecutors heavily relied on DNA evidence
allegedly linking O. J. Simpson to a double murder. The case also brought to light the laboratory difficulties and
handling procedure mishaps that can cause such evidence to be significantly doubted.
In 1994, Royal Canadian Mounted Police (RCMP) detectives successfully tested hairs from a cat known
as Snowball, and used the test to link a man to the murder of his wife, thus marking for the first time in forensic
history the use of non-human DNA to identify a criminal (except for the plant DNA mentioned in the case four
paragraphs up).
In 1994, the claim that Anna Anderson was Grand Duchess Anastasia Nikolaevna of Russia was tested after
her death using samples of her tissue that had been stored at a Charlottesville, Virginia hospital following a
medical procedure. The tissue was tested using DNA fingerprinting, and showed that she bore no relation to
the Romanovs.[60]
In 1994, Earl Washington, Jr., of Virginia had his death sentence commuted to life imprisonment a week before
his scheduled execution date based on DNA evidence. He received a full pardon in 2000 based on more
advanced testing.[61] His case is often cited by opponents of the death penalty.
In 1995, the British Forensic Science Service carried out its first mass intelligence DNA screening in the
investigation of the Naomi Smith murder case.
In 1998, Richard J. Schmidt was convicted of attempted second-degree murder when it was shown that there
was a link between the viral DNA of the human immunodeficiency virus (HIV) he had been accused of injecting
in his girlfriend and viral DNA from one of his patients with AIDS. This was the first time viral DNA fingerprinting
had been used as evidence in a criminal trial.
In 1999, Raymond Easton, a disabled man from Swindon, England, was arrested and detained for seven hours
in connection with a burglary. He was released due to an inaccurate DNA match. His DNA had been retained on
file after an unrelated domestic incident some time previously.[62]
In 2000 Frank Lee Smith was proved innocent by DNA profiling of the murder of an eight-year-old girl after
spending 14 years on death row in Florida, USA. However he had died of cancer just before his innocence was
proven.[63] In view of this the Florida state governor ordered that in future any death row inmate claiming
innocence should have DNA testing.[61]
In May 2000 Gordon Graham murdered Paul Gault at his home in Lisburn, Northern Ireland. Graham was
convicted of the murder when his DNA was found on a sports bag left in the house as part of an elaborate ploy
to suggest the murder occurred after a burglary had gone wrong. Graham was having an affair with the victim's
wife at the time of the murder. It was the first time Low Copy Number DNA was used in Northern Ireland.[64]
In 2001, Wayne Butler was convicted for the murder of Celia Douty. It was the first murder in Australia to be
solved using DNA profiling.[65][66]
In 2002, the body of James Hanratty, hanged in 1962 for the "A6 murder", was exhumed and DNA samples
from the body and members of his family were analysed. The results convinced Court of Appeal judges that

Hanratty's guilt, which had been strenuously disputed by campaigners, was proved "beyond doubt".[67] Paul Foot
and some other campaigners continued to believe in Hanratty's innocence and argued that the DNA evidence
could have been contaminated, noting that the small DNA samples from items of clothing, kept in a police
laboratory for over 40 years "in conditions that do not satisfy modern evidential standards", had had to be
subjected to very new amplification techniques in order to yield any genetic profile.[68] However, no DNA other
than Hanratty's was found on the evidence tested, contrary to what would have been expected had the
evidence indeed been contaminated.[69]
In 2002, DNA testing was used to exonerate Douglas Echols, a man who was wrongfully convicted in a 1986
rape case. Echols was the 114th person to be exonerated through post-conviction DNA testing.
In August 2002, Annalisa Vincenzi was shot dead in Tuscany. Bartender Peter Hamkin, 23, was arrested,
in Merseyside, in March 2003 on an extradition warrant heard at Bow Street Magistrates' Court in London to
establish whether he should be taken to Italy to face a murder charge. DNA "proved" he shot her, but he was
cleared on other evidence.[70]
In 2003, Welshman Jeffrey Gafoor was convicted of the 1988 murder of Lynette White, when crime scene
evidence collected 12 years earlier was re-examined using STR techniques, resulting in a match with his
nephew.[71] This may be the first known example of the DNA of an innocent yet related individual being used to
identify the actual criminal, via "familial searching".
In March 2003, Josiah Sutton was released from prison after serving four years of a twelve-year sentence for a
sexual assault charge. Questionable DNA samples taken from Sutton were retested in the wake of the Houston
Police Department's crime lab scandal of mishandling DNA evidence.
In June 2003, because of new DNA evidence, Dennis Halstead, John Kogut and John Restivo won a re-trial on
their murder conviction, their convictions were struck down and they were released.[72] The three men had
already served eighteen years of their thirty-plus-year sentences.
The trial of Robert Pickton (convicted in December 2003) is notable in that DNA evidence is being used
primarily to identify the victims, and in many cases to prove their existence.
In 2004, DNA testing shed new light into the mysterious 1912 disappearance of Bobby Dunbar, a four-year-old
boy who vanished during a fishing trip. He was allegedly found alive eight months later in the custody of William
Cantwell Walters, but another woman claimed that the boy was her son, Bruce Anderson, whom she had
entrusted in Walters' custody. The courts disbelieved her claim and convicted Walters for the kidnapping. The
boy was raised and known as Bobby Dunbar throughout the rest of his life. However, DNA tests on Dunbar's
son and nephew revealed the two were not related, thus establishing that the boy found in 1912 was not Bobby
Dunbar, whose real fate remains unknown.[73]
In 2005, Gary Leiterman was convicted of the 1969 murder of Jane Mixer, a law student at the University of
Michigan, after DNA found on Mixer's pantyhose was matched to Leiterman. DNA in a drop of blood on Mixer's
hand was matched to John Ruelas, who was only four years old in 1969 and was never successfully connected
to the case in any other way. Leiterman's defense unsuccessfully argued that the unexplained match of the
blood spot to Ruelas pointed to cross-contamination and raised doubts about the reliability of the lab's
identification of Leiterman.[74][75][76]
In December 2005, Evan Simmons was proven innocent of a 1981 attack on an Atlanta woman after serving
twenty-four years in prison. Mr. Clark is the 164th person in the United States and the fifth in Georgia to be freed
using post-conviction DNA testing.
In March 2009, Sean Hodgson who spent 27 years in jail, convicted of killing Teresa De Simone, 22, in her car
in Southampton 30 years before was released by senior judges. Tests prove DNA from the scene was not his.
British police have now reopened the case.
In November 2008, Anthony Curcio was arrested for masterminding one of the most elaborately planned
armored car heists in history. DNA evidence linked Curcio to the crime.[77]

In addition, when proteins are being made, the double helix unwinds to allow a single strand of DNA to serve as a
template. This template strand is then transcribed into mRNA, which is a molecule that conveys vital instructions to
the cell's protein-making machinery.

Genetic code
From Wikipedia, the free encyclopedia

"Codon" redirects here. For the plant genus, see Codon (genus).

A series of codons in part of amessenger RNA (mRNA) molecule. Each codon consists of threenucleotides, usually
corresponding to a single amino acid. The nucleotides are abbreviated with the letters A, U, G and C. This is mRNA, which uses
U (uracil). DNA uses T (thymine) instead. This mRNA molecule will instruct a ribosome to synthesize a protein according to this
code.

The genetic code is the set of rules by which information encoded within genetic material
(DNA or mRNA sequences) is translated into proteins by living cells. Biological decoding is accomplished by
the ribosome, which links amino acids in an order specified by mRNA, using transfer RNA (tRNA) molecules to carry
amino acids and to read the mRNA three nucleotides at a time. The genetic code is highly similar among all
organisms and can be expressed in a simple table with 64 entries.
The code defines how sequences of these nucleotide triplets, called codons, specify which amino acid will be added
next during protein synthesis. With some exceptions,[1] a three-nucleotide codon in a nucleic acid sequence specifies
a single amino acid. Because the vast majority of genes are encoded with exactly the same code (see the RNA
codon table), this particular code is often referred to as the canonical or standard genetic code, or simply the genetic
code, though in fact some variant codes have evolved. For example, protein synthesis in human mitochondria relies
on a genetic code that differs from the standard genetic code.
While the genetic code determines the protein sequence for a given coding region, other genomic regions can
influence when and where these proteins are produced.
Contents
[hide]

1 Discovery
2 Salient features
o 2.1 Sequence reading frame
o 2.2 Start/stop codons

o 2.3 Effect of mutations


o 2.4 Degeneracy
3 Transfer of information via the genetic code
4 RNA codon table
5 DNA codon table
6 Variations to the standard genetic code
o 6.1 Predicting the genetic code
o 6.2 Expanded genetic code
7 Origin
8 See also
9 References
10 Further reading
11 External links

Discovery[edit]

The genetic code

Serious efforts to understand how proteins are encoded began after the structure of DNA was discovered in
1953. George Gamow postulated that sets of three bases must be employed to encode the 20 standard amino acids
used by living cells to build proteins. With four different nucleotides, a code of 2 nucleotides would allow for only a
maximum of 42 = 16 amino acids. A code of 3 nucleotides could code for a maximum of 43 = 64 amino acids.[2]
The Crick, Brenner et al. experiment first demonstrated that codons consist of three DNA bases; Marshall
Nirenberg and Heinrich J. Matthaei were the first to elucidate the nature of a codon in 1961 at the National Institutes
of Health. They used a cell-free system to translate a poly-uracil RNA sequence (i.e., UUUUU...) and discovered
that the polypeptide that they had synthesized consisted of only the amino acidphenylalanine.[3] They thereby
deduced that the codon UUU specified the amino acid phenylalanine. This was followed by experiments in Severo
Ochoa's laboratory that demonstrated that the poly-adenine RNA sequence (AAAAA...) coded for the polypeptide
poly-lysine[4] and that the poly-cytosine RNA sequence (CCCCC...) coded for the polypeptide polyproline.[5] Therefore the codon AAA specified the amino acidlysine, and the codon CCC specified the amino
acid proline. Using different copolymers most of the remaining codons were then determined. Subsequent work
by Har Gobind Khorana identified the rest of the genetic code. Shortly thereafter, Robert W. Holley determined the
structure of transfer RNA (tRNA), the adapter molecule that facilitates the process of translating RNA into protein.
This work was based upon earlier studies by Severo Ochoa, who received the Nobel Prize in Physiology or
Medicine in 1959 for his work on the enzymology of RNA synthesis.[6]
Extending this work, Nirenberg and Philip Leder revealed the triplet nature of the genetic code and deciphered the
codons of the standard genetic code. In these experiments, various combinations of mRNA were passed through a
filter that contained ribosomes, the components of cells that translate RNA into protein. Unique triplets promoted the

binding of specific tRNAs to the ribosome. Leder and Nirenberg were able to determine the sequences of 54 out of
64 codons in their experiments.[7] In 1968, Khorana, Holley and Nirenberg received the Nobel Prize in Physiology or
Medicine for their work.[8]

Salient features[edit]
Sequence reading frame[edit]
A codon is defined by the initial nucleotide from which translation starts. For example, the string GGGAAACCC, if
read from the first position, contains the codons GGG, AAA, and CCC; and, if read from the second position, it
contains the codons GGA and AAC; if read starting from the third position, GAA and ACC. Every sequence can,
thus, be read in three reading frames, each of which will produce a different amino acid sequence (in the given
example, Gly-Lys-Pro, Gly-Asn, or Glu-Thr, respectively). With double-stranded DNA, there are six possible reading
frames, three in the forward orientation on one strand and three reverse on the opposite strand.[9]:330 The actual
frame in which a protein sequence is translated is defined by a start codon, usually the first AUG codon in the
mRNA sequence.

Start/stop codons[edit]
Translation starts with a chain initiation codon or start codon. Unlike stop codons, the codon alone is not sufficient to
begin the process. Nearby sequences such as the Shine-Dalgarno sequence in E. coli and initiation factors are also
required to start translation. The most common start codon is AUG, which is read as methionine or, in bacteria,
as formylmethionine. Alternative start codons depending on the organism include "GUG" or "UUG"; these codons
normally represent valine and leucine, respectively, but as start codons they are translated as methionine or
formylmethionine.[10]
The three stop codons have been given names: UAG is amber, UGA is opal (sometimes also called umber), and
UAA is ochre. "Amber" was named by discoverers Richard Epstein and Charles Steinberg after their friend Harris
Bernstein, whose last name means "amber" in German.[11] The other two stop codons were named "ochre" and
"opal" in order to keep the "color names" theme. Stop codons are also called "termination" or "nonsense" codons.
They signal release of the nascent polypeptide from the ribosome because there is no cognate tRNA that has
anticodons complementary to these stop signals, and so a release factor binds to the ribosome instead.[12]

Effect of mutations[edit]

Examples of notable mutations that can occur in humans.[13]

During the process of DNA replication, errors occasionally occur in the polymerization of the second strand. These
errors, called mutations, can have an impact on the phenotype of an organism, especially if they occur within the
protein coding sequence of a gene. Error rates are usually very low1 error in every 10100 million basesdue to
the "proofreading" ability of DNA polymerases.[14][15]
Missense mutations and nonsense mutations are examples of point mutations, which can cause genetic diseases
such as sickle-cell disease and thalassemia respectively.[16][17][18] Clinically important missense mutations generally
change the properties of the coded amino acid residue between being basic, acidic, polar or non-polar, whereas
nonsense mutations result in a stop codon.[9]:266
Mutations that disrupt the reading frame sequence by indels (insertions or deletions) of a non-multiple of 3
nucleotide bases are known as frameshift mutations. These mutations usually result in a completely different
translation from the original, and are also very likely to cause a stop codon to be read, which truncates the creation
of the protein.[19] These mutations may impair the function of the resulting protein, and are thus rare in in

vivo protein-coding sequences. One reason inheritance of frameshift mutations is rare is that, if the protein being
translated is essential for growth under the selective pressures the organism faces, absence of a functional protein
may cause death before the organism is viable.[20] Frameshift mutations may result in severe genetic diseases such
as Tay-Sachs disease.[21]
Although most mutations that change protein sequences are harmful or neutral, some mutations have a beneficial
effect on an organism.[22] These mutations may enable the mutant organism to withstand particular environmental
stresses better thanwild-type organisms, or reproduce more quickly. In these cases a mutation will tend to become
more common in a population through natural selection.[23] Viruses that use RNA as their genetic material have rapid
mutation rates,[24] which can be an advantage, since these viruses will evolve constantly and rapidly, and thus evade
the defensive responses of e.g. the human immune system.[25] In large populations of asexually reproducing
organisms, for example, E. coli, multiple beneficial mutations may co-occur. This phenomenon is called clonal
interference and causes competition among the mutations.[26]

Degeneracy[edit]
Main article: Codon degeneracy
Degeneracy is the redundancy of the genetic code. The genetic code has redundancy but no ambiguity (see
the codon tables below for the full correlation). For example, although codons GAA and GAG both specify glutamic
acid (redundancy), neither of them specifies any other amino acid (no ambiguity). The codons encoding one amino
acid may differ in any of their three positions. For example the amino acid leucine is specified by YUR or
CUN (UUA, UUG, CUU, CUC, CUA, or CUG) codons (difference in the first or third position indicated using IUPAC
notation), while the amino acid serine is specified by UCN or AGY (UCA, UCG, UCC, UCU, AGU, or AGC) codons
(difference in the first, second, or third position).[27]:521522 A practical consequence of redundancy is that errors in the
third position of the triplet codon cause only a silent mutation or an error that would not affect the protein because
the hydrophilicity or hydrophobicity is maintained by equivalent substitution of amino acids; for example, a codon of
NUN (where N = any nucleotide) tends to code for hydrophobic amino acids. NCN yields amino acid residues that
are small in size and moderate in hydropathy; NAN encodes average size hydrophilic residues. The genetic code is
so well-structured for hydropathy that a mathematical analysis (Singular Value Decomposition) of 12 variables (4
nucleotides x 3 positions) yields a remarkable correlation (C = 0.95) for predicting the hydropathy of the encoded
amino acid directly from the triplet nucleotide sequence, without translation.[28][29] Note in the table, below, eight
amino acids are not affected at all by mutations at the third position of the codon, whereas in the figure above, a
mutation at the second position is likely to cause a radical change in the physicochemical properties of the encoded
amino acid.

Grouping of codons by amino acid residue molar volume and hydropathy. A more detailed version is available.

Transfer of information via the genetic code[edit]


The genome of an organism is inscribed in DNA, or, in the case of some viruses, RNA. The portion of the genome
that codes for a protein or an RNA is called a gene. Those genes that code for proteins are composed of trinucleotide units called codons, each coding for a single amino acid. Each nucleotide sub-unit consists of
a phosphate, a deoxyribose sugar, and one of the four nitrogenous nucleobases. Thepurine bases adenine (A)
and guanine (G) are larger and consist of two aromatic rings. The pyrimidine bases cytosine (C) and thymine (T) are
smaller and consist of only one aromatic ring. In the double-helix configuration, two strands of DNA are joined to

each other by hydrogen bonds in an arrangement known as base pairing. These bonds almost always form between
an adenine base on one strand and a thymine base on the other strand, or between a cytosine base on one strand
and a guanine base on the other. This means that the number of A and T bases will be the same in a given double
helix, as will the number of G and C bases.[27]:102117 In RNA, thymine (T) is replaced by uracil (U), and the
deoxyribose is substituted by ribose.[27]:127
Each protein-coding gene is transcribed into a molecule of the related RNA polymer. In prokaryotes, this RNA
functions as messenger RNA or mRNA; in eukaryotes, the transcript needs to be processed to produce a mature
mRNA. The mRNA is, in turn, translated on a ribosome into a chain of amino acids otherwise known as
a polypeptide.[27]:Chp 12 The process of translation requires transfer RNAs which arecovalently attached to a specific
amino acid, guanosine triphosphate as an energy source, and a number of translation factors. tRNAs
have anticodons complementary to the codons in an mRNA and can be covalently "charged" with specific amino
acids at their 3' terminal CCA ends by enzymes known as aminoacyl tRNA synthetases, which have high specificity
for both their cognate amino acid and tRNA. The high specificity of these enzymes is a major reason why the fidelity
of protein translation is maintained.[27]:464469
There are 4 = 64 different codon combinations possible with a triplet codon of three nucleotides; all 64 codons are
assigned to either an amino acid or a stop signal. If, for example, an RNA sequence UUUAAACCC is considered
and the reading framestarts with the first U (by convention, 5' to 3'), there are three codons, namely, UUU, AAA, and
CCC, each of which specifies one amino acid. Therefore, this 9 base RNA sequence will be translated into an amino
acid sequence that is three amino acids long.[27]:521539 A given amino acid may be encoded by between one and six
different codon sequences. A comparison may be made using bioinformatics tools wherein the codon is similar to
a word, which is the standard data "chunk" and a nucleotide is similar to a bit, in that it is the smallest unit. This
allows for powerful comparisons across species as well as within organisms.
The standard genetic code is shown in the following tables. Table 1 shows which amino acid each of the 64 codons
specifies. Table 2 shows which codons specify each of the 20 standard amino acids involved in translation. These
are called forward and reverse codon tables, respectively. For example, the codon "AAU" represents the amino
acid asparagine, and "UGU" and "UGC" represent cysteine (standard three-letter designations, Asn and Cys,
respectively).[27]:522

RNA codon table[edit]


nonpolar polar basic acidic (stop codon)

Standard genetic code

2nd base

1st
bas
e

UC
U

UUU

UA
U

(Phe/F) Phenylalani
ne
U

UUA

(Leu/L) Leucine

UG
U
(Tyr/Y) Tyrosine

UC
C

UUC

UC

(Ser/S) Serine

UA
C

UA

U
(Cys/C) Cysteine

UG
C

Stop (Ochre)

3rd
bas
e

UG

Stop (Opal)

UUG

UC
G

UA
G

CUU

CC
U

CA
U

Stop (Amber)

UG
G

(Trp/W) Tryptophan

CG
U

CGC

(His/H) Histidine
CUC

CCC

CAC
(Pro/P) Proline

CUA

CCA

(Arg/R) Arginine
CAA

CUG

CC
G

CA
G

AUU

AC
U

AA
U

AUC

(Ile/I) Isoleucine

ACC

(Gln/Q) Glutamin
e

(Asn/N) Asparagi
ne

CGA

CG
G

AG
U

U
(Ser/S) Serine

AAC

AGC

AA
A

AG
A

(Thr/T) Threonine

A
AC
A

AUA

(Lys/K) Lysine
AUG[
A]

(Met/M) Methionin
e

GUU

(Arg/R) Arginine

AC
G

AA
G

AG
G

GC
U

GA
U

GG
U

(Asp/D) Aspartic
acid
G

GUC

GUA

(Val/V) Valine

GC
C

GC
A

(Ala/A) Alanine

GA
C

GA
A

GG
C

(Glu/E) Glutamic
acid

GG
A

(Gly/G) Glycine

GC
G

GUG

GA
G

GG
G

The codon AUG both codes for methionine and serves as an initiation site: the first AUG in an mRNA's
coding region is where translation into protein begins.[30]
Inverse table (compressed using IUPAC notation)

Amino
acid

Codons

Compressed

Amino
acid

Codons

Compressed

Ala/A

GCU, GCC, GCA, GCG

GCN

Leu/L

UUA, UUG, CUU, CUC, CUA,


CUG

YUR, CUN

Arg/R

CGU, CGC, CGA, CGG, AGA,


AGG

CGN, MGR

Lys/K

AAA, AAG

AAR

Asn/N

AAU, AAC

AAY

Met/M

AUG

Asp/D

GAU, GAC

GAY

Phe/F

UUU, UUC

UUY

Cys/C

UGU, UGC

UGY

Pro/P

CCU, CCC, CCA, CCG

CCN

Gln/Q

CAA, CAG

CAR

Ser/S

UCU, UCC, UCA, UCG, AGU,


AGC

UCN, AGY

Glu/E

GAA, GAG

GAR

Thr/T

ACU, ACC, ACA, ACG

ACN

Gly/G

GGU, GGC, GGA, GGG

GGN

Trp/W

UGG

His/H

CAU, CAC

CAY

Tyr/Y

UAU, UAC

UAY

AUU, AUC, AUA

AUH

Val/V

GUU, GUC, GUA, GUG

GUN

Ile/I

START

AUG

STOP

UAA, UGA, UAG

UAR, URA

DNA codon table[edit]


Main article: DNA codon table
The DNA codon table is essentially identical to that for RNA, but with U replaced by T.

Variations to the standard genetic code[edit]


See also: List of genetic codes
While slight variations on the standard code had been predicted earlier,[31] none were discovered until 1979,
when researchers studying human mitochondrial genes discovered they used an alternative code. Many slight
variants have been discovered since then,[32] including various alternative mitochondrial codes,[33] and small
variants such as translation of the codon UGA as tryptophan in Mycoplasma species, and translation of CUG as
a serine rather than a leucine in yeasts of the "CTG clade" (Candida albicans is member of this
group).[34][35][36] Because viruses must use the same genetic code as their hosts, modifications to the standard
genetic code could interfere with the synthesis or functioning of viral proteins. However, some viruses (such
as totiviruses) have adapted to the genetic code modification of the host.[37] In bacteria and archaea, GUG and
UUG are common start codons, but in rare cases, certain proteins may use alternative start codons not normally
used by that species.[32]
In certain proteins, non-standard amino acids are substituted for standard stop codons, depending on
associated signal sequences in the messenger RNA. For example, UGA can code for selenocysteine and UAG
can code for pyrrolysine. Selenocysteine is now viewed as the 21st amino acid, and pyrrolysine is viewed as the
22nd.[32] Unlike selenocysteine, pyrrolysine encoded UAG is translated with the participation of a dedicated
aminoacyl-tRNA synthetase.[38] Both selenocysteine and pyrrolysine may be present in the same
organism.[39] Although the genetic code is normally fixed in an organism the achaeal prokaryote Acetohalobium
arabaticum can expand its genetic code from 20 to 21 amino acids (by including pyrrolysine) under different
conditions of growth.[40]
Despite these differences, all known naturally-occurring codes are very similar to each other, and the coding
mechanism is the same for all organisms: three-base codons, tRNA, ribosomes, reading the code in the same
direction and translating the code three letters at a time into sequences of amino acids.

Genetic code logo of theGlobobulimina pseudospinescensmitochondrial genome. The logo shows the 64 codons from left to
right, predicted alternatives in red (relative to the standard genetic code). Red line: stop codons. The height of each amino
acid in the stack shows how often it is aligned to the codon in homologous protein domains. The stack height indicates the
support for the prediction.

Predicting the genetic code[edit]


The genetic code used by a genome can be predicted by identifying the genes encoded on that genome, and
comparing the codons on the DNA to the amino acids in homologous proteins in other genomes. The
evolutionary conservation of protein sequences makes it possible to predict the amino acid translation for each
codon as the one that is most often aligned to that codon. The program FACIL[41] allows the automated
prediction of the genetic code, searching which amino acids in homologous protein domains are most often
aligned to every codon. The resulting amino acid probabilities for each codon are displayed in a genetic code
logo, that also shows the support for a stop codon.

Expanded genetic code[edit]


Main article: Expanded genetic code
See also: Nucleic acid analogues

Since 2001, 40 non-natural amino acids have been added into protein by creating a unique codon (recoding)
and a corresponding transfer-RNA:aminoacyl tRNA-synthetase pair to encode it with diverse physicochemical
and biological properties in order to be used as a tool to exploring protein structure and function or to create
novel or enhanced proteins.[41][42]
H. Murakami and M. Sisido have extended some codons to have four and five bases. Steven A.
Benner constructed a functional 65th (in vivo) codon.[43]

Origin[edit]
If amino acids were randomly assigned to triplet codons, then there would be 1.5 x 1084 possible genetic codes
to choose from.[44]:163 This number is found by calculating how many ways there are to place 21 items (20 amino
acids plus one stop) in 64 bins, wherein each item is used at least once. [1] The genetic code used by all known
forms of life is nearly universal with few minor variations. One could ask: Has all life on Earth descended from a
single bacterium that mutated to make the final optimization in the genetic code? Many hypotheses on the
evolutionary origins of the genetic code have been proposed.
Four themes run through the many hypotheses about the evolution of the genetic code:[45]

Chemical principles govern specific RNA interaction with amino acids. Experiments with aptamers showed
that some amino acids have a selective chemical affinity for the base triplets that code for them.[46] Recent
experiments show that of the 8 amino acids tested, 6 show some RNA triplet-amino acid association.[44]:170[47]

Biosynthetic expansion. The standard modern genetic code grew from a simpler earlier code through a
process of "biosynthetic expansion". Here the idea is that primordial life "discovered" new amino acids (for
example, as by-products ofmetabolism) and later incorporated some of these into the machinery of genetic
coding. Although much circumstantial evidence has been found to suggest that fewer different amino acids
were used in the past than today,[48] precise and detailed hypotheses about which amino acids entered the
code in what order have proved far more controversial.[49][50]
Natural selection has led to codon assignments of the genetic code that minimize the effects
of mutations.[51] A recent hypothesis[52] suggests that the triplet code was derived from codes that used
longer than triplet codons (such as quadruplet codons). Longer than triplet decoding would have higher
degree of codon redundancy and would be more error resistant than the triplet decoding. This feature could
allow accurate decoding in the absence of highly complex translational machinery such as
the ribosome and before cells began making ribosomes.
Information channels: Information-theoretic approaches model the process of translating the genetic code
into corresponding amino acids as an error-prone information channel.[53] The inherent noise (that is, the
error) in the channel poses the organism with a fundamental question: how can a genetic code be
constructed to withstand the impact of noise[54] while accurately and efficiently translating information?
These rate-distortion models[55] suggest that the genetic code originated as a result of the interplay of the
three conflicting evolutionary forces: the needs for diverse amino-acids,[56] for error-tolerance[51] and for
minimal cost of resources. The code emerges at a coding transition when the mapping of codons to aminoacids becomes nonrandom. The emergence of the code is governed by the topology defined by the
probable errors and is related to the map coloring problem.[57]

Transfer RNA molecules appear to have evolved before modern aminoacyl-tRNA synthetases, so the latter
cannot be part of the explanation of its patterns.[58]
Models encompassing aspects of two or more of the above themes have also been explored. For example,
models based on signaling games combine elements of game theory, natural selection and information
channels. Such models have been used to suggest that the first polypeptides were likely short and had some
use other than enzymatic function. Game theoretic models have also suggested that the organization of RNA
strings into cells may have been necessary to prevent "deceptive" use of the genetic code, i.e. preventing the
ancient equivalent of viruses from overwhelming the RNA world.[59]
The distribution of codon assignments in the genetic code is nonrandom.[60] For example, the genetic code
clusters certain amino acid assignments. Amino acids that share the same biosynthetic pathway tend to have
the same first base in their codons.[61] Amino acids with similar physical properties tend to have similar
codons,[62][63] reducing the problems caused by point mutations and mistranslations.[60] A robust hypothesis for the
origin of genetic code should also address or predict the following gross features of the codon table:[64]

1.
2.
3.
4.
5.

6.

absence of codons for D-amino acids


secondary codon patterns for some amino acids
confinement of synonymous positions to third position
limitation to 20 amino acids instead of a number closer to 64
relation of stop codon patterns to amino acid coding patterns

Genetic code

7. NOTE - starting VarNomen version 3 the '*' is used to indicate a


translation stop codon, replacing the 'X' used previously (see
Background).
Nucleotide position in codon
first

second
U

third

UUU - Phe
UUC - Phe
UUA - Leu
UUG - Leu

UCU
UCC
UCA
UCG

C
- Ser
- Ser
- Ser
- Ser

CUU - Leu
CUC - Leu
CUA - Leu
CUG - Leu

CCU
CCC
CCA
CCG

Pro
Pro
Pro
Pro

CAU
CAC
CAA
CAG

His
His
Gln
Gln

CGU
CGC
CGA
CGG

Arg
Arg
Arg
Arg

U
C
A
G

AUU - Ile
AUC - Ile
AUA - Ile
AUG - Met

ACU
ACC
ACA
ACG

Thr
Thr
Thr
Thr

AAU
AAC
AAA
AAG

Asn
Asn
Lys
Lys

AGU
AGC
AGA
AGG

Ser
Ser
Arg
Arg

U
C
A
G

GUU - Val
GUC - Val
GUA - Val
GUG - Val

GCU
GCC
GCA
GCG

Ala
Ala
Ala
Ala

GAU
GAC
GAA
GAG

Asp
Asp
Glu
Glu

GGU
GGC
GGA
GGG

Gly
Gly
Gly
Gly

U
C
A
G

8.

9.

UAU
UAC
UAA
UAG

- Tyr
- Tyr
- *
- *

UGU
UGC
UGA
UGG

Cys
Cys
*
Trp

U
C
A
G

Amino acid descriptions


One letter Three letter
Amino acid
code
code
A
Ala
Alanine

Possible codons
GCA, GCC, GCG, GCT

Asx

Asparagine or Aspartic acid AAC, AAT, GAC, GAT

Cys

Cysteine

TGC, TGT

Asp

Aspartic acid

GAC, GAT

Glu

Glutamic acid

GAA, GAG

Phe

Phenylalanine

TTC, TTT

Gly

Glycine

GGA, GGC, GGG, GGT

His

Histidine

CAC, CAT

Ile

Isoleucine

ATA, ATC, ATT

Lys

Lysine

AAA, AAG

10.

11.

Leu

Leucine

CTA, CTC, CTG, CTT, TTA, TTG

Met

Methionine

ATG

Asn

Asparagine

AAC, AAT

Pro

Proline

CCA, CCC, CCG, CCT

Gln

Glutamine

CAA, CAG

Arg

Arginine

AGA, AGG, CGA, CGC, CGG, CGT

Ser

Serine

AGC, AGT, TCA, TCC, TCG, TCT

Thr

Threonine

ACA, ACC, ACG, ACT

Val

Valine

GTA, GTC, GTG, GTT

Trp

Tryptophan

TGG

any codon

NNN

Tyr

Tyrosine

TAC, TAT

Glx

Glutamine or Glutamic acid CAA, CAG, GAA, GAG

stop codon

TAA, TAG, TGA

Amino acid properties


Property

Amino acids

small

Ala, Gly

acidic / amide

Asp, Glu, Asn, Gln

charged

negative

Asp, Glu

positive

Lys, Arg

polar

Ala, Gly, Ser, Thr, Pro

hydrophobic

Val, Leu, Ile, Met


big

Glu, Gln, His, Ile, Lys, Leu, Met,


Phe, Trp, Tyr

small

Ala, Asn, Asp, Cys, Gly, Pro, Ser,


Thr, Val

size

aliphatic

Ile, Leu, Val

aromatic

His, Phe, Tyr, Trp

Genetic Code and Amino Acid Translation


Table 1 shows the genetic code of the messenger ribonucleic acid (mRNA), i.e. it shows all 64 possible combinations of codons
composed of three nucleotide bases (tri-nucleotide units) that specify amino acids during protein assembling.
Each codon of the deoxyribonucleic acid (DNA) codes for or specifies a single amino acid and each nucleotide unit consists of a
phosphate, deoxyribose sugar and one of the 4 nitrogenous nucleotide bases, adenine (A), guanine (G), cytosine (C) and thymine (T).
The bases are paired and joined together by hydrogen bonds in the double helix of the DNA. mRNA corresponds to DNA (i.e. the
sequence of nucleotides is the same in both chains) except that in RNA, thymine (T) is replaced by uracil (U), and the deoxyribose is
substituted by ribose.
The process of translation of genetic information into the assembling of a protein requires first mRNA, which is read 5' to 3' (exactly as

DNA), and then transfer ribonucleic acid (tRNA), which is read 3' to 5'. tRNA is the taxi that translates the information on the ribosome
into an amino acid chain or polypeptide.
For mRNA there are 43 = 64 different nucleotide combinations possible with a triplet codon of three nucleotides. All 64 possible
combinations are shown in Table 1. However, not all 64 codons of the genetic code specify a single amino acid during translation. The
reason is that in humans only 20 amino acids (except selenocysteine) are involved in translation. Therefore, one amino acid can be
encoded by more than one mRNA codon-triplet. Arginine and leucine are encoded by 6 triplets, isoleucine by 3, methionine and
tryptophan by 1, and all other amino acids by 4 or 2 codons. The redundant codons are typically different at the 3rd base. Table
2 shows the inverse codon assignment, i.e. which codon specifies which of the 20 standard amino acids involved in translation.
Table 1. Genetic code: mRNA codon -> amino acid
1st
Base

2nd
Base

3rd
Base

Phenylalanine

Serine

Tyrosine

Cysteine

Phenylalanine

Serine

Tyrosine

Cysteine

Leucine

Serine

Stop

Stop

Leucine

Serine

Stop

Tryptophan

Leucine

Proline

Histidine

Arginine

Leucine

Proline

Histidine

Arginine

Leucine

Proline

Glutamine

Arginine

Leucine

Proline

Glutamine

Arginine

Isoleucine

Threonine

Asparagine

Serine

Isoleucine

Threonine

Asparagine

Serine

Isoleucine

Threonine

Lysine

Arginine

Methionine
(Start)1

Threonine

Lysine

Arginine

Valine

Alanine

Aspartate

Glycine

Valine

Alanine

Aspartate

Glycine

Valine

Alanine

Glutamate

Glycine

Valine

Alanine

Glutamate

Glycine

Table 2. Reverse codon table: amino acid -> mRNA codon


Amino acid

mRNA codons

Amino
acid

mRNA codons

Ala/A

GCU, GCC, GCA, GCG

Leu/L

UUA, UUG, CUU, CUC, CUA, CUG

Arg/R

CGU, CGC, CGA, CGG, AGA, AGG

Lys/K

AAA, AAG

Asn/N

AAU, AAC

Met/M AUG

Asp/D

GAU, GAC

Phe/F

UUU, UUC

Cys/C

UGU, UGC

Pro/P

CCU, CCC, CCA, CCG

Gln/Q

CAA, CAG

Ser/S

UCU, UCC, UCA, UCG, AGU, AGC

Glu/E

GAA, GAG

Thr/T

ACU, ACC, ACA, ACG

Gly/G

GGU, GGC, GGA, GGG

Trp/W UGG

His/H

CAU, CAC

Tyr/Y

UAU, UAC

Ile/I
START

AUU, AUC, AUA

Val/V

GUU, GUC, GUA, GUG

AUG

STOP

UAG, UGA, UAA

The direction of reading mRNA is 5' to 3'. tRNA (reading 3' to 5') has anticodons complementary to the codons in mRNA and can be
"charged" covalently with amino acids at their 3' terminal. According to Crick the binding of the base-pairs between the mRNA codon
and the tRNA anticodon takes place only at the 1st and 2nd base. The binding at the 3rd base (i.e. at the 5' end of the tRNA anticodon)
is weaker and can result in different pairs. For the binding between codon and anticodon to come true the bases must wobble out of
their positions at the ribosome. Therefore, base-pairs are sometimes called wobble-pairs.
Table 3 shows the possible wobble-pairs at the 1st, 2nd and 3rd base. The possible pair combinations at the 1st and 2nd base are
identical. At the 3rd base (i.e. at the 3' end of mRNA and 5' end of tRNA) the possible pair combinations are less unambiguous, which
leads to the redundancy in mRNA. The deamination (removal of the amino group NH2) of adenosine (not to confuse with adenine)
produces the nucleotide inosine (I) on tRNA, which generates non-standard wobble-pairs with U, C or A (but not with G) on mRNA.
Inosine may occur at the 3rd base of tRNA.
Table 3. Base-pairs: mRNA codon -> tRNA anticodon
1st (i.e. 5' end) and 2nd place 1st (i.e. 3' end) and 2nd place
mRNA codon
tRNA anticodon
A

3rd place (i.e. 3' end)


mRNA codon

3rd place (i.e. 5' end)


tRNA anticodon

A or G

U or C

U, C or A

Table 3 is read in the following way: for the 1st and 2nd base-pairs the wobble-pairs provide uniqueness in the way that U on tRNA
always emerges from A on mRNA, A on tRNA always emerges from U on mRNA, etc. For the 3rd base-pair the genetic code is
redundant in the way that U on tRNA can emerge from A or G on mRNA, G on tRNA can emerge from U or C on mRNA and I on tRNA
can emerge from U, C or A on mRNA. Only A and C at the 3rd place on tRNA are unambiguously assigned to U and G at the 3rd place
on mRNA, respectively.
Due to this combination structure a tRNA can bind to different mRNA codons where synonymous or redundant mRNA codons differ at
the 3rd base (i.e. at the 5' end of tRNA and the 3' end of mRNA). By this logic the minimum number of tRNA anticodons necessary to
encode all amino acids reduces to 31 (excluding the 2 STOP codons AUU and ACU, see Table 5). This means that any tRNA
anticodon can be encoded by one or more different mRNA codons (Table 4). However, there are more than 31 tRNA anticodons
possible for the translation of all 64 mRNA codons. For example, serine has a fourfold degenerate site at the 3rd position (UCU, UCC,
UCA, UCG), which can be translated by AGI (for UCU, UCC and UCA) and AGC on tRNA (for UCG) but also by AGG and AGU. This
means, in turn, that any mRNA codon can also be translated by one or more tRNA anticodons (see Table 5).
The reason for the occurrence of different wobble-pairs encoding the same amino acid may be due to a compromise between velocity
and safety in protein synthesis. The redundancy of mRNA codons exist to prevent mistakes in transcription caused by mutations or
variations at the 3rd position but also at other positions. For example, the first position of the leucine codons (UCA, UCC, CCU, CCC,
CCA, CCG) is a twofold degenerate site, while the second position is unambiguous (not redundant). Another example is serine with
mRNA codons UCA, UCG, UCC, UCU, AGU, AGC. Of course, serine is also twofold degenerate at the first position and fourfold
degenerate at the third position, but it is twofold degenerate at the second position in addition. Table 4 shows the assignment of mRNA
codons to any possible tRNA anticodon in eukaryotes for the 20 standard amino acids involved in translation. It is the reverse codon
assignment.
Table 4. Reverse amino acid encoding: amino acid -> tRNA anticodon -> mRNA codon

Amino acid

tRNA anticodon

mRNA codon

Phenylalanine

3'-AAG-5'

5'-UUU-3', 5'-UUC-3'

3'-AAA-5'

5'-UUU-3'

3'-AAU-5'

5'-UUA-3', 5'-UUG-3'

3'-AAC-5'

5'-UUG-3'

3'-GAI-5'

5'-CUU-3', 5'-CUC-3', 5'-CUA-3'

3'-GAG-5'

5'-CUU-3', 5'-CUC-3'

3'-GAU-5'

5'-CUA-3', 5'-CUG-3'

3'-GAA-5'

5'-CUU-3'

3'-GAC-5'

5'-CUG-3'

3'-AGI-5'

5'-UCU-3', 5'-UCC-3', 5'-UCA-3'

3'-AGG-5'

5'-UCU-3', 5'-UCC-3'

3'-AGU-5'

5'-UCA-3', 5'-UCG-3'

3'-AGA-5'

5'-UCU-3'

3'-AGC-5'

5'-UCG-3'

3'-UCG-5'

5'-AGU-3', 5'-AGC-3'

3'-UCA-5'

5'-AGU-3'

3'-AUG-5'

5'-UAU-3', 5'-UAC-3'

3'-AUA-5'

5'-UAU-3'

3'-ACG-5'

5'-UGU-3', 5'-UGC-3'

3'-ACA-5'

5'-UGU-3'

Tryptophan

3'-ACC-5'

5'-UGG-3'

Proline

3'-GGI-5'

5'-CCU-3', 5'-CCC-3', 5'-CCA-3'

3'-GGG-5'

5'-CCU-3', 5'-CCC-3'

3'-GGU-5'

5'-CCA-3', 5'-CCG-3'

3'-GGA-5'

5'-CCU-3'

3'-GGC-5'

5'-CCG-3'

3'-GUG-5'

5'-CAU-3', 5'-CAC-3'

3'-GUA-5'

5'-CAU-3'

3'-GUU-5'

5'-CAA-3', 5'-CAG-3'

3'-GUC-5'

5'-CAG-3'

3'-GCI-5'

5'-CGU-3', 5'-CGC-3', 5'-CGA-3'

3'-GCG-5'

5'-CGU-3', 5'-CGC-3'

3'-GCU-5'

5'-CGA-3', 5'-CGG-3'

3'-GCA-5'

5'-CGU-3'

3'-GCC-5'

5'-CGG-3'

3'-UCU-5'

5'-AGA-3', 5'-AGG-3'

Leucine

Serine

Tyrosine

Cysteine

Histidine

Glutamine

Arginine

3'-UCC-5'

5'-AGG-3'

3'-UAI-5'

5'-AUU-3', 5'-AUC-3', 5'-AUA-3'

3'-UAG-5'

5'-AUU-3', 5'-AUC-3'

3'-UAA-5'

5'-AUU-3'

3'-UAU-5'

5'-AUA-3'

Methionine

3'-UAC-5'

5'-AUG-3'

Threonine

3'-UGI-5'

5'-ACU-3', 5'-ACC-3', 5'-ACA-3'

3'-UGG-5'

5'-ACU-3', 5'-ACC-3'

3'-UGU-5'

5'-ACA-3', 5'-ACG-3'

3'-UGA-5'

5'-ACU-3'

3'-UGC-5'

5'-ACG-3'

3'-UUG-5'

5'-AAU-3', 5'-AAC-3'

3'-UUA-5'

5'-AAU-3'

3'-UUU-5'

5'-AAA-3', 5'-AAG-3'

3'-UUC-5'

5'-AAG-3'

3'-CAI-5'

5'-GUU-3', 5'-GUC-3', 5'-GUA-3'

3'-CAG-5'

5'-GUU-3', 5'-GUC-3'

3'-CAU-5'

5'-GUA-3', 5'-GUG-3'

3'-CAA-5'

5'-GUU-3'

3'-CAC-5'

5'-GUG-3'

3'-CGI-5'

5'-GCU-3', 5'-GCC-3', 5'-GCA-3'

3'-CGG-5'

5'-GCU-3', 5'-GCC-3'

3'-CGU-5'

5'-GCA-3', 5'-GCG-3'

3'-CGA-5'

5'-GCU-3'

3'-CGC-5'

5'-GCG-3'

3'-CUG-5'

5'-GAU-3', 5'-GAC-3'

3'-CUA-5'

5'-GAU-3'

3'-CUU-5'

5'-GAA-3', 5'-GAG-3'

3'-CUC-5'

5'-GAG-3'

3'-CCI-5'

5'-GGU-3', 5'-GGC-3', 5'-GGA-3'

3'-CCG-5'

5'-GGU-3', 5'-GGC-3'

3'-CCU-5'

5'-GGA-3', 5'-GGG-3'

3'-CCA-5'

5'-GGU-3'

3'-CCC-5'

5'-GGG-3'

Isoleucine

Asparagine

Lysine

Valine

Alanine

Aspartate

Glutamate

Glycine

While it is not possible to predict a specific DNA codon from an amino acid, DNA codons can be decoded unambiguously into amino
acids. The reason is that there are 61 different DNA (and mRNA) codons specifying only 20 amino acids. Note that there are 3
additional codons for chain termination, i.e. there are 64 DNA (and thus 64 different mRNA) codons, but only 61 of them specify amino
acids.

Table 5 shows the genetic code for the translation of all 64 DNA codons, starting from DNA over mRNA and tRNA to amino acid. In the
last column, the table shows the different tRNA anticodons minimally necessary to translate all DNA codons into amino acids and sums
up the number in the final row. It reveals that the minimum number of tRNA anticodons to translate all DNA codons is 31 (plus 2 STOP
codons). The maximum number of tRNA anticodons that can emerge in amino acid transcription is 70 (plus 3 STOP codons).
Table 5. Genetic code: DNA -> mRNA codon -> tRNA anticodon -> amino acid

Obs. DNA mRNA tRNA


1 TTT

UUU

Amino acid Different AA

AAA, AAG

Phe

2 TTC UUC

AAG

Phe

3 TTA UUA

AAU

Leu

4 TTG UUG

AAU, AAC

Leu

5 TCT UCU

AGI, AGG, AGA Ser

6 TCC UCC

AGI, AGG

Ser

7 TCA UCA

AGI, AGU

Ser

8 TCG UCG

AGC, AGU

Ser

9 TAT UAU

AUA, AUG

Tyr

10 TAC UAC

AUG

Tyr

11 TAA UAA

AUU

STOP

12 TAG UAG

AUC, AUU

STOP

13 TGT UGU

ACA, ACG

Cys

14 TGC UGC

ACG

Cys

15 TGA UGA

ACU

STOP

16 TGG UGG

ACC

Trp

17 CTT CUU

GAI, GAG, GAA Leu

18 CTC CUC

GAI, GAG

Leu

19 CTA CUA

GAI, GAU

Leu

20 CTG CUG

GAC, GAU

Leu

21 CCT CCU

GGI, GGG, GGA Pro

22 CCC CCC

GGI, GGG

Pro

23 CCA CCA

GGI, GGU

Pro

24 CCG CCG

GGC, GGU

Pro

25 CAT CAU

GUA, GUG

His

26 CAC CAC

GUG

His

27 CAA CAA

GUU

Gln

28 CAG CAG

GUC, GUU

Gln

29 CGT CGU

GCI, GCG, GCA Arg

30 CGC CGC

GCI, GCG

Arg

31 CGA CGA

GCI, GCU

Arg

32 CGG CGG

GCC, GCU

Arg

33 ATT AUU

UAI, UAG, UAA

Ile

34 ATC AUC

UAI, UAG

Ile

35 ATA AUA

UAI, UAU

Ile

36 ATG AUG

UAC

Met

37 ACT ACU

UGI, UGG, UGA Thr

38 ACC ACC

UGI, UGG

Thr

39 ACA ACA

UGI, UGU

Thr

40 ACG ACG

UGC, UGU

Thr

Diff. tRNA anticodons


to encode all AA

Phenylalanine AAG
Leucine

AAU

Serine

AGI

AGC (or AGU)


Tyrosine

AUG
AUU

Cysteine

ACG
ACU

Tryptophan

ACC
GAI

GAC (or GAU)


Proline

GGI

GGC (or GGU)


Histidine

GUG

Glutamine

GUU

Arginine

GCI

GCC (or GCU)


Isoleucine

UAI

Methionine

UAC

Threonine

UGI

UGC (or UGU)

41 AAT AAU

UUA, UUG

Asn

42 AAC AAC

UUG

Asn

43 AAA AAA

UUU

Lys

44 AAG AAG

UUC, UUU

Lys

45 AGT AGU

UCA, UCG

Ser

46 AGC AGC

UCG

Ser

47 AGA AGA

UCU

Arg

48 AGG AGG

UCC, UCU

Arg

49 GTT GUU

CAI, CAG, CAA

Val

50 GTC GUC

CAI, CAG

Val

51 GTA GUA

CAI, CAU

Val

52 GTG GUG

CAC, CAU

Val

53 GCT GCU

CGI, CGG, CGA Ala

54 GCC GCC

CGI, CGG

Ala

55 GCA GCA

CGI, CGU

Ala

56 GCG GCG

CGC, CGU

Ala

57 GAT GAU

CUG, CUA

Asp

58 GAC GAC

CUG

Asp

59 GAA GAA

CUU

Glu

60 GAG GAG

CUU, CUC

Glu

61 GGT GGU

CCI, CCG, CCA Gly

62 GGC GGC

CCI, CCG

Gly

63 GGA GGA

CCI, CCU

Gly

64 GGG GGG

CCC, CCU

Gly

No.

64

64

Asparagine

UUG

Lysine

UUU
UCG
UCU

Valine

CAI

CAC (or CAU)


Alanine

CGI

CGC (or CGU)


Aspartate

CUG

Glutamate

CUU

Glycine

CCI

CCC (or CCU)


20

33

DNA sequencing is the determination of the precise sequence of nucleotides in a sample of DNA.
The most popular method for doing this is called the dideoxy method or Sanger method (named
after its inventor, Frederick Sanger, who was awarded the 1980 Nobel prize in chemistry [his
second] for this achievment).
DNA is synthesized from four deoxynucleotide triphosphates. The top formula shows one of them:
deoxythymidine triphosphate (dTTP). Each new nucleotide is added to the 3 -OH group of the last
nucleotide added.
Link to discussion of DNA synthesis.

The dideoxy method gets its name from the critical role played by synthetic nucleotides
that lack the -OH at the 3 carbon atom (red arrow). A dideoxynucleotide (dideoxythymidine
triphosphate ddTTP is the one shown here) can be added to the growing DNA strand but
when it is, chain elongation stops because there is no 3 -OH for the next nucleotide to be
attached to. For this reason, the dideoxy method is also called the chain termination method.

The bottom formula shows the structure of azidothymidine (AZT), a drug used to treat AIDS. AZT (which is also
called zidovudine) is taken up by cells where it is converted into the triphosphate. The reverse transcriptase of the
human immunodeficiency virus (HIV) prefers AZT triphosphate to the normal nucleotide (dTTP). Because AZT has no 3 OH group, DNA synthesis by reverse transcriptase halts when AZT triphosphate is incorporated in the growing DNA
strand. Fortunately, the DNA polymerases of the host cell prefer dTTP, so side effects from the drug are not so severe as
might have been predicted.

The Procedure
The DNA to be sequenced is prepared as a single strand.
This template DNA is supplied with

a mixture of all four normal (deoxy) nucleotides


in ample quantities
o dATP
o dGTP
o dCTP
o dTTP
a mixture of all four dideoxynucleotides, each
present in limiting quantities and each labeled
with a "tag" that fluoresces a different color:
o ddATP
o ddGTP
o ddCTP
o ddTTP
DNA polymerase I

Because all four normal nucleotides are present, chain


elongation proceeds normally until, by chance, DNA
polymerase inserts a dideoxy nucleotide (shown as
colored letters) instead of the normal deoxynucleotide
(shown as vertical lines). If the ratio of normal
nucleotide to the dideoxy versions is high enough, some DNA strands will succeed in adding
several hundred nucleotides before insertion of the dideoxy version halts the process.
At the end of the incubation period, the fragments are separated by length from longest to shortest.
The resolution is so good that a difference of one nucleotide is enough to separate that strand from
the next shorter and next longer strand. Each of the four dideoxynucleotides fluoresces a different

color when illuminated by a laser beam and an automatic scanner provides a printout of the
sequence.

How do we Sequence DNA?


DNA Sequencing is at the center of the Human Genome Project, which promises to
revolutionize
the Biomedical Sciences and the treatment of human diseases. This page is designed to
help you
understand how DNA is sequenced.
If you are looking for information on our DNA sequencing service facility, our home page is here:
The University of Michigan DNA Sequencing Core
First you need to know a few key terms:
As you go through the subsequent discussion, you may need to jump back here to refresh your memory on various
definitions.
DNA
We assume you've read through the description of DNA structure, an earlier link in this thread ... right? You
hopefully also read the link that describes DNA Denaturation, Annealing and Replication, since the following
page builds on those basics.
Plasmid
A 'plasmid' is a small, circular piece of DNA that is often found in bacteria. This innocuous molecule might help
the bacteria survive in the presence of an antibiotic, for example, due to the genes it carries. To scientists,
however, plasmids are important because (i) we can isolate them in large quantities, (ii) we can cut and splice
them, adding whatever DNA we choose, (iii) we can put them back into bacteria, where they'll replicate along
with the bacteria's own DNA, and (iv) we can isolate them again - getting billions of copies of whatever DNA we
inserted into the plasmid! Plasmid are limited to sizes of 2.5-20 kilobases (kb), in general.
BAC
The term 'BAC" is an acronym for 'Bacterial Artificial Chromosome', and in principle, it is used like a plasmid. We
construct BACs that carry DNA from humans or mice or wherever, and we insert the BAC into a host bacterium.
As with the plasmid, when we grow that bacterium, we replicate the BAC as well. Huge pieces of DNA can be
easily replicated using BACs - usually on the order of 100-400 kilobases (kb). Using BACs, scientists have cloned
(replicated) major chunks of human DNA. This, as you will see later, is critical to the Human Genome Project.
Vector
The 'vector' is generally the basic type of DNA molecule used to replicate your DNA, like a plasmid or a BAC.
Insert

The 'insert' is a piece of DNA we've purposely put into another (a 'vector') so that we can replicate it. Usually the
'insert' is the interesting part, consequently. In the case of the Human Genome Project or other sequencing
projects, the insert is the part we want to sequence - the part we don't know. Usually we know the complete
DNA sequence of the vector.
Shotgun Sequencing
Shotgun sequencing is a method for determining the sequence fo a very large piece of DNA. The basic DNA
sequencing reaction can only get the sequence of a few hundred nucleotides. For larger ones (like BAC DNA), we
usually fragment the DNA and insert the resultant pieces into a convenient vector (a plasmid, usually) to
replicate them. After we sequence the fragments, we try to deduce from them the sequence of the original BAC
DNA.
For more definitions ...
See our Molecular Biology Glossary.

Now for the details on how DNA Sequencing works:


DNA sequencing reactions are just like the
PCR reactions for replicating DNA (refer to
the previous page DNA Denaturation,
Annealing and Replication). The reaction
mix includes the template DNA, free
nucleotides, an enzyme (usually a variant
of Taq polymerase) and a 'primer' - a small
piece of single-stranded DNA about 20-30
nt long that can hybridize to one strand of
the template DNA.

The reaction is initiated by heating until


the two strands of DNA separate, then
the primer sticks to its intended location
and DNA polymerase starts elongating
the primer. If allowed to go to
completion, a new strand of DNA
would be the result. If we start with a
billion identical pieces of template
DNA, we'll get a billion new copies of
one of its strands.

Dideoxynucleotides: We run the reactions,


however, in the presence of a
dideoxyribonucleotide. This is just like regular DNA,
except it has no 3' hydroxyl group - once it's added
to the end of a DNA strand, there's no way to
continue elongating it.

Now the key to this is that MOST of the


nucleotides are regular ones, and just a fraction
of them are dideoxy nucleotides....

Replicating a DNA strand in the presence of dideoxy-T

MOST of the time when a 'T' is required to make


the new strand, the enzyme will get a good one and
there's no problem. MOST of the time after adding
a T, the enzyme will go ahead and add more
nucleotides. However, 5% of the time, the enzyme
will get a dideoxy-T, and that strand can never
again be elongated. It eventually breaks away from
the enzyme, a dead end product.
Sooner or later ALL of the copies will get
terminated by a T, but each time the enzyme
makes a new strand, the place it gets stopped will
be random. In millions of starts, there will be
strands stopping at every possible T along the way.
ALL of the strands we make started at one exact
position. ALL of them end with a T. There are
billions of them ... many millions at each possible
T position. To find out where all the T's are in our
newly synthesized strand, all we have to do is find
out the sizes of all the terminated products!

Here's how we find out those fragment sizes.

Gel electrophoresis can be used to separate the fragments by size


and measure them. In the cartoon at left, we depict the results of a
sequencing reaction run in the presence of dideoxy-Cytidine
(ddC).
First, let's add one fact: the dideoxy nucleotides in my lab have
been chemically modified to fluoresce under UV light. The
dideoxy-C, for example, glows blue. Now put the reaction
products onto an 'electrophoresis gel' (you may need to refer to
'Gel Electrophoresis' in the Molecular Biology Glossary), and
you'll see something like depicted at left. Smallest fragments are
at the bottom, largest at the top. The positions and spacing shows
the relative sizes. At the bottom is the smallest fragment that's
been terminated by ddC; that's probably the C closest to the end
of the primer (which is omitted from the sequence shown).
Simply by scanning up the gel, we can see that we skip two, and
then there's two more C's in a row. Skip another, and there's yet
another C. And so on, all the way up. We can see where all the
C's are.

Putting all four deoxynucleotides into the picture:

Well, OK, it's not so easy reading just C's, as you perhaps saw
in the last figure. The spacing between the bands isn't all that
easy to figure out. Imagine, though, that we ran the reaction
with *all four* of the dideoxy nucleotides (A, G, C and T)
present, and with *different* fluorescent colors on each. NOW
look at the gel we'd get (at left). The sequence of the DNA is
rather obvious if you know the color codes ... just read the
colors from bottom to top: TGCGTCCA-(etc).
(Forgive me for using black - it shows up better than yellow).

An Automated sequencing gel:

That's exactly what we do to sequence DNA, then - we run DNA replication reactions in a
test tube, but in the presence of trace amounts of all four of the dideoxy terminator
nucleotides. Electrophoresis is used to separate the resulting fragments by size and we can
'read' the sequence from it, as the colors march past in order.
In a large-scale sequencing lab, we use a machine to run the electrophoresis step and to
monitor the different colors as they come out. Since about 2001, these machines - not
surprisingly called automated DNA sequencers - have used 'capillary electrophoresis',
where the fragments are piped through a tiny glass-fiber capillary during the
electrophoresis step, and they come out the far end in size-order. There's an ultraviolet
laser built into the machine that shoots through the liquid emerging from the end of the
capillaries, checking for pulses of fluorescent colors to emerge. There might be as many
as 96 samples moving through as many capillaries ('lanes') in the most common type of
sequencer.
At left is a screen shot of a real fragment of sequencing gel (this one from an older model
of sequencer, but the concepts are identical). The four colors red, green, blue and yellow
each represent one of the four nucleotides.
The actual gel image, if you could get a monitor large enough to see it all at this
magnification, would be perhaps 3 or 4 meters long and 30 or 40 cm wide.
A 'Scan' of one gel lane:

We don't even have to 'read' the sequence from the gel - the computer does that for us! Below is an example
of what the sequencer's computer shows us for one sample. This is a plot of the colors detected in one 'lane'
of a gel (one sample), scanned from smallest fragments to largest. The computer even interprets the colors by
printing the nucleotide sequence across the top of the plot. This is just a fragment of the entire file, which
would span around 900 or so nucleotides of accurate sequence.
The sequencer also gives the operator a text file containing just the nucleotide sequence, without the color
traces.

As you have seen, we can get the sequence of a fragment of DNA as long as 900 or so
nucleotides. Great! But what about longer pieces? The human genome is 3 *billion* bases
long, arranged on 23 pairs of chromosomes. Our sequencing machine reads just a drop in
the bucket compared to what we really need!
To do it, we break the entire genome up into manageable pieces and sequence them. There

are two approaches currently in use:

The Publically-funded Human Genome Project: The National Institutes of Health and the National Science
Foundation have funded the creation of 'libraries' of BAC clones. Each BAC carries a large piece of human
genomic DNA on the order of 100-300 kb. All of these BACs overlap randomly, so that any one gene is
probably on several different overlapping BACs. We can replicate those BACs as many times as necessary, so
there's a virtually endless supply of the large human DNA fragment.

In the Publically-funded project, the BACs are subjected to shotgun sequencing (see below) to figure
out their sequence. By sequencing all the BAC's, we know enough of the sequence in overlapping
segments to reconstruct how the original chromosome sequence looks.

A Privately-Funded Sequencing Project: Celera Genomics An innovative approach to sequencing the human
genome has been pioneered by Celera Genomics. The founders of this company realized that it might be
possible to skip the entire step of making libraries of BAC clones. Instead, they blast apart the entire human
genome into fragments of 2-10 kb and sequence those. Now the challenge is to assemble those fragments of
sequence into the whole genome sequence.

Imagine, for example that you have hundreds of 500-piece puzzles, each being assembled by a team
of puzzle experts using puzzle-solving computers. Those puzzles are like BACs - smaller puzzles that
make a big genome manageable. Now imagine that Celera throws all those puzzles together into one
room and scrambles the pieces. They, however, have scanners that scan all the puzzle pieces and huge
computers that figure out where they all go.
It is controversial still as to whether the Celera approach will succeed on a puzzle as large as the
human genome. Whether it does or not, they have certainly stirred up the intellectual pot a bit.

Shotgun sequencing: assembly of random sequence fragments


To sequence a BAC, we take millions of copies of it and chop them all up randomly. We then insert those into
plasmids and for each one we get, we grow lots of it in bacteria and sequence the insert. If we do this to enough
fragments, eventually we'll be able to reconstruct the sequence of the original BAC based on the overlapping
fragments we've sequenced!

Introduction
You can think of the sequences of bases in the coding strand of DNA or in messenger RNA as coded
instructions for building protein chains out of amino acids. There are 20 amino acids used in making
proteins, but only four different bases to be used to code for them.
Obviously one base can't code for one amino acid. That would leave 16 amino acids with no codes.
If you took two bases to code for each amino acid, that would still only give you 16 possible codes
(TT, TC, TA, TG, CT, CC, CA and so on) - still not enough.
However, if you took three bases per amino acid, that gives you 64 codes (TTT, TTC, TTA, TTG,
TCT, TCC and so on). That's enough to code for everything with lots to spare. You will find a full table
of these below.
A three base sequence in DNA or RNA is known as a codon.

The code in DNA


The codes in the coding strand of DNA and in messenger RNA aren't, of course, identical, because in
RNA the base uracil (U) is used instead of thymine (T).
The table shows how the various combinations of three bases in the coding strand of DNA are used

to code for individual amino acids - shown by their three letter abbreviation.

The table is arranged in such a way that it is easy to find any particular combination you want. It is
fairly obvious how it works and, in any case, it doesn't take very long just to scan through the table to
find what you want.
The colours are to stress the fact that most of the amino acids have more than one code. Look, for
example, at leucine in the first column. There are six different codons all of which will eventually
produce a leucine (Leu) in the protein chain. There are also six for serine (Ser).
In fact there are only two amino acids which have only one sequence of bases to code for them methionine (Met) and tryptophan (Trp).
You have probably noticed that three codons don't have an amino acid written beside them, but say
"stop" instead. For obvious reasons these are known as stop codons. We'll leave talking about those
until we have looked at the way the code works in messenger RNA.

The code in messenger RNA


You will remember that when DNA is transcribed into messenger RNA, the sequence of bases
remains exactly the same, except that each thymine (T) is replaced by uracil (U). That gives you the
table:

In many ways, this is the more useful table. Messenger RNA is directly involved in the production of
the protein chains (see the next page in this sequence). The DNA coding chain is one stage removed
from this because it must first be transcribed into a messenger RNA chain.

Start and stop codons


The stop codons in the RNA table (UAA, UAG and UGA) serve as a signal that the end of the chain
has been reached during protein synthesis - and we will come back to that on the next page.
There is also a start codon - but you won't find it called that in the table!
The codon that marks the start of a protein chain is AUG. If you check the table, that's the amino acid,
methionine (Met). That ought to mean that every protein chain must start with methionine. That's not
quite true because in some cases the methionine can get chopped off the chain after synthesis is
complete.

You might also like