0% found this document useful (0 votes)
47 views110 pages

Genomics-Lectures 1 To 8 - 2023 PDF

The document provides an introduction to genomics and the structure of DNA. It discusses that the human genome contains about 100 trillion cells, each containing the entire human genome stored as DNA. DNA is organized into chromosomes, which contain many genes that code for particular proteins. The key components of DNA are nucleotides containing adenine, thymine, cytosine, and guanine. James Watson and Francis Crick discovered the double helix structure of DNA in 1953, which was key to understanding how genetic information is stored and copied.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views110 pages

Genomics-Lectures 1 To 8 - 2023 PDF

The document provides an introduction to genomics and the structure of DNA. It discusses that the human genome contains about 100 trillion cells, each containing the entire human genome stored as DNA. DNA is organized into chromosomes, which contain many genes that code for particular proteins. The key components of DNA are nucleotides containing adenine, thymine, cytosine, and guanine. James Watson and Francis Crick discovered the double helix structure of DNA in 1953, which was key to understanding how genetic information is stored and copied.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Introduction to Genomics

Our body is made up of about


100 trillion cells.
When unfolded, DNA
That’s 100,000,000,000,000!
looks like a double
helix: a twisted
ladder
Each cell contains the
entire human
DNA is looped
genome.
and folded so
Cells differentiate long stretches
by turning on and can be fit into a
nucleus
off different
genes.

Chromosomes have many


genes: small sections of
DNA that code for a
particular protein

The DNA is organized into


chromosomes: the human
genome has 46
chromosomes
What does DNA look like?
• DNA contains one of four
nucleotides: adenine (A),
thymine (T), cytosine
(C), and guanine (G).

• A+T or C+G
DNA
Molecule of Life
Deoxyribonucleic Acid

The discovery of DNA was the most


important event in biological science
of the 20th century
Discovery!
The structure of Crick

DNA was
Watson
discovered in 1953
by James Watson
and Francis Crick
Rosalind Franklin
Developed
techniques for X-
Ray Diffraction
photographs
Franklin’s X -Ray
pictures of DNA fibers
were key to the
discovery of the
structure of DNA.
X - Ray Diffraction
Using models,
Watson and Crick
were able to come
up with the
structure of DNA
that matched the
pattern in
Franklin’s
photographs.
NOBEL PRIZE
Watson and Crick
won the Nobel
Prize in 1962.

Rosalind Franklin
died in 1958. The
Nobel Prize is
only awarded to
living scientists.
Chemical structure, nomenclature, properties of nucleotides

Nucleotides have three characteristic components:


(1) a nitrogenous (nitrogen-containing) base,
(2) a pentose sugar,
(3) phosphate groups

The molecule
without the
phosphate group is
   called a nucleoside
(sugar+base).

The nitrogenous bases


are derivatives of two
parent compounds,
pyrimidine and purine
(heterocyclic compounds).
The flow of genetic information

(Central Dogma of Molecular Biology)


Reverse Transcription
(used by RNA viruses)

DNA RNA PROTEIN


Transcription Translation

DNA
Replication
DNA Replication
(Synthesis of the new DNA Strands)

• Components are required:

• dNTPs: dATP, dTTP, dGTP, dCTP


• (deoxyribonucleoside 5’-triphosphates)

• DNA template

• DNA polymerase

• Mg2+ (optimizes DNA polymerase activity)


DNA elongation
Polymerase Chain Reaction
What is PCR?
PCR or Polymerase Chain Reaction is defined as a
molecular technique which allows the production of
large quantities of a specific DNA from a DNA
template using a simple enzymatic reaction without a
living organism.

In essence, a molecular photocopier, in


which we try to duplicate the replication
of DNA in a test tube.
A brief history

• The concept of DNA synthesis or


oligonucleotide synthesis was first introduced
by Khorana group in 1970s.

• In their 1970 synthetic gene paper in Nature,


they note, “Unpublished experiments by two of
us have given encouraging results on the use of
DNA polymerase for replication of the gene in
the presence of suitable primers.” (Agarwal et
al., 1970: 30).
A brief history
• A 1971 paper in the Journal of Molecular Biology by
Kleppe and co-workers first described a method using
an enzymatic assay to replicate a short DNA template
with primers in vitro.

• But the progress was limited by the primer synthesis


and polymerase purification issues.

• It went unnoticed for a good 15 years till PCR burst


open the scene.
Invention of PCR

• The idea was conceived by Kary Mullis in the


spring of 1983 when he was working in Cetus
Corporation.

• First reported to the scientific community in a


Cold Spring Harbor Conference in 1986,
followed by publication in 1987 in Methods in
Enzymology.
Components of PCR
• DNA template, which contains the region of the DNA
fragment to be amplified.

• DNA Polymerase, which copies the region to be amplified.

• Two primers, which determine the beginning and end of


the region to be amplified.

• Nucleotides, from which the DNA-Polymerase builds the


new DNA.

• Buffer, which provides a suitable chemical environment


for the DNA Polymerase.
Basic cycling in PCR
• In a PCR reaction, the following series of
steps is repeated 20-40 times.
(Note: 25 cycles usually takes about 2
hours and amplifies the DNA fragment of
interest 100,000 fold)

Step 1: Denature DNA


At 95C, the DNA is denatured (i.e. the
two strands are separated)

Step 2: Primers Anneal


At 40C - 65C, the primers anneal (or
bind to) their complementary sequences
on the single strands of DNA

Step 3: DNA polymerase Extends the DNA


chain
DNA Polymerase extends the DNA chain
by adding nucleotides to the 3’ ends of the
primers at ~70C.
Was it easy then?!!!

• No….Initially one had to use water baths


and laboratory timers.
– Cycling made easier after the invention of Thermal Cyclers.

• DNA polymerase is destroyed after each


denaturation cycle at 95ºC.
– Polymerase problem was solved when the scientists at the
Cetus corporation discovered the use of enzymes from
Thermophilic (heat-loving) bacteria which grow at
temperatures above 90ºC.
Step 1. Denaturation of DNA
5’ 3’
3’ 5’

5’ 3’

3’ 5’

This occurs at 95 ºC mimicking the function of


helicase in the cell.
Step 2. Annealing or Primers Binding
5’ 3’
3’ 5’ Reverse Primer

Forward Primer 5’ 3’

3’ 5’

Primers bind to the complimentary sequence on the target


DNA. Primers are chosen such that one is complimentary
to the one strand at one end of the target sequence and that
the other is complimentary to the other strand at the other
end of the target sequence.
Step 3. Extension or Primer Extension

5’ 3’
3’ 5’
extension

extension
5’ 3’
3’ 5’

DNA polymerase catalyzes the extension of


the strand in the 5’→ 3’ direction, starting at
the primers, attaching the appropriate
nucleotide (A-T, C-G)
• The next cycle will begin by denaturing the
new DNA strands formed in the previous
cycle

5’ 3’

3’ 5’
5’ 3’

3’ 5’ 5’ 3’
5’ 3’
3’ 5’
3’ 5’

5’ 3’ 5’ 3’

3’ 5’
3’ 5’
5’ 3’

3’ 5’
The Size of the DNA Fragment Produced in
PCR is Dependent on the Primers
• The PCR reaction will amplify the DNA section between the two primers.
• If the DNA sequence is known, primers can be developed to amplify any
piece of an organism’s DNA.

Forward primer

Reverse primer

Size of fragment that is amplified


Components of PCR
Template:
• Can be any crude preparation – just boil your cells in
buffer, spin and take the supernatant.
• High Specificity, very low requirements – wide range
of templates
• Dried blood
• Semen stains
• Vaginal swabs
• Single hair
• Fingernail scrapings
• Insects in Amber
• Egyptian mummies
• Buccal Swab
• Toothbrushes
Components of PCR
Primers
• Artificial DNA strands (18-25 bp) nucleotides.
• Match the beginning and end of DNA fragment to be
amplified.
• Anneal to the DNA template at these starting and
ending points.
• DNA-Polymerase binds and begins the synthesis of
the new DNA strand.
• Primer design criteria – discussed later
Components of PCR
• Reaction Buffer
• Concentration of MgCl2 around 1.5 mM, but to be
optimised depending on the reaction.
– Excess of MgCl2 accumulates non-specific
amplification products
– Insufficient, reduce the yield

• DMSO: An additive, to reduce secondary structure of


DNA; used with difficult high GC templates; inhibits Taq;
reduces yield

•Buffering agent – to maintain pH at higher temperatures.


Tris usually preferred.
Components of PCR
DNA Polymerase

• Initially used E.coli DNA polymaerase or T4 DNA polymerase.

• Now, Taq DNA polymerase from a hyperthermophile Thermus


aquaticus.

• Very efficient, thermosatble.

• Problem: Taq DNA polymerase lacks the 3´→5´proof-reading


activity commonly present in other polymerases.
Taq mis-incorporates 1 base in 10000. Error distribution will
be random.
Components of PCR

DNA Polymerase
• Polymerases such as Pfu, obtained from Archae, have proof-reading
mechanisms.
• Combinations of both Taq and Pfu are available
• high fidelity
• accurate amplification of DNA.
Components of PCR
dNTPs
• Typically 0.2 mM in reaction.
• Less amount of dNTP gives poor yield.
• More amount leads to mis-incorporation by
polymerase.
• Very sensitive to freeze-thaw cycles. Should be kept in
small aliquots, discard after a few cycles.
Cycles in PCR

Cycles Copies • PCR target molecules


1 2
accumulate as a
2 4
function of cycle
4 16
number.
10 1024
• The exponential phase
15 32,768
lasts for about 30 cycles
20 1,048,576
under standard reactions
25 33,554,432 conditions.
30 1,073,741,824
Cycles in PCR
• Increasing the cycle
number above ~35 has
little positive effect.
• The plateau occurs
when:
– The reagents are depleted
– The products re-anneal
– The polymerase is
damaged
• Unwanted products
accumulate.
Was your PCR successful?
• Run a sample on a gel
• Check for the following:
– Size of the PCR Product
– Is there a single band?
– Multiple bands – mispriming.
– Any single band of your expected size?
– Any band in negative control?
Troubleshooting in PCR

• Annealing temperature of the primers.


• The concentration of Mg2+ in the reaction.
• The extension time.
• The denaturing and annealing times.
• The extension temperature.
• The amount of template and polymerase.
PCR primer design
• The main criteria for primer design are length, sequence and melting
temperature.

• Recommended primer length is 18-25 nucleotides.

• GC content of the primers should range between 40% and 60%.

• The most important region for the specific priming is the 3’ region of the
primer; amplification starts here. The 3’ ends should be free of secondary
structures, repetitive sequences, palindromes and highly degenerate
sequences.

• The sequences should lack complementarity to each other, especially at their


3’ ends (so primer-dimer will not form).

• Most primers should have melting temperatures between 55ºCand 65ºC.

• Primers that are used together should have similar Tm values and should not
differ by more than 5ºC.
Can you design a PCR program to
amplify a gene of your interest?
Example PCR program

1. 95 C for 5 min
2. 95 C for 30 sec
3. 60 C for 30 sec
4. 72 C for 1 min (for amplicon size of 1 kb)
5. Go to step 2 for 35 cycles
6. 72 C for 10 min
7. 4 C for ever/ Stop
Laboratory
Applications
of PCR
DNA Repair

• Excision repair:
1. Damaged segment is excised by a repair
enzyme (there are over 50 repair enzymes).

2. DNA polymerase and DNA ligase replace


and bond the new nucleotides together.
Transcription

From Genes to mRNA


Transcription

• RNA polymerase-enzyme that links the


RNA nucleotides together
Types of RNA
• There are 3 types of RNA, each encoded
by its own type of gene:

• mRNA –RNA molecule transcribed from a


DNA template (transcription).

• tRNA - Transfer RNA. Translates the 3-


letter codons of mRNA to amino acids
(translation).

• rRNA - Ribosomal RNA: component of


ribosomes (translation).
Transcription Steps
• DNA unwinds

• RNA polymerase links the nucleotides


together. In the transcription of a
gene, specific sequences of DNA
nucleotides tell the RNA polymerase
where to begin and end the
transcribing process.
• The newly-formed mRNA has regions that
do not contain a genetic message. These
regions are called introns and must be
removed. Their function is unknown.

• The remaining portions of mRNA are called


exons. They are spliced together (RNA
splicing) to form a “final draft” of mRNA
ready for translation.
TRANSLATION- RNA TO PROTEIN
(PROTEIN SYNTHESIS)

KEY TERMS
*tRNA
*rRNA
*codon
*anticodon
*codon- in RNA, a three-based “word” that
codes for one amino acid (EX. AUG)

*anticodon-in tRNA, a triplet of nitrogenous


bases that is complementrary to a specific
codon in mRNA (EX.UAC)
TRANSLATION-THE STEPS
1. mRNA, tRNA with attached amino acid, and
the 2 subunits of a ribosome come together

2. Start codon AUG dictates where translation will


begin

3. Amino acids are added one by one to the


growing chain of amino acids

4. The elongation process continues until the


ribosome reaches a stop codon-UAA, UAG, or
UGA
Human Genome Project
Genomics

• Our genome encodes an enormous amount of information


about our beings
– our looks
– our size
– how our bodies work
– ….
– our health
– our behaviors
– … who we are!

gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggata
gcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatg
attgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgca
ataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgt
aggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgag
gcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagta
agtaagcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttg
tagctgatagtgatgatcgtag …….
Human Genome Project
Goals:
■ Identify all the genes in human DNA,
■ Determine the sequences of the 3 billion chemical base pairs that make up human
DNA,
■ Store this information in databases,
■ Improve tools for data analysis,
■ Transfer related technologies to the private sector, and
■ Address the ethical, legal, and social issues (ELSI) that may arise from the project.

Milestones:
■ 1990: Project initiated as joint effort of U.S. Department of Energy and the National
Institutes of Health (project time and cost: 15 years and $3 billion)
■ June 2000: Completion of a working draft of the entire human genome (using DNA
from 5 individuals)
■ February 2001: Analyses of the working draft were published
■ April 2003: HGP sequencing was completed and the Project was declared finished two
years ahead of schedule
Two strategies for genome sequencing
Hierarchical Shotgun
Sequencing Sequencing
DNA Sequencing

Among the different methods for DNA


sequencing, the chain termination
method developed by Frederick Sanger
and his colleagues in 1977 is now most
commonly used all over the World. In
fact, the other methods have now become
obsolete.
The basic principle of Sanger method:

(i)Generation of a nested set of


molecules in 4 separate reactions
each terminating at either A,G,T or C.

(ii) Separate these fragments on a high


resolution PAGE.
DNA Sequencing by Sanger Method

• This key feature of this method is: it uses


dideoxynucleotide triphosphates (ddNTPs).

• ddNTPs have an H on the 3’ carbon of the ribose


sugar instead of the normal OH found in
deoxynucleotide triphosphates (dNTPs).

• In a synthesis reaction, if a ddNTP is added instead


of the normal deoxynucleotide, the synthesis stops
at that point because the 3’OH necessary for the
addition of the next nucleotide is absent.
Difference between dNTP and ddNTP
DNA Sequencing

• Only one of the 4 ddNTPs are added in each of the 4


reactions.

• The ddNTP:dNTP ratio in each reaction is such that


the chain will stop at each occurrence of the base
corresponding to the dNTP on the reaction.

• The ddNTPs are radioactively labelled for detection.

• Separate 4 pools of molecules on a high resolution gel.

• Read sequence ladder from the gel.


Strategy of the chain-termination

Small amount of ddGTP + excess dGTP partially terminates


chains at Cs in the template
Sequencing Gels
 Resolve single base pair differences!

• Length: ~ 70 cm

•Thickness: 0.1 mm

•Gel type: 7 M urea and 4-8% gradient acrylamide

•Gel running conditions: 2000 volts (17 Watt) at 65°C


Autoradiogram of a sequencing gel
DNA Sequencing using fluorescently labelled ddNTPs

1. All fragments start at


the primer.
2. All fragments ending in
a particular base have
a different length and a
different color tag.
3. Separating the mixture
of products by size
reveals the sequence.
Radioactive labelling vs. fluorescent
labelling of ddNTPs
Capillary sequencing

 Involves small-diameter capillaries

-Can withstand high electric fields

-Small capillaries effectively dissipates the heat that is


generated

-Large electric fields permit efficient separation of DNA


molecules and reduction in separation times
An automated DNA Sequencing Chromatogram
This is how commercial sequencing result file looks!!!
What does the human genome sequence tell us?

By the Numbers
• The human genome contains 3 billion chemical nucleotide bases (A, C,
T, and G).

• The average gene consists of 3000 bases, but sizes vary greatly, with
the largest known human gene being Dystrophin at 2.4 million bases.

• The total number of genes is estimated at around 30,000 --much lower


than previous estimates of 80,000 to 140,000.

• Almost all (99.9%) nucleotide bases are exactly the same in all people.

• The functions are unknown for over 50% of discovered genes.


Book of Life : 5000 bases per page
CACACTTGCATGTGAGAGCTTCTAATATCTAAATTAATGTTGAATCATTATTCAGAAACAGAGAGCTAACTGTTATCCCATCCTGACTTTATTCTTTATG AGAAAAATACAGTGATTCC
AAGTTACCAAGTTAGTGCTGCTTGCTTTATAAATGAAGTAATATTTTAAAAGTTGTGCATAAGTTAAAATTCAGAAATAAAACTTCATCCTAAAACTCTGTGTGTTGCTTTAAATAATC
AGAGCATCTGC TACTTAATTTTTTGTGTGTGGGTGCACAATAGATGTTTAATGAGATCCTGTCATCTGTCTGCTTTTTTATTGTAAAACAGGAGGGGTTTTAATACTGGAGGAACAA
CTGATGTACCTCTGAAAAGAGA AGAGATTAGTTATTAATTGAATTGAGGGTTGTCTTGTCTTAGTAGCTTTTATTCTCTAGGTACTATTTGATTATGATTGTGAAAATAGAATTTATCC
CTCATTAAATGTAAAATCAACAGGAGAATAGCAAAAACTTATGAGATAGATGAACGTTGTGTGAGTGGCATGGTTTAATTTGTTTGGAAGAAGCACTTGCCCCAGAAGATACACAAT
GAAATTCATGTTATTGAGTAGAGTAGTAATACAGTGTGTTCCCTTGTGAAGTTCATAACCAAGAATTTTAGTAGTGGATAGGTAGGCTGAATAACTGACTTCCTATC ATTTTCAGGTT
CTGCGTTTGATTTTTTTTACATATTAATTTCTTTGATCCACATTAAGCTCAGTTATGTATTTCCATTTTATAAATGAAAAAAAATAGGCACTTGCAAATGTCAGATCACTTGCCTGTGGT
CATTCGGGTAGAGATTTGTGGAGCTAAGTTGGTCTTAATCAAATGTCAAGCTTTTTTTTTTCTTATAAAATATAGGTTTTAATATGAGTTTTAAAATAAAATTAATTAGAAAAAGGCAA
ATTACTCAATATATATAAGGTATTGCATTTGTAATAGGTAGGTATTTCATTTTCTAGTTATGGTGGGATATTATTCAGACTATAATTCCCAATGAAAAAACTTTAAAAAATGCTAGTGA
TTGCACACTTAAAACACCTTTTAAAAAGCATTGAGAGCTTATAAAATTTTAATGAGTGATAAAACCAAATTTGAAGAGAAAAGAAGAACCCAGAGAGGTAAGGATATAACCTTACC
AGTTGCAATTTGCCGATCTCTACAAATATTAATATTTATTTTGACAGTTTCAGGGTGAATGAGAAAGAAACCAAAACCCAAGACTAGCATATGTTGTCTTCTTAAGGAGCCCTCCCCT
AAAAGATTGAGATGACCAAATCTTATACTCTCAGCATAAGGTGAACCAGACAGACCTAAAGCAGTGGTAGCTTGGATCCACTACTTGGGTTTGTGTGTGGCGTGACTCAGGTAATCT
CAAGAATTGAACATTTTTTTAAGGTGGTCCTACTCATACACTGCCCAGGTATTAGGGAGAAGCAAATCTGAATGCTTTATAAAAATACCCTAAAGCTAAATCTTACAATATTCTCAAG
AACACAGTGAA ACAAGGCAAAATAAGTTAAAATCAACAAAAACAACATGAAACATAATTAGACACACAAAGACTTCAAACATTGGAAAATACCAGAGAAAGATAATAAATAT
TTTACTCTTTAAAAATTTAGTTAAAAGCTTAAACTAATTGTAGAGAAAA AACTATGTTAGTATTATATTGTAGATGAAATAAGCAAAACATTTAAAATACAAATGTGATTACTTAAAT
TAAATATAATAGATAATTTACCACCAGATTAGATACCATTGAAGGAATAATTAATATACTGAAATACAGGTCAGTAGAATTTTTTTCAATTCAGCATGGAGATGTAAAAAATGAAAA
TTAATGCAAAAAATAAGGGCACAAAAAGAAATGAGTAATTTTGATCAGAAATGTATTAAAATTAATAAACTGGAAATTTGACATTTAAAAAAAGCATTGTCATCCAAGTAGATGTG
TCTATTAAATAGTTGTTCTCATATCCAGTAATGTAATTATTATTCCCTCTCATGCAGTTCAGATTCTGGGGTAATCTTTAGACATCAGTTTTGTCTTTTATATTATTTATTCTGTTTACTAC
ATTTTATTTTGCTAATGATATTTTTAATTTCTGACATTCTGGAGTATTGCTTGTAAAAGGTATTTTTAAAAATACTTTATGGTTATTTTTGTGATTCCTATTCCTCTATGGACACCAAGGCT
ATTGACATTTTCTTTGGTTTCTTCTGTTACTTCTATTTTCTTAGTGTTTATATCATTTCATAGATAGGATATTCTTTATTTTTTATTTTTATTTAAATATTTGGTGATTCTTGGTTTTCTCAGCC
ATCTATTGTCAAGTGTTCTTATTAAGCATTATTATTAAATAAAGATTATTTCCTCTAATCACATGAGAATCTTTATTTCCCCCAAGTAATTGAAAATTGCAATGCCATGCTGCCATGTGG
TACAGCATGGGTTTGGGCTTGCTTTCTTCTTTTTTTTTTAACTTTTATTTTAGGTTTGGGAGTACCTGTGAAAGTTTGTTATATAGGTAAACTCGTGTCACCAGGGTTTGTTGTACAGATCA
TTTTGTCACCTAGGTACCAAGTACTCAACAATTATTTTTCCTGCTCCTCTGTCTCCTGTCACCCTCCACTCTCAAGTAGACTCCGGTGTCTGCTGTTCCATTCTTTGTGTCCATGTGTTCTC
ATAATTTAGTTCCCCACTTGTAAGTGAGAACATGCAGTATTTTCTAGTATTTGGTTTTTTGTTCCTGTGTTAATTTGCCCAGTATAATAGCCTCCAGCTCCATCCATGTTACTGCAAAGAA
CATGATCTCATTCTTTTTTATAGCTCCATGGTGTCTATATACCACATTTTCTTTATCTAAACTCTTATTGATGAGCATTGAGGTGGATTCTATGTCTTTGCTATTGTGCATATTGCTGCAAG
AACATTTGTGTGCATGTGTCTTTATGGTAGAATGATATATTTTCTTCTGGGTATATATGCAGTAATGCGATTGCTGGTTGGAATGGTAGTTCTGCTTTTATCTCTTTGAGGAATTGCCATG
CTGCTTTCCACAATAGTTGAACTAACTTACACTCCCACTAACAGTGTGTAAGTGTTTCCTTTTCTCCACAACCTGCCAGCATCTGTTATTTTTTGACATTTTAATAGTAGCCATTTTAACT
GGTATGAAATTATATTTCATTGTGGTTTTAATTTGCATTTCTCTAATGATCAGTGATATTGAGTTTGTTTTTTTTCACATGCTTGTTGGCTGCATGTATGTCTTCTTTTAAAAAGTGTCTGTT
CATGTACTTTGCCCACATTTTAATGGGGTTGTTTTTCTCTTGTAAATTTGTTTAAATTCCTTATAGGTGCTGGATTTTAGACATTTGTCAGACGCATAGTTTGCAAATAGTTTCTCCCATTC
TGTAGGTTGTCTGTTTATTTTGTTAATAGTTTCTTTTGCTATGCAGAAGCTCTTAATAAGTTTAATGAGATCCTGATATGTTAGGCTTTGTGTCCCCACCCAAATCTCATCTTGAATTATA
TCTCCATAATCACCACATGGAGAGACCAGGTGGAGGTAATTGAATCTGGGGGTGGTTTCACCCATGCTGTTCTTGTGATAGTGAATGAGTTCTCACGAGATCTAATGGTTTTATGAGG
GGCTCTTCCCAGCTTTGCCTGGTACTTCTCCTTCCTGCCGCTTTGTGAAAAAGGTGCATTGCGTCCCTTTCACCTTCTTCTATAATTGTAAGTTTCCTGAGGCCTTCCCAGCCATGCTGAA
CTTCAAGTCAATTAAACCTTTTTCTTTATAAATTACTCAGTCTCTGGTGGTTCTTTATAGCAGTGTGAAAATGGACTAATGAAGTTCCCATTTATGAATTTTTGCTTTTGTTGCAATTGCTT
TTGACATCTTAGTCATGAAATCCTTGCCTGTTCTAAGTACAGGACGGTATTGCCTAGGTTGTCTTCCAGGGTTTTTCTAATTTTGTGTTTTGCATTTAAGTGTTTAATCCATCTTGAGTTGA
TTTTTGTATATTGTGTAAGGAAGGGGTCCAGTTTCAATCTTTTGCATATGGCTAGTTAGTTATCCCAGTACCATTTATTGAAAAGACAGTCTTTTCCCCATCGCTCGTTTTTGTCAGTTTT
ATTGATGATCAGATAATCATAGCTGTGTGGCTTTATTTCTGGGTTCTTTATTCTGTTCTATTGGTTTATGTCCCTGTTTTTGTGCCAGTACCATGCTGTTTTGGTTAACATAGCCCTGTAGT
ATAGTTTGAGGTCAGATAGCCTGATGCTTCCAGCTTTGTTCTTTTTCTTAAGATTGCCTTGGCTATTTGGCCTCTTTTTTGGTTCCACATGAATTTTAAAACAGTTGTTTCTAGTTTTTGAA
GAATGTCATTGGTAGTTTGATAGAAATAGCATTTAATCTGTAAATTGATTTGTGCAGTATGGCCTTTTAATGATATTGATTCTTCCTATCCATGAGCATGATATGTTTTCCATTTTGTTTG
TATCCTCTCTGATTTCTTTGTGCAGTGTTTTGTAATTCTCAT TGTAGAGATTTTTCACCTCCCTGGTTAGTTGTATTTTACCCTAGATATTT TATTCTTTTTGTGAAAATTGTGAATGGGAT
TGCCTTCCTGATTTGACTGC CAGCTTGGTTACTGTTGGTTTATAGAAATGCTAGTGATTTTTGTACATTG ATTTTCTTTCTAAAACTTTGCTGAAGTTTTTTTTATTAGCAGAAGGAGCT
TTGGGGCTGAGACTATGGGGTTTTCTAGATATAGAATCATGTCAGCTTCAAATAGGGATAATTTTACTTCCTCTCTTCCTATTTGGATGCCCTTTATTTCTTTCTCTTGCCTGATTACTCTG
GCTGGGATTTCCTATGTTGAATAGGAGT CATGAGAGAGGGCATCAAATCTACACATATCAAATACTAACCTTGAATGTCTAGATATTT TATTCTTTTTGTGAAAATTGTGAATGGGAT
Book of Life:
How much data make up the human genome?

3 pallets with 40 boxes


per pallet x 5000 pages
per box x 5000 bases per
page = 3,000,000,000
bases!
Genome sequence resources

• UCSC Genome browser


– Santa Cruz
• Ensembl
– Wellcome Trust, EBI, Sanger
• NCBI Genome Database
– NIH
What does the human genome sequence tell us?

How It's Arranged

• The human genome's gene-dense "urban centers" are predominantly composed of


the DNA building blocks G and C.

• In contrast, the gene-poor "deserts" are rich in the DNA building blocks A and T.

• Genes appear to be concentrated in random areas along the genome, with vast
expanses of noncoding DNA between.

• Stretches of up to 30,000 C and G bases repeating over and over often occur
adjacent to gene-rich areas, forming a barrier between the genes and the "junk DNA."
These CpG islands are believed to help regulate gene expression.
What does the human genome sequence tell us?

Marked variation across the human genome

• Chromosome 1 has the most genes (2968), and the Y chromosome has
the fewest (231).

• Gene rich (Chromosome 19) vs. gene poor (Chromosome 13) regions

• Chromosome 21 & 22 (smallest) were sequenced 1st

• Chromosome 21 ~ 225 genes; Chromosome 22 ~ 550 genes


What does the human genome sequence tell us?

The Wheat from the Chaff

• Less than 2% of the genome codes for proteins.

• Repeated sequences that do not code for proteins ("junk DNA") make up at least
50% of the human genome.

• Repetitive sequences are thought to have no direct functions, but they shed light
on chromosome structure and dynamics. Over time, these repeats reshape the
genome by rearranging it, creating entirely new genes, and modifying and
reshuffling existing genes.

• The human genome has a much greater portion (50%) of repeat sequences than
the mustard weed (11%), the worm (7%), and the fly (3%).

• Hundreds of genes appear to have come from bacteria.


Comparison of Humans with Other Organisms
Genome sizes and number of genes in organisms
What does the human genome sequence tell us?

Comparison of Humans with Other Organisms

• Unlike the human's seemingly random distribution of gene-rich areas,


many other organisms' genomes are more uniform, with genes evenly
spaced throughout.

• Humans have on average three times as many kinds of proteins as the fly
or worm because of mRNA transcript "alternative splicing" and chemical
modifications to the proteins. This process can yield different protein
products from the same gene.
What does the human genome sequence tell us?

Comparison of Humans with Other Organisms

• Humans share most of the same protein families with worms, flies, and
plants; but the number of gene family members has expanded in
humans, especially in proteins involved in development and immunity.

• Although humans appear to have stopped accumulating repeated DNA


over 5 million years ago, there seems to be no such decline in rodents.
This may account for some of the fundamental differences between
hominids and rodents, although gene estimates are similar in these
species. Scientists have proposed many theories to explain evolutionary
contrasts between humans and other organisms, including those of life
span, litter sizes, inbreeding, and genetic drift.
What does the human genome sequence tell us?

Variations and Mutations

• Scientists have identified more than 10 million locations where single-


base DNA differences (SNPs) occur in humans. This information promises
to revolutionize the processes of finding chromosomal locations for
disease-associated sequences and tracing human history.

• The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs.
females. Researchers point to several reasons for the higher mutation rate
in the male germline, including the greater number of cell divisions
required for sperm formation than for eggs.
Genome Analysis
Genome sizes and number of genes

Organism Genome Size Estimated


(Bases) Genes
Human (Homo sapiens) 3 billion 32,000

Laboratory mouse (M. musculus) 2.6 billion 32,000

Mustard weed (A. thaliana) 100 million 26,000

Roundworm (C. elegans) 97 million 19,000

Fruit fly (D. melanogaster) 137 million 13,000

Yeast (S. cerevisiae) 12.1 million 6,000


Bacterium (E. coli) 4.6 million 3,200
Human immunodeficiency virus (HIV) 9700 9
How Can You Sequence the Genome of a
New Organism?
Single nucleotide polymorphisms (SNPs)

• Sites in the genome where the DNA sequences of


many individuals differ by a single base are called
single nucleotide polymorphisms (SNPs).

… T A G C …

… T G G C …

• Some people may have a chromosome with an A at a


particular site where others have a G. Each form is called
an allele.
Genotype at a SNP

• The set of alleles that a person has is called a


genotype.
• For this SNP a person could have the genotype AA,
AG, or GG.

… T A G C …

… T G G C …
Genotype at a SNP
Occurrence of SNPs across the human genome

More than 10 millions SNPs have been


identified. It is estimated that about 30
million SNPs exist in human populations.

~ 10 million common SNPs (> 1 - 5% MAF) - 1/300 bp


Haplotype

• Alleles of SNPs that are close together tend to be


inherited together.

• A set of associated SNP alleles in a region of a


chromosome is called a "haplotype".
Haplotype

• Most chromosome regions have only a few common


haplotypes (each with a frequency of at least 5%), which
account for most of the variation from person to person
in a population.

• A chromosome region may contain many SNPs, but only


a few "tag" SNPs can provide most of the information on
the pattern of genetic variation in the region.
Haplotypes (example)
.
. A .. C .. A .. T .. G .. T ..
.
. A .. C .. C .. G .. C .. T ..
.
. G .. T .. C .. G .. G .. A ..

A chromosome region with only the SNPs shown. Three haplotypes are
shown. The two SNPs in color are sufficient to identify (tag) each of the
three haplotyes. For example, if a chromosome has alleles A and T at
these two tag SNPs, then it has the first haplotype.
Future Challenges: Specific Research Areas

• Gene number, exact locations, and functions

• Gene regulation

• Noncoding DNA types, amount, distribution and functions

• Coordination of gene expression

• Predicted vs. experimentally determined gene function

• Evolutionary conservation among organisms


Future Challenges: Specific Research Areas

• Correlation of SNPs and Haplotypes with health and disease

• Disease-susceptibility prediction based on gene sequence variation

• Genes involved in complex traits and multigene diseases

• Complex systems biology including microbial consortia useful for


environmental restoration

• Developmental genetics
International HapMap project
The International HapMap Project is a multi-country effort
to identify and catalog genetic similarities and differences
in human beings.

Goal of this project: To develop a resource to facilitate future


studies that relate human genetic variation to health and disease. By
making this information freely available, the Project will help biomedical
researchers find genes involved in disease and responses to
therapeutic drugs.

The HapMap (=Haplotype Map of the human genome) will


facilitate comparisons among:
• Individuals
• Groups
Genetic variation and use of the HapMap

• Although any two unrelated people are the same at about


99.9% of their DNA sequences, the remaining 0.1% is
important because it contains the genetic variants that
influence how people differ in their risk of disease or their
response to drugs.

• Discovering the DNA sequence variants that contribute to


common disease risk offers one of the best opportunities
for understanding the complex causes of disease in
humans.
Populations Included

• Yoruba (Ibadan, Nigeria)


30 parent-child trios

• Han Chinese (Beijing, China)


45 unrelated individuals

• Japanese (Tokyo, Japan)


45 unrelated individuals

• CEPH (Centre d'Etude du Polymorphisme Humain:Utah


residents with Northern & Western European ancestry)
30 parent-child trios
What is the necessity of taking
samples form different populations?
Some Facts About Populations

• Any one population includes about 90% of the genetic


variation that exists throughout the world. Therefore, the
most common haplotypes are expected to be found in
all human populations.

• Thus, the HapMap could have been developed with


samples from any one population.
But…………

• The frequencies of particular Haplotypes often


differ among populations.

• Population differences in haplotype frequencies


are important for discovering genes that
influence health and disease.
So…………
• Studying samples from several populations with
different ancestral geographies will make the
HapMap more useful for future studies in
multiple populations (both those sampled and
not sampled).

• A grid sampling strategy would have ignored


population structure, making the HapMap less
useful.
Why These Populations?
Why These Populations?

• Scientific reasons:
* Recommendation to include samples from at least 3 Old World
continents
* Pilot data showing range of haplotype frequencies

• Ethical reasons:
* No small, isolated populations
* Inclusiveness
(find some less common variation)

• Practical reasons:
* Established relationships with communities
* Funding agency interest
Time lines of the International HapMap project

 Officially started October, 2002 and was expected to take


about 3 years.

 Comprised of three phases: the complete set of data


obtained in Phase I were published on October 2005; the
Phase II dataset was published in October 2007 and the
Phase III dataset was first released in 2009.

2010-05-28: HapMap3 Public Release #3

Data analysis continues.

 All data are available at http://hapmap.ncbi.nlm.nih.gov/


International HapMap Project Papers

The International HapMap Consortium. Integrating common and rare


genetic variation in diverse human populations. Nature 467, 52-58. 2010.

The International HapMap Consortium. A second generation human


haplotype map of over 3.1 million SNPs. Nature 449, 851-861. 2007.

The International HapMap Consortium. A Haplotype Map of the Human


Genome. Nature 437, 1299-1320. 2005.

The International HapMap Consortium. The International HapMap


Project. Nature 426, 789-796. 2003.

The International HapMap Consortium. Integrating Ethics and Science in


the International HapMap Project. Nature Reviews Genetics 5, 467 -475.
2004.

Thorisson, G.A., Smith, A.V., Krishnan, L., and Stein, L.D. The International
HapMap Project Web site. Genome Research,15:1591-1593. 2005.

You might also like