T2 Syllabus Revision Class
T2 Syllabus Revision Class
• DNA Structure
• Replication
• Repeats in DNA
• Sequencing processes
• Sequencing technologies
Molecular basis of DNA Structure
• Polynucleotide chain – sugar phosphate back bone
having nitrogenous base attached to it.
• Phosphodiester Bond – the backbone of DNA
• Nucleotide has three elements – phosphate,
pentose sugar, nitrogenous base
• Pairing between Nitrogenous base is not chemical….
• Base Stacking – Allows millions of base pairs lie one
above the other
Chargaff Rule
• Erwin Chargaff “A pairs with T & C pairs with G
• C – G bond stronger than A – T (amount of heat
required to separate the DNA strand increases with
increase in G + C )
• Base composition – % of G + C in terms of % of total
base
differs among species but is constant in all cells of an
organism and within a species.
Human being – A 29.8, T- 31.8, G-20.2, C -18.2, G+C-
38.4
• A+ G = C + T that is purines = pyrimidines
Watson Crick Model
The Double Helix Structure
• Right hand twisted helix
• Ten base pairs in each helical strand
• Bases are spaced at 3.4 Å
• Each turn measures 34 Å
• Bases are perpendicular to the sugar
phosphate backbone but stacked parallel to
each other
• Grooves - The major & the minor groove,
provide binding site for the proteins.
•Read from 5’ to 3’ direction.
These labels are indicative of
free carbon on sugar
phosphate backbone ( 5’ has
terminal phosphate group &
3’ has free OH)
•DNA Length measured in
base pair (BP) units (1kb =
1000 bp)
•Sugar lie above and below
the plane containing the
base pair
Organization of DNA
• DNA strand is longer than
the nucleus
• Smallest DNA is
14000 μm.
• Average size of nucleus
6 μm
• DNA is packed as
Chromosome
• Packing ratio – Length
of DNA/length of
Chromosome
Review….
• Genome is complete DNA sequence of one set of
chromosome
• Contains both coding and the non-coding sequence of
DNA
• DNA is a polymer called poly nucleotide
• Nucleotide unit is made of – Phosphate group, pentose
sugar (deoxyribose) and nitrogenous base
• The phosphate can attach to the sugar at 3’ or 5’
position
• Nitrogenous base always at 1’ position
• Two polynucleotide chain make a DNA (double helix
structure)
• Both chain are anti parallel – 5’ of one pair the 3’ end
of another chain
• Chargaff rule – A pairs with T & G pairs with C
• Base composition – % G + C in a genome Fixed for a
species
• Base Stacking – Allows millions of base pairs lie one
above the other
• DNA Length measured in base pair (BP) units (1kb =
1000 bp)
• Both reverse and forward strand read from 5’ direction
• DNA is a very dynamic molecule
• Satisfy the criteria for genetic material - make a
copy of itself, code for life, allow for changes
• DNA is packed as Chromosome
• Packing ratio – Length of DNA/length of Chromosome
• Relplication – biological process by which DNA
makes a copy of itself
• Each strand act as a template
• can also be performed in vitro
Essentials for replication
• A parent strand as template
• Nucleotides containing bases adenine, guanine,
cytosine & thymine
• RNA Primer – oligonucleotide containing upto
30 bp
• In vitro synthesis - DNA primer is used
• DNA polymerase
• Some proteins and enzymes like helicase, ligase
Replication
• Biological process of
producing two identical
replicas of DNA from one
original molecule.
• DNA make copy of itself
• Each strand acts as
template
• can also be performed in
vitro
• Double stranded molecule gets converted into
two identical double stranded molecule/DNA
Essentials
• A parent strand as template
• Nucleotides containing bases adenine,
guanine, cytosine & thymine
• RNA Primer – oligonucleotide containing upto
30 bp
• DNA polymerase
• Some proteins and enzymes like helicase,
ligase
Primer
Replication begins at 3’end of each strand
Bases are added one at a time
Process continues till the strand is completed
How a genome looks like?
Coding Regions
• Called Genes
• Roughly 20K in number
• Make 5% of the total genome
• Eukaryotic gene contain interspersed non
coding repeated sequence – Introns
Repeated Sequences
• Function largely unknown or poorly
understood – Labeled as Junk DNA
• Repeat can be Tandem or Interspersed
Tandem Repeats
•Read length
upto 500 bp
•Low
thoroughput
Illumina Sequencing(II Gen)
• Fast, high throughput, parallel sequencing.
• Library preparation done by fragmenting DNA
(tagmentation)
• Single stranded templates (fragments)
attached to flow cell
• Cluster formation by amplification on flow cell.
• Sequencing
Sequencing by Synthesis
• Requires DNA fragments,
bases attached with
terminator which can
fluoresce, polymerase
• Base sequencing done
one at a time
• The light emitted is
snapped and read as base.
Sequencing Read Options
• Two types: single-read and paired-end
sequencing
• Single-read sequencing reads DNA fragments
from one end to the other
• In paired-end sequencing, after a DNA
fragment is read from one end, the process
starts again in the other direction. Two reads
generated. Common method.
Nanopore Sequencing (III Gen)
• Conductivity of ion currents in the pore
changes when the strand of nucleic acid
passes through it.
• The flow of ion current depends on the shape
of the molecule translocating through the
pore. Since nucleotides have different shapes,
each nucleotide is recognized by its effect on
the change of the ionic current
• Sample preparation is minimal, less
time consuming.
• Long read lengths.
• No amplification or ligation steps
required before sequencing.
• Challenge is to optimize the speed
of DNA translocation through the
nanopore to improve measurement
accuracy and reduce the
high error rates of base
calling
Base Calling
• Base calling is the process of assigning bases along
with the confidence level for each base.
• Computer program Phred base-calling, DNA Baser
Assembler.
• Shows high base calling accuracy.
• Each cycle generates 320,000 images (.TIFF),
• Each image approx 7Mb,
2.24x106 Mb (2.24TB) of total data.
A fragment (100bp) sequencing requires 100 cycles
2 3
Base quality
• Software entrusted to read the sequence are base caller,
Phred
• The ambiguity for each base is reported as base quality.
• Base caller’s estimate of the "probability that the base
called incorrectly” is base quality(Q).
• Q = - 10 log10(p); p is the probability
that base called is incorrect.
Q = Base quality
• P is ratio of incorrect base call over the entire
cluster in terms of light intensity.
• Phred Quality Score “Q” is an easy representation of
base quality
• Characters in the base quality line match with
characters in the sequence line in a typical
FASTQ file.
Base Quality Representation
• base quality is ASCII- encoded version of
Q = - 10 log10(p)
• Method of conversion is called Phred 33.
• Base quality converted to nearest integer and
then 33 added
• Integer then mapped to ASCII character.
• ASCII is the character encoding standard for
electronic communication
• Character encoding standard for electronic
communication
American Standard Code for Information Interchange (ASCII)
Assembly of Reads
• Billions of raw read generated by sequencing
machines create terabyte of data.
• The processing and assembling of sequence is
done by assemblers like Celera Assembler,
Arachne etc.
• Sequence assembly is an algorithm-driven
automated process.
• Process based on sequence overlaps for
sequence reads in correct order
Assemble approaches
• De Novo approach – the sequence has to be
done all from scratch.
• Mapping – map the fragment from pre
sequenced genome
Type of sequencing
• Mapping
Reads generated can be mapped to the reference
genome in case the genome was earlier mapped.
• De Novo assembly
Sequencing a novel genome where there is no
reference sequence available for alignment
Analysis by Mapping
• Reads of experiment mapped to best fit region
in the reference sequence.
• For complex genome, multiple locations for a
sequence is possible
• BLAST (basic local alignment search tool)
• is used in modern genome analysis
• It compares nucleotide or protein sequences
to sequence databases and calculates the
statistical significance of matches.
de novo Assembly
• Sequence reads are assembled as contigs.
• Contiguous long sequences are obtained from
reads having common sequences
• The coverage quality of de novo sequence
data depends on the size and continuity of the
contigs (ie, the number of gaps in the data).
• de novo assemblers are greedy algorithm
assemblers (SEQAID, CAP) and de
Bruijn assemblers (ABySS, SPAdes)