0% found this document useful (0 votes)
16 views63 pages

Sajeev Sequencing

Uploaded by

lucylit0666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views63 pages

Sajeev Sequencing

Uploaded by

lucylit0666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

DNA Sequencing

SAJEEV RAJ S
MSc P Batch
INTRODUCTION
• The information content of DNA is encoded in the form of four
bases (A,G,C and T) and the process of determining sequence of
these bases in a given DNA molecule is referred to as DNA
sequencing.

• DNA fragments can be analyzed to determine the nucleotide


sequence of DNA.
Why we have to know the sequence of DNA molecule?

• To characterize the newly cloned DNA.


• For predictions about its functions.
• To facilitate manipulation of the molecule.
• To confirm the identity of a clone or a mutation.
• To check the fidelity of newly created mutation.
• Screening tool to identify polymorphisms and mutation in genes of
particular interest.
• To confirm the product of a PCR.
HISTORY
Applied biosystems marketed
Human Genome
RNA sequencing of First DNA genome
Bacteriophage MS2 by sequenced of Fully automated sequencing Project
Bacteriophage ΦX174 machines
Walter Fieser

1977 1986 1995

1972 1987 2003


DNA sequencing Lorey and Smith gave Craig Venter, Hamilton Smith
Chain termination
t,
semiautomated published complete genome
method by Sanger and sequencing sequence of Haemophilus
Chemical degradation influenzae
method by Allan
Maxam and Walter
Gilbert.
STEPS OF DNA SEQUENCING

n
it o
A

a
DN

on a r
i

n
t p
ng is
of

tio
ta r e i s
n

en P c y

ca
l
io

y e n a
ar

ifi
at

g m u An
br

pl
ol

a q
Fr Li e ta

Am
Is

S
1 2 3 4 5 Da
6
Isolation of DNA
• Cell lysis
• Chemical lysis
• Mechanical lysis
• Enzymatic lysis
• DNA stabilization
• Protein removal
• DNA precipitation
Fragmentation
• After isolation of DNA the next step is the fragmentation of DNA
molecule.

• DNA molecule can be fragmented into smaller pieces by using different


methods such as sonication or by using restriction enzymes.

• SONICATION:High-frequency sound waves are used to randomly


fragment DNA. This method is versatile and allows control over
fragment size by adjusting sonication parameters such as amplitude
and duration.
• RESTRICTION ENZYMES:These are specific endonucleases that recognize specific
DNA sequences (restriction sites) and cleave the DNA at or near these sites. The
resulting fragments have defined ends dictated by the restriction enzyme recognition
sequences. This method is useful for creating DNA fragments with known ends, which
can be important for cloning and other applications
Library preparation
• After the fragmentation the next step is to
be done is the library preparation of the
DNA to be amplified.

• The DNA fragments to be sequenced is


treated with a known sequence of
adapters that might be the
complimentary sequence to the primers
we adding in further steps.
Amplification
• Then the amplification is done by Polymerase Chain Reaction(PCR)

• Mainly there are two types of PCR were used in NGS techniques such as
Emulsion PCR & Bridge Amplification.

• EMULSION PCR: Emulsion PCR is a specialized form of PCR that is


commonly used in next-generation sequencing (NGS) technologies.It is done by
individual fragmented DNA molecule is attached to the primer in a bead that
is incorporated in water-in-oil droplets.
Bridge Amplification
• For the initialization of the bridge
amplification, primers that are
complementary to the adapter sequence on
the DNA fragments are added to the flow cell.
• Along with that dNTPs and polymerase
enzymes are also added.

• Firstly, there form a bridge like structure of


DNA fragment by attaching the one adapter
end to the fixed primer and the other to the
nearest primer.

• Then the DNA polymerase extends the primer


by adding nucleotides along the ssDNA results
in forming the dsDNA molecule.
• Then the flow cell is subjected to conditions that
denature the newly synthesized double-stranded DNA,
separating it into two single strands.
• Each single strand remains attached at one end to the
flow cell surface via its adapter sequence.

• The original template strand, which is still attached,


serves as a template for a new round of primer binding
and extension.

• Through successive cycles of synthesis and


denaturation, clusters of identical DNA molecules are
generated. Each cluster typically contains hundreds to
thousands of identical DNA fragments arranged in
close proximity on the flow cell surface.
Sequencing

1. Sequencing by Synthesis
2. Sequencing by Ligation
Sequencing by Synthesis
• In NGS, sequencing by synthesis is generally used in 454
sequencing, Ion torrent sequencing and Illumina sequencing
methods.

• In454/pyro sequencing and also in Ion torrent sequencing the


process is by detecting the light as a result of the PPi emission and
by detecting H+ ions released respectively.
• In sequencing by synthesis method initially uses the emulsion PCR
technique for constructing the colonies required for the
sequencing and removes the complementary strand.

• Then ssDNA sequencing primer hybridizes to the end of the


strand (primer-binding region).

• Then the four different dNTPs are sequentially made to flow in


and out of the wells over the colonies.

• When the correct dNTP comes, it enzymatically incorporated into


the strand.
• In 454 sequencing the base calling and sequence reading is done
by:
• Put the four chemical combinations in four different tubes

• Then add the DNAstrand to be sequenced into these tubes


• when it comes to Ion torrent sequencing the base call is done by:
• There are numerous of wells are arranged in a semiconductor chip
with an underlying sensor.

• Then each nucleotides are added in an order in the wells for 4-10s
span.

• Then washed away the unbounded nucleotides and if the


nucleotide is bounded then a signal is detected by the underlying
sensor, then it is recorded by the ph meter connected to it.

• The fluctuation in the ph meter is according to the number of


nucleotides attached to the DNA strand in a single flow.
• This fluctuations are plotted as a graph and then a software is used to
retrieve the bases of each strand and give the phred quality score to each
bases according to the intensity measured during the synthesis.

• Then by using another software this bcl file is converted into a fastq file
for further analysis.
• When it comes in the case of illumina it is done by another base
call process.

• Herethe sequencing strand is prepared by bridge amplification


method is done in the surface of flow cell where each fragment
forms a cluster of strands to be sequenced.

• Each strand the sequence are synthesized following the primer.

• It has both forward sequencing and reverse sequencing.

• It can run about 12 samples in a lane and also 96 samples in a


flow cell with each strand insert size from 200-800bp.
BCL FILE
• Binary Base Call (BCL) files are the raw data files generated by the Illumina
sequencers.

• Binary file containing base calls and quality scores for each tiles for each cycle.
• The quality scores given for each bases according to the fluorescence intensity
of the light detection.
Sequence Analysis
Viewer(SAV)

• SAV software can be used to


review the overall run quality.

• The software ensure that the


quality of the sequences are at
quality score 30 or above.
SAV Application
Q Score Distribution
➔ It shows plots that allow you to view the
number of reads by quality score.
➔ You can select the displayed lane,
surface, read, and cycle through the
drop-down lists.
>A - cutoff slider
➔ The cutoff slider allows you to
determine how many bases have a
minimum Q-score or higher. Grab the
slider with your mouse pointer, and drop
it at the minimum Qscore. The SAV
Software then calculates how many bases
have that Q-score or higher.
BCL Convert
• A standalone software application from Illumina that demultiplexes data and converts BCL
files to standard FASTQ file formats.

bcl2fastq
• Illumina's conversion software that demultiplexes sequencing data and converts BCL files into
FASTQ files. You can use bcl2fastq v1.8.4 for Illumina sequencing systems

• The FASTQ is a text-based sequence file format that is generated from the BCL
file that stores both raw sequence data and quality scores.
• --bcl-input-directory: The path to the input directory containing BCL files
• --output-directory: The path to an output directory for newly created FASTQ files
• --sample-sheet: The path to a CSV file containing sample information
Sequencing by Ligation
• SOLiD is the only sequencing platform that uses the sequence by
ligation method by using DNA ligase enzyme for the ligation.

• In this method after the emulsion PCR the bead is then deposited
onto a glass surface, a high density of beads can be achieved which
increases the throughput of this technique.

• Once the bead is deposited, a primer of length N is hybridized to


the adapter.

• Then the beads are exposed to a library of 8-mer probes , which


are in 16 different combinations simultaneously.
• In the 8-mer probe first 2 bases are complimentary to the
nucleotides, 3-5 bases are degenerative bases(it can binds with any
of the 4 bases) and the 6-8 bases are inosine bases with fluorescent
dye.
• Then DNA ligase is used to join the complementary probe,
adjacent to the primer.

• A phosphorothioate linkage between bases 5 and 6 allows the


fluorescent dye to be cleaved from the fragment using silver ions.

• This cleavage allows fluorescence to be measured (four different


fluorescent dyes are used, all of which have different emission
spectra) and also generates a 5’-phosphate group which can
undergo further ligation.
• After the first round of sequencing is completed, the extension product
is melted off and then a second round of sequencing is performed with a
primer of length N−1.

• Upto 5 rounds of sequencing using shorter primers each time (i.e. N−2,
N−3 and N-4) and measuring the fluorescence ensures that the target is
sequenced.

• Due to the two base sequencing method and also the bases the
effectively sequenced twice it gives a accuracy of 99.999% and also it is
inexpensive.

• The main disadvantage that it can read only short reads and it make
unmatchable for many applications
• Here for the interpretation of the result we got from this method,
uses a unique color matrix
IMAGE CALLING
Data Analysis
• After the sequencing process in
different types of sequencing
platforms with different methods of
sequencing we all get a BCL file.

• Then the BCL file is converted into


a Fastq file with its corresponding
quality scores for the easiness of
further data analysis steps.
Quality control and read filtration:
➢ The amplified raw reads pass through quality control
check using FastQC that can produce a detailed report on
the number, quality, and coverage of reads.
➢ These methods mostly work on sequence analysis
techniques like clustering short reads to calculate their
frequency and quality scores.
➢ Read filtration involves removing low confidence and
erroneous reads from the dataset.

➢ In read filtration, clipping of adapters and low-quality base


pairs from 3’ and 5’ ends using software such as CutAdapt
, trimmomatic and others.
java -jar trimmomatic-0.30.jar PE s_1_1_sequence.txt.gz s_1_2_sequence.txt.gz
lane1_forward_paired.fq.gz lane1_forward_unpaired.fq.gz
lane1_reverse_paired.fq.gz lane1_reverse_unpaired.fq.gz
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3
SLIDINGWINDOW:4:15 MINLEN:36

❏ Remove Illumina adapters provided in the TruSeq3-PE.fa file (provided).


❏ Remove leading low quality or N bases (below quality 3)
❏ Remove trailing low quality or N bases (below quality 3)
❏ Scan the read with a 4-base wide sliding window, cutting when the average
quality per base drops below 15
❏ Drop reads which are less than 36 bases long after these steps
Assembly:
➢ Once the read quality is acceptable, millions of raw sequence reads (single-end or
paired-end) are mapped and aligned using either a reference based assembly (in
which reference sequence is available) or de novo assembly (in the absence of a
reference sequence).

➢ The sequence reads of variable lengths are aligned using different bioinformatics
alignment tools such as BWA, Bowtie, and TopHat.
➢ These heuristic-based aligners allow fast sequence alignment and generate a
consensus sequence from the alignment by searching the overlapping portions of
the reads and merging them into longer reads in order to construct a region of
interest, that is, genes or a whole genome.
➢ The main aim of this step is to generate a consensus sequence from the millions of
reads.
Reference based assembly
➔ Reference-assisted assembly is more like painting a scenery.

➔ The landscapes on the painting may look a little different, the terrains need not
to be the same, but still having a scenery in front of you makes the job
relatively simpler.

➔ It is done by using alignment tools such as BWA, Bowtie2 etc


Shorter-read guided assembly. In this
method, shorter reads are aligned
against the reference genome, a
consensus assembly is generated, and
structural variations are detected. It
can also be used to detect
contamination in the sequenced reads.
This approach is used when genomes
are re-sequenced to detect
polymorphisms in individuals.
Longer-read guided
assembly. Longer reads are
aligned against the
reference genome, a
consensus genome assembly
constructed, and structural
variations are detected.
Guided de novo genome
assembly of shorter reads.
Previously de novo
assembled shorter reads are
aligned against the reference
or a closely related genome
to extend the existing
contigs.
Guided de novo genome
assembly of longer reads.
Longer reads are de novo
assembled into contigs,
which are aligned against the
reference or a closely related
genome to be extended.
de novo assembly

Short read assembly


Genome assembly using
only shorter reads and any
assembly tool to construct
contiguous
sequences/contigs.
Longer reads assembly. Contig
(red) assembly using longer
reads (long, linked reads,
optical maps) followed by
scaffold assembly and gap
filling.
Hybrid genome assembly. In
this method, shorter reads
can be assembled into contigs
and the longer reads can be
used for error correction
(errors represented by Xs),
then the corrected contigs can
be assembled into scaffolds
and the gaps filled.
Hybrid genome assembly
using pre-assembled contigs.
Longer reads are aligned
against de novo
pre-assembled contigs from
shorter reads, followed by
contig extension.
Variant identification:
➢ The aligned sequence is in SAM format and it is converted to BAM
format(less size than SAM) by using SAMtools.

➢ Variant analysis uses the reads file to determine the conserved and
variable nucleotides at specific positions.

➢ Bootstrap resampling of reads can be used to assess the quality of


variant calling scores.
➢ The variations within the genomic sequences such as single-nucleotide
polymorphisms (SNPs), single-nucleotide variants (SNV), and indels
(insertions and deletions) are detected using software such as
SAMtools, Genome Analysis Toolkit (GATK), and VarScan.

➢ Both SAMtools and GATK use the Bayesian probabilistic approach to


identify true variants from alignment errors, whereas VarScan uses a
heuristic approach.

➢ Applying filters to remove low confidence variants using GATK’s


VarientFiltration or other filtering tools.
Single Nucleotide Polymorphisms Single Nucleotide Variants Indels

★ SNPs are variations at a single ★ SNVs are similar to SNPs but ★ Indels refer to small
nucleotide position in the encompass any single insertions or deletions of
genome where different nucleotide change in a nucleotides in the genome.
individuals may have different sequence, regardless of its ★ For example, an indel could
nucleotides. frequency in the population. involve the addition or
★ For example, in one person, a ★ SNVs can be rare and include removal of a few nucleotides
specific position in the genome all single nucleotide changes at a specific location, which
might have an "A," while in whether or not they are can lead to changes in the
another person, it might be a common. reading frame of a gene or
"G." SNPs are common and ★ For example, SNV changes the affect gene function.
can contribute to genetic adenine to thymine in the
diversity and disease coding region of the HBB gene,
susceptibility which can result in sickle cell
disease due to a change in the
amino acid sequence from
glutamic acid to valine.
Annotation:
➢ The genetic variants detected are annotated based on the published
peer-reviewed literature and public genetic variant databases using
tools like ANNOVAR, SnpEff, or VEP (Variant Effect Predictor).
Interpretation of variants:
➢ Lastly, medical professionals will interpret these variants to
examining different disease pathways and gene network analysis and
identifying actual mutations causing a disease.
• https://www.researchgate.net/publication/24043867_Next-Generation_S
equencing_From_Basic_Research_to_Diagnostics
• https://www.slideshare.net/slideshow/dna-sequencing-and-its-types/857
10251#11
• https://www.researchgate.net/figure/SOLiD-Four-color-sequencing-by-l
igation-After-annealing-of-a-universal-primer-a_fig3_268816439
• https://www.youtube.com/watch?v=QhjUS3YHpzw
• https://www.researchgate.net/publication/336975702_Biological_Sequen
ce_Analysis
• https://www.researchgate.net/publication/329097779_Current_Strategie
s_of_Polyploid_Plant_Genome_Sequence_Assembly
THANK YOU

You might also like