0% found this document useful (0 votes)
16 views204 pages

Module-1 - Merged - Bioinformatics

genomics notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views204 pages

Module-1 - Merged - Bioinformatics

genomics notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 204

Genomics Proteomics & Bioinformatics

Topic: Genes and Proteins


Module-1
INTRODUCTION:Polymorphisms – types of polymorphism,
genome sequences and database subscriptions, discovery of new
genes and their function. Early sequencing efforts. Extraction of
DNA, Methods of preparing genomic DNA for sequencing, DNA
sequence analysis methods-Maxam& Gilbert Method, Sanger Di-
deoxy method, Fluorescence method, shot-gun approach. NGS –
different methods and principles
Introduction to Genes & Proteins

1.Definition of gene
2. How does gene work?
3.Definition of protein
4.Gene & protein relationship.
5.Genetic fine structure.
1. Gene

• A gene is a molecular hereditary unit of all living


organisms.
 Gene carries all the information to build and maintain the
cell and pass genetic traits to the generations.
2. Protein
Large molecules composed of one or more chains of amino
acids in a specific order determined by the base sequence of
nucleotides in the DNA coding for the protein.
4.GENE-PROTEIN RELATIONSHIP
Relationship between genotype and
phenotype.
Genes

Linear sequence of amino acid

Enzymes

Structural proteins

Phenotyope of the cell

Characteristics features of the organism


From gene to protein
nucleus cytoplasm

a a
transcription translation
DNA mRNA proteaian
a
a

ribosome

trait
trait
How does mRNA code for proteins?
DNA TACGCACATTTACGTACGCGG
4 ATCG

mRNAAUGCGUGUAAAUGCAUGCGCC
4 AUCG

protein MetArgValAsnAlaCysAla
Prokaryote vs. Eukaryote genes
 Prokaryotes  Eukaryotes
 DNA in cytoplasm  DNA in nucleus
 circular  linear
chromosome chromosomes
 naked DNA  DNA wound on
histone proteins
 no introns  introns vs. exons
introns
come out!
intron = noncoding (inbetween) sequence
eukaryotic
DNA
exon = coding (expressed) sequence
AP Biology
Genomics Proteomics &
Bioinformatics
Module 1
Topic: Polymorphism & types
• Before we begin with the discussion on genetic polymorphism, it is essential that we understand what a
gene is?
• A gene can be defined as a segment of the DNA that specifies the sequence of amino acids in a
particular protein.
• Throughout the life cycle of a cell, it is the DNA that directs the cellular functions and exists in an
uncoiled granular form.
• However, during the life cycle of the cell, the normal activities of the cell might cease and the cell
divides.
• Cell division results in the production of new cells.
• At this stage, the DNA is highly coiled and visible under a microscope as a discrete structure called
chromosomes.
• During the early stage of cell division, when the chromosomes become visible, they are made up of two
strands or two DNA molecules that are joined together at a constricted area called the centromere.
• Chromosomes are present in identical sets of two (or in pairs).
• Humans have 46 chromosomes or 23 sets of chromosomes.
• Of these, 22 pairs are autosomes and other two are sex chromosome that is X and Y. The sex
chromosomes determine the sex that is either male or female.
• The autosomes are responsible for all physical characteristics of an individual except primary sex
determination.
• Human cells contain around 28000-30000 genes (Deloukas et al., 1998). These genes code for
important and necessary information that determine molecular traits that are passed on from parents to
their offspring.
• Genes encode various traits like hair color, eye color, skin color, hair texture, etc.
• Any change in the DNA sequence brings about change in the genetic information, which brings about
change in the phenotypic expression and also the associated biological function.
• The changes in the DNA sequence is known as a mutation.
• Physical anthropologists are concerned in understanding visible human variations as they are
interested not only in identifying the factors that produce visible physical variation but also the
underlying genetic determinants that dictate it.
• Genetic variations arise due to the differences in the DNA sequence among populations from the wild
type form.
• Each and every individual has two sets of genomes, one maternal and one paternal.
• Therefore, at each genetic location (locus), the alleles from the maternal and the paternal side, can
either have identical DNA sequence or slightly differing DNA sequence.
• The wild type form in a population refers to individuals with normal phenotype. The wild type form is
usually possessed by the majority of the individuals in the population.
• In contrast to this, mutant type refers to individuals with a phenotype that varies from the normal
population.
• These variations can also be referred to as homozygous if the alleles on both the chromosomes are
identical or heterozygous if they differ on any one of the chromosomes
POLYMORPHISM
• The term “polymorphism” is a combination of two Greek words “poly” meaning multiple and “morph”
meaning form, can be defined as a mendelian trait, which exists in a population, in at least two different
forms. Ford, 1940 defines genetic polymorphism refers to the occurrence together in the same habitat of two
or more discontinuous forms or phases of a species in such proportions that the rarest of them cannot be
maintained by recurrent mutations.
• In simpler words, genetic polymorphism refers to the occurrence in the same population of two or more than
two alleles at the same locus in the same population, such that the frequency of the rarer allele is always
greater than one percent and the rarer allele is maintained in the population, not merely be recurrent
mutations (Cavalli-Sforza and Bodmer, 1971).
• In the nutshell, polymorphisms can be defined as the variations in the DNA sequence that are present in the
population, with the frequency of the variation being greater than 1 percent.
• In other words, it can be said that mutation frequency is more than 1 percent in a population, it is a
polymorphism. Insertions-deletions polymorphisms, single nucleotide polymorphisms, restriction site
polymorphisms or restricted fragment length polymorphism etc. are some of the examples of genetic
polymorphisms.
• Genetic polymorphism refers to the occurrence of multiple forms or variants of a particular gene within a
population. These variations are encoded in the DNA sequences and can manifest through differences in
nucleotide bases or changes in the structure and function of the gene.
• Genetic polymorphism can lead to diverse phenotypic characteristics and traits observed among individuals.
Definition:
Genetic polymorphism is a term used in genetics to describe multiple forms
of a single gene that exist in an individual or among a group of individuals
(Philips, 2016).

Causes of genetic polymorphism

 Deletion and duplication of millions of base pairs of DNA.


 Changes in one or a few bases in the DNA located between genes or
within exons.
 Sequence changes may also be located in the coding sequence of genes
themselves and result in different
protein variants that may lead in turn to different phenotypes.
Mutation and
polymorphism
 The main difference is of frequency
If frequency more than 1 % :
polymorphism
If frequency less than 1 % :
mutation
So we can infer that the mutation give rise to polymorphism
 The other difference is in effect
Mutation give rise to diseases
Polymorphism is sometimes neutral while some time dangerous.
HUMAN GENE POLYMORPHISM
When a nucleotide change is very rare, and not present in many
individuals, it is often called a mutation. In contrast to mutations, genetic
polymorphisms are usually considered normal variants in population.
When a specific allele occurs in at least 1% of the population, it is said to
be a genetic polymorphism
Examples of Genetic Polymorphism

• ABO Blood Group System: The ABO blood group system is a classic example of genetic polymorphism. It
is determined by variations in the ABO gene, which results in the expression of different surface antigens
on red blood cells. The system includes four main blood types: A, B, AB, and O, with individuals having
different combinations of antigens, leading to diverse blood groups.
• Human Leukocyte Antigen (HLA) System: The HLA system is a highly polymorphic group of genes
involved in the immune response. Variations in HLA genes influence an individual's susceptibility to certain
autoimmune diseases, transplantation compatibility, and defense against infectious agents.
• Melanocortin-1 Receptor (MC1R) Gene: The MC1R gene is responsible for determining the production and
type of melanin, influencing hair and skin pigmentation. Genetic polymorphism in the MC1R gene
contributes to variations in hair color, ranging from red and blond to brown and black.
• CYP2D6 Gene: The CYP2D6 gene encodes an enzyme involved in drug metabolism. Genetic
polymorphism in this gene affects an individual's ability to metabolize certain medications, leading to
variations in drug response and potential adverse effects.
Causes of Genetic Polymorphism

• Mutation: Genetic mutations, such as point mutations, insertions, deletions, and chromosomal
rearrangements, are a fundamental source of genetic polymorphism. These mutations can
occur spontaneously or due to environmental factors, chemical exposure, or errors during DNA
replication.
• Genetic Drift: Random changes in allele frequency within small populations can lead to the
emergence and maintenance of genetic polymorphism. Genetic drift is particularly significant in
isolated populations or those with limited gene flow.
• Natural Selection: Environmental factors and selective pressures can favor certain alleles over
others, influencing the prevalence of genetic polymorphism. For instance, alleles that confer
advantages in adapting to specific environments or provide resistance to diseases may become
more common in a population over time.
• Gene Flow: Gene flow occurs when genetic material is exchanged between
different populations through migration and interbreeding. It can introduce new alleles into a
population, contributing to genetic polymorphism.
• Non-random Mating: Certain mating patterns, such as assortative mating (choosing partners
with similar traits) or disassortative mating (choosing partners with dissimilar traits), can lead to
genetic polymorphism by altering the distribution of alleles within a population.
Types of polymorphism

•Single nucleotide polymorphism (SNP)


•Insertion and deletion polymorphism (indel)
•Nucleotide repeat polymorphism / Variable
number of tandem repeats (VNTR)/ Microsatellite
variation
Diagramatic representation …....
Single nucleotide polymorphism
• A Single Nucleotide Polymorphism is a source variance in a genome.
• A SNP (”snip”) is a single base mutation in DNA.
• SNPs are the most simple form and most common source of genetic polymorphism in the human genome
(90% of all human DNA polymorphisms).
• The most common form of polymorphisms is the single nucleotide polymorphism, which is a change in a
single base pair (bp) in the genomic DNA. Single nucleotide polymorphisms can affect gene function. For
example, a single nucleotide polymorphism located in a promoter region may influence the amount of
mRNA produced
• Human genome, that is the complete genetic material in a cell consists of about 3 billion base pairs. SNP
occurs in every 300 nucleotides, meaning that there are around 10 million SNPs that are present in the
human genome.
• All individual humans share genome sequences that are approximately 99.9%the same. 0.1% variable
region is responsible for the genetic diversity betweenindividuals.
• There are two types of nucleotide base
substitutions resulting in SNPs:
• • A transition substitution occurs
between purines (A, G) or between
pyrimidines (C, T). This type of
substitution constitutes two thirds of all
• . SNPs.
e.g. SNP might change the DNA sequence
• ATGCCTA to ATGCTTA.Individuals
• A transversion may be
substitution occurs
homozygotes
between a purine(e.g.and T/T or C/C), or
a pyrimidine.
heterozygotes with different bases (e.g.
T/C) at polymorphic sites. For a variation to
be considered a SNP it must occur in at
least 1% of thepopulation. ~7 million
common SNPs with a population frequency
of at least 5% across theentire human
population.
CODING REGION SNPS
Synonymous : The substitution causes no amino acid change in the protein
that it produces
Non synonymous : The substitution results in an alteration of the
encoded amino acid. A missense polymorphism changes the
protein by causing a change of codon. It results in a misplaced
termination codon. One half of all coding sequence, SNPs
result in non-synonymous codon changes

SNPs may also occur in regulatory regions of genes. These SNPs are
capable of changing the amount or timing of a protein production.
Variable Number of Tandem Repeats (VNTR)
Nucleotide repeat polymorphism
• These are arrays of 2 or more base pair core units, located in the non- coding region of the
genome, adjacent to each other.
• Variable number of tandem repeats refer to a condition where the number of nucleotides in the
core unit is variable or is not known.
• On the basis of the size of the core unit, they can be categorized as either
• a) mini-satellites (10-60 bp) is a collection of moderately sized arrays, usually 10-60 base pairs
of tandemly repeated DNA sequences that are dispersed over considerable portions of the
nuclear genome.
• For example: GATACCCCAAAG GATACCCCAAAG GATACCCCAAAG is an array of 12
nucleotide repeats from 3-20 kbp.
• b) microsatellites or short tandem repeats is a small array of tandem repeats of a simple
nucleotide sequence which is usually less than 10 base pairs. For example, GATA GATA GATA
GATA GATA GATA has 4 bases repeated 6 times. TA TA TA TA TA TA is a dinucleotide repeat
and TAT TAT TAT TAT TAT is a trinucleotide repeat. The STRs were first used in the Persian
Gulf War in 1991 for identification of human remains.
:\
Another class of polymorphism is the simple sequence repeats, of which of 1 – 4bp
Insertion and deletion polymorphism:
• Another category of gene polymorphism involves insertions or deletions. Insertions and
deletions can be as small as 1 base, in which case they may also be classified in the category of
single nucleotide polymorphisms, but can also consist of a few bases, one or more exons, or
even a whole gene.
• These refer to genetic variations in which a sequence of DNA is either inserted or deleted from
the gene.

• The frequency of the occurrence of the insertion deletion polymorphism is only about one tenth
of the frequency of the SNPs.
Studies have shown that around 90 percent of the insertion deletions are of 1-10 nucleotides only.

• Around 9 percent have sequences involving 11 to 100 nucleotides, whereas only 1 percent have
sequences greater than 100 nucleotides (Mullaney et al., 2010).
• the common forms are the dinucleotide and trinucleotide repeats.
Microsatellite variation (tandem repeat
Genomics Proteomics &
Bioinformatics
Module 1
Topic: Biological database
Biological databases
• Biological databases are libraries of life sciences
information, collected from scientific experiments,
published literature, high-throughput experiment
technology, and computational analysis.
• They contain information from research areas
including genomics, proteomics, metabolomics, microarray
gene expression, and phylogenetics.
• Information contained in biological databases includes gene
function, structure, localization (both cellular and
chromosomal), clinical effects of mutations as well as
similarities of biological sequences and structures.
• Biological knowledge is distributed among many different
general and specialized databases. This sometimes makes it
difficult to ensure the consistency of information.
• Integrative bioinformatics is one field attempting to
tackle this problem by providing unified access. One
solution is how biological databases cross-reference to
other databases with accession numbers to link their
related knowledge together.
• Relational database concepts of computer
science and Information retrieval concepts of digital
libraries are important for understanding biological
databases.
• Biological database design, development, and long-
term management is a core area of the discipline
of bioinformatics. Data contents include gene
sequences, textual descriptions, attributes
and ontology classifications, citations, and tabular data.
• These are often described as semi-structured data, and
can be represented as tables, key delimited records, and
XML structures.
Classification:
• Biological databases can be broadly classified into
sequence, structure and functional databases.
• Nucleic acid and protein sequences are stored in sequence
databases and structure databases store solved structures
of RNA and proteins.
• Functional databases provide information on the
physiological role of gene products, for example enzyme
activities, mutant phenotypes, or biological pathways.
• Model Organism Databases are functional databases that
provide species-specific data.
• Databases are important tools in assisting scientists to
analyze and explain a host of biological phenomena from
the structure of biomolecules and their interaction, to the
whole metabolism of organisms and to understanding
the evolution of species.
• This knowledge helps facilitate the fight against
diseases, assists in the development of medications,
predicting certain genetic diseases and in discovering
basic relationships among species in the history of life.

Home page of a biological database called STRING which


characterises functional links between proteins
Biological Databases • PDB
The data repositories more • MMDB
relevant to the biological Some genome databases:
sciences include: • ENSEMBL (Human, mouse
 nucleotide and protein and others)
sequences • SGD (Yeast)
 protein structures • TAIR (Arabidopsis)
 genomes Bibliography:
 genetic expression • Pubmed
 bibliography • Web of Science
Main sequence databases: Human diseases:
• NCBI • OMIM
• EMBL Metabolic pathways:
Main protein databases: • KEGG
• Uniprot
Sequence databases
• A sequence database is a collection of DNA or
protein sequences with some extra relevant
information. The main sequence databases
are Genbank and EMBL. Originally they were just
sequence collections, but they have grown to store
different biological databases heavily interconnected
and they provide powerful interfaces to search and
browse the stored information.
• The sequences are split in these databases in
different sections to ease the search. Among others,
there are sections for mRNAs, publised nucleotide
sequences, genomes, and genes.
Genbank
Genbank is a public collection of annotated sequences hosted by
the NCBI. Among other kinds of sequences Genbank includes
messenger RNAs, genomic DNAs and ribosomic RNA.
Some characteristics:
• It is a public repository, any one can send sequences to it.
• There are sequences of different qualities, anything submitted is
stored.
• There could be multiple sequences for the same gene or for the
same mRNA
• A sequence can have several versions that represent the
modifications done by the authors.
Due to the huge amount of sequences stored to ease the search the
databases are split in different divisions. These divisions follow two
criteria: the species and type of sequence. Among the taxonomical
divisions you can find: primate, rodent, other mammalian,
invertebrate an others. The other divisions are related to the kind of
sequences like: EST, WGS, HTGS, and many others.
Genbank format:
RefSeq
• RefSeq is a reference database curated by NCBI.
• In RefSeq there are only well annotated and good quality
sequences. It stores genomic, transcript and protein sequences
and links the sequences that belong to a gene. It just has one
representative sequence for each mRNA in a particular
organism and, thus, it will have as many sequences as different
transcripts and proteins coded for a particular gene in a
particular organism.
UniProt
• UniProt is a protein database that includes information divided
in two sections: Swiss-Prot and TrEMBL. UniProt aims to store
sequence and functional information for the proteins.
• TrEMBL is automatically annotated while Swiss-Prot is reviewed
manually by humans that add information by reviewing the
literature. Due to this effort Swiss-Prot has information of a
higher quality, but it has less sequences than TrEMBL.
PubMed
• PubMed is a bibliographical database that
comprises biomedical literature (MEDLINE), life
science journals and on-line books. It is a good
collection of publications related to biochemistry,
cellular biology and medicine. As of 2016
PubMed stores 26 million citations.
For each record it stores:
• title
• authors
• Abstract
PDB, Protein Data Bank
• PDB stores 3D structures for proteins and nucleic
acids.
DNA databases
1. Primary databases
International Nucleotide Sequence Database (INSD) consists of the
following databases.
• DNA Data Bank of Japan (National Institute of Genetics)
• EMBL (European Bioinformatics Institute)
• GenBank (National Center for Biotechnology Information)
2. Secondary databases
• 23andMe's database
• HapMap
• OMIM (Online Mendelian Inheritance in Man): inherited diseases
• RefSeq
• 1000 Genomes Project: launched in January 2008. The genomes of
more than a thousand anonymous participants from a number of
different ethnic groups were analyzed and made publicly available.
• EggNOG Database: a hierarchical, functionally and phylogenetically
annotated orthology resource based on 5090 organisms and 2502
viruses. It provides multiple sequence alignments and maximum-
likelihood trees, as well as broad functional annotation
Gene expression databases
• ArrayExpress: archive of functional genomics data; stores data from high-
throughput functional genomics experiments from EMBL
• Bioinformatic Harvester
• Ensembl: provides automatic annotation databases for human, mouse,
other vertebrate and eukaryotegenomes
• Ensembl Genomes: provides genome-scale data for bacteria, protists, fungi,
plants and invertebrate metazoa, through a unified set of interactive and
programmatic interfaces (using the Ensembl software platform)
• FlyBase: genome of the model organism Drosophila melanogaster
• Gene Disease Database
Phenotype databases.
• PHI-base: pathogen-host interaction database. It links gene information to
phenotypic information from microbial pathogens on their hosts. Information
is manually curated from peer reviewed literature.
• RGD Rat Genome Database: genomic and phenotype data for Rattus
norvegicus
• PomBase database: manually curated phenotypic data for the
yeast Schizosaccharomyces pombe
RNA databases.
• miRBase: the microRNA database
• Rfam: a database of RNA families
Protein sequence databases
• DisProt: database of experimental evidences of disorder in proteins
(Indiana University School of Medicine, Temple University, University
of Padua)
• InterPro: classifies proteins into families and predicts the presence of
domains and sites
• MobiDB: database of intrinsic protein disorder annotation (University
of Padua)
• neXtProt: a human protein-centric knowledge resource
• Pfam: protein families database of alignments and HMMs (Sanger
Institute)
• PRINTS: a compendium of protein fingerprints from (Manchester
University)
• PROSITE: database of protein families
Protein structure databases
• Protein Data Bank (PDB), comprising:
– Protein DataBank in Europe (PDBe)
– ProteinDatabank in Japan (PDBj)
– Research Collaboratory for Structural Bioinformatics (RCSB)
• Structural Classification of Proteins (SCOP)
• Protein model databases
• Protein-protein and other molecular interactions
• Protein expression databases
• Signal transduction pathway databases
• Metabolic pathway and protein function
databases
• Exosomal databases
• Mathematical model databases
• Taxonomic databases
• Radiologic databases
• Antimicrobial resistance databases
• Wiki-style databases
• Specialized databases
MODULE 1
EARLY SEQUENCING
EFFORTS.
The order of nucleic acids in polynucleotide chains ultimately
contains the information for the hereditary and biochemical
properties of terrestrial life.

Therefore the ability to measure or infer such sequences is


imperative to biological research.

This topic deals with how researchers throughout the years have
addressed the problem of how to sequence DNA, and the
characteristics that define each generation of methodologies for
doing so.
First-Generation Sequencing (Sanger
Sequencing)

Developed by Frederick Sanger in 1977.

Uses dideoxynucleotides (ddNTPs) to terminate DNA


synthesis at specific bases.

DNA fragments of varying lengths are separated by


electrophoresis and read to determine the sequence.

Key Milestone: Human Genome Project (1990-2003),


which was largely completed using this method.
Second-Generation Sequencing (Next-
Generation Sequencing or NGS)
• Began in the mid-2000s.

• Techniques include Illumina sequencing and Roche/454 pyrosequencing.

• Allows massively parallel sequencing, increasing speed and reducing cost.

• Key Milestone: Cost of sequencing a human genome dropped dramatically,


facilitating large-scale projects like the 1000 Genomes Project (2008).
Third-Generation Sequencing (Single-Molecule Real-
Time Sequencing)

Emerged around 2010.

Includes technologies like Pacific Biosciences (PacBio)


and Oxford Nanopore.

Allows sequencing of long DNA fragments in real-time


without amplification.

Key Milestone: Achieving high-throughput, real-time


sequencing and the potential for on-site diagnostics.
Key Milestones:

1977: First complete genome sequenced (bacteriophage ΦX174).

1990-2003: Human Genome Project completion.

2008: Introduction of the 1000 Genomes Project.

2014: First nanopore-based portable sequencer released by Oxford Nanopore.

Each generation brought significant advancements in speed, cost, and applications of


sequencing technologies.
Extraction of DNA
DNA extraction is a process that isolates DNA from a sample, such as blood, saliva, or tissue, and makes it ready for downstream
applications like sequencing. The basic steps of DNA extraction are:

• Disrupt the cell structure: Break open the cells to create a lysate

• Separate the DNA: Separate the DNA from other components of the cell, like proteins and lipids

• Bind the DNA: Bind the DNA to a purification matrix

• Wash away contaminants: Wash away proteins and other contaminants from the matrix

• Elute the DNA: Elute the DNA

The method used to extract DNA depends on the sample type and the downstream application, but some common methods include:

• Mechanical, chemical, and enzymatic lysis

• Precipitation

• Purification

• Concentration

When choosing a DNA extraction method, it's important to consider the quality and quantity of the DNA, as well as the time, cost, and
other factors.

To ensure the best quality DNA, it's important to reduce the time between sampling and storing the sample to prevent enzymatic
degradation
Methods of
preparing
genomic DNA for
sequencing
Genomics Proteomics &
Bioinformatics – 21BT54
Module 1
Topic: DNA sequencing methods
• DNA sequencing is the process of determining the
exact order of nucleotides within a DNA molecule.
This method is used to determine the order of the four
bases—adenine (A), guanine (G), cytosine (CY), and
thymine (T) in a strand of DNA. The advent of rapid
DNA sequencing methods has greatly accelerated the
biological and medical research.
• Maxam and Gilbert-chemical
sequencing
• Sanger-chain termination
sequencing.
• These two are conventional methods
Purpose of DNA sequencing:
• Can compare genes or specific sequences to
find out differences and similarities
• Classify organism, make a disease diagnosis
• Through the DNA sequencing we can be able to know
where exactly into genome or a gene is the mutation.
1. Sanger Sequence:
• The enzymatic method is called as Sanger Method. Sanger sequencing
method was developed by Frederick Sanger and his colleagues in 1977.
The development of this technique won Sanger the Nobel Prize in
Chemistry in 1980.
• From the 1980's to the mid - 2000's, Sanger sequencing dominated the
DNA sequencing platform, bringing successful completion of the Human
Genome Project (HGP) in 2003. Although this technique has been replaced
by next generation sequencing methods, it is still used today for smaller-
scale projects.
• In order to perform the sequencing, one must first convert double
stranded DNA into single stranded DNA. This can be done by denaturing
the double stranded DNA with NaOH.
• A Sanger reaction consists of the following components:
• Single stranded DNA fragment (ssDNA template): a DNA strand to be
sequenced (one of the single strands, which was denatured using NaOH).
• All four deoxyribonucleotide triphosphates: i.e. dATP, dGTP, dTTP and
dCTP
• NOTE: At least one of the four deoxyribonucleotide
triphosphates should be radioactive in each of the four
reaction mixture tubes in order to permit the
autoradioautographic development of DNA bands
after the gel electrophoresis.
• DNA polymerase: Each incubation tube will also carry
DNA polymerase enzyme (Sequenase) in order to copy
the DNA template by adding nucleotides to the primer
as the synthesis proceeds.
• NOTE: Sequenase is an engineered E. Coli DNA
polymerase I (known as ‘Klenow Fragment’), which is
used to copy the DNA template. Sequenase is obtained
by removing the first 323 amino acids of the
polypeptide (5’-------3’), using exonuclease enzyme.
• DNA primers: The enzyme Sequenase needs a primer
to start end to end nucleotide synthesis
• NOTE: A primer should have the following characteristics
• 1. The primer should be either a DNA restriction fragment or short DNA sequence
complementary to the single stranded DNA template.

• 2. The primer should have a free 3’ – OH group required to make 3’ – 5’


phosphodiester linkage between the nucleotides to be added to the primer.

• 3. The primer should be radioactively labelled at the 5' end.


• Four reaction mixtures
• In addition to radio-labelled primers and DNA polymerase (Sequenase), each
incubation tube (carrying a reaction mixture) should have all the four
deoxyribonucleotides (dATP, dCTP, dGTP and dTTP) and a particular
dideoxyribonucleotide phosphate.
• The four reaction mixtures should differ from each other in having a different
dideoxyribonucleotide phosphate analogue (ddNTP analogues: ddGTP, ddATP,
ddTTP and ddCTP).
• Example: Tube A will carry Sequenase, radio-labelled primers, all the four
deoxyribonucleotides (dATP, dCTP, dGTP and dTTP) and a particular
dideoxyribonucleotide phosphate (ddATP in this case).
• Dideoxynucleotides (ddNTPs) are chain-elongation inhibitors of DNA polymerase,
used in the Sanger method for DNA sequencing.
• A dideoxynucleotide (ddNTP) is an artificial molecule that lacks a hydroxyl (OH)
group at both the 2nd and 3rd carbons of the sugar moiety of DNA molecule. In
contrast, a regular deoxynucleotide triphosphate (dNTP) has the hydroxyl group on
the 3rd carbon of the sugar.
• The main purpose of the 3'-OH group is that it is used to
form a phosphodiester bond between two nucleotides -
this allows a DNA strand to elongate.
• During DNA replication, an incoming nucleoside
triphosphate is linked by its 5' α-phosphate group to the 3'
hydroxyl group of the last nucleotide of the growing chain.
With ddNTP, where there is no 3' - OH group, this reaction
cannot take place, so elongation is terminated. Thus, each
new strand will stop randomly at positions where dNTP is
replaced by ddNTP
• The concentration of ddNTP should be 1% of the
concentration of dNTP. The logic behind this ratio is that
after DNA polymerase is added, the polymerization will
take place and will terminate whenever a ddATP is
incorporated into the growing strand.
• Thus, four sets of chain-termination fragments,
corresponding to A, C, T and G, are produced in four
reaction mixtures.
• Procedure
• The single stranded DNA is mixed with primer and split into four reaction
mixtures. Each reaction mixture contains DNA polymerase, four
deoxyribonucleotide phosphates (dNTPs) and a replication terminator
(ddNTP). Each reaction proceeds until a replication terminating nucleotide
• (ddNTP) is added. The mixtures are loaded into four separate lanes of gel
and the electrophoresis is used to separate the DNA fragments.
• STEPS:
• 1. Fragmentation and amplification: Fragment the DNA and clone the
fragments into vectors.

• 2. Denaturation: Denature the double stranded DNA (by heat or NaOH)


into single stranded DNA fragments.

• 3. Attach the primer: A primer is a synthetic oligonucleotide, containing


17 to 24 nucleotides. The primer binds to the DNA molecule and provides
a 3' OH group, which is necessary to initiate DNA synthesis. The 3'-OH
group allows for DNA chain elongation.

• 4. Add 4 dNTPs + 1 ddNTP


• Four different reaction vials are taken, each with the four
standard dNTPs (dATP, dGTP, dCTP and dTTP) and DNA
polymerases. Difference among the vials is because of
different type of ddNTP. Each vial will have 1 ddNTP per
100 dNTPs.
• 6. After the occurrence of DNA synthesis, each reaction vial
will have a unique set of single-stranded DNA molecules of
varying lengths. However, all DNA molecules will have the
same primer sequence at its 5' end.
• 7. Find the nucleotide sequence using gel electrophoresis:
In the gel, we have varying sequences, lined up according
to size.
Application:

Other useful applications of DNA sequencing


include single nucleotide polymorphism (SNP) detection,
single strand conformation polymorphism (SSCP)
heteroduplex analysis, and short tandem repeat (STR)
analysis.
Resolving DNA fragments according to differences
in size and/or conformation is the most critical step in
studying these features of the genome.
Advantages:
Long sequence can be sequenced within
short duration, very much helpful in
the sequencing project.

Disadvantages:

More expensive.
MAXAM & GILBERT PROCEDURE
(Chemical Method)
• Allan Maxam and Walter Gilbert published a DNA sequencing method in 1977
based on chemical modification of DNA and subsequent cleavage at specific bases.
This method allows purified samples of double-stranded DNA to be used without
further cloning.
• Maxam-Gilbert sequencing requires radioactive labelling at 5' end or 3’ end of the
DNA followed by purification of the DNA fragment to be sequenced.
• Procedure (STEPS)
• 1. Radioactive labelling of one end (5' end or 3’ end) of the DNA fragment to be
sequenced by a kinase reaction using 32P.

• 2. Cut the DNA fragment with specific restriction enzyme, resulting in two unequal
DNA fragments

• 3. Denature the double-stranded DNA to single-stranded DNA by increasing


temperature.

• 4. Cleave the DNA strand at specific positions using chemical reactions. For
example, we can use one of the two chemicals followed by addition of piperdine.
Dimethyl sulphate (DMS) selectively attacks purine (A and G), while hydrazine
selectively attacks pyrimidines (C and T). This is called modification step
• 5. Chemical treatment generates breaks at the four
nucleotide bases in the four reaction mixtures (G, A+G, C, and
C+ T).
• Reagent mixtures
• 1. Reagent G: It breaks the DNA chain after guanine (G) base
• 2. Reagent A+G: It breaks the DNA chain after adenine (A) and
guanine (G) bases
• 3. Reagent C: It breaks the DNA chain after cytosine (C) base
• 4. Reagent C+T: It breaks the DNA chain after cytosine (C) and
guanine (G) bases
• The concentration of the modifying chemicals is controlled to introduce,
on an average, one modification per DNA molecule. Thus a series of
labelled fragments is generated, starting from the radiolabeled end to the
first "cut" site in each molecule. As a result, we have several differently
sized DNA strands in four reaction tubes.
• Fragments are subjected to
electrophoresis in high-resolution
acrylamide gels for size-based
separation.
• To visualize the fragments, the gel is
exposed to X-ray film for
autoradiography, which yields a series
of dark bands, each corresponding to a
radiolabeled DNA fragment, from which
the nucleotide sequence may be
inferred. In the gel, the fragments are
ordered by size and, thus, we can
deduce the sequence of the DNA
molecule.
• DNA sequencing evaluation: Reading the gel
• 1. The gel is read from bottom to top

• 2. The gel has nucleotide sequence differing by only one


nucleotide; i.e. each subsequent base will be one
nucleotide longer than the previous one.

• 3. The larger the fragment, the more it is slowed down by


the gel; i.e. the largest fragment will be at the bottom,
while the smallest one will be at the top of the gel.

• 4. Each band on the gel identifies the specific nucleotide;


thus, the nucleotide sequence of the DNA fragment can be
read off the gel ‘end to end’.
Advantages
• Directly read purified DNA.
• Used sequence heterogenous DNA as well
as Homopolymeric sequences.
• used to analyze DNA-Protein interaction.
• Used to analyze Epigenic modification and
nucleic acid structure.
Disadvantages
• Use of toxic chemicals and extensive use of
radioactive isotopes. highly poisonous and unstable.
• Cannot read more than 500bp.
• Setup is quite complex.
• It is difficult to make Maxam-Gilbert DNA sequencing
kit.
• Read size decrease with incomplete cleavage
reactions.
3. Shot gun sequencing
• Shotgun sequencing is a laboratory technique for
determining the DNA sequence of an organism's
genome.
• The method involves breaking the genome into a
collection of small DNA fragments that are
sequenced individually.
• A computer program looks for overlaps in the
DNA sequences and uses them to place the
individual fragments in their correct order to
reconstitute the genome.
STEPS IN SHOTGUN SEQUENCING
• 1. RANDOM PHASE
a. Fragmentation:
• Fragmentation of the DNA can be done by using restriction
enzymes, physically by breaking it into small pieces (usually
2, 10, 50, and 150 kb ) by passing it through a narrow
gauge syringe or sonicating it, which is the way of breaking
the sample using sound waves, by mechanical
method, shearing. It is a random process, so the sequences
of the fragments will have some overlap between them.
Fragments of about 150Mb are obtained.
b. Cloning:
• Next step is joining of DNA fragments with a vector which
is a carrier DNA. This method is known as cloning. And a
sequence library is created. Sequencing of entire genome
will be created by using this library.
c. Sequencing:
• Sequencing of each clone in library is done. Individual fragments
are sequenced individually using the chain termination method
to obtain reads. Multiple overlapping reads are obtained for the
target DNA by performing several rounds of this fragmentation
and sequencing.
2. ASSEMBLY PHASE
Reassembling:
• Then based on overlapping regions, these fragments
are reassembled into their original order.
• Assembling of the overlaps creates a “contig,”. It is a
long continuous stretch of DNA sequence, or which is
the de-coded version of the original source DNA.
• Larger and larger contigs will be produced, until a
single ordered contig of the genome is achieved.
These contigs are produced identifying the gaps
(where there is no sequence available) and single
stranded regions (where there is sequence for only
one stand).
• These gaps and single stranded regions are then
targeted to do additional sequencing to make a full
sequenced molecule.
3. FINISHING PHASE
• Alignment of the sequences of overlapping pieces
is done by computer programs or sequence
assembly softwares.
• In the case of the human genome project, a
massive amount of data is involved, requiring
supercomputer technology. Then it ultimately
yields the complete sequence. It is a faster and a
more complex technique.

Shotgun cloning usually results in some gaps between contigs because some
sequences are missing from the library by chance. These gaps are filled by-
· Creating a new library or
· By using known sequences to extend outward from the contig.
Many fragments are sequenced more than once because shotgun sequencing
sequences DNA fragments at random. Thus creates more certainty that the
sequence is correct than if each fragment had only been sequenced once or
twice.
APPLICATIONS OF SHOTGUN
SEQUENCING
• It is the most widely used tool for genome sequencing.
The human genome was sequenced both by-
· The Human Genome Project using map-based
sequencing using shotgun sequencing.
• Shotgun sequencing is now the most preferred method
for all other kinds of genome sequencing also.
• The total genomes of many organisms, such as the
plant Arabidopsis thaliana, rice, the cow, dog, chicken,
chimpanzee, rat, mouse, pufferfish, and many
microorganisms have been sequenced this way.
MODULE 1
Bioinformatics tools and automation in Genome Sequencing, analysis of raw
genome sequence data, Transcriptome (RNA) sequencing, Exome
sequencing, Genome Annotation, Using NGS
to detect sequence variants, Utility of EST database in sequencing.
Bioinformatics tools and automation in
Genome Sequencing
 Bioinformatics tools and automation are used in genome sequencing to analyze and interpret genomic
data, including:
 Analyzing genetic variants
Bioinformatics tools can predict how genetic variants affect gene function, protein structure, and
interaction networks.
 Managing data
Bioinformatics is essential for managing data in modern biology and medicine.
 Annotating genomes
Bioinformatics tools can annotate genomes with information about genes and other functional elements.
 Filtering and trimming reads
Bioinformatics tools can filter and trim reads to improve the quality and reliability of results.
 Integrating and weighing evidence
Automated annotation tools can integrate and weigh evidence.
Some bioinformatics tools used in genome sequencing include:
 ABySS: An algorithm that can handle large data sets
 Velvet: An algorithm that can be used for assembly, and has been improved to address
scalability issues
 EULER SR: An algorithm that is optimized for assembling short reads
 SOAPdenovo: A de novo assembler that has been used for genome sequencing projects
 Nullarbor: A tool that uses a command-line interface for WGS analysis
 INNUca: A tool that provides analysis functions for quality check, contamination detection,
and more
 Pathogenwatch: A tool that can be used for molecular typing, AMR prediction, and more
 MAKER and PASA: Automated annotation tools that can integrate and weigh evidence
 WebApollo: A tool that can be used to edit gene annotations
Analysis of Raw Genome Sequence
Data
 Analyzing raw genome sequence data involves multiple steps, from quality control to
variant analysis. Here's a detailed breakdown:
 1. Quality Control (QC)
 Objective: Ensure raw sequencing data is of high quality before downstream analysis.
 Tools:
 FastQC: Assesses sequence quality metrics like per-base quality scores, GC content, and
adapter contamination.

 MultiQC: Aggregates results from multiple QC tools for comprehensive reports.

 Steps:
 Evaluate raw sequence data for quality and anomalies.

 Identify and filter out low-quality reads or bases.

 Trim adapters and low-quality bases using tools like Trimmomatic or Cutadapt.
 2. Read Alignment/Mapping
 Objective: Align short reads to a reference genome to determine their
original location.
 Tools:
 BWA (Burrows-Wheeler Aligner): Efficiently maps short reads to a reference.

 Bowtie2: Fast and memory-efficient for aligning DNA sequences.

 HISAT2: Specialized for RNA-seq alignment but also usable for DNA.

 Steps:
 Align cleaned reads to the reference genome.

 Generate SAM/BAM files containing aligned reads.

 Convert and sort these files using tools like SAMtools.


 3. Post-Alignment Processing
 Objective: Refine alignments for accurate variant calling.
 Tools:
 Picard: Manages duplicate reads, which can arise during PCR amplification.

 GATK (Genome Analysis Toolkit): Includes tools for base quality score
recalibration (BQSR) and indel realignment.

 Steps:
 Mark duplicates: Remove duplicate reads to prevent bias.

 Indel realignment: Correct misalignments around indels.

 Base quality score recalibration: Adjust quality scores to improve accuracy.


 4. Variant Calling
 Objective: Identify genetic variants like SNPs (single nucleotide
polymorphisms) and indels (insertions/deletions).
 Tools:
 GATK: Performs variant discovery and genotyping.

 FreeBayes: A haplotype-based variant caller for diploid and polyploid


genomes.

 SAMtools/BCFtools: Calls variants from BAM files.

 Steps:
 Call variants using tools like GATK’s HaplotypeCaller or UnifiedGenotyper.

 Generate VCF (Variant Call Format) files containing variant data.


 5. Variant Filtering and Annotation
 Objective: Filter out low-quality variants and annotate with functional
information.
 Tools:
 VCFtools: Filters variants based on criteria like quality score, depth, and allele
frequency.

 ANNOVAR or SnpEff: Annotates variants with information like gene impact,


location, and clinical significance.

 Steps:
 Apply filtering to retain high-confidence variants.

 Annotate variants to add gene, protein, and phenotype information.


 6. Functional Analysis and Interpretation
 Objective: Understand the biological implications of identified variants.
 Tools:
 DAVID or Enrichr: For gene enrichment and pathway analysis.

 Integrative Genomics Viewer (IGV): Visualizes genomic data to validate variants.

 Steps:
 Assess the impact of variants on gene function.

 Perform pathway analysis to determine affected biological processes.


 7. Visualization and Reporting
 Objective: Present data in a comprehensible format.
 Tools:
 IGV: Visualizes sequence alignment and variants.

 Circos: Creates circular genome visualizations for complex data sets.

 Steps:
 Generate visual representations of alignments and variants.

 Prepare detailed reports for interpretation and publication.


• long RNAs are first converted into a library of
cDNA fragments through either RNA
fragmentation or DNA fragmentation
• Sequencing adaptors (blue) are subsequently
added to each cDNA fragment and a short
sequence is obtained from each cDNA using high-
throughput sequencing technology.
• The resulting sequence reads are aligned with the
reference genome or transcriptome, and
classified as three types: exonic reads, junction
reads and poly(A) end-reads.
• These three types are used to generate a base-
resolution expression profile for each gene,
Exome sequencing

You might also like