Module-1 - Merged - Bioinformatics
Module-1 - Merged - Bioinformatics
1.Definition of gene
2. How does gene work?
3.Definition of protein
4.Gene & protein relationship.
5.Genetic fine structure.
1. Gene
Enzymes
Structural proteins
a a
transcription translation
DNA mRNA proteaian
a
a
ribosome
trait
trait
How does mRNA code for proteins?
DNA TACGCACATTTACGTACGCGG
4 ATCG
mRNAAUGCGUGUAAAUGCAUGCGCC
4 AUCG
protein MetArgValAsnAlaCysAla
Prokaryote vs. Eukaryote genes
Prokaryotes Eukaryotes
DNA in cytoplasm DNA in nucleus
circular linear
chromosome chromosomes
naked DNA DNA wound on
histone proteins
no introns introns vs. exons
introns
come out!
intron = noncoding (inbetween) sequence
eukaryotic
DNA
exon = coding (expressed) sequence
AP Biology
Genomics Proteomics &
Bioinformatics
Module 1
Topic: Polymorphism & types
• Before we begin with the discussion on genetic polymorphism, it is essential that we understand what a
gene is?
• A gene can be defined as a segment of the DNA that specifies the sequence of amino acids in a
particular protein.
• Throughout the life cycle of a cell, it is the DNA that directs the cellular functions and exists in an
uncoiled granular form.
• However, during the life cycle of the cell, the normal activities of the cell might cease and the cell
divides.
• Cell division results in the production of new cells.
• At this stage, the DNA is highly coiled and visible under a microscope as a discrete structure called
chromosomes.
• During the early stage of cell division, when the chromosomes become visible, they are made up of two
strands or two DNA molecules that are joined together at a constricted area called the centromere.
• Chromosomes are present in identical sets of two (or in pairs).
• Humans have 46 chromosomes or 23 sets of chromosomes.
• Of these, 22 pairs are autosomes and other two are sex chromosome that is X and Y. The sex
chromosomes determine the sex that is either male or female.
• The autosomes are responsible for all physical characteristics of an individual except primary sex
determination.
• Human cells contain around 28000-30000 genes (Deloukas et al., 1998). These genes code for
important and necessary information that determine molecular traits that are passed on from parents to
their offspring.
• Genes encode various traits like hair color, eye color, skin color, hair texture, etc.
• Any change in the DNA sequence brings about change in the genetic information, which brings about
change in the phenotypic expression and also the associated biological function.
• The changes in the DNA sequence is known as a mutation.
• Physical anthropologists are concerned in understanding visible human variations as they are
interested not only in identifying the factors that produce visible physical variation but also the
underlying genetic determinants that dictate it.
• Genetic variations arise due to the differences in the DNA sequence among populations from the wild
type form.
• Each and every individual has two sets of genomes, one maternal and one paternal.
• Therefore, at each genetic location (locus), the alleles from the maternal and the paternal side, can
either have identical DNA sequence or slightly differing DNA sequence.
• The wild type form in a population refers to individuals with normal phenotype. The wild type form is
usually possessed by the majority of the individuals in the population.
• In contrast to this, mutant type refers to individuals with a phenotype that varies from the normal
population.
• These variations can also be referred to as homozygous if the alleles on both the chromosomes are
identical or heterozygous if they differ on any one of the chromosomes
POLYMORPHISM
• The term “polymorphism” is a combination of two Greek words “poly” meaning multiple and “morph”
meaning form, can be defined as a mendelian trait, which exists in a population, in at least two different
forms. Ford, 1940 defines genetic polymorphism refers to the occurrence together in the same habitat of two
or more discontinuous forms or phases of a species in such proportions that the rarest of them cannot be
maintained by recurrent mutations.
• In simpler words, genetic polymorphism refers to the occurrence in the same population of two or more than
two alleles at the same locus in the same population, such that the frequency of the rarer allele is always
greater than one percent and the rarer allele is maintained in the population, not merely be recurrent
mutations (Cavalli-Sforza and Bodmer, 1971).
• In the nutshell, polymorphisms can be defined as the variations in the DNA sequence that are present in the
population, with the frequency of the variation being greater than 1 percent.
• In other words, it can be said that mutation frequency is more than 1 percent in a population, it is a
polymorphism. Insertions-deletions polymorphisms, single nucleotide polymorphisms, restriction site
polymorphisms or restricted fragment length polymorphism etc. are some of the examples of genetic
polymorphisms.
• Genetic polymorphism refers to the occurrence of multiple forms or variants of a particular gene within a
population. These variations are encoded in the DNA sequences and can manifest through differences in
nucleotide bases or changes in the structure and function of the gene.
• Genetic polymorphism can lead to diverse phenotypic characteristics and traits observed among individuals.
Definition:
Genetic polymorphism is a term used in genetics to describe multiple forms
of a single gene that exist in an individual or among a group of individuals
(Philips, 2016).
• ABO Blood Group System: The ABO blood group system is a classic example of genetic polymorphism. It
is determined by variations in the ABO gene, which results in the expression of different surface antigens
on red blood cells. The system includes four main blood types: A, B, AB, and O, with individuals having
different combinations of antigens, leading to diverse blood groups.
• Human Leukocyte Antigen (HLA) System: The HLA system is a highly polymorphic group of genes
involved in the immune response. Variations in HLA genes influence an individual's susceptibility to certain
autoimmune diseases, transplantation compatibility, and defense against infectious agents.
• Melanocortin-1 Receptor (MC1R) Gene: The MC1R gene is responsible for determining the production and
type of melanin, influencing hair and skin pigmentation. Genetic polymorphism in the MC1R gene
contributes to variations in hair color, ranging from red and blond to brown and black.
• CYP2D6 Gene: The CYP2D6 gene encodes an enzyme involved in drug metabolism. Genetic
polymorphism in this gene affects an individual's ability to metabolize certain medications, leading to
variations in drug response and potential adverse effects.
Causes of Genetic Polymorphism
• Mutation: Genetic mutations, such as point mutations, insertions, deletions, and chromosomal
rearrangements, are a fundamental source of genetic polymorphism. These mutations can
occur spontaneously or due to environmental factors, chemical exposure, or errors during DNA
replication.
• Genetic Drift: Random changes in allele frequency within small populations can lead to the
emergence and maintenance of genetic polymorphism. Genetic drift is particularly significant in
isolated populations or those with limited gene flow.
• Natural Selection: Environmental factors and selective pressures can favor certain alleles over
others, influencing the prevalence of genetic polymorphism. For instance, alleles that confer
advantages in adapting to specific environments or provide resistance to diseases may become
more common in a population over time.
• Gene Flow: Gene flow occurs when genetic material is exchanged between
different populations through migration and interbreeding. It can introduce new alleles into a
population, contributing to genetic polymorphism.
• Non-random Mating: Certain mating patterns, such as assortative mating (choosing partners
with similar traits) or disassortative mating (choosing partners with dissimilar traits), can lead to
genetic polymorphism by altering the distribution of alleles within a population.
Types of polymorphism
SNPs may also occur in regulatory regions of genes. These SNPs are
capable of changing the amount or timing of a protein production.
Variable Number of Tandem Repeats (VNTR)
Nucleotide repeat polymorphism
• These are arrays of 2 or more base pair core units, located in the non- coding region of the
genome, adjacent to each other.
• Variable number of tandem repeats refer to a condition where the number of nucleotides in the
core unit is variable or is not known.
• On the basis of the size of the core unit, they can be categorized as either
• a) mini-satellites (10-60 bp) is a collection of moderately sized arrays, usually 10-60 base pairs
of tandemly repeated DNA sequences that are dispersed over considerable portions of the
nuclear genome.
• For example: GATACCCCAAAG GATACCCCAAAG GATACCCCAAAG is an array of 12
nucleotide repeats from 3-20 kbp.
• b) microsatellites or short tandem repeats is a small array of tandem repeats of a simple
nucleotide sequence which is usually less than 10 base pairs. For example, GATA GATA GATA
GATA GATA GATA has 4 bases repeated 6 times. TA TA TA TA TA TA is a dinucleotide repeat
and TAT TAT TAT TAT TAT is a trinucleotide repeat. The STRs were first used in the Persian
Gulf War in 1991 for identification of human remains.
:\
Another class of polymorphism is the simple sequence repeats, of which of 1 – 4bp
Insertion and deletion polymorphism:
• Another category of gene polymorphism involves insertions or deletions. Insertions and
deletions can be as small as 1 base, in which case they may also be classified in the category of
single nucleotide polymorphisms, but can also consist of a few bases, one or more exons, or
even a whole gene.
• These refer to genetic variations in which a sequence of DNA is either inserted or deleted from
the gene.
• The frequency of the occurrence of the insertion deletion polymorphism is only about one tenth
of the frequency of the SNPs.
Studies have shown that around 90 percent of the insertion deletions are of 1-10 nucleotides only.
• Around 9 percent have sequences involving 11 to 100 nucleotides, whereas only 1 percent have
sequences greater than 100 nucleotides (Mullaney et al., 2010).
• the common forms are the dinucleotide and trinucleotide repeats.
Microsatellite variation (tandem repeat
Genomics Proteomics &
Bioinformatics
Module 1
Topic: Biological database
Biological databases
• Biological databases are libraries of life sciences
information, collected from scientific experiments,
published literature, high-throughput experiment
technology, and computational analysis.
• They contain information from research areas
including genomics, proteomics, metabolomics, microarray
gene expression, and phylogenetics.
• Information contained in biological databases includes gene
function, structure, localization (both cellular and
chromosomal), clinical effects of mutations as well as
similarities of biological sequences and structures.
• Biological knowledge is distributed among many different
general and specialized databases. This sometimes makes it
difficult to ensure the consistency of information.
• Integrative bioinformatics is one field attempting to
tackle this problem by providing unified access. One
solution is how biological databases cross-reference to
other databases with accession numbers to link their
related knowledge together.
• Relational database concepts of computer
science and Information retrieval concepts of digital
libraries are important for understanding biological
databases.
• Biological database design, development, and long-
term management is a core area of the discipline
of bioinformatics. Data contents include gene
sequences, textual descriptions, attributes
and ontology classifications, citations, and tabular data.
• These are often described as semi-structured data, and
can be represented as tables, key delimited records, and
XML structures.
Classification:
• Biological databases can be broadly classified into
sequence, structure and functional databases.
• Nucleic acid and protein sequences are stored in sequence
databases and structure databases store solved structures
of RNA and proteins.
• Functional databases provide information on the
physiological role of gene products, for example enzyme
activities, mutant phenotypes, or biological pathways.
• Model Organism Databases are functional databases that
provide species-specific data.
• Databases are important tools in assisting scientists to
analyze and explain a host of biological phenomena from
the structure of biomolecules and their interaction, to the
whole metabolism of organisms and to understanding
the evolution of species.
• This knowledge helps facilitate the fight against
diseases, assists in the development of medications,
predicting certain genetic diseases and in discovering
basic relationships among species in the history of life.
This topic deals with how researchers throughout the years have
addressed the problem of how to sequence DNA, and the
characteristics that define each generation of methodologies for
doing so.
First-Generation Sequencing (Sanger
Sequencing)
• Disrupt the cell structure: Break open the cells to create a lysate
• Separate the DNA: Separate the DNA from other components of the cell, like proteins and lipids
• Wash away contaminants: Wash away proteins and other contaminants from the matrix
The method used to extract DNA depends on the sample type and the downstream application, but some common methods include:
• Precipitation
• Purification
• Concentration
When choosing a DNA extraction method, it's important to consider the quality and quantity of the DNA, as well as the time, cost, and
other factors.
To ensure the best quality DNA, it's important to reduce the time between sampling and storing the sample to prevent enzymatic
degradation
Methods of
preparing
genomic DNA for
sequencing
Genomics Proteomics &
Bioinformatics – 21BT54
Module 1
Topic: DNA sequencing methods
• DNA sequencing is the process of determining the
exact order of nucleotides within a DNA molecule.
This method is used to determine the order of the four
bases—adenine (A), guanine (G), cytosine (CY), and
thymine (T) in a strand of DNA. The advent of rapid
DNA sequencing methods has greatly accelerated the
biological and medical research.
• Maxam and Gilbert-chemical
sequencing
• Sanger-chain termination
sequencing.
• These two are conventional methods
Purpose of DNA sequencing:
• Can compare genes or specific sequences to
find out differences and similarities
• Classify organism, make a disease diagnosis
• Through the DNA sequencing we can be able to know
where exactly into genome or a gene is the mutation.
1. Sanger Sequence:
• The enzymatic method is called as Sanger Method. Sanger sequencing
method was developed by Frederick Sanger and his colleagues in 1977.
The development of this technique won Sanger the Nobel Prize in
Chemistry in 1980.
• From the 1980's to the mid - 2000's, Sanger sequencing dominated the
DNA sequencing platform, bringing successful completion of the Human
Genome Project (HGP) in 2003. Although this technique has been replaced
by next generation sequencing methods, it is still used today for smaller-
scale projects.
• In order to perform the sequencing, one must first convert double
stranded DNA into single stranded DNA. This can be done by denaturing
the double stranded DNA with NaOH.
• A Sanger reaction consists of the following components:
• Single stranded DNA fragment (ssDNA template): a DNA strand to be
sequenced (one of the single strands, which was denatured using NaOH).
• All four deoxyribonucleotide triphosphates: i.e. dATP, dGTP, dTTP and
dCTP
• NOTE: At least one of the four deoxyribonucleotide
triphosphates should be radioactive in each of the four
reaction mixture tubes in order to permit the
autoradioautographic development of DNA bands
after the gel electrophoresis.
• DNA polymerase: Each incubation tube will also carry
DNA polymerase enzyme (Sequenase) in order to copy
the DNA template by adding nucleotides to the primer
as the synthesis proceeds.
• NOTE: Sequenase is an engineered E. Coli DNA
polymerase I (known as ‘Klenow Fragment’), which is
used to copy the DNA template. Sequenase is obtained
by removing the first 323 amino acids of the
polypeptide (5’-------3’), using exonuclease enzyme.
• DNA primers: The enzyme Sequenase needs a primer
to start end to end nucleotide synthesis
• NOTE: A primer should have the following characteristics
• 1. The primer should be either a DNA restriction fragment or short DNA sequence
complementary to the single stranded DNA template.
Disadvantages:
More expensive.
MAXAM & GILBERT PROCEDURE
(Chemical Method)
• Allan Maxam and Walter Gilbert published a DNA sequencing method in 1977
based on chemical modification of DNA and subsequent cleavage at specific bases.
This method allows purified samples of double-stranded DNA to be used without
further cloning.
• Maxam-Gilbert sequencing requires radioactive labelling at 5' end or 3’ end of the
DNA followed by purification of the DNA fragment to be sequenced.
• Procedure (STEPS)
• 1. Radioactive labelling of one end (5' end or 3’ end) of the DNA fragment to be
sequenced by a kinase reaction using 32P.
• 2. Cut the DNA fragment with specific restriction enzyme, resulting in two unequal
DNA fragments
• 4. Cleave the DNA strand at specific positions using chemical reactions. For
example, we can use one of the two chemicals followed by addition of piperdine.
Dimethyl sulphate (DMS) selectively attacks purine (A and G), while hydrazine
selectively attacks pyrimidines (C and T). This is called modification step
• 5. Chemical treatment generates breaks at the four
nucleotide bases in the four reaction mixtures (G, A+G, C, and
C+ T).
• Reagent mixtures
• 1. Reagent G: It breaks the DNA chain after guanine (G) base
• 2. Reagent A+G: It breaks the DNA chain after adenine (A) and
guanine (G) bases
• 3. Reagent C: It breaks the DNA chain after cytosine (C) base
• 4. Reagent C+T: It breaks the DNA chain after cytosine (C) and
guanine (G) bases
• The concentration of the modifying chemicals is controlled to introduce,
on an average, one modification per DNA molecule. Thus a series of
labelled fragments is generated, starting from the radiolabeled end to the
first "cut" site in each molecule. As a result, we have several differently
sized DNA strands in four reaction tubes.
• Fragments are subjected to
electrophoresis in high-resolution
acrylamide gels for size-based
separation.
• To visualize the fragments, the gel is
exposed to X-ray film for
autoradiography, which yields a series
of dark bands, each corresponding to a
radiolabeled DNA fragment, from which
the nucleotide sequence may be
inferred. In the gel, the fragments are
ordered by size and, thus, we can
deduce the sequence of the DNA
molecule.
• DNA sequencing evaluation: Reading the gel
• 1. The gel is read from bottom to top
Steps:
Evaluate raw sequence data for quality and anomalies.
Trim adapters and low-quality bases using tools like Trimmomatic or Cutadapt.
2. Read Alignment/Mapping
Objective: Align short reads to a reference genome to determine their
original location.
Tools:
BWA (Burrows-Wheeler Aligner): Efficiently maps short reads to a reference.
HISAT2: Specialized for RNA-seq alignment but also usable for DNA.
Steps:
Align cleaned reads to the reference genome.
GATK (Genome Analysis Toolkit): Includes tools for base quality score
recalibration (BQSR) and indel realignment.
Steps:
Mark duplicates: Remove duplicate reads to prevent bias.
Steps:
Call variants using tools like GATK’s HaplotypeCaller or UnifiedGenotyper.
Steps:
Apply filtering to retain high-confidence variants.
Steps:
Assess the impact of variants on gene function.
Steps:
Generate visual representations of alignments and variants.