0% found this document useful (0 votes)
39 views102 pages

Unit 8

Uploaded by

Janani K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views102 pages

Unit 8

Uploaded by

Janani K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 102

Unit – 8

Bioinformatics
Introduction
• Bioinformatics is defined as an interdisciplinary field involving biology,
computer science, mathematics, and statistics to analyse biological
sequence data, genome content and arrangement, and to predict the
function and structure of macromolecules.
• Components of bioinformatics
• Development of new algorithms and statistics for assessing the relationships
among large sets of biological data
• Application of these tools for the analysis and interpretation of the various
biological data including nucleotide sequences, amino acid sequences, etc.
• The development of database for an efficient storage, access and
management of the large body of various biological information.
cDNA LIBRARY
Definition
• A population of bacterial transformants or phage lysates in which
each mRNA isolated from an organism or tissue is represented as its
cDNA insertion in a plasmid or a phage vector
• Frequency of a specific cDNA in such a library would ordinarily
depend on the frequency of the concerned mRNA in the
tissue/organism in question
Isolation of mRNA
• Total RNA is first isolated from the desired organism/tissue
• The amount of desired mRNA is increased generally by any of the
following methods:
• Chromatography on poly-U sepharose or oligo-T cellulose, which retains
mRNA molecule, since they have 3’ poly-A tails
• Density gradient centrifugation
• Protein produced out of that mRNA is used to prepare antibodies. Now these
antibodies are used to precipitate polysomes containing mRNA and primary
transcript
Cloning of cDNAs
• Cloned in phage insertion vectors (since they afford high efficiency
packaging in vitro and give large number of cDNA clones)
• Typically 105-106 cDNA clones are obtained
Properties of cDNA and cDNA library
• cDNAs are free from intron sequences
• Smaller in size than genes
• A comparison of the cDNA sequence with the corresponding genome
sequence permits the delineation of introns/exons boundaries
• The contents of cDNA libraries from a single organism will vary widely
depending on the developmental stage and the cell type used for the
preparation of the library. But genomic library always remains the
same
• A cDNA library will be enriched for abundant mRNA but may contain
only a few or no clones representing rare mRNAs
Applications
• Expression studies of eukaryotic genes in prokaryotes
• Generation of expressed sequence tags (ESTs), which is highly useful
in high throughput genome research
Express Sequence Tags (ESTs)
• Full length c-DNAs are desirable for accurate and detailed mapping of
genomic transcriptional units
• However, short cDNA sequence can also be used for unambiguous
identification of specific genes, map them on physical gene maps and to
provide information about their expression patterns.
• Express sequence tags (ESTs) are 200-300 bp long cDNA sequences,
generated by picking thousands of random clones from cDNA libraries and
using them for a single-pass sequencing
• Vast majority of cDNA sequences in databases are ESTs rather than full
length cDNAs or genomic clones
• dbEST is an EST database in the public domain
GENOMIC LIBRARY
• Collection of plasmid clones or phage lysates containing recombinant
DNA molecules so that the sum total of DNA inserts in this collection,
ideally, represents the entire genome of the concerned organism
• Despite of care taken in preparation of genomic libraries, certain DNA
fragments should be expected to be under- or over-repressed, or even
missing
• The reason might be due to
• Certain fragments might replicate slowly
• Certain fragments might have been altered by recombinational events during
cloning
Construction of Genomic Library
• Extraction of total genomic DNA
• DNA is broken into fragments by
• Mechanical shearing
• Sonication
• Use of appropriate restriction enzymes
• It is desirable to cut the DNA randomly so that the DNA fragments (=
clones) overlap one another and no sequence of the genome is
systematically excluded
• Single or mixed digestion with enzymes AluI (AG/CT), HaeIII (GG/CC)
or Sau3A (/GACT) have been used successfully
• Agarose gel electrophoresis of the partial digest is done
• The separated fragments are introduced in the appropriate vector for
cloning
• Transformation of host with recombinant vectors
• Minimum size of genomic libraries depend on
• Complexity of genome
• Size (or length) of the DNA insert (smaller the fragment size, larger the no. of
clones)
• Probability of gene being represented in the library
Sequencing of Genome
Clone by clone sequencing
• Fragments (BAC clones) are first aligned into contigs
• The fragments are then used to create cosmid clones and plasmid
clones, to get smaller sized clones
• These clones are also arranged into contigs
• Sequencing of contigs are done
Shotgun Sequencing
• Clones are sequenced until all clones in the genomic library are
analysed
• Assembler software organizes the nucleotide sequence information
so obtained into a genome sequence
Genome Sequence Compilation
• Genome sequencing necessitated the development of high through-
put technologies that generate data at a very fast pace
• This has brought about recruitment of computers to manage this
flood of information
• Hence, the field of bioinformatics was developed
• Bioinformatics is used for
• Compilation of genome sequences
• Identification of genes
• Assigning functions to the identified genes
• Preparation of databases etc.
• In order to ensure that the nucleotide sequence of a genome is
complete and error-free, the genome is sequenced more than once
• For example, the Human Genome Project sequenced 3.2 billion base
pairs of the human genome a total of 12 times
Human Genome Project
At a glance
• The Human Genome
• Salient Features of Human Genome
• What was Human Genome Project(HGP)
• Milestones
• Goals of Human Genome Project
• Issues of concern
• Future Challenges
• Vectors for Large-Scale Genome Project
• Yeasts artificial chromosomes
• Bacterial artificial chromosome
• Difference between YAC and BAC
The Human Genome

• The human genome is the complete set of genetic information for humans
(Homo sapiens).
• The human genome is by far the most complex and largest genome.
• Its size spans a length of about 6 feet of DNA, containing more than 30,000
genes.
• The DNA material is organized into a haploid chromosomal set of 22
(autosome) and one sex chromosome (X or Y).

Male Female
Human Genome Sequencing
22 autosome + 2 sex chromosomes

From NCBI
Salient Features of Human Genome:
❑ Human genome consists the information of 24
chromosomes (22 autosome + X chromosome + one Y
chromosome); in Homo sapiens 2n = 2x = 46
❑ The human genome contains over 3 billion nucleotide pairs.
❑ Human genome is estimated to have about 30,000 genes .
❑ Average gene consists of 3000 bases. But sizes of genes
vary greatly, with the largest known human gene encoding
dystrophin containing 2.5 million base pairs.
❑ Only about 3% of the genome encodes amino acid
sequences of polypeptides and rest of it junk (repetitive
DNA).
❑ The functions are unknown for over 50% of the discovered
genes.
Continue………
❑The repetitive sequences makeup very large portion of human
genome. Repetitive sequences have no direct coding function but they
shed light on the chromosome structure, dynamics and evolution.
❑Chromosome 1 has most genes (2968) and Y chromosome has the
lowest (231).
❑Almost all nucleotide bases are exactly the same in all people.
Genome sequences of different individuals differ for less than 0.2% of
base pairs.
❑Most of these differences occur in the form of single base differences
in the sequence. These single base differences are called single
nucleotide polymorphisms (SNPs).
❑One SNP occurs at every ~ 1,000 bp of human genome.
❑About 85% of all differences in human DNAs are due to SNPs.
Human Chromosome 1 Genetic Map
What was Human – Department of Energy
(DOE) and
Genome Project(HGP) – National Institutes of
• The Human Genome Health (NIH).
Project was an
international research
effort to determine
the sequence of the
human genome and
identify the genes that
it contains.

• The US Human
Genome Project is a
13 year effort, which is
coordinated by the
Milestones
1986 The birth of the Human Genome Project.
1990 Project initiated as joint effort of US Department of
Energy and the National Institute of Health.
1994 Genetic Privacy Act: to regulate collection, analysis,
storage and use of DNA samples and genetic information
is proposed.
1996 Welcome Trust joins the project.
1998 Celera Genomics (a private company founded by Craig
Venter) formed to sequence much of the human genome in
3 years.
1999 Completion of the sequence of Chromosome 22-the first
human chromosome to be sequenced.
2000 Completion of the working draft of the entire human
genome.
2001 Analysis of the working draft are published.
2003 HGP sequencing is completed and Project is declared
Goals of Human Genome Project
1. To identify all the genes in human DNA.
2. To develop a genetic linkage map of human genome.
3. To obtain a physical map of human genome.
4. To develop technology for the management of human
genome information.
5. To know the function of genes.
6. Determine the sequences of the 3 billion chemical base
pairs that make up human DNA.
7. Store this information in public databases.
8. Develop tools for data analysis.
9. Transfer related technologies to the private sectors.
ISSUES OF CONCERN
Ethical, Legal and Social issues of the Human Genome
Project
• Fairness in the use of genetic information.
• Privacy and confidentiality of genetic information.
• Psychological impact, stigmatization, and discrimination.
• Reproductive issues.
• Clinical issues.
• Uncertainties associated with gene tests for susceptibilities and
complex conditions.
• Fairness in access to advanced genomic technologies.
• Conceptual and philosophical implications.
• Health and environmental issues.
• Commercialization of products.
• Education, Standards, and Quality control.
• Patent issues.
Future Challenges: What We Still Don’t Know

1. Gene number, exact locations, and functions


2. Gene regulation
3. Chromosomal structure and organization
4. Non-coding DNA types, amount, distribution, information
content, and functions
5. Coordination of gene expression, protein synthesis, Proteomes
and post-translational events
6. Predicted vs experimentally determined gene function
7. Evolutionary conservation among organisms
8. Disease-susceptibility prediction based on gene sequence
variation
9. Genes involved in complex traits and multigene diseases
10. Developmental genetics, genomics
Vectors for Large-Scale Genome Project
❑A vector is a DNA molecule that has the ability to replicate
in an appropriate host cell, and into which the DNA insert is
integrated for cloning.
❑A vector must have a origin of DNA replication (ori).
❑The vector is a vehicle or carrier which is used for cloning
foreign DNA in bacteria.
❑For genome sequencing, first DNA fragments of the
genome must be cloned in appropriate vectors. Two of the
most popular vector:
1. Yeast artificial chromosomes (YACs) and
2. Bacterial artificial chromosomes (BACs)
Yeasts artificial chromosomes (YACs):
Yeast artificial chromosomes (YACs) are genetically engineered
chromosomes derived from the DNA of the yeast, Saccharomyces
cerevisiae, which is then ligated into a bacterial plasmid.
• YACs were very useful in mapping the human genome because they
could accommodate hundreds of thousands of kilo bases each.
• YACs containing a mega base (1 million) or more are known as "mega
YACs."

A YAC can be considered as self replicating element, because it includes


three specific DNA sequences:
1. TEL: The telomere which is located at each chromosome end,
protects the chromosome's ends from degradation by nucleases.
2. CEN: The centromere which is the attachment site for mitotic spindle
fibers and necessary for segregation of sister chromatids to opposite
poles of the dividing yeast cell. The centromere is placed in adjacent
to the left telomere, and a huge piece of human (or anyContinue………
other) DNA
3. ORI: Replication origin sequences which are specific DNA sequences
that allow the DNA replication.

It also contains few other specific sequences like:


⮚Selectable markers (A and B) that allow the easy isolation of yeast
cells that have taken up the artificial chromosome.
⮚Recognition site for the two restriction enzymes EcoRI and BamHI.
Cloning genomic DNA into a YAC
1. Genomic DNA is partially
digested with a restriction
enzyme (EcoRI).
2. The YAC is digested by the
two restriction enzymes EcoRI
and BamHI.
3. Those two elements
recombine at the EcoRI sites of
YAC and are covalently linked
by the DNA ligase.
4. A recombinant YAC vector,
an yeast artificial chromosome
with genomic DNA inserted, is
produced. Then YACs vector
can be introduced into yeast
cells and generated an
unlimited number of copies. Fig: Cloning of genomic DNA into a YAC
Bacterial artificial chromosome (BAC)
A bacterial artificial chromosome (BAC) is an engineered DNA
molecule, used to clone DNA segment in bacterial cells (E. coli).
It is based on a well-known natural F plasmid (inhabits E. coli
cells). This plasmid allows conjugation between bacterial cells.
• Segments of an organism's DNA, ranging from 150 to
about 300 kilo base pairs, can be inserted into BACs.
• These vectors are able to maintain in stable state in
vivo and in vitro.
• Their copy number is about two per cell.
• Extensively used in analysis of large genomes but the
main disadvantage of BAC vectors is some what
laborious construction of BAC libraries.
Common gene components
Bacterial artificial chromosome is another cloning vector system in E.coli
(pBAC108L), developed by Melsimon and his colleagues in 1992, have
❑ HindIII and BamHI: the
cloning sites
❑ CmR: the chloramphenicol
resistance gene, used as a
selection tool.
❑ oriS: the origin of replication
❑ repE: for plasmid replication
and regulation of copy number.
❑ ParA and ParB: the genes
governing partition of plasmids to
daughter cells during division and
ensures stable maintenance of
the BAC.

Fig: Map of the BAC vector, pBAC108L


In some conjugation events, the F-plasmid itself is
transferred from a donor F+ cell to a recipient F- cell,
converting the letter to an F+ cell.

In other events, a small piece of host DNA is transferred as


an insert in the F (which is called an F' plasmid if it has an
insert of foreign DNA). And in still other events, the F'
plasmid inserts into the host chromosome and mobilizes the
whole chromosome to pass from the donor cell to the
recipient cell. Thus, because the E. coli chromosome
contains over 4 million bp, the F plasmid can obviously
accommodate a large insert of DNA.
Cloning genomic DNA into a BAC
1. Genomic DNA is isolated
from a desired source and
used restriction enzymes
to cleave the target DNA
into fragments.
2. The BAC is digested by
restriction enzymes in the
cloning sites HindIII and
BamHI.
3. Those two elements
recombine by the DNA
ligase and attach into a
host bacterium.
4. As the bacterial cells
grow and divide, they
amplify the BAC DNA,
which can then be isolated
and used in sequencing
DNA. Fig: BAC as a Cloning vector
Difference between YAC and BAC
as vector of genome sequencing
Yeast artificial chromosomes (YACs) Bacterial artificial chromosome (BAC)
1. Yeast artificial chromosomes (YACs) are 1. A bacterial artificial chromosome (BAC)
genetically engineered chromosomes derived is an engineered DNA molecule, used to
from the DNA of the yeast, Saccharomyces clone DNA segment in bacterial cells (E.
cerevisiae. coli).
2. YAC’s are used for cloning very large 2. These vectors are used to clone the
(1000-2000kb) DNA segments. DNA inserts up to 300kb.
3. They are less efficient. 3. They are more efficient.
4. Unlike BAC library, it is not so hard to 4. It is very hard to construct BAC library.
construct YAC library.
5. They are unstable. 5. They are more stable.
6. They tend to contain scrambled inserts, i.e. 6. They contain pure inserts.
composites of DNA fragments from more than
one site.
7. The linear YACs, which tend to break under 7. The circular, super coiled BACs resist
shearing forces. breakage.
8. They are hard to isolate from yeast cells. 8. They are easy to isolate.
Rice Genome Project
• First crop genome to be sequenced
• The genome sequencing was performed collaboratively by multi-
nations
• Japan: Chromosomes 1, 6, 7 & 8
• US: Chromosomes 3 & 10
• China: Chromosome 4
• France: Chromosome 12
• Taiwan: Chromosome 5 etc.
• Rice was one of the last clone-by-clone, Sanger-sequenced genomes
• BAC/PAC clones were sequentially selected for sequencing,
independently assembled and then stitched together to form pseudo-
chromosomes.
• Rice chromosome was the first whose centromere was sequenced,
although being highly repetitive (64 kbp of satellite repeat).
• Hence, rice has been a model for studies of centromere structure and
function
• The public rice genome, which took advantage of whole genome
shotgun sequenced genomes made available from Monsanto in 2000
and Syngenta in 2002, was published in 2006 (International Rice
Genome Sequencing P 2005).
• Annotation took place after that.
• Rice was one of the few genomes to have competing annotations
• Rice was successful for following purposes
• Smaller genome size
• Can be used as model for other cereal crops with larger genomes, such as
wheat and maize
Contribution of RICE GENOME SEQUENCING
Gene and QTL Cloning
• Gene cloning and especially cloning genes underlying QTL can provide a
deeper molecular understanding of a trait and in the case of breeding,
markers linked research directly with the causative DNA mutation/changes
• In 1995, first BAC library was published which was eventually used for
cloning of the Xa-21 restriction gene
• RGS facilitated cloning of a gene underlying a QTL IS sub1, the gene that
confers submergence tolerance
• RGS also helped in exploring the diversity of other genome across Oryza
genus
Crop improvement
• Molecular markers were largely increased, their physical order was
understood and their proximity to annotated genes was useful to
predict gene-trait associations
• Sequence based analysis of variation in cultivated and wild rice to
allow breeders to better understand and exploit genetic variation.
• Molecular understanding of the genetic basis of traits such as N and P
use is allowing rice researchers to engineer GREEN SUPER RICE that
should help meet the challenge of the growing world population.
Nucleic Acid Databases

10/26/2023 9:53 AM
Biological databases: why?
• Need for storing and communicating large datasets has grown
• Make biological data available to scientists.
• To make biological data available in computer-readable form.

10/26/2023 9:53 AM
Different classifications of databases
• Type of data
• nucleotide sequences
• protein sequences
• proteins sequence patterns or motifs
• macromolecular 3D structure
• gene expression data
• metabolic pathways

10/26/2023 9:53 AM
Different classifications of databases….

Primary or derived databases


• Primary databases: experimental results directly into database
• Secondary databases: results of analysis of primary databases
• Aggregate of many databases
• Links to other data items
• Combination of data
• Consolidation of data

10/26/2023 9:53 AM
Different classifications of databases….

Technical design
• Flat-files
• Relational database (SQL)
• Exchange/publication technologies (FTP, HTML, CORBA, XML,...)

10/26/2023 9:53 AM
Different classifications of databases….

• Availability
• Publicly available, no restrictions
• Available, but with copyright
• Accessible, but not downloadable
• Academic, but not freely available
• Proprietary, commercial; possibly free for academics

10/26/2023 9:53 AM
Nucleotide sequence databases
• EMBL, GenBank, and DDBJ are the three primary nucleotide
sequence databases
• EMBL www.ebi.ac.uk/embl/
• GenBank www.ncbi.nlm.nih.gov/Genbank/
• DDBJ www.ddbj.nig.ac.jp

10/26/2023 9:53 AM
10/26/2023 9:53 AM
http://www3.oup.co.uk/nar/database/c/

10/26/2023 9:53 AM
Genbank
• An annotated collection of all publicly available nucleotide and
proteins

• Set up in 1979 at the LANL (Los Alamos).

• Maintained since 1992 NCBI (Bethesda).

• http://www.ncbi.nlm.nih.gov

10/26/2023 9:53 AM
10/26/2023 9:53 AM
10/26/2023 9:53 AM
EMBL Nucleotide Sequence Database

• An annotated collection of all publicly available


nucleotide and protein sequences

• Created in 1980 at the European Molecular Biology


Laboratory in Heidelberg.

• Maintained since 1994 by EBI- Cambridge.

• http://www.ebi.ac.uk/embl.html
10/26/2023 9:53 AM
10/26/2023 9:53 AM
http://www3.ebi.ac.uk/Services/DBStats/

10/26/2023 9:53 AM
DDBJ–DNA Data Bank of Japan
• An annotated collection of all publicly available nucleotide and
protein sequences

• Started, 1984 at the National Institute of Genetics (NIG) in Mishima.

• Still maintained in this institute a team led by Takashi Gojobori.

• http://www.ddbj.nig.ac.jp

10/26/2023 9:53 AM
•1984: NIG; the National Institute of Genetics was reorganized as
an Inter-University Research Institute.
DDBJ began to work at NIG.
•1986: DNA Database Advisory Committee organized.
•1987: DDBJ release 1 was provided. By this release, we regard
this year as official start of DDBJ operation.

Source: https://www.ddbj.nig.ac.jp/aboutus-e.html#history
10/26/2023 9:53 AM
10/26/2023 9:53 AM
10/26/2023 9:53 AM
Other NCBI nucleic acids DBs
• EST database: A collection of expressed sequence tags, or short, single-pass
sequence reads from mRNA (cDNA).
• GSS database: A database of genome survey sequences, or short, single-pass
genomic sequences.
• HomoloGene: A gene homology tool that compares nucleotide sequences
between pairs of organisms in order to identify putative orthologs
• Orthologs are genes in different species that evolved from a common ancestral
gene by speciation, and, in general, orthologs retain the same function during the
course of evolution.
• HTG database: A collection of high-throughput genome sequences from large-
scale genome sequencing centers, including unfinished and finished
• SNPs database: A central repository for both single-base nucleotide
substitutions and short deletion and insertion polymorphisms.
• RefSeq: A database of non-redundant reference sequences standards,
including genomic DNA contigs, mRNAs, and proteins for known genes.
Multiple collaborations, both within NCBI and with external groups,
supports data-gathering efforts.
• STS database: A database of sequence tagged sites, or short sequences that
are operationally unique in the genome.
• UniSTS: A unified, non-redundant view of sequence tagged sites (STSs).
• UniGene: A collection of ESTs and full-length mRNA sequences organized
into clusters, each representing a unique known or putative human gene
annotated with mapping and expression information and cross-references
to other sources.
Sequence submission
• Submissions through the Internet:
• Web forms.
• Email.
• Sequences shared/exchanged between the 3 centers on a daily basis:
• The sequence content of the banks is identical.

10/26/2023 9:54 AM
Derived databases
• CUTG Codon usage tabulated from GenBank
http://www.kazusa.or.jp/codon/
• Genetic Codes Deviations from the standard genetic code in various
organisms and organelles
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
• TIGR Gene Indices Organism-specific databases of EST and gene sequences
http://www.tigr.org/tdb/tgi.shtml
• UniGene Unified clusters of ESTs and full-length mRNA sequences
http://www.ncbi.nlm.nih.gov/UniGene/
• ASAP Alternative spliced isoforms http://www.bioinformatics.ucla.edu/ASAP
• Intronerator Introns and alternative splicing in C.elegans and C.briggsae
http://www.cse.ucsc.edu/~kent/intronerator/

10/26/2023 9:54 AM
Nucleic acid structure
databases
• NDB Nucleic acid-containing structures http://ndbserver.rutgers.edu/

• NTDB Thermodynamic data for nucleic acids http://ntdb.chem.cuhk.edu.hk/

• RNABase RNA-containing structures from PDB and NDB http://www.rnabase.org/

• SCOR Structural classification of RNA: RNA motifs by structure, function and


tertiary interactions
• http://scor.lbl.gov/

10/26/2023 9:54 AM
PROTEIN SEQUENCE DATABASE

10/26/2023 9:54 AM
• Protein database can be a sequence database or structure database.

• Protein sequence database:

• The protein sequence database was developed at National biomedical research


foundation (NBRF) at Georgetown University by Margaret Dayoff in 1960’s.

• The protein sequence database was collaboratively maintained by PIR (Protein


Information Resources), JIPID (International Protein Information Database of Japan)
and MIPS (Munich Information Centre for Protein Sequences)
PIR (PROTEIN INFORMATION RESOURCE) DATABASE:

• It is main protein sequence database.

• This database is classified into 4 classes.

• PIR1:classified and annotated entries.

• PIR2:Preliminary entries

• PIR3:Unverified entries

• PIR4:Conceptual translation of the sequence that are


not transcribed , that are genetically engineered etc.
SWISS-PROT
• It is a protein sequence database maintained collaboratively
by Medical Biochemistry at the University of Geneva .

• The d/b endeavours to provide


• high level annotation
• Functional description of the protein
• Structural description of the domains
• post translational modifications
• variants and so on.

• They are interlinked to many source and have minimal


redundancy.
TrEMBL:

It was created in 1966 as a computer annotated supplement to swiss prot.

The d/b contains translation of all coding sequences.

2 main sections:
SP –TrEMBL –contain entries that are not been annotated but they are
eventually incorporated in to swiss prot.

REM-TrEMBL-contain entries that are not included into swiss prot. eg.
synthetic seq.
NRL-3D
This d/b is produced by PIR from sequences extracted
from PDB.
• NRL 3D is used both for similarity searches and
keyword interrogation.

• ATLAS retrieval system is used to access information


from NRL-3D.
Structural database:

• They store a collection of 3 dimensional biological


macromolecular structures of proteins.

• The last established database for protein structures is


protein data bank (PDB)
PDB: It contains following information
• Name of the protein
• The species
• Describe the structure determination.
• Amino acid sequence
• Additional information.

SCOP:(Structural classification of protein)


• The SCOP describes structural and evolutionary relationship between
proteins of known structure.

• Proteins are clustered into families with clear evolutionary relationships if


they have sequence identities of more than 30 %.

• Proteins are suggested to have a common fold if they have the same
secondary structures in the same arrangement whether or not they have a
common evolutionary origin.
CATH DATABASE:

• Class, architecture, topology and homology

• Class is derived from gross secondary structure content and packing.

• Architecture describes the gross arrangement of secondary structures.

• Topology encompasses both overall shape and connectivity of


secondary structures.

• Homology groups domains that share more than 35 % sequence


identity and thought to share a common ancestor.
OTHER DATABASE:

DALI: Based on extraction of similar structures from distance matrices.

CE: Database of structural alignments.

Proteopedia: A collaborative 3D encyclopedia of proteins and other molecules.

OPM: provides spatial positions of protein 3 Dimensional structure

CONSERVED DOMAIN DATABASE: A collection of sequence alignments and


profiles representing protein domains conserved in molecular evolution
Sequence Alignment
• Sequence alignment is used to find out the degrees of similarity
between two (pairwise alignment) or more DNA, RNA or amino acid
sequences of proteins (multiple sequence alignment)

10/26/2023 9:54 AM
PAIRWISE ALIGNMENT MULTIPLE SEQUENCE ALIGNMENT
An alignment procedure comparing two An alignment procedure comparing three or
biological sequences of either protein, DNA or more biological sequences of either protein,
RNA DNA or RNA

Pairwise alignments can be generally MSA is generally a global multiple sequence


categorized as global or local alignment alignment
methods

Comparatively simple algorithm is used Complex sophisticated algorithm is used

A general global algorithm technique is the A technique called progressive alignment


Needleman-Wunsch algorithm. method is employed. In this approach, a
A general local alignment method is Smith- pairwise alignment algorithm is used iteratively,
Waterman algorithm first to align the most closely related pair of
sequences, then the next most similar one to
that pair, and so on.
Examples of pairwise alignment tools: LALIGN, Examples of MSA tools: MUSCLE, T-Coffee,
BLAST, EMBOSS Needle, EMBOSS Water MAFT, CLUSTALW
Differences between Local and Global
Alignment (Self Study)
https://www.majordifferences.com/2016/05/difference-between-globa
l-and-local.html#.X6TDMWgzbDc
Applications of Pairwise Seq. Alignment
• Primarily to find out conserved regions between two
sequences
• Similarity searches in a database

10/26/2023 9:54 AM
Multiple Sequence Alignment
• To detect regions of variability or conservation in a family of proteins
• Phylogenetic analysis (inferring a tree, estimating rates of
substitution, etc.)
• Detection of homology between a newly sequenced gene and an
existing gene family, prediction of protein structure etc.
• Demonstration of homology in multigene families
Comparison of biological sequences by
BLAST and FASTA

• BLAST and FASTA software compare amino acids and


nucleotides of different species and look for the
similarities
• Basic Local Alignment Search Tool (BLAST)
• FAST – A
• FAST - P
BLAST
• Developed in 1980
• Available in NCBI website
• Input data is in FASTA format
• Output data can be obtained in plain text, HTML or XML
• BLAST works on the principle of searching for localized similarities
between the two sequences and after shortlisting the similar
sequences it searches for neighbourhood similarities
• The software searches for high number of result after a threshold
value is reached.
• BLAST is used for many purposes:
• DNA mapping
• comparing two identical genes in different species
• creating phylogenetic tree

10/26/2023 9:54 AM
Class Amino acids 1-letter code

Glycine, Alanine, Valine, Leucine,


Aliphatic G, A, V, L, I
Isoleucine

Hydroxyl or sulfur / selenium- Serine, Cysteine, Selenocysteine,


S, C, U, T, M
containing Threonine, Methionine

Cyclic Proline P

Phenylalanine, Tyrosine,
Aromatic F, Y, W
Tryptophan

Basic Histidine, Lysine, Arginine H, K, R

Aspartate, Glutamate,
Acidic and their amides D, E, N, Q
Asparagine, Glutamine

10/26/2023 9:54 AM
FASTA
• Developed in 1985 for comparing protein sequences only but was
later modified to conduct searches on DNA also
• FASTA software uses the principle of finding the similarity between
the two sequences statistically
• This software matches one sequence of DNA or protein with the other
by local sequence alignment method

10/26/2023 9:54 AM
• It just searches the local region for similarity and not the best match
between two sequences

• Since the software compares localized similarities at times, it can come up


with a mismatch

• In a sequence, FASTA takes a small part known as k-tuples, where tuple


can be from 1 to 6 and matches with the k-tuples of other sequence and
once a threshold value of matching is reached, it comes up with the result
PCR
Denaturation at around 94°C :
During the denaturation, the double strand melts open to single
stranded DNA, all enzymatic reactions stop (for example the extension
from a previous cycle).
Annealing at around 54°C :
Hydrogen bonds are constantly formed and broken between the single
stranded primer and the single stranded template. If the primers
exactly fit the template, the hydrogen bonds are so strong that the
primer stays attached
Extension at around 72°C :
The bases (complementary to the template) are coupled to the primer
on the 3' side (the polymerase adds dNTP's from 5' to 3', reading the
template from 3' to 5' side, bases are added complementary to the
template)
PCR

The different steps of PCR


PCR

Exponential increase of the number of


copies during PCR
Verification of PCR product
Primer design

The most critical parameter for successful PCR is design of primers

Primer selection critical variables


- primer length
- melting temperature (Tm)
- specificity
- complementary primer sequences
- G/C content
- 3’-end sequence
1. Primer length

- specificity and the temperature of annealing are at


least partly dependent on primer length

-oligonucleotides between 20 and 30 bases -highly sequence


specific

- primer length is inversely proportional to annealing efficiency: in


general, the longer the primer, the more inefficient the annealing
- the primers should not be too short as specificity decreases
2. Complementary primer sequences and
specificity
- primers need to be designed with absolutely no intra-
primer homology beyond 3 base pairs. If a primer has
such a region of self-homology, “snap back” can occur
- If the homology occur at the 3' end of either primer,
primer dimer formation will occur
3. G/C content
- ideally a primer should have a near random mix
of nucleotides, a 50% GC content
- there should be no PolyG or PolyC stretches that
can promote non-specific annealing
4. Melting temperature (Tm)

- Annealing temperature of at least 50°C


-

- Use an annealing temperature that is 5°C lower


than the melting temperature
-

- Wallace formula:
Tm = 4 (C+G) + 2 (A+T) °C
-Both of the primers should have similar melting
temperatures.
If primers are mismatched in terms of Tm, amplification
will be less efficient or may not work: the primer with the
higher Tm will mis-prime at lower temperatures; the
primer with the lower Tm may not work at higher
temperatures.
Reference:
Various internet sources

You might also like