Unit 8
Unit 8
Bioinformatics
Introduction
• Bioinformatics is defined as an interdisciplinary field involving biology,
computer science, mathematics, and statistics to analyse biological
sequence data, genome content and arrangement, and to predict the
function and structure of macromolecules.
• Components of bioinformatics
• Development of new algorithms and statistics for assessing the relationships
among large sets of biological data
• Application of these tools for the analysis and interpretation of the various
biological data including nucleotide sequences, amino acid sequences, etc.
• The development of database for an efficient storage, access and
management of the large body of various biological information.
cDNA LIBRARY
Definition
• A population of bacterial transformants or phage lysates in which
each mRNA isolated from an organism or tissue is represented as its
cDNA insertion in a plasmid or a phage vector
• Frequency of a specific cDNA in such a library would ordinarily
depend on the frequency of the concerned mRNA in the
tissue/organism in question
Isolation of mRNA
• Total RNA is first isolated from the desired organism/tissue
• The amount of desired mRNA is increased generally by any of the
following methods:
• Chromatography on poly-U sepharose or oligo-T cellulose, which retains
mRNA molecule, since they have 3’ poly-A tails
• Density gradient centrifugation
• Protein produced out of that mRNA is used to prepare antibodies. Now these
antibodies are used to precipitate polysomes containing mRNA and primary
transcript
Cloning of cDNAs
• Cloned in phage insertion vectors (since they afford high efficiency
packaging in vitro and give large number of cDNA clones)
• Typically 105-106 cDNA clones are obtained
Properties of cDNA and cDNA library
• cDNAs are free from intron sequences
• Smaller in size than genes
• A comparison of the cDNA sequence with the corresponding genome
sequence permits the delineation of introns/exons boundaries
• The contents of cDNA libraries from a single organism will vary widely
depending on the developmental stage and the cell type used for the
preparation of the library. But genomic library always remains the
same
• A cDNA library will be enriched for abundant mRNA but may contain
only a few or no clones representing rare mRNAs
Applications
• Expression studies of eukaryotic genes in prokaryotes
• Generation of expressed sequence tags (ESTs), which is highly useful
in high throughput genome research
Express Sequence Tags (ESTs)
• Full length c-DNAs are desirable for accurate and detailed mapping of
genomic transcriptional units
• However, short cDNA sequence can also be used for unambiguous
identification of specific genes, map them on physical gene maps and to
provide information about their expression patterns.
• Express sequence tags (ESTs) are 200-300 bp long cDNA sequences,
generated by picking thousands of random clones from cDNA libraries and
using them for a single-pass sequencing
• Vast majority of cDNA sequences in databases are ESTs rather than full
length cDNAs or genomic clones
• dbEST is an EST database in the public domain
GENOMIC LIBRARY
• Collection of plasmid clones or phage lysates containing recombinant
DNA molecules so that the sum total of DNA inserts in this collection,
ideally, represents the entire genome of the concerned organism
• Despite of care taken in preparation of genomic libraries, certain DNA
fragments should be expected to be under- or over-repressed, or even
missing
• The reason might be due to
• Certain fragments might replicate slowly
• Certain fragments might have been altered by recombinational events during
cloning
Construction of Genomic Library
• Extraction of total genomic DNA
• DNA is broken into fragments by
• Mechanical shearing
• Sonication
• Use of appropriate restriction enzymes
• It is desirable to cut the DNA randomly so that the DNA fragments (=
clones) overlap one another and no sequence of the genome is
systematically excluded
• Single or mixed digestion with enzymes AluI (AG/CT), HaeIII (GG/CC)
or Sau3A (/GACT) have been used successfully
• Agarose gel electrophoresis of the partial digest is done
• The separated fragments are introduced in the appropriate vector for
cloning
• Transformation of host with recombinant vectors
• Minimum size of genomic libraries depend on
• Complexity of genome
• Size (or length) of the DNA insert (smaller the fragment size, larger the no. of
clones)
• Probability of gene being represented in the library
Sequencing of Genome
Clone by clone sequencing
• Fragments (BAC clones) are first aligned into contigs
• The fragments are then used to create cosmid clones and plasmid
clones, to get smaller sized clones
• These clones are also arranged into contigs
• Sequencing of contigs are done
Shotgun Sequencing
• Clones are sequenced until all clones in the genomic library are
analysed
• Assembler software organizes the nucleotide sequence information
so obtained into a genome sequence
Genome Sequence Compilation
• Genome sequencing necessitated the development of high through-
put technologies that generate data at a very fast pace
• This has brought about recruitment of computers to manage this
flood of information
• Hence, the field of bioinformatics was developed
• Bioinformatics is used for
• Compilation of genome sequences
• Identification of genes
• Assigning functions to the identified genes
• Preparation of databases etc.
• In order to ensure that the nucleotide sequence of a genome is
complete and error-free, the genome is sequenced more than once
• For example, the Human Genome Project sequenced 3.2 billion base
pairs of the human genome a total of 12 times
Human Genome Project
At a glance
• The Human Genome
• Salient Features of Human Genome
• What was Human Genome Project(HGP)
• Milestones
• Goals of Human Genome Project
• Issues of concern
• Future Challenges
• Vectors for Large-Scale Genome Project
• Yeasts artificial chromosomes
• Bacterial artificial chromosome
• Difference between YAC and BAC
The Human Genome
• The human genome is the complete set of genetic information for humans
(Homo sapiens).
• The human genome is by far the most complex and largest genome.
• Its size spans a length of about 6 feet of DNA, containing more than 30,000
genes.
• The DNA material is organized into a haploid chromosomal set of 22
(autosome) and one sex chromosome (X or Y).
Male Female
Human Genome Sequencing
22 autosome + 2 sex chromosomes
From NCBI
Salient Features of Human Genome:
❑ Human genome consists the information of 24
chromosomes (22 autosome + X chromosome + one Y
chromosome); in Homo sapiens 2n = 2x = 46
❑ The human genome contains over 3 billion nucleotide pairs.
❑ Human genome is estimated to have about 30,000 genes .
❑ Average gene consists of 3000 bases. But sizes of genes
vary greatly, with the largest known human gene encoding
dystrophin containing 2.5 million base pairs.
❑ Only about 3% of the genome encodes amino acid
sequences of polypeptides and rest of it junk (repetitive
DNA).
❑ The functions are unknown for over 50% of the discovered
genes.
Continue………
❑The repetitive sequences makeup very large portion of human
genome. Repetitive sequences have no direct coding function but they
shed light on the chromosome structure, dynamics and evolution.
❑Chromosome 1 has most genes (2968) and Y chromosome has the
lowest (231).
❑Almost all nucleotide bases are exactly the same in all people.
Genome sequences of different individuals differ for less than 0.2% of
base pairs.
❑Most of these differences occur in the form of single base differences
in the sequence. These single base differences are called single
nucleotide polymorphisms (SNPs).
❑One SNP occurs at every ~ 1,000 bp of human genome.
❑About 85% of all differences in human DNAs are due to SNPs.
Human Chromosome 1 Genetic Map
What was Human – Department of Energy
(DOE) and
Genome Project(HGP) – National Institutes of
• The Human Genome Health (NIH).
Project was an
international research
effort to determine
the sequence of the
human genome and
identify the genes that
it contains.
• The US Human
Genome Project is a
13 year effort, which is
coordinated by the
Milestones
1986 The birth of the Human Genome Project.
1990 Project initiated as joint effort of US Department of
Energy and the National Institute of Health.
1994 Genetic Privacy Act: to regulate collection, analysis,
storage and use of DNA samples and genetic information
is proposed.
1996 Welcome Trust joins the project.
1998 Celera Genomics (a private company founded by Craig
Venter) formed to sequence much of the human genome in
3 years.
1999 Completion of the sequence of Chromosome 22-the first
human chromosome to be sequenced.
2000 Completion of the working draft of the entire human
genome.
2001 Analysis of the working draft are published.
2003 HGP sequencing is completed and Project is declared
Goals of Human Genome Project
1. To identify all the genes in human DNA.
2. To develop a genetic linkage map of human genome.
3. To obtain a physical map of human genome.
4. To develop technology for the management of human
genome information.
5. To know the function of genes.
6. Determine the sequences of the 3 billion chemical base
pairs that make up human DNA.
7. Store this information in public databases.
8. Develop tools for data analysis.
9. Transfer related technologies to the private sectors.
ISSUES OF CONCERN
Ethical, Legal and Social issues of the Human Genome
Project
• Fairness in the use of genetic information.
• Privacy and confidentiality of genetic information.
• Psychological impact, stigmatization, and discrimination.
• Reproductive issues.
• Clinical issues.
• Uncertainties associated with gene tests for susceptibilities and
complex conditions.
• Fairness in access to advanced genomic technologies.
• Conceptual and philosophical implications.
• Health and environmental issues.
• Commercialization of products.
• Education, Standards, and Quality control.
• Patent issues.
Future Challenges: What We Still Don’t Know
10/26/2023 9:53 AM
Biological databases: why?
• Need for storing and communicating large datasets has grown
• Make biological data available to scientists.
• To make biological data available in computer-readable form.
10/26/2023 9:53 AM
Different classifications of databases
• Type of data
• nucleotide sequences
• protein sequences
• proteins sequence patterns or motifs
• macromolecular 3D structure
• gene expression data
• metabolic pathways
10/26/2023 9:53 AM
Different classifications of databases….
10/26/2023 9:53 AM
Different classifications of databases….
Technical design
• Flat-files
• Relational database (SQL)
• Exchange/publication technologies (FTP, HTML, CORBA, XML,...)
10/26/2023 9:53 AM
Different classifications of databases….
• Availability
• Publicly available, no restrictions
• Available, but with copyright
• Accessible, but not downloadable
• Academic, but not freely available
• Proprietary, commercial; possibly free for academics
10/26/2023 9:53 AM
Nucleotide sequence databases
• EMBL, GenBank, and DDBJ are the three primary nucleotide
sequence databases
• EMBL www.ebi.ac.uk/embl/
• GenBank www.ncbi.nlm.nih.gov/Genbank/
• DDBJ www.ddbj.nig.ac.jp
10/26/2023 9:53 AM
10/26/2023 9:53 AM
http://www3.oup.co.uk/nar/database/c/
10/26/2023 9:53 AM
Genbank
• An annotated collection of all publicly available nucleotide and
proteins
• http://www.ncbi.nlm.nih.gov
10/26/2023 9:53 AM
10/26/2023 9:53 AM
10/26/2023 9:53 AM
EMBL Nucleotide Sequence Database
• http://www.ebi.ac.uk/embl.html
10/26/2023 9:53 AM
10/26/2023 9:53 AM
http://www3.ebi.ac.uk/Services/DBStats/
10/26/2023 9:53 AM
DDBJ–DNA Data Bank of Japan
• An annotated collection of all publicly available nucleotide and
protein sequences
• http://www.ddbj.nig.ac.jp
10/26/2023 9:53 AM
•1984: NIG; the National Institute of Genetics was reorganized as
an Inter-University Research Institute.
DDBJ began to work at NIG.
•1986: DNA Database Advisory Committee organized.
•1987: DDBJ release 1 was provided. By this release, we regard
this year as official start of DDBJ operation.
Source: https://www.ddbj.nig.ac.jp/aboutus-e.html#history
10/26/2023 9:53 AM
10/26/2023 9:53 AM
10/26/2023 9:53 AM
Other NCBI nucleic acids DBs
• EST database: A collection of expressed sequence tags, or short, single-pass
sequence reads from mRNA (cDNA).
• GSS database: A database of genome survey sequences, or short, single-pass
genomic sequences.
• HomoloGene: A gene homology tool that compares nucleotide sequences
between pairs of organisms in order to identify putative orthologs
• Orthologs are genes in different species that evolved from a common ancestral
gene by speciation, and, in general, orthologs retain the same function during the
course of evolution.
• HTG database: A collection of high-throughput genome sequences from large-
scale genome sequencing centers, including unfinished and finished
• SNPs database: A central repository for both single-base nucleotide
substitutions and short deletion and insertion polymorphisms.
• RefSeq: A database of non-redundant reference sequences standards,
including genomic DNA contigs, mRNAs, and proteins for known genes.
Multiple collaborations, both within NCBI and with external groups,
supports data-gathering efforts.
• STS database: A database of sequence tagged sites, or short sequences that
are operationally unique in the genome.
• UniSTS: A unified, non-redundant view of sequence tagged sites (STSs).
• UniGene: A collection of ESTs and full-length mRNA sequences organized
into clusters, each representing a unique known or putative human gene
annotated with mapping and expression information and cross-references
to other sources.
Sequence submission
• Submissions through the Internet:
• Web forms.
• Email.
• Sequences shared/exchanged between the 3 centers on a daily basis:
• The sequence content of the banks is identical.
10/26/2023 9:54 AM
Derived databases
• CUTG Codon usage tabulated from GenBank
http://www.kazusa.or.jp/codon/
• Genetic Codes Deviations from the standard genetic code in various
organisms and organelles
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
• TIGR Gene Indices Organism-specific databases of EST and gene sequences
http://www.tigr.org/tdb/tgi.shtml
• UniGene Unified clusters of ESTs and full-length mRNA sequences
http://www.ncbi.nlm.nih.gov/UniGene/
• ASAP Alternative spliced isoforms http://www.bioinformatics.ucla.edu/ASAP
• Intronerator Introns and alternative splicing in C.elegans and C.briggsae
http://www.cse.ucsc.edu/~kent/intronerator/
10/26/2023 9:54 AM
Nucleic acid structure
databases
• NDB Nucleic acid-containing structures http://ndbserver.rutgers.edu/
10/26/2023 9:54 AM
PROTEIN SEQUENCE DATABASE
10/26/2023 9:54 AM
• Protein database can be a sequence database or structure database.
• PIR2:Preliminary entries
• PIR3:Unverified entries
2 main sections:
SP –TrEMBL –contain entries that are not been annotated but they are
eventually incorporated in to swiss prot.
REM-TrEMBL-contain entries that are not included into swiss prot. eg.
synthetic seq.
NRL-3D
This d/b is produced by PIR from sequences extracted
from PDB.
• NRL 3D is used both for similarity searches and
keyword interrogation.
• Proteins are suggested to have a common fold if they have the same
secondary structures in the same arrangement whether or not they have a
common evolutionary origin.
CATH DATABASE:
10/26/2023 9:54 AM
PAIRWISE ALIGNMENT MULTIPLE SEQUENCE ALIGNMENT
An alignment procedure comparing two An alignment procedure comparing three or
biological sequences of either protein, DNA or more biological sequences of either protein,
RNA DNA or RNA
10/26/2023 9:54 AM
Multiple Sequence Alignment
• To detect regions of variability or conservation in a family of proteins
• Phylogenetic analysis (inferring a tree, estimating rates of
substitution, etc.)
• Detection of homology between a newly sequenced gene and an
existing gene family, prediction of protein structure etc.
• Demonstration of homology in multigene families
Comparison of biological sequences by
BLAST and FASTA
10/26/2023 9:54 AM
Class Amino acids 1-letter code
Cyclic Proline P
Phenylalanine, Tyrosine,
Aromatic F, Y, W
Tryptophan
Aspartate, Glutamate,
Acidic and their amides D, E, N, Q
Asparagine, Glutamine
10/26/2023 9:54 AM
FASTA
• Developed in 1985 for comparing protein sequences only but was
later modified to conduct searches on DNA also
• FASTA software uses the principle of finding the similarity between
the two sequences statistically
• This software matches one sequence of DNA or protein with the other
by local sequence alignment method
10/26/2023 9:54 AM
• It just searches the local region for similarity and not the best match
between two sequences
- Wallace formula:
Tm = 4 (C+G) + 2 (A+T) °C
-Both of the primers should have similar melting
temperatures.
If primers are mismatched in terms of Tm, amplification
will be less efficient or may not work: the primer with the
higher Tm will mis-prime at lower temperatures; the
primer with the lower Tm may not work at higher
temperatures.
Reference:
Various internet sources