Unit 6 - Bioinformatics
Unit 6 - Bioinformatics
Unit 6 - Bioinformatics
INTRODUCTION
• Bioinformatics is an interdisciplinary field mainly involving
molecular biology and genetics, computer science, mathematics,
and statistics.
• computational techniques for solving biological problems
1. data problems: representation (graphics), storage and retrieval
(databases), analysis (statistics, artificial intelligence,
optimization, etc.)
2. biology problems: sequence analysis, structure or function
prediction, data mining, etc. also called computational biology
• National Center for Biotechnology Information (NCBI 2001)
defines bioinformatics as: Bioinformatics is the field of science in
which biology, computer science, and IT merge into a single
discipline
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
Prokaryotes
• Advantages
– Simple gene structure
– Small genomes (0.5 to 10 million bp)
– No introns
– Genes are called Open Reading Frames (ORFs)
– High coding density (>90%)
• Disadvantages
– Some genes overlap (nested)
– Some genes are quite short (<60 bp)
Gene finding approaches
• Rule-based
– Not as applicable – too many false positives
• Content-based Methods
– CpG islands, GC content, hexamer repeats, composition statistics, codon
frequencies
• Feature-based Methods
– donor sites, acceptor sites, promoter sites, start/stop codons, polyA signals,
feature lengths
• Similarity-based Methods
– sequence homology, EST (expressed sequence tags) searches
• Pattern-based
– HMMs, Artificial Neural Networks
• Most effective is a combination of all the above
Gene prediction programs
• Rule-based programs
– Use explicit set of rules to make decisions.
– Example: GeneFinder
• Neural Network-based programs
– Use data set to build rules.
– Examples: Grail, GrailEXP
• Hidden Markov Model-based programs
– Use probabilities of states and transitions between
these states to predict features.
– Examples: Genscan, GenomeScan
Combined Methods
• GRAIL (http://compbio.ornl.gov/Grail-1.3/)
• FGENEH (http://www.bioscience.org/urllists/genefind.htm)
• HMMgene (http://www.cbs.dtu.dk/services/HMMgene/)
• GENSCAN(http://genes.mit.edu/GENSCAN.html)
• GenomeScan (http://genes.mit.edu/genomescan.html)
• Twinscan (http://ardor.wustl.edu/query.html)
Egpred: Prediction of Eukaryotic Genes
• Similarity Search
– First BLASTX against RefSeq datbase
• Combined method
Biological databases
Introduction
• Biological databases : libraries of life sciences information,
collected from scientific experiments, published literature, high-
throughput experiment technology, and computational analysis
• The first database was created within a short period after the Insulin
protein sequence was made available in 1956.
• Around mid 1960s, the first nucleic acid sequence of Yeast tRNA with
77 bases (individual units of nucleic acids) was found out. During this
period, three dimensional structures of proteins were studied and
the well known Protein Data Bank was developed as the first protein
structure database with only 10 entries in 1972
• Databases in general can be classified in to primary, secondary or composite
databases.
• Experimental results are submitted directly into the database by researchers, and
the data are essentially archival in nature.
• Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.
GenBank
EMBL
USA
Europe
Collaborative
Meeting
TrEMBL DDBJ NRDB
Japan
NIG
CIB
Genbank (NCBI)
Curated database;
Synonyms Archival database
knowledgebase
– Changeable
– Stable
– ENTREZ
• SRS :
– Sequence Retrieval System
– Developed by EBI
– Web oriented system, accessed through HTML pages & Common Gateway
Interface(CGI) scripts
• ENTREZ :
– Developed & accessible at NCBI Entrez site
– Provide search facilities for large no. of databases & links between them