Bio in For Ma Tics
Bio in For Ma Tics
FUNDAMENTAL ISSUES:
How we describe, analyze, simulate and predict the dynamics of various biological
processes using IT tools.
COMPLEXITY: due to massive amount of data obtained through numerous
Biological experiments.
Managing and interpreting Biological data:
EMBL database : 17,807,926,047 nucleotides
Total entries : 15,851,373
BIOINFORMATICS
Biology, Computer Science & IT
Ultimate goal:- to enable discovery of new biological insights as well as create global
perspective from which unifying principles of Biology can be discerned.
Job prospects:-
• $300 billion Pharmaceutical industries
• Fast growing biotech industry to support it
• Drug design going to be genomics related
• shortage of Bioinformatics and molecular modeling specialists.
• Global market share for Bioinformatics is expected to cross US $ 60 billion by year
2006.
Salaries range:-
US $ 80,000/- to US $ 200,000/- per annum.
BIOINFORMATICS
• Science of using information to understand biological phenomena
• Part of Computational Biology
Bioinformatics consists of:
APPLICATIONS:
•Sequence analysis
•Primer design – short sequences to make many copies of a piece of
DNA sequences
•Predict function of actual gene products
•Molecular modeling
•Crystallography – structural biology
•Genetic engineering
BIOINFORMATICS
(AN OVERVIEW)
THE CURRENT EXCITEMENT IN BIOLOGY
THE CHALLENGES OF
BIOINFORMATICS.
mRNA STRUCTURE
Shift of
PROTEIN paradigm FUNCTION
THE AGE OF BIOCOMPUTING
Protein PROTEOMICS
Phenotype (Function)
FUNCTOMICS
TRANSCRIPTOMICS
~THE OMICS
THE RELATED DISCIPLINES
OF BIOINFORMATICS
Cheminformatics
Medical Informatics
Health Informatics
Medical computing
Nursing informatics
STRUCTURE PREDICTION : UNLOCKING
BIOLOGICAL SECRETS
EXPASY
SWISS MODEL
GENO 3D
ADME PROPERTIES
ABSORPTION
DISTRIBUTION
METABOLISM
ELIMINATION
.
???
ITS RELATIONSHIP USING BIOINFORMATICS TOOLS
REBASE
PROSITE
TRANSCRIPTION FACTOR DATABASE
TRANSFAC
EUCARYOTIC PROMOTER DATABASE
BIOINFORMATICIST YAHOOO!!!! I AM
A HYBRID
BIOTECHNOLOGY CONCEPTS
INFORMATION TECHNOLOGY
TOOLS
THE EXTENDING SAGA OF
BIOINFORMATICS
WAAHH!!!
MOUSEMAN….
WONDER HAVE THEY
JUST SEQUENCED
MOUSE GENOME MAP?
SEQUENCES
Quantitative Qualitative
SIMILAR
ANALOGOUS HOMOLOGOUS
ORTHOLOGOUS PARALOGOUS
Concepts:
Example:
S= - - c a c - d b d v l
T= l t c a b d d b - - -
Two leading spaces at left end of alignment are free as well as
three trailing spaces at right hand side.
Gap penalty
Input: Two strings S & T of different length.
Q: Define gap as any maximal consecutive run of spaces, length of
gap as the number of indel operations. What is the similarity
between two strings, given weight function for gaps.
Model for sequence analysis continued -
Example :
S=attc- -ga-tggacc
T=a--cgtgatt - - - cc
Four gaps of total eight spaces.
Then alignment would be described as
Seven matches
No mismatch
Eight spaces
SIMILARITY DISTANCE
Local alignment Evolution & phylogeny
Suited for comparing proteins Triangle inequality
GLOBAL Vs LOCAL SIMILARITY
Global algorithms are not sensitive for highly diverged sequences, a better and faster
method is local similarity. Three most widely used local similarity algorithms are – Smith-
Waterman, BLAST, FASTA.
Smith-Waterman:
• it is a rigorous dynamic programming approach
• it does not make use of heuristic shortcuts
•FASTA: (developed by Lipman & Pearson in 1985)
• considers exact matches between short substrings, for a given parameter
• allows to trade-off speed for precision: the larger we choose the parameter, the smaller
is the number of exact matches
•Makes the program faster but loses precision
BLAST: (developed by Altschul et al. in 1990)
• it focuses on no-gap alignments of a certain, fixed length
• it uses a scoring function to measure similarity rather than distance
•It reports to the user all database entries which have a segment pair scores higher than
the threshold parameter.
FASTA algorithm
BLAST (Basic Local Alignment
Search Tool)
is a similarity search program developed by the research
staff at NCBI/GenBank. It is available as a free service over the
Internet that provides very fast, accurate, and database searching
• If similar words are found, BLAST tries to expand the alignment to the
adjacent words.
TTC
The other optimal alignment is,
ATT-
-TTC
Construction of the optimal alignment
Hidden Markov Model
• HMMs derive from Markov chain that
concentrate only on the sequence state.
• Since the early 1970s
– Applied in speech recognition research
• The early 1990s
– Introduced this model to the bioinformatics
community
– Sequence modeling, multiple alignment, protein
structure prediction and profiling
Markov Chains
• Markov Property of order 1
• Formally
P( X 0 , X 1 ,, X t ) P( X 0 ) P( X 1 | X 0 ) P( X 2 | X 0 , X 1 ) P( X t | X 0 ,, X t 1 )
P( X 0 ) P( X 1 | X 0 ) P( X 2 | X 1 ) P( X t | X t 1 )
• ( S0 ) P( S1 | S0 ) P( S1 | S1 ) S2
• Sum-of-pairs method
• Star alignment
• Two-step method (Clustal and Pileup approaches)
• Automated tools (Macaw, Meme etc.)
Global & Local MSA (multiple sequence alignment)
Example – SP (sum of pairs), method
Example – SP method
• The sum of pairs function scores each position in the
protein, that is, each column, as the sum of the pair wise
scores. For k sequences, there are k (k -1)/2 unique pair
wise comparisons, excluding self comparisons. Here in
column three, the score would be