0% found this document useful (0 votes)
89 views54 pages

Bio in For Ma Tics

Bioinformatics is the application of computer science and information technology to understand and organize the information associated with biological molecules like DNA and proteins on a large scale. It involves describing, analyzing, simulating and predicting biological processes using computational tools. The massive amounts of data from biological experiments makes analysis complex. Bioinformatics aims to enable new biological insights and discern unifying biological principles. It has applications in areas like sequence analysis, structure prediction, and drug development.

Uploaded by

Kaveesh Dashora
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views54 pages

Bio in For Ma Tics

Bioinformatics is the application of computer science and information technology to understand and organize the information associated with biological molecules like DNA and proteins on a large scale. It involves describing, analyzing, simulating and predicting biological processes using computational tools. The massive amounts of data from biological experiments makes analysis complex. Bioinformatics aims to enable new biological insights and discern unifying biological principles. It has applications in areas like sequence analysis, structure prediction, and drug development.

Uploaded by

Kaveesh Dashora
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

WHAT IS BIOINFORMATICS ?

•Conceptualizing Biology in terms of molecules and applying informatics


techniques (applied math's, computer science, and statistics) to understand and
organize the information associated with these molecules on a large scale.
• MIS for molecular Biology

FUNDAMENTAL ISSUES:
How we describe, analyze, simulate and predict the dynamics of various biological
processes using IT tools.
COMPLEXITY: due to massive amount of data obtained through numerous
Biological experiments.
Managing and interpreting Biological data:
EMBL database : 17,807,926,047 nucleotides
Total entries : 15,851,373
BIOINFORMATICS
Biology, Computer Science & IT
Ultimate goal:- to enable discovery of new biological insights as well as create global
perspective from which unifying principles of Biology can be discerned.
Job prospects:-
• $300 billion Pharmaceutical industries
• Fast growing biotech industry to support it
• Drug design going to be genomics related
• shortage of Bioinformatics and molecular modeling specialists.
• Global market share for Bioinformatics is expected to cross US $ 60 billion by year
2006.
Salaries range:-
US $ 80,000/- to US $ 200,000/- per annum.
BIOINFORMATICS
• Science of using information to understand biological phenomena
• Part of Computational Biology
Bioinformatics consists of:

• DNA Micro arrays: technology to measure relative copies of


*genetic message at different stages of development or disease. (*
levels of gene expression)
• Functional genomics: large scale ways of identifying gene
functions & associations.
• Structural genomics: attempts to crystallize/predict structures of all
proteins.
• Comparative genomics: understand differences & similarities
between all the genes of multiple species – evolution.
• Medical informatics: management of biomedical experimental
data.
OBJECTIVES:
•Organizing data
•Develop tools & resources for analysis of data
•Analysis and interpretation of results

APPLICATIONS:
•Sequence analysis
•Primer design – short sequences to make many copies of a piece of
DNA sequences
•Predict function of actual gene products
•Molecular modeling
•Crystallography – structural biology
•Genetic engineering
BIOINFORMATICS
(AN OVERVIEW)
THE CURRENT EXCITEMENT IN BIOLOGY
THE CHALLENGES OF
BIOINFORMATICS.

TODAY, TOMORROW AND THE NEAR FUTURE

THE THREE E’s


Extracting
Envisaging THE BIOLOGICAL DATA
Elucidating
CENTRAL DOGMA

OF EARLY MOLECULAR BIOLOGY OF BIOINFORMATICS


DNA SEQUENCE

mRNA STRUCTURE

Shift of
PROTEIN paradigm FUNCTION
THE AGE OF BIOCOMPUTING

THE DNA COMPUTING


(The gene-protein assessment)
THE FUTURE NEXT OF
BIOINFORMATICS
Genome GENOMICS

Protein PROTEOMICS

Biochemical pathways METABOLITES


(METABONOMICS ???)

Phenotype (Function)
FUNCTOMICS

TRANSCRIPTOMICS

~THE OMICS
THE RELATED DISCIPLINES
OF BIOINFORMATICS
Cheminformatics

Medical Informatics

Health Informatics

Medical computing

Nursing informatics
STRUCTURE PREDICTION : UNLOCKING

BIOLOGICAL SECRETS

(GENE IDENTIFICATION TOOLS)

 EXPASY

 SWISS MODEL

 GENO 3D

CPH MODELS ETC…


Insilico APPROACHES OF THE DRUG
DEVELOPMENT

ADME PROPERTIES

ABSORPTION

DISTRIBUTION

METABOLISM

ELIMINATION
.

WHAT METABONOMICS IS ALL ABOUT

???
ITS RELATIONSHIP USING BIOINFORMATICS TOOLS

ANABOLISM + CATABOLISM --->>>>>> METABOLISM


THE STUDIES USING BIOINFORMATICS…….METABONOMICS

PATTERN RECOGNITION AND DATABASES

 REBASE
 PROSITE
 TRANSCRIPTION FACTOR DATABASE
 TRANSFAC
 EUCARYOTIC PROMOTER DATABASE
BIOINFORMATICIST YAHOOO!!!! I AM
A HYBRID

BIOTECHNOLOGY CONCEPTS

INFORMATION TECHNOLOGY

TOOLS
THE EXTENDING SAGA OF
BIOINFORMATICS

WANT YOUR (SEQUENCE)


MAP.??????
Human Genome Map for
$250 only REGISTER
BUY ONE AND GET
MOUSE GENOME
TRANSCRIPTOME
FREE
LIVING WITH BIOINFORMATICS?????

WAAHH!!!
MOUSEMAN….
WONDER HAVE THEY
JUST SEQUENCED
MOUSE GENOME MAP?
SEQUENCES

Sequences:- Viewed as strings of characters for


convenience of understanding & performing
Mathematical functions.

• Proteins & DNA may be similar with respect to their


function, structure or primary sequence of amino or
nucleic acids.
• Sequence determines shape, shape determines function.
• We study sequence similarity to discover similarity in
shape & function.
Similarity in sequences

Quantitative Qualitative

Similarity measure An alignment i.e.


i.e. two sequences mutual
show certain arrangement of two
degree of similarity sequences where
two sequences are
similar & where
they differ

Optimal alignment – that exhibit most correspondences &


least differences.
BIOLOGICAL MOTIVATIONS OF SEQUENCE
ANALYSIS
• Large variety of biological problems involve sequences
• Sequence alignment – useful for discovering information related to
functions, structure and evolution
Examples :
• Reconstructing long sequences of DNA from overlapping strings
fragments.
• Determine physical and genetic maps from probe data under various
experiments protocols.
• Storing, retrieving and comparing DNA strings.
• Comparing two or more strings for similarities to find related Proteins
• Exploring frequently occurring patterns of nucleotides.
• Finding informative elements in Proteins & DNA sequences.
• Identify an unknown sequence.
• Find other members of multigene families.
Aim: Learn functionality & structure of Protein without performing
experiments & without physically constructing Protein itself.

Basic idea: Similar sequences produce similar proteins.


Predict characteristics of Proteins using its sequence data.

Example: Let two Protein sequences are identical at 25% of their


positions. This association is found in Cancer and uncontrolled growth
cells. Compare sequence of Cancer associated gene and sequence of
Protein which influences cell growth.
 Correlation was very high.
 Proves connection between the two.
IDENTICAL

SIMILAR

ANALOGOUS HOMOLOGOUS

ORTHOLOGOUS PARALOGOUS
Concepts:

Identical: when corresponding character is shared between two species


that character is said to be identical.

Similar: Degree to which two species or populations share identities.

Homologous: When characters are similar due to common ancestry.

Analogous: When characters are similar due to convergent evolution they


are analogous.

Orthologous: When characters are homologous with conserved function.

Paralogous: When characters are homologous with divergent function.


SIMILARITY & DIFFERENCES
Similarity:- Maximal sum of weights. Assign weights corresponding for
resemblance.

Occurred due to mutations – modifying DNA sequences


Insertion of letter/letters in a sequence.
Deletion of letter/letters in a sequence.
Substitution of letter by another.

The notion of distance, assigning weights to each mutation.


Distance between minimal sum of weights for a set of mutations.
MODELS FOR SEQUENCE ANALYSIS
 Global alignment
Input: two strings S & T roughly of same length
Q: What is the difference (similarity) between the two?
It is done across entire sequence length to include as many matches as
possible including sequence end.
 Local alignment / similarity (more meaningful)
Input: Two strings S & T
Q: what is maximum similarity ( minimum difference) between substring
of S & substring of T?
Q: What are these most similar substrings?
Example: S=a b c x d e x
T= x x c - d e
We give each match a value 2 & mismatch a value -1.
α =cxde
β =c–de
Have optimal alignment
Model for sequence analysis continued -

 Ends free space alignment


Input: two strings of S & T of different lengths.
Q: what is maximum similarity between substrings of S & T?
Given: least one of these substrings must be prefix of the original string
& one (not necessarily other) must be a suffix.

Example:
S= - - c a c - d b d v l
T= l t c a b d d b - - -
Two leading spaces at left end of alignment are free as well as
three trailing spaces at right hand side.

 Gap penalty
Input: Two strings S & T of different length.
Q: Define gap as any maximal consecutive run of spaces, length of
gap as the number of indel operations. What is the similarity
between two strings, given weight function for gaps.
Model for sequence analysis continued -
Example :
S=attc- -ga-tggacc
T=a--cgtgatt - - - cc
Four gaps of total eight spaces.
Then alignment would be described as
 Seven matches
 No mismatch
 Eight spaces

• Length of gap is No. of indel operations.

• Concept of gap in alignment is important in many


biological applications.

• Mutational events create gaps of varying sizes.


METHODS OF ALIGNMENT
• DOT MATRIX – useful for simple alignment, however does not show
sequences or produce optimal alignment.
• BRUTE FORCE – produces alignments without gaps and has an
N2 complexity, where N is length of sequences.
• DYNAMIC PROGRAMMING – produces optimal alignment by
starting an alignment from one end (as in dot matrix), then keeping
track of all possible best alignments to that point.
• HEURISTICS METHODS – fast computational machine-based
methods. May not be as accurate as dynamic programming.

GRAPHIC SIMILARITY COMPARISONS:-

GGCTTGACCGG - -> GGCTTGACCGG - -> GGCTTGACCGG - ->

GGATTGACCCG--> GGATTGACCCG GGCTTGACCGG - ->


SIMILARITY VERUS DISTANCE
1. Elements of the matrices specify the weight to assign a
given comparison by:
• The cost of replacing one residue with another (distance); or
• A measure of the similarity for the replacement.
2. Similarity is used for database searching.
3. Distance is more applicable for phylogenetic tree
reconstruction.
4. Maximizing the similarity is fundamentally the same as
minimizing a distance. Hence distance and similarity
matrices are inter-convertible by some mathematical
transformation appropriate for the given application.

SIMILARITY DISTANCE
Local alignment Evolution & phylogeny
Suited for comparing proteins Triangle inequality
GLOBAL Vs LOCAL SIMILARITY
Global algorithms are not sensitive for highly diverged sequences, a better and faster
method is local similarity. Three most widely used local similarity algorithms are – Smith-
Waterman, BLAST, FASTA.
Smith-Waterman:
• it is a rigorous dynamic programming approach
• it does not make use of heuristic shortcuts
•FASTA: (developed by Lipman & Pearson in 1985)
• considers exact matches between short substrings, for a given parameter
• allows to trade-off speed for precision: the larger we choose the parameter, the smaller
is the number of exact matches
•Makes the program faster but loses precision
BLAST: (developed by Altschul et al. in 1990)
• it focuses on no-gap alignments of a certain, fixed length
• it uses a scoring function to measure similarity rather than distance
•It reports to the user all database entries which have a segment pair scores higher than
the threshold parameter.
FASTA algorithm
BLAST (Basic Local Alignment
Search Tool)
is a similarity search program developed by the research
staff at NCBI/GenBank. It is available as a free service over the
Internet that provides very fast, accurate, and database searching

BLAST goes through the following 3 steps


• It takes each word from the query sequence (3 amino acids or 11
nucleotides).

• If similar words are found, BLAST tries to expand the alignment to the
adjacent words.

• After all words are tested, a set of HSPs (High-scoring Segment


Pairs) are chosen for that database sequence.
BLAST algorithm
POPULAR SCORING MODELS FOR PROTEIN
SEQUENCES
There are two popular scoring models for protein sequences – PAM and
BLOSUM. PAM stands for Percent Accepted Mutation and BLOSUM
stands for BLOcks SUbstitution Matrix.
PAM is
• based on explicit evolutionary model
• represents a specific evolutionary distance
• ranges from identical to completely random
BLOSUM is
• based on empirical frequencies
• always a blend of distances as seen in the database and PROSITE
• narrower range than PAM matrix
Representation of dot plots
GRAPHIC SIMILARITY
COMPARISONS
•Uses the power of computer to present
relationships between sequences
•Similarity between two sequences can be detected
as a diagonal on an identity matrix
•To determine the similarity of sequences , we must
compare all parts of one sequence with all parts of
the other
•The alignment with the greatest number of
identities would be the optimal alignment
Graphic similarity comparisons
Representation of scoring matrix
Representation of sequence s & t
METHODS FOR OPTIMAL ALIGNMENT
Global sequence alignment :
• Here dynamic programming is used which is a method for breaking down
the alignment of sequences into small parts
• It is comparable to moving across a dot matrix and keeping track of all the
matching pairs
• Sequence alignment method predate dot-matrix searches and all of the
alignment methods in use today
• Over the course of evolution, some positions undergo base or amino acid
substitution and bases or amino acids can be inserted or deleted
Local alignment :
• Smith-Waterman dynamic programming algorithm is used for local
alignment
• The algorithm gives the highest-scoring local match between two
sequences
• The alignment are arrived at by starting at the highest-scoring positions in
the scoring matrix and following a trace path up to a box that scores zero.
EXAMPLE :
• Calculate a dynamic programming matrix and alignment for the
sequences ATT and TTC . How many optimal alignments are there?
Matrix :
0123
1123
2112
3212
Alignment :
ATT

TTC
The other optimal alignment is,

ATT-

-TTC
Construction of the optimal alignment
Hidden Markov Model
• HMMs derive from Markov chain that
concentrate only on the sequence state.
• Since the early 1970s
– Applied in speech recognition research
• The early 1990s
– Introduced this model to the bioinformatics
community
– Sequence modeling, multiple alignment, protein
structure prediction and profiling
Markov Chains
• Markov Property of order 1
• Formally
P( X 0 , X 1 ,, X t )  P( X 0 ) P( X 1 | X 0 ) P( X 2 | X 0 , X 1 )  P( X t | X 0 ,, X t 1 )
 P( X 0 ) P( X 1 | X 0 ) P( X 2 | X 1 )  P( X t | X t 1 )

– State space = list of possible values for X


– Transition matrix = probability of moving from one X to another
– Initial distribution = initial value of X
• CS intuition S0 S1
– Stochastic finite automaton
S2
Markovian Sequence
• States through which the chain passes from a
sequence 0.5
• Example: S0 , S1, S1, S1, S0 , S1, 0.45
S0 S1
P(seq)  P( S0 , S1, S1, S1, S0 , S1,) 0.2

•   ( S0 ) P( S1 | S0 ) P( S1 | S1 ) S2

• Markov chain for generating DNA sequence


S=AGATCG
P( AGATCG)  ( A)P(G | A)P( A | G)P(T | A)
Hidden Markov Chains (HMMs)
• Observed sequence is a probabilistic
function of underlying Markov chain
– Example: HMM for a (noisy) DNA sequence
• True state sequence is unknown, but
observation sequence gives us a clue

Emission probabilities from each state


MSA (MULTIPLE SEQUENCE
ALIGNMENT)

It is a tool to determine levels of homology, and hence


relatedness, between members of a series of globally
related sequence.

Tools for MSA:

• Sum-of-pairs method
• Star alignment
• Two-step method (Clustal and Pileup approaches)
• Automated tools (Macaw, Meme etc.)
Global & Local MSA (multiple sequence alignment)
Example – SP (sum of pairs), method
Example – SP method
• The sum of pairs function scores each position in the
protein, that is, each column, as the sum of the pair wise
scores. For k sequences, there are k (k -1)/2 unique pair
wise comparisons, excluding self comparisons. Here in
column three, the score would be

SP – score (I, -, I,V) = p (I, -) + p (I,I) + p (I, V) + p (-, I) +


p (-, V) + p (I, V)

Where p (a, b) is the pair wise score of two amino acids.


Optimal alignment between k = 3 sequence

Where K is the number of sequences


HMM for Multiple Alignment
• Match” states are alignment sequence positions
• Position-specific deletion penalties
• Position-specific insertion frequencies
• Path through states aligns sequence to model
Example of HMM Model

Transition probabilities (T) and emission probabilities (e)


Scoring in HMM model
• Score of aCCy along the path
loge(.4) + loge(.3) + loge(.46) + loge(.6) +
loge(.97) + loge(.5) + loge(.015) + loge(.73) +
loge(.01) + loge(1) = -13.25

You might also like