Bioinformatics Note
Bioinformatics Note
Put simply, bioinformatics is the science of storing, retrieving and analysing large amounts of
biological information. It is a highly interdisciplinary field involving many different types of
specialists, including biologists, molecular life scientists, computer scientists and
mathematicians.
The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper to describe "the study
of informatic processes in biotic systems" and it found early use when the first biological
sequence data began to be shared. Whilst the initial analysis methods are still fundamental to
many large-scale experiments in the molecular life sciences, nowadays bioinformatics is
considered to be a much broader discipline, encompassing modelling and image analysis in
addition to the classical methods used for comparison of linear sequences or three-dimensional
structures .
The Nucleic Acid Database was established in 1991 as a resource to assemble and distribute
structural information about nucleic acids. Over the years, the NDB has developed generalized
software for processing, archiving, querying and distributing structural data for nucleic acid-
containing structures
The Gene Expression Database (GXD) is a community resource for gene expression information
from the laboratory mouse. GXD stores and integrates different types of expression data and
makes these data freely available in formats appropriate for comprehensive analysis. There is
particular emphasis on endogenous gene expression during mouse development.
EcoCyc describes the genome and the biochemical machinery of E. coli. It provides a
molecular and functional catalog of the E. coli cell to facilitates system-level
understanding. Its Pathway/Genome Navigator user interface visualizes the layout of
genes, of individual biochemical reactions, or of complete pathways. It also supports
computational studies of the metabolism, such as pathway design, evolutionary studies,
and simulations. A related metabolic database is Metalgen.
UNIT-2
What is alignment score?
A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the
highest alignment scores in a given search. identity. The extent to which two (nucleotide or
amino acid) sequences have the same residues at the same positions in an alignment, often
expressed as a percentage.
Scoring Matrix
Scoring matrices are used to determine the relative score made by matching two characters in a
sequence alignment. These are usually log-odds of the likelihood of two characters being derived
from a common ancestral character. There are many flavors of scoring matrices for amino acid
sequences, nucleotide sequences, and codon sequences, and each is derived from the alignment
of "known" homologous sequences. These alignments are then used to determine the likelihood
of one character being at the same position in the sequence as another character.
PAM AND BLOSUM SERIES OF MATRICES
A point accepted mutation — also known as a PAM — is the replacement of a single amino
acid in the primary structure of a protein with another single amino acid, which is accepted by
the processes of natural selection. This definition does not include all point mutations in
the DNA of an organism. In particular, silent mutations are not point accepted mutations, nor are
mutations which are lethal or which are rejected by natural selection in other ways.
A PAM matrix is a matrix where each column and row represents one of the twenty standard
amino acids. In bioinformatics, PAM matrices are regularly used as substitution matrices to
score sequence alignments for proteins. Each entry in a PAM matrix indicates the likelihood of
the amino acid of that row being replaced with the amino acid of that column through a series of
one or more point accepted mutations during a specified evolutionary interval, rather than these
two amino acids being aligned due to chance. Different PAM matrices correspond to different
lengths of time in the evolution of the protein sequence.
PAM matrices were introduced by Margaret Dayhoff in 1978. The calculation of these matrices
were based on 1572 observed mutations in the phylogenetic trees of 71 families of closely
related proteins. The proteins to be studied were selected on the basis of having high similarity
with their predecessors. The protein alignments included were required to display at least 85%
identity.[6][8] As a result, it is reasonable to assume that any aligned mismatches were the result of
a single mutation event, rather than several at the same location.
Each PAM matrix has twenty rows and twenty columns — one representing each of the twenty
amino acids translated by the genetic code. The value in each cell of a PAM matrix is related to
the probability of a row amino acid before the mutation being aligned with a column amino acid
afterwards. From this definition, PAM matrices are an example of a substitution matrix.
In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution
matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments
between evolutionarily divergent protein sequences. They are based on local alignments.
BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They
scanned the BLOCKS database for very conserved regions of protein families (that do not have
gaps in the sequence alignment) and then counted the relative frequencies of amino acids and
their substitution probabilities. Then, they calculated a log-odds score for each of the 210
possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on
observed alignments; they are not extrapolated from comparisons of closely related proteins like
the PAM Matrices.
Several sets of BLOSUM matrices exist using different alignment databases, named with
numbers. BLOSUM matrices with high numbers are designed for comparing closely related
sequences, while those with low numbers are designed for comparing distant related sequences.
For example, BLOSUM80 is used for closely related alignments, and BLOSUM45 is used for
more distantly related alignments. The matrices were created by merging (clustering) all
sequences that were more similar than a given percentage into one single sequence and then
comparing those sequences (that were all more divergent than the given percentage value) only;
thus reducing the contribution of closely related sequences. The percentage used was appended
to the name, giving BLOSUM80 for example where sequences that were more than 80%
identical were clustered.
BLOSUM r: the matrix built from blocks with less than r% of similarity – E.g., BLOSUM62 is
the matrix built using sequences with more than 62% similarity (sequences with ≥ 62% identity
were clustered) – Note: BLOSUM 62 is the default matrix for protein BLAST. Experimentation
has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein
similarities.
Maximum likelihood
The maximum likelihood method uses standard statistical techniques for inferring probability
distributions to assign probabilities to particular possible phylogenetic trees. The method requires
a substitution model to assess the probability of particular mutations; roughly, a tree that requires
more mutations at interior nodes to explain the observed phylogeny will be assessed as having a
lower probability. This is broadly similar to the maximum-parsimony method, but maximum
likelihood allows additional statistical flexibility by permitting varying rates of evolution across
both lineages and sites. In fact, the method requires that evolution at different sites and along
different lineages must be statistically independent. Maximum likelihood is thus well suited to
the analysis of distantly related sequences, but it is believed to be computationally intractable to
compute due to its NP-hardness.
The "pruning" algorithm, a variant of dynamic programming, is often used to reduce the search
space by efficiently calculating the likelihood of subtrees. The method calculates the likelihood
for each site in a "linear" manner, starting at a node whose only descendants are leaves (that is,
the tips of the tree) and working backwards toward the "bottom" node in nested sets. However,
the trees produced by the method are only rooted if the substitution model is irreversible, which
is not generally true of biological systems. The search for the maximum-likelihood tree also
includes a branch length optimization component that is difficult to improve upon
algorithmically; general global optimization tools such as the Newton-Raphson method are often
used.
Some tools that use maximum likelihood to infer phylogenetic trees from variant allelic
frequency data (VAFs) include AncesTree and CITUP.