0% found this document useful (0 votes)
142 views7 pages

Bioinformatics Note

Uploaded by

NITI SHAH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views7 pages

Bioinformatics Note

Uploaded by

NITI SHAH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Bioinformatics

Put simply, bioinformatics is the science of storing, retrieving and analysing large amounts of
biological information. It is a highly interdisciplinary field involving many different types of
specialists, including biologists, molecular life scientists, computer scientists and
mathematicians.

The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper to describe "the study
of informatic processes in biotic systems" and it found early use when the first biological
sequence data began to be shared. Whilst the initial analysis methods are still fundamental to
many large-scale experiments in the molecular life sciences, nowadays bioinformatics is
considered to be a much broader discipline, encompassing modelling and image analysis in
addition to the classical methods used for comparison of linear sequences or three-dimensional
structures .

Bioinformatics- Introduction and Applications


 With a large number of prokaryotic and eukaryotic genomes completely sequenced and
more forthcoming, access to the genomic information and synthesizing it for the
discovery of new knowledge have become central themes of modern biological research.
 Mining the genomic information requires the use of sophisticated computational tools.
 It therefore becomes imperative for the new generation of biologists to initiate and
familiarize with a field of study that is concerned with the careful storage, organization
and indexing of information in order to tackle the new challenges in the genomic era.
 Information science has been applied to biology to produce a field is called
bioinformatics.
 It is concerned with the state of- the-art computational tools available to solve biological
research problems.
 The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper to describe
“the study of informatic processes in biotic systems” and it found early use when the first
biological sequence data began to be shared.
 Bioinformatics is an interdisciplinary field that develops methods and software tools
for understanding biological data.
 The development of bioinformatics as a field is the result of advances in both molecular
biology and computer science over the past 30–40 years.
 As an interdisciplinary field of science, bioinformatics combines biology, computer
science, information engineering, mathematics and statistics to analyze and interpret
biological data.
 The key areas of bioinformatics include biological databases, sequence alignment, gene
and promoter prediction, molecular phylogenetics, structural bioinformatics, genomics,
and proteomics.
Applications of Bioinformatics
Bioinformatics has not only become essential for basic genomic and molecular biology research,
but is having a major impact on many areas of biotechnology and biomedical sciences. The main
uses of bioinformatics include:
 Bioinformatics plays a vital role in the areas of structural genomics, functional genomics,
and nutritional genomics.
 It covers emerging scientific research and the exploration of proteomes from the overall
level of intracellular protein composition (protein profiles), protein structure, protein-
protein interaction, and unique activity patterns (e.g. post-translational modifications).
 Bioinformatics is used for transcriptome analysis where mRNA expression levels can be
determined.
 Bioinformatics is used to identify and structurally modify a natural product, to design a
compound with the desired properties and to assess its therapeutic effects, theoretically.
 Cheminformatics analysis includes analyses such as similarity searching, clustering,
QSAR modeling, virtual screening, etc.
 Bioinformatics is playing an increasingly important role in almost all aspects of drug
discovery and drug development.
 Bioinformatics tools are very effective in prediction, analysis and interpretation of
clinical and preclinical findings.

Biological Databases- Types and Importance


 One of the hallmarks of modern genomic research is the generation of enormous amounts
of raw sequence data.
 As the volume of genomic data grows, sophisticated computational methodologies are
required to manage the data deluge.
 Thus, the very first challenge in the genomics era is to store and handle the staggering
volume of information through the establishment and use of computer databases.
 A biological database is a large, organized body of persistent data, usually associated
with computerized software designed to update, query, and retrieve components of the
data stored within the system.
 A simple database might be a single file containing many records, each of which includes
the same set of information.
 The chief objective of the development of a database is to organize data in a set of
structured records to enable easy retrieval of information.
Example. A few popular databases are GenBank from NCBI (National Center for Biotechnology
Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein
Information Resource.

The Nucleic Acid Database was established in 1991 as a resource to assemble and distribute
structural information about nucleic acids. Over the years, the NDB has developed generalized
software for processing, archiving, querying and distributing structural data for nucleic acid-
containing structures

The Genome Sequence DataBase (GSDB) is a database of publicly available nucleotide


sequences and their associated biological and bibliographic information. Several notable changes
have occurred in the past year: GSDB stopped accepting data submissions from researchers;
ownership of data submitted to GSDB was transferred to GenBank; sequence analysis
capabilities were expanded to include Smith–Waterman and Frame Search; and Sequence
Viewer became available to Mac users. The content of GSDB remains up-to-date because
publicly available data is acquired from the International Nucleotide Sequence Database
Collaboration databases (IC) on a nightly basis. This allows GSDB to continue providing
researchers with the ability to analyze, query and retrieve nucleotide sequences in the database.

The Gene Expression Database (GXD) is a community resource for gene expression information
from the laboratory mouse. GXD stores and integrates different types of expression data and
makes these data freely available in formats appropriate for comprehensive analysis. There is
particular emphasis on endogenous gene expression during mouse development.

Gene Expression Query Forms


 The Gene Expression Literature Query Form allows you to search for references on
endogenous gene expression during development with search parameters such as genes
and ages analyzed, and assays used. To the best of our knowledge, you can query all
relevant publications from 1993 to the present for all pertinent journals and from 1990 to
the present for major developmental journals with this form.
 The Gene Expression Data Query Form allows you to search for detailed expression
results from RNA in situ, immunohistochemistry, in situ reporter (knock in), Northern
blot, Western blot, RT-PCR and RNAse and Nuclease S1 protection experiments using
many query parameters. Experimental results are described together with the probes,
specimens, and experimental conditions used and are complemented by digitized images
of the original data. Use its Differential Expression Search to query for genes expressed in
some anatomical structures or developmental stages but not in others.
 The Mouse Developmental Anatomy Browser allows you to navigate through the
extensive dictionary to locate specific anatomical structures and to obtain the expression
results associated with those structures.
 The RNA-Seq and Microarray Experiment Search allows you to quickly and reliably find
mouse expression studies of interest using GXD's standardized metadata annotations.
Search parameters include the age, anatomical structure, mutant alleles, strain and sex of
samples analyzed and the study type and key parameters of the experiments.
Data Acquisition
New expression data are made available on a weekly basis. These data are acquired from the
literature by our curatorial staff and via electronic submission. GXD encourages researchers to
submit their expression data to us. Data submission both increases the exposure of the work and
the amount of data available to the scientific community.
We have incorporated data from several large scale expression databases into GXD. Database
entries in GXD may link to the corresponding entries at the providers' websites. In this way,
users can take advantage of GXD's data integration and querying capabilities while having easy
access to additional resources at the providers' sites.
METABOLIC PATHWAYS DATABASES

BRENDA, the enzyme database, has comprehensive information on enzymes and


enzymatic reactions. It is one of several databases nested within the metabolic pathway
database set of the SRS5 sequence retreival system at EBI.
KEGG Metabolic Pathways include graphical pathway maps for all known metabolic
pathways from various organisms. Ortholog group tables, containing conserved,
functional units in a molecular pathway or assembly as well comparative lists of genes
for a given functional unit in different organisms, are also available.

The WIT Metabolic Reconstruction project produces metabolic reconstructions for


sequenced, or partially sequenced, genomes. It currently provides a set of over 25 such
reconstructions in varying states of completion. Over 2900 pathway diagrams are
available, associated with functional roles and linked to ORFs.

EcoCyc describes the genome and the biochemical machinery of E. coli. It provides a
molecular and functional catalog of the E. coli cell to facilitates system-level
understanding. Its Pathway/Genome Navigator user interface visualizes the layout of
genes, of individual biochemical reactions, or of complete pathways. It also supports
computational studies of the metabolism, such as pathway design, evolutionary studies,
and simulations. A related metabolic database is Metalgen.

Boehringer Mannheim - Biochemical Pathways is a searchable database of metabolic


pathways, enzymes, substrates and products. Based on a given search, it produces a
graphic representation of the relevant pathway(s) within the context of an enormous
metabolic map. Neighboring metabolic reactions can then be viewed through links to
adjacent maps.

UNIT-2
What is alignment score?
A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the
highest alignment scores in a given search. identity. The extent to which two (nucleotide or
amino acid) sequences have the same residues at the same positions in an alignment, often
expressed as a percentage.

Scoring Matrix
Scoring matrices are used to determine the relative score made by matching two characters in a
sequence alignment. These are usually log-odds of the likelihood of two characters being derived
from a common ancestral character. There are many flavors of scoring matrices for amino acid
sequences, nucleotide sequences, and codon sequences, and each is derived from the alignment
of "known" homologous sequences. These alignments are then used to determine the likelihood
of one character being at the same position in the sequence as another character.
PAM AND BLOSUM SERIES OF MATRICES
A point accepted mutation — also known as a PAM — is the replacement of a single amino
acid in the primary structure of a protein with another single amino acid, which is accepted by
the processes of natural selection. This definition does not include all point mutations in
the DNA of an organism. In particular, silent mutations are not point accepted mutations, nor are
mutations which are lethal or which are rejected by natural selection in other ways.
A PAM matrix is a matrix where each column and row represents one of the twenty standard
amino acids. In bioinformatics, PAM matrices are regularly used as substitution matrices to
score sequence alignments for proteins. Each entry in a PAM matrix indicates the likelihood of
the amino acid of that row being replaced with the amino acid of that column through a series of
one or more point accepted mutations during a specified evolutionary interval, rather than these
two amino acids being aligned due to chance. Different PAM matrices correspond to different
lengths of time in the evolution of the protein sequence.
PAM matrices were introduced by Margaret Dayhoff in 1978. The calculation of these matrices
were based on 1572 observed mutations in the phylogenetic trees of 71 families of closely
related proteins. The proteins to be studied were selected on the basis of having high similarity
with their predecessors. The protein alignments included were required to display at least 85%
identity.[6][8] As a result, it is reasonable to assume that any aligned mismatches were the result of
a single mutation event, rather than several at the same location.
Each PAM matrix has twenty rows and twenty columns — one representing each of the twenty
amino acids translated by the genetic code. The value in each cell of a PAM matrix is related to
the probability of a row amino acid before the mutation being aligned with a column amino acid
afterwards. From this definition, PAM matrices are an example of a substitution matrix.
In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution
matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments
between evolutionarily divergent protein sequences. They are based on local alignments.
BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They
scanned the BLOCKS database for very conserved regions of protein families (that do not have
gaps in the sequence alignment) and then counted the relative frequencies of amino acids and
their substitution probabilities. Then, they calculated a log-odds score for each of the 210
possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on
observed alignments; they are not extrapolated from comparisons of closely related proteins like
the PAM Matrices.

Several sets of BLOSUM matrices exist using different alignment databases, named with
numbers. BLOSUM matrices with high numbers are designed for comparing closely related
sequences, while those with low numbers are designed for comparing distant related sequences.
For example, BLOSUM80 is used for closely related alignments, and BLOSUM45 is used for
more distantly related alignments. The matrices were created by merging (clustering) all
sequences that were more similar than a given percentage into one single sequence and then
comparing those sequences (that were all more divergent than the given percentage value) only;
thus reducing the contribution of closely related sequences. The percentage used was appended
to the name, giving BLOSUM80 for example where sequences that were more than 80%
identical were clustered.
BLOSUM r: the matrix built from blocks with less than r% of similarity – E.g., BLOSUM62 is
the matrix built using sequences with more than 62% similarity (sequences with ≥ 62% identity
were clustered) – Note: BLOSUM 62 is the default matrix for protein BLAST. Experimentation
has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein
similarities.

In phylogenetics, maximum parsimony is an optimality criterion under which the phylogenetic


tree that minimizes the total number of character-state changes is to be preferred. Under the
maximum-parsimony criterion, the optimal tree will minimize the amount
of homoplasy (i.e., convergent evolution, parallel evolution, and evolutionary reversals). In other
words, under this criterion, the shortest possible tree that explains the data is considered best.
The principle is akin to Occam's razor, which states that—all else being equal—the simplest
hypothesis that explains the data should be selected. Some of the basic ideas behind maximum
parsimony were presented by James S. Farris in 1970 and Walter M. Fitch in 1971.
Maximum parsimony is an intuitive and simple criterion, and it is popular for this reason.
However, although it is easy to score a phylogenetic tree (by counting the number of character-
state changes), there is no algorithm to quickly generate the most-parsimonious tree. Instead, the
most-parsimonious tree must be found in "tree space" (i.e., amongst all possible trees). For a
small number of taxa (i.e., fewer than nine) it is possible to do an exhaustive search, in which
every possible tree is scored, and the best one is selected. For nine to twenty taxa, it will
generally be preferable to use branch-and-bound, which is also guaranteed to return the best tree.
For greater numbers of taxa, a heuristic search must be performed.
Because the most-parsimonious tree is always the shortest possible tree, this means that—in
comparison to the "true" tree that actually describes the evolutionary history of the organisms
under study—the "best" tree according to the maximum-parsimony criterion will often
underestimate the actual evolutionary change that has occurred. In addition, maximum
parsimony is not statistically consistent. That is, it is not guaranteed to produce the true tree with
high probability, given sufficient data. As demonstrated in 1978 by Joe Felsenstein, maximum
parsimony can be inconsistent under certain conditions, such as long-branch attraction. Of
course, any phylogenetic algorithm could also be statistically inconsistent if the model it employs
to estimate the preferred tree does not accurately match the way that evolution occurred in that
clade. This is unknowable. Therefore, while statistical consistency is an interesting theoretical
property, it lies outside the realm of testability, and is irrelevant to empirical phylogenetic
studies.

Maximum likelihood
The maximum likelihood method uses standard statistical techniques for inferring probability
distributions to assign probabilities to particular possible phylogenetic trees. The method requires
a substitution model to assess the probability of particular mutations; roughly, a tree that requires
more mutations at interior nodes to explain the observed phylogeny will be assessed as having a
lower probability. This is broadly similar to the maximum-parsimony method, but maximum
likelihood allows additional statistical flexibility by permitting varying rates of evolution across
both lineages and sites. In fact, the method requires that evolution at different sites and along
different lineages must be statistically independent. Maximum likelihood is thus well suited to
the analysis of distantly related sequences, but it is believed to be computationally intractable to
compute due to its NP-hardness.
The "pruning" algorithm, a variant of dynamic programming, is often used to reduce the search
space by efficiently calculating the likelihood of subtrees. The method calculates the likelihood
for each site in a "linear" manner, starting at a node whose only descendants are leaves (that is,
the tips of the tree) and working backwards toward the "bottom" node in nested sets. However,
the trees produced by the method are only rooted if the substitution model is irreversible, which
is not generally true of biological systems. The search for the maximum-likelihood tree also
includes a branch length optimization component that is difficult to improve upon
algorithmically; general global optimization tools such as the Newton-Raphson method are often
used.
Some tools that use maximum likelihood to infer phylogenetic trees from variant allelic
frequency data (VAFs) include AncesTree and CITUP.

You might also like