0% found this document useful (0 votes)
24 views10 pages

Introduction To Bioinformatics - BCHS 4214

Bioinformatics is a multidisciplinary field that utilizes computational tools for the storage, analysis, and visualization of biological data, integrating computer science, mathematics, and biology. Its objectives include developing algorithms for data analysis, creating databases for efficient data management, and applying these tools to understand biological processes and diseases. Key applications encompass sequence analysis, function analysis, structure analysis, and drug discovery, with significant contributions to genomics and microbial genome research.

Uploaded by

tohbriko
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views10 pages

Introduction To Bioinformatics - BCHS 4214

Bioinformatics is a multidisciplinary field that utilizes computational tools for the storage, analysis, and visualization of biological data, integrating computer science, mathematics, and biology. Its objectives include developing algorithms for data analysis, creating databases for efficient data management, and applying these tools to understand biological processes and diseases. Key applications encompass sequence analysis, function analysis, structure analysis, and drug discovery, with significant contributions to genomics and microbial genome research.

Uploaded by

tohbriko
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

BCHS4214

Section 2: Introduction to Bioinformatics


Learning Objectives:
 Define bioinformatics and explain its relevance to biochemistry.
 Identify key applications of bioinformatics.

I- Introduction
Bioinformatics is the application of computational tools to the storage, analysis, and visualization
of biological data. It is a branch of science that integrates computer science, mathematics and
statistics, chemistry and engineering for analysis, exploration, integration and exploitation of
biological sciences data, in Research and Development. Bioinformatics also deals with storage,
retrieval, analysis and interpretation of biological data using computer based software and tools.

History of Bioinformatics
Bioinformatics emerged in mid 1990s. • From 1965-78 Margaret O. Dayhoff established first
database of protein sequences, published annually as series of volume entitled “Atlas of protein
sequence and structure”. • During 1977 DNA sequences began to accumulate slowly in literature
and it became more common to predict protein sequences by translating sequenced genes than by
direct sequencing of proteins. • Thus number of uncharacterised proteins began to increase. • In
1980, there were enough DNA sequences to justify the establishment of the first nucleotide
sequence database, GenBank at National Centre for Biotechnology Information (NCBI), USA.
NCBI served as primary databank provider for information. The European Molecular Biology
Laboratory (EMBL) was established at European Bioinformatics Institute (EBI) in 1980 and the
aim of this data library was to collect, organize and distribute nucleotide sequence data and related
information. • In1986, DNA Data Bank was established by GemonNet, Japan. • In 1984, the
National Biomedical Research Foundation (NBRF) established the protein information Resource
(PIR). • All these data banks operate in close collaboration and regularly exchange data. •
Management and analysis of the rapidly accumulating sequence data required new computer
software and statistical tools. • This attracted scientists from computer science and mathematics to
the fast emerging field of bioinformatics.

II- Objectives of Bioinformatics


Some objectives of Bioinformatics are as followed :
1. Development of new algorithms and statistics for assessing the relationships among large sets of
biological data.
2. Application of these tools for the analysis and interpretation of the various biological data.
3. Development of database for an efficient storage, access and management of the large body of
various biological information.
In other words, the first aim of bioinformatics is to store the biological data organized in form of a
database. This allows the researchers an easy access to existing information and submit new entries.
These data must be annoted to give a suitable meaning or to assign its functional characteristics.
The databases must also be able to correlate between different hierarchies of information. For
example: GenBank for nucleotide and protein sequence information, Protein Data Bank for 3D
macromolecular structures, etc.
The second aim is to develop tools and resources that aid in the analysis of data. For example:
BLAST to find out similar nucleotide/amino-acid sequences, ClustalW to align two or more
nucleotide/amino-acid sequences, Primer3 to design primers probes for PCR techniques, etc.
The third and the most important aim of bioinformatics is to exploit these computational tools to
analyze the biological data interpret the results in a biologically meaningful manner.
The goals of bioinformatics thus is to provide scientists with a means to explain 1. Normal
biological processes 2. Malfunctions in these processes which lead to diseases 3. Approaches to
improving drug discovery To study how normal cellular activities are altered in different disease
states, the biological data must be combined to form a comprehensive picture of these activities.
Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the
analysis and interpretation of various types of data. This includes nucleotide and amino acid
sequences, protein domains, and protein structures. The actual process of analyzing and interpreting
data is referred to as computational biology.  Important sub-disciplines within bioinformatics and
computational biology include: Development and implementation of computer programs that enable
efficient access to, use and management of, various types of information  Development of new
algorithms (mathematical formulas) and statistical measures that assess relationships among
members of large data sets. For example, there are methods to locate a gene within a sequence, to
predict protein structure and/or function, and to cluster protein sequences into families of related
sequences. The primary goal of bioinformatics is to increase the understanding of biological
processes. What sets it apart from other approaches, however, is its focus on developing and
applying computationally intensive techniques to achieve this goal. Examples include: pattern
recognition, data mining, machine learning algorithms, and visualization. Major research efforts in
the field include sequence alignment, gene finding, genome assembly, drug design, drug discovery,
protein structure alignment, protein structure prediction, prediction of gene expression and protein–
protein interactions, genome-wide association studies, the modeling of evolution and cell
division/mitosis. Bioinformatics now entails the creation and advancement of databases, algorithms,
computational and statistical techniques, and theory to solve formal and practical problems arising
from the management and analysis of biological data.
These tools are used in three areas  Molecular Sequence Analysis  Molecular Structural
Analysis  Molecular Functional Analysis. Over the past few decades, rapid developments in
genomic and other molecular research technologies and developments in information technologies
have combined to produce a tremendous amount of information related to molecular biology.
Bioinformatics is the name given to these mathematical and computing approaches used to glean
understanding of biological processes. Common activities in bioinformatics include mapping and
analyzing DNA and protein sequences, aligning DNA and protein sequences to compare them, and
creating and viewing 3-D models of protein structures. Bioinformatics encompasses the use of tools
and techniques from three separate disciplines; molecular biology (the source of the data to be
analyzed), computer science (supplies the hardware for running analysis and the networks to
communicate the results), and the data analysis algorithms which strictly define bioinformatics. For
this reason, the editors have decided to incorporate events from these areas into a brief history of the
field.

III- Applications of Bioinformatics


Bioinformatics joins mathematics, statistics, and computer science and information technology to
solve complex biological problems. These problems are usually at the molecular level which cannot
be solved by other means. This interesting field of science has many applications and research areas
where it can be applied. All the applications of bioinformatics are carried out in the user level.
Various bioinformatics application can be categorized under following groups: Sequence Analysis,
Function Analysis, Structure Analysis.

a) Sequence Analysis:
All the applications that analyzes various types of sequence information and can compare between
similar types of information is grouped under Sequence Analysis.
The application of sequence analysis determines those genes which encode regulatory sequences or
peptides by using the information of sequencing. For sequence analysis, there are many powerful
tools and computers which perform the duty of analyzing the genome of various organisms. These
computers and tools also see the DNA mutations in an organism and also detect and identify those
sequences which are related. Shotgun sequence techniques are also used for sequence analysis of
numerous fragments of DNA. Special software is used to see the overlapping of fragments and their
assembly.
b) Function Analysis:
These applications analyze the function engraved within the sequences and helps predict the
functional interaction between various proteins or genes. Also expressional analysis of various
genes is a prime topic for research these days.
c) Structure Analysis:
When it comes to the realm of RNA and Proteins, its structure plays a vital role in the interaction
with any other thing. This gave birth to a whole new branch termed Structural Bioinformatics with
is devoted to predict the structure and possible roles of these structures of Proteins or RNA.
d)Prediction of Protein Structure:
It is easy to determine the primary structure of proteins in the form of amino acids which are present
on the DNA molecule but it is difficult to determine the secondary, tertiary or quaternary structures
of proteins. For this purpose either the method of crystallography is used or tools of bioinformatics
can also be used to determine the complex protein structures.
e) Comparative Genomics:
Comparative genomics is the branch of bioinformatics which determines the genomic structure and
function relation between different biological species. For this purpose, intergenomic maps are
constructed which enable the scientists to trace the processes of evolution that occur in genomes of
different species. These maps contain the information about the point mutations as well as the
information about the duplication of large chromosomal segments.
f)Health and Drug discovery:
The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease management.
Complete sequencing of human genes has enabled the scientists to make medicines and drugs which
can target more than 500 genes. Different computational tools and drug targets has made the drug
delivery easy and specific because now only those cells can be targeted which are diseased or
mutated. It is also easy to know the molecular basis of a disease.
Drug development
At present all drugs on the market target only about 500 proteins. With an improved understanding
of disease mechanisms and using computational tools to identify and validate new drug targets,
more specific medicines that act on the cause, not merely the symptoms, of the disease can be
developed. These highly specific drugs promise to have fewer side effects than many of today's
medicines.
Microbial genome applications
Microorganisms are ubiquitous, that is they are found everywhere. They have been found surviving
and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are present in the
environment, our bodies, the air, food and water. Traditionally, use has been made of a variety of
microbial properties in the baking, brewing and food industries. The arrival of the complete genome
sequences and their potential to provide a greater insight into the microbial world and its capacities
could have broad and far reaching implications for environment, health, energy and industrial
applications. For these reasons, in 1994, the US Department of Energy (DOE) initiated the MGP
(Microbial Genome Project) to sequence genomes of bacteria useful in energy production,
environmental cleanup, industrial processing and toxic waste reduction. By studying the genetic
material of these organisms, scientists can begin to understand these microbes at a very fundamental
level and isolate the genes that give them their unique abilities to survive under extreme conditions.
Antibiotic resistance
Scientists have been examining the genome of Enterococcus faecalis-a leading cause of bacterial
infection among hospital patients. They have discovered a virulence region made up of a number of
antibiotic-resistant genes that may contribute to the bacterium's transformation from harmless gut
bacteria to a menacing invader. The discovery of the region, known as a pathogenicity island, could
provide useful markers for detecting pathogenic strains and help to establish controls to prevent the
spread of infection in wards.

IV- Components of Bioinformatics


The 3 main components of Bioinformatics are : Data, Database and Database Mining Tools.
IV-1 Data
Data represents information or results from scientific experiments and other biological or
biochemical tests. These include : Nucleic Acid Sequences (Raw DNA Sequences, Genomic
sequence tags (GSTs), cDNAsequences, Expressed sequence tags (ESTs), Organellar DNA
sequences, RNA Sequences) Protein sequences, Protein structures, Metabolic pathways, Gel
pictures or Literature.
IV-2 Databases
A database is a vast collection of data pertaining to a specific topic e.g. nucleotide sequence, protein
sequence etc., in an electronic environment. • They are heart of bioinformatics. • Computerized
storehouse of data (records). • Allows extraction of specified records. • Allows adding, changing,
removing, and merging of records. • Use standardized formats.
Biological/Biochemical databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experiment technology, and computational
analysis. They contain information from research areas including genomics, proteomics,
metabolomics, microarray gene expression, and phylogenetics. Information contained in biological
databases includes gene function, structure, localization (both cellular and chromosomal), clinical
effects of mutations as well as similarities of biological sequences and structures.
Why databases? • Means to handle and share large volumes of biological data • Support large-scale
analysis efforts • Make data access easy and updated • Link knowledge obtained from various fields
of biology and medicine Features • Most of the databases have a web-interface to search for data •
Common mode to search is by Keywords • User can choose to view the data or save to your
computer • Cross-references help to navigate from one database to another easily Biological
databases can be broadly classified into sequence and structure databases. Nucleic acid and protein
sequences are stored in sequence databases and structure database only store proteins. These
databases are important tools in assisting scientists to analyze and explain a host of biological
phenomena from the structure of biomolecules and their interaction, to the whole metabolism of
organisms and to understanding the evolution of species. This knowledge helps facilitate the fight
against diseases, assists in the development of medications, predicting certain genetic diseases and
in discovering basic relationships among species in the history of life.
Some example of Sequence Databases include : Nucleic acid sequence databases like EMBL ,
GenBank, DDBJ. The main protein sequence databases are : Swiss Prot, TREMBL, GenPept. These
databases are often integrated with other databases.
Types of Databases include : Sequence Databases, Structural Databases, Enzyme Databases, Micro-
array Databases, Clinical Database, Pathway Databases, Chemical Databases, Integrated Databases,
Bibliographic Databases.
As example, some Nucleotide Sequence Databases are : NCBI-GenBank
(www.ncbi.nlm.nih.gov/GenBank), EMBL (www.ebi.ac.uk/embl), DDBJ (www.ddbj.nig.ac.jp). The
3 databases are updated and exchanged on a daily basis and the accession numbers are consistent.
There are no legal restriction in the usage of these databases. However, there are some patented
sequences in the database.
For Proteins, there are Protein sequence database which functions as repository of raw data of two
types : Primary and Secondary and Protein structure database.
Some examples of Protein Primary databases are : 1. SWISS-PROT which annotate the
sequences , Describe protein functions, gives the domain structures, indicate the post translations
modification, Provides high level of annotation, Minimum level of redundancy, High level of
integration with other databases. 2. TrEMBL which is a computer annotated supplements of
SWISS-PROT that contains all the translations of EMBL nucleotide entries not yet integrated in
SWISS-PROT. 3. PIR: Protein Information Resource (a division of NBRF in US) collaborated with
Munich Information Centre for Protein Sequences (MIPS) and Japanese International Protein
Sequence Database (JIPID). It looks for sequence similarity, produces MRL-3D (database of
sequences extracted from 3D structures in Protein Data Bank).
Secondary databases compile and filter sequence data from different primary database (db). These
db contain information derived from protein sequences and help the user determine whether a new
sequence belong to a known protein family. As secondary data base, we can name :
1. PROSITE: db of short protein sequence patterns and profiles that characterise biologically
significant sites in proteins. It is based on regular expressions describing characteristic sequences of
specific protein families and domains. It is part of SWISS-PROT, and maintained in the same way.
2. PRINTS provides a compendium of protein fingerprints (groups of conserved motifs that
characterise a protein family). It now has a relational version, "PRINTS-S“
3. BLOCKS patterns without gaps in aligned protein families defined by PROSITE, found by
pattern searching and statistical sampling algorithms. It automatically determined un-gapped
conserved segments.
4. Pfam : here, db of protein families are defined as domains. For each domain, it contains a
multiple alignment of a set of defining sequences and the other sequences in SWISS-PROT and
TrEMBL that can be matched to the alignment.

Protein Structural Database


1. PDB (Protein Data Bank): It is the Main db of 3D structures of biological macromolecules
(determined by X-ray crystallography and NMR). PDB entry contain the atomic coordinates and
some structural parameters connected with the atoms or computed from the structures (secondary
structure). PDB also provides primary archive of all 3D structures for macromolecules such as
proteins, DNA, RNA and various complexes.
2. SCOP (Structural Classification of Proteins): Db was started with objective to classify protein 3D
structures in a hierarchical scheme of structural classes. It is based on data in a primary db, but adds
information through analysis and organization (such as classification of 3D structures into
hierarchical scheme of folds, super-families and families).
3. CATH (Class, architecture, topology, homologous super-family): CATH perform hierarchical
classification of protein domain structures and that of Clusters proteins at four major structural
levels.

Enzyme Database
- BRENDA (Braunshchweig Enzyme Database) (http://www.brenda.uni-koeln.de/)
- Enzyme, a part of ExPaSy (Expert Protein Analysis System, the proteomic server of Swiss
Institute of Bioinformatics).

Clinical Databases
It generally contain information from the Human.
- Human Gene Mutation Database, Cardiff, UK (http://www.hgmd.org): It registers known
mutations in the human genome and the diseases they cause.
- OMIM database (Online Mendelian Inheritance in Man, http://www.ncbi.nlm.nih.gov/Omim). The
OMIM database contains abstracts and texts describing genetic disorders to support genomics
efforts and clinical genetics. It provides gene maps, and known disorder maps in tabular listing
formats. It also contains keyword search.

Bibliographic Databases
They are used for searching for reference articles.
- PubMed for example enables user to do keyword searches, provides links to a selection of full
articles, and has text mining capabilities, e.g. provides links to related articles, and GenBank entries,
among others. It also contains entries for more than 30 million abstracts of scientific publications.

IV-3 Database Mining Tools (Analysis Tools)


Utilization of various data bases requires the use of suitable search engines and analysis tools.
These tools are called Database mining tools and the process of data utilization is known as data
base mining. Some Analysis Tools are as follows :
One of these tools will be described.
BLAST (Basic Local Alignment Search Tool) is used for Homology Analyses. There are :
BLASTn used for Nucleotide query vs nucleotide database,
BLASTp used for protein query vs protein database
BLASTx used for automatic 6-frame translation of nucleotide query vs protein database– If you
have a DNA sequence and you want to now what protein (if any) it encodes, you can perform
BLASTx search.
tBLASTn– protein query vs automatic 6-frame translation of nucleotide database– You can use this
program to ask whether a DNA or ESTs db contains a nucleotide sequence encoding a protein that
matches your protein of interest. • tBLASTx– automatic 6-frame translation of nucleotide query vs
automatic 6-frame translation of nucleotide database.

V- Sequence Alignment
A sequence alignment is a way of arranging the sequences of DNA or protein to identify regions of
similarity that may be a consequence of functional, structural, or evolutionary relationships between
the sequences. Some essential words here are :
Similarity : The extent to which nucleotide or protein sequences are related. It is based upon
identity plus conservation.
Identity : The extent to which two sequences are invariant.
Conservation : Changes at a specific position of an amino acid or (less commonly, DNA) sequence
that preserve the physico chemical properties of the original residue.
There are 2 types of alignment :Pairwise alignment and Multiple Alignment.
Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or
global alignments of two query sequences. Pairwise alignments can only be used between two
sequences at a time, but they are efficient to calculate and are often used for methods that do not
require extreme precision (such as searching a database for sequences with high similarity to a
query). The three primary methods of producing pairwise alignments are dot- matrix methods,
dynamic programming, and word methods.
Sequence Analyses for relatedness • Homologs: similar sequences in different organisms derived
from a common ancestor sequence. • Orthologs : homologous sequences in different related species
that arose from a common ancestral gene during speciation. Orthologs are presumed to have similar
biological function. e.g. Human and rats myoglobins both transport oxygen in muscle • Paralogs:
homologous genes within the same organism e.g. human α and β globins are paralogs. Paralogs are
the result of gene duplication events • Xenologs: similar sequences that have arisen out of
horizontal transfer events (symbiosis, viruses, etc

Multiple Sequence Alignment (MSA) : A multiple sequence alignment (MSA) is a sequence


alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases,
the input set of query sequences are assumed to have an evolutionary relationship by which they
share a lineage and are descended from a common ancestor. From the resulting MSA, sequence
homology can be inferred and phylogenetic analysis can be conducted to assess the sequences'
shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate
mutation events such as point mutations (single amino acid or nucleotide changes) that appear as
differing characters in a single alignment column, and insertion or deletion mutations (indels or
gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence
alignment is often used to assess sequence conservation of protein domains, tertiary and secondary
structures, and even individual amino acids or nucleotides. Multiple sequence alignment also refers
to the process of aligning such a sequence set. Because three or more sequences of biologically
relevant length can be difficult and are almost always time-consuming to align by hand,
computational algorithms are used to produce and analyze the alignments. MSAs require more
sophisticated methodologies than pairwise alignment because they are more computationally
complex. Most multiple sequence alignment programs use heuristic methods rather than global
optimization because identifying the optimal alignment between more than a few sequences of
moderate length is prohibitively computationally expensive.

VI- PHYLOGENETIC ANALYSIS


How to construct a Phylogenetic tree?  A phylogenetic tree is a visual representation of the
relationship between different organisms, showing the path through evolutionary time from a
common ancestor to different descendants.  Similarities and divergence among related biological
sequences revealed by sequence alignment often have to be rationalized and visualized in the
context of phylogenetic trees. Thus, molecular phylogenetics is a fundamental aspect of
bioinformatics.  Molecular phylogenetics is the branch of phylogeny that analyzes genetic,
hereditary molecular differences, predominately in DNA sequences, to gain information on an
organism’s evolutionary relationships.  The similarity of biological functions and molecular
mechanisms in living organisms strongly suggests that species descended from a common ancestor.
Molecular phylogenetics uses the structure and function of molecules and how they change over
time to infer these evolutionary relationships.
From these analyses, it is possible to determine the processes by which diversity among species has
been achieved. The result of a molecular phylogenetic analysis is expressed in a phylogenetic tree.

Steps in Phylogenetic Analysis : The basic steps in any phylogenetic analysis include:
1. Assemble and align a dataset  The first step is to identify a protein or DNA sequence of interest
and assemble a dataset consisting of other related sequences.  DNA sequences of interest can be
retrieved using NCBI BLAST or similar search tools.  Once sequences are selected and retrieved,
multiple sequence alignment is created.  This involves arranging a set of sequences in a matrix to
identify regions of homology.  There are many websites and software programs, such as
ClustalW, MSA, MAFFT, and T-Coffee, designed to perform multiple sequence on a given set of
molecular data.
2. Build (estimate) phylogenetic trees from sequences using computational methods and stochastic
models  To build phylogenetic trees, statistical methods are applied to determine the tree topology
and calculate the branch lengths that best describe the phylogenetic relationships of the aligned
sequences in a dataset.  The most common computational methods applied include distance-
matrix methods, and discrete data methods, such as maximum parsimony and maximum likelihood.
 There are several software packages, such as Paup, PAML, PHYLIP, that apply these most
popular methods.
3. Statistically test and assess the estimated trees.  Tree estimating algorithms generate one or
more optimal trees.  This set of possible trees is subjected to a series of statistical tests to evaluate
whether one tree is better than another – and if the proposed phylogeny is reasonable.  Common
methods for assessing trees include the Bootstrap and Jackknife Resampling methods, and
analytical methods, such as parsimony, distance, and likelihood.

Bioinformatics Tools for Phylogenetic Analysis  There are several bioinformatics tools and
databases that can be used for phylogenetic analysis.  These include PANTHER, P-Pod, PFam,
TreeFam, and the PhyloFacts structural phylogenomic encyclopedia.  Each of these databases
uses different algorithms and draws on different sources for sequence information, and therefore the
trees estimated by PANTHER, for example, may differ significantly from those generated by P-Pod
or PFam.  As with all bioinformatics tools of this type, it is important to test different methods,
compare the results, then determine which database works best (according to consensus results) for
studies involving different types of datasets.

You might also like