0% found this document useful (0 votes)
31 views47 pages

Bioinformatics

The document provides an introduction to bioinformatics including its definition, history, objectives, components like data and databases, and examples of nucleotide, protein, and structural databases. It discusses the emergence of bioinformatics as a field and the establishment of early biological databases to store accumulating sequence data.

Uploaded by

Ashutosh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views47 pages

Bioinformatics

The document provides an introduction to bioinformatics including its definition, history, objectives, components like data and databases, and examples of nucleotide, protein, and structural databases. It discusses the emergence of bioinformatics as a field and the establishment of early biological databases to store accumulating sequence data.

Uploaded by

Ashutosh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Introduction to Bioinformatics

For Biochemistry Sem IV

Dr. Kusum Yadav


Assistant Professor
Department of
Biochemistry
University of
Lucknow, Lucknow
Mob-9452490044
Email-anukusum@gmail.
com
Bioinformatics
Bioinformatics is a branch of
science that integrates
computer science, mathematics
and statistics, chemistry and
engineering for analysis,
exploration, integration and
exploitation of biological
sciences data, in Research and
Development.

Bioinformatics deals with


storage, retrieval, analysis and
interpretation of biological
data using computer based
software and tools.

Kusum Yadav, Department of Biochemistry, University of Lucknow


History of Bioinformatics
• Bioinformatics emerged in mid 1990s.
• From 1965-78 Margaret O. Dayhoff established first
database of protein sequences, published annually as
series of volume entitled “Atlas of protein sequence and
structure”.
• During 1977 DNA sequences began to accumulate slowly
in literature and it became more common to predict
protein sequences by translating sequenced genes than
by direct sequencing of proteins.
• Thus number of uncharacterised proteins began to
increase.
• In 1980, there were enough DNA sequences to justify the
establishment of the first nucleotide sequence database,
GenBank at National Centre for Biotechnology
Information (NCBI), USA. NCBI served as primary
databank provider for information.
Kusum Yadav, Department of Biochemistry
History of Bioinformatics (contd..)
• The European Molecular Biology Laboratory (EMBL)
established at European Bioinformatics Institute (EBI) in 1980.
The aim of this data library was to collect, organize and
distribute nucleotide sequence data and related information.
• In 1986 DNA Data Bank was established by GemonNet, Japan.
• In 1984, the National Biomedical Research Foundation (NBRF)
established the protein information Resource (PIR).
• All these data banks operate in close collaboration and
regularly exchange data.
• Management and of the rapidly
accumulating
analysis sequence data required new
statistical tools.
computer software and
• This attracted scientists from computer science
and mathematics to the fast emerging field of bioinformatics.

Kusum Yadav, Department of Biochemistry


Objectives of Bioinformatics
1. Development of new algorithms and
statistics for assessing the relationships
among large sets of biological data.
2. Application of these tools for the analysis
and interpretation of the various biological
data.
3. Development of database for an efficient
storage, access and management of the large
body of various biological information.

Kusum Yadav, Department of Biochemistry


Components of Bioinformatics

Data

Database

Database Mining Tools

Kusum Yadav, Department of Biochemistry


Data
 Nucleic Acid Sequences
• Raw DNA Sequences
• Genomic sequence tags (GSTs)
• cDNA sequences
• Expressed sequence tags (ESTs)
• Organellar DNA sequences
• RNA Sequences
 Protein sequences
 Protein structures
 Metabolic pathways
 Gel pictures
 Literature
Kusum Yadav, Department of Biochemistry
Databases
A database is a vast collection of data pertaining to a
specific topic e.g. nucleotide sequence, protein
sequence etc., in an electronic environment.
• They are heart of bioinformatics.
• Computerized storehouse of data (records).
• Allows extraction of specified records.
•Allows adding, changing, removing, and
merging of records.
• Uses standardized formats.

Kusum Yadav, Department of Biochemistry


Databases: Types
Sequence Databases
Structural Databases
Enzyme Databases
Micro-array Databases
Clinical Database
Pathway Databases
Chemical Databases
Integrated Databases
Bibliographic Databases
Kusum Yadav, Department of Biochemistry
Nucleotide Sequence Databases

– NCBI - GenBank: (www.ncbi.nlm.nih.gov/GenBank)


– EMBL: (www.ebi.ac.uk/embl)
– DDBJ: (www.ddbj.nig.ac.jp)
The 3 databases are updated and exchanged on a
daily basis and the accession numbers are consistent.
There are no legal restriction in the usage of these
databases. However, there are some patented sequences
in the database.
The International Nucleotide Sequence
Database Collaboration (INSD)
Kusum Yadav, Department of Biochemistry
National Center for Biotechnology Information (NCBI)

Kusum Yadav, Department of Biochemistry


EMBL Database
European Molecular Biology Laboratory (EMBL) :
 Maintained by European Bioinformatics Institute (EBI)

 GSS (genome survey sequences)


 HTC (high-throughput c-DNA sequences)
 HTG (high-throughput genomic sequences)
 EST (expressed sequence tag)
Patents

Kusum Yadav, Department of Biochemistry


European Bioinformatics Institute (EBI)

Kusum Yadav, Department of Biochemistry


DDBJ (DNA Database of GenomNet, Japan)

• Developed in 1986 as a collaboration with


EMBL and GenBank.
• Produced, maintained and distributed by the
National Institute of Genetics, Japan.
• Sequences is submitted via Web based data
submission tool.

Kusum Yadav, Department of Biochemistry


GenomeNet, Japan

Kusum Yadav, Department of Biochemistry


Other Databases
• ESTs - Expressed Sequence Tags
– dbEST (http://www.ncbi.nlm.nih.gov/dbEST)
• GenBank subset with additional EST-specific data
• Implemented in a Sybase relational database
• SNPs - Single Nucleotide Polymorphisms
– dbSNP (http://www.ncbi.nlm.nih.gov/SNP/)
• Very similar to dbEST in philosophy and
implementation
• Many commercial databases
– Celera, Incyte, etc.
Kusum Yadav, Department of Biochemistry
Protein Databases

Protein sequence database

• Functions as repository of raw data: two types


• Primary
• Secondary

Protein structure database

Kusum Yadav, Department of Biochemistry


Primary databases
1. SWISS-PROT: Groups at Swiss Institute of Bioinformatics (SIB).
• It annotate the sequences
• Describe protein functions
• Its domain structures
• Its post translations modifications
• Provides high level of annotation
• Minimum level of redundancy
• High level of integration with other databases
2. TrEMBL:
• Computer annotated supplements of SWISS-PROT that contains all the
translations of EMBL nucleotide entries not yet integrated in SWISS-PROT.
2. PIR: Protein Information Resource, a division of NBRF in US.
• Collaborated with Munich Information Centre for Protein
Sequences (MIPS) and Japanese International Protein Sequence Database
(JIPID).
• One an search for entries
• Do sequence similarity
• PIR also produces MRL-3D (db of sequences extracted from 3D structures
in PDB)
Kusum Yadav, Department of Biochemistry
Swiss-Prot

Kusum Yadav, Department of Biochemistry


Secondary databases
• Secondary db compile and filter sequence data from different primary db.
• These db contain information derived from protein sequences and help the user
determine whether a new sequence belong to a known protein family.
1. PROSITE:
• db of short protein sequence patterns and profiles that characterise biologically significant sites
in proteins
• It is based on regular expressions describing characteristic sequences of specific protein families
and domains.
• It is part of SWISS-PROT, and maintained in the same way
2. PRINTS
• PRINTS provides a compendium of protein fingerprints (groups of conserved motifs that
characterise a protein family)
• Now has a relational version, "PRINTS-S“
3. BLOCKS
• BLOCK patterns without gaps in aligned protein families defined by PROSITE, found by pattern
searching and statistical sampling algorithms.
• Automatically determined un-gapped conserved segments
4. Pfam
• Db of protein families defined as domains
• For each domain, it contains a multiple alignment of a set of defining sequences and the
other sequences in SWISS-PRKOuTsumanYdadaTvr, EDMpeaBtrLmtnehtaotf Bciaonchebmestimyr atched to the
alignment.
Protein Structural Database
1. PDB (Protein Data Bank):
• Main db of 3D structures of biological macromolecules (determined by
X-ray crystallography and NMR).
• PDB entrys contain the atomic coordinates, and some structural parameters connected
with the atoms or computed from the structures (secondary structure).
• PDB provide primary archive of all 3D structures for macromolecules such as proteins,
DNA, RNA and various complexes.

2. SCOP (Structural Classification of Proteins):


• Db was started to with objective to classify protein 3D structures in a hierarchical
scheme of structural classes.
• It is based on data in a primary db, but adds information through analysis and
organization (such as classification of 3D structures into hierarchical scheme of folds,
super-families and families)

3. CATH (Class, architecture, topology, homologous super-family):


• CATH perform hierarchical classification of protein domain structures.
• Clusters proteins at four majKourus smtrYuadcatv,uDreaplarltemvenet losf Biochemistry
Enzyme Database

 BRENDA [BRaunshchweig ENzyme DAtabase]


(

 Enzyme, a part of ExPaSy (Expert


Protein Analysis System, the proteomic
server of Swiss Institute of Bioinformatics)

Kusum Yadav, Department of Biochemistry


Clinical
Databases
Generally contain information from the Human

Human Gene Mutation Database, Cardiff, UK:


http://www.hgmd.org
Registers known mutations in the human genome and the
diseases they cause.

OMIM database
Online Mendelian Inheritance in Man
http://www.ncbi.nlm.nih.gov/Omim

The OMIM database contains abstracts and texts describing


genetic It provides gene
disorders to support genomics efforts and clinical genetics.
maps, and known disorder maps in tabular listing formats. Contains
keyword search.
Kusum Yadav, Department of Biochemistry
Kyoto Encyclopedia of Genes and
Genomes (KEGG) www.genome.jp/kegg
/
Database and associated software which
integrates several databases such as,
Pathway database
Genes database
Genome database
Drug database
Reaction database
Compound database
KO database etc.
Bibliographic Databases
Used for searching for reference articles

PubMed
1.It enables user to do keyword searches, provides links to a
selection of full articles, and has text mining capabilities, e.g.
provides links to related articles, and GenBank entries,
among others.
2.It contains entries for more than 30 million abstracts of
scientific publications.

Kusum Yadav, Department of Biochemistry


Database Mining Tools (Analysis Tools)
Utilization of various databases requires the use of suitable search engines and analysis
tools. These tools are called Database mining tools and the process of data utilization is
known as database mining. Some Analysis Tools are as follows:
Analysis Tool Function
BLAST (NCBI, USA) Used to analyse sequence information and detect homologous
sequences

ENTREZ (NCBI, USA) Used to access literature (abstracts), sequence and structure db
DNAPLOT (EBI, UK) Sequence alignment tool
LOCUS LINK (NCBI, Assessing information on homologous genes
USA)
LIGAND (GenomNet, A chemical db, allows search for a combination of enzymes and links
Japan) to all publically accessible db.
BRITE (GenomNet, Biomolecular relations information transmission and expression db;
Japan) links to all publically accessible db.
TAXONOMY BROWSER Taxonomic classification of various species as well as genetic
(NCBI, USA) information
STRUCTURE It support Molecular Modelling Database (MMDB) and software
tools forKsusturmucYatduarv,eDaepnaratmyl enstsi of Biochemistry
BLAST
(Basic Local Alignment Search Tool)
for Homology Analyses
• BLASTn
– Nucleotide query vs nucleotide database
• BLASTp
– protein query vs protein database
• BLASTx
– automatic 6-frame translation of nucleotide query vs protein database
– If you have a DNA sequence and you want to now what protein (if any) it
encodes, you can perform BLASTx search.
• tBLASTn
– protein query vs automatic 6-frame translation of nucleotide database
– You can use this program to ask whether a DNA or ESTs db contains a
nucleotide sequence encoding a protein that matches your protein of
interest.
• tBLASTx
– automatic 6-frame translation of nucleotide query vs automatic 6-frame
translation of nucleotKiduseumdYaadtaav,bDaepsaertment of Biochemistry
BLAST
(Basic Local Alignment Search Tool)
for Homology Analyses

Program Input
Database
1
BLASTn DNA DNA
1
BLASTp protein protein
6
DNA
6
tBLASTn
BLASTx protein protein
36
tBLASTx DNA DNA
Kusum Yadav, Department of Biochemistry
SEQUENCE ALIGNMENT
What is Sequence Alignment ?
A sequence alignment is a way of arranging the sequences of DNA
or protein to identify regions of similarity that may be a
consequence of functional, structural, or evolutionary
relationships between the sequences.

Definitions

Similarity
T h e extent to which nucleotide or protein s e que nce s are
r e l a t e d . It i s b a s e d u p o n i d e n t i t y p l u s c o n s e r v a t i o n .

Identity
T h e extent to which t w o s e q u e n c e s are invariant.

Conservation
C h a n g e s at a specific position of a n a m i n o acid or (less
commonly, D N A ) s e q u e n c e that preserve the physico-
chemical properties of the original residue.

Kusum Yadav, Department of Biochemistry


Types of alignment

• Pairwise alignment
• Multiple Alignment

Kusum Yadav, Department of Biochemistry


Pairwise alignment

• The process of lining up two sequences to achieve


maximal levels of identity (and conservation, in the
case of amino acid sequences) for the purpose of
assessing the degree of similarity and the
possibility of homology.
• Pairwise sequence alignment is the most
fundamental operation of bioinformatics.

Kusum Yadav, Department of Biochemistry


Pairwise alignment of retinol-binding protein 4
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 :LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
| | | | :: | .| . || |: || |. 97 RBP
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK
93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC
136 RBP 135 lactoglobulin
|| ||. | :.|||| | . .|
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. | | | : || . | || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Kusum Yadav, Department of Biochemistry


Pairwise alignment of retinol-binding protein
and b-lactoglobulin

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:| :
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | | | | :: | .| . || |: || |.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK
93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP
|| ||. | :.|||| | . .|
RQRQ.EELCLA
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC
135.lactoglobulin
| | | : || . | || | (bar)
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF NPTQLEEQCHI ....... 178 lactoglobulin

137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIV Identity RQYRLIV


185 RBP

Kusum Yadav, Department of Biochemistry Page 46


Pairwise alignment of retinol-binding protein
and -lactoglobulin

1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:| :
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | | | | :: | .| . || |: || |.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK
93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP
|| ||. | :.|||| | . .|
Very
DSYSFVFSRDPNGLP PEAQKIVRQRQ.EELC LARQYRLIV 185 RBP
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC
137 RLLNLDGTCA Somewhat
136 QCLVRTPEVD
|
similar
135.lactoglobulin
| | : |
| .
similar
HI....... 178 lactoglobulin
(one dot)
DEALEKFDKALKALP | || | (two dots)
MHIRLSFNPTQLEEQC

Kusum Yadav, Department of Biochemistry Page 46


Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin

51 :LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
| | | | :: | .| . || |: || |. 97 RBP
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK
93 lactoglobulin

98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC
136 RBP 135 lactoglobulin
|| ||. | :.|||| | . .|
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. | | | : || . | || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

Internal Termina
gap
Kusum Yadav, Department of Biochemistry
l gap
Kusum Yadav, Department of Biochemistry
Sequence Analyses for relatedness
• Homologs: similar sequences in different organisms derived
from a common ancestor sequence.
• Orthologs : homologous sequences in different related species
that arose from a common ancestral gene during speciation.
Orthologs are presumed to have similar biological function.
e.g. Human and rats myoglobins both transport oxygen in
muscle
• Paralogs: homologous genes within the same organism
e.g. human α and β globins are paralogs. Paralogs are the
result of gene duplication events
• Xenologs: similar sequences that have arisen out of horizontal
transfer events (symbiosis, viruses, etc)
Kusum Yadav, Department of Biochemistry
Multiple sequence Alignment
• Partial or complete alignment of three or
more related proteins/ nucleotide sequences
• Conserved domain analysis
• Primer Designing

Kusum Yadav, Department of Biochemistry


Tools of Multiple Alignment

• CLUSTALW
• T-Coffee
• MUSCLE
• KALIGN
• CLC & GCG WorkBench

Kusum Yadav, Department of Biochemistry


Various categories of Analyses
1. Analysis of a single gene (protein) sequence
– Similarity with other known genes
– Phylogenetic trees; evolutionary relationships
– Identification of well-defined domains in the
sequence
– Sequence features (physical properties, binding
sites, modification sites)
– Prediction of sub-cellular localization
– Prediction of protein secondary and tertiary
structures
Kusum Yadav, Department of Biochemistry
2. Analysis of whole genomes
– Location of variuos genes on the chromosomes,
correlation with function or evolution
– Expansion/duplication of gene families
– Which gene families are present, which
missing?
– Presence or absence of biochemical pathways
– Identification of "missing" enzymes
– Large-scale events in the evolution of organisms

Kusum Yadav, Department of Biochemistry


3. Analysis of genes and genomes with respect
to function (Functional Annotation)
– Transcriptomics : Expression analysis; micro array
data (mRNA/transcript analyses)
– Proteomics; protein qualitative and
quantitative analyses, covalent modifications
– Comparison and analysis of
biochemical pathways
– Deletion or mutant genotypes vs phenotypes
– Identification of essential genes, or
genes involved in specific processes

Kusum Yadav, Department of Biochemistry


4. Comparative
genomics
⚫ Identifying pathogen specific unique
targets for designing novel drugs.

Kusum Yadav, Department of Biochemistry


Phylogenetic Analysis
• The phylogenetic trees aim at reconstructing the history of
successive divergence which took place during the evolution,
between the considered sequences and their common
ancestor.

• Nucleic acid and protein sequences are used to infer


Phylogenetic relationships

• Molecular phylogeny methods allow the suggestion of


phylogenetic trees, from a given set of aligned sequences.

Kusum Yadav, Department of Biochemistry


Phylogenetic Analysis Tools

 MEGA
 PHYLIP
 PAUP
 Treeview
 ODEN
 PHYLOWIN
 TREECON
 DENDRON

Kusum Yadav, Department of Biochemistry


Kusum Yadav, Department of Biochemistry

You might also like