Module 4 Merged
Module 4 Merged
TOPIC: BIOINFORMATICS
RESOURCES: NCBI, EBI,
EXPASY, RCSB.
Selected resources
Broad Institute of Harvard and MIT
• NCBI along with EBI and CIB together form International Sequence Database
Collaboration which act as the chief working unit and Information Centre. NCBI has 3
collaborative databases:
• GenBank
• European Molecular Biology Laboratory (EMBL)
• Database DNA Database of Japan (DDBJ)
A Science “Primer" yields access to general
definitions and introductory information
regarding the branches of science included
in bioinformatics.
Many bioinformatics terms are defined in this
section in a clear-cut and basic manner,
making this Primer an excellent first resource.
"Databases and Tools" from the yields is a
complete and well-ordered listing of
accessible information.
EMBL's European Bioinformatics Institute (EMBL-EBI)
https://www.ebi.ac.uk/
EMBL-EBI, makes the world’s public biological
data freely available to the scientific community
via a range of services and tools, perform basic
research and provide professional training in
bioinformatics.
Are part of the European Molecular Biology
Laboratory (EMBL), an international, innovative
and interdisciplinary research organization
funded by over 20 member states, prospect
and associate member states.
situated on the Wellcome Genome Campus in
Hinxton, Cambridge, UK, one of the world’s
largest concentrations of scientific and
technical expertise in genomics.
What they do….
provide freely available data and bioinformatics services to the scientific
community.
contribute to the advancement of biology through investigator-driven
research.
provide advanced bioinformatics training to scientists at all levels.
help disseminate cutting-edge technologies to industry.
support the coordination of biological data provision throughout Europe.
The European Nucleotide Archive and the protein sequence
resource UniProt (then known as Swiss-Prot–TrEMBL) were the original
EMBL-EBI databases. Since then, the EMBL-EBI has played a major part
in the bioinformatics revolution.
Tools & Data Resources
Clustal Omega
Multiple sequence alignment of DNA or protein sequences. Clustal Omega
replaces the older ClustalW alignment tools
InterProScan
InterProScan searches sequences against InterPro's predictive protein
signatures.
BLAST [protein]
Fast local similarity search tool for protein sequence databases.
BLAST [nucleotide]
Fast local similarity search tool for nucleotide sequence databases
HMMER
Fast sensitive protein homology searches using profile hidden Markov
models (HMMs) for querying against both sequence and HMM target
databases
Tools & Data Resources
Ensembl
Genome browser, API and database, providing access to reference genome annotation.
UniProt
A comprehensive resource for protein sequence and functional annotation
PDBe
The European resource for the collection, organisation and dissemination of 3D structural data (from
PDB and EMDB) on biological macromolecules and their complexe.
Europe PMC
A database to search the worldwide life sciences literature.
Expression Atlas
An added-value database that shows which genes/proteins are expressed under which conditions,
and how expression differs between conditions.
ChEMBL
An open data resource of binding, functional and ADMET bioactivity data.
Browse by type
DNA & RNA
Gene Expression
Proteins
Structures
Systems
Chemical biology
Ontologies
Literature
Cross domai
ExPASy SIB(https://www.expasy.org/)
Swiss Bioinformatics resource portal
About Expasy
Expasy is the bioinformatics resource portal of the SIB Swiss Institute of Bioinformatics (more
about its history).
It is an extensible and integrative portal which provides access to over 160 databases and
software tools, developed by SIB Groups and supporting a range of life science and clinical
research domains, from genomics, proteomics and structural biology, to evolution and
phylogeny, systems biology and medical chemistry.
The Expasy search engine
Expasy allows you to seamlessly
1) query in parallel a subset of SIB databases through a single search, and to
2) surface related information and knowledge from the complete set of >160 resources on the
portal. Expasy provides information that is automatically aligned with the most recent release
of each resources, thereby ensuring up-to-date information.
Some history
Expasy was created in August 1993 - the dawn of the internet
era. At that time, it was referred to as 'ExPASy, the Expert Protein
Analysis System' as proteins were its primary focus. It was the first
life science website - and among the 150 very first websites in the
world!
In June 2011, it became the SIB Expasy Bioformatics Resources
Portal: a diverse catalogue of bioinformatics resources
developed by SIB Groups.
The current version of Expasy was released in July 2020 following
a massive user study and taking into account design, user
experience and architecture aspects: we thank all participants
for their help in shaping Expasy 3.0!
RCSB-PDB(https://www.rcsb.org/)
The Protein Data Bank (PDB) was established as the 1st open access digital data
resource in all of biology and medicine (Historical Timeline). It is today a leading global
resource for experimental data central to scientific discovery.
Through an internet information portal and downloadable data archive, the PDB provides
access to 3D structure data for large biological molecules (proteins, DNA, and RNA).
These are the molecules of life, found in all organisms on the planet.
Knowing the 3D structure of a biological macromolecule is essential for understanding its
role in human and animal health and disease, its function in plants and food and energy
production, and its importance to other topics related to global prosperity and
sustainability.
A Structural View of Biology
This resource is powered by the Protein Data Bank archive-information about the 3D
shapes of proteins, nucleic acids, and complex assemblies that helps students and
researchers understand all aspects of biomedicine and agriculture, from protein synthesis
to health and disease.
As a member of the wwPDB, the RCSB PDB curates and annotates PDB data.
The RCSB PDB builds upon the data by creating tools and resources for research and
education in molecular biology, structural biology, computational biology, and beyond.
MODULE-4
TOPIC: Databases , classifications and
file formats
What is database????
• Database are convenient system to
properly store, search and retrieve any
type of data.
• A database helps to easily handle and share
large amount of data and supports large
scale analysis by easy access and data
updating
What is Biological Database???
• Biological databases are libraries of life sciences
information ,collected from scientific
experiments, published literature, high-
throughput experiment technology and
computational analysis.
• They contain information from genomics,
proteomics, microarray gene expression.
What is expected from a database..!!
• Sequence, functional, structural information,
related bibliography
• Well Structured and Indexed information
• Well cross-referenced (with other databases)
• Periodically updated
• Tools for analysis and visualization
Databases Architecture
Information system (The Google,Entrez
SRS)
)Query system
Thus, the very first challenge in the genomics era is to store and
handle the staggering volume of information through the establishment
and use of computer databases.
• Examples
InterPro (protein families, motifs and domains)
• Institute (EBI), in England), Grenoble (France), Hamburg (Germany), and The European
Molecular Biology Laboratory (EMBL) is a molecular biology research institution
supportedby 22member states, four prospectand two associatemember states.
• EMBL was created in 1974 and is an intergovernmental organisation funded by public
researchmoney from its member states.
• The Laboratory operates from five sites: the main laboratory in Heidelberg, and
outstations in Hinxton (the European Bioinformatics Monterotondo (near Rome).
• EMBL groups and laboratories perform basic research in molecular biology and
molecularmedicine aswell astraining for scientists, studentsand visitors.
• Israelis the onlyAsianstate that hasfull membership.
• The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/), maintained at
the EuropeanBioinformaticsInstitute (EBI).
• It is usedto incorporate anddistributes nucleotide sequencesfrom public sources.
• The database is a part of an international collaboration with DDBJ (Japan) and GenBank
(USA).
• Data are exchanged between the collaborating databases on a daily
basis.
• The web-based tool, Webin, is the preferred system for individual submission
of nucleotide sequences, including Third Party Annotation (TPA) and
alignment data.
• Automatic submission procedures are used for submission of data from large-
scale genomesequencing
• The latest data collection can be accessed via FTP, email and WWW
interfaces.
• The EBI's Sequence Retrieval System (SRS) integrates and links the main
nucleotide and protein databases as well as many other specialist molecular
biologydatabases.
• For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are
available that allow external users to compare their own sequences against
the data in the EMBL Nucleotide Sequence Database and otherdatabases.
• All available resources canbe accessedvia the EBIhome pageat
http://www.ebi.ac.uk.
DDBJ(DNA Data Bank of Japan,
https://www.ddbj.nig.ac.jp/)
• DDBJ Center collects nucleotide sequence data as a member of
INSDC(International Nucleotide Sequence Database
Collaboration) and provides freely available nucleotide sequence
data and supercomputer system, to support research activities
in life science.
• Currently, DDBJ Center is in operation at Research
Organization of Information and System National Institute
of Genetics(NIG) in Mishima, Japan with endorsement
of MEXT; Japanese Ministry of Education, Culture, Sports,
Science and Technology.
• DDBJ Center is reviewed and advised by its own advisory
board, DNA Database Advisory Committee (an outside
committee of NIG), and also by the advisory board to
INSDC, International Advisory Committee.
UniProt
• The Protein Data Bank (pdb) file format is a textual file format describing the
three-dimensional structures of molecules held in the Protein Data Bank.
• The pdb format accordingly provides for description and annotation of protein
and nucleic acid structures including atomic coordinates, secondary structure
assignments, as well as atomic connectivity.
• In addition experimental metadata are stored. PDB format is the legacy file
format for the Protein Data Bank which now keeps data on biological
macromolecules in the newer mmCIF file format.
• The PDB file format was invented in 1976 as a human-readable file that would
allow researchers to exchange protein coordinates through a database system.
• Its fixed-column width format is limited to 80 columns, which was based on
the width of the computer punch cards that were previously used to exchange
the coordinates.
• Through the years the file format has undergone many changes and revisions.
• HEADER, TITLE and AUTHOR records
provide information about the researchers who defined the
structure; numerous other types of records are available to provide
other types of information.
• REMARK records
can contain free-form annotation, but they also accommodate
standardized information; for example, the REMARK 350 BIOMT
records describe how to compute the coordinates of the
experimentally observed multimer from those of the explicitly
specified ones of a single repeating unit.
• SEQRES records
give the sequences of the three peptide chains (named A, B and C),
which are very short in this example but usually span multiple lines.
• ATOM records
describe the coordinates of the atoms that are part of
the protein. For example, the first ATOM line above
describes the alpha-N atom of the first residue of
peptide chain A, which is a proline residue; the first
three floating point numbers are its x, y and z
coordinates and are in units of Ångströms.[3] The next
three columns are the occupancy, temperature factor,
and the element name, respectively.
• HETATM records
describe coordinates of hetero-atoms, that is those
atoms which are not part of the protein molecule.
Protein Information Resource
(PIR format)
PIR format description
The fields in the atom section are atom ID, segment name, residue ID, residue name, atom name,
atom type, charge, mass, and an unused 0.
Module 4
Topic: Modular Nature
of proteins
Introduction - Domain
o A protein domain is a region of the protein's polypeptide chain that is self-
stabilizing and that folds independently from the rest.
o Each domain forms a compact folded three-dimensional structure.
o Many proteins consist of several domains. One domain may appear in a variety of
different proteins.
o Molecular evolution uses domains as building blocks and these may be
recombined in different arrangements to create proteins with different functions.
o In general, domains vary in length from between about 50 amino acids up to 250
amino acids in length.
o
Introduction – Domain contd….
The shortest domains, such as zinc fingers, are stabilized by metal ions
or disulfide bridges. Domains often form functional units, such as the
calcium-binding EF hand domain of calmodulin.
Because they are independently stable, domains can be "swapped"
by genetic engineering between one protein and another to make chimeric
proteins.
Proteins are composed of evolutionary units called domains
Can either have an independent function or contribute to the function of a
multidomain protein in cooperation with other domains.
Once a domain has duplicated, it can evolve a new or
modified function.
Based on sequence, structural and functional evidence are grouped into
superfamilies.
Background
The concept of the domain was first proposed in 1973 by Wetlaufer after X-ray crystallographic
studies of hen lysozyme and papain and by limited proteolysis studies of immunoglobulins.
Wetlaufer defined domains as stable units of protein structure that could fold autonomously.
In the past domains have been described as units of:
•compact structure
•function and evolution
•folding.
Domain swapping
Domain swapping is a mechanism for forming oligomeric
assemblies.
In domain swapping, a secondary or tertiary element of a
monomeric protein is replaced by the same element of
another protein.
Domain swapping can range from secondary structure
elements to whole structural domains.
It also represents a model of evolution for functional
adaptation by oligomerisation, e.g. oligomeric enzymes that
have their active site at subunit interfaces
Role of domains
TWO principle
1.A domain can perform the same function, but in different protein contexts
(i.e. with different partner domains).Eg:sensory, regulatory and enzymatic
domains.
2.Some domains modify their function according to the partner
domain.Eg:WHD domain (Winged Helix Domain)
Module 4
Topic: Optional Alignment Methods,
Sequence Alignment
Introduction
Shotgun sequencing
Accurate to 650 nucleotides
Sequence alignment used to stitch the whole
length
Sequence assembly
Sequence comparison
DNA/RNA sequences
– strings composed of an alphabet of 4 letters
Protein sequences
– alphabet of 20 letters
101
A Quantitative Measure of Sequence
Similarity
To compare the nucleotides or amino acids
that appear at corresponding positions in two
or more sequences, we must first assign
those correspondences.
Sequence alignment is the identification of
residue-residue correspondences.
102
Orthologous and paralogous
Orthologous sequences differ because they are
found in different species (a speciation event)
Paralogous sequences differ due to a gene
duplication event
Sequences may be both orthologous and
paralogous
Pairwise Alignment
The alignment of two sequences (DNA or
protein) is a relatively straightforward
computational problem.
– There are lots of possible alignments.
•
Two sequences can always be aligned.
Sequence alignments have to be scored.
Often there is more than one solution with the
same score.
Methods of Alignment
By hand - slide sequences on two lines of a word
processor
Dot plot
– with windows
Rigorous mathematical approach
– Dynamic programming (slow, optimal)
Heuristic methods (fast, approximate)
– BLAST and FASTA
• Word matching and hash tables.
Applications
Diagonal lines
of dots show
similarities
108
Dot Plots Sequence Alignments
A alignment can reflect the evolutionary
relationship between two or more homologs.
Three kinds of changes can occur at any
given position within a sequence
– Mutation
– Insertion
– Deletion
109
Many Possibilities
gctg-aa-cg
-ctataatc-
110
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
AC C TG A G – AG
AC G TG – G C AG
mismatch
indel
70% identical
Affine Gap – Example (1)
+2 for a match
-2 for a gap
-1 for a mismatch
112
Gap Penalties
Linear gap penalty
– cost of gap (length n) depends linearly on gap-open
penalty
• f(g)= – gi
Affine gap penalty
– cost of gap depends on an initial gap-open penalty(gi) and
a subsequent gap-extension penalty(ge)
– based on the fact that a single biological mutational event
can insert or delete more than one residue
• f(g) = –[gi + (n – 1) ge]
+2 for a match
-1 for a mismatch
a gap open score of –2
a gap extension score of -1.
114
Find the Best
117
Widely Used Substitution Matrices
– Empirically Derived
PAM: Point Accepted Mutations
– The PAM family (Dayhoff) is based on evolutionary
distance. The matrices were derived from closely related
sequences and the mutations seen in them.
118
Scoring system is a set of values for qualifying the set of one
residue being substituted by another in an alignment.
Tansversions ---
(A/G) (C/T)
Match score: +1
Mismatch score: +0
Gap penalty: –1
ACGTCTGATACGCCGTATAGTCTATCT
||||| ||| || ||||||||
----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1)
Mismatches: 2 × 0 Score = +11
Gaps: 7 × (– 1)
PAM - point accepted mutation based on
global alignment [evolutionary model]
Difficulty
of determining ancestral
relationships among sequences;
130
For More Widely Divergent
Sequences
Matrices representing larger evolutionary
distances may be derived from the PAM1
matrix by matrix multiplication.
PAM250:
– Corresponding to ~20% identity
– The lowest sequence similarity for which we can
hope to produce a correct alignment
% identity 100 75 50 60 25 20
131
PAM 250
A R N D C
C Q E G H I L K M F P S T W
W Y V B Z
A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1
R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2
N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3
D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4
C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4
Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5
E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5
G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1
H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0
F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4
P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1
W
W
Y
-6
-3
2
-4
-4
-2
-7
-4
-8
-8
0
-5
-4
-7
-4
-7
-5
-3
0
-5
-1
-2
-1
-3
-4
-4
-2
0
7
-6
-5
-2
-3
-5
-3
17
17
0
0
10
-6
-2
-4
-2
-4
-3
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0
B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5
Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6
BLOSUM Matrices
134
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
136
Dynamic Programming
137
Global vs. Local Alignment
138
Global Alignment
Needleman-Wunsch 1970
Idea: Build up optimal alignment from optimal
alignments of subsequences
139
Three Steps of Dynamic
Programming
A simple scoring scheme is assumed where
– Si,j = 1 (match score); otherwise
– Si,j = 0 (mismatch score)
– w = 0 (gap penalty)
Three steps in dynamic programming
– Initialization
– Matrix fill (scoring)
– Traceback (alignment)
140
Initialization Step
GAATTCAGTTA
-----------
-------
GGATCGA
G- -G G
3 cases:
-G G- G 141
Matrix Fill Step
142
Traceback Step
143
Z-score
Z = (Xs-Xt) /s
144
P-value
147
Module 4
Topic: BLAST
INTRODUCTION
An important goal of genomics and proteomics is to determine
if a particular sequence is like another sequence. This is
accomplished by comparing the new sequence with sequences
that have already been reported and stored in a database.
This process is principally one that uses alignment procedures
to uncover the “like” sequence in the database.
The alignment process will uncover those regions that are
identical or closely similar and those regions with little (or
any) similarity.
Two alignment types are used: global and local.
BLAST
BLAST stands for Basic Local Alignment Search Tool
BLAST was developed by Stephen Altschul, Warren
Gish, Webb Miller, Eugene Myers, and David J. Lipman at
NCBI in 1990.
It is a local alignment tool.
It helps to find regions of local similarity between sequences.
It is a program compares nucleotide or protein sequences to sequence
databases and calculates the statistical significance of matches.
BLAST can be used to infer functional and evolutionary
relationships between sequences as well as help identify members of
gene families.
NCBI HOMEPAGE
NCBI-BLAST HOMEPAGE
TYPES
BLAST
Blastp Blastn
tBlastn Blastx
tBlastx
STEPS
Selecting Database
1 - This portion of each description links to the sequence record for a particular hit.
2 - Score or bit score is a value calculated from the number of gaps and substitutions
associated with each aligned sequence. The higher the score, the more significant the
alignment.
3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score
will occur in the database by chance. The smaller the E Value, the more significant the
alignment
4 - These links provide the user with direct access from BLAST results to related
entries in other databases. „L‟ links to Locus Link records and „S‟links to structure
records in NCBI's Molecular Modelling Database.
BLAST OUTPUT
C. ALIGNMENT
APPLICATIONS
MacIsaac KD, Fraenkel E (2006) Practical Strategies for Discovering Regulatory DNA Sequence Motifs. PLoS Comput
Biol 2(4): e36. doi:10.1371/journal.pcbi.0020036
http://journals.plos.org/ploscompbiol/article?id=info:doi/10.1371/journal.pcbi.0020036
MRLSFVPLLQLSRLVVSTQHSTKMSTVYRTCKMNEIALSLLAPTQPLDADQ
GVMSPMASSDQ
TTSIGDFRFLRTHHDKEERGLLVTSLTKGLAETSFPYR
YTSMCATICSITHSRADAAPAKQAH
What is Pattern Recognition?
Deterministic
Matches a given string or not.
Probabilistic
each sequence is given a probability that
this sequence is generated by a model.
The higher the probability, the better is the
match between sequence and pattern.
PROSITE
PROSITE is a protein database. It consists of entries describing
the protein families, domains and functional sites as well as
amino acid patterns,
signatures, and profiles in them, which are manually curated by a
team of the Swiss Institute of Bioinformatics and tightly
integrated into Swiss-Prot protein annotation.
A. Maximum parsimony
B. Maximum likelihood
Character based method :
1. Inaccurate evolutionary
history
2. The data used is little noisy
3. Problem facing in single type
of character basing
4. Homoplasy would be unlikely
from natural selection
5. Length of branch doesn’t mean
the timing passed
- Fields of study
1. Cladistics
2. Comparative phylogenetics
3. Computational phylogenetics
4. Evolutionary taxonomy
5. Evolutionary biology
6. Phylogenetics
Applications:
• Find out the evolutionary history .
1.Maximumparsimonymethod
2.Distancemethod
3. Maximum likelihood methods