0% found this document useful (0 votes)
5 views283 pages

Module 4 Merged

bioinformatics notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views283 pages

Module 4 Merged

bioinformatics notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 283

MODULE-4

TOPIC: BIOINFORMATICS
RESOURCES: NCBI, EBI,
EXPASY, RCSB.
Selected resources
 Broad Institute of Harvard and MIT

 DNA Databank of Japan (DDBJ)


 The European Bioinformatics Institute (EBI)


 ExPASy Bioinformatics Resource Portal


 National Center for Biotechnology Information (NCBI)



 Ingenuity Pathway Analysis (IPA)
 IPA is a web-based software application for the analysis, integration, and interpretation of data derived
from 'omics experiments. Access provided to Tufts University and Tufts Medical Center researchers. Click
the link above for more information on how to access IPA.
 Oncomine
 Cancer microarray database with more than 700 independent data sets and a set of analysis functions
that compute gene expression signatures, clusters and gene-set modules. Users must register for
username and password at site.
 Find More Resources
 Nucleic Acids Research Database Issue
 Each year, the journal Nucleic Acids Research devotes an issue to descriptions of new molecular biology
databases and updates of previously reviewed molecular biology databases.
 Nucleic Acids Research Database Summary Papers
 Searchable collection of databases that have been described in the journal, Nucleic Acids Research.
Databases are organized by category, and each entry has a description of the database and/or a link
to the review in Nucleic Acids Research.
 Nucleic Acids Research Web Server Issue
 Each year, the journal Nucleic Acids Research dedicates an issue to reports on web-based software
resources for analysis and visualization of molecular biology data.
 Online Bioinformatics Resources Collection (OBRC)
 The OBRC, created and maintained by the University of Pittsburgh Health Sciences Library System,
contains brief descriptions of and links for more than 2,400 bioinformatics databases and tools.
NCBI (https://www.ncbi.nlm.nih.gov/)
 The National Center for Biotechnology Information (NCBI) is part of the United States
National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH).
 It is approved and funded by the government of the United States. The NCBI is located
in Bethesda, Maryland and was founded in 1988.
 The NCBI houses a series of databases relevant to biotechnology and biomedicine and is
an important resource for bioinformatics tools and services. Major databases
include GenBank for DNA sequences and PubMed, a bibliographic database for
biomedical literature.
 Other databases include the NCBI Epigenomics database. All these databases are
available online through the Entrez search engine.
• A comprehensive website for biologists including:
 biology-related databases,
 tools for viewing and analyzing
 automated systems for storing and retrieval

• NCBI along with EBI and CIB together form International Sequence Database
Collaboration which act as the chief working unit and Information Centre. NCBI has 3
collaborative databases:
• GenBank
• European Molecular Biology Laboratory (EMBL)
• Database DNA Database of Japan (DDBJ)
A Science “Primer" yields access to general
definitions and introductory information
regarding the branches of science included
in bioinformatics.
 Many bioinformatics terms are defined in this
section in a clear-cut and basic manner,
making this Primer an excellent first resource.
 "Databases and Tools" from the yields is a
complete and well-ordered listing of
accessible information.
EMBL's European Bioinformatics Institute (EMBL-EBI)
https://www.ebi.ac.uk/
 EMBL-EBI, makes the world’s public biological
data freely available to the scientific community
via a range of services and tools, perform basic
research and provide professional training in
bioinformatics.
 Are part of the European Molecular Biology
Laboratory (EMBL), an international, innovative
and interdisciplinary research organization
funded by over 20 member states, prospect
and associate member states.
 situated on the Wellcome Genome Campus in
Hinxton, Cambridge, UK, one of the world’s
largest concentrations of scientific and
technical expertise in genomics.
What they do….
 provide freely available data and bioinformatics services to the scientific
community.
 contribute to the advancement of biology through investigator-driven
research.
 provide advanced bioinformatics training to scientists at all levels.
 help disseminate cutting-edge technologies to industry.
 support the coordination of biological data provision throughout Europe.
 The European Nucleotide Archive and the protein sequence
resource UniProt (then known as Swiss-Prot–TrEMBL) were the original
EMBL-EBI databases. Since then, the EMBL-EBI has played a major part
in the bioinformatics revolution.
Tools & Data Resources
Clustal Omega
 Multiple sequence alignment of DNA or protein sequences. Clustal Omega
replaces the older ClustalW alignment tools
InterProScan
 InterProScan searches sequences against InterPro's predictive protein
signatures.
BLAST [protein]
 Fast local similarity search tool for protein sequence databases.
BLAST [nucleotide]
 Fast local similarity search tool for nucleotide sequence databases
HMMER
 Fast sensitive protein homology searches using profile hidden Markov
models (HMMs) for querying against both sequence and HMM target
databases
Tools & Data Resources
Ensembl
 Genome browser, API and database, providing access to reference genome annotation.
UniProt
 A comprehensive resource for protein sequence and functional annotation
PDBe
 The European resource for the collection, organisation and dissemination of 3D structural data (from
PDB and EMDB) on biological macromolecules and their complexe.
Europe PMC
 A database to search the worldwide life sciences literature.
Expression Atlas
 An added-value database that shows which genes/proteins are expressed under which conditions,
and how expression differs between conditions.
ChEMBL
 An open data resource of binding, functional and ADMET bioactivity data.
Browse by type
 DNA & RNA
 Gene Expression
 Proteins
 Structures
 Systems
 Chemical biology
 Ontologies
 Literature
 Cross domai
ExPASy SIB(https://www.expasy.org/)
Swiss Bioinformatics resource portal
About Expasy
 Expasy is the bioinformatics resource portal of the SIB Swiss Institute of Bioinformatics (more
about its history).
 It is an extensible and integrative portal which provides access to over 160 databases and
software tools, developed by SIB Groups and supporting a range of life science and clinical
research domains, from genomics, proteomics and structural biology, to evolution and
phylogeny, systems biology and medical chemistry.
The Expasy search engine
Expasy allows you to seamlessly
1) query in parallel a subset of SIB databases through a single search, and to
2) surface related information and knowledge from the complete set of >160 resources on the
portal. Expasy provides information that is automatically aligned with the most recent release
of each resources, thereby ensuring up-to-date information.
Some history
 Expasy was created in August 1993 - the dawn of the internet
era. At that time, it was referred to as 'ExPASy, the Expert Protein
Analysis System' as proteins were its primary focus. It was the first
life science website - and among the 150 very first websites in the
world!
 In June 2011, it became the SIB Expasy Bioformatics Resources
Portal: a diverse catalogue of bioinformatics resources
developed by SIB Groups.
 The current version of Expasy was released in July 2020 following
a massive user study and taking into account design, user
experience and architecture aspects: we thank all participants
for their help in shaping Expasy 3.0!
RCSB-PDB(https://www.rcsb.org/)
 The Protein Data Bank (PDB) was established as the 1st open access digital data
resource in all of biology and medicine (Historical Timeline). It is today a leading global
resource for experimental data central to scientific discovery.
 Through an internet information portal and downloadable data archive, the PDB provides
access to 3D structure data for large biological molecules (proteins, DNA, and RNA).
These are the molecules of life, found in all organisms on the planet.
 Knowing the 3D structure of a biological macromolecule is essential for understanding its
role in human and animal health and disease, its function in plants and food and energy
production, and its importance to other topics related to global prosperity and
sustainability.
A Structural View of Biology
 This resource is powered by the Protein Data Bank archive-information about the 3D
shapes of proteins, nucleic acids, and complex assemblies that helps students and
researchers understand all aspects of biomedicine and agriculture, from protein synthesis
to health and disease.
 As a member of the wwPDB, the RCSB PDB curates and annotates PDB data.
 The RCSB PDB builds upon the data by creating tools and resources for research and
education in molecular biology, structural biology, computational biology, and beyond.
MODULE-4
TOPIC: Databases , classifications and
file formats
What is database????
• Database are convenient system to
properly store, search and retrieve any
type of data.
• A database helps to easily handle and share
large amount of data and supports large
scale analysis by easy access and data
updating
What is Biological Database???
• Biological databases are libraries of life sciences
information ,collected from scientific
experiments, published literature, high-
throughput experiment technology and
computational analysis.
• They contain information from genomics,
proteomics, microarray gene expression.
What is expected from a database..!!
• Sequence, functional, structural information,
related bibliography
• Well Structured and Indexed information
• Well cross-referenced (with other databases)
• Periodically updated
• Tools for analysis and visualization
Databases Architecture
Information system (The Google,Entrez
SRS)
)Query system

Storage System Your search keywords


Oracle,MySQL,PCbinary
files,Unix text
Data files,Bookshelves

GenBank flat file


PDB file
Interaction Record
Title of a book
Book
Biological Databases- Types and Importance
 One of the hallmarks of modern genomic research is the generation of
enormous amounts of raw sequence data.

 As the volume of genomic data grows, sophisticated computational


methodologies are required to manage the data deluge.

 Thus, the very first challenge in the genomics era is to store and
handle the staggering volume of information through the establishment
and use of computer databases.

 A biological database is a large, organized body of persistent data,


usually associated with computerized software designed to update,
query, and retrieve components of the data stored within the system.

 A simple database might be a single file containing many records, each


of which includes the same set of information.

 The chief objective of the development of a database is to organize


data in a set of structured records to enable easy retrieval of
information.
Types of Biological Databases
Based on their contents, biological databases can be roughly divided into
two categories:
1. Primary databases
 Primary databases are also called as archieval database.

 They are populated with experimentally derived data such as


nucleotide sequence, protein sequence or macromolecular structure.

 Experimental results are submitted directly into the database by


researchers, and the data are essentially archival in nature.

 Once given a database accession number, the data in primary


databases are never changed: they form part of the scientific record.

 Examples:ENA, GenBank and DDBJ (nucleotide sequence)

 Array Express Archive and GEO (functional genomics data)

 Protein Data Bank (PDB; coordinates of three-dimensional


macromolecular structures)
2. Secondary databases
 Secondary databases comprise data derived from the results of
analysing primary data.

 Secondary databases often draw upon information from numerous


sources, including other databases (primary and secondary),
controlled vocabularies and the scientific literature.

 They are highly curated, often using a complex combination of


computational algorithms and manual analysis and interpretation to
derive new knowledge from the public record of science.

• Examples
 InterPro (protein families, motifs and domains)

 UniProt Knowledgebase (sequence and functional information on


proteins)

 Ensembl (variation, function, regulation and more layered onto whole


genome sequences)
3.However, many data resources have both primary
and secondary characteristics. For
example, UniProt accepts primary sequences
derived from peptide sequencing experiments.
However, UniProt also infers peptide sequences
from genomic information, and it provides a wealth
of additional information, some derived from
automated annotation (TrEMBL), and even more
from careful manual analysis (SwissProt).

4. There are also specialized databases are those


that cater to a particular research interest. For
example, Flybase, HIV sequence database, and
Ribosomal Database Project are databases that
specialize in a particular organism or a particular
type of data.
GenBank (Genetic Sequence Databank)

• GenBank® is the genetic sequence database at the National Center for


BiotechnologyInformation (NCBI).
• It was established in the year 1982 and now maintained by the National Center for
Biotechnology(NCBI).
• DNAsequencescanbesubmitted to GenBankusingseveral different methods.
• It contains publicly available nucleotide sequences for more than 240 000 named
organisms, obtained primarily through submissions from individual laboratories and
batch submissions fromlarge-scale sequencing projects.
• It has a flat file structure that is an ASCII text file, readable & downloadable by
both humansand computers.
• There are two main ways of making batch sequence submissions to GenBank: NCBI’s
Barcode SubmissionTool(BarSTool)andSequin.
EMBL

• Institute (EBI), in England), Grenoble (France), Hamburg (Germany), and The European
Molecular Biology Laboratory (EMBL) is a molecular biology research institution
supportedby 22member states, four prospectand two associatemember states.
• EMBL was created in 1974 and is an intergovernmental organisation funded by public
researchmoney from its member states.
• The Laboratory operates from five sites: the main laboratory in Heidelberg, and
outstations in Hinxton (the European Bioinformatics Monterotondo (near Rome).
• EMBL groups and laboratories perform basic research in molecular biology and
molecularmedicine aswell astraining for scientists, studentsand visitors.
• Israelis the onlyAsianstate that hasfull membership.
• The EMBL Nucleotide Sequence Database (http:// www.ebi.ac.uk/embl/), maintained at
the EuropeanBioinformaticsInstitute (EBI).
• It is usedto incorporate anddistributes nucleotide sequencesfrom public sources.
• The database is a part of an international collaboration with DDBJ (Japan) and GenBank
(USA).
• Data are exchanged between the collaborating databases on a daily
basis.
• The web-based tool, Webin, is the preferred system for individual submission
of nucleotide sequences, including Third Party Annotation (TPA) and
alignment data.
• Automatic submission procedures are used for submission of data from large-
scale genomesequencing
• The latest data collection can be accessed via FTP, email and WWW
interfaces.
• The EBI's Sequence Retrieval System (SRS) integrates and links the main
nucleotide and protein databases as well as many other specialist molecular
biologydatabases.
• For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are
available that allow external users to compare their own sequences against
the data in the EMBL Nucleotide Sequence Database and otherdatabases.
• All available resources canbe accessedvia the EBIhome pageat
http://www.ebi.ac.uk.
DDBJ(DNA Data Bank of Japan,
https://www.ddbj.nig.ac.jp/)
• DDBJ Center collects nucleotide sequence data as a member of
INSDC(International Nucleotide Sequence Database
Collaboration) and provides freely available nucleotide sequence
data and supercomputer system, to support research activities
in life science.
• Currently, DDBJ Center is in operation at Research
Organization of Information and System National Institute
of Genetics(NIG) in Mishima, Japan with endorsement
of MEXT; Japanese Ministry of Education, Culture, Sports,
Science and Technology.
• DDBJ Center is reviewed and advised by its own advisory
board, DNA Database Advisory Committee (an outside
committee of NIG), and also by the advisory board to
INSDC, International Advisory Committee.
UniProt

• UniProt is a freely accessible database of protein


sequence and functional information, many entries being
derived from genome sequencing projects.
• It contains a large amount of information about the
biological function of proteins derived from the research
literature.
• It is maintained by the UniProt consortium, which
consists of several European bioinformatics organisations
and a foundation from Washington, DC, United States.
• The UniProt consortium comprises the European
Bioinformatics Institute (EBI), the Swiss Institute of
Bioinformatics (SIB), and the Protein Information
Resource (PIR).
Organization of UniProt databases
• UniProtKB
• UniProt Knowledgebase (UniProtKB) is a protein database
partially curated by experts, consisting of two sections:
• UniProtKB/Swiss-Prot (containing reviewed, manually
annotated entries) and UniProtKB/TrEMBL (containing
unreviewed, automatically annotated entries).
• As of 19 March 2014, release "2014_03" of
UniProtKB/Swiss-Prot contains 542,782 sequence entries
(comprising 193,019,802 amino acids abstracted from
226,896 references) and release "2014_03" of
UniProtKB/TrEMBL contains 54,247,468 sequence entries
(comprising 17,207,833,179 amino acids)
UniProtKB/Swiss-Prot

• UniProtKB/Swiss-Prot is a manually annotated, non-


redundant protein sequence database.
• It combines information extracted from scientific
literature and biocurator-evaluated computational
analysis.
• The aim of UniProtKB/Swiss-Prot is to provide all known
relevant information about a particular protein.
• Annotation is regularly reviewed to keep up with current
scientific findings. The manual annotation of an entry
involves detailed analysis of the protein sequence and of
the scientific literature.
UniProtKB/TrEMBL
• UniProtKB/TrEMBL contains high-quality computationally
analyzed records, which are enriched with automatic
annotation.
• It was introduced in response to increased dataflow resulting
from genome projects, as the time- and labour-consuming
manual annotation process of UniProtKB/Swiss-Prot could
not be broadened to include all available protein sequences.
• The translations of annotated coding sequences in the EMBL-
Bank/GenBank/DDBJ nucleotide sequence database are
automatically processed and entered in UniProtKB/TrEMBL.
• UniProtKB/TrEMBL also contains sequences from PDB, and
from gene prediction, including Ensembl, RefSeq and CCDS.
The Protein Information Resource (PIR)
• The Protein Information Resource (PIR) produces the largest,
most comprehensive, annotated protein sequence database in the
public domain.
• The PIR-International Protein Sequence Database, in collaboration
with the Munich Information Center for Protein Sequences
(MIPS) and the Japan International Protein Sequence Database
(JIPID).
• The expanded PIR WWW site allows sequence similarity and text
searching of the Protein Sequence Database and auxiliary
databases.
• Several new web-based search engines combine searches of
sequence similarity and database annotation to facilitate the
analysis and functional identification of proteins.
• New capabilities for searching the PIR
sequence databases include annotation-sorted
search, domain search, combined global and
domain search, and interactive text searches.
• The PIR-International databases and search
tools are accessible on the PIR WWW site at
http://pir.georgetown.edu and at the MIPS
WWW site at
http://www.mips.biochem.mpg.de .
The database has the following distinguishing features.

• It is a comprehensive, annotated, and non-redundant protein sequence database,


containing over 142 000 sequences as of September 1999. Included are
sequences from the completely sequenced genomes of 16 prokaryotes, six
archaebacteria, 17 viruses and phages, >100 eukaryote organelles
and Saccharomyces cerevisiae.
• The collection is well organized with >99% of entries classified by protein
family and >57% classified by protein superfamily.
• PSD annotation includes concurrent cross-references to other sequence,
structure, genomic and citation databases, including the public nucleic acid
sequence databases ENTREZ, MEDLINE, PDB, GDB, OMIM, FlyBase,
MIPS/Yeast, SGD/Yeast, MIPS/Arabidopsis and TIGR.
• The PIR is the only sequence database to provide context cross-references
between its own database entries.
PIR-International sequence and auxiliary databases
Database Description Information
PSD Annotated and classified protein http://pir.georgetown.edu/pirw
sequences ww/dbinfo/textpsd.html
PATCHX Sequences not yet in the PIR- http://pir.georgetown.edu/pirw
International PSD ww/dbinfo/patchx.html
ARCHIVE Sequences as originally reported http://pir.georgetown.edu/pirw
in a publication or submission ww/dbinfo/archive.html

NRL_3D Sequences from three- http://pir.georgetown.edu/pirw


dimensional structure database ww/dbinfo/nrl3d.html
PDB
FAMBASE Representative sequences from http://pir.georgetown.edu/pirw
each protein family ww/dbinfo/fambase.html
PIR-ALN Sequence alignments of http://pir.georgetown.edu/pirw
superfamilies, families and ww/dbinfo/piraln.html
homology domains
RESID Post-translational modifications http://pir.georgetown.edu/pirw
with PSD feature information ww/dbinfo/resid.html

ProClass Non-redundant sequences http://pir.georgetown.edu/gfserv


organized according to er/proclass.html
superfamilies and motifs
ProtFam Sequence alignments of http://www.mips.biochem.mpg.
superfamilies de/proj/protfam/protfam
PIR - https://proteininformationresource.org/
FILE FORMATS
Flat file
• A flat-file database is a database stored
in a file called a flat file. Records follow a
uniform format, and there are no
structures for indexing or recognizing
relationships between records. The file is
simple. A flat file can be a plain text file, or
a binary file. Relationships can be inferred
from the data in the database, but the
database format itself does not make
those relationships explicit.
Flat File Storage Data Formats

•When GenBank, EMBL and DDBJ formed a


collaboration (1986), sequence databases had
moved to a defined flat file format with a shared
feature table format and annotation standards.
•The flat file formats from the sequence databases
are still used to access and display sequence and
annotation. They are also convenient for storage of
localcopies.
Genbank flat file
• The Genbank format allows for the storage
of information in addition to a DNA/protein
sequence.
The screen grab shows various details, the first section includes the entry’s
LOCUS, DEFINITION, ACCESSION and VERSION and denoted by ORIGIN,
you can see that the final detail is the actual sequence. These five elements are
the essential parts of the GenBank format.
GenBank (Genetic Sequence Databank)

• GenBank® is the genetic sequence database at the


National Center for Biotechnology Information (NCBI).
• It was established in the year 1982 and now maintained by
the NationalCenter for Biotechnology (NCBI).
• DNA sequences can be submitted to GenBank using
several different methods.
• It contains publicly available nucleotide sequences for more
than 240 000 named organisms, obtained primarily through
submissions from individual laboratories and batch
submissions fromlarge-scale sequencing projects.
•It has a flat file structure that is anASCII text
file, readable & downloadable by both
humans and computers.
•There are two main ways of making batch
sequence submissions to GenBank: NCBI’s Barcode
SubmissionTool (BarSTool) and Sequin.
FASTA format
• In bioinformatics and biochemistry, the FASTA format is a text-
based format for representing either nucleotide sequences or
amino acid (protein) sequences, in which nucleotides or amino
acids are represented using single-letter codes. The format also
allows for sequence names and comments to precede the
sequences.
• FASTA format is a text-based format for representing either
nucleotide sequences or peptide sequences, in which base pairs
or amino acids are represented using single-letter codes.
• A sequence in FASTA format begins with a single-line
description, followed by lines of sequence data.
• The description line is distinguished from the sequence data by a
greater-than (">") symbol in the first column.
• It is recommended that all lines of text be shorter than 80
characters in length
>gi|129295|sp|P01013|OVAX_CHICK GENE X
PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDT
REMPFHVTKQESKPVQMMCMNNSFNVATLPAEKM
KILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTW
TNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGM
TDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIE
MAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTI
VYFGRYWSP
Filename extension
Extension Meaning Notes
Any generic fasta file. See below
[9]
fasta, fa generic FASTA for other common FASTA file
extensions
Used generically to specify
fna FASTA nucleic acid
nucleic acids.
FASTA nucleotide of gene Contains coding regions for a
ffn
regions genome.

Contains amino acid sequences.


A multiple protein fasta file can
faa FASTA amino acid
have the more specific extension
mpfa.

Contains non-coding RNA


frn FASTA non-coding RNA regions for a genome, in DNA
alphabet e.g. tRNA, rRNA

There is no standard filename extension for a text file containing FASTA


formatted sequences. The table below shows each extension and its respective
meaning.
Protein Data Bank (PDB) flat file

• The Protein Data Bank (pdb) file format is a textual file format describing the
three-dimensional structures of molecules held in the Protein Data Bank.
• The pdb format accordingly provides for description and annotation of protein
and nucleic acid structures including atomic coordinates, secondary structure
assignments, as well as atomic connectivity.
• In addition experimental metadata are stored. PDB format is the legacy file
format for the Protein Data Bank which now keeps data on biological
macromolecules in the newer mmCIF file format.
• The PDB file format was invented in 1976 as a human-readable file that would
allow researchers to exchange protein coordinates through a database system.
• Its fixed-column width format is limited to 80 columns, which was based on
the width of the computer punch cards that were previously used to exchange
the coordinates.
• Through the years the file format has undergone many changes and revisions.
• HEADER, TITLE and AUTHOR records
provide information about the researchers who defined the
structure; numerous other types of records are available to provide
other types of information.

• REMARK records
can contain free-form annotation, but they also accommodate
standardized information; for example, the REMARK 350 BIOMT
records describe how to compute the coordinates of the
experimentally observed multimer from those of the explicitly
specified ones of a single repeating unit.

• SEQRES records
give the sequences of the three peptide chains (named A, B and C),
which are very short in this example but usually span multiple lines.
• ATOM records
describe the coordinates of the atoms that are part of
the protein. For example, the first ATOM line above
describes the alpha-N atom of the first residue of
peptide chain A, which is a proline residue; the first
three floating point numbers are its x, y and z
coordinates and are in units of Ångströms.[3] The next
three columns are the occupancy, temperature factor,
and the element name, respectively.
• HETATM records
describe coordinates of hetero-atoms, that is those
atoms which are not part of the protein molecule.
Protein Information Resource
(PIR format)
PIR format description

•A sequence in PIR format consists of:


o One line starting with
• a ">" (greater-than) sign, followed by
• a two-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX),
followed by
• a semicolon, followed by
• the sequence identification code (the database ID-code).
o One line containing a textual description of the sequence.
o One or more lines containing the sequence itself. The end of the sequence is
marked by a "*" (asterisk) character.
o Optionally, this can be followed by one or more lines describing the sequence.
Software that is supposed to read only the sequence should ignore these.
•A file in PIR format may comprise more than one sequence.
•The PIR format is also often referred to as the NBRF format.
PIR format example (for sequence which doesn't have
structure)
>P1;test
sequence
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLM
NTTVTTGLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLT
VTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWCHFPSN
WKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETA
NLWFNCHGEFFYCKMDWFLNYLNNLTVDADHNECKNTSGT
KSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKTYAPPREGHL
ECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLV
EITPIGFAPTEVRRYTGGHERQKRVPFV*
PIR format example (for sequence which has structure)
>P1;1bbha-
structure:1bbha-
AGLSPEEQIETRQAGYEFMGWNMGKIKANLEGEYNAAQVEAAANVIAAIANSGMGALYGPG
TDKNVGDVKTRVKPEFFQNMEDVGKIAREFVGAANTLAEVAATGEAEAVKTAFGDVGAACKS
CHEKYRAK-*
>P1;1cpq--
structure:1cpq--
--ADTKEVLEAREAYFKSLGGSMKAMTGVAKA-
DAEAAKVEAAKLEKILATDVAPLFPAGTSSTDLPG-
QTEAKAAIWANMDDFGAKGKAMHEAGGAVIAAANAGDGAAFGAALQKLGGTCKACHDDY
REED*
>P1;256bb-
structure:256bb-
---------ADLEDNMETLNDNLKVIEKAD----NAAQVKDALTKMRAAALD-AQKATPPKLE---------
DKSP-DSPEMKDFRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR---*
Protein structure file
• A PSF file, also called a protein structure file,
contains all of the molecule-specific information
needed to apply a particular force field to a
molecular system.
• The PSF file contains six main sections of
interest: atoms, bonds, angles, dihedrals,
impropers (dihedral force terms used to maintain
planarity), and cross-terms. The following is
taken from a PSF file for ubiquitin. First is the
title and atom records:
PSF CMAP
6 !NTITLE
REMARKS original generated structure x-plor psf file
REMARKS 2 patches were applied to the molecule.
REMARKS topology top_all27_prot_lipid.inp
REMARKS segment U { first NTER; last CTER; auto angles dihedrals }
REMARKS defaultpatch NTER U:1
REMARKS defaultpatch CTER U:76
1231 !NATOM
1 U 1 MET N NH3 -0.300000 14.0070 0
2 U 1 MET HT1 HC 0.330000 1.0080 0
3 U 1 MET HT2 HC 0.330000 1.0080 0
4 U 1 MET HT3 HC 0.330000 1.0080 0
5 U 1 MET CA CT1 0.210000 12.0110 0
6 U 1 MET HA HB 0.100000 1.0080 0
7 U 1 MET CB CT2 -0.180000 12.0110 0

The fields in the atom section are atom ID, segment name, residue ID, residue name, atom name,
atom type, charge, mass, and an unused 0.
Module 4
Topic: Modular Nature
of proteins
Introduction - Domain
o A protein domain is a region of the protein's polypeptide chain that is self-
stabilizing and that folds independently from the rest.
o Each domain forms a compact folded three-dimensional structure.
o Many proteins consist of several domains. One domain may appear in a variety of
different proteins.
o Molecular evolution uses domains as building blocks and these may be
recombined in different arrangements to create proteins with different functions.
o In general, domains vary in length from between about 50 amino acids up to 250
amino acids in length.
o
Introduction – Domain contd….
 The shortest domains, such as zinc fingers, are stabilized by metal ions
or disulfide bridges. Domains often form functional units, such as the
calcium-binding EF hand domain of calmodulin.
 Because they are independently stable, domains can be "swapped"
by genetic engineering between one protein and another to make chimeric
proteins.
Proteins are composed of evolutionary units called domains
Can either have an independent function or contribute to the function of a
multidomain protein in cooperation with other domains.
Once a domain has duplicated, it can evolve a new or
modified function.
Based on sequence, structural and functional evidence are grouped into
superfamilies.
Background
 The concept of the domain was first proposed in 1973 by Wetlaufer after X-ray crystallographic
studies of hen lysozyme and papain and by limited proteolysis studies of immunoglobulins.
 Wetlaufer defined domains as stable units of protein structure that could fold autonomously.
 In the past domains have been described as units of:
•compact structure
•function and evolution
•folding.
Domain swapping
Domain swapping is a mechanism for forming oligomeric
assemblies.
 In domain swapping, a secondary or tertiary element of a
monomeric protein is replaced by the same element of
another protein.
Domain swapping can range from secondary structure
elements to whole structural domains.
It also represents a model of evolution for functional
adaptation by oligomerisation, e.g. oligomeric enzymes that
have their active site at subunit interfaces
Role of domains

Acquiring new sructures and function by combination of


domain
New domain combinations

☺ Formation of new domain combinations is an important mechanism


in protein evolution.
☺ Proteins contain several thousand different combinations of two
superfamilies.
☺ Duplication is one of the main sources for creation of new proteins.
☺ After duplication ,it evolve a new or modified function either by
sequence divergence or by combining with other domains to form
a multidomain protein with a new series of domains.
☺ Formation of multidomain proteins by duplication and
recombination, and the geometry and functional relationships .
☺ Supradomains are two- or three-domain combinations that occur
in different domain architectures with different N- and C-terminal
neighbours.
Overview of different aspects of
multidomain proteins :

Domains belonging to the same


superfamily are represented as
rectangles of the same colour. {1}

Supradomains are two- or three-


domain combinations that occur
in different domain architectures
{2}

Forms different geometry with


different functions.{3}

These domains forms a


A few domain superfamilies are highly versatile and have neighbouring domains from
many superfamilies.
Each superfamily has its own feature.
* Some superfamilies are highly versatile, some are highly abundant and some
superfamilies are both.
* It depends on the structure and function of the domains and domain combinations
that determine the selection.
Cntd…
• Important examples of the reuse of particular domains
come from signal transduction.

 .The SH3 and SH2 domains in signal transduction.

 . Combination and addition of several domains


determine the versatility of the protein.
To have the
SAME FUNCTION
-- Sequential order of
domains are conserved
 If the same domain combination is observed in two different proteins,they
are closely relatedwith each other phylogenetically.
*Domain architecture have evolved from the sameancestor.
* EG:Rossmann fold
*Proteins sharing the same series of domains tend to have the same
function.
*The total number of defined domains is relatively small and is growing
only slowly. For example, the Pfam domain database defines about
18,000 domains in its current version (version 32).
* On the other hand, the number of known unique domain
arrangements - defined by the linear order of domains in an amino acid
sequence is much larger and growing rapidly .
* Accordingly, rearrangements of existing domains can help explain the
vast protein diversity we observe in nature
Geometry of
domain
combinations
~Sequential order of domains are largely conserved.

~The geometry of Rossmann domains and their


partner domains -conserved - same superfamily.

~~Proteins of unknown structure - based on


homologous polypeptide(s) of known structure.
*EG :yeast ribosome and exosome

~the more similar the domain sequences -


interaction of protein domains is more conserved.
Functional relationships
of domains in multi-
domain proteins!
Domain-centric scheme emphasises domain function.

In this domain-centric functional classification scheme, domains are


classified into several categories
1.catalytic activity,
2.cofactor binding,
3.responsibility for subcellular localisation,
4.protein–protein interaction etc..

TWO principle
1.A domain can perform the same function, but in different protein contexts
(i.e. with different partner domains).Eg:sensory, regulatory and enzymatic
domains.
2.Some domains modify their function according to the partner
domain.Eg:WHD domain (Winged Helix Domain)
Module 4
Topic: Optional Alignment Methods,
Sequence Alignment
Introduction

 Fundamental building blocks are linear


sequences
 Heart of bioinfo analysis is sequence
comparision
 Gene repository in ncbi
 Pairwise sequence alignment
Evolutionary basis

 molecular sequences undergo random


changes
 traces of evolution may still remain in certain
portions of the sequences to allow
identification of the common ancestry
 Functional and structural roles tend to be
preserved
 patterns of conservation and variation can be
identified
 evolutionary relationships between
sequences helps to characterize the function
of unknown sequences
 Charactarization into families or domains or
motifs
 insertions or deletions or mutations
 Sequence homology vs sequence similarity
 Sequence similarity vs sequence identity
Sequencing a genome

 Shotgun sequencing
 Accurate to 650 nucleotides
 Sequence alignment used to stitch the whole
length
 Sequence assembly
Sequence comparison

 Sequence similarity can provide clues about


function and evolutionary relationships
 Algorithms used to search in massive
databases
 Two types
 global and local
Global

 Generally similar over entire length


 best possible alignment across the entire
length
local

 local regions with the highest level of


similarity
 Conserved patterns in DNA or protein
sequences.
 Motifs
 Protien domains
Pairwise Sequence
Alignment
Sequences

 DNA/RNA sequences
– strings composed of an alphabet of 4 letters
 Protein sequences
– alphabet of 20 letters

101
A Quantitative Measure of Sequence
Similarity
 To compare the nucleotides or amino acids
that appear at corresponding positions in two
or more sequences, we must first assign
those correspondences.
 Sequence alignment is the identification of
residue-residue correspondences.

102
Orthologous and paralogous
 Orthologous sequences differ because they are
found in different species (a speciation event)
 Paralogous sequences differ due to a gene
duplication event
 Sequences may be both orthologous and
paralogous
Pairwise Alignment
 The alignment of two sequences (DNA or
protein) is a relatively straightforward
computational problem.
– There are lots of possible alignments.

 Two sequences can always be aligned.
 Sequence alignments have to be scored.
 Often there is more than one solution with the
same score.
Methods of Alignment
 By hand - slide sequences on two lines of a word
processor
 Dot plot
– with windows
 Rigorous mathematical approach
– Dynamic programming (slow, optimal)
 Heuristic methods (fast, approximate)
– BLAST and FASTA
• Word matching and hash tables.
Applications

 The basic tool of bioinformatics


 Sequence similarity is an indicator of
homology
 Database queries
– Determining the function of a newly discovered
genetic sequence
 Annotation of genomes
– Involving assignment of structure and function to
as many genes as possible
106
Dot Plot – Example (1)

 Lets consider a dot plot between sperm


whale and human myoglobins (肌紅蛋白)
Sperm whale myoglobin
GLSDGEWQLV LNVWGKVEAD IPGHGQEVLI RLFKGHPETL
EKFDKFKHLK SEDEMKASED LKKHGATVLT ALGGILKKKG
HHEAEIKPLA QSHATKHKIP VKYLEFISEC IIQVLQSKHP
GDFGADAQGA MNKALELFRK DMASNYKELG FQG
human myoglobin
VLSEGEWQLV LHVWAKVEAD VAGHGQDILI RLFKSHPETL
EKFDRFKHLK TEAEMKASED LKKHGVTVLT ALGAILKKKG
HHEAELKPLA QSHATKHKIP IKYLEFISEA IIHVLHSRHP
GDFGADAQGA MNKALELFRK DIAAKYKELG YQG
107
Dot Plot – Example (2)

 Diagonal lines
of dots show
similarities

108
Dot Plots  Sequence Alignments
 A alignment can reflect the evolutionary
relationship between two or more homologs.
 Three kinds of changes can occur at any
given position within a sequence
– Mutation
– Insertion
– Deletion

109
Many Possibilities

 An uninformative alignment: -----gctgsscg


ctataatc-------
 An alignment without gaps:
gctgaacg
ctataatc
 An alignment with gaps:
gctga-a--cg
 And another: --ct-ataatc

gctg-aa-cg
-ctataatc-

110
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Percent Sequence Identity


• The extent to which two nucleotide or amino
acid sequences are invariant

AC C TG A G – AG
AC G TG – G C AG
mismatch
indel
70% identical
Affine Gap – Example (1)

+2 for a match
-2 for a gap
-1 for a mismatch

112
Gap Penalties
 Linear gap penalty
– cost of gap (length n) depends linearly on gap-open
penalty
• f(g)= – gi
 Affine gap penalty
– cost of gap depends on an initial gap-open penalty(gi) and
a subsequent gap-extension penalty(ge)
– based on the fact that a single biological mutational event
can insert or delete more than one residue
• f(g) = –[gi + (n – 1)  ge]

currently there is no widely accepted theory for selecting gap costs


- generally guided by trial and error 113
Affine Gap – Example (2)

+2 for a match
-1 for a mismatch
a gap open score of –2
a gap extension score of -1.

114
Find the Best

 Need a way to examine all possible


alignments systematically
 Compute a score reflecting the quality of
each possible alignment
 To identify the alignment with the optimal
score.
 Several different alignments may give the
same best score.
 Many different scoring scheme
115
Scoring Matrices for Nucleotide
Sequence
 A mild penalty for transitions
– AG
– CT
 A severe penalty for transversions
– AC
a g t c
– AT
a 20 10 5 5
– GC g 10 20 5 5
– GT Transition t 5 5 20 10
Transversion
c 5 5 10 20
Matrix
116
Scoring Matrices for Amino Acid
Sequence
 Based on observed chemical/physical
similarity
– Residue hydrophobicity, charge, and size
– Genetic code
 Based on observed substitution frequencies

117
Widely Used Substitution Matrices
– Empirically Derived
 PAM: Point Accepted Mutations
– The PAM family (Dayhoff) is based on evolutionary
distance. The matrices were derived from closely related
sequences and the mutations seen in them.

 BLOSUM: BLOcks SUbsitution Matrix


– The Blosum family (Henikoff and Henikoff) were derived
from more distantly related sequences. The number of
the matrix is percent identity.

118
 Scoring system is a set of values for qualifying the set of one
residue being substituted by another in an alignment.

 It is also known as substitution matrix.

 Scoring matrix of nucleotide is relatively simple.

 A positive value or a high score is given for a match &


negative value or a low score is given for a mismatch.

 Scoring matrices for amino acids are more complicated


because scoring has to reflect the physicochemical properties
of amino acid residues.
Identity matrix
1
Transition-Transvesion
matrix
Transition --- substitutions in which a purine (A/G) is replaced by
another purine (A/G) or a pyrimidine (C/T) is replaced by
another pyrimidine (C/T).

Tansversions ---
(A/G)  (C/T)
 Match score: +1
 Mismatch score: +0
 Gap penalty: –1

 ACGTCTGATACGCCGTATAGTCTATCT
||||| ||| || ||||||||
----CTGATTCGC---ATCGTCTATCT

 Matches: 18 × (+1)
 Mismatches: 2 × 0 Score = +11
 Gaps: 7 × (– 1)
PAM - point accepted mutation based on
global alignment [evolutionary model]

BLOSUM - Block substitutions based


on local alignments [similarity among
conserved sequences]
 First given by Dayhoff who compiled alignment of 71
groups of very closely related protein sequences.

 PAM- Point Accepted Mutation.

 PAM matrix were derived based on evolutionary


divergence between sequences of protein structure.

 Construction of PAM1 matrix involves alignment of full


length sequence & subsequent construction of phylogenic
trees using parsimony principle.
 Ancestral sequence information is used to count the number of
substitution along each branch of tree.

 Positive scores in the matrix denotes substitutions occurring


more frequently than expected among evolutionary conserved
replacements.

 Negative score corresponds to substution which occurs less


frequently.

 A PAM is defined as 1% amino acid change or one mutation per


100 residues.

 The increasing PAM numbers correlate with increasing PAM


units & thus evolutionary distances of protein sequences.
 Constructed based on the phylogenetic
relationships prior to scoring mutations;

 Difficulty
of determining ancestral
relationships among sequences;

 Based on a small set of closely related


proteins;
 It is a series of block amino acid substitution matrix.

 Derived on the basis of direct observation for every


possible amino acid substitution in multiple sequence
alignment.

 Sequence pattern is also called as block.

 Ungapped alignments are less than 60 amino acid in


length.

 BLOSUM matrix are actual % values of sequence


selected for construction of matrix.
 BLOSUM 62 indicates that sequence selected for
constructing the matrix is an average share of 62%.

 BLOSUM share for a particular residue pair is derived


from the log ratio of observed residue substitution versus
the expected probability of particular residue.

 Lower the number of BLOSUM more divergent species


are present.
 BLOSUM62 was
measured on pairs
of sequences with
an average of 62 %
identical amino
acids.

Log-odds = log ( chance to see the pair in homologous proteins )


chance to see the pair in unrelated proteins by chance
 PAM  BLOSUM
› Based on mutational › Based on the multiple
model of evolution alignment of blocks
(Markov process)
› Good to be used to
› PAM1 is based on compare distant
sequences of 85% sequences
similarity
› Designed to find
› Designed to track the proteins’ conserved
evolutionary origins domains
Measure of Sequence Divergence –
PAM
 1 PAM = 1 percent accepted mutation
 Two sequences 1 PAM apart have 99%
identical residues.
 Given amount of evolutionary time, how likely
one amino acid is to mutate to another.
 Collecting statistics from pairs of sequences
as closely related as 1 PAM to produce the
1PAM substitution matrix.

130
For More Widely Divergent
Sequences
 Matrices representing larger evolutionary
distances may be derived from the PAM1
matrix by matrix multiplication.
 PAM250:
– Corresponding to ~20% identity
– The lowest sequence similarity for which we can
hope to produce a correct alignment

PAM 0 30 80 110 200 250

% identity 100 75 50 60 25 20
131
PAM 250

A R N D C
C Q E G H I L K M F P S T W
W Y V B Z
A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1
R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2
N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3
D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4
C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4
Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5
E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5
G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1
H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0
F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4
P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1
W
W
Y
-6
-3
2
-4
-4
-2
-7
-4
-8
-8
0
-5
-4
-7
-4
-7
-5
-3
0
-5
-1
-2
-1
-3
-4
-4
-2
0
7
-6
-5
-2
-3
-5
-3
17
17
0
0
10
-6
-2
-4
-2
-4
-3
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0
B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5
Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6
BLOSUM Matrices

 PAM matrices were based on only a small


number of observed substitutions (~1500)
 Perform best in identifying distant
relationships
 BLOCKS database (BLOcks Subsitution
Matrix)
 Regions of closely-related proteins alignable
without gaps
 BLOSUM62  PAM150
 BLOSUM50  PAM250
133
Scoring/Substitution Matrices
BLOSUM62

134
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Blosum50 Scoring Matrix


Examples of Scoring Scheme
 For DNA sequences, CLUSTAL-W recommends
use of the identity matrix for substitution
– +1 for a match
– 0 for a mismatch
– Penalty 10 for gap open
– Penalty 0.1 for gap extension by one residue
 For protein sequences
– BLOSUM 62 matrix for substitution
– Penalty 11 for gap open
– Penalty 1 for gap extension by one residue

136
Dynamic Programming

 General algorithmic development technique


 Reuses the results of previous computations
– Store intermediate results in a table for reuse
 Look up in table for earlier result to build from

137
Global vs. Local Alignment

138
Global Alignment

 Needleman-Wunsch 1970
 Idea: Build up optimal alignment from optimal
alignments of subsequences

139
Three Steps of Dynamic
Programming
 A simple scoring scheme is assumed where
– Si,j = 1 (match score); otherwise
– Si,j = 0 (mismatch score)
– w = 0 (gap penalty)
 Three steps in dynamic programming
– Initialization
– Matrix fill (scoring)
– Traceback (alignment)

140
Initialization Step

 This example assumes there is no gap


opening or gap extension penalty

GAATTCAGTTA
-----------

-------
GGATCGA

G- -G G
3 cases:
-G G- G 141
Matrix Fill Step

142
Traceback Step

143
Z-score
Z = (Xs-Xt) /s

Xs = average of distribution scores with random sequences


Xt = average of distribution score with real sequences
s = SD of distribution scores with random sequences

Accuracy of the alignment:


Z<3 not significant
3<Z<6 putatively significant
6<Z<10 possibly significant
Z>10 significant

144
P-value

 The probability that the observed match could


have happened by chance

Optimal local alignment scores


for pairs of random amino acid
sequences of the same length
follow an extreme-value
distribution
P(S < x) = exp[Kexp(x)]
P(S  x) = 1  exp[Kexp(x)]
A p-value of 0.01 means that 1 in 100 matches giving this
145
score are to unrelated sequences.
E-value

 The expected number of pairs with score at


least S is given by the E-value for the score S
E = Kmn exp(S)
 E-value takes into accout the size of the
database being scanned.
 The parameters K and lambda can be
thought of simply as natural scales for the
search space size and the scoring system
respectively.
146
Comparison of the Performance

 Compare the performance (execution time) of


the three programs
– SSEARCH
– FASTA
– BLAST

147
Module 4
Topic: BLAST
INTRODUCTION
 An important goal of genomics and proteomics is to determine
if a particular sequence is like another sequence. This is
accomplished by comparing the new sequence with sequences
that have already been reported and stored in a database.
 This process is principally one that uses alignment procedures
to uncover the “like” sequence in the database.
 The alignment process will uncover those regions that are
identical or closely similar and those regions with little (or
any) similarity.
 Two alignment types are used: global and local.
BLAST
 BLAST stands for Basic Local Alignment Search Tool
 BLAST was developed by Stephen Altschul, Warren
Gish, Webb Miller, Eugene Myers, and David J. Lipman at
NCBI in 1990.
 It is a local alignment tool.
 It helps to find regions of local similarity between sequences.
 It is a program compares nucleotide or protein sequences to sequence
databases and calculates the statistical significance of matches.
 BLAST can be used to infer functional and evolutionary
relationships between sequences as well as help identify members of
gene families.
NCBI HOMEPAGE
NCBI-BLAST HOMEPAGE
TYPES
BLAST

Amino acid DNA


sequence sequence

Blastp Blastn

tBlastn Blastx

tBlastx
STEPS

Specifying A Sequence Of Interest

Selecting BLAST Program

Selecting Database

Selecting Optional Parameters

Selecting Formatting Parameters


PROCESS
 The first step of the BLAST algorithm is to break the query
into short words of a specific length.
 For example, twelve amino acids near the amino terminal of the
Aradbidopsis thaliana protein phosphoglucomutase sequence are:
NYLENFVQATFN
 This sequence is broken down into three character words by
selecting the first amino acid characters.
NYLYLE LEN ENF NFV FVQ VQA QATATF TFN
 These words are then compared against a sequence in a
database.
 For example, word match with rabbit muscle phosphoglucomutase:
Query ENF
Subject SSTNYAENTIQSIISTVEPAQR
 This search is performed for all words. Those words whose T
value was greater than 18 were used as to extend the
alignment.
 For every pair of sequences (query and target) that have a
word or words in common, BLAST extends the alignment in
both directions to find alignments that score greater (are more
similar) until the alignment score decreases in value.
 For example, consider the following alignment between the A. thaliana
and rabbit muscle phosphoglucomutase:
Query NLYENFVQATFNALTAEKV
NY ENF+Q + + + +
Subject NYAENTIQSIISTVEPAQR
 Once this alignment process is completed for a query and each
subject sequence in the database, a report is generated. This
report provides a list of those alignments (default size of 50)
with a value greater than the S cutoff value.
 Those alignments whose score is above the cutoff are called a
High Scoring Segment Pair (HSP).
 For each alignment reported, an Expect (e) Value is reported.
BLAST OUTPUT

 The blast output is basically displayed in three ways or


formats.
A. Graphical display: shows where the query is similar to other
sequences.
B. Hit list: number of sequences similar to query, ranked by
similarity.
C. Alignment: every alignment between the query and the
reported hits.
BLAST OUTPUT
A. GRAPHICAL DISPLAY

• Query sequence is at the top,


with colour key for alignment
scores.
• Each bar represents the portion
of another sequence that‟s
similar to your query sequence :-
 Red bars- most similar
sequence.
 Pink bars- match less good.
 Green bars- not impressive
match.
 Blue bars- worst score.
 Black bars- bad hits.
BLAST OUTPUT
B. HIT LIST

 1 - This portion of each description links to the sequence record for a particular hit.
 2 - Score or bit score is a value calculated from the number of gaps and substitutions
associated with each aligned sequence. The higher the score, the more significant the
alignment.
 3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score
will occur in the database by chance. The smaller the E Value, the more significant the
alignment
 4 - These links provide the user with direct access from BLAST results to related
entries in other databases. „L‟ links to Locus Link records and „S‟links to structure
records in NCBI's Molecular Modelling Database.
BLAST OUTPUT
C. ALIGNMENT
APPLICATIONS

• BLAST can be used for


several purposes. These
include:
 Identifying Species
 Establishing Phylogeny
 DNA Mapping
 Locating Domains
MULTIPLE SEQUENCE
ALLIGNMENT
Module 4
Topic: Motifs and Patterns, PROSITE,
Hidden Markov Models (HMMs)
Motifs

• Defined as a nucleotide or amino acid


sequence pattern that is widespread and
is associated with a biological function.
– A sequence motif = A structural Motif.
– A sequence motif residing in the coding
region may encode a structural motif.
– Non-coding nucleotide motifs may have
regulatory role. May have recognition sites
for DNA binding proteins.
Motifs, profiles and patterns
• Conserved region of a DNA or protein –
Motif
• Qualitative expression of a motif – Pattern
– Regular Expression
– C[TA]TTG{X}
• Quantitative expression of a motif –
Profile
– Position Specific Scoring Matrices (PSSMs)
– Weight matrices
Motifs/Patterns
N{P}[ST]{P}
[FILV]Qxxx[RK]Gxxx[RK]xx[FILVW
Y]
[] -> or (Probability information is
lost)
{} -> Not
() -> repeated
^ -> Beginning
Profiles
• Quantitative representation.
• More useful for training
dataset.
TCTAGAAGATGGCAGTGGCGAAGA A 0,0,0,100 ,0, 75,100, 75
TCTAGAAAATGACAGTGGCGAAGA T 25,
ATG 0 ATG
TCTAGAAAATGGCAGTAGCGAAGA 100,0,100,0,0, 0, 25
TCTACTA AATGA TAGTAGCGAAGA G 0, 0, 0, 0, 75 ,0, ATG 0
C 0,100,0,0, 0, ATG
2 ,0,
5 0,
De novo prediction of Motifs
• MEME; EXTREME; AlignAce, Amadeus,
CisModule, FIRE, Gibbs Motif Sampler,
PhyloGibbs, SeSiMCMC, ChIPMunk
and Weeder. SCOPE, MotifVoter, and
Mprofiler

MEME (Multiple Expectation Maximization


for Motif Elicitation)
Figure 3.
Resources

MacIsaac KD, Fraenkel E (2006) Practical Strategies for Discovering Regulatory DNA Sequence Motifs. PLoS Comput
Biol 2(4): e36. doi:10.1371/journal.pcbi.0020036
http://journals.plos.org/ploscompbiol/article?id=info:doi/10.1371/journal.pcbi.0020036
MRLSFVPLLQLSRLVVSTQHSTKMSTVYRTCKMNEIALSLLAPTQPLDADQ
GVMSPMASSDQ
TTSIGDFRFLRTHHDKEERGLLVTSLTKGLAETSFPYR
YTSMCATICSITHSRADAAPAKQAH
What is Pattern Recognition?

•A Technique to identify interesting patterns of events such as Amino acid,


Nucleotide, Gene Expression levels etc. that appear in number of times in a
particular set of data.
Pattern Recognition in Molecular Biology

• Human Genome Project


• Protein analysis
• Gene Expression & DNA Micro Analysis
• Drug Discovery
Pattern Discovery in Proteins
• Three main steps
- Proteins related to a query sequence are found by searching the database for
similar sequences.
- Sequences revealed from this initial screen are then used as query sequences to
search other family members
- This process is repeated till exhaustion.
Tandem Repeats
• These are two or more contiguous, approximate copies of a pattern of nucleotides.
• There duplicates occur as a result of mutational events in which an original segment
of DNA, the pattern is converted into a sequence of individual copies.
• They have been linked to a number of different diseases.
• These might play a role in gene regulation and in the development of immune system
cells.
Types of Patterns

Deterministic
Matches a given string or not.
Probabilistic
each sequence is given a probability that
this sequence is generated by a model.
The higher the probability, the better is the
match between sequence and pattern.
PROSITE
PROSITE is a protein database. It consists of entries describing
the protein families, domains and functional sites as well as
amino acid patterns,
signatures, and profiles in them, which are manually curated by a
team of the Swiss Institute of Bioinformatics and tightly
integrated into Swiss-Prot protein annotation.

PROSITE was created in 1988 by Amos Bairoch, who directed the


group for more than 20 years. Since July 2009 the director of the
PROSITE, Swiss-Prot and Vital-IT groups is Ioannis
Xenarios.
PROSITE' s uses include identifying possible functions of newly
discovered proteins and analysis of known proteins for
previously undetermined activity. Properties from well-studied
genes can be propagated to biologically related organisms, and
for different or poorly known genes biochemical functions can be
predicted from similarities.

PROSITE offers tools for protein sequence analysis and motif


detection. It is part of the ExPASy proteomics analysis servers.
HMM-BASED TOOLS

• GENSCAN (Burge 1997)


• FGENESH (Solovyev 1997)
• HMMgene (Krogh 1997)
• GENIE (Kulp 1996)
• GENMARK (Borodovsky & McIninch 1993)
• VEIL (Henderson, Salzberg, & Fasman 1997)
Module 4
Topic: Phylogenetic analysis
What is Phylogenetic Tree?
• A branching diagram
• Showing the inferred evolutionary relationships among
various biological species
• Based upon similarities and differences in their physical or
genetic characteristics
• Each node with descendants represents the inferred most
recent common ancestor of the descendants
History
• Early representations of "branching"
phylogenetic trees include a "paleontological
chart" showing the geological relationships
among plants and animals in the
book Elementary Geology, by Edward Hitchcock
in 1840.

• Charles Darwin in 1859 also produced one of the


first illustrations and crucially popularized the
notion of an evolutionary "tree" in his seminal
book The Origin of Species.
What does this tree looks like?
What do the lines represent?
PHYLOGENETIC TREE
Phylogeny is the evolutionary history of a
kind of organism .
In phylogenetic studies , the most convenient
way to study the evolutionary relationship
among a group of organism is through the
illustration of phylogenetic tree.
DEFINITION –Phylogenetic tree is a two
dimensional graph showing evolutionary
relationship between organism , or genes
from various organism .
Characteristics :
Nodes can be internal or external .
Each internal node represent the last common
ancestor of the two lineage .
External node (also termed as terminal node ,
leaves ) represent the tip of the tree .
Node correspond to species , organism or
sequences .
Similarly, branches can be internal or external .
Internal branches or internodes connect two
nodes , whereas external branches connect a tip
and a node .
A phylogenetic tree branches either be :
- Scaled
- Unscaled
In scaled branches , their length are
proportional to the evolutionary change .
Example - phylogram .
In unscaled branches , the branch length is
not proportional to the number of changes .
Example -cladogram
When constructing phylogenetic trees ,researcher identify
homologous features that are shared by some species
but not by others.
This allows them to group species based On their shared
characterstics .
Historically, comparison of morphological similarities and
differences have been used to construct evolutionarytrees.
In this approach, species that share certain charactersticts
(i.e.,homologous trait) tend to be placed closer togetheron
the tree .
In 1963,Linsus pauling and Emile Zuckerkandl were the first
to suggest the use of molecular data to establish
Evolutionary relationship

 When comparing homologous genes in different species,

the DNA sequences from closely related species are more

similer to each other than are the sequences from

distantly related species .


Phylogenetic tree based on homology

Phylogenetic tress are now based on homology which


refers to similarities among various species that occur
because the species are derived from a common
ancestor.
 Attributes that are the result of homology are saidto
be homologous.
Phylogenetic tree reconstruction
 Phylogenetic trees are constructed :
- To reconstruct the evolutionary past.
- To develop an understanding of when and
which speciation event may have occurred to
give rise to the organism exhibited today .
A phylogenetic analysis consist of four steps and
these are :
 SEQUENCE ALIGNMENT :- Sequence
alignment is the essential preliminary to the
tree reconstruction . The data used in
reconstruction of a DNA –based phylogenetic
tree are obtained by comparing nucleotide
sequences.
These comparison are made by aligning the
sequences so that nucleotide differences can
be scored .

 DETERMINING THE SUBSITUTION


MODEL
 TREE BUILDING
 TREE EVALUATION
Construction of phylogenetic tree
2 types of method

Character based Distance based


method method

A. Maximum parsimony
B. Maximum likelihood
Character based method :

This method is also called as discrete


method and are based directly on the
sequence characters rather than on pairwise
distances .
The two most popular character based
methods are :
1. MAXIMUM PARSIMONY
2. MAXIMUM LIKELIHOOD
Maximum parsimony
Parsimony method is one of the pioneer
method of phylogeny construction .
Parsimony groups taxa together in way that
minimize the number of changes .
It assume that the best hypothesis is one
that requires the fewest number of
evolutionary changes hence it is also called
as minimum evolution method .
It also states that the preferred hypothesis is
the one that is simplest .
EXAMPLE : If two species possess a tail then
there are two hypothesis :
First assuming that a tail arose once during
evolution and that both species have descended
from a common ancestor with a tail .
Second hypothesis assuming that tails arose
twice during evolution and that the tails in the
two species are not due to descent from a
common ancestor .
So the first assumption is simplest one and is
accepted .
Maximum likelihood approach
The maximum likelihood method presents
an additional opportunity to evaluate trees
with variations in mutation rates in
different lineage .
The method can be used to explore
relationship among more diverse sequences
and condition that are not well handled by
maximum parsimony methods .
Distance based method :
Distance method are based on the amount of
dissimilarity ( distance ) between two aligned
sequences .
Such method remain important when using
fossil data to build phylogenies for extinct
species and for living species it is more common
to use DNA sequences from the two species .
This method assume that all sequence involved
are homologous and that tree branches are
additive , meaning the distance between the
two taxa equals the sum of all branch branch
lengths connecting them .
Limitations Of Phylogenetic tree
- Limitations

1. Inaccurate evolutionary
history
2. The data used is little noisy
3. Problem facing in single type
of character basing
4. Homoplasy would be unlikely
from natural selection
5. Length of branch doesn’t mean
the timing passed
- Fields of study
1. Cladistics
2. Comparative phylogenetics
3. Computational phylogenetics
4. Evolutionary taxonomy
5. Evolutionary biology
6. Phylogenetics
Applications:
• Find out the evolutionary history .

• Can measure phylogenetic diversity using


phylogenetic trees .

• Search for natural products .

• Infectious bacteria and viruses to trace their


evolutionary histories.
Applications:
• Find out what trends they've undergone in their
history .

• To guide our search for new species.

• Find out how our species spread geographically


in their evolution.

• To tell us when taxa originated and where.


Module 4
Topic: Clustal, PHYLip &
Bootstrapping
Clustal Omega
• Purpose: Clustal Omega is a widely used tool for performing multiple
sequence alignments (MSA). It aligns protein or nucleotide sequences to
identify conserved regions and evolutionary relationships.How It
Works:Uses a guide tree to align sequences progressively.
• Employs Hidden Markov Models (HMMs) for greater accuracy in large
datasets.
• Applications:Comparative genomics.
• Identifying functional domains.
• Evolutionary and phylogenetic studies.
• Advantages: Fast and scalable, handling thousands of sequences efficiently.
Phylogenetic Analysis using PHYLIP - Unrooted trees
Theory :
• PHYLIP is a complete phylogenetic analysis package which was
developed by Joseph Felsestein at University of Washington.
• PHYLIP is used to find the evolutionary relationships between
different organisms. Some of the methods available in this
package are maximum parsimony method, distance matrix
and likelihood methods.
• The data is presented to the program from a text file, which is
prepared by the user using common text editors such as word
processor, etc. Some of the sequence analysis programs such
as ClustalW can write data files in PHYLIP format.
• Most of the programs look for the input file called "infile" -- if
they Phylogenetic analysis: Analyze the evolutionary
relationships between different organisms and this analysis
would help to find out the changes that occured in organisms
during the evolution.
• Boot Strapping: It is a way to test the reliability of Dataset.
• Query: User can give input called as a query. This can be
either a protein or nucleotide sequence.
• Rooted tree: A tree which is having a special node as main
node also called the root. A tree without root is treated as a
free tree.
• Tree topology: Tree topology refers to the arrangement of
phylogenetic tree.
PHYLIP file format :

• The input files have information about the number of


sequences, nucleic acids and amino acids.
• The sequence has 10 characters length. Spaces can be
added to the end of the short sequences to make them
long.
• Gaps can be represented as ‘-‘.
• Missing data can be represented as ‘?’
• Spaces between the alignments are allowed usually
after every 10 bases.
Methods involved in PHYLIP:

1.Maximumparsimonymethod
2.Distancemethod
3. Maximum likelihood methods

• Maximum parsimony method: It is a character-based method


which infers a phylogenetic tree by minimizing the total number of
evolutionary steps or total tree length for a given set of data. It is
also referred to as sequence based tree reconstruction method.

• Distance methods: Evolutionary distances are calculated for all


operational taxonomic units and build tree where distance
between the operational taxonomic units match these distances.

• Maximum likelihood method: Refers to a model of sequence


evolution which finds the tree and gives highest likelihood of the
observed data.
Programs used in PHYLIP :
• The following are the methods available in PHYLIP program.

• Dnapars: Estimates the phylogeny using parsimony method from nucleic acid sequence.

• Dnamove: It is an interactive process used for construction of phylogeny from nucleic acid sequences
using parsimony method.

• Dnapenny: Estimates the parsimonious phylogeny for nucleic acid sequences which uses branch and
bound theory.

• Dnacomp: States the phylogeny of nucleic acids and searches for the largest sites which have uniquely
evolved on the same tree.

• Dnainvar: Computes the nucleic acid sequence which tests the alternative tree topologies. The
programs tabulate (chart) the frequencies of occurrences of different nucleotide patterns.

• Dnaml: Estimates the phylogenies from nucleotide sequences by maximum likelihood method without
assuming molecular clock. Molecular clock defines to calculate timings of evolutionary events.

• Dnamlk: It estimates the phylogeny using maximum likelihood method, it assumes the molecular clock.
Boot strap analysis
• It involves resampling one's own data, with replacement, to create a series of
bootstrap samples of the same size as the original data.
• In the case of nucleic acid (amino acid) sequences, the resampled data are the
nucleotides (amino acids) of a sequence while the statistical significance of a
specific cluster is given by the fraction of trees, based on the resampled data,
containing that cluster.
• Bootstrapping can be considered a two-step process comprising the
generation of (many) new data sets from the original set and the
computation of a number that gives the proportion of times that a particular
branch (e.g., a taxon) appeared in the tree. That number is commonly
referred to as the bootstrap value.
Bootstrapping
• urpose: A statistical technique to measure the confidence of phylogenetic
tree branches.How It Works:Resamples the original data with replacement
to create multiple datasets.
• Builds a tree for each resampled dataset.
• Computes the frequency (bootstrap value) of branches across all trees.
• Applications:Validating phylogenetic trees.
• Ensuring reliability of inferred evolutionary relationships.
• Interpreting Bootstrap Values:Higher values (e.g., >70%) indicate strong
support for a branch.
• Lower values suggest weak or uncertain relationships.

You might also like