0% found this document useful (0 votes)
121 views40 pages

Bioinformatics Notes

The document outlines the course BT-247: Introductory Bioinformatics, covering its definition, history, scope, applications, and the role of computers in bioinformatics. It discusses the importance of operating systems, hardware, software, and biological databases, including primary and secondary databases like GenBank and SWISS-PROT. The course emphasizes the interdisciplinary nature of bioinformatics, integrating biology, computer science, and statistics for analyzing biological data.

Uploaded by

k7260827
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views40 pages

Bioinformatics Notes

The document outlines the course BT-247: Introductory Bioinformatics, covering its definition, history, scope, applications, and the role of computers in bioinformatics. It discusses the importance of operating systems, hardware, software, and biological databases, including primary and secondary databases like GenBank and SWISS-PROT. The course emphasizes the interdisciplinary nature of bioinformatics, integrating biology, computer science, and statistics for analyzing biological data.

Uploaded by

k7260827
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Course No: BT-247

Course Title: Introductory Bioinformatics


Credits: 3(2+1)
Semester: IV Theory
Course No: BT-247 Course Title: Introductory Bioinformatics
Credits: 3(2+1) Semester: IV Theory
UNIT I

Chapter 1 Introduction to bioinformatics; Definition, History

What is Bioinformatics?
A marriage between Biology and Computers
The term bioinformatics was coined by Paulien Hogeweg, a Dutch Theoretical Biologist, in
conversations with her colleague Ben Hesper in the beginning of the 1970s.
Margaret Oakley Dayhoff has been called The “mother and father of bioinformatics” as she was
a pioneer of applying mathematics and computational methods to biochemistry.
• It is an interdisciplinary field that develops methods and software tools for
understanding biological data.
• Bioinformatics is the application of computer technology to the management of
biological information.
• Computers are used to gather, store, analyze and integrate biological and genetic
information which can then be applied to gene-based drug discovery and development.
• It includes biology, computer science, information engineering, mathematics and
statistics to analyze and interpret biological data.
• Bioinformatics has been used for in silico analyses ("in silicon", alluding to the mass use
of silicon for computer chips) "performed on computer “.
• in vivo, in vitro, and in situ, which are commonly used in biology and refer to
experiments done in living organisms, outside living organisms, and where they are
found in nature, respectively.

History of bioinformatics
• 1951 Pauling and Corey propose the structure for the alpha-helix and beta-sheet
• 1953 – Watson & Crick propose the double helix model for DNA based x-ray data
obtained by Franklin & Wilkins
• 1955 – The sequence of the first protein to be analysed, bovine insulin, is announed by
F.Sanger.
• 1970 – Needleman-Wunsch algorithm for sequence comparison are published.
• 1973 – The Brookhaven Protein DataBank is announeced
• 1981 – The Smith-Waterman algorithm for sequence alignment is published.
• 1985 – The FASTP/FASTN algorithm is published.
• 1988 – National Center for Biotechnology Information (NCBI) created at NIH/NLM
• 1990 – The BLAST program (Altschul,et.al.) is implemented.
• 1990 -Human Genome Project initiated.
• 2003 -Human Genome Project Completion, April 2003.
Chapter 2 Development and scope of bioinformatics

Scope of bioinformatics
1. Bioinformatics has more opportunity than Biotech because it is computer connected topic
and provides many carrier provider options like software developer, biological data
manager or analyzer, medication developing, wet lab analysis etc.
2. In India, few peoples are doing bioinformatics so it's a great opportunity in India to get a
job quickly than Biotech students because of their vacancies of knowledge.
3. The scope of a bioinformatics project (or lab) varies widely depending on the balance of
math/stats, computational/software/engineering, and biological/molecular training of
the researchers involved and the data or subject matter they deal with particularly DNA
sequences, gene expression, protein, epigenetics, or networks and systems biology.

Chapter 3-4 Applications of computers in bioinformatics

Computers in Biology and medicine


• Millions of base pairs of DNA sequences are known and must be analysed.
• Become necessary for:
• Acquisition
• Retrieval
• Manipulation
• Analysis
• Computers are excellent for manipulation
• Data collection storage and retrieval
• Nucleic acid and protein analysis

Application of Bioinformatics
 Medical
 Agriculture
 Evolutionary studies
 Genome Sequencing
 Drug discovery and drug development
 Gene therapy- used to treat, cure or even prevent disease by changing the expression of a
person’s genes
Chapter 05- Operating systems

Operating Systems:

 An operating System is an integrated set of programs that controls the resources (the
CPU, memory, I/O devices etc.) of a computer system and provides its users with an
interface or virtual machine that is more convenient to use than the bare machine.
 An operating system is the interface between the user and the architecture
 An Operating System, or OS, is low-level software that enables a user and higher-level
application software to interact with a computer’s hardware and the data and other
programs stored on the computer.
 An OS performs basic tasks, such as recognizing input from the keyboard, sending output
to the display screen, keeping track of files and directories on the disk, and controlling
peripheral devices such as printers. An Operating System, or OS, is a software program
that enables the computer hardware to communicate and operate with the computer
software.
 Without a computer Operating System, a computer would be useless.

Examples of Operating Systems

 Windows (GUI-based, PC)


 GNU/Linux (Personal, Workstations, ISP, File, and print server, Three-tier client/Server)
 macOS (Macintosh), used for Apple’s personal computers and workstations (MacBook,
iMac).
 Android (Google’s Operating System for smartphones/tablets/smartwatches)
 iOS (Apple’s OS for iPhone, iPad, and iPod Touch)

Types of Operating System (OS)


Following are the popular types of OS (Operating System):
 Batch Operating System
 Multitasking/Time Sharing OS
 Multiprocessing OS
 Real Time OS
 Distributed OS
 Network OS
 Mobile OS
Applications of Operating System:

• Following are some of the important activities that an Operating System performs −

• Security − By means of password and similar other techniques, it prevents unauthorized


access to programs and data.

• Control over system performance − Recording delays between request for a service and
response from the system.
• Job accounting − Keeping track of time and resources used by various jobs and users.

• Error detecting aids − Production of dumps, traces, error messages, and other
debugging and error detecting aids.

• Coordination between other software and users − Coordination and assignment of


compilers, interpreters, assemblers and other software to the various users of the
computer systems.

Main functions of an operating system –

 Booting the computer,


 Managing system resources (CPU, memory,Storage devices, printer, etc.),
 managing files,
 handling input and output,
 Executing and providing services for application software, etc.
• Chapter 6-7 Hardware, Software

Computer hardware and software

Hardware – any physical device or equipment used in or with a computer system (anything you
can see and touch).

External hardware
 External hardware devices (peripherals) – any hardware device that is located outside the
computer.
 Input device – a piece of hardware device which is used to enter information to a
computer for processing.
 Examples: keyboard, mouse, trackpad (or touchpad), touchscreen, joystick, microphone,
light pen, webcam, speech input, etc.

 Output device – a piece of hardware device that receives information from a computer.
 Examples: monitor, printer, scanner, speaker, display screen (tablet, smartphone …),
projector, head phone, etc.
Internal hardware

 Internal hardware devices (or internal hardware components) – any piece of hardware device that
is located inside the computer.
 Examples: CPU, hard disk drive, ROM, RAM, etc.

Software:
 Software – a set of instructions or programs that tells a computer what to do or how to
perform a specific task (computer software runs on hardware).
 including computer programs and apps on your phone. Video games, photo editors, and
web browsers are just a few examples.
 Main types of software – systems software and application software.

Software can be categorized into two types −


System software
Application software

 System software:
 It Operates directly on hardware devices of computer. It provides a platform to run an
application. It provides and supports user functionality. Examples of system software
include operating systems such as Windows, Linux, Unix, etc.
 Application Software
 An application software is designed for benefit of users to perform one or more tasks.
Examples of application software include Microsoft Word, Excel, PowerPoint, Oracle,
etc.
• Chapter 8-9 Internet, www resources, FTP

Internet:
 It is a global network of computer networks.
 It comprises of millions of computing devices that carry and transfer volumes of
information from one device to the other.
 The Internet is a massive network of networks.
 It connects millions of computers together globally, forming a network in which any
computer can communicate with any other computer as long as they are both connected
to the Internet.
 Information that travels over the Internet does so via a variety of languages known as
protocols.

WWW:

 The World Wide Web (WWW) is an internet based service, which uses common set of
rules known as Protocols, to distribute documents across the Internet in a standard way.
 The World Wide Web or simply Web is a massive collection of digital pages to access
information over the Internet.
 The Web uses the HTTP protocol, to transmit data and allows applications to
communicate in order to exchange business logic.
 The Web also uses browsers, such as Internet Explorer, Google Chrome etc. to access
Web pages that are linked to each other via hyperlinks.

FTP:

 File Transfer Protocol is a standard protocol used on network to transfer the files from
one host computer to another host computer using a TCP based network, such as the
Internet.
 To use FTP server, users need to authenticate themselves using a sign-in protocol, using a
username and password, but can connect anonymously if the server is configured to allow
it.
 It is an alternative choice to HTTP protocol for downloading and uploading files to FTP
servers.
Chapter 10- 12
Biological Databases and their classification; Primary databases:
Nucleotide sequence databases (GenBank, EMBL)

Database concept
• Data is a collection of facts, such as numbers, words, measurements, observations or just
descriptions of things.
• Data contents include gene sequences, textual descriptions, attributes and classifications,
citations, and tabular data.
• A database is an organized collection of data, generally stored and accessed electronically
from a computer system.
• Biological databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experiment technology, and
computational analysis.
• Information contained in biological databases includes gene function, structure,
localization (both cellular and chromosomal), clinical effects of mutations as well as
similarities of biological sequences and structures.

Primary databases: Nucleotide sequence databases


• Primary databases
• Primary databases are also called as archieval database
• Takes information directly from experimental laboratory derived data such as
nucleotide sequence, protein sequence or macromolecular structure.
• Experimental results are submitted directly into the database by researchers, and the data
are essentially archival in nature.
• Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.
• They include sequences submitted directly by scientists and genome sequencing
group, and sequences taken from literature and patents.
• There is comparatively little error checking and there is a fair amount of redundancy.
• E.g EMBL, GenBank, DDBJ (nucleotide sequence)
1. GenBank
• GenBank was created in 1979 at the Los Alamos National Laboratory and was called
the Los Alamos Sequence Database.
• It was renamed GenBank in 1982 and became a public database.
• GenBank sequence database is an open access, annotated collection of all publicly
available nucleotide sequences and their protein translations.
• This database is produced and maintained by the National Center for Biotechnology
Information (NCBI; a part of the National Institutes of Health in the United States) as
part of the International Nucleotide Sequence Database Collaboration (INSDC).
• GenBank and its collaborators receive sequences produced in laboratories throughout the
world from more than 3,00,000 distinct organisms.
• GenBank is built by direct submissions from individual laboratories, as well as from
bulk submissions from large-scale sequencing centers.

EMBL(European Molecular Biology Laboratory)


• EMBL nucleotide sequence database is maintained by the European Bioinformatics
Institute (EBI) in Hinxton, Cambridge, UK.
• EMBL was created in 1974 and is an Intergovernmental organization.
• EMBL groups and laboratories perform basic research in molecular biology and
molecular medicine as well as training for scientists, students and visitors.
• The organization aids in the development of services, new instruments and methods,
and technology in its member states.
• EMBL-EBI serves the scientific community by
• Providing freely available bioinformatics resources,
• Promoting basic research,
• Providing training to scientists at all levels and technologies to the academic community
and industry.
Chapter 13-14
Protein sequence databases; Secondary databases:
SwissProt/TrEMBL, conserved domain database, Pfam

Secondary databases
• It is also called as Curated database.
• Performs a quality control and sorting of information before making accessible to the
public.
• Secondary databases contain information derived from primary databases.
• Secondary databases store information such as conserved sequences, active site
residues, and signature sequences.
• They are highly curated, often using a complex combination of computational
algorithms and manual analysis and interpretation to derive new knowledge from the
public record of science.

SWISS-PROT
• SWISS-PROT is a curated protein sequence database which strives to provide a high
level of annotation (such as the description of the function of a protein, its domains
structure, post-translational modifications, variants, etc.), a minimal level of redundancy
and high level of integration with other databases.
• SWISS-PROT is an annotated protein sequence database established in 1986 and
maintained collaboratively, since 1987, by the Department of Medical Biochemistry of
the University of Geneva and the EMBL Data Library.
• SWISS-PROT contains the information about the name and origin of the protein,
protein attributes, general information, sequence annotation, amino acid sequence,
references, cross-references with sequence, structure and interaction databases and entry
information.
• The core data consists of the sequences entered in common single letter amino acid
code, and the related references and bibliography.
• The taxonomy of the organism from which the sequence was obtained also forms part of
this core information.

TrEMBL(Translated EMBL)
• TrEMBL is a very large protein database in SwissProt format generated by computer
translation of the genetic information from the EMBL Nucleotide Sequence Database
database.
• Computer translation is not entirely perfect, so proteins predicted by the TrEMBL
database can be hypothetical, and many TrEMBL entries are poorly annotated.
• In contrast to SwissProt which contains only proteins actually found in the wild, and PIR
which is entirely unchecked.
• TrEMBL is currently being combined with the above two databases in the Uniprot
project.

Conserved Domains Database (CDD)


• Conserved Domain Database is a database of well-annotated multiple sequence
alignment models and derived database search models, for ancient domains and full-
length proteins.
• Domains can be thought of as distinct functional and structural units of a protein.
• CDD provides annotation of domain footprints and conserved functional sites on
protein sequences.
• CDD includes manually curated domain models that make use of protein 3D structure
to refine domain models and provide insights into sequence/structure/function
relationships.

Pfam
• Pfam is a database of protein families that includes their annotations and multiple
sequence alignments generated using hidden Markov models.
• The most recent version, Pfam 32.0, was released in September 2018 and contains 17,929
families.
• The general purpose of the Pfam database is to provide a complete and accurate
classification of protein families and domains.
• The Pfam website allows users to submit protein or DNA sequences to search for
matches to families in the database.
• Pfam (Protein families database of alignments and HMMs) is a large collection of
multiple sequence alignments and Hidden Markov Models covering many common
protein domains and families.
• For each family in Pfam there is a possibility to look at multiple alignments, to view
protein domain architectures, to examine species distribution, to follow links to other
databases, and to view known protein structures or domain organization of proteins
Differences between primary and secondary databases

Sr. Primary database Secondary Database


No
1 Synonyms Archival database Curated database;
knowledgebase
2 Source of Direct submission of experimentally Results of analysis, literature research and
data derived data from researchers interpretation, often of data in primary
databases
3 Examples ENA, GenBank and DDBJ Inter Pro (protein families, motifs and
(nucleotide sequence) domains) UniProt Knowledgebase
Array Express and GEO (functional (sequen ce and functional information on
genomics data) proteins) Ensembl (variat ion, function,
Protein Data Bank (PDB); regulation and more layered onto whole
coordinates of three dimensional genome sequences)
macromolecular structures)
Chapter 15-16
Structure databases: Protein Data Bank (PDB), MMDB, SCOP, CATH

Protein Data Bank (PDB)


 PDB is the single global repository of experimentally determined 3D structures of
biological macromolecules and their complexes—was established in 1971, becoming the
first open-access digital resource in the biological sciences.
 Protein Data Bank (PDB) is the single worldwide archive of structural data of biological
macromolecules.
 It includes data obtained by X-ray crystallography and nuclear magnetic
resonance (NMR) spectrometry submitted by biologists and biochemists from all over the
world.
 Presently, PDB is under the purview of the Worldwide Protein Data Bank (wwPDB), a
network of four organizations – Research Collaboratory for Structural Bioinformatics
(RCSB) PDB (USA), PDB in Europe (PDBe) (Europe), PDB Japan (PDBj) (Japan), and
the Biological Magnetic Resonance Data Bank (BMRB) (USA) – whose mission is to
“maintain a single Protein Data Bank Archive of macromolecular structural data that is
freely and publicly available to the global community.”
 Currently, more than 83 900 biological macromolecular structures have been deposited in
PDB.
 PDB (Berman et al., 2000) is the most comprehensive repository of structure data for
biological macromolecules.
 The repository contains the primary structure and secondary structure information along
with the atomic coordinates of constituent atoms of biomolecule.
 PDB contain the information about the secondary structures, helix, strand, coil, and turn.
 PDB is a unique resource for experimentally determined structures of proteins and their
complexes.

Molecular Modeling Database (MMDB)


• MMDB stands for Molecular Modeling Database
• It is also referred as Entrez structure database provided by the (NCBI)
• It contains experimentally determined 3-D bio-molecular structures.
• Most 3D-structure data are obtained by X-ray crystallography and NMR-spectroscopy.
• MMDB provide information on the biological function, on mechanisms linked to the
function, and on the evolutionary history of and relationships between macromolecules.
• MMDB contains 3D macromolecular structures, including proteins and polynucleotides
structures.
• MMDB contains over 28,000 structures and is linked to the rest of the NCBI databases,
including sequences, bibliographic citations, taxonomic classifications, and sequence and
structure neighbors.
• MMDB as part of the Entrez system facilitates access to structure data by connecting
them with associated literature, protein and nucleic acid sequences, chemicals,
biomolecular interactions, and more.
• It holds all the structures in the PDB database, but in a different file format [specified in
the Abstract Syntax Notation 1 (ASN.1) data description language.
• This format allows files of structural data to be readily compressed and exchanged
between modern computers.
• Our hope is that with this new transformed data, structural scientists can start to design
and create software tools that allow all of us to see different kinds of data, such as
structural superposition and non-atomic three-dimensional models from electron
microscopy in a single viewing environment.

SCOP (Structural Classification of Proteins)


• The SCOP database is freely accessible on the internet.
• It was created in 1994 in the Centre for Protein Engineering and the Laboratory of
Molecular Biology.
• Maintained in the Centre for Protein Engineering at the Laboratory of Molecular Biology
in Cambridge, England.
• SCOP database is a largely manual classification based on similarities of their structure
and amino acid sequences.
• SCOP database describe structural and evolutionary relationship between proteins of
known structure.
• It also provides for each entry links to co-ordinates, images of the structure, interactive
viewers, sequence data and literature references.
• Proteins having the same shape and some similarity of sequence and/or function are
placed in "families", and are assumed to have a closer common ancestor.
• The Classification is on hierarchical levels: the first two levels, family and super family,
describe near and distant evolutionary relationships; the third fold describes
geometrical relationships.
• SCOP organizes proteins in a hierarchy, from class down to fold, superfamily, and
family.
• A total of ten classes are defined (only the first four of which are considered here):
• All alpha, all beta, alpha and beta (α/β), alpha plus beta (α+β), multidomain,
membrane and cell-surface proteins and peptides, small proteins, peptides, designed
proteins, and non-protein structures.
• SCOP classification is essentially using visual inspection and comparison of structures,
some automation is used for the most routine tasks such as clustering protein chains on
the basis of sequence similarity.
• The source of protein structures is the PDB.
• The unit of classification of structure in SCOP is the protein domain.
• The shapes of domains are called "folds" in SCOP.
• Domains belonging to the same fold have the same major secondary structures in the
same arrangement with the same topological connections.
• Structural similarities of proteins at the fold level often represent favorable packing
arrangements and chain topologies, although some distant evolutionary links may exist.
• Common ancestry (i.e. homology) is more clearly defined upon classification into super
families, where proteins with similar structure and/or functional features are believed to
share a common evolutionary origin.
• Proteins with similar sequences, or very similar structures and functions that imply a
solid evolutionary link, are grouped together as families.
• Thus, members of the same family or super family within SCOP share common ancestry.

CATH
• The CATH database is a free, publicly available online resource that provides
information on the evolutionary relationships of protein domains.
• It was created in the mid-1990s, at University College London.
• It consists of both phylogenetic and phenetic descriptors for protein domain relationships.
Domains are obtained from protein structures deposited in the PDB.
• It is a novel hierarchical classification of protein domain structures, which clusters
proteins at four major levels,
• Class level, domains are assigned according to their secondary structure content, i.e. all
alpha, all beta, a mixture of alpha and beta, or little secondary structure.
• Architecture(A) information on the secondary structure arrangement in three-
dimensional space is used for assignment;.
• Topology(T) = information on how the secondary structure elements are connected and
arranged is used; and
• Homologous superfamily (H) level if there is good evidence that the domains are related
by evolution i.e. they are homologous.
• The CATH database provides hierarchical classification of protein domains based on
their folding patterns.
• Domains are obtained from protein structures deposited in the Protein Data Bank and
both domain identification and subsequent classification use manual as well as automated
procedures.
• CATH offers an important tool to researchers, as proteins with even very little sequence
similarity often are both structurally and functionally related.
• At the C-level, domains are grouped according to their secondary structure content into
four cat- egories: mainly alpha, mainly beta, mixed alpha- beta; and a fourth category
which contains domains with only few secondary structures.
• The A-level groups domains according to the general orientations of their secondary
structures.
• At the T-level, the connectivity (ie the order) of the secondary structures is taken into
account.
• The grouping of domains at the H-level is based on a combination of both sequence
similarity and a measure of structural similarity obtained from the dynamic programming
algorithm SSAP.

Chapter 17
Structure databases: Retrieving information from these databases.

Accessing protein structure from PDB


PROCEDURE
1. Open the PDB from the following URL- https://www.rcsb.org/ (Fig. 6.1).

2. Enter the query in the textbox provided by entering PDB ID, molecule name or author
name. Click on the search button (Fig. 6.2). When the protein name is provided in the text
box, results are displayed showing data like Molecule Name, PDB text, Structural
domains, Ontology Terms etc. If one click on any of results,the particular information
about the molecule is obtained. If no information is obtained for the given uery, advanced
search can be used. A query can be defined as a request that one uses to get information
from a database.
3. From the summary page click on PDB ID 7LYJ and download the macromolecular 3D
structure in PDB format (Fig. 6.3 and 6.4).
4. Using any one of the visualizing tools PyMoL or RasMol or Swiss-PDB viewer open the
structure file to visualize. You will learn about these tools in exercise number 8 of this
course.

Chapter 20-21
Introduction to sequence alignment and its applications: Pair
wise and multiple sequence alignment

Sequence Alignment
• A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to
identify regions of similarity that may be a consequence of functional, structural, or
evolutionary relationships between the sequences.
• Sequence Alignment is a process of aligning two sequences to achieve maximum levels
of identity between them.
• This help to derive functional, structural and evolutionary relationships between them.
• Aligning sequences assigns functions to the unknown proteins, determines the
evolutionary relatedness of organisms and helps in making prediction about the 3D
structures.
Similarity search is necessary for:

 Family assignment
 Sequence annotation
 Construction of phylogenetic trees
 Learn about evolutionary relationships
 Classify sequences
 Identify functions
 Homology Modeling
Pair wise alignment
 Pair wise sequence alignment methods are concerned with finding the best-matching
piecewise local or global alignments of protein (amino acid) or DNA (nucleic acid)
sequences.
 Pair wise alignments can only be used between two sequences at a time, but they are
efficient to calculate and are often used for methods that do not require extreme precision
(such as searching a database for sequences with high homology to a query).
 The three primary methods of producing pair wise alignments are dot-matrix methods,
dynamic programming, and word methods; however, most multiple sequence alignment
techniques can align only two sequences.
 The purpose of this is to find homologues (relatives) of a gene in a database of known
examples.
 This information is useful for answering a variety of biological questions
 The identification of sequences of unknown structure or function.
 The study of molecular evolution.

Multiple Sequence Alignment (MSA)


• Alignment of three or more biological sequences (protein or nucleic acid) of similar
length.
• MSA try to achieve maximal matching between them.
• Help to understand phylogeny (the evolutionary history of an organism)
• Also help to find the mutation
• The goal of MSA is to arrange a set of sequences in such a way that as many characters
from each sequence are matched according to some scoring function.
• MSA is helpful for determining certain structures or locations on sequences.
• MSA are completed where homologous sequences are compared in order to perform
phylogenetic reconstruction, protein secondary and tertiary structure analysis, and protein
function prediction analysis.
• It is used to identify new members of protein families.
• There are many softwares like Clustal, T-coffee, Phylip, MSA, MUSCLE used for
obtaining multiple sequence alignment.

Multiple sequence alignment tools


 ClustalW • Multiple sequence alignment • [more]
 ClustalW2 • Multiple sequence alignment program • [more]
 DIALIGN • Local multiple sequence aligment • [more]
 Kalign - EBI • Fast and accurate multiple sequence alignment • [more]
 MAFFT - EBI • Multiple sequence aligment • [more]
 MUSCLE • Multiple alignment server • [more]
 T-Coffee • sequence and structure multiple alignments • [more]
 T-Coffee - EBI • Multiple sequence alignment program • [more]

Concept of local and global alignment


 Global alignments, which attempt to align every residue in every sequence, are most
useful when the sequences in the query set are similar and of roughly equal size. (This
does not mean global alignments cannot end in gaps.)
 A general global alignment technique is called the Needleman-Wunsch algorithm and is
based on dynamic programming.
 Local alignments are more useful for dissimilar sequences that are suspected to contain
regions of similarity or similar sequence motifs within their larger sequence context.
 The Smith-Waterman algorithm is a general local alignment method also based on
dynamic programming.
 With sufficiently similar sequences, there is no difference between local and global
alignments.
Chapter 23-25
Algorithms: Dot Matrix method, dynamic programming methods (Needleman–
Wunsch and Smith–Waterman) application of these algorithms in different
biological problems.

Methods of sequence alignment

1. Dot plot method


 In computational biology a dot plot is a graphical method for comparing two biological
sequences and identifying region of close similarity.
 It was discovered by Gibbs and McIntyre in 1970.
 It is a type of recurrence plot ( graph of horizontal and vertical axis)
 Each sequence to be compared are written in horizontal and vertical axes and by
comparing each residue or nucleotide for similarity.
 A dot is placed within the 2D graph wherever a match is found and the unmatched
residues are left blank.
 Once the dot are placed many dots may line up to form continues diagonal lines which
refers to best alignment between two sequences.
 If there is break between two contiguous lines then it indicates there is an insertion or
deletion in the sequence alignment.
 But in many cases we end up with parallel diagonal lines within the matrix.
 And are two-dimensional matrices that have the sequences of the proteins being
compared along the vertical and horizontal axes.
 The principle used to generate the dot plot .
 The top X and the left y axes of a rectangular array are used to represent the two
sequences to be compared.
o Calculation:Matrix
 Columns = residues of sequence 1
• Rows = residues of sequence 2.
o A dot is plotted at every co-ordinate where there is similarity between the bases.
2. Dynamic programming method
 It is a problem solving method for a class of problems that can be solved by dividing
them down into simpler sub-problems.
 It finds the alignment by giving some scores for matches and mismatches (Scoring
matrices).
 This method is widely used in sequence alignments problems.
 However, when the number of the sequences is more than two, multiple dimensional
Dynamic programming in infeasible because of the large storage and computational
complexities.
 It is a highly computationally demanding as well as intensive method.
 It aligns two nucleotide/protein sequences, explores all possible alignments and chooses
the best alignment (high scoring alignment) as the optimal alignment.
 It is based on alignment scores.
 It uses gaps to achieve the best alignment.
 Global alignment program is based on Needleman-Wunsch algorithm and local alignment
on Smith-Waterman.
 Both algorithms are derivates from the basic dynamic programming algorithm.
• The dynamic programming matrix is defined with three different steps.

1.Initialization of the matrix with the scores possible.


2.Matrix filling with maximum scores.
3.Trace back the residues for appropriate alignment.
• Dynamic programming provides optimal alignment for a given set of scoring function
which is its advantage.
• But it is slow due to the very large number of computational steps.
• Therefore; it is difficult to use the method for very long sequences.
• Dynamic programming has two algorithms that are used very frequently in sequence
alignment Needleman Wunsch and Smith Waterman Algorithms.
• The dynamic programming method is guaranteed to find an optimal alignment given a
particular scoring function;
• however, identifying a good scoring function is often an empirical rather than a
theoretical matter.
Needleman-Wunsch algorithm
 The Needleman–Wunsch algorithm is an algorithm used
in bioinformatics to align protein or nucleotide sequences.
 It was one of the first applications of dynamic programming to compare biological
sequences.
 The algorithm was developed by Needleman–Wunsch in 1970.
 Needleman-Wunsch algorithm provides a method of finding the optimal global
alignment of two sequences by maximizing the number of amino acid matches and
minimizing the number of gaps necessary to align the two sequences.
 Global Alignment aligns Closely related sequences which are of same length.
 Here, the alignment is carried out from beginning till end of the sequence to find out the
best possible alignment.

Smith-Waterman algorithm
 The concept of local alignment was introduced by Smith-Waterman algorithm (1981).
 It is for determining similar regions between two nucleotide or protein sequences.
 This algorithm was designed to sensitively detect highest similarities in highly diverged
sequences.
 The S-W Algorithm implements a technique called dynamic programming, which takes
alignments of any length, at any location, in any sequence, and determines whether an
optimal alignment can be found.
 Based on these calculations, scores or weights are assigned to each character-to-character
comparison: positive for exact matches/substitutions, negative for insertions/deletions.
 In weight matrices, scores are added together and the highest scoring alignment is
reported.
26-27
Tools of MSA: ClustalW, TCoffee; Use of these tools for MSA of DNA and
protein sequences. Save output file in phylip format.

Multiple sequence alignment (MSA)

• In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA,


RNA, or protein to identify regions of similarity that may be a consequence of functional,
structural, or evolutionary relationships between the sequences.
• Multiple sequence alignment (MSA) is a fundamental process in the studies for
determination of evolutionary, structural and functional relationships.
• It is generally used to predict the function and structure of proteins from biological
sequences.
• While next generation sequencing methods have been developing, MSA plays a key role
in function and structure comparison in this technology
• In multiple sequence alignment (MSA) we try to align three or more related sequences so
as to achieve maximal matching between them.
• The goal of MSA is to arrange a set of sequences in such a way that as many characters
from each sequence are matched according to some scoring function.
• Sequence alignment is a method of arranging sequences of DNA, RNA, or protein to
identify regions of similarity.
• Multiple sequence alignment is quite similar to pairwise sequence alignment, but it uses
three or more sequences instead of only two sequences.

• ClustalW
• ClustalW like the other Clustal tools is used for aligning multiple nucleotide or protein
sequences in an efficient manner.
• Clustal W was introduced by Julie D. Thompson and Toby Gibson of EMBL, EBI.
• Most closely related sequences are aligned first, and then additional sequences and
groups of sequences are added, guided by the initial alignments.
• It uses progressive alignment methods, which align the most similar sequences first and
work their way down to the least similar sequences until a global alignment is created.
• Aligns the sequences sequentially, guided by the phylogenetic relationships indicated by
the tree.
• Gap penalties can be adjusted based on specific amino acid residues, regions of
hydrophobicity, proximity to other gaps, or secondary structure.
• ClustalW is faster than T-Coffee, but T-Coffee is more accurate, especially when
sequences share less than 30% identity.
• CLUSTALW algorithm
• Calculate all possible pairwise alignments; record the score for each pair.
• Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining).
• Find the two most closely related sequences

Consensus Symbols:

"*" means that the residues or nucleotides in that column are identical in all sequences, in the
alignment.

":" means that conserved substitutions have been observed, according to the COLOUR table
below.
"." means that semi-conserved substitutions are observed, i.e., amino acids having similar shape.
Conserved means the amino acid is replaced by one having similar characteristics.

TCoffee

• T-Coffee stands for (Tree-based Consistency Objective Function for Alignment


Evaluation)
• It compares all the sequences two by two, producing a global alignment and a series of
local alignments (using Lalign).
• Then combine all these alignment into a multiple alignment.
• It is a multiple sequence alignment (MSA) software using a progressive approach.
• In this strategy, firstly a phylogenetic tree is constructed between sequences and then an
alignment is established according to their order in the tree.
• It generates a library of pair wise alignments to guide the multiple sequence alignment.
• T-Coffee can easily align up to a 200 sequences, about 1000 aa long in about ~20min.
• It can also combine multiple sequences alignments obtained previously and in the latest
versions can use structural information from PDB files (3D-Coffee).
• It has advanced features to evaluate the quality of the alignments and some capacity for
identifying occurrence of motifs.
• T-Coffee will align nucleic acid (DNA and RNA) and protein sequences alike.
• T-Coffee is also able to use other type of information such as secondary/tertiary structure
information (for protein or RNA sequences with a known/predicted structure), sequence
profiles, trees…
• It produces alignment in the aln format (Clustal) by default, but can also produce PIR,
MSF, and FASTA format.
• The most common input formats are supported (FASTA, PIR).
• T-Coffee is on overall much more accurate than ClustalW.
• T-Coffee (default mode) is slower than ClustalW
28-30
Phylogeny; terminologies in phylogeny, applications, and methods of
phylogenetic analysis

What is Phylogeny?

• Phylogeny is the representation of the evolutionary history and relationships between groups of
organisms.
• The results in a phylogenetic tree that provides a visual output of relationships based on shared or
divergent physical and genetic characteristics.
• Biologists estimate that there are about 5 to 100 million species of organisms living on Earth
today.
• Evidence from morphological, biochemical, and gene sequence data suggests that all organisms
on Earth are genetically related, and the genealogical relationships of living things can be
represented by a vast evolutionary tree, the Tree of Life.
• The Tree of Life then represents the phylogeny of organisms, i. e., and the history of organism
lineages as they change through time.
• It implies that different species arise from previous forms via descent, and that all organisms,
from the smallest microbe to the largest plants and vertebrates,
• are connected by the passage of genes along the branches of the phylogenetic tree that links all of
Life

"Tree" Facts: Terminology

Terminology:

 Node: a node represents a taxonomic unit. This can be a taxon (an existing species) or an ancestor
(unknown species: represents the ancestor of 2 or more species).
 Branch: defines the relationship between the taxa in terms of descent and ancestry.
 Topology: is the branching pattern.
 Branch length: often represents the number of changes that have occurred in that branch.
 Root: is the common ancestor of all taxa.
 Distance scale : scale which represents the number of differences between sequences (e.g. 0.1
means 10 % differences between two sequences)
Methods of phylogenetic analysis

• There are two major groups of analyses to examine phylogenetic relationships between sequences
• Phenetic methods : trees are calculated by similarities of sequences and are based on distance
methods.
• The resulting tree is called a dendrogram.
• Distance methods compress all of the individual differences between pairs of sequences into a
single number.
• Cladistic methods: trees are calculated by considering the various possible pathways of
evolution and are based on parsimony or likelihood methods.
• The resulting tree is called a cladogram.
• Cladistic methods use each alignment position as evolutionary information to build a tree.
Phenetic methods based on distances:
 Starting from an alignment, pairwise distances are calculated between DNA sequences as the
sum of all base pair differences between two sequences (the most similar sequences are assumed
to be closely related). This creates a distance matrix.
 All base changes can be considered equally or a matrix of the possible replacements can be used.
 Insertions and deletions are given a larger weight than replacements. Insertions or deletions of
multiple bases at one position are given less weight than multiple independent insertions or
deletions.
 it is possible to correct for multiple substitutions at a single site.
 From the obtained distance matrix, a phylogenetic tree is calculated with clustering algorithms.
These cluster methods construct a tree by linking the least distant pair of taxa, followed by
successively more distant taxa.
 UPGMA clustering (Unweighted Pair Group Method using Arithmetic averages) : this is the
simplest method
 Neighbor joining: this method tries to correct the UPGMA method for its assumption that the
rate of evolution is the same in all taxa.

Cladistic methods based on Parsimony:


 For each position in the alignment, all possible trees are evaluated and are given a score based
on the number of evolutionary changes needed to produce the observed sequence changes.
 The most parsimonious tree is the one with the fewest evolutionary changes for all sequences
to derive from a common ancestor.
 This is a more time-consuming method than the distance methods.
 5.3. Cladistic methods based on Maximum Likelihood :
 This method also uses each position in an alignment, evaluates all possible trees, and calculates
the likelihood for each tree using an explicit model of evolution (<-> Parsimony just looks for
the fewest evolutionary changes).
 The likelihood's for each aligned position are then multiplied to provide likelihood for each tree.
 This is the slowest method of all but seems to give the best result and the most information about
the tree.

Maximum parsimony
• The maximum parsimony method minimizes the number of changes on a phylogenetic tree by
assigning character states to interior nodes on the tree.
• The character (or site) length is the minimum number of changes required for that site, whereas
the tree score is the sum of character lengths over all sites.
• Some sites are not useful for tree comparison by parsimony.
• For example, constant sites, for which the same nucleotide occurs in all species, have a character
length of zero on any tree.
• Singleton sites, at which only one of the species has a distinct nucleotide, whereas all others are
the same, can also be ignored, as the character length is always one.
• The parsimony-informative sites are those at which at least two distinct characters are observed,
each at least twice.
Phylogenetic tree construction software

 Phylogeny.fr -is a simple to use web service dedicated to reconstructing and analysing
phylogenetic relationships between molecular sequences.It includes multiple alignment
(MUSCLE, T-Coffee, ClustalW, ProbCons), tree viewer (Drawgram, Drawtree, ) and utility
programs (e.g. Gblocks to eliminate poorly aligned positions and divergent regions). It runs and
connects various bioinformatics programs to reconstruct a robust phylogenetic tree from a set of
sequences.
 MEGA is a bioinformatics tool used for genome analysis of molecular sequences to measure
evolutionary distance for the construction of phylogenies.
 Clustal Omega- is a multiple sequence alignment program for aligning three or more sequences
together in a computationally efficient and accurate manner. It produces biologically meaningful
multiple sequence alignments of divergent sequences. Evolutionary relationships can be seen via
viewing Cladograms or Phylograms.
 Blast-explorer- helps you building datasets for phylogenetic analysis

What are the applications of phylogenetic analysis?

• Phylogenetics has many applications in medical and biological fields, including forensic science,
• conservation biology,
• Epidemiology,
• drug discovery and drug design,
• Prediction of protein structure and function, and gene function prediction.
• To study the relationship between genomes of different species.
• To predict gene or gene finding, which means locating specific genetic regions along a genome?
• It can help identify closely related members of a species with pharmacological significance.
• To identify and classify various microorganisms, including bacteria.
31-32
Introduction to BLAST and FASTA. Different BLAST Programmes: their
application in terms of nucleic acid and protein sequence. Significance of E
Value.

Introduction to BLAST
• BLAST is widely used bioinformatics programs for sequence searching.
• The algorithm it uses is much faster than other approaches, such as calculating an optimal
alignment.
• This emphasis on speed is vital to making the algorithm practical on the huge genome
databases currently available.
• Before BLAST, FASTA was developed by David J. Lipman and William R. Pearson in
1985.
• Before fast algorithms such as BLAST and FASTA were developed, doing database
searches for protein or nucleic sequences was very time consuming because a full
alignment procedure .
• BLAST is faster than any Smith-Waterman implementation for most cases, it cannot
"guarantee the optimal alignments of the query and database sequences" as Smith-
Waterman algorithm does.
• BLAST is more time-efficient than FASTA by searching only for the more significant
patterns in the sequences, yet with comparative sensitivity.
• Examples of other questions that researchers use BLAST to answer are:
• Which bacterial species have a protein that is related in lineage to a certain protein with
known amino-acid sequence
• What other genes encode proteins that exhibit structures or motifs such as ones that have
just been determined

There are different BLAST programs available.


1. Nucleotide-nucleotide BLAST (blastn)
This program, given a DNA query, returns the most similar DNA sequences from the
DNA database.
2. Protein-protein BLAST (blastp)
This program, given a protein query, returns the most similar protein sequences from the
protein database .
3. Position-Specific Iterative BLAST (PSI-BLAST) (blastpgp)
This program is used to find distant relatives of a protein. First, a list of all closely related
proteins is created. These proteins are combined into a general "profile" sequence, which
summarises significant features present in these sequences. A query against the protein
database is then run using this profile, and a larger group of proteins is found. This larger
group is used to construct another profile, and the process is repeated.
4. Nucleotide 6-frame translation-protein (blastx)
This program compares the six-frame conceptual translation products of a nucleotide
query sequence (both strands) against a protein sequence database.
5. Large numbers of query sequences (megablast)
When comparing large numbers of input sequences via the command-line BLAST,
"megablast" is much faster than running BLAST multiple times. It concatenates many
input sequences together to form a large sequence before searching the BLAST database,
then post-analyzes the search results to glean individual alignments and statistical values.

BLAST variants for different searches

Program Query Database Comparison Common use

blastn DNA DNA DNA level Seek identical DNA sequences and splicing
patterns
blastp Protein Protein Protein level Find homologous proteins
blastx DNA Protein Protein level Analyze new DNA to find genes and seek
homologous proteins
tblastn Protein DNA Protein level Search for genes in unannotated DNA
tblastx DNA DNA Protein level Discover gene structure
BLAST Programmes application in terms of nucleic acid and protein sequence
• BLAST can be used for several purposes. These include
• Identifying species
• With the use of BLAST, you can possibly correctly identify a species or find homologous
species. This can be useful, for example, when you are working with a DNA sequence
from an unknown species.
• Locating domains
• When working with a protein sequence you can input it into BLAST, to locate known
domains within the sequence of interest.
• Establishing phylogeny
• Using the results received through BLAST you can create a phylogenetic tree using the
BLAST web-page. Phylogenies based on BLAST alone are less reliable than other
purpose-built computational phylogenetic methods, so should only be relied upon for
"first pass" phylogenetic analyses.
• DNA mapping
• When working with a known species, and looking to sequence a gene at an unknown
location, BLAST can compare the chromosomal position of the sequence of interest, to
relevant sequences in the database(s).
• Comparison
• When working with genes, BLAST can locate common genes in two related species, and
can be used to map annotations from one organism to another.
FASTA format
• In bioinformatics, the FASTA format is a text-based format for representing either
nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino
acids are represented using single-letter codes.
• FASTA can carry out a dynamic sequence similarity search between the Protein and
Nucleotide sequences against the databases.
• The format also allows for sequence names and comments to precede the sequences.
• In the original format, a sequence was represented as a series of lines, each of which was
no longer than 120 characters and usually did not exceed 80 characters.
• FASTA is a pairwise sequence alignment tool which takes input as nucleotide or protein
sequences and compares it with existing databases
• It is a text-based format and can be read and written with the help of text editor or word
processor.
• Fasta file description starts with ‘>’ symbol and followed by the gi and accession number
and then the description, all in a single line.
• Next line starts with the sequence and in each row there would be 60 nucleotides/amino
acids only.
• For DNA and proteins it is represented in one letter IUPAC nucleotide codes and amino
acid codes.
• It finds the local similarity between the sequences and calculates the statistical
significance of matches.
• It can be also used to find the functional and evolutionary relationship between the
sequences.
Variants of FastA
 FASTA- Compares a DNA query sequence to a DNA database, or a protein query to a
protein database, detecting the sequence type automatically. Versions 2 and 3 are in
common use, version 3 having a highly improved score normalization method. It signi
cantly reduces the overlap between the score distributions.
 FASTX- Compares a DNA query to a protein database. It mayintroduce gaps only
between codons.
 FASTY- Compares a DNA query to a protein database, optimizing gap location, even
within codons.
 TFASTA- Compares a protein query to a DNA database.
Significance of E Value
 The Expect value (E) is a parameter that describes the number of hits one can "expect" to
see by chance when searching a database of a particular size.
 It decreases exponentially as the Score (S) of the match increases.
 Essentially, the E value describes the random background noise.
 In principle E-value lower than 0.05 can be considered as a statistically significant hit.
 But usually, a lower e-value indicates a better quality in the earch/alignment/comparison.
 The smaller the E-value, the better the match.
 It is preferred over the score value because e-value is less sensitive to sequence length.
 However, in practice one consider even more stringent E-value cut-offs.
 A hit may have very low E-value but still can be a false positive.

Is a high E value good?


 The smaller the E-value, the better the match. Blast hits with an E-value smaller than 1e-
50 includes database matches of very high quality. Blast hits with E-value smaller than
0.01 can still be considered as good hit for homology matches.
Spell out

NCBI-National Center for Biotechnology Information


DDBJ- DNA data bank of Japan
EMBL- European Molecular Biology Laboratory
TrEMBL- Translated European Molecular Biology Laboratory
INSDC- International Nucleotide Sequence Database Collaboration
Pfam- Protein family database
PDB- Protein Data Bank
MMDB- Molecular Modeling Database
SCOP- Structural classification of proteins
CATH- Class Architecture Topology and Homology.
ASN.1- Abstract Syntax Notation One
XML- Extensible Markup Language
BLAST- Basic Local Alignment Search Tool
FASTA- fast all
Uniprot- Universal Protein Resource
SRS- Sequence retrieval system
ARSA- All round Retrieval of sequence and Annotation
APSSP: Advanced Protein Secondary Structure Prediction
MSA- Multiple sequence alignment
ExPASy- expert protein analysis system

You might also like