0% found this document useful (0 votes)
9 views61 pages

CSC215 Complete Note

The course GOU-CSC215: Algorithms for Bioinformatics aims to teach students key algorithms and computational techniques for analyzing biological data, focusing on DNA, RNA, and protein sequences. Students will learn about algorithm design, performance analysis, and the implementation of bioinformatics algorithms, with a curriculum covering topics like sequence alignment, phylogenetic trees, and machine learning in bioinformatics. By the end of the course, students will develop problem-solving skills and the ability to analyze biological data using these algorithms.

Uploaded by

Vanity538
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views61 pages

CSC215 Complete Note

The course GOU-CSC215: Algorithms for Bioinformatics aims to teach students key algorithms and computational techniques for analyzing biological data, focusing on DNA, RNA, and protein sequences. Students will learn about algorithm design, performance analysis, and the implementation of bioinformatics algorithms, with a curriculum covering topics like sequence alignment, phylogenetic trees, and machine learning in bioinformatics. By the end of the course, students will develop problem-solving skills and the ability to analyze biological data using these algorithms.

Uploaded by

Vanity538
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

COURSE GOU-CSC215: Algorithms for Bioinformatics

(First Semester 2024/2025 Academic Year)


Njoku Camillus E.
Introduction:
This course introduces key algorithms and computational techniques used in the analysis of
biological data, especially DNA, RNA, and protein sequences. The focus is on understanding how
computer science principles can solve biological problems, particularly in genomics, proteomics,
and systems biology. Students will learn about algorithmic design, performance analysis, and
biological sequence data handling, using fundamental bioinformatics algorithms.
Course Aim
The aim of the course is simple. The course aims to provide you with an understanding of
Algorithms for Bioinformatics.
Learning outcomes
By the end of this course, students should gain understanding of the following
1. Basic concepts of bioinformatics and its applications.
2. Learn key algorithms used for solving biological problems.
3. Implement bioinformatics algorithms in computational tools.
4. Analyze and interpret biological data using algorithms.
5. Develop problem-solving skills using algorithmic approaches in bioinformatics.
6: Machine learning in Bioinformatics

Course Outline
Module 1: Bioinformatics and Algorithms
1.1 Introduction to Algorithms for Bioinformatics
1.2 What is bioinformatics
1.3 Role of Computer Science in Bioinformatics
1.4 Software tools commonly used in bioinformatics (e.g., BLAST, FASTA)
1.5 Applications of Bioinformatics Algorithms
Module 2: DNA, RNA, and Protein Sequence Data
2.1 Overview of Biological Sequences
2.2 Representation of DNA, RNA, and Protein Sequences

1
2.3 Central Dogma of Molecular Biology
2.4 FASTA and Other Sequence File Formats
2.5 Overview of Nucleotide and Protein Databases
Module 3: Algorithmic Problem Solving
3.1 Introduction to algorithms: Design, analysis, and complexity
3.2 Algorithmic paradigms: Brute force, divide-and-conquer, greedy algorithms, dynamic
programming
3.3 Time and space complexity (Big O notation)
3.4 Application of algorithms to biological problems
Module 4: Sequence Alignment Algorithms
4.1 Biological motivation: Importance of sequence alignment in genomics
4.2 Global alignment (Needleman-Wunsch algorithm)
4.3 Local alignment (Smith-Waterman algorithm)
4.4 Scoring matrices: PAM, BLOSUM
Module 5: Multiple Sequence Alignment
5.1 Biological significance of multiple sequence alignment (MSA)
5.2 Progressive alignment algorithms (e.g., CLUSTLAW)
5.3 Iterative methods and refinement (e.g., MUSCLE)
5.4 Advanced sequence alignment Algorithms (protein sequence alignment)
Module 6: Heuristics for Sequence Alignment
6.1 Challenges in aligning large datasets
6.2 Heuristic methods for rapid alignment: BLAST and FASTA algorithms
6.3 Trade-offs between accuracy and performance in heuristic algorithms
6.4 Case studies: Using BLAST in genomic research
Module 7: Hidden Markov Models (HMM) in Bioinformatics
7.1 Overview of Hidden Markov Models, Markov processes and biological sequences
7.2 Components of HMMs: States, transitions, emissions
7.3 Probability estimation in HMM, finding maximum assignment
7.4 Application of HMMs in biological sequence modeling (gene prediction, protein family
classification)

2
Module 8: Phylogenetic Trees
8.1 Introduction to tree structures: Binary trees, rooted vs unrooted trees
8.2 Tree construction methods: Distance-based (UPGMA, Neighbor-Joining) and character-based
methods (maximum parsimony, maximum likelihood)
8.3 Algorithms for Phylogenetic Tree Construction (Hierarchical clustering algorithms
(UPGMA), Neighbor-joining algorithm)
8.4 Applications of phylogenetics in comparative genomics
Module 9: Machine Learning in Bioinformatics
9.1 Introduction to Deep Learning ((Basic concepts, MLP), CNN, Recurrent NN, LSTM, ResNet)
9.2 Deep Learning in genomics & Protein Structure

Textbooks:
“Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology” by
Dan Gusfield
“Bioinformatics Algorithms: An Active Learning Approach” by Phillip Compeau and Pavel
Pevzner

3
LECTURE ONE
MODULE 1: BIOINFORMATICS AND ALGORITHMS
1.1 Introduction to Algorithms for Bioinformatics
1.2 What is bioinformatics
1.3 Role of Computer Science in Bioinformatics
1.4 Software tools commonly used in bioinformatics (e.g., BLAST, FASTA)
1.5 Applications of Bioinformatics Algorithms
Learning Objectives:
• Understand the scope of bioinformatics and its importance in modern biology.
• Learn basic bioinformatics algorithmic and role of computer science in bioinformatics
• Explain areas where bioinformatics algorithms are applied.

1.1 Introduction to Algorithms for Bioinformatics


In bioinformatics, an algorithm is a step-by-step procedure or set of rules designed to solve
computational problems related to biological data. These algorithms analyze and interpret vast
amounts of biological information, such as DNA sequences, protein structures, or gene
expressions. Bioinformatics algorithms are computational methods that are essential for analyzing
biological data, particularly from genomics, proteomics, and other high-throughput biological
fields. Below is a list of common bioinformatics algorithms with examples of each:
1. Sequence Alignment Algorithms: These algorithms compare two or more sequences to identify
regions of similarity that may indicate functional, structural, or evolutionary relationships.
Examples:
i. Needleman-Wunsch Algorithm (Global Alignment): Used to align entire sequences.
Example: Aligning two DNA sequences to find the best match across their lengths.
ii. Smith-Waterman Algorithm (Local Alignment): Focuses on aligning the most similar
subsequences, for instance identifying conserved protein domains within larger sequences.
iii. BLAST (Basic Local Alignment Search Tool): Rapidly compares a query sequence
against a database to find similar sequences. Example: Searching a DNA sequence in
GenBank to find homologous sequences.
2. Hidden Markov Models (HMM): Used for pattern recognition in biological sequences, such as
identifying conserved sequence motifs. Examples:

4
i. HMMER: A software suite that uses HMMs for sequence analysis, particularly in detecting
protein families and domains. Example: Searching for homologous proteins across
different species.
ii. Profile HMMs: Used to align protein or DNA sequences to a probabilistic model of a
sequence family. Example: Finding similar protein motifs in different organisms.
3. Gene Prediction Algorithms: These algorithms predict the locations of genes within a DNA
sequence. Examples: GENSCAN (A program for the locations of protein-coding genes in human
DNA sequences), Augustus (gene annotation in plant geneomes).
4. Phylogenetic Tree Construction Algorithms: These algorithms infer evolutionary relationships
between sequences or species based on sequence data. Examples:
i. Neighbor-Joining Algorithm: A distance-based method for constructing phylogenetic
trees.
ii. Maximum Likelihood (ML) Method: Estimates the tree that best explains the observed
data. Example: Generating a phylogenetic tree for viral strains.
iii. UPGMA (Unweighted Pair Group Method with Arithmetic Mean): A simple clustering
method for creating phylogenetic trees. Example: Grouping bacterial strains based on
genetic similarity.
5. Clustering Algorithms: Group data points (such as gene expression data) into clusters based on
their similarities. Examples:
i. K-Means Clustering: Partitions data into k clusters based on similarities. Example:
Clustering genes with similar expression profiles in microarray data.
ii. Hierarchical Clustering: Builds a hierarchy of clusters. Example: Grouping gene
expression data to identify co-expressed genes in a dataset.
iii. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters data
points based on density. Example: Clustering single-cell RNA sequencing data to identify
cell subtypes.
6. Genome Assembly Algorithms: These algorithms reconstruct a genome from short DNA
sequence reads. Examples:
i. De Bruijn Graph Algorithms: Used to assemble large genomes from short-read sequencing
data. Example: Assembling the genome of a bacterial species from Illumina reads.

5
ii. Overlap-Layout-Consensus (OLC) Algorithms: Used for long-read sequencing data.
Example: Assembling a yeast genome from PacBio sequencing reads.
iii. SPAdes: A popular assembler that uses de Bruijn graphs for assembling bacterial genomes.
Example: Assembling a microbial genome from short sequencing reads.
7. Motif Finding Algorithm: Identifies recurring patterns (motifs) in sequences that may be
biologically significant, such as transcription factor binding sites. Examples:
i. MEME (Multiple EM for Motif Elicitation): Identifies conserved motifs in DNA or protein
sequences. Example: Finding regulatory motifs in promoter regions of co-expressed genes.
ii. Gibbs Sampling: A probabilistic method for motif finding. Example: Identifying
transcription factor binding sites in gene regulatory networks.
iii. AlignACE: A tool for identifying motifs shared by a set of DNA sequences. Example:
Discovering new motifs in bacterial promoters.
8. Gene Ontology (GO) Enrichment Algorithms: Analyzes gene lists to identify over-represented
biological terms or functions. Examples: GOseq (Corrects for gene length bias in RNA-Seq GO
enrichment analysis, eg. Identifying biological processes enriched in differentially expressed
genes).

1.2 What is Bioinformatics?


Bioinformatics is an interdisciplinary field that combines computer science, biology, and
mathematics to analyze biological data. With the growing volume of biological data from DNA
sequencing, protein studies, and genomics, bioinformatics helps make sense of these large datasets
through computational methods.
It combines biology, computer science, and mathematics to store, process, and analyze biological
information, particularly in genomics and proteomics.
Bioinformatics involves developing software and algorithms to manage and study complex
biological data for applications in fields like genetics, molecular biology, and drug discovery.

1.3 Role of Computer Science in Bioinformatics


Computer science plays a crucial role in bioinformatics by providing the tools, techniques, and
computational power needed to process and analyze large biological datasets. Here are the key
roles of computer science in bioinformatics:

6
1. Data Management and Storage
Biological research generates massive amounts of data, such as DNA sequences, protein structures,
and gene expression profiles. Computer science provides databases and data management systems
to store, organize, and retrieve this information efficiently. Examples include databases like
GenBank and Protein Data Bank (PDB).
2. Algorithm Development
Computer science develops algorithms to solve complex bioinformatics problems such as:
Sequence alignment: (e.g., BLAST) for comparing genetic sequences; Genome assembly which
are algorithms to reconstruct genomes from DNA fragments; and Protein structure prediction:
algorithms to understand protein folding and interactions.
3. Machine Learning and AI
Machine learning and artificial intelligence (AI) are used in bioinformatics to identify patterns and
make predictions from biological data. For example, AI can predict disease risk based on genetic
information, classify protein structures, or even assist in drug discovery by analyzing molecular
structures and biological pathways.
4. Data Analysis and Visualization
Bioinformatics involves analyzing complex data, such as gene expression data or protein
interactions. Computer science provides tools for statistical analysis and visualizations that help
researchers understand biological processes, such as creating heat maps, 3D protein models, or
phylogenetic trees.
5. High-Performance Computing (HPC)
Many bioinformatics tasks, such as genome sequencing or protein folding simulations, require
significant computational power. High-performance computing (HPC) systems, cloud computing,
and parallel processing frameworks allow bioinformaticians to process massive datasets and run
complex simulations in a reasonable amount of time.
6. Database Search and Retrieval
Computer science enables the development of fast and efficient search algorithms for querying
biological databases. For instance, searching for similar DNA or protein sequences in a large
database requires optimized search techniques, such as indexing and hashing algorithms.

7
7. Data Integration
Bioinformatics often requires integrating data from various sources, such as genomic, proteomic,
and clinical data. Computer science provides methods and frameworks for integrating
heterogeneous data types, ensuring that researchers can analyze comprehensive datasets for better
insights into biological systems.
8. Modeling and Simulation
Computer science helps in creating models and simulations of biological systems. For example:
Systems biology uses computational models to understand complex interactions within cells and
organisms. Drug interaction simulations predict how molecules interact with biological targets.
9. Software Development
Many bioinformatics tools and platforms are developed by computer scientists. Software packages
like MATLAB, Bioconductor, and R are widely used in bioinformatics for statistical analysis, data
mining, and visualization. Python and other programming languages are often used to develop
custom tools and pipelines.
10. Cybersecurity in Biomedical Data
With the increasing importance of patient genomic data and health records, computer science
provides cybersecurity solutions to protect sensitive biological information from unauthorized
access or breaches. This includes encryption, secure data sharing, and privacy-preserving
computation techniques.
11. Cloud Computing and Distributed Systems
Cloud computing allows researchers to access bioinformatics tools and datasets remotely,
providing scalable resources for large-scale analyses. Distributed systems enable collaboration
between research groups, allowing for the sharing of data and computational tasks across networks.

1.4 Software tools commonly used in bioinformatics


i. BLAST (Basic Local Alignment Search Tool): a tool for comparing an input sequence
(query) with a database of sequences to find regions of similarity. Widely used for
identifying homologous genes or proteins.
ii. FASTA: another alignment tool that compares nucleotide or protein sequences by finding
regions of local or global similarity.

8
iii. Clustal Omega: a tool for multiple sequence alignment, useful for aligning sequences to
study evolutionary relationships.
iv. SPAdes (St. Petersburg Assembler): a genome assembler particularly used for small
genomes such as bacterial genomes.
v. MAKER: a genome annotation pipeline used to predict genes and align them to known
data.
vi. Augustus: it is a gene prediction tool used to annotate genomes based on available evidence
or training.
vii. MEGA (Molecular Evolutionary Genetics Analysis): A tool for constructing phylogenetic
trees based on sequence data.
viii. MrBayes: it is a program for Bayesian inference of phylogeny that uses Markov Chain
Monte Carlo methods.
ix. RAxML (Randomized Axelerated Maximum Likelihood): software for fast and accurate
phylogenetic tree estimation.
x. PyMOL: A molecular visualization tool used for 3D visualization of protein structures.
xi. SWISS-MODEL: A tool for homology modeling, predicting the 3D structure of a protein
based on homologous templates.
xii. MODELLER: A program that generates 3D models of proteins using comparative
modeling techniques.
xiii. AutoDock: a suite of tools for performing molecular docking of small molecules to protein
receptors.
xiv. MOE (Molecular Operating Environment): software platform that integrates visualization,
modeling, and simulation of bio-molecular structures.
xv. GROMACS: molecular dynamics simulation tool that can be used to study protein-ligand
interactions.

1.5 Applications of Bioinformatics Algorithms

i. Sequence Alignment and Comparison

Example: The Needleman-Wunsch and Smith-Waterman algorithms are used to align DNA
or protein sequences, helping researchers identify similarities between species or genes, which
may indicate functional or evolutionary relationships.

9
ii. Gene Prediction

Example: Hidden Markov Models (HMMs) are used in gene prediction algorithms like
Genscan to identify coding regions (exons) in genomic sequences, aiding in the discovery of
new genes.

iii. Protein Structure Prediction

Example: Algorithms like PSIPRED predict secondary structures (alpha-helices and beta-sheets)
of proteins from their amino acid sequences, which is crucial for understanding protein function
and drug design.

iv. Phylogenetic Tree Construction

Example: The Neighbor-Joining algorithm constructs phylogenetic trees, allowing scientists to


infer evolutionary relationships between organisms based on genetic sequences.

v. Genome Assembly

Example: De Bruijn Graph algorithms are used in genome assembly tools like SPAdes to
reconstruct genomes from short sequencing reads, as in assembling bacterial genomes from next-
generation sequencing data.

vi. Motif Finding

Example: Gibbs Sampling and Expectation-Maximization (EM) algorithms are used to


identify conserved sequence motifs, such as transcription factor binding sites, in a set of DNA
sequences.

vii. Protein-Protein Interaction Prediction

Example: Graph-based algorithms analyze biological networks, such as predicting interactions


between proteins in a cell using protein-protein interaction (PPI) networks, like in STRING
database analyses.

viii. Gene Expression Analysis

Example: Clustering algorithms like K-means are used to group genes with similar expression
patterns from microarray or RNA-Seq data, helping in identifying co-regulated genes in a
biological pathway.

10
LECTURE TWO
MODULE 2: DNA, RNA, AND PROTEIN SEQUENCE DATA
2.1 Overview of Biological Sequences
2.2 Representation of DNA, RNA, and Protein Sequences
2.3 Central Dogma of Molecular Biology
2.4 FASTA and Other Sequence File Formats
2.5 Overview of Nucleotide and Protein Databases
Learning Objectives:
• Representation of biological sequences computationally.
• Explore standard file formats for bioinformatics and database resources.

2.1 Overview of Biological Sequences


A biological sequence is a single, continuous molecule of nucleic acid or protein. Biological
sequences represent the linear arrangement of nucleotides or amino acids that make up DNA,
RNA, and proteins. These sequences form the foundation of genetic information and are key to
understanding biological functions and evolutionary relationships. A typical biological sequence
involves four nucleotides for DNA (Adenine, Thymine, Cytosine, and Guanine), four nucleotides
for RNA (Adenine, Uracil, Cytosine, and Guanine), and twenty amino acids for proteins.
❖ DNA (Deoxyribonucleic Acid): DNA carries the hereditary material of an organism,
encoded as a sequence of nucleotides. It is structured as a double helix composed of two
complementary strands.
❖ RNA (Ribonucleic Acid): RNA is typically single-stranded and functions in various
cellular processes, including acting as an intermediary between DNA and proteins.
❖ Proteins: Proteins are macromolecules made of amino acids and perform a variety of
biological tasks, including catalyzing metabolic reactions and providing structural support
to cells.

2.2 Representation of DNA, RNA, and Protein Sequences


The representation of biological sequences can be done in both digital and textual forms. These
forms help bioinformaticians analyze, manipulate, and visualize sequence data for various
purposes.

11
DNA Sequences: DNA sequences are represented by the letters A, T, C, and G, corresponding to
the four nucleotides. Example: ‘ATCGGCTA’

Fig 2.1 example of a DNA structure.

RNA Sequences: RNA sequences are represented similarly but with U (Uracil) replacing T
(Thymine). Example: ‘AUCGGCUA’

Fig 2.2 example of an RNA structure


Protein Sequences: Proteins are represented by a one-letter code for each amino acid (e.g., A for
Alanine, R for Arginine, K for Lysine, etc.). Example: ‘MKTLLV’. The structure of these
sequences is pivotal to their function. For example, a single nucleotide change in DNA can result
in an altered protein, potentially leading to disease.

2.3 Central Dogma of Molecular Biology


The central dogma of molecular biology explains how genetic information flows within a
biological system. It describes the processes of transcription and translation, whereby DNA is first
transcribed into RNA and then translated into proteins.

12
1. Replication: DNA makes copies of itself. During replication, the double helix unwinds, and each
strand serves as a template for creating a new complementary strand, resulting in two identical
DNA molecules.
2. Transcription: The process of converting DNA into RNA. This RNA can be either mRNA,
tRNA, or rRNA, depending on its function. It is the process where a cell makes a copy of a specific
part of its DNA, which is called RNA. This RNA can be of different types, such as mRNA
(messenger RNA), which carries instructions to make proteins, tRNA (transfer RNA), which helps
build proteins, or rRNA (ribosomal RNA), which forms part of the machinery that assembles
proteins.
3. Translation: RNA is translated into proteins through the ribosomes, where each codon (triplet
of nucleotides) corresponds to an amino acid.
The central dogma can be summarized as: DNA→RNA→Protein
Mutations or errors in these processes can lead to dysfunctional proteins, which can cause various
diseases, including cancer and genetic disorders.

2.4 FASTA and Other Sequence File Formats


In bioinformatics, biological sequence data is stored and analyzed in specific file formats. These
formats standardize how sequences are represented in computers.
1. FASTA Format: One of the most widely used file formats for storing sequence data. A
sequence in FASTA format consists of:
One line starting with a ">" sign, followed by a sequence identification code.
It is optionally being followed by a textual description of the sequence. Since it is not part of the
official description of the format, software can choose to ignore this, when it is present.
One or more lines containing the sequence itself.
A file in FASTA format may comprise more than one sequence. The FASTA format is sometimes
also referred to as the "Pearson" format (after the author of the FASTA program and ditto format).
This format is simple and efficient, making it ideal for large datasets. Example:

>sequence_name
ATCGGCTAAGTC

13
2. GenBank Format: Another format used for nucleotide sequences. It contains both the sequence
and rich annotation details, such as gene structure and function.
3. PDB (Protein Data Bank): This format is used for representing 3D structures of proteins. Each
file contains atomic coordinates that describe the spatial arrangement of a molecule.
4. GFF (General Feature Format): Used for representing genomic features like genes, exons,
and coding regions.
These formats are crucial for data exchange and are widely supported by various bioinformatics
tools.

2.5 Overview of Nucleotide and Protein Databases


Several databases have been developed to store and organize DNA, RNA, and protein sequence
data. These databases enable researchers to access vast amounts of sequence information for
analysis and comparison.
a. GenBank: Managed by the National Center for Biotechnology Information (NCBI), GenBank
is one of the most comprehensive nucleotide databases. It stores DNA sequences from a variety of
organisms.
b. RefSeq: A subset of GenBank, providing curated and annotated reference sequences.
c. EMBL-EBI: European Molecular Biology Laboratory-European Bioinformatics Institute
(EMBL-EBI) offers several databases like ENA (European Nucleotide Archive), storing
nucleotide sequences.
d. UniProt: A well-known protein sequence database that contains functional information about
proteins. It includes protein sequences and their biological roles.
e. Swiss-Prot: A section of UniProt that offers manually annotated, reviewed protein sequences.
f. PDB: The Protein Data Bank contains 3D structures of proteins and nucleic acids, crucial for
structural biology research.
These databases allow for sequence alignment, homology searches, and functional predictions,
driving advancements in bioinformatics and molecular biology.

14
LECTURE THREE
MODULE 3: ALGORITHMIC PROBLEM SOLVING
3.1 Introduction to algorithms: Design, analysis, and complexity
3.2 Algorithmic paradigms: Brute force, divide-and-conquer, greedy algorithms, dynamic
programming
3.3 Time and space complexity (Big O notation)
3.4 Application of algorithms to biological problems
Learning objectives
• Understand the Basics of Algorithms
• Differentiate Algorithmic Paradigms
• Students will be able to evaluate the time and space complexity of algorithms using Big O
notation.

3.1 Introduction to algorithms: Design, analysis, and complexity


An algorithm is a set of commands that must be followed for a computer to perform calculations
or other problem-solving operations. According to its formal definition, an algorithm is a finite
set of instructions carried out in a specific order to perform a particular task. It is not the entire
program or code; it is simple logic to a problem represented as an informal description in the
form of a flowchart or pseudocode. In bioinformatics, algorithms play a pivotal role in solving
complex biological problems, particularly in data-intensive tasks such as sequence alignment, gene
prediction, and structural modeling.

Figure 3.1 Algorithm

❖ Problem: A problem can be defined as a real-world problem or real-world instance problem


for which you need to develop a program or set of instructions. An algorithm is a set of
instructions.

15
❖ Algorithm: An algorithm is defined as a step-by-step process that will be designed for a
problem.
❖ Input: After designing an algorithm, the algorithm is given the necessary and desired
inputs.
❖ Processing unit: The input will be passed to the processing unit, producing the desired
output.
❖ Output: The outcome or result of the program is referred to as the output
3.1.1 Algorithm Design
Algorithm design refers to the process of developing step-by-step procedures to solve
computational problems. In bioinformatics, the focus is on creating efficient and scalable
algorithms that handle large datasets like genome sequences, protein structures, and biological
networks. Key techniques used in designing algorithms include:
❖ Greedy Algorithms: These algorithms make locally optimal choices at each step, aiming to
find a global optimum. Examples in bioinformatics include motif finding and sequence
assembly.
❖ Divide and Conquer: This technique splits a problem into smaller sub-problems, solves
each sub-problem recursively, and combines their solutions. Applications include sequence
alignment and phylogenetic tree construction.
❖ Dynamic Programming: A method used for optimization problems, breaking them down
into overlapping sub-problems and solving each sub-problem once. This is widely used in
sequence alignment algorithms such as Needleman-Wunsch and Smith-Waterman.
❖ Graph-based Algorithms: Many bioinformatics problems can be represented using graphs
(e.g., protein interaction networks, gene regulatory networks). Algorithms such as
Dijkstra’s for shortest path and graph traversal methods (BFS, DFS) are crucial in
analyzing these biological networks.
3.1.2 Algorithm Analysis
Algorithm analysis involves evaluating the performance of an algorithm in terms of time and space
resources (i.e., computational complexity). The goal is to understand how efficiently an algorithm
performs, especially for large datasets typical in bioinformatics. Two critical measures in
algorithm analysis are time complexity and space complexity

16
1. Time Complexity: The amount of time an algorithm takes to solve a problem as a function of
the input size, often expressed in Big-O notation.
2. Space Complexity: The amount of memory an algorithm uses, also expressed in Big-O
notation.
3.1.3 Complexity Classes
Understanding the complexity of problems is vital in bioinformatics, where some tasks may be
computationally infeasible due to the large size of biological data. The complexity of problems is
often classified into:
a. P (Polynomial Time): These are problems that can be solved by an algorithm in polynomial
time, such as sequence alignment and finding the shortest path in a graph. Problems in P are
considered “efficiently solvable”.
b. NP (Nondeterministic Polynomial Time): Problems where a solution can be verified in
polynomial time, but finding the solution may not be possible in polynomial time. An example
from bioinformatics is protein folding prediction.
c. NP-Hard and NP-Complete: NP-hard problems are at least as hard as NP problems, and NP-
complete problems are both in NP and NP-hard. These problems are often intractable for large
inputs, and approximations or heuristics are used. Many bioinformatics problems, such as genome
assembly, are NP-hard.
3.1.4 Common Bioinformatics Algorithms and Their Complexity: Several bioinformatics
algorithms have become standard tools for analyzing biological data, each with distinct design and
complexity features:

❖ BLAST (Basic Local Alignment Search Tool): This heuristic algorithm is designed for
searching protein and nucleotide databases. It uses a fast local alignment approach but may
not guarantee the optimal alignment. Its time complexity is approximately \(O(mn)\),
where \(m\) and \(n\) are the lengths of the sequences being compared.
❖ Needleman-Wunsch Algorithm: A dynamic programming algorithm for global sequence
alignment. Its time complexity is \(O(mn)\), where \(m\) and \(n\) are the lengths of the
sequences being aligned.

17
❖ Smith-Waterman Algorithm: Another dynamic programming algorithm for local sequence
alignment. It also has a time complexity of \(O(mn)\), but it is computationally expensive
for large datasets.
❖ Hidden Markov Models (HMMs): Used for tasks like gene prediction, HMMs are
probabilistic models with algorithms such as the Viterbi algorithm having a time
complexity of \(O(TN^2)\), where \(T\) is the sequence length and \(N\) is the number of
hidden states.
3.1.5 Importance of Algorithm Optimization in Bioinformatics
Bioinformatics datasets are often massive (e.g., genome sequencing, proteomics data), and
processing them efficiently requires carefully designed and optimized algorithms. Optimizing
algorithms for:
Speed: Ensures that biological datasets are analyzed in reasonable time frames.
Memory Usage: Helps conserve computational resources, especially in constrained environments.
Scalability: Ensures algorithms can handle growing datasets as bioinformatics data increases
exponentially due to advancements in sequencing technologies.

3.2 Algorithmic paradigms: Brute force, divide-and-conquer, greedy algorithms, dynamic


programming
3.2.1 Introduction
Algorithmic paradigms are general strategies or approaches used to design algorithms to solve
computational problems. Each paradigm provides a different way of thinking about how to break
down and solve a problem. In bioinformatics, where problems often involve processing large
amounts of biological data (e.g., DNA sequences, protein structures), choosing the right
algorithmic approach is crucial for achieving efficiency and accuracy.
For this course will cover four major algorithmic paradigms: Brute Force, Divide-and-Conquer,
Greedy Algorithms, and Dynamic Programming, with a focus on how these paradigms can be
applied to bioinformatics problems.
3.2.1 Brute Force Paradigm
Brute force is the simplest algorithmic paradigm that involves trying all possible solutions to a
problem and choosing the best one. It explores all potential answers systematically and guarantees
correctness, but it is often inefficient for large inputs due to its high computational cost. Its

18
advantage is that it is Simple to implement and useful for small datasets or problems where
correctness is more important than efficiency, while on the other hand its disadvantages means
that it is impractical for large datasets due to long execution times and inefficient for complex
problems with many possibilities.
Characteristics
❖ Exhaustive Search: Brute force algorithms try every possible option.
❖ Simplicity: These algorithms are easy to implement but can be inefficient.
❖ Guaranteed Solution: Because every option is explored, the correct solution is guaranteed.
Example of Brute Force Paradigm in Bioinformatics:
a. Sequence Alignment: A brute force approach to aligning two DNA sequences would involve
trying all possible alignments and calculating the score for each one. This guarantees finding the
best alignment but is computationally infeasible for long sequences due to the exponential number
of possible alignments.
b. Time Complexity: Usually very high, often exponential (e.g., (O(2^n))) or factorial (e.g.,
(O(n!))).
3.2.2 Divide-and-Conquer Paradigm
Divide-and-conquer is an algorithmic paradigm that breaks a problem into smaller sub-problems,
solves each sub-problem recursively, and then combines their solutions to solve the original
problem. Its advantages are, they are more efficient for larger problems, and can exploit
parallelism, as sub-problems can be solved independently, and finally it often leads to logarithmic
reductions in time complexity. They disadvantage are it enquires recursive thinking, which can
make implementation complex, and merging solutions can sometimes be tricky or computationally
expensive.
Steps of divide and Conquer paradigm
1. Divide: Split the problem into smaller, similar sub-problems.
2. Conquer: Solve each sub-problem independently.
3. Combine: Merge the solutions of the sub-problems to form a solution for the original problem.
Example in Bioinformatics:
Phylogenetic Tree Construction:

19
a. Divide-and-conquer algorithms are used in phylogenetics to construct evolutionary trees by
breaking down the data into smaller groups of species, calculating relationships for each group,
and then merging them to form a complete tree.
b. Another common example is Merge Sort or Quick Sort applied to sorting biological data like
gene sequences or protein structures.
c. Time Complexity: Typically, O(nlogn) for many divide-and-conquer algorithms, which makes
them much more efficient than brute force for large inputs.
3.2.3. Greedy Algorithms Paradigm
A greedy algorithm makes a series of choices, each of which looks best at the moment (locally
optimal), with the hope that these local choices will lead to a globally optimal solution. It solves
the problem step by step by choosing the next step that provides the most immediate benefit.
Advantages:
a. Simple to implement and fast.
b. Suitable for problems where the locally optimal solution leads to a globally optimal solution.
Disadvantages:
a. May not always produce the optimal solution for all problems.
b. Greedy choices are irreversible, making it unsuitable for problems that require reconsideration
of earlier decisions.
Characteristics:
❖ Locally Optimal Choices: Decisions are made based on immediate benefits.
❖ No Backtracking: Once a choice is made, it cannot be undone.
❖ Fast: Greedy algorithms are usually fast and simple.
Examples of Greedy algorithms in Bioinformatics:
a. Genome Assembly: A greedy algorithm can be used to assemble DNA fragments by choosing
the pair of fragments with the largest overlap and merging them until a full sequence is constructed.
Although greedy methods are fast, they may not always give the best (optimal) solution in complex
cases.
b. Huffman Coding for compressing biological data sequences, where the algorithm constructs an
optimal prefix code based on the frequencies of nucleotides or amino acids.
c. Time Complexity: Often linear or polynomial in time, such as O(nlogn) for many greedy
algorithms like Huffman coding.

20
3.2.4. Dynamic Programming Paradigm
Dynamic programming is an optimization technique used to solve problems by breaking them
down into overlapping sub-problems, solving each sub-problem just once, and storing their
solutions in a table (memoization). This avoids redundant calculations and reduces time
complexity compared to brute force approaches.
Characteristics:
❖ Overlapping sub-problems: sub-problems are solved and reused.
❖ Optimal Substructure: A solution to the problem can be composed from solutions to its
sub-problems.
❖ Memoization or Tabulation: Results of solved sub-problems are stored for future
reference.
Examples of Dynamic Programming Paradigm in Bioinformatics:
a. Sequence Alignment (Dynamic Programming Approach): it is used to solve sequence alignment
problems (e.g., Needleman-Wunsch and Smith-Waterman algorithms) by building a matrix that
stores the optimal alignment scores for all possible prefixes of the sequences. This significantly
reduces computation time compared to a brute force approach.
b. RNA Secondary Structure Prediction where dynamic programming helps predict the most stable
structure by breaking down RNA sequence interactions and storing results of overlapping
substructures.
c. Time Complexity: Typically (O(n^2)) or (O(n^3)) for most bioinformatics applications like
sequence alignment.
Advantages:
a. More efficient than brute force due to reduced redundant calculations.
b. Guarantees finding the optimal solution for many problems.
Disadvantages:
a. Can require a lot of memory due to the storage of sub-problem solutions.
b. May be more complex to implement compared to simpler paradigms like greedy algorithms.

21
3.2.5 Comparison of Paradigms in Bioinformatics

In conclusion, understanding and applying the right algorithmic paradigm is essential in solving
bioinformatics problems efficiently. While brute force provides correctness, it is often impractical
for large data, and more advanced paradigms like divide-and-conquer, greedy algorithms, and
dynamic programming provide powerful tools for solving complex problems in a time-efficient
manner. Selecting the appropriate algorithmic approach based on the problem characteristics (e.g.,
overlapping sub-problems, need for global optimality) is key to developing efficient
bioinformatics solutions.

3.3 Time and space complexity (Big O notation)


3.3.1. Time Complexity: The amount of time an algorithm takes to solve a problem as a function
of the input size, often expressed in Big-O notation. For example, an algorithm with time
complexity (O(n^2)) will take time proportional to the square of the input size \(n\).
Common time complexity classes include:
Constant Time \(O(1)\): The execution time is constant and does not depend on the input size.
Logarithmic Time \(O(\log n)\): The execution time grows logarithmically with the input size.
Linear Time \(O(n)\): The execution time grows proportionally to the input size.
Quadratic Time \(O(n^2)\): The execution time grows with the square of the input size.
Exponential Time \(O(2^n)\): The execution time grows exponentially with the input size, often
becoming impractical for large inputs.

22
3.3.2. Space Complexity: The amount of memory an algorithm uses, also expressed in Big-O
notation. Memory efficiency is essential for bioinformatics applications handling large genomic
datasets.The general equation for space complexity is:

Where:
(S(n)) represents the space complexity.
(n) is the size of the input.
(f(n)) is a function that represents how much memory the algorithm uses as a function of (n).
(O(f(n))) is the Big-O notation describing the upper bound of space usage.
Space Complexity Example:
1. Constant Space: when an algorithm uses a fixed amount of space regardless of input size, the
space complexity is O(1) (constant space). Example is a function that only uses a few variables
and doesn’t depend on the input size.
2. Linear Space: if an algorithm’s space grows directly with the input size \( n \), the space
complexity is O(n). Example, Storing an array of size \( n \).
3. Recursive Function: for recursive algorithms, space complexity includes both the memory used
by variables and the additional space taken by the recursive call stack. Example is a recursive
function with depth (n) & using (O(1) ) space per function call would have space complexity O(n).

3.4 Application of algorithms to biological problems


In bioinformatics, algorithms play a crucial role in solving complex biological problems by
providing computational methods to analyze large volumes of biological data efficiently.
Biological data, such as DNA sequences, protein structures, and gene expressions, require precise,
efficient algorithms for tasks like sequence alignment, phylogenetic tree construction, and protein
structure prediction. Key Applications include:

1. Sequence Alignment:
Algorithms like Needleman-Wunsch (global alignment) and Smith-Waterman (local alignment)
are used to compare DNA, RNA, or protein sequences. By aligning sequences, researchers can
identify similarities, evolutionary relationships, and functional regions in genomes.

23
2. Genome Assembly:
Algorithms for assembling short DNA sequences into longer genomes (such as the de Bruijn
graph approach) help reconstruct entire genomes from fragmented DNA data, critical for genomics
research.
3. Phylogenetic Tree Construction:
Algorithms like neighbor-joining and maximum likelihood are used to infer evolutionary
relationships between species based on genetic data. These trees depict how species have diverged
from common ancestors.
4. Protein Structure Prediction:
Algorithms, including those using dynamic programming and machine learning techniques,
predict the three-dimensional structure of proteins based on their amino acid sequences. Accurate
structure predictions are important for understanding protein function and drug design.
5. Gene Expression Analysis:
Algorithms for clustering and classification help analyze gene expression data from technologies
like microarrays and RNA-seq, allowing researchers to identify genes with similar expression
patterns and study their regulation.

24
LECTURE FOUR
Module 4: Sequence Alignment Algorithms
4.1 Biological motivation: Importance of sequence alignment in genomics
4.2 Global alignment (Needleman-Wunsch algorithm)
4.3 Local alignment (Smith-Waterman algorithm)
4.4 Scoring matrices: PAM, BLOSUM
Learning Objectives:
• Understand dynamic programming-based algorithms for sequence alignment.
• Learn to use substitution matrices for scoring sequence alignments.

4.1 Introduction to Sequence Alignment


Sequence Alignment is a method used to arrange two or more sequences (of DNA, RNA, or
proteins) to identify regions of similarity that may indicate functional, structural, or evolutionary
relationships. The process of aligning sequences enables us to compare sequences by placing them
in a matrix where each sequence is aligned across the top and down the left side. The goal is to
identify the best match between the sequences based on a scoring system. There are two primary
types of sequence alignment:
Global Alignment: Attempts to align every residue in both sequences from beginning to end.
Local Alignment: Searches for the best matching region within two sequences, not necessarily
aligning the entire sequence.
Sequence alignment forms the foundation of many bioinformatics analyses, such as:
Phylogenetics: Studying evolutionary relationships.
Gene prediction: Identifying coding regions within a DNA sequence.
Functional genomics: Identifying functional regions within sequences.
4.1.1 Basic Concepts:
1. Substitution Matrix: A table used to score alignments between residue pairs in protein or
nucleotide sequences (e.g., PAM or BLOSUM matrices).
2. Gap Penalty: A score assigned to opening or extending gaps in alignments to penalize insertions
and deletions.
3. Scoring System: Helps evaluate how well sequences match by assigning scores to matches,
mismatches, and gaps.

25
4.2 Global Sequence Alignment: Needleman-Wunsch Algorithm
The Needleman-Wunsch Algorithm is a dynamic programming approach used for global sequence
alignment. It aligns sequences in their entirety, aiming to find the optimal alignment between the
two sequences over their entire length. The algorithm constructs a matrix where one sequence is
placed on the top (along columns) and the other along the side (along rows). It uses the following
steps:
1. Initialization: Create a scoring matrix where the first row and column are initialized with gap
penalties.
2. Matrix Filling: Fill in the matrix based on the scoring scheme, taking into account matches,
mismatches, and gap penalties.
3. Traceback: Starting from the bottom-right of the matrix, trace back to determine the optimal
alignment.
Algorithm Steps:
1. Input:
❖ Two sequences: A = A_1, A_2, ..., A_m and B = B_1, B_2, ..., B_n.
❖ Substitution matrix for match/mismatch scores.
❖ Gap penalty.
2. Initialization:
Let F (i, 0) = -d * i for all i = 0 to m (for row).
Let F (0, j) = -d * j for all j = 0 to n (for column).
3. Recurrence Relation:
F(i, j) = max {
F(i-1, j-1) + s(A_i, B_j), # match/mismatch
F(i-1, j) - d, # insertion (gap in B)
F(i, j-1) - d # deletion (gap in A)
}
N/b The function F(i, j) computes a score for aligning two sequences (let’s call them A and B).
The indices i and j refer to specific positions in these sequences:
• A_i: The i-th element of sequence A
• B_j: The j-th element of sequence B
The Components:

26
1. F(i-1, j-1) + s(A_i, B_j):
1. This part considers the case where you match or mismatch the characters at A_i
and B_j.
2. s(A_i, B_j) is a scoring function that gives a positive score for a match and a
negative score for a mismatch.
3. So, this expression takes the score from aligning the previous characters (i-1, j-
1) and adds the score from the current characters.
2. F (i-1, j) - d:
1. This part accounts for an insertion (a gap in sequence B).
2. It looks at the score for aligning the previous character of A with the current
character of B, but reduces the score by d (the penalty for introducing a gap).
3. F (i, j-1) - d:
1. This part considers a deletion (a gap in sequence A).
2. It checks the score for aligning the current character of A with the previous
character of B and also reduces the score by d for the gap.
Putting It All Together:
• F (i, j) takes the maximum of these three options:
1. Match/mismatch score (align both A_i and B_j)
2. Insertion (allowing a gap in B)
4. Traceback:
- Starting from `F(m, n)`, trace back to `F(0, 0)` to recover the optimal alignment.

27
4.3 Local Sequence Alignment: Smith-Waterman Algorithm
The Smith-Waterman Algorithm is another dynamic programming approach but for local sequence
alignment. Unlike the Needleman-Wunsch algorithm, it finds the best matching subsequence
between two sequences. This method is ideal when you want to align sequences that only partially
overlap. The algorithm constructs a similar matrix to Needleman-Wunsch but differs in the
following:
❖ Negative scores are replaced with zeros to ensure no penalty for misalignments outside the
local alignment.
❖ Traceback starts from the highest scoring cell and ends when a cell with a score of zero is
reached.
Algorithm Steps:
1. Input:
1. Two sequences: A = A_1, A_2, ..., A_m and B = B_1, B_2, ..., B_n.
2. Substitution matrix for match/mismatch scores.
3. Gap penalty.
2. Initialization:
F(i, 0) = 0 for all i.
F(0, j) = 0` for all j.

28
3. Recurrence Relation:
F(i, j) = max {

0,
F(i-1, j-1) + s(A_i, B_j), # match/mismatch
F(i-1, j) - d, # insertion (gap in B)
F(i, j-1) - d # deletion (gap in A)

4. Traceback:

Starting from the highest scoring cell in the matrix, trace back to a cell with a score of zero.

Example Code in Python:

python
def smith_waterman(seq1, seq2, match_score=1, mismatch_score=-1, gap_penalty=-2):
m, n = len(seq1), len(seq2)
score_matrix = [[0 for _ in range(n+1)] for _ in range(m+1)]
# Fill the score matrix
max_score = 0
max_pos = None
for i in range(1, m+1):
for j in range(1, n+1):
match = score_matrix[i-1][j-1] + (match_score if seq1[i-1] == seq2[j-1] else mismatch_score)
delete = score_matrix[i-1][j] + gap_penalty
insert = score_matrix[i][j-1] + gap_penalty
score_matrix[i][j] = max(0, match, delete, insert)
if score_matrix[i][j] > max_score:
max_score = score_matrix[i][j]
max_pos = (i, j)

# Traceback to get alignment

align1, align2 = '', ''


i, j = max_pos
while score_matrix[i][j] > 0:
current_score = score_matrix[i][j]
if current_score == score_matrix[i-1][j-1] + (match_score if seq1[i-1] == seq2[j-1] else mismatch_score):
align1 += seq1[i-1]
align2 += seq2[j-1]
i -= 1
j -= 1
elif current_score == score_matrix[i-1][j] + gap_penalty:
align1 += seq1[i-1]
align2 += '-'
i -= 1

29
else:
align1 += '-
align2 += seq2[j-1]
j -= 1
return align1[::-1], align2[::-1]
# Example Usage
seq1 = "GATTACA"
seq2 = "GCATGCU"
alignment = smith_waterman(seq1, seq2)
print("Alignment:", alignment)

4.4 Scoring matrices: PAM, BLOSUM


4.4.1 Introduction
In bioinformatics, scoring matrices are essential for comparing biological sequences, such as
proteins and DNA. They help quantify the similarity between sequences, especially when aligning
amino acids or nucleotide sequences. Scoring matrices assign scores for matches, mismatches, and
gaps in sequence alignment. Two of the most widely used scoring matrices in protein sequence
alignment are PAM (Point Accepted Mutation) and BLOSUM (Blocks Substitution Matrix).
Scoring matrices provide a way to measure the similarity or evolutionary relationship between two
sequences. When comparing protein sequences, the matrix assigns a score to each possible amino
acid pair in an alignment. Positive scores are typically given for more biologically favorable
substitutions, while negative scores are assigned to substitutions that are less likely. There are two
primary types of scoring matrices:
1. Substitution Matrices: Used for aligning amino acids.
2. Gap Penalties: Applied when there are gaps (insertions or deletions) in the alignment.
Key Functions of Scoring Matrices:
1. Quantifying similarity: They help compare the degree of similarity between sequences.
2. Guiding alignments: They provide a basis for alignment algorithms (e.g., Needleman-Wunsch,
Smith-Waterman).
3. Inferring homology: Matrices help infer evolutionary relationships between proteins.
4.4.2 PAM (Point Accepted Mutation) Matrices
PAM matrices are based on evolutionary distances between proteins. The PAM1 matrix, in
particular, measures the probability of an amino acid substitution after one accepted point mutation
per 100 residues (1% divergence). (it tells us or measures changes in the building blocks of protein
called amino acid over time.)

30
Development of PAM:
❖ The PAM matrix was developed by Margaret Dayhoff in the 1970s.
❖ It was constructed by comparing closely related protein sequences and determining how
frequently one amino acid changes into another over evolutionary time.
Characteristics:
❖ PAM1: Represents 1% of evolutionary change (1 mutation per 100 amino acids).
❖ PAM250: Represents 250% change (250 mutations per 100 amino acids), and so on.
❖ Usage: PAM matrices are most useful for comparing closely related sequences.
Example of a PAM1 Matrix:

Key Points:
PAM1 matrix is derived from direct observations of accepted mutations.
Higher PAM matrices (e.g., PAM250) are extrapolated for larger evolutionary distances.
4.4.3 BLOSUM (Blocks Substitution Matrix)
The BLOSUM matrix is based on blocks of conserved sequences, rather than evolutionary
distances. Unlike PAM, BLOSUM matrices are derived from comparisons of protein sequences
within conserved regions (blocks) of related proteins.
Development of BLOSUM:
❖ Developed by Steven Henikoff and Jorja Henikoff in 1992.
❖ It is based on blocks of sequences that are more distantly related than those used in PAM.
Characteristics:
a. BLOSUM62: The most widely used matrix, optimized for sequences with about 62% similarity.

31
b. BLOSUM matrices: Higher-numbered matrices (e.g., BLOSUM80) are for more closely related
sequences, while lower-numbered matrices (e.g., BLOSUM45) are for more distantly related
sequences.
Example of a BLOSUM62 Matrix:

Key Points:
❖ BLOSUM62 is the default matrix for most protein alignment tools.
❖ BLOSUM matrices are more versatile for finding local alignments between sequences of
varying similarity levels.
4.4.4 PAM vs. BLOSUM
PAM: Based on evolutionary models and best suited for aligning sequences with small
evolutionary distances.
BLOSUM: Based on empirical data from conserved blocks, better for local alignment of more
distantly related sequences.

32
4.4.5 Example Exercise:
1. Given the sequences:
Seq1: ATCG
Seq2: ATCC
Use a simple substitution matrix where matches score +1, mismatches score -1, and gaps are
penalized with -2. Align the two sequences and calculate the alignment score.
Solution: Let’s align the sequences with no gaps first:

A vs. A: match (+1)


T vs. T: match (+1)
C vs. C: match (+1)
G vs. C: mismatch (-1)
Total score: (1 + 1 + 1 - 1 = 2)
2. Exercise: Align the following two protein sequences using the BLOSUM62 matrix and calculate
the score.
Protein 1: AVFC
Protein 2: AVLC
Summary
Scoring matrices like PAM and BLOSUM are fundamental in bioinformatics for aligning protein
sequences and assessing evolutionary relationships. They allow researchers to detect similarities
between sequences that may not be immediately apparent, providing insights into protein function
and evolution.

33
LECTURE FIVE
Module 5: Multiple Sequence Alignment
5.1 Biological significance of multiple sequence alignment (MSA)
5.2 Progressive alignment algorithms (e.g., CLUSTLAW)
5.3 Iterative methods and refinement (e.g., MUSCLE)
5.4 Advanced sequence alignment Algorithms (protein sequence alignment)

5.1 Biological significance of multiple sequence alignment (MSA)


5.1.1 What is Multiple Sequence Alignment (MSA)?
Multiple Sequence Alignment (MSA) is a computational method used to align three or more
biological sequences usually DNA, RNA, or protein sequences to identify regions of similarity.
These similarities can reveal evolutionary relationships, functional and structural patterns, and
conserved regions essential to understanding biological processes. MSA is essential in
bioinformatics for analyzing genetic information and exploring the biological significance of
different sequences.
Why is MSA Important in Biology?
The goal of MSA is to identify homologous regions, or regions of similarity across sequences that
can signify:
1. Evolutionary Relationships: By aligning multiple sequences, scientists can observe
mutations, insertions, or deletions over time. MSA allows us to determine evolutionary
relationships, classify organisms, and even track changes in genes or proteins.
2. Functional Conservation: Identifying conserved regions (regions that stay the same
across sequences) can indicate essential functional areas in proteins or genes. These regions
are likely crucial for the organism’s survival and are often conserved through evolution.
3. Structural Insights: MSA provides insight into the structural aspects of proteins or RNA.
Conserved sequences often contribute to specific structural features, which can help predict
the structure and functionality of unknown proteins or RNA sequences.
Key Steps in Performing MSA
1. Sequence Collection: Gather sequences that need to be aligned. These can be from various
species or variants of a single species.

34
2. Alignment Algorithm: Use an algorithm (such as Clustal Omega, MUSCLE, or MAFFT)
to align the sequences.
3. Analyze Conserved Regions: Look for blocks of conserved sequences that may indicate
structural or functional significance.
4. Interpret Biological Meaning: Evaluate the alignment results for evolutionary
relationships, conserved functional domains, or structural predictions.
Practical Examples of MSA in Bioinformatics
1. Detecting Evolutionary Relationships: Suppose scientists have DNA sequences from
three different organisms: a human, a chimpanzee, and a gorilla. By aligning these
sequences, MSA reveals which parts of the sequence are conserved and which have
diverged over time. Example:
Human: ATGCTGAACCT
Chimpanzee: ATGCTGAACCA
Gorilla: ATGTTGAACCA
Here, the MSA shows slight differences in the sequences, with conserved regions across all three.
These similarities can infer that humans, chimpanzees, and gorillas share a common ancestor.
2. Predicting Protein Functionality: Imagine studying a protein sequence known for its
enzymatic activity in one species. By aligning this protein with similar sequences from
other species, researchers can locate conserved amino acids (building blocks of proteins)
likely crucial for the protein’s function.
Example:
Species A: MKVLLVGLQGS
Species B: MKVLLVGLQGD
Species C: MKVLLVGLQGS
In this example, the alignment shows that the sequence "MKVLLVGLQGS" is conserved across
species A and C but differs slightly in species B. This conserved sequence might represent a
functional domain crucial for enzymatic activity.
3. Identifying Mutations and Disease Links: MSA is commonly used to compare gene
sequences from healthy individuals with those who have a genetic disorder. Differences or
mutations identified through alignment can indicate mutations responsible for diseases.

35
Example:
Healthy Sequence: ATGCGTACTGAAC
Mutant Sequence: ATGCTTACTGAAC
The mutation from "C" to "T" in the MSA could potentially alter protein functionality, linking it
to disease if the region is functionally significant.
Tools Commonly Used in MSA
1. Clustal Omega: A widely used tool for multiple sequence alignment that efficiently
handles large numbers of sequences.
2. MUSCLE: Known for its speed and accuracy, it’s frequently used for protein sequences.
3. MAFFT: Suitable for large datasets, often used for complex alignments of nucleotide or
protein sequences.

5.2 Progressive alignment algorithms (e.g., CLUSTLAW)


Progressive alignment is a popular method for MSA in bioinformatics. It builds the alignment
progressively, adding one sequence at a time to a growing alignment. This technique, implemented
by tools like CLUSTALW, is widely used due to its balance of accuracy and computational
efficiency.
5.2.1 How it Works:
1. Distance Matrix Construction: Pairwise alignments are performed for all pairs of
sequences to calculate their similarity scores.
2. Guide Tree Construction: Based on the distance matrix, a guide tree (also called a
phylogenetic tree) is constructed. The sequences are arranged from the most to least similar
pairs.
3. Progressive Alignment: Sequences are aligned progressively according to the guide tree,
starting with the most similar pair and adding sequences one by one.
5.2.2 CLUSTALW Overview
CLUSTALW is a popular tool for performing progressive alignment. It uses the above method to
generate an MSA for DNA, RNA, or protein sequences.
• Input: A set of sequences in FASTA format.
• Output: An MSA file showing aligned sequences, with gaps introduced to maximize
alignment across all sequences.

36
Key Features:
1. Pairwise Alignment: Uses scoring matrices to perform pairwise alignments.
2. Guide Tree Construction: Utilizes the Neighbor-Joining method to build the tree.
3. Alignment: Uses the tree to progressively align sequences or groups of sequences.
5.2.3 Steps in CLUSTALW Progressive Alignment
1. Step 1: Compute Pairwise Distances: Calculate similarity scores for all pairs of sequences,
usually using a substitution matrix like PAM or BLOSUM (for proteins) or simple
match/mismatch scores (for DNA).
2. Step 2: Construct a Guide Tree: Using the computed distances, a guide tree is built using
hierarchical clustering or the Neighbor-Joining algorithm.
3. Step 3: Progressive Alignment: The sequences are aligned following the guide tree.
Initially, two closely related sequences are aligned, and then each subsequent sequence or
group of sequences is aligned to this initial alignment.
5.2.3 Example of Progressive Alignment Using CLUSTALW
Let’s align the following three DNA sequences using CLUSTALW:
• Seq1: ATCG
• Seq2: ATGG
• Seq3: ACGA
Step-by-Step Example:
Step 1: Pairwise Distance Calculation
Using a simple match (1 point) / mismatch (0 points) scoring system:
1. Seq1 vs. Seq2: 3 matches (AT-GG), score = 3
2. Seq1 vs. Seq3: 2 matches (A-CGA), score = 2
3. Seq2 vs. Seq3: 2 matches (A-GGA), score = 2
Step 2: Construct Guide Tree
Based on the scores, Seq1 and Seq2 are most similar, so they will be grouped first.
Step 3: Progressive Alignment
1. Align Seq1 and Seq2:
Seq1: ATCG
Seq2: ATGG
2. Align Seq3 to the alignment of Seq1 and Seq2:

37
Seq1: ATCG
Seq2: ATGG
Seq3: A-CGA
5.2.4 Applications of Progressive Alignment in Bioinformatics
❖ Phylogenetic Analysis: Helps determine evolutionary relationships.
❖ Conserved Region Identification: Identifies regions that remain unchanged across species,
hinting at crucial functional roles.
❖ Structure Prediction: Uses alignments to predict the 3D structure of proteins.
5.2.5 Advantages and Limitations of Progressive Alignment
Advantages:
❖ Computational Efficiency: Works well with larger datasets.
❖ Ease of Implementation: Requires relatively simple calculations for guide tree construction
and alignment.
Limitations:
❖ Error Propagation: Early alignment errors propagate through the entire alignment.
❖ Dependence on Guide Tree: Accuracy is highly dependent on the quality of the guide tree.

5.3 Iterative Methods and Refinement (e.g., MUSCLE)


In bioinformatics, iterative methods are techniques used to improve the accuracy of sequence
alignments by refining them over multiple steps or "iterations." These approaches are especially
useful when dealing with complex sequences, such as those in protein alignment, where initial
alignments may not capture subtle similarities. One popular iterative alignment tool is MUSCLE
(Multiple Sequence Comparison by Log-Expectation), widely used for multiple sequence
alignment due to its accuracy and efficiency.
How Iterative Methods Work
1. Initial Alignment: The algorithm starts by creating an initial alignment, often quickly but
with basic accuracy.
2. Refinement: The initial alignment is refined by repeatedly adjusting and realigning
portions of the sequence. With each iteration, the algorithm tries to improve the score (e.g.,
maximizing similarity or minimizing mismatches).

38
3. Convergence: The process continues until improvements become negligible, indicating
that the algorithm has likely reached the best alignment possible.
MUSCLE Algorithm
MUSCLE is particularly valued for both speed and accuracy, using three primary stages:
1. Draft Progressive Alignment: Builds a tree to represent the relationships between
sequences, then aligns them progressively based on this tree.
2. Tree Refinement: The algorithm builds a second, more accurate tree based on the initial
alignment and realigns the sequences progressively again, using this refined tree.
3. Iterative Refinement: MUSCLE refines the alignment iteratively by making small
adjustments, testing the alignment score after each change.
Advantages of MUSCLE
• High accuracy: MUSCLE achieves high alignment accuracy, especially for long protein
sequences.
• Efficiency: MUSCLE can quickly handle large datasets, making it popular for aligning
multiple protein sequences.

5.4 Advanced Sequence Alignment Algorithms (Protein Sequence Alignment)


Advanced sequence alignment algorithms aim to accurately align protein sequences by taking into
account factors like evolutionary relationships and structural similarities. Unlike DNA sequences,
protein sequences have more complex alignment needs due to their diverse chemical properties.
Protein sequence alignment algorithms often use scoring systems based on amino acid properties
to better reflect biological significance.

5.4.1 Types of Advanced Protein Alignment Algorithms


1. Dynamic Programming-Based Approaches:
Smith-Waterman Algorithm: This algorithm performs local sequence alignment, which is
particularly useful when only a specific region of two sequences shares similarity.
Needleman-Wunsch Algorithm: Suitable for global alignment, where the algorithm aligns entire
protein sequences from start to finish.

39
2. Hidden Markov Models (HMM):
HMMER: A widely-used tool that utilizes Hidden Markov Models for aligning protein sequences,
especially in finding patterns in protein families. HMMER is particularly effective for recognizing
distant relationships between sequences due to its probabilistic approach.
3. Profile-Based Algorithms:
PSI-BLAST (Position-Specific Iterated BLAST): Builds a profile based on the results of an initial
BLAST search and then aligns other sequences to this profile. This is useful for detecting weak
but biologically significant similarities between proteins.
4. Progressive Alignment with Refinement:
Many modern tools, including MUSCLE and MAFFT (Multiple Alignment using Fast Fourier
Transform), use progressive alignment followed by iterative refinement to handle large-scale
protein datasets with high accuracy.

5.4.2 Scoring Matrices


Scoring matrices like BLOSUM (Blocks Substitution Matrix) and PAM (Point Accepted
Mutation) are integral to protein alignment algorithms. They provide scores based on observed
substitution rates, allowing algorithms to better match amino acids that are functionally or
evolutionarily similar.
• BLOSUM62: Frequently used in BLAST and other alignment tools for proteins, as it is
well-suited for sequences with moderate levels of similarity.
• PAM250: Useful for more distantly related proteins, as it assumes a greater degree of
divergence between sequences.
Practical Example: Using PSI-BLAST for Protein Analysis
Imagine we have a protein sequence that we believe may be distantly related to others in a database,
potentially sharing functional domains with proteins of known function.
1. Input Sequence: Start with a query protein sequence in FASTA format.
2. Run PSI-BLAST:
Load the sequence into PSI-BLAST, choosing a suitable database like Swiss-Prot, which is
manually curated for high-quality annotations.

40
3. Output: PSI-BLAST will return sequences that are similar to the query and provide a
position-specific scoring matrix that can help identify conserved motifs or functional
domains.
4. Iterate: Re-run PSI-BLAST to refine the alignment and detect more distant relatives,
which can be informative for functional or structural predictions.

5.4.3 Comparing BLAST and FASTA

41
MODULE 6: HEURISTICS FOR SEQUENCE ALIGNMENT
6.1 Challenges in aligning large datasets
6.2 Heuristic methods for rapid alignment: BLAST and FASTA algorithms
6.3 Trade-offs between accuracy and performance in heuristic algorithms
6.4 Case studies: Using BLAST in genomic research

6.0 Introduction to Heuristics for sequence Alignment


Heuristics for sequence alignment are computational methods that speed up the alignment of
biological sequences, like DNA, RNA, or proteins, by approximating optimal alignments rather
than computing them exactly. Exact methods, like dynamic programming, can be very slow for
long sequences, so heuristic approaches are used to make the process faster, especially when
analyzing large datasets in genomics. Common heuristic algorithms for sequence alignment
include:
1. BLAST (Basic Local Alignment Search Tool): This is one of the most widely used
heuristics for sequence alignment. It works by first identifying small matching
subsequences, or "seeds," and then extending these matches to create high-scoring
alignments. BLAST is very fast and is suitable for finding regions of similarity.
2. FAST (Fast Alignment Search Tool): Similar to BLAST, FAST finds matching words
(short subsequences) between sequences. It arranges these in a graph and extends the high-
scoring words to form alignments.
3. PatternHunter: This algorithm uses spaced seeds, allowing matches with gaps and
helping to detect more distant relationships between sequences. It’s particularly useful in
genomic comparisons.

6.1 Challenges in aligning large datasets


Aligning biological sequences, such as DNA, RNA, or proteins, is a fundamental task in
bioinformatics, helping scientists compare sequences, find functional similarities, and understand
evolutionary relationships. However, as sequencing technology advances, the volume of sequence
data has grown exponentially, creating significant challenges in aligning large datasets.

42
6.1.1 Key Challenges in Large-Scale Sequence Alignment
1. Data Volume and Computational Resources: As datasets grow larger, aligning them
requires more computational resources. Exhaustive alignment methods, like dynamic
programming (e.g., Needleman-Wunsch and Smith-Waterman), become computationally
expensive and time-consuming for very large datasets.
2. Accuracy vs. Speed: Achieving high accuracy with traditional alignment algorithms
demands significant processing power, leading to slower runtimes. Optimizations are often
required to balance the trade-off between speed and accuracy.
3. Memory Usage: Aligning large sequences requires considerable memory to store matrices
and intermediary data. Managing memory efficiently becomes a crucial task.
4. Complexity of Biological Data: Biological sequences may contain insertions, deletions,
and substitutions, which complicates alignment. Some sequences might also be repetitive
or conserved across multiple species, adding additional complexity to the alignment
process.
5. Database Search Time: When searching for similar sequences in large databases, the time
to complete the search can be prohibitively long, especially when the database size is in
terabytes.
Example Problem
Imagine you have a dataset with one million DNA sequences, each about 1,000 bases long.
Traditional pairwise alignment using dynamic programming would require vast amounts of
computation and memory. Instead, we turn to heuristic methods like BLAST or FASTA to achieve
faster, approximate alignments.

6.2 Heuristic Methods for Rapid Alignment – BLAST and FASTA Algorithms
To address the challenges of aligning large datasets, bioinformatics researchers use heuristic
algorithms like BLAST (Basic Local Alignment Search Tool) and FASTA (Fast-All). These
methods quickly approximate alignments by focusing on local regions of similarity, rather than
exhaustively analyzing every possible alignment.
Key Concepts:
• Heuristic Algorithms: They make trade-offs between speed and optimality by focusing
on finding “good enough” solutions instead of the best possible alignment.

43
• Local Alignments: Instead of aligning entire sequences, they identify regions of high
similarity, which reduces computational time.
6.2.1 BLAST Algorithm (Basic Local Alignment Search Tool)
BLAST is one of the most widely used bioinformatics tools for finding regions of similarity
between biological sequences. It is faster than exhaustive methods because it uses a heuristic
approach to find local alignments.
How BLAST Works:
1. Word Matching: BLAST first identifies “words” or short subsequences within a query
sequence. Words are typically 11 bases for nucleotide sequences or 3 amino acids for
proteins.
2. Extension: After finding a matching word between the query and database sequences,
BLAST attempts to extend the alignment by matching additional bases or amino acids
around it.
3. Scoring and Filtering: BLAST scores alignments based on similarity. Only alignments
above a certain score threshold are considered, further reducing computation time.
4. Result Output: BLAST ranks results based on the similarity score, displaying the best
matches to the user.
Example: Suppose you have a query protein sequence and want to find similar sequences in a
protein database. Using BLAST, you can search the database quickly by comparing segments
rather than the entire sequences.
6.2.2 FASTA Algorithm (Fast-All)
FASTA was one of the first algorithms developed for rapid sequence alignment. Although BLAST
has largely supplanted it in popularity, FASTA remains a useful heuristic tool for rapid alignment.
How FASTA Works:
1. Identifying K-Tuples: FASTA breaks down sequences into “k-tuples” (short
subsequences of k length). The k-value depends on the sequence type (nucleotides or amino
acids).
2. Word Matching: Like BLAST, FASTA finds initial matching k-tuples between the query
and database sequences.
3. Scoring and Filtering: FASTA scores and filters alignments based on the similarity of
matched regions, focusing on those with high local alignment scores.

44
4. Final Alignment: FASTA refines high-scoring matches using dynamic programming to
ensure accuracy, providing alignments that are comparable to BLAST.
Example: FASTA can be used to find DNA sequence matches in a database.

6.3 Trade-offs Between Accuracy and Performance in Heuristic Algorithms


Heuristic algorithms are practical methods for solving problems, aiming to find sufficiently good
solutions quickly, especially for tasks like sequence alignment. Rather than exhaustively exploring
all alignments, they identify regions of similarity more efficiently, which is crucial given the size
of biological databases. However, this approach requires a balance between accuracy (finding
biologically meaningful alignments) and performance (speed and resource efficiency).
6.3.1 Key Trade-offs in Heuristic Algorithms
1. Accuracy vs. Performance:
❖ High Accuracy, Lower Performance: Exact algorithms like Needleman-Wunsch and
Smith-Waterman provide optimal alignments but are computationally intensive.
❖ Lower Accuracy, Higher Performance: Heuristic algorithms, like BLAST, achieve faster
processing by sacrificing some accuracy, ideal for large datasets.
2. Specific Trade-offs:
❖ Sensitivity vs. Speed: High-speed algorithms might miss low-similarity alignments.
❖ Memory Usage vs. Performance: Additional memory (e.g., hash tables) can boost speed
but may be a limitation with large datasets.
❖ Alignment Length vs. Resources: Longer alignments increase computational demands.
6.3.2 Strategies to Manage Trade-offs
1. Word Matching: Limiting search space by focusing on small sequence segments.
2. Greedy Extensions: Extending alignments based on promising matches.
3. Indexing Schemes: Rapid retrieval of sequences using data structures.
4. Parallel Processing: Using multiple processors to improve efficiency.

6.4 Case Studies: Using BLAST in Genomic Research


BLAST (Basic Local Alignment Search Tool), developed by Stephen Altschul and his team in
1990, is a key algorithm for aligning sequences quickly and effectively against large nucleotide or
protein databases, widely used in bioinformatics. How BLAST Works:

45
1. Word Matching: The query sequence is divided into smaller "words" or subsequences,
which are searched in the database for exact matches.
2. Extension: Each exact match is extended in both directions to form a high-scoring segment
pair (HSP).
3. Scoring and Significance: HSPs are scored with a scoring matrix, and their alignment
significance is statistically assessed.
4. Output: BLAST generates a ranked list of alignments, prioritizing the most significant
matches.
6.4.1 Case Studies
1. Identifying Gene Homologs:
❖ Objective: Understand evolutionary links by finding homologous genes across species.
❖ Approach: Using BLASTP, a human protein is compared with a mouse protein database.
❖ Outcome: Reveals conserved gene functions, advancing evolutionary biology.
2. Discovering Novel Genes:
❖ Objective: Identify new genes in a newly sequenced organism.
❖ Approach: BLASTX compares the new organism's nucleotide sequences to a known
protein database.
❖ Outcome: Suggests novel genes for faster genome annotation and further studies.
6.4.2 Advantages of BLAST in Genomic Research
❖ Speed and Efficiency: Supports rapid large-database queries.
❖ Flexibility: Different BLAST types allow for various sequence comparisons.
❖ User-Friendly: Available as command-line and web tools.
❖ Widespread Integration: Used broadly across bioinformatics pipelines.
6.4.3 Limitations of BLAST
❖ Heuristic Nature: May miss alignments lacking exact word matches.
❖ Database Dependency: Results rely on the database's quality.
❖ Computational Demand: Large datasets still require considerable processing power.

46
Module 7: Hidden Markov Models (HMM) in Bioinformatics
7.1 Overview of Hidden Markov Models, Markov processes and biological sequences
7.2 Components of HMMs: States, transitions, emissions
7.3 Application of HMMs in biological sequence modeling (gene prediction, protein family
classification)

7.1 Overview of Hidden Markov Models, Markov Processes, and Biological Sequences

A Markov chain is a model that tells us something about the probabilities of sequences of random
variables, states, each of which can take on values from some set. These sets can be words, or tags,
or symbols representing anything, like the weather. A Markov chain makes a very strong
assumption that if we want to predict the future in the sequence, all that matters is the current state.
The states before the current state have no impact on the future except via the current state. It’s as
if to predict tomorrow’s weather you could examine today’s weather but you weren’t allowed to
look at yesterday’s weather.
A hidden Markov model (HMM) allows us to talk about both observed events (like words that we
see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in
our probabilistic model. HMMs are statistical models used to describe systems that have
unobservable (hidden) states. These models rely on Markov processes, where the probability of
transitioning to a particular state depends only on the current state (not the history of states). The
HMM is based on augmenting the Markov chain.

Relevance to Bioinformatics:
Biological sequences (DNA, RNA, or proteins) often exhibit patterns that are not directly
observable but can be inferred using statistical models. For example:
• The identification of exons and introns in genes.
• Classifying protein domains within sequences.
Example:
• DNA Sequence Modeling: In a DNA sequence, each base (A, T, C, G) may correspond to
states like exon or intron. HMMs can infer which parts of the sequence are exons (coding)
or introns (non-coding).

47
7.2 Components of HMMs: States, Transitions, Emissions
1. States
States in an HMM represent the hidden, discrete conditions or categories of the system. These
states are not directly observable but influence the observable outputs.
• Characteristics of States:
Each state is discrete and part of a finite set.
States are hidden, meaning the actual state sequence is not directly known.
They are often denoted as S = {S1, S2, …, SN} where N is the total number of states.
• Examples:
i. In speech recognition, states might represent different phonemes.
ii. In weather modeling, states might represent weather conditions like "sunny," "rainy," or
"cloudy."
iii. In biological sequence analysis, states might represent DNA sequence regions (e.g.,
"coding" or "non-coding").
2. Transitions
Transitions describe the probabilities of moving from one state to another in the hidden sequence.
These probabilities are defined by a transition matrix.
• Transition Matrix (AAA):

Key Properties:
❖ Transition probabilities remain fixed for a given HMM.
❖ Initial state probabilities (π\piπ): The distribution that defines the likelihood of starting in
each state.
Examples:
i. In weather modeling, A [sunny, rainy], could be the probability of transitioning from a
sunny day to a rainy day.

48
ii. In part-of-speech tagging, transitions might define the probability of moving from a noun
to a verb.
3. Emissions

Emissions represent the observable outputs or symbols generated from each hidden state. These
are modeled by emission probabilities, which describe the likelihood of an observable output
given a particular hidden state.

Examples:
i. In speech recognition, emissions might be audio waveforms or spectral features derived
from spoken words.
ii. In weather modeling, emissions could be the observed weather patterns (e.g., temperature,
humidity).

7.3 Applications of HMMs in Biological Sequence Modeling


1. Gene Prediction

HMMs can distinguish between coding (exon) and non-coding (intron) regions in a genome.
Example: The Genscan software employs HMMs to predict genes in raw DNA sequences.

2. Protein Family Classification

HMMs classify proteins into families based on conserved motifs and patterns. Example: The Pfam
database uses HMMs to identify and annotate protein families.

3. Sequence Alignment

49
HMMs are employed in multiple sequence alignment by modeling conserved regions across
sequences. Example: The HMMER tool uses HMMs for searching sequence databases and
aligning multiple sequences.

4. Structural Prediction

HMMs predict secondary structures (alpha-helix, beta-sheet) in proteins.

Example: SAM-T08 software uses HMMs to predict protein structures based on sequence
similarity.

50
MODULE 8: PHYLOGENETIC TREES
8.1 Introduction to tree structures: Binary trees, rooted vs unrooted trees
Phylogenetic trees are diagrams that depict evolutionary relationships among various species or
other entities based on their genetic or physical traits. These relationships are inferred using
biological data, such as DNA sequences, and are represented in the form of a tree-like structure.
A. Binary Trees
A binary tree is a hierarchical structure in which each node has at most two children, typically
referred to as the "left" and "right" child. In phylogenetics, binary trees represent evolutionary
divergence, with each node signifying a common ancestor and its descendants.
Key Features:
1. Internal Nodes: Represent hypothetical common ancestors.
2. Leaf Nodes: Represent existing species or taxa.
3. Edges: Represent evolutionary paths.
Example:
Imagine three species: A, B, and C. A binary tree may show that species A and B share a more
recent common ancestor than they do with species C.

In this example:
i. The internal node "Ancestor1" is the common ancestor of all three species.
ii. "Ancestor2" is the shared ancestor of A and B only.
iii. A, B, and C are the leaf nodes.

B. Rooted vs. Unrooted Trees

51
Rooted Trees: A rooted tree has a designated root that represents the most recent common
ancestor of all entities in the tree. It shows the direction of evolutionary time, starting from the root
and diverging toward the tips.
Example:
A rooted tree with species A, B, C, and D might look like this:

Here:
❖ The root indicates the origin of all evolutionary relationships in the tree.
❖ The tree is "directional," reflecting evolutionary divergence over time.
Unrooted Trees:
An unrooted tree depicts the relationships between species without assuming a common ancestor
or evolutionary direction. It focuses solely on the genetic or physical similarities.
8.1.2 Applications of Tree Structures in Phylogenetics:
1. Binary Trees: Useful for modeling speciation events and calculating evolutionary
distances.
2. Rooted Trees: Ideal for tracing ancestry and evolutionary timelines.
3. Unrooted Trees: Helpful for identifying clusters of related species based on similarity
metrics.
8.1.3 Example Use Case:
Consider a study analyzing the evolutionary relationships among human, chimpanzee, gorilla,
and orangutan DNA sequences. A rooted binary tree can depict their divergence from a common
primate ancestor, while an unrooted tree can compare genetic similarity without suggesting
direct ancestry.

52
8.2 Tree construction methods: Distance-based (UPGMA, Neighbor-Joining) and character-
based methods (maximum parsimony, maximum likelihood)
Phylogenetic tree construction methods aim to represent evolutionary relationships among a set of
organisms, genes, or other biological units. These methods can be broadly categorized into
distance-based and character-based approaches. Below is a detailed explanation of each method,
including their principles and examples.
8.2.1 Distance-Based Methods
These methods rely on pairwise distance measures between sequences or species. They assume
that the evolutionary relationships can be inferred from the overall similarity between pairs of
sequences.
a. UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
Principle:
UPGMA is a hierarchical clustering method that assumes a constant rate of evolution (molecular
clock). It groups taxa (a group of one or more populations of an organism or organisms seen by
taxonomists to form a unit) based on their pairwise distances and produces a rooted tree.
Steps:
1. Compute a pairwise distance matrix.
2. Find the closest two taxa (smallest distance).
3. Merge the two taxa into a single cluster and compute the new distances to all other taxa.
4. Repeat until all taxa are joined in a single tree.
Example: Given four species (A, B, C, D) with the distance matrix:

• Merge A and B (distance 5).


• Recalculate distances for the new cluster (AB).

53
• Continue until all species are joined.
Advantages:
• Simple and fast.
• Suitable for datasets where the molecular clock assumption holds.
Limitations:
• Assumes a constant rate of evolution, which may not be realistic.

b. Neighbor-Joining (NJ)
Principle:
Neighbor-Joining relaxes the molecular clock assumption and is more flexible than UPGMA. It
identifies pairs of taxa that minimize the total branch length of the tree.
Steps:
1. Compute a distance matrix.
2. Calculate the neighbor-joining matrix to identify the closest pair of taxa.
3. Join the pair into a new cluster and update the distance matrix.
4. Repeat until all taxa are joined.
Advantages:
• Does not require a molecular clock assumption.
• Produces an unrooted tree, which can later be rooted if needed.
Limitations:
• May produce less accurate results with highly divergent sequences.

8.2.2 Character-Based Methods


These methods analyze the individual characters (e.g., nucleotide or amino acid positions) in the
sequences to infer evolutionary relationships.
a. Maximum Parsimony (MP)
Principle:
Maximum parsimony seeks the tree that requires the fewest evolutionary changes to explain the
observed data.
Steps:
1. Identify variable sites in the sequences.

54
2. Generate all possible tree topologies.
3. Calculate the number of changes (steps) required for each tree.
4. Select the tree with the minimum number of steps.

8.2.3 Comparison of Methods

8.3 Applications of phylogenetics in comparative genomics

Phylogenetics, the study of evolutionary relationships among biological entities, is an essential


tool in comparative genomics. By constructing and analyzing phylogenetic trees, researchers can
investigate genetic similarities and differences across species, understand evolutionary processes,
and predict functional elements in genomes. Below are the key applications of phylogenetics in
comparative genomics:

1. Identifying Orthologous and Paralogous Genes


Phylogenetics helps distinguish between orthologs (genes in different species derived from a
common ancestor) and paralogs (genes related by duplication within the same genome). This
distinction is crucial for understanding gene function and evolutionary relationships.

55
Example: In humans and mice, the gene PAX6, involved in eye development, is an ortholog.
Phylogenetic analysis confirms that these genes originated from a common ancestor and perform
similar functions in both species.

2. Inferring Ancestral Genomes


Using phylogenetics, scientists reconstruct ancestral genome sequences, providing insights into
the genomes of extinct species or early evolutionary stages.
Example: Reconstruction of the genome of the last universal common ancestor (LUCA) has
provided a framework to study the early evolution of life and essential genes conserved across all
domains of life.

3. Tracing Horizontal Gene Transfer (HGT)


Horizontal gene transfer, the exchange of genetic material between unrelated species, is traced
through phylogenetic incongruences where gene trees differ significantly from species trees.
Example: The discovery of antibiotic resistance genes in pathogenic bacteria like Escherichia coli
and Salmonella has been attributed to HGT from other bacterial species, supported by phylogenetic
evidence.

4. Studying Genome Evolution


Phylogenetics uncovers the mechanisms driving genome evolution, such as gene duplications,
inversions, transpositions, and deletions.
Example: Comparative studies in plant genomes, like wheat and maize, reveal large-scale
duplications and rearrangements using phylogenetic methods, shedding light on the evolution of
agronomic traits.

5. Predicting Gene Function


By analyzing the evolutionary conservation of genes across species, phylogenetics helps predict
unknown gene functions.
Example: The human gene BRCA1, known for its role in DNA repair, was initially characterized
using phylogenetic comparisons with yeast and other model organisms.

56
6. Identifying Conserved Non-Coding Regions
Phylogenetic comparisons highlight conserved non-coding DNA sequences, which are often
regulatory elements critical to gene expression.
Example: The HOX gene clusters, involved in body plan development, include highly conserved
non-coding regions, identified through phylogenetic alignment across vertebrates.

7. Understanding Pathogen Evolution


Phylogenetics plays a vital role in tracking the evolution and spread of pathogens, aiding in vaccine
development and disease control.
Example: Phylogenetic studies of SARS-CoV-2, the virus causing COVID-19, revealed its
evolutionary origins, mutational hotspots, and global transmission patterns, informing public
health strategies.

8. Drug Discovery and Resistance


Comparative phylogenetics helps identify potential drug targets and understand the evolution of
drug resistance.
Example: In malaria research, phylogenetics of Plasmodium species has helped identify genetic
markers of drug resistance and new targets for antimalarial drug development.

9. Insights into Speciation Events


Phylogenetic analysis allows researchers to pinpoint when and how speciation events occurred,
linking genomic changes to new species formation.
Example: The divergence between humans and chimpanzees approximately 6–7 million years ago
has been explored through comparative genomics and phylogenetic trees.

10. Evolutionary Rate Analysis


Phylogenetics is used to estimate the rates of molecular evolution, providing insights into selective
pressures acting on genes.
Example: Genes like FOXP2, associated with human speech, show accelerated evolution in
humans compared to other primates, identified using phylogenetic rate analyses.

57
MODULE 9: MACHINE LEARNING IN BIOINFORMATICS

9.1 Introduction to Deep Learning


Deep Learning is a subset of Machine Learning (ML) that utilizes artificial neural networks to
model and solve complex problems. It involves training algorithms to learn patterns and
representations from large amounts of data, often in an unsupervised or supervised manner. Deep
learning is widely used for tasks like image recognition, natural language processing, and time-
series forecasting. Some key concepts of Deep Learning architecture include
❖ Neurons and Layers: Inspired by biological neurons, artificial neurons (nodes) are
arranged in layers, including input, hidden, and output layers.
❖ Activation Functions: These introduce non-linearity to the network, enabling it to model
complex patterns. Common activation functions include ReLU (Rectified Linear Unit),
sigmoid, and tanh.
❖ Backpropagation: The algorithm used to minimize error in predictions by adjusting
weights in the network.
❖ Gradient Descent: An optimization algorithm to minimize the loss function.

9.1.1 Multi-Layer Perceptron (MLP)


An MLP is a class of feedforward neural networks composed of multiple layers of neurons. Each
layer is fully connected to the next, meaning every neuron in one layer connects to every neuron
in the next.
❖ Structure: Comprises an input layer, one or more hidden layers, and an output layer.
❖ Applications: Basic classification tasks like digit recognition (MNIST dataset).

9.1.2 Convolutional Neural Networks (CNNs)


CNNs are specialized for processing structured grid data like images. They use convolutional
layers to extract features like edges, shapes, and textures. It is applied in Image and video
recognition, object detection etc. Examples are seen in CNNs power facial recognition systems
by extracting unique facial features for identification.

58
Key Components:
i. Convolutional Layers: Apply filters to extract spatial features.
ii. Pooling Layers: Reduce spatial dimensions (e.g., max pooling).
iii. Fully Connected Layers: Connect features to output predictions.

9.1.3 Recurrent Neural Networks (RNNs)


RNNs are designed for sequential data where the order of inputs matters. They have feedback
connections that allow information to persist across time steps.
• Structure: Neurons have connections to themselves, enabling memory of previous inputs.
• Applications: Time-series forecasting, language modeling.

9.1.4 Long Short-Term Memory (LSTM)


LSTMs are a type of RNN designed to solve the problem of long-term dependency and vanishing
gradients in standard RNNs. Applications are seen in speech recognition, text generation. Key
Features include:
i. Cell State: Stores information over long periods.
ii. Gates: Mechanisms (forget, input, and output gates) that regulate information flow.
Example: Using an LSTM to translate sentences between languages (e.g., English to French).

9.1.5 Residual Networks (ResNet)


ResNets address the problem of vanishing gradients in very deep networks by introducing "skip
connections" or residual links.
• Key Concept: Instead of learning a direct mapping, ResNets learn the residuals or
differences between layers.
• Advantages: Enable the training of very deep networks without performance degradation.
• Applications: Image classification, object detection.
Example: A ResNet model trained on the ImageNet dataset achieves state-of-the-art accuracy in
recognizing thousands of objects.

9.2 Deep Learning in genomics & Protein Structure

59
Deep learning has revolutionized bioinformatics and computational biology, particularly in the
areas of genomics and protein structure analysis. These domains leverage deep neural networks to
uncover patterns, predict functions, and infer structures from complex biological data.

9.2.1 Deep Learning in Genomics

Genomics involves the study of the genome, which includes the sequencing, analysis, and
functional mapping of DNA. Deep learning has provided breakthroughs in processing large-scale
genomic data and deriving meaningful insights.

Key Applications:

1. Gene Expression Prediction: Predicting how genes are expressed in different cell types
under varying conditions. Example: Deep learning models like DeepSEA predict the impact
of noncoding variants on gene expression and chromatin accessibility. These models use
convolutional neural networks (CNNs) to analyze DNA sequences directly.

2. Variant Calling: Identifying genetic variants from raw sequencing data. Example: Tools
like DeepVariant use deep learning to enhance the accuracy of variant calling from next-
generation sequencing (NGS) data.

3. Enhancer-Promoter Interactions: Identifying interactions between distant genomic


regions to understand regulatory mechanisms. Example: Models like DeepHiC reconstruct
enhancer-promoter loops from Hi-C data, aiding in mapping the 3D genome.

9.2.2 Deep Learning in Protein Structure

Proteins are essential biological molecules, and understanding their 3D structure is crucial for drug
design and understanding cellular mechanisms. Deep learning has made remarkable advances in
protein structure prediction and function annotation.

Key Applications:

1. Protein Structure Prediction: Predicting a protein's 3D structure from its amino acid
sequence. Example: AlphaFold by DeepMind uses attention mechanisms and transformer
models to predict protein structures with near-experimental accuracy.

2. Protein-Protein Interaction Prediction: Predicting interactions between proteins that are


fundamental to cellular pathways. Example: Graph neural networks (GNNs) model
interactions based on protein sequence and structural information, aiding in understanding
complex pathways like signal transduction.

3. Functional Annotation of Proteins: Inferring the function of uncharacterized proteins


based on sequence and structure. Example: Models like ProtTrans apply transformer
architectures to annotate protein functions directly from sequence data.

60
Examples and Practical Impacts

1. COVID-19 Vaccine Development

2. CRISPR-Cas9 Targeting

3. Antimicrobial Resistance (AMR)

Challenges and Future Directions

❖ Data Quality and Diversity: Training data must be high-quality and representative of
diverse biological systems.

❖ Interpretability: Deep learning models often function as "black boxes," making biological
interpretation challenging.

❖ Integration with Experimental Biology: Bridging computational predictions with


experimental validation remains crucial.

61

You might also like