CSC215 Complete Note
CSC215 Complete Note
Course Outline
Module 1: Bioinformatics and Algorithms
1.1 Introduction to Algorithms for Bioinformatics
1.2 What is bioinformatics
1.3 Role of Computer Science in Bioinformatics
1.4 Software tools commonly used in bioinformatics (e.g., BLAST, FASTA)
1.5 Applications of Bioinformatics Algorithms
Module 2: DNA, RNA, and Protein Sequence Data
2.1 Overview of Biological Sequences
2.2 Representation of DNA, RNA, and Protein Sequences
1
2.3 Central Dogma of Molecular Biology
2.4 FASTA and Other Sequence File Formats
2.5 Overview of Nucleotide and Protein Databases
Module 3: Algorithmic Problem Solving
3.1 Introduction to algorithms: Design, analysis, and complexity
3.2 Algorithmic paradigms: Brute force, divide-and-conquer, greedy algorithms, dynamic
programming
3.3 Time and space complexity (Big O notation)
3.4 Application of algorithms to biological problems
Module 4: Sequence Alignment Algorithms
4.1 Biological motivation: Importance of sequence alignment in genomics
4.2 Global alignment (Needleman-Wunsch algorithm)
4.3 Local alignment (Smith-Waterman algorithm)
4.4 Scoring matrices: PAM, BLOSUM
Module 5: Multiple Sequence Alignment
5.1 Biological significance of multiple sequence alignment (MSA)
5.2 Progressive alignment algorithms (e.g., CLUSTLAW)
5.3 Iterative methods and refinement (e.g., MUSCLE)
5.4 Advanced sequence alignment Algorithms (protein sequence alignment)
Module 6: Heuristics for Sequence Alignment
6.1 Challenges in aligning large datasets
6.2 Heuristic methods for rapid alignment: BLAST and FASTA algorithms
6.3 Trade-offs between accuracy and performance in heuristic algorithms
6.4 Case studies: Using BLAST in genomic research
Module 7: Hidden Markov Models (HMM) in Bioinformatics
7.1 Overview of Hidden Markov Models, Markov processes and biological sequences
7.2 Components of HMMs: States, transitions, emissions
7.3 Probability estimation in HMM, finding maximum assignment
7.4 Application of HMMs in biological sequence modeling (gene prediction, protein family
classification)
2
Module 8: Phylogenetic Trees
8.1 Introduction to tree structures: Binary trees, rooted vs unrooted trees
8.2 Tree construction methods: Distance-based (UPGMA, Neighbor-Joining) and character-based
methods (maximum parsimony, maximum likelihood)
8.3 Algorithms for Phylogenetic Tree Construction (Hierarchical clustering algorithms
(UPGMA), Neighbor-joining algorithm)
8.4 Applications of phylogenetics in comparative genomics
Module 9: Machine Learning in Bioinformatics
9.1 Introduction to Deep Learning ((Basic concepts, MLP), CNN, Recurrent NN, LSTM, ResNet)
9.2 Deep Learning in genomics & Protein Structure
Textbooks:
“Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology” by
Dan Gusfield
“Bioinformatics Algorithms: An Active Learning Approach” by Phillip Compeau and Pavel
Pevzner
3
LECTURE ONE
MODULE 1: BIOINFORMATICS AND ALGORITHMS
1.1 Introduction to Algorithms for Bioinformatics
1.2 What is bioinformatics
1.3 Role of Computer Science in Bioinformatics
1.4 Software tools commonly used in bioinformatics (e.g., BLAST, FASTA)
1.5 Applications of Bioinformatics Algorithms
Learning Objectives:
• Understand the scope of bioinformatics and its importance in modern biology.
• Learn basic bioinformatics algorithmic and role of computer science in bioinformatics
• Explain areas where bioinformatics algorithms are applied.
4
i. HMMER: A software suite that uses HMMs for sequence analysis, particularly in detecting
protein families and domains. Example: Searching for homologous proteins across
different species.
ii. Profile HMMs: Used to align protein or DNA sequences to a probabilistic model of a
sequence family. Example: Finding similar protein motifs in different organisms.
3. Gene Prediction Algorithms: These algorithms predict the locations of genes within a DNA
sequence. Examples: GENSCAN (A program for the locations of protein-coding genes in human
DNA sequences), Augustus (gene annotation in plant geneomes).
4. Phylogenetic Tree Construction Algorithms: These algorithms infer evolutionary relationships
between sequences or species based on sequence data. Examples:
i. Neighbor-Joining Algorithm: A distance-based method for constructing phylogenetic
trees.
ii. Maximum Likelihood (ML) Method: Estimates the tree that best explains the observed
data. Example: Generating a phylogenetic tree for viral strains.
iii. UPGMA (Unweighted Pair Group Method with Arithmetic Mean): A simple clustering
method for creating phylogenetic trees. Example: Grouping bacterial strains based on
genetic similarity.
5. Clustering Algorithms: Group data points (such as gene expression data) into clusters based on
their similarities. Examples:
i. K-Means Clustering: Partitions data into k clusters based on similarities. Example:
Clustering genes with similar expression profiles in microarray data.
ii. Hierarchical Clustering: Builds a hierarchy of clusters. Example: Grouping gene
expression data to identify co-expressed genes in a dataset.
iii. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters data
points based on density. Example: Clustering single-cell RNA sequencing data to identify
cell subtypes.
6. Genome Assembly Algorithms: These algorithms reconstruct a genome from short DNA
sequence reads. Examples:
i. De Bruijn Graph Algorithms: Used to assemble large genomes from short-read sequencing
data. Example: Assembling the genome of a bacterial species from Illumina reads.
5
ii. Overlap-Layout-Consensus (OLC) Algorithms: Used for long-read sequencing data.
Example: Assembling a yeast genome from PacBio sequencing reads.
iii. SPAdes: A popular assembler that uses de Bruijn graphs for assembling bacterial genomes.
Example: Assembling a microbial genome from short sequencing reads.
7. Motif Finding Algorithm: Identifies recurring patterns (motifs) in sequences that may be
biologically significant, such as transcription factor binding sites. Examples:
i. MEME (Multiple EM for Motif Elicitation): Identifies conserved motifs in DNA or protein
sequences. Example: Finding regulatory motifs in promoter regions of co-expressed genes.
ii. Gibbs Sampling: A probabilistic method for motif finding. Example: Identifying
transcription factor binding sites in gene regulatory networks.
iii. AlignACE: A tool for identifying motifs shared by a set of DNA sequences. Example:
Discovering new motifs in bacterial promoters.
8. Gene Ontology (GO) Enrichment Algorithms: Analyzes gene lists to identify over-represented
biological terms or functions. Examples: GOseq (Corrects for gene length bias in RNA-Seq GO
enrichment analysis, eg. Identifying biological processes enriched in differentially expressed
genes).
6
1. Data Management and Storage
Biological research generates massive amounts of data, such as DNA sequences, protein structures,
and gene expression profiles. Computer science provides databases and data management systems
to store, organize, and retrieve this information efficiently. Examples include databases like
GenBank and Protein Data Bank (PDB).
2. Algorithm Development
Computer science develops algorithms to solve complex bioinformatics problems such as:
Sequence alignment: (e.g., BLAST) for comparing genetic sequences; Genome assembly which
are algorithms to reconstruct genomes from DNA fragments; and Protein structure prediction:
algorithms to understand protein folding and interactions.
3. Machine Learning and AI
Machine learning and artificial intelligence (AI) are used in bioinformatics to identify patterns and
make predictions from biological data. For example, AI can predict disease risk based on genetic
information, classify protein structures, or even assist in drug discovery by analyzing molecular
structures and biological pathways.
4. Data Analysis and Visualization
Bioinformatics involves analyzing complex data, such as gene expression data or protein
interactions. Computer science provides tools for statistical analysis and visualizations that help
researchers understand biological processes, such as creating heat maps, 3D protein models, or
phylogenetic trees.
5. High-Performance Computing (HPC)
Many bioinformatics tasks, such as genome sequencing or protein folding simulations, require
significant computational power. High-performance computing (HPC) systems, cloud computing,
and parallel processing frameworks allow bioinformaticians to process massive datasets and run
complex simulations in a reasonable amount of time.
6. Database Search and Retrieval
Computer science enables the development of fast and efficient search algorithms for querying
biological databases. For instance, searching for similar DNA or protein sequences in a large
database requires optimized search techniques, such as indexing and hashing algorithms.
7
7. Data Integration
Bioinformatics often requires integrating data from various sources, such as genomic, proteomic,
and clinical data. Computer science provides methods and frameworks for integrating
heterogeneous data types, ensuring that researchers can analyze comprehensive datasets for better
insights into biological systems.
8. Modeling and Simulation
Computer science helps in creating models and simulations of biological systems. For example:
Systems biology uses computational models to understand complex interactions within cells and
organisms. Drug interaction simulations predict how molecules interact with biological targets.
9. Software Development
Many bioinformatics tools and platforms are developed by computer scientists. Software packages
like MATLAB, Bioconductor, and R are widely used in bioinformatics for statistical analysis, data
mining, and visualization. Python and other programming languages are often used to develop
custom tools and pipelines.
10. Cybersecurity in Biomedical Data
With the increasing importance of patient genomic data and health records, computer science
provides cybersecurity solutions to protect sensitive biological information from unauthorized
access or breaches. This includes encryption, secure data sharing, and privacy-preserving
computation techniques.
11. Cloud Computing and Distributed Systems
Cloud computing allows researchers to access bioinformatics tools and datasets remotely,
providing scalable resources for large-scale analyses. Distributed systems enable collaboration
between research groups, allowing for the sharing of data and computational tasks across networks.
8
iii. Clustal Omega: a tool for multiple sequence alignment, useful for aligning sequences to
study evolutionary relationships.
iv. SPAdes (St. Petersburg Assembler): a genome assembler particularly used for small
genomes such as bacterial genomes.
v. MAKER: a genome annotation pipeline used to predict genes and align them to known
data.
vi. Augustus: it is a gene prediction tool used to annotate genomes based on available evidence
or training.
vii. MEGA (Molecular Evolutionary Genetics Analysis): A tool for constructing phylogenetic
trees based on sequence data.
viii. MrBayes: it is a program for Bayesian inference of phylogeny that uses Markov Chain
Monte Carlo methods.
ix. RAxML (Randomized Axelerated Maximum Likelihood): software for fast and accurate
phylogenetic tree estimation.
x. PyMOL: A molecular visualization tool used for 3D visualization of protein structures.
xi. SWISS-MODEL: A tool for homology modeling, predicting the 3D structure of a protein
based on homologous templates.
xii. MODELLER: A program that generates 3D models of proteins using comparative
modeling techniques.
xiii. AutoDock: a suite of tools for performing molecular docking of small molecules to protein
receptors.
xiv. MOE (Molecular Operating Environment): software platform that integrates visualization,
modeling, and simulation of bio-molecular structures.
xv. GROMACS: molecular dynamics simulation tool that can be used to study protein-ligand
interactions.
Example: The Needleman-Wunsch and Smith-Waterman algorithms are used to align DNA
or protein sequences, helping researchers identify similarities between species or genes, which
may indicate functional or evolutionary relationships.
9
ii. Gene Prediction
Example: Hidden Markov Models (HMMs) are used in gene prediction algorithms like
Genscan to identify coding regions (exons) in genomic sequences, aiding in the discovery of
new genes.
Example: Algorithms like PSIPRED predict secondary structures (alpha-helices and beta-sheets)
of proteins from their amino acid sequences, which is crucial for understanding protein function
and drug design.
v. Genome Assembly
Example: De Bruijn Graph algorithms are used in genome assembly tools like SPAdes to
reconstruct genomes from short sequencing reads, as in assembling bacterial genomes from next-
generation sequencing data.
Example: Clustering algorithms like K-means are used to group genes with similar expression
patterns from microarray or RNA-Seq data, helping in identifying co-regulated genes in a
biological pathway.
10
LECTURE TWO
MODULE 2: DNA, RNA, AND PROTEIN SEQUENCE DATA
2.1 Overview of Biological Sequences
2.2 Representation of DNA, RNA, and Protein Sequences
2.3 Central Dogma of Molecular Biology
2.4 FASTA and Other Sequence File Formats
2.5 Overview of Nucleotide and Protein Databases
Learning Objectives:
• Representation of biological sequences computationally.
• Explore standard file formats for bioinformatics and database resources.
11
DNA Sequences: DNA sequences are represented by the letters A, T, C, and G, corresponding to
the four nucleotides. Example: ‘ATCGGCTA’
RNA Sequences: RNA sequences are represented similarly but with U (Uracil) replacing T
(Thymine). Example: ‘AUCGGCUA’
12
1. Replication: DNA makes copies of itself. During replication, the double helix unwinds, and each
strand serves as a template for creating a new complementary strand, resulting in two identical
DNA molecules.
2. Transcription: The process of converting DNA into RNA. This RNA can be either mRNA,
tRNA, or rRNA, depending on its function. It is the process where a cell makes a copy of a specific
part of its DNA, which is called RNA. This RNA can be of different types, such as mRNA
(messenger RNA), which carries instructions to make proteins, tRNA (transfer RNA), which helps
build proteins, or rRNA (ribosomal RNA), which forms part of the machinery that assembles
proteins.
3. Translation: RNA is translated into proteins through the ribosomes, where each codon (triplet
of nucleotides) corresponds to an amino acid.
The central dogma can be summarized as: DNA→RNA→Protein
Mutations or errors in these processes can lead to dysfunctional proteins, which can cause various
diseases, including cancer and genetic disorders.
13
2. GenBank Format: Another format used for nucleotide sequences. It contains both the sequence
and rich annotation details, such as gene structure and function.
3. PDB (Protein Data Bank): This format is used for representing 3D structures of proteins. Each
file contains atomic coordinates that describe the spatial arrangement of a molecule.
4. GFF (General Feature Format): Used for representing genomic features like genes, exons,
and coding regions.
These formats are crucial for data exchange and are widely supported by various bioinformatics
tools.
14
LECTURE THREE
MODULE 3: ALGORITHMIC PROBLEM SOLVING
3.1 Introduction to algorithms: Design, analysis, and complexity
3.2 Algorithmic paradigms: Brute force, divide-and-conquer, greedy algorithms, dynamic
programming
3.3 Time and space complexity (Big O notation)
3.4 Application of algorithms to biological problems
Learning objectives
• Understand the Basics of Algorithms
• Differentiate Algorithmic Paradigms
• Students will be able to evaluate the time and space complexity of algorithms using Big O
notation.
15
❖ Algorithm: An algorithm is defined as a step-by-step process that will be designed for a
problem.
❖ Input: After designing an algorithm, the algorithm is given the necessary and desired
inputs.
❖ Processing unit: The input will be passed to the processing unit, producing the desired
output.
❖ Output: The outcome or result of the program is referred to as the output
3.1.1 Algorithm Design
Algorithm design refers to the process of developing step-by-step procedures to solve
computational problems. In bioinformatics, the focus is on creating efficient and scalable
algorithms that handle large datasets like genome sequences, protein structures, and biological
networks. Key techniques used in designing algorithms include:
❖ Greedy Algorithms: These algorithms make locally optimal choices at each step, aiming to
find a global optimum. Examples in bioinformatics include motif finding and sequence
assembly.
❖ Divide and Conquer: This technique splits a problem into smaller sub-problems, solves
each sub-problem recursively, and combines their solutions. Applications include sequence
alignment and phylogenetic tree construction.
❖ Dynamic Programming: A method used for optimization problems, breaking them down
into overlapping sub-problems and solving each sub-problem once. This is widely used in
sequence alignment algorithms such as Needleman-Wunsch and Smith-Waterman.
❖ Graph-based Algorithms: Many bioinformatics problems can be represented using graphs
(e.g., protein interaction networks, gene regulatory networks). Algorithms such as
Dijkstra’s for shortest path and graph traversal methods (BFS, DFS) are crucial in
analyzing these biological networks.
3.1.2 Algorithm Analysis
Algorithm analysis involves evaluating the performance of an algorithm in terms of time and space
resources (i.e., computational complexity). The goal is to understand how efficiently an algorithm
performs, especially for large datasets typical in bioinformatics. Two critical measures in
algorithm analysis are time complexity and space complexity
16
1. Time Complexity: The amount of time an algorithm takes to solve a problem as a function of
the input size, often expressed in Big-O notation.
2. Space Complexity: The amount of memory an algorithm uses, also expressed in Big-O
notation.
3.1.3 Complexity Classes
Understanding the complexity of problems is vital in bioinformatics, where some tasks may be
computationally infeasible due to the large size of biological data. The complexity of problems is
often classified into:
a. P (Polynomial Time): These are problems that can be solved by an algorithm in polynomial
time, such as sequence alignment and finding the shortest path in a graph. Problems in P are
considered “efficiently solvable”.
b. NP (Nondeterministic Polynomial Time): Problems where a solution can be verified in
polynomial time, but finding the solution may not be possible in polynomial time. An example
from bioinformatics is protein folding prediction.
c. NP-Hard and NP-Complete: NP-hard problems are at least as hard as NP problems, and NP-
complete problems are both in NP and NP-hard. These problems are often intractable for large
inputs, and approximations or heuristics are used. Many bioinformatics problems, such as genome
assembly, are NP-hard.
3.1.4 Common Bioinformatics Algorithms and Their Complexity: Several bioinformatics
algorithms have become standard tools for analyzing biological data, each with distinct design and
complexity features:
❖ BLAST (Basic Local Alignment Search Tool): This heuristic algorithm is designed for
searching protein and nucleotide databases. It uses a fast local alignment approach but may
not guarantee the optimal alignment. Its time complexity is approximately \(O(mn)\),
where \(m\) and \(n\) are the lengths of the sequences being compared.
❖ Needleman-Wunsch Algorithm: A dynamic programming algorithm for global sequence
alignment. Its time complexity is \(O(mn)\), where \(m\) and \(n\) are the lengths of the
sequences being aligned.
17
❖ Smith-Waterman Algorithm: Another dynamic programming algorithm for local sequence
alignment. It also has a time complexity of \(O(mn)\), but it is computationally expensive
for large datasets.
❖ Hidden Markov Models (HMMs): Used for tasks like gene prediction, HMMs are
probabilistic models with algorithms such as the Viterbi algorithm having a time
complexity of \(O(TN^2)\), where \(T\) is the sequence length and \(N\) is the number of
hidden states.
3.1.5 Importance of Algorithm Optimization in Bioinformatics
Bioinformatics datasets are often massive (e.g., genome sequencing, proteomics data), and
processing them efficiently requires carefully designed and optimized algorithms. Optimizing
algorithms for:
Speed: Ensures that biological datasets are analyzed in reasonable time frames.
Memory Usage: Helps conserve computational resources, especially in constrained environments.
Scalability: Ensures algorithms can handle growing datasets as bioinformatics data increases
exponentially due to advancements in sequencing technologies.
18
advantage is that it is Simple to implement and useful for small datasets or problems where
correctness is more important than efficiency, while on the other hand its disadvantages means
that it is impractical for large datasets due to long execution times and inefficient for complex
problems with many possibilities.
Characteristics
❖ Exhaustive Search: Brute force algorithms try every possible option.
❖ Simplicity: These algorithms are easy to implement but can be inefficient.
❖ Guaranteed Solution: Because every option is explored, the correct solution is guaranteed.
Example of Brute Force Paradigm in Bioinformatics:
a. Sequence Alignment: A brute force approach to aligning two DNA sequences would involve
trying all possible alignments and calculating the score for each one. This guarantees finding the
best alignment but is computationally infeasible for long sequences due to the exponential number
of possible alignments.
b. Time Complexity: Usually very high, often exponential (e.g., (O(2^n))) or factorial (e.g.,
(O(n!))).
3.2.2 Divide-and-Conquer Paradigm
Divide-and-conquer is an algorithmic paradigm that breaks a problem into smaller sub-problems,
solves each sub-problem recursively, and then combines their solutions to solve the original
problem. Its advantages are, they are more efficient for larger problems, and can exploit
parallelism, as sub-problems can be solved independently, and finally it often leads to logarithmic
reductions in time complexity. They disadvantage are it enquires recursive thinking, which can
make implementation complex, and merging solutions can sometimes be tricky or computationally
expensive.
Steps of divide and Conquer paradigm
1. Divide: Split the problem into smaller, similar sub-problems.
2. Conquer: Solve each sub-problem independently.
3. Combine: Merge the solutions of the sub-problems to form a solution for the original problem.
Example in Bioinformatics:
Phylogenetic Tree Construction:
19
a. Divide-and-conquer algorithms are used in phylogenetics to construct evolutionary trees by
breaking down the data into smaller groups of species, calculating relationships for each group,
and then merging them to form a complete tree.
b. Another common example is Merge Sort or Quick Sort applied to sorting biological data like
gene sequences or protein structures.
c. Time Complexity: Typically, O(nlogn) for many divide-and-conquer algorithms, which makes
them much more efficient than brute force for large inputs.
3.2.3. Greedy Algorithms Paradigm
A greedy algorithm makes a series of choices, each of which looks best at the moment (locally
optimal), with the hope that these local choices will lead to a globally optimal solution. It solves
the problem step by step by choosing the next step that provides the most immediate benefit.
Advantages:
a. Simple to implement and fast.
b. Suitable for problems where the locally optimal solution leads to a globally optimal solution.
Disadvantages:
a. May not always produce the optimal solution for all problems.
b. Greedy choices are irreversible, making it unsuitable for problems that require reconsideration
of earlier decisions.
Characteristics:
❖ Locally Optimal Choices: Decisions are made based on immediate benefits.
❖ No Backtracking: Once a choice is made, it cannot be undone.
❖ Fast: Greedy algorithms are usually fast and simple.
Examples of Greedy algorithms in Bioinformatics:
a. Genome Assembly: A greedy algorithm can be used to assemble DNA fragments by choosing
the pair of fragments with the largest overlap and merging them until a full sequence is constructed.
Although greedy methods are fast, they may not always give the best (optimal) solution in complex
cases.
b. Huffman Coding for compressing biological data sequences, where the algorithm constructs an
optimal prefix code based on the frequencies of nucleotides or amino acids.
c. Time Complexity: Often linear or polynomial in time, such as O(nlogn) for many greedy
algorithms like Huffman coding.
20
3.2.4. Dynamic Programming Paradigm
Dynamic programming is an optimization technique used to solve problems by breaking them
down into overlapping sub-problems, solving each sub-problem just once, and storing their
solutions in a table (memoization). This avoids redundant calculations and reduces time
complexity compared to brute force approaches.
Characteristics:
❖ Overlapping sub-problems: sub-problems are solved and reused.
❖ Optimal Substructure: A solution to the problem can be composed from solutions to its
sub-problems.
❖ Memoization or Tabulation: Results of solved sub-problems are stored for future
reference.
Examples of Dynamic Programming Paradigm in Bioinformatics:
a. Sequence Alignment (Dynamic Programming Approach): it is used to solve sequence alignment
problems (e.g., Needleman-Wunsch and Smith-Waterman algorithms) by building a matrix that
stores the optimal alignment scores for all possible prefixes of the sequences. This significantly
reduces computation time compared to a brute force approach.
b. RNA Secondary Structure Prediction where dynamic programming helps predict the most stable
structure by breaking down RNA sequence interactions and storing results of overlapping
substructures.
c. Time Complexity: Typically (O(n^2)) or (O(n^3)) for most bioinformatics applications like
sequence alignment.
Advantages:
a. More efficient than brute force due to reduced redundant calculations.
b. Guarantees finding the optimal solution for many problems.
Disadvantages:
a. Can require a lot of memory due to the storage of sub-problem solutions.
b. May be more complex to implement compared to simpler paradigms like greedy algorithms.
21
3.2.5 Comparison of Paradigms in Bioinformatics
In conclusion, understanding and applying the right algorithmic paradigm is essential in solving
bioinformatics problems efficiently. While brute force provides correctness, it is often impractical
for large data, and more advanced paradigms like divide-and-conquer, greedy algorithms, and
dynamic programming provide powerful tools for solving complex problems in a time-efficient
manner. Selecting the appropriate algorithmic approach based on the problem characteristics (e.g.,
overlapping sub-problems, need for global optimality) is key to developing efficient
bioinformatics solutions.
22
3.3.2. Space Complexity: The amount of memory an algorithm uses, also expressed in Big-O
notation. Memory efficiency is essential for bioinformatics applications handling large genomic
datasets.The general equation for space complexity is:
Where:
(S(n)) represents the space complexity.
(n) is the size of the input.
(f(n)) is a function that represents how much memory the algorithm uses as a function of (n).
(O(f(n))) is the Big-O notation describing the upper bound of space usage.
Space Complexity Example:
1. Constant Space: when an algorithm uses a fixed amount of space regardless of input size, the
space complexity is O(1) (constant space). Example is a function that only uses a few variables
and doesn’t depend on the input size.
2. Linear Space: if an algorithm’s space grows directly with the input size \( n \), the space
complexity is O(n). Example, Storing an array of size \( n \).
3. Recursive Function: for recursive algorithms, space complexity includes both the memory used
by variables and the additional space taken by the recursive call stack. Example is a recursive
function with depth (n) & using (O(1) ) space per function call would have space complexity O(n).
1. Sequence Alignment:
Algorithms like Needleman-Wunsch (global alignment) and Smith-Waterman (local alignment)
are used to compare DNA, RNA, or protein sequences. By aligning sequences, researchers can
identify similarities, evolutionary relationships, and functional regions in genomes.
23
2. Genome Assembly:
Algorithms for assembling short DNA sequences into longer genomes (such as the de Bruijn
graph approach) help reconstruct entire genomes from fragmented DNA data, critical for genomics
research.
3. Phylogenetic Tree Construction:
Algorithms like neighbor-joining and maximum likelihood are used to infer evolutionary
relationships between species based on genetic data. These trees depict how species have diverged
from common ancestors.
4. Protein Structure Prediction:
Algorithms, including those using dynamic programming and machine learning techniques,
predict the three-dimensional structure of proteins based on their amino acid sequences. Accurate
structure predictions are important for understanding protein function and drug design.
5. Gene Expression Analysis:
Algorithms for clustering and classification help analyze gene expression data from technologies
like microarrays and RNA-seq, allowing researchers to identify genes with similar expression
patterns and study their regulation.
24
LECTURE FOUR
Module 4: Sequence Alignment Algorithms
4.1 Biological motivation: Importance of sequence alignment in genomics
4.2 Global alignment (Needleman-Wunsch algorithm)
4.3 Local alignment (Smith-Waterman algorithm)
4.4 Scoring matrices: PAM, BLOSUM
Learning Objectives:
• Understand dynamic programming-based algorithms for sequence alignment.
• Learn to use substitution matrices for scoring sequence alignments.
25
4.2 Global Sequence Alignment: Needleman-Wunsch Algorithm
The Needleman-Wunsch Algorithm is a dynamic programming approach used for global sequence
alignment. It aligns sequences in their entirety, aiming to find the optimal alignment between the
two sequences over their entire length. The algorithm constructs a matrix where one sequence is
placed on the top (along columns) and the other along the side (along rows). It uses the following
steps:
1. Initialization: Create a scoring matrix where the first row and column are initialized with gap
penalties.
2. Matrix Filling: Fill in the matrix based on the scoring scheme, taking into account matches,
mismatches, and gap penalties.
3. Traceback: Starting from the bottom-right of the matrix, trace back to determine the optimal
alignment.
Algorithm Steps:
1. Input:
❖ Two sequences: A = A_1, A_2, ..., A_m and B = B_1, B_2, ..., B_n.
❖ Substitution matrix for match/mismatch scores.
❖ Gap penalty.
2. Initialization:
Let F (i, 0) = -d * i for all i = 0 to m (for row).
Let F (0, j) = -d * j for all j = 0 to n (for column).
3. Recurrence Relation:
F(i, j) = max {
F(i-1, j-1) + s(A_i, B_j), # match/mismatch
F(i-1, j) - d, # insertion (gap in B)
F(i, j-1) - d # deletion (gap in A)
}
N/b The function F(i, j) computes a score for aligning two sequences (let’s call them A and B).
The indices i and j refer to specific positions in these sequences:
• A_i: The i-th element of sequence A
• B_j: The j-th element of sequence B
The Components:
26
1. F(i-1, j-1) + s(A_i, B_j):
1. This part considers the case where you match or mismatch the characters at A_i
and B_j.
2. s(A_i, B_j) is a scoring function that gives a positive score for a match and a
negative score for a mismatch.
3. So, this expression takes the score from aligning the previous characters (i-1, j-
1) and adds the score from the current characters.
2. F (i-1, j) - d:
1. This part accounts for an insertion (a gap in sequence B).
2. It looks at the score for aligning the previous character of A with the current
character of B, but reduces the score by d (the penalty for introducing a gap).
3. F (i, j-1) - d:
1. This part considers a deletion (a gap in sequence A).
2. It checks the score for aligning the current character of A with the previous
character of B and also reduces the score by d for the gap.
Putting It All Together:
• F (i, j) takes the maximum of these three options:
1. Match/mismatch score (align both A_i and B_j)
2. Insertion (allowing a gap in B)
4. Traceback:
- Starting from `F(m, n)`, trace back to `F(0, 0)` to recover the optimal alignment.
27
4.3 Local Sequence Alignment: Smith-Waterman Algorithm
The Smith-Waterman Algorithm is another dynamic programming approach but for local sequence
alignment. Unlike the Needleman-Wunsch algorithm, it finds the best matching subsequence
between two sequences. This method is ideal when you want to align sequences that only partially
overlap. The algorithm constructs a similar matrix to Needleman-Wunsch but differs in the
following:
❖ Negative scores are replaced with zeros to ensure no penalty for misalignments outside the
local alignment.
❖ Traceback starts from the highest scoring cell and ends when a cell with a score of zero is
reached.
Algorithm Steps:
1. Input:
1. Two sequences: A = A_1, A_2, ..., A_m and B = B_1, B_2, ..., B_n.
2. Substitution matrix for match/mismatch scores.
3. Gap penalty.
2. Initialization:
F(i, 0) = 0 for all i.
F(0, j) = 0` for all j.
28
3. Recurrence Relation:
F(i, j) = max {
0,
F(i-1, j-1) + s(A_i, B_j), # match/mismatch
F(i-1, j) - d, # insertion (gap in B)
F(i, j-1) - d # deletion (gap in A)
4. Traceback:
Starting from the highest scoring cell in the matrix, trace back to a cell with a score of zero.
python
def smith_waterman(seq1, seq2, match_score=1, mismatch_score=-1, gap_penalty=-2):
m, n = len(seq1), len(seq2)
score_matrix = [[0 for _ in range(n+1)] for _ in range(m+1)]
# Fill the score matrix
max_score = 0
max_pos = None
for i in range(1, m+1):
for j in range(1, n+1):
match = score_matrix[i-1][j-1] + (match_score if seq1[i-1] == seq2[j-1] else mismatch_score)
delete = score_matrix[i-1][j] + gap_penalty
insert = score_matrix[i][j-1] + gap_penalty
score_matrix[i][j] = max(0, match, delete, insert)
if score_matrix[i][j] > max_score:
max_score = score_matrix[i][j]
max_pos = (i, j)
29
else:
align1 += '-
align2 += seq2[j-1]
j -= 1
return align1[::-1], align2[::-1]
# Example Usage
seq1 = "GATTACA"
seq2 = "GCATGCU"
alignment = smith_waterman(seq1, seq2)
print("Alignment:", alignment)
30
Development of PAM:
❖ The PAM matrix was developed by Margaret Dayhoff in the 1970s.
❖ It was constructed by comparing closely related protein sequences and determining how
frequently one amino acid changes into another over evolutionary time.
Characteristics:
❖ PAM1: Represents 1% of evolutionary change (1 mutation per 100 amino acids).
❖ PAM250: Represents 250% change (250 mutations per 100 amino acids), and so on.
❖ Usage: PAM matrices are most useful for comparing closely related sequences.
Example of a PAM1 Matrix:
Key Points:
PAM1 matrix is derived from direct observations of accepted mutations.
Higher PAM matrices (e.g., PAM250) are extrapolated for larger evolutionary distances.
4.4.3 BLOSUM (Blocks Substitution Matrix)
The BLOSUM matrix is based on blocks of conserved sequences, rather than evolutionary
distances. Unlike PAM, BLOSUM matrices are derived from comparisons of protein sequences
within conserved regions (blocks) of related proteins.
Development of BLOSUM:
❖ Developed by Steven Henikoff and Jorja Henikoff in 1992.
❖ It is based on blocks of sequences that are more distantly related than those used in PAM.
Characteristics:
a. BLOSUM62: The most widely used matrix, optimized for sequences with about 62% similarity.
31
b. BLOSUM matrices: Higher-numbered matrices (e.g., BLOSUM80) are for more closely related
sequences, while lower-numbered matrices (e.g., BLOSUM45) are for more distantly related
sequences.
Example of a BLOSUM62 Matrix:
Key Points:
❖ BLOSUM62 is the default matrix for most protein alignment tools.
❖ BLOSUM matrices are more versatile for finding local alignments between sequences of
varying similarity levels.
4.4.4 PAM vs. BLOSUM
PAM: Based on evolutionary models and best suited for aligning sequences with small
evolutionary distances.
BLOSUM: Based on empirical data from conserved blocks, better for local alignment of more
distantly related sequences.
32
4.4.5 Example Exercise:
1. Given the sequences:
Seq1: ATCG
Seq2: ATCC
Use a simple substitution matrix where matches score +1, mismatches score -1, and gaps are
penalized with -2. Align the two sequences and calculate the alignment score.
Solution: Let’s align the sequences with no gaps first:
33
LECTURE FIVE
Module 5: Multiple Sequence Alignment
5.1 Biological significance of multiple sequence alignment (MSA)
5.2 Progressive alignment algorithms (e.g., CLUSTLAW)
5.3 Iterative methods and refinement (e.g., MUSCLE)
5.4 Advanced sequence alignment Algorithms (protein sequence alignment)
34
2. Alignment Algorithm: Use an algorithm (such as Clustal Omega, MUSCLE, or MAFFT)
to align the sequences.
3. Analyze Conserved Regions: Look for blocks of conserved sequences that may indicate
structural or functional significance.
4. Interpret Biological Meaning: Evaluate the alignment results for evolutionary
relationships, conserved functional domains, or structural predictions.
Practical Examples of MSA in Bioinformatics
1. Detecting Evolutionary Relationships: Suppose scientists have DNA sequences from
three different organisms: a human, a chimpanzee, and a gorilla. By aligning these
sequences, MSA reveals which parts of the sequence are conserved and which have
diverged over time. Example:
Human: ATGCTGAACCT
Chimpanzee: ATGCTGAACCA
Gorilla: ATGTTGAACCA
Here, the MSA shows slight differences in the sequences, with conserved regions across all three.
These similarities can infer that humans, chimpanzees, and gorillas share a common ancestor.
2. Predicting Protein Functionality: Imagine studying a protein sequence known for its
enzymatic activity in one species. By aligning this protein with similar sequences from
other species, researchers can locate conserved amino acids (building blocks of proteins)
likely crucial for the protein’s function.
Example:
Species A: MKVLLVGLQGS
Species B: MKVLLVGLQGD
Species C: MKVLLVGLQGS
In this example, the alignment shows that the sequence "MKVLLVGLQGS" is conserved across
species A and C but differs slightly in species B. This conserved sequence might represent a
functional domain crucial for enzymatic activity.
3. Identifying Mutations and Disease Links: MSA is commonly used to compare gene
sequences from healthy individuals with those who have a genetic disorder. Differences or
mutations identified through alignment can indicate mutations responsible for diseases.
35
Example:
Healthy Sequence: ATGCGTACTGAAC
Mutant Sequence: ATGCTTACTGAAC
The mutation from "C" to "T" in the MSA could potentially alter protein functionality, linking it
to disease if the region is functionally significant.
Tools Commonly Used in MSA
1. Clustal Omega: A widely used tool for multiple sequence alignment that efficiently
handles large numbers of sequences.
2. MUSCLE: Known for its speed and accuracy, it’s frequently used for protein sequences.
3. MAFFT: Suitable for large datasets, often used for complex alignments of nucleotide or
protein sequences.
36
Key Features:
1. Pairwise Alignment: Uses scoring matrices to perform pairwise alignments.
2. Guide Tree Construction: Utilizes the Neighbor-Joining method to build the tree.
3. Alignment: Uses the tree to progressively align sequences or groups of sequences.
5.2.3 Steps in CLUSTALW Progressive Alignment
1. Step 1: Compute Pairwise Distances: Calculate similarity scores for all pairs of sequences,
usually using a substitution matrix like PAM or BLOSUM (for proteins) or simple
match/mismatch scores (for DNA).
2. Step 2: Construct a Guide Tree: Using the computed distances, a guide tree is built using
hierarchical clustering or the Neighbor-Joining algorithm.
3. Step 3: Progressive Alignment: The sequences are aligned following the guide tree.
Initially, two closely related sequences are aligned, and then each subsequent sequence or
group of sequences is aligned to this initial alignment.
5.2.3 Example of Progressive Alignment Using CLUSTALW
Let’s align the following three DNA sequences using CLUSTALW:
• Seq1: ATCG
• Seq2: ATGG
• Seq3: ACGA
Step-by-Step Example:
Step 1: Pairwise Distance Calculation
Using a simple match (1 point) / mismatch (0 points) scoring system:
1. Seq1 vs. Seq2: 3 matches (AT-GG), score = 3
2. Seq1 vs. Seq3: 2 matches (A-CGA), score = 2
3. Seq2 vs. Seq3: 2 matches (A-GGA), score = 2
Step 2: Construct Guide Tree
Based on the scores, Seq1 and Seq2 are most similar, so they will be grouped first.
Step 3: Progressive Alignment
1. Align Seq1 and Seq2:
Seq1: ATCG
Seq2: ATGG
2. Align Seq3 to the alignment of Seq1 and Seq2:
37
Seq1: ATCG
Seq2: ATGG
Seq3: A-CGA
5.2.4 Applications of Progressive Alignment in Bioinformatics
❖ Phylogenetic Analysis: Helps determine evolutionary relationships.
❖ Conserved Region Identification: Identifies regions that remain unchanged across species,
hinting at crucial functional roles.
❖ Structure Prediction: Uses alignments to predict the 3D structure of proteins.
5.2.5 Advantages and Limitations of Progressive Alignment
Advantages:
❖ Computational Efficiency: Works well with larger datasets.
❖ Ease of Implementation: Requires relatively simple calculations for guide tree construction
and alignment.
Limitations:
❖ Error Propagation: Early alignment errors propagate through the entire alignment.
❖ Dependence on Guide Tree: Accuracy is highly dependent on the quality of the guide tree.
38
3. Convergence: The process continues until improvements become negligible, indicating
that the algorithm has likely reached the best alignment possible.
MUSCLE Algorithm
MUSCLE is particularly valued for both speed and accuracy, using three primary stages:
1. Draft Progressive Alignment: Builds a tree to represent the relationships between
sequences, then aligns them progressively based on this tree.
2. Tree Refinement: The algorithm builds a second, more accurate tree based on the initial
alignment and realigns the sequences progressively again, using this refined tree.
3. Iterative Refinement: MUSCLE refines the alignment iteratively by making small
adjustments, testing the alignment score after each change.
Advantages of MUSCLE
• High accuracy: MUSCLE achieves high alignment accuracy, especially for long protein
sequences.
• Efficiency: MUSCLE can quickly handle large datasets, making it popular for aligning
multiple protein sequences.
39
2. Hidden Markov Models (HMM):
HMMER: A widely-used tool that utilizes Hidden Markov Models for aligning protein sequences,
especially in finding patterns in protein families. HMMER is particularly effective for recognizing
distant relationships between sequences due to its probabilistic approach.
3. Profile-Based Algorithms:
PSI-BLAST (Position-Specific Iterated BLAST): Builds a profile based on the results of an initial
BLAST search and then aligns other sequences to this profile. This is useful for detecting weak
but biologically significant similarities between proteins.
4. Progressive Alignment with Refinement:
Many modern tools, including MUSCLE and MAFFT (Multiple Alignment using Fast Fourier
Transform), use progressive alignment followed by iterative refinement to handle large-scale
protein datasets with high accuracy.
40
3. Output: PSI-BLAST will return sequences that are similar to the query and provide a
position-specific scoring matrix that can help identify conserved motifs or functional
domains.
4. Iterate: Re-run PSI-BLAST to refine the alignment and detect more distant relatives,
which can be informative for functional or structural predictions.
41
MODULE 6: HEURISTICS FOR SEQUENCE ALIGNMENT
6.1 Challenges in aligning large datasets
6.2 Heuristic methods for rapid alignment: BLAST and FASTA algorithms
6.3 Trade-offs between accuracy and performance in heuristic algorithms
6.4 Case studies: Using BLAST in genomic research
42
6.1.1 Key Challenges in Large-Scale Sequence Alignment
1. Data Volume and Computational Resources: As datasets grow larger, aligning them
requires more computational resources. Exhaustive alignment methods, like dynamic
programming (e.g., Needleman-Wunsch and Smith-Waterman), become computationally
expensive and time-consuming for very large datasets.
2. Accuracy vs. Speed: Achieving high accuracy with traditional alignment algorithms
demands significant processing power, leading to slower runtimes. Optimizations are often
required to balance the trade-off between speed and accuracy.
3. Memory Usage: Aligning large sequences requires considerable memory to store matrices
and intermediary data. Managing memory efficiently becomes a crucial task.
4. Complexity of Biological Data: Biological sequences may contain insertions, deletions,
and substitutions, which complicates alignment. Some sequences might also be repetitive
or conserved across multiple species, adding additional complexity to the alignment
process.
5. Database Search Time: When searching for similar sequences in large databases, the time
to complete the search can be prohibitively long, especially when the database size is in
terabytes.
Example Problem
Imagine you have a dataset with one million DNA sequences, each about 1,000 bases long.
Traditional pairwise alignment using dynamic programming would require vast amounts of
computation and memory. Instead, we turn to heuristic methods like BLAST or FASTA to achieve
faster, approximate alignments.
6.2 Heuristic Methods for Rapid Alignment – BLAST and FASTA Algorithms
To address the challenges of aligning large datasets, bioinformatics researchers use heuristic
algorithms like BLAST (Basic Local Alignment Search Tool) and FASTA (Fast-All). These
methods quickly approximate alignments by focusing on local regions of similarity, rather than
exhaustively analyzing every possible alignment.
Key Concepts:
• Heuristic Algorithms: They make trade-offs between speed and optimality by focusing
on finding “good enough” solutions instead of the best possible alignment.
43
• Local Alignments: Instead of aligning entire sequences, they identify regions of high
similarity, which reduces computational time.
6.2.1 BLAST Algorithm (Basic Local Alignment Search Tool)
BLAST is one of the most widely used bioinformatics tools for finding regions of similarity
between biological sequences. It is faster than exhaustive methods because it uses a heuristic
approach to find local alignments.
How BLAST Works:
1. Word Matching: BLAST first identifies “words” or short subsequences within a query
sequence. Words are typically 11 bases for nucleotide sequences or 3 amino acids for
proteins.
2. Extension: After finding a matching word between the query and database sequences,
BLAST attempts to extend the alignment by matching additional bases or amino acids
around it.
3. Scoring and Filtering: BLAST scores alignments based on similarity. Only alignments
above a certain score threshold are considered, further reducing computation time.
4. Result Output: BLAST ranks results based on the similarity score, displaying the best
matches to the user.
Example: Suppose you have a query protein sequence and want to find similar sequences in a
protein database. Using BLAST, you can search the database quickly by comparing segments
rather than the entire sequences.
6.2.2 FASTA Algorithm (Fast-All)
FASTA was one of the first algorithms developed for rapid sequence alignment. Although BLAST
has largely supplanted it in popularity, FASTA remains a useful heuristic tool for rapid alignment.
How FASTA Works:
1. Identifying K-Tuples: FASTA breaks down sequences into “k-tuples” (short
subsequences of k length). The k-value depends on the sequence type (nucleotides or amino
acids).
2. Word Matching: Like BLAST, FASTA finds initial matching k-tuples between the query
and database sequences.
3. Scoring and Filtering: FASTA scores and filters alignments based on the similarity of
matched regions, focusing on those with high local alignment scores.
44
4. Final Alignment: FASTA refines high-scoring matches using dynamic programming to
ensure accuracy, providing alignments that are comparable to BLAST.
Example: FASTA can be used to find DNA sequence matches in a database.
45
1. Word Matching: The query sequence is divided into smaller "words" or subsequences,
which are searched in the database for exact matches.
2. Extension: Each exact match is extended in both directions to form a high-scoring segment
pair (HSP).
3. Scoring and Significance: HSPs are scored with a scoring matrix, and their alignment
significance is statistically assessed.
4. Output: BLAST generates a ranked list of alignments, prioritizing the most significant
matches.
6.4.1 Case Studies
1. Identifying Gene Homologs:
❖ Objective: Understand evolutionary links by finding homologous genes across species.
❖ Approach: Using BLASTP, a human protein is compared with a mouse protein database.
❖ Outcome: Reveals conserved gene functions, advancing evolutionary biology.
2. Discovering Novel Genes:
❖ Objective: Identify new genes in a newly sequenced organism.
❖ Approach: BLASTX compares the new organism's nucleotide sequences to a known
protein database.
❖ Outcome: Suggests novel genes for faster genome annotation and further studies.
6.4.2 Advantages of BLAST in Genomic Research
❖ Speed and Efficiency: Supports rapid large-database queries.
❖ Flexibility: Different BLAST types allow for various sequence comparisons.
❖ User-Friendly: Available as command-line and web tools.
❖ Widespread Integration: Used broadly across bioinformatics pipelines.
6.4.3 Limitations of BLAST
❖ Heuristic Nature: May miss alignments lacking exact word matches.
❖ Database Dependency: Results rely on the database's quality.
❖ Computational Demand: Large datasets still require considerable processing power.
46
Module 7: Hidden Markov Models (HMM) in Bioinformatics
7.1 Overview of Hidden Markov Models, Markov processes and biological sequences
7.2 Components of HMMs: States, transitions, emissions
7.3 Application of HMMs in biological sequence modeling (gene prediction, protein family
classification)
7.1 Overview of Hidden Markov Models, Markov Processes, and Biological Sequences
A Markov chain is a model that tells us something about the probabilities of sequences of random
variables, states, each of which can take on values from some set. These sets can be words, or tags,
or symbols representing anything, like the weather. A Markov chain makes a very strong
assumption that if we want to predict the future in the sequence, all that matters is the current state.
The states before the current state have no impact on the future except via the current state. It’s as
if to predict tomorrow’s weather you could examine today’s weather but you weren’t allowed to
look at yesterday’s weather.
A hidden Markov model (HMM) allows us to talk about both observed events (like words that we
see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in
our probabilistic model. HMMs are statistical models used to describe systems that have
unobservable (hidden) states. These models rely on Markov processes, where the probability of
transitioning to a particular state depends only on the current state (not the history of states). The
HMM is based on augmenting the Markov chain.
Relevance to Bioinformatics:
Biological sequences (DNA, RNA, or proteins) often exhibit patterns that are not directly
observable but can be inferred using statistical models. For example:
• The identification of exons and introns in genes.
• Classifying protein domains within sequences.
Example:
• DNA Sequence Modeling: In a DNA sequence, each base (A, T, C, G) may correspond to
states like exon or intron. HMMs can infer which parts of the sequence are exons (coding)
or introns (non-coding).
47
7.2 Components of HMMs: States, Transitions, Emissions
1. States
States in an HMM represent the hidden, discrete conditions or categories of the system. These
states are not directly observable but influence the observable outputs.
• Characteristics of States:
Each state is discrete and part of a finite set.
States are hidden, meaning the actual state sequence is not directly known.
They are often denoted as S = {S1, S2, …, SN} where N is the total number of states.
• Examples:
i. In speech recognition, states might represent different phonemes.
ii. In weather modeling, states might represent weather conditions like "sunny," "rainy," or
"cloudy."
iii. In biological sequence analysis, states might represent DNA sequence regions (e.g.,
"coding" or "non-coding").
2. Transitions
Transitions describe the probabilities of moving from one state to another in the hidden sequence.
These probabilities are defined by a transition matrix.
• Transition Matrix (AAA):
Key Properties:
❖ Transition probabilities remain fixed for a given HMM.
❖ Initial state probabilities (π\piπ): The distribution that defines the likelihood of starting in
each state.
Examples:
i. In weather modeling, A [sunny, rainy], could be the probability of transitioning from a
sunny day to a rainy day.
48
ii. In part-of-speech tagging, transitions might define the probability of moving from a noun
to a verb.
3. Emissions
Emissions represent the observable outputs or symbols generated from each hidden state. These
are modeled by emission probabilities, which describe the likelihood of an observable output
given a particular hidden state.
Examples:
i. In speech recognition, emissions might be audio waveforms or spectral features derived
from spoken words.
ii. In weather modeling, emissions could be the observed weather patterns (e.g., temperature,
humidity).
HMMs can distinguish between coding (exon) and non-coding (intron) regions in a genome.
Example: The Genscan software employs HMMs to predict genes in raw DNA sequences.
HMMs classify proteins into families based on conserved motifs and patterns. Example: The Pfam
database uses HMMs to identify and annotate protein families.
3. Sequence Alignment
49
HMMs are employed in multiple sequence alignment by modeling conserved regions across
sequences. Example: The HMMER tool uses HMMs for searching sequence databases and
aligning multiple sequences.
4. Structural Prediction
Example: SAM-T08 software uses HMMs to predict protein structures based on sequence
similarity.
50
MODULE 8: PHYLOGENETIC TREES
8.1 Introduction to tree structures: Binary trees, rooted vs unrooted trees
Phylogenetic trees are diagrams that depict evolutionary relationships among various species or
other entities based on their genetic or physical traits. These relationships are inferred using
biological data, such as DNA sequences, and are represented in the form of a tree-like structure.
A. Binary Trees
A binary tree is a hierarchical structure in which each node has at most two children, typically
referred to as the "left" and "right" child. In phylogenetics, binary trees represent evolutionary
divergence, with each node signifying a common ancestor and its descendants.
Key Features:
1. Internal Nodes: Represent hypothetical common ancestors.
2. Leaf Nodes: Represent existing species or taxa.
3. Edges: Represent evolutionary paths.
Example:
Imagine three species: A, B, and C. A binary tree may show that species A and B share a more
recent common ancestor than they do with species C.
In this example:
i. The internal node "Ancestor1" is the common ancestor of all three species.
ii. "Ancestor2" is the shared ancestor of A and B only.
iii. A, B, and C are the leaf nodes.
51
Rooted Trees: A rooted tree has a designated root that represents the most recent common
ancestor of all entities in the tree. It shows the direction of evolutionary time, starting from the root
and diverging toward the tips.
Example:
A rooted tree with species A, B, C, and D might look like this:
Here:
❖ The root indicates the origin of all evolutionary relationships in the tree.
❖ The tree is "directional," reflecting evolutionary divergence over time.
Unrooted Trees:
An unrooted tree depicts the relationships between species without assuming a common ancestor
or evolutionary direction. It focuses solely on the genetic or physical similarities.
8.1.2 Applications of Tree Structures in Phylogenetics:
1. Binary Trees: Useful for modeling speciation events and calculating evolutionary
distances.
2. Rooted Trees: Ideal for tracing ancestry and evolutionary timelines.
3. Unrooted Trees: Helpful for identifying clusters of related species based on similarity
metrics.
8.1.3 Example Use Case:
Consider a study analyzing the evolutionary relationships among human, chimpanzee, gorilla,
and orangutan DNA sequences. A rooted binary tree can depict their divergence from a common
primate ancestor, while an unrooted tree can compare genetic similarity without suggesting
direct ancestry.
52
8.2 Tree construction methods: Distance-based (UPGMA, Neighbor-Joining) and character-
based methods (maximum parsimony, maximum likelihood)
Phylogenetic tree construction methods aim to represent evolutionary relationships among a set of
organisms, genes, or other biological units. These methods can be broadly categorized into
distance-based and character-based approaches. Below is a detailed explanation of each method,
including their principles and examples.
8.2.1 Distance-Based Methods
These methods rely on pairwise distance measures between sequences or species. They assume
that the evolutionary relationships can be inferred from the overall similarity between pairs of
sequences.
a. UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
Principle:
UPGMA is a hierarchical clustering method that assumes a constant rate of evolution (molecular
clock). It groups taxa (a group of one or more populations of an organism or organisms seen by
taxonomists to form a unit) based on their pairwise distances and produces a rooted tree.
Steps:
1. Compute a pairwise distance matrix.
2. Find the closest two taxa (smallest distance).
3. Merge the two taxa into a single cluster and compute the new distances to all other taxa.
4. Repeat until all taxa are joined in a single tree.
Example: Given four species (A, B, C, D) with the distance matrix:
53
• Continue until all species are joined.
Advantages:
• Simple and fast.
• Suitable for datasets where the molecular clock assumption holds.
Limitations:
• Assumes a constant rate of evolution, which may not be realistic.
b. Neighbor-Joining (NJ)
Principle:
Neighbor-Joining relaxes the molecular clock assumption and is more flexible than UPGMA. It
identifies pairs of taxa that minimize the total branch length of the tree.
Steps:
1. Compute a distance matrix.
2. Calculate the neighbor-joining matrix to identify the closest pair of taxa.
3. Join the pair into a new cluster and update the distance matrix.
4. Repeat until all taxa are joined.
Advantages:
• Does not require a molecular clock assumption.
• Produces an unrooted tree, which can later be rooted if needed.
Limitations:
• May produce less accurate results with highly divergent sequences.
54
2. Generate all possible tree topologies.
3. Calculate the number of changes (steps) required for each tree.
4. Select the tree with the minimum number of steps.
55
Example: In humans and mice, the gene PAX6, involved in eye development, is an ortholog.
Phylogenetic analysis confirms that these genes originated from a common ancestor and perform
similar functions in both species.
56
6. Identifying Conserved Non-Coding Regions
Phylogenetic comparisons highlight conserved non-coding DNA sequences, which are often
regulatory elements critical to gene expression.
Example: The HOX gene clusters, involved in body plan development, include highly conserved
non-coding regions, identified through phylogenetic alignment across vertebrates.
57
MODULE 9: MACHINE LEARNING IN BIOINFORMATICS
58
Key Components:
i. Convolutional Layers: Apply filters to extract spatial features.
ii. Pooling Layers: Reduce spatial dimensions (e.g., max pooling).
iii. Fully Connected Layers: Connect features to output predictions.
59
Deep learning has revolutionized bioinformatics and computational biology, particularly in the
areas of genomics and protein structure analysis. These domains leverage deep neural networks to
uncover patterns, predict functions, and infer structures from complex biological data.
Genomics involves the study of the genome, which includes the sequencing, analysis, and
functional mapping of DNA. Deep learning has provided breakthroughs in processing large-scale
genomic data and deriving meaningful insights.
Key Applications:
1. Gene Expression Prediction: Predicting how genes are expressed in different cell types
under varying conditions. Example: Deep learning models like DeepSEA predict the impact
of noncoding variants on gene expression and chromatin accessibility. These models use
convolutional neural networks (CNNs) to analyze DNA sequences directly.
2. Variant Calling: Identifying genetic variants from raw sequencing data. Example: Tools
like DeepVariant use deep learning to enhance the accuracy of variant calling from next-
generation sequencing (NGS) data.
Proteins are essential biological molecules, and understanding their 3D structure is crucial for drug
design and understanding cellular mechanisms. Deep learning has made remarkable advances in
protein structure prediction and function annotation.
Key Applications:
1. Protein Structure Prediction: Predicting a protein's 3D structure from its amino acid
sequence. Example: AlphaFold by DeepMind uses attention mechanisms and transformer
models to predict protein structures with near-experimental accuracy.
60
Examples and Practical Impacts
2. CRISPR-Cas9 Targeting
❖ Data Quality and Diversity: Training data must be high-quality and representative of
diverse biological systems.
❖ Interpretability: Deep learning models often function as "black boxes," making biological
interpretation challenging.
61