0% found this document useful (0 votes)
62 views21 pages

Alignment of Whole Genomes

This document discusses genome alignment using MUMmer. MUMmer is a system that can align whole genomes in linear time using suffix trees, longest increasing subsequence algorithms, and Smith-Waterman alignment. The document outlines MUMmer version 1.0 and 2.0, describing their algorithms and improvements. MUMmer 1.0 finds maximal unique matches between genomes and uses longest increasing subsequence to output alignments, while MUMmer 2.0 clusters matches and includes tools like NUCmer for contig alignment and PROmer for protein alignment. The document provides examples of MUMmer's speed and ability to align diverse genomes.

Uploaded by

Samir Sabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views21 pages

Alignment of Whole Genomes

This document discusses genome alignment using MUMmer. MUMmer is a system that can align whole genomes in linear time using suffix trees, longest increasing subsequence algorithms, and Smith-Waterman alignment. The document outlines MUMmer version 1.0 and 2.0, describing their algorithms and improvements. MUMmer 1.0 finds maximal unique matches between genomes and uses longest increasing subsequence to output alignments, while MUMmer 2.0 clusters matches and includes tools like NUCmer for contig alignment and PROmer for protein alignment. The document provides examples of MUMmer's speed and ability to align diverse genomes.

Uploaded by

Samir Sabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

ALIGNMENT OF

WHOLE GENOMES
Presented by: Hisham Adel Mohamed
Outline
 Genome
 Objective
 MUMmer
 MUMmer 1.0
 Algorithms
 Results
 MUMmer 2
 Algorithms
 NUCmer
 PROmer
 Results
Genome
 Is the Total DNA in the Cell
 Whole genomes are millions of
nucleotides.
 Identification of region of similarities
and difference between genomes is
important

 Similarities means similar function for the two


organisms.
http://www.scq.ubc.ca/wp-content/uploads/2006/08/molecular-machine.gif

 Difference means function is only need in one


of the organisms.
Objective
 Align the whole genomes of two related organisms.
 Understand the functions of genomes.
 In healthcare to find drugs for diseases.

http://www.scq.ubc.ca/wp-content/uploads/2006/08/applications.gif
MUMmer
 Is a system used to align DNA and protein
sequencing in linear time.

 There are two versions


 MUMmer 1.0, 1999.
 MUMmer 2 .0, 2002.
MUMmer
 Is mainly based on three algorithms.
 Suffix trees.
 The longest increasing subsequence (LIS).
 Smith-Waterman alignment.

 The Novelty of this system is to integrate the three algorithms to work in


coherent system

 For Aligning using dynamic programming


 Naïve :version O(n2) space and time.
 Hashing: faster but O(n2).
 Some versions can reduce the space to O(n) by taking more time.
 MUMmer:
 Suffix Trees: building takes O(n) time and space, search takes O(m), where n sequence length and m is substring length.
 LIS O(k log K) where k is number of Maximal unique match (MUMS).
MUMmer 1.0, 2009
 Assumption: Sequences are closely related.

 Inputs: Two DNA sequence and length of the shortest MUM.

 Output: base to base alignment of whole sequence


highlighting the exact difference in the genomes.
MUMmer 1.0, 2009
Perform a Maximal
Sort
Closethe
the matches founded
gaps in the and
alignment,
extract the longest set of
unique match
performing
matches
local identificatio
large inserts,
usingrepeats,
Longestsmall
n of

(MUM) using suffix


mutated
in regions,
creasin tandem repeats
g sequencing
and SNPs.m (LIS)
tree.
algorith

O
u
t
p
u
t
t
h
e
r
es
ul
ts
.
MUMmer 1.0: Suffix Tree
Suffixes:

1 gaaccgacct
2 aaccgacct
3 accgacct
4 ccgacct
5 cgacct
6 gacct
7 acct
8 cct
9 ct
10 t
MUMmer 1.0: Maximal Unique matching
subsequence (MUM)

 MUM is a sequence in genome A and B that occur


exactly once in A and B and not contained in any
large subsequences.
 With single scan of the suffix tree, find all MUMs
MUMmer 1.0: LIS
 Store MUMs according to their position in genome A.
 Select longest set of MUMs whose sequence occur in
ascending order in both genome A and B.

1 2 3 4 5 6 7
A

B
1 3 2 6 4 5 7
1 2 4 5 7
A

B
1 2 4 5 7
MUMmer 1.0: Closing the gaps
 A gap : is an interruption in the MUM alignment.
 Repeat procedure using a shorter minimum length for MUMS (long gaps)
 Use the Smith-Waterman alignments ( short gaps)
Results
 Align 2 highly homologous strains of M.tuberculosis, 4.4 million bps.
 Time: 5 s suffix tree construction, 45 s sorting MUMs, 5 s Smith-Waterman alignments.
 Longest MUM ,24563 bp; 249 MUMs > 5000 bp; >90% identical

 Align 2 cousin bacteria, M.genitalium (580 kbp) and M.pneumoniae


(816 kbp)
 Time : 6.5 s suffix tree; finding LIS 0.02 s; 116 s alignments.
 Longest MUM, 281 bp, 16 MUMs > 100 bp, <50% identical

 Align 2 syntenic sequences from human chromosome 12 and mouse


chromosome 6 (225 kbp).
 Time: 29 s in total, 1.6 s for suffix tree.
 Longest MUM, 117 bp, 10 MUMs > 50bp
MUMmer 2.0, 2002
 Algorithm improvement
 Memory
 Streaming query
 New model to cluster matches

 Able to align not only simple DNA sequences, but also


human chromosomes

 Able to align incomplete genomes and protein sequences


MUMmer 2.0: Streaming query
 A link points from Node U to node V, if the string label from the root to v is equal to the label from the root
to u with the first character removed.

Streaming String ...atgtcc...

atgtgtgtc$ $
c$ t
gt

1 9 i+1 10

c$ gt c$ gt

7 8 i

c$ gtc$ c$ gt

5 3 6
Suffix Tree for String atgtgtgtc$`
1 2 3 4 5 6 7 8 9 10 c$ gtc$

4 2
MUMmer 2.0: Cluster MUMs
 In MUMmer 1 it is presumed that two complete sequence
to be aligned. It compute a single longest alignment
between the sequences.

 New version Align unfinished assembly which needs


rearrangement
 First, the system outputs a series of separate MUMs.

 Clustering is performed by finding pairs of matches that are sufficiently close.

 Finally, a LIS is done.


MUMmer 2.0: NUCmer
 Multiple-contigs alignment program

 Uses MUMmer 2

 A contig (from contiguous) is a set of overlapping DNA segments derived from a single genetic
source.

 NUCmer input: two multi-fasta files contains partial or complete assemblies.

 Create a map of all contig positions within each file.

 Concatenate contigs in each file.

 Run MUMer to fine MUMs.

 Map back the matches

 MUMs are clustered together.


MUMmer 2.0: PROmer
 Protein-based alignment program
 Input: 2 multi-fasta files (DNA)
 Technique:
 Translate DNA into Amino Acid
 Index created that maps each protein to the source
DNA.
 Amino Acid are filtered to remove stop codons.
 Amino acid sequences are passed to MUMmer.
 Index is then used to translate matches back to DNA.
 Clustering steps.
Results
 Align P.yeolii and P.falciparum ,size 25 Mb
 PROmer : time < 1 h
 Blast : time ~ weeks
 Align E.coli (4.7 Mb) and V.cholerae (3 Mb) on 1
GHz desktop computer
 MUMmer 1: 74 s, 293 MB
 MUMmer 2: 27 s, 100 MB
Refrences
 Delcher et al. (2002) Fast algorithms for large-scale
genome alignment and comparison, Nucleic Acids
Res. 2478-2483
 Delcher et al. (1999) Alignment of whole genomes,
Nucleic Acids Res., 27,2369-2376
 http://www.iro.umontreal.ca/~csuros/IFT6299/A20
04/materiel/Shakiba.ppt
Questions?

You might also like