0% found this document useful (0 votes)
72 views25 pages

Genomic Sequence Alignment

Bioinformatic Sequence Alignment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views25 pages

Genomic Sequence Alignment

Bioinformatic Sequence Alignment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Introduction to Bioinformatics

for Medical Research

Gideon Greenspan
[email protected]

Lecture 3
Genomic Sequence Alignment
Genomic Sequence Alignment
• Homology
• Sequence similarity
• Dot plots
• Needleman-Wunsch alignment
• Alignment Variations
• Complexity
• BLASTN tool
2
Homology
• Similarity between species
• Before Darwin, based on morphology
– Similar bones in bat’s wing and whale’s flipper
• Darwin’s evolutionary theory
– Common ancestry
• Genetic sequences
– Evolution made visible
• Beware of incidental similarity
3
Types of Homology

• Orthologs generated by speciation


• Paralogs by gene duplication 4
Homology Motivation
• Study evolution
– Compare different species
• Discover function
– Compare against known species
• Find crucial features
– Compare many examples
• Identify cause of disease
– Compare against healthy individuals
5
Homology via Sequences
• Study evolution
– Multiple alignment and phylogenetic trees
• Discover function
– BLAST searches against database
• Find crucial features
– Motif finding
• Identify cause of disease
– Detect variable sites in alignment
6
Sequence Similarity
• Molecular genetic evidence of homology
• Requires quantification
– Length of match
– Strength of match

GTCTTCGACGTCAGTCGGCACGACCTG
ACCGTACGTCTCAGTTAGCGTGATCAG

Strong but short match Long but weak match


7
Sequence Modifications
• Three types of mutation
– Substitution (point mutation)
– Insertion
Indel (replication slippage)
– Deletion

TCCGT
TCAGT TCGAGT TCAGT
TCGT 8
Scoring Similarity
• Assume independent mutation model
– Each site considered separately
• Score at each site
– Positive if the same
– Negative if different
• Sum to make final score
– Can be positive or negative
– Significance depends on sequence length
9
Substitutions Only
• Pretend there are no indels
– Sequences compared base-by-base
– Count the number of matches and mismatches

TTCGTCGTAGTCGGCTCGACCTG
GTACGTCTAGCGAGCGTGATCCT

9 matches +18 Total score +4


14 mismatches -14 A weak match
10
Including Indels
• Create an ‘alignment’
– Count matches within alignment
– Required if sequences are different length

TT-CGTCGTAGTCG-GC-TCGACC-TG
GTACGTC-TAG-CGAGCGT-GATCCT-
17 matches +34
2 mismatches - 2 Total score +24
8 indels - 8 A strong match
11
Choosing an Alignment
• Many different alignments are possible
– Should consider all possible
– Take the best score found

TT-CGTCGTAGTCG-GC-TCGACC-TG
+24
GTACGTC-TAG-CGAGCGT-GATCCT-
-TTCGT-CGTAGTC-GGCTCG-ACCTG
0
GTAC-GTCTA-GCGAGCGT-GATCC-T
12
Dot Plots
G T A G T C G G • Early method
T ® ®
• Sequences at
A ®
top and left
G ® ® ® ®
C ® • Dots indicate
G ® ® ® ® matched bases
A ® • Diagonal series
G ® ® ® ® show matched
C ® regions
13
Local Alignment
• Best score for aligning part of sequences
– Often beats global alignment score
• Similar algorithm: Smith-Waterman
– Table cells never score below zero

-ACGCTG Global: ACGCT Local:


CATG-T- +2 ATG-T +4

14
Gap Scores
• Example showed -1 score per indel
– So gap cost is proportional to its length
• Biologically, indels occur in groups
– We want our gap score to reflect this
• Standard solution: affine gap model
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
15
Complexity
• Complexity is determined by size of table
– Aligning a sequence of length m against one of
length n requires calculating (m ¥ n) cells
• Estimate: we calculate 108 cells per second
– Aligning two mRNA sequences of 8,000 bp
requires 64,000,000 cells fi 0.64 seconds
– Aligning an mRNA and a 107 bp chromosome
requires ~1011 cells fi 1,000 secs = 15 minutes

16
Complexity for GenBank
• GenBank contains 3 ¥ 1010 base pairs
– Searching an mRNA against GenBank requires
~2.5 ¥ 1014 cells fi 2.5 ¥ 106 secs = 1 month!
– So each computer could support just one
GenBank search per month
• We need to cut down on alignment
– Use a heuristic method to narrow down the part
of GenBank that could be of interest

17
Genomic Indexing
• Indexing can improve any kind of search
– Index need only be built once
– Accessing an index is instantaneous
• Keys in genomic index are short sequences
– Each key points to many sequences
• Search uses index to isolate candidates
– Alignment is applied only to likely matches
• FASTA and BLAST implementations
18
FASTA and BLAST
• FASTA (Lipman & Pearson 1985)
– First fast sequence searching algorithm
• Gapped BLAST (Altschul et al 1997)
– Originally without gaps (1990)
• BLAST benefits
– Search speed
– Ease of use
– Statistical rigor
19
BLAST Variations
Name Query type Database
blastn Genomic Genomic
blastp Protein Protein
blastx Translated genomic Protein
tblastn Protein Translated genomic
tblastx Translated genomic Translated genomic

• Genomic translations test all 6 possibilities


– 3x for codon frames, 2x for reverse complement
20
BLASTN Parameters
Query: bare sequence, FASTA format,
GenBank accession or identifier

Subsequence: only
use part of query

Database: nr = all
available, month = in last
Go! month, chromosome =
complete genomes, more…
21
BLASTN Results (1)
Query sequence
representation

Matched areas
of database
sequences

Multiple
matches on
sequence
22
BLASTN Results (2)
Sequence
Identifier

Sequence
description

Score and
E value
23
BLASTN Results (3)
Database
sequence

Score
breakdown

Sequence
positions

Matching
sites
24
BLASTN Scores
• Bit score
– Similar to alignment score
– Normalized to account for different schemes
– Higher means more significant
• E value
– Based on random database of similar size
– Number of hits of given score expected
– Lower means more significant
25

You might also like