Sequence Alignment
Sequence Alignment
Sequence Alignments
5/3/2025
1
Sequence alignment: Overview
seq1: CATTTATTTTC
seq2: AATTTGTA Mismatch
Match
Indel
• Match vs mismatch.
• Gap (added to increase number of match) represents insertion or deletion
(indels)
2
Sequence alignment: Purpose
seq1: CATTTATTTTC
seq2: AATTTGTA
3
Sequence alignment: Purpose
4
Sequence alignment: Purpose
Assembled sequence:
The more that that you read, the more things you will know.
5
Sequence homology, similarity and identity
• Sequence identity can be the same as similarity (for DNA) but is different from
similarity (for protein)
6
Sequence evolution
• Major changes:
• Substitution GACTGGA
• Insertion
• Deletion Substitution: G -> C CACTGGA
Deletion: C CATGGA
Speciation event
Substitution: G ->T CATGTA
Insertion: T CATGTTA
CATGTTA CACTGGA
7
Sequence alignment: which alignment is the best?
0 -2
-1
C A T G T T A C A - T G T T A C A T - G T T A
| | | | | | | | | | | |
C A C T G G A C A C T G G - A C A C T G G - A
gap penalty
8
100 50
100
consensuit
Pairwise alignment: Global vs. local 250
so sánh toàn bộ (giả thiết 2 trình tự có kích thước như nhau) và so sánh cục bộ
chéo
• Step 1: set up a matrix
• Step 2: score a matrix
• Step 3: trace back and identify
alignment
CACTGGA
CATGTTA
10
Sequence alignment: dynamic programming method
Sequence 2 (length m)
C A – T G T T A
C A C T G G - A
Sequence 1 (length n)
2
11
Sequence alignment: dynamic programming method
12
Scoring matrix
substitution matrix: hệ ma trận thay thế
• Substitution matrix is a set of values for quantifying the likelihood of one residue
being substituted by another in an alignment.
• Scoring matrices for nucleotide sequences are relatively simple. A positive value
or high score is given for a match and a negative value or low score for a mismatch.
• Scoring matrices for amino acids are more complicated because scoring reflects
the physicochemical properties of amino acid residues, as well as the likelihood of
certain residues being substituted among true homologous sequences
13
Waterman wush
Scoring matrix
khác tính chất khả năng thay
thế ít hơn: bị phạt nhiều điểm
hơn
14
Local alignment: Smith and Waterman algorithm
15
Sequence alignment: dot plots
• Seq1: GATTCTATCTAACTA
• Seq2: GTTCTATTCTAAC
G A T T C T A T – C T A A C T A
| | | | | | | | | | | |
G – T T C T A T T C T A A C - -
16
Database similarity searching: pairwise alignment on large scale
17
Database similarity searching: pairwise alignment on large scale
• Requirements:
• Sensitivity: the ability to find as many correct hits as possible tính nhạy
• Selectivity (specificity): to find as few unrelated hits as possible tính đặc hiệu
18
Basic Local Alignment Search Tool (BLAST)
19
BLAST steps
1. Break query sequence into words
(e.g. 3 aa or 11 nucleotides)
2. Scan every 3 residues in word
database
3. Assume one of the words finds
matches in the database
4. Calculate sums of match scores
based on a scoring matrix
5. Find the database sequence
corresponding to the best word
match and extend alignment in both
directions
6. Determine the high scored segment
above threshold (e.g., 22)
20
BLAST results
21
Statistical significance of BLAST search results
E-value = m x n x P
m: total number of residues in a database
n: number of residues in the query sequence
P: probability that an alignment is a result of random chance
22
BLAST results
23
BLAST results
24
Problems
1. Obtain the human HBA and HBB protein sequences. Perform pairwise
alignment on NCBI and on EBI websites
2. You have isolated a novel bacterial strain from a soil sample and subject
PCR product of 16S rRNA gene for Sanger sequencing. Now that you have
a sequence of 16S rRNA gene, use Blastn on NCBI to identify the identity of
your isolate.
identity>97%: Cùng loài
>94%: cùng chi
25