0% found this document useful (0 votes)
15 views25 pages

Sequence Alignment

Chapter 2 discusses sequence alignments, which are used to infer relatedness, predict functions, and assemble sequences. It covers concepts such as homology, similarity, identity, and the methods for alignment including global and local alignments, dynamic programming, and algorithms like BLAST. Additionally, it emphasizes the importance of scoring matrices and the statistical significance of alignment results.

Uploaded by

CT Hương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views25 pages

Sequence Alignment

Chapter 2 discusses sequence alignments, which are used to infer relatedness, predict functions, and assemble sequences. It covers concepts such as homology, similarity, identity, and the methods for alignment including global and local alignments, dynamic programming, and algorithms like BLAST. Additionally, it emphasizes the importance of scoring matrices and the statistical significance of alignment results.

Uploaded by

CT Hương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 2.

Sequence Alignments

5/3/2025

1
Sequence alignment: Overview

• Sequence alignment provides inference for the relatedness of two


sequences under study.

seq1: CATTTATTTTC
seq2: AATTTGTA Mismatch

Match
Indel
• Match vs mismatch.
• Gap (added to increase number of match) represents insertion or deletion
(indels)

2
Sequence alignment: Purpose

• Predict function of a sequence by inference from a well-characterized


sequence

seq1: CATTTATTTTC
seq2: AATTTGTA

• Infer evolutionary relationship between sequences: If the two sequences


share significant similarity, it is likely that the two sequences must have
derived from a common evolutionary origin
Có thể so sánh trình tự để suy ra chức năng do trình tụ bậc 1 quyết định bậc 2- bậc 3- bậc 4 (vùng active site
giống nhau)
• Predict structural and functional motif: active site, receptor site

3
Sequence alignment: Purpose

• Assembly of sequence reads into larger units such as contigs or genomes

Seq 1 The more that


Seq 2 that you read,
Seq 3 you read, the more things
Seq 4 things you will
Seq 5 will know.

4
Sequence alignment: Purpose

• Assembly of sequence reads into larger units such as contigs or genomes

Seq 1 The more that


Seq 2 that you read,
Seq 3 you read, the more things
Seq 4 things you will
Seq 5 will know.

Assembled sequence:
The more that that you read, the more things you will know.

5
Sequence homology, similarity and identity

• Two sequences share homology when they share a common ancestor.


Homology are not a quantitative term
2 homolog là 2 trình tự chia sẻ 1 tổ tiên
paralog -> gene dipication: những trình tự lặp lại trên cùng 1 cơ thể
arthlog
• Sequence similarity is the percentage of aligned residues that are similar in
physiochemical properties such as size, charge, and hydrophobicity. Similarity
is a quantitative term

• Sequence identity can be the same as similarity (for DNA) but is different from
similarity (for protein)

• Sequence identity refers to the percentage matches of the aligned residues

6
Sequence evolution

• Major changes:
• Substitution GACTGGA
• Insertion
• Deletion Substitution: G -> C CACTGGA
Deletion: C CATGGA
Speciation event
Substitution: G ->T CATGTA
Insertion: T CATGTTA

CATGTTA CACTGGA

7
Sequence alignment: which alignment is the best?

0 -2
-1

C A T G T T A C A - T G T T A C A T - G T T A
| | | | | | | | | | | |
C A C T G G A C A C T G G - A C A C T G G - A

1 match: 1p, 1 mis:-1, 1 gap: -2

gap penalty

8
100 50
100
consensuit
Pairwise alignment: Global vs. local 250
so sánh toàn bộ (giả thiết 2 trình tự có kích thước như nhau) và so sánh cục bộ

• In global alignment, two sequences


to be aligned are assumed to be
generally similar over their entire
length.
• Global alignment applies for closely
related sequences
• Local alignment does not assume
similar length between aligned
sequence, finds local regions that
share the highest level of similarity
Tìm vùng cục bộ có tỉ lệ tương đồng cao nhất
• Local alignment to search for
conversed regions within the
sequence
9
Sequence alignment: dynamic programming method

• Global alignment: Needleman and


Wunsch algorithm

Match: +1, mismatch: -1, gap: -2

chéo
• Step 1: set up a matrix
• Step 2: score a matrix
• Step 3: trace back and identify
alignment
CACTGGA
CATGTTA

10
Sequence alignment: dynamic programming method

Sequence 2 (length m)
C A – T G T T A
C A C T G G - A
Sequence 1 (length n)

2
11
Sequence alignment: dynamic programming method

• Match: +1, mismatch: -1, gap: -3

12
Scoring matrix
substitution matrix: hệ ma trận thay thế
• Substitution matrix is a set of values for quantifying the likelihood of one residue
being substituted by another in an alignment.

• Substitution matrix is derived from statistical analysis of residue substitution data


from sets of reliable alignments of highly related sequences.

• Scoring matrices for nucleotide sequences are relatively simple. A positive value
or high score is given for a match and a negative value or low score for a mismatch.

• Scoring matrices for amino acids are more complicated because scoring reflects
the physicochemical properties of amino acid residues, as well as the likelihood of
certain residues being substituted among true homologous sequences

13
Waterman wush

Scoring matrix
khác tính chất khả năng thay
thế ít hơn: bị phạt nhiều điểm
hơn

14
Local alignment: Smith and Waterman algorithm

• Negative scores are replaced by 0


• Tracing back scoring matrix starts
from the cell with the highest score

So sánh từ giá trị cao nhất

15
Sequence alignment: dot plots
• Seq1: GATTCTATCTAACTA
• Seq2: GTTCTATTCTAAC

G A T T C T A T – C T A A C T A
| | | | | | | | | | | |
G – T T C T A T T C T A A C - -

• Put a dot at where a match is found


• Connect the dots in diagonal direction
• Drawback: high noise
• Solution: sliding window with a
threshold

16
Database similarity searching: pairwise alignment on large scale

• Database searching: a mean of assigning putative functions to newly


determined sequences.

• How: by pairwise alignment on a large scale: a query sequence (input


sequence) vs. thousands of sequences in the database

17
Database similarity searching: pairwise alignment on large scale

• Requirements:
• Sensitivity: the ability to find as many correct hits as possible tính nhạy

• Selectivity (specificity): to find as few unrelated hits as possible tính đặc hiệu

• Speed: the time it takes to get results tốc đọ


• Approaches:
• Exhaustive type: dynamic programming (Waterman and Smith algorithm)
• Heuristic type: take shortcut by reducing the search space. so sánh đường tắt

18
Basic Local Alignment Search Tool (BLAST)

• Developed by Stephen Altschul of NCBI in 1990


• Became one of the most popular programs for sequence analysis
• Use heuristic approach to align a query sequence with all sequences in the
database
• Objective: find high-scoring ungapped segments along related sequences.

19
BLAST steps
1. Break query sequence into words
(e.g. 3 aa or 11 nucleotides)
2. Scan every 3 residues in word
database
3. Assume one of the words finds
matches in the database
4. Calculate sums of match scores
based on a scoring matrix
5. Find the database sequence
corresponding to the best word
match and extend alignment in both
directions
6. Determine the high scored segment
above threshold (e.g., 22)

20
BLAST results

21
Statistical significance of BLAST search results

• E-value (Expectation value) indicates the probability that the resulting


alignments from a database search are caused by random chance

E-value = m x n x P
m: total number of residues in a database
n: number of residues in the query sequence
P: probability that an alignment is a result of random chance

E.g., E-value = 1012 x 100 x 10-20 = 10-6

22
BLAST results

23
BLAST results

24
Problems

1. Obtain the human HBA and HBB protein sequences. Perform pairwise
alignment on NCBI and on EBI websites
2. You have isolated a novel bacterial strain from a soil sample and subject
PCR product of 16S rRNA gene for Sanger sequencing. Now that you have
a sequence of 16S rRNA gene, use Blastn on NCBI to identify the identity of
your isolate.
identity>97%: Cùng loài
>94%: cùng chi

25

You might also like