0% found this document useful (0 votes)
54 views14 pages

Sequence Alignment vs. Database: Task: Given A Query Sequence and Millions of

The document describes the difference between sequence alignment and database searching. It discusses naive sequence alignment approaches and introduces heuristic algorithms like FASTA and BLAST that use word matching to quickly filter out irrelevant database records before applying computationally expensive alignment algorithms. These fast filtering approaches improve the speed of searching large databases for similar sequences.

Uploaded by

Andreas Rousalis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views14 pages

Sequence Alignment vs. Database: Task: Given A Query Sequence and Millions of

The document describes the difference between sequence alignment and database searching. It discusses naive sequence alignment approaches and introduces heuristic algorithms like FASTA and BLAST that use word matching to quickly filter out irrelevant database records before applying computationally expensive alignment algorithms. These fast filtering approaches improve the speed of searching large databases for similar sequences.

Uploaded by

Andreas Rousalis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 14

Sequence Alignment vs.

Database

Task: Given a query sequence and millions of


database records, find the optimal alignment
between the query and a record

ACTTTTGGTGACTGTAC
Sequence Alignment vs. Database

Tool: Given two sequences,


there exists an algorithm to
find the best alignment.

Nave Solution: Apply algorithm to


each of the records, one by one
Sequence Alignment vs. Database
Problem: An exact algorithm is too slow
to run millions of times (even linear time algorithm
will run slowly on a huge DB)

Solution:
Run in parallel (expensive).
Use a fast (heuristic) method to discard irrelevant
records. Then apply the exact algorithm to the remaining
few.
Sequence Alignment vs. Database

General Strategy of Heuristic Algorithms:


-Homologous sequences are expected to contain un-gapped (at
least) short segments (probably with substitutions, but without ins/dels)
-Preprocess DB into some fast access data structure of short
segments.
FASTA Idea

Idea: a good alignment probably matches some


identical words (ktups)
Example:
Database record:
ACTTGTAGATACAAAATGTG
Aligned query sequence:
A-TTGTCG-TACAA-ATCTGT
Matching words of size 4
Dictionaries of Words
ACTTGTAGATAC Is translated to the dictionary:
ACTT,
CTTG,
TTGT,
TGTA
Dictionaries of well aligned sequences share words.
FASTA Stage I

Prepare dictionary for db sequence (in advance)


Upon query:
Prepare dictionary for query sequence
For each DB record: *= matching word
Find matching words
Search for long diagonal runs * * *
*
of matching words Position in *
Init-1 score: longest run * *
DB record
* *
Discard record if low score *
* *
Position in query
FASTA stage II

Good alignment path


through many runs, with
short connections
Assign weights to runs(+)
and connections(-)
Find a path of max weight
Init-n score total path weight
Discard record if low score
FASTA Stage III

Improve Init-1. Apply an


exact algorithm around
Init-1 diagonal within a
given width band.
Init-1 Opt-score new
weight
Discard record if low
score
FASTA final stage

Apply an exact algorithm to surviving


records, computing the final alignment
score.
BLAST (Basic Local Alignment Search Tool)
Approximate Matches

BLAST:
Words are allowed to contain inexact matching.
Example:
In the polypeptide sequence IHAVEADREAM
The 4-long word HAVE starting at position 2 may
match
HAVE,RAVE,HIVE,HALE,
Approximate Matches

For each word from DB generate similar


words (according to the substitution matrix)
and store them in a look-up table.
BLAST Stage I
Find approximately matching word pairs
Extend word pairs as much as possible,
i.e., as long as the total weight increases
Result: High-scoring Segment Pairs (HSPs)

THEFIRSTLINIHAVEADREAMESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEWASNINETEEN
BLAST Stage II

Try to connect HSPs by aligning the


sequences inbetween them:

THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEW___ASNINETEEN

You might also like