Lecture 4

Uploaded by

ge0mi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views106 pages

Lecture 4

Uploaded by

ge0mi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Bioinformatics for Computer Scientists

Lecture 4
BLAST & Genome assembly

Alexey Kozlov
Staff Scientist
[Link]@[Link]

Exelixis Lab

October 30, 2023

Today's agenda
●
Search methods for biological sequences
– Public sequence databases
– BLAST algorithm
– Alternative search algorithms
●
Genome assembly
– De novo assembly
– By-reference assembly
Disclaimer
●
Today: Basic algorithms
– Solving an idealized problem
●
Real programs are much more complicated
– Deal with biological “fuzziness”
– Heuristics, optimizations...
●
I’m not an expert
– Ask your advanced questions anyways :)
– Only basic stuff is relevant for the exam
Searching for similar sequences
●
Assumption:
– sequence similarity Û evolutionary and/or
functionally relatedness
Annotated!

Query sequence Sequence database

S1
m S2
Q
S3
d
ACTTGCTA... S4
S5
n

Query hits (results)

S4
S3
4
Sequence databases: proteins
●
UniProt
– extensive annotations (3D structure, functions etc.)
– 560K manually curated entries + 180M automatic

5
Sequence databases: proteins
●
AlphaFold DB
– 200M protein structures → high-quality prediction
– [Link]

6
Sequence databases: GenBank
●
NCBI GenBank/INSDC →”all” DNA+proteins
– > 246M “regular” sequences as of Aug. 2023
– ~ 2.6B whole-genome-shotgun (WGS)
– “primary” database → sequences submitted and
annotated by the respective authors (biologists)

7
Sequence annotation in GenBank
entry name:
usually includes organism
and gene name

accession number
(~ sequence ID)
organism name and
taxonomy (=biological
classification)

sequence

8
Sequence databases: barcodes
●
“Secondary” databases for barcoding genes
– quality filtering, curated taxonomy, sample geodata...
– useful for species identification/classification
●
SILVA →[Link]
– 16S, >10M sequences, focus on Bacteria+Archaea
●
BOLD →[Link]
– COI, >7M sequences, focus on Animals
●
UNITE → [Link]
– ITS, 1.7M sequences, focus on Fungi

9
Searching for similar sequences
●
Assumption:
– sequence similarity Û evolutionary and/or
functionally relatedness
Annotated!

Query sequence Sequence database

S1
m ??? ?????
S2
Q
S3
d
ACTTGCTA... S4
S5
n

Query hits (results)

p53 Human
S4
p53 Gorilla
S3
10
Naїve approach
●
Use Smith-Waterman algorithm to build a pair-
wise alignment of query sequence Q with every
sequence in the database S1...Sd
●
Sort alignments by similarity score
●
Report best match(es)
Score

S1 90 Query hits

S2 70
S4 140
125 S3 125
S3

S4 140

S5 65
11
Naїve approach: complexity
●
Smith-Waterman has a complexity of O(m ´ n)
●
DP matrix for every sequence in database:
Q Q Q Q

S1 S2 S3 ... Sd

database size d

12
Naїve approach: complexity
●
Smith-Waterman has a complexity of O(m ´ n)
●
DP matrix for every sequence in database:
Q Q Q Q

S1 S2 S3 ... Sd

database size d

●
NCBI GenBank: d = 200 million
➔
Do we really need to compute all matrices?
12
BLAST
●
Stands for Basic Local Alignment Search Tool
●
Fast heuristic to find similar sequences
●
One of the most widely-used algorithms in bioinformatics
– Two main papers from 1990s have ~180 000 citations1
– cf. RAxML: ~40 000, Human Genome Sequence: ~45 000
●
Available as both stand-alone application and web
service
– BLAST on NCBI GenBank database:
[Link]

1
Altschul et al (1990), Altschul et al (1997)
13
BLAST: algorithm outline
●
Idea: reduce search space by ignoring dissimilar
regions
●
Three-step heuristic:
– Seeding: find common subwords between query and
database sequences → seeds
– Extension: starting from seeds, extend alignment in
both directions → high-scoring segment pairs (HSP)
– Evaluation: assess the statistical significance of
each HSP

14
BLAST: seeding
●
Pre-process query sequence
● Build a list L1 of all subwords (factors) of the
length w in the query sequence
●
Example with w := 5
Query: ACTTGCTAAC
ACTTG
CTTGC
L1 TTGCT
TGCTA
CTAAC
15
BLAST: seeding (2)
● For each subword Wi Î L1 build a neighborhood Ni which
includes “similar” subwords
●
Similarity is defined via a substitution matrix
– For proteins, BLOSUM matrices can be used
– For DNA, +2 for a match and -3 for mismatch (or +5/-4)
●
Only subwords with similarity score above a threshold T are
added into the neighborhood

16
BLAST: seeding (2)
● For each subword Wi Î L1 build a neighborhood Ni which
includes “similar” subwords
●
Similarity is defined via a substitution matrix
– For proteins, BLOSUM matrices can be used
– For DNA, +2 for a match and -3 for mismatch (or +5/-4)
●
Only subwords with similarity score above a threshold T are
added into the neighborhood
T := 4

Wi T T G C T Wi T T G C T
T T G A T T G C C T
+2 +2 +2 -3 +2 = 5 > 4 +2 -3 -3 +2 +2 = 0 < 4
16
BLAST: seeding (3)
●
Combine all subwords' neighborhoods of a single query
sequence
L2 = È Ni
●
Build a final list by adding the subwords themselves
L = L1 È L2
●
Scan the database for exact matches of subwords in L

17
BLAST: seeding (3)
●
Combine all subwords' neighborhoods of a single query
sequence
L2 = È Ni
●
Build a final list by adding the subwords themselves
L = L1 È L2
●
Scan the database for exact matches of subwords in L
L Database sequence:
ACTTG CTAAT AGCTATTGATGACTG
CTTGC TTGCT
CTTGC CTAAC seed

TGCTA TTGAT
17
BLAST: extension
●
Try to extend the alignment to the left and to the
right from the seed
●
Stop if the current total score drops by more
than X compared to the maximum seen so far
●
Trim alignment back to the maximum score

extend reference

seed

18
BLAST: extension (2)
●
Example with X := 3
seed

DB sequence: A G C T A T T G A T G A C T G
Query: C A C T T G C T A A C
+2 +2 +2 -3 +2
Current score: 5 Max score: 5 Diff: 0