0% found this document useful (0 votes)
5 views46 pages

Lecture 3

The document outlines homework assignments related to bioinformatics, focusing on gene structure prediction, sequence alignment, and database searching using tools like BLAST. It includes exercises on identifying sequences and analyzing results from BLAST searches, as well as discussions on scoring parameters and alignment algorithms. The document serves as a guide for students in a bioinformatics course to complete their assignments effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views46 pages

Lecture 3

The document outlines homework assignments related to bioinformatics, focusing on gene structure prediction, sequence alignment, and database searching using tools like BLAST. It includes exercises on identifying sequences and analyzing results from BLAST searches, as well as discussions on scoring parameters and alignment algorithms. The document serves as a guide for students in a bioinformatics course to complete their assignments effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

HOMEWORK Day 2

Extra point for


- Revise your Homework 2 from Day 1.
Homework 2 Day 1
- Extract the FASTA sequence of the genomic region of your genes (from
Homework Day 1) and predict gene structure of these DNA sequences using
one gene prediction programs. Summary the exons and introns from your
prediction; and write your observation and conclusion.

- Finding transcript information about a specific gene using NCBI & Ensembl
and compare with your prediction from bioinformatics program.

- Exploring genomic information of your genes (from Homework Day 1) using


Ensembl (see exercise 2 for detail).

- Between Ensembl and NCBI, which one would you prefer when searching
information of human genes? Why?

Homework Day 2
Which ORF is a real gene?

Homologous sequences in databank


Is my ORF similar to an already known protein ?

2
Bioinformatics
-----
Sequence Alignment and Database Search
Course Provider: PhD Tam Tran
Department of Life Sciences (LS) – USTH
Email: [email protected]

Master in Medical biotechnology - Plant biotechnology – Pharmacology – Year 2


Outline

• Database searching

• Sequence Alignment Algorithms

• Local Alignment with BLAST

4
Database searching

We compare our DNA/protein sequences to all known sequences in the database to find
any matches

GenBank, SwissProt,
One sequence Lots of sequences
NR, DDBJ

5
Why do we compare sequences?

Human vs. Chimpanzee

A similarity between 2 sequences may indicate


 a common biological function
 a similar 3D structure
 a common evolutionary origin → homology

6
Sequence Alignment Algorithms

7
What is an algorithm?
An algorithm is a step-by-step procedure to solve a problem

Input Algorithm Output

8
Algorithm for Pairwise Sequence Alignment

E.g: Compare two sequences ATGCATGC and TGCATGCA

ATGCATGC ATGCATGC-
TGCATGCA -TGCATGCA
no matching positions six matching positions

Key questions for alignment of two sequences:

Q1: What do we want to align?

Q2: How do we “score” an alignment?

Q3: How do we find the “best” alignment?

9
Q1: What Do We Want to Align?

find best match of both sequences in their entirety

find best subsequence match find best match without penalizing gaps on the
start or end of the alignment
10
Q2: How Do We Score Alignments?

S1: TACG---A--TTCAGATACG
|||| | ||||||||||
S2: AACGCTAACGTTCAATCGTC

Score(alignment) = Total cost of editing S1 into S2


 Cost of substitution (-1) Substitution Insertion Deletion
 Cost of insertion / deletion (-1) AATAAGC AAT-AAGC AATAAGC
 Reward of match (+2)
 Score for gaps (-1) AATTAAGC AATTAAGC AA-AAGC

We would score it by:


s(T,A) + m(A,A) + m(C,C) + m(G,G) + 3g + m(A,A) + 2g …

→ Therefore the score should be: 5


11
Q3: How Do We Find the Best Alignment?

 Simple approach: compute & score all possible alignments

E.g. How many possible global alignments are there for 2 sequences of length 3 (TAT and TCT)?

12
 But: too many possible alignments
Possible global alignments for 2 sequences of length n

e.g. two sequences of length 100 have possible ~ 1077 alignments

We need a smart algorithm

13
Dot plot

• Maybe a dot plot will help

G
Sequence 1 A
T
A
C
T
A C A T A G
Query

14
The dot plot of the alignment of two different contigs to a reference sequence.

15
Needleman-Wunsch algorithm
Key insight: Matrix representation of alignments

• Build a matrix
• S(i,j) = score of the best alignment of S1[1..i] and
S2[1..j]

• Systemically fill in the matrix and compute the


optimal score in S(i, j)
• Trace back from the optimal score, and find
Goal: alignment solution.
Find best path through the matrix

16
(Needleman & Wunsch, Journal of Molecular Biology, 1970)
Scoring parameters

Parameters so far:
 Match/mismatch
 Gap opening
Insertion/Deletion
 Gap extending

CGATGCAGCAGCAGCATCG CGATGCAGCAGCAGCATCG
|||||| ||||||| || || |||| || || |
CGATGC------AGCATCG CG-TG-AGCA-CA--AT-G

gap opening gap extension

(13 x 1) - 10 - (6 x 1) = -3 (13 x 1) - (5 x 10) - (6 x 1) = -43

this alignment is more likely … … than this one


(single evolutionary event) (5 evolutionary events)
DNA substitutions

• DNA:
 Purines (A,G) – dual ring

 Pyrimidines (C, T) – single ring

• Transitions: Substitutions of the same type

• Transversions: Exchanging one for another

• Transistions occur more frequently than transversions,


so we can score them higher in the scoring matrix

18
Amino acid substitution matrices

Scoring matrices reflect:


 Probabilities of mutual substitutions
 Probability of occurrence of each amino acid

Two most commonly used models:


 BLOSUM
 PAM

19
BLOSUM 62 (BLOcks SUbstitution Matrix)

Cluster proteins with identity greater than 62%

Estimate the frequency that a is replaced by b

S(a, b) = log (Pab/PaPb)

BLOSUM62 is the most frequently used protein similarity alignment scoring matrix,
20
default for NCBI BLASTP
Scoring parameters

• Match/mismatch

• Gap opening

• Gap extending

• Substitution matrix

21
BLAST
(Basic Local Alignment Search Tool)

22
NCBI BLAST

• The Basic Local Alignment Search Tool (BLAST) finds


regions of local similarity between sequences.

• The program compares nucleotide or protein sequences


to sequence databases and calculates the statistical
significance of matches.

• BLAST can be used to infer functional and evolutionary


relationships between sequences as well as help identify
members of gene families.

https://blast.ncbi.nlm.nih.gov/Blast.cgi

23
How BLAST works

“words” (subsequences of the query seq)

Query words are compared to the database


(target sequences) and exact matches
identified

For each word match, alignment is extended


in both directions to find alignments that score
greater than some threshold (maximal
segment pairs, or MSPs)

(Schneider and La Rota 2000)


BLAST @NCBI: easy !!

1. query (your sequence)

2. search space
(what you want to compare with)

3. submit !
Several versions of BLAST

Query
Database
sequence
blastn nucleic nucleic

blastp protein protein

nucleic
blastx ↓ protein
protein
nucleic
tblastn protein ↓
protein
nucleic nucleic
tblastx ↓ ↓
protein protein
EXERCISE BREAK
Exercise 1: Identify sequences with BLAST
>Unknown_sequence 1
AAATGAGTTAATAGAATCTTTACAAATAAGAATATACACTTCTGCTTAGGATGATAATTGGAGGCAAGTGAAT
CCTGAGCGTGATTTGATAATGACCTAATAATGATGGGTTTTATTTCCAGACTTCACTTCTAATGGTGATTATG
GGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGA
TTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCAT
CAAAGCATGCCAACTAGAAGAGGTAAGAAACTATGTGAAAACTTTTTGATTATGCATATGAACCCTTCACACT
ACCCAAATTATATATTTGGCTCCATATTCAATCGGTTAGTCTACATATATTTATGTTTCCTCTATGGGTAAGC
TACTGTGAATGGATCAATTAATAAAACACATGACCTATGCTTTAAGAAGCTTGCAAACACATGAA

1. Select the appropriate type of BLAST for this unknown sequence.

2. Report the "Algorithm parameters“ of your search

3. What does BLAST tell you about this sequences?

27
BLAST results

1. summary of the query

2. graphical summary of results

3. detailled summary of results

4. alignments
BLAST results

Summary of query
which sequence was submitted ?

which database is queried ?

which program is used ?


BLAST results

2. Graphical summary

this red bar represents the


submitted sequence (long. 253 AA)

each colored line represents a local alignment


between the query sequence and a sequence
from the selected database
color → score
length → size of the alignment

= HSP ("high scoring pair")


BLAST results

List of alignments
identifier description score E-value

each line in this summary


corresponds to a colored line
in the graphics
How to read a BLAST alignment ?

name, identifier, description and length of the similar sequence in the database

Query = user submitted sequence


Subject = similar sequence found in the database
How to read a BLAST alignment ?

3 lines in the alignment


 top line: query sequence
 bottom line: subject sequence
 middle line: AA if conserved between both sequences; "+" if score of
the substitution is positive
How to read a BLAST alignment ?

Score (raw score) E-value

Number of
matches (%) Number of positive
Number of gaps (%)
substitutions (%)
Interpreting the E-value
What is the likelihood that this alignment represents a true
homology between both sequences ?
1 10-10
maybe
no homology homology
("twilight zone")

false-positives: both sequences are


aligned, but no homology. Example :

Query: "British forces at a gaelic football match during the war of …"
Items of interest in BLAST

•Max[imum] Score: the highest alignment score calculated from the sum of the
rewards for matched nucleotides or amino acids and penalities for mismatches
and gaps.
•Tot[al] Score: the sum of alignment scores of all segments from the same
subject sequence.
•Query Cover[age]: the percent of the query length that is included in the
aligned segments.
•E[xpect] Value: the number of alignments expected by chance with the
calculated score or better. The expect value is the default sorting metric; for
significant alignments the E value should be very close to zero.
•Ident[ity]: the highest percent identity for a set of aligned segments to the
same subject sequence.

36
BLAST must-know

 BLAST= very fast tool for local alignments, used to query a sequence
database with a query sequence
 output: list of HSP (high scoring pairs = alignments) with
 % identity, % positives, %gaps
 score, E-value
 E-value = statistical measure : how many HSP with a comparable score
would we expect to find by chance ?
 E-value decreases when score increases: the smaller, the better !
E-value < 1e-10: very likely homology.
EXERCISE BREAK
Exercise 2: Carrying out a BLAST search of an unknown protein
>unknown_protein2
VVKSSGVRQPFDKEKIYKVLKWACDGHNIDVRAFLENVLELIRDGMTTKQIQRIAAIKYA
ADHISVKEPDWQYVASNLEMFALRKDVYGQFDPIPFYDHIVKMVEAGKYDKEILEKYSKQ
DIQVFERAIDHDKDFEFSYAGSQQLIGKYLVQDRDTGEIFETPQYAFMLIAMCLHQEETG
AQVTHIVDFYNAISDRKLSLPTPIMAGVRTPTRQFSSCVVIESGDSLGSLNAVTSAIKVY
ISQRAGIGVNAGHIRAMGSKIRGGEAVHTGVIPFWKIQTAVKSCSQGGVRGGAATLYYPF
WHLEVENLLVLKNNKGVEENRVRHLDYGVQLNQLMYKRLMNRDYITLFSPDVANDRLYDL

1. Which type of BLAST search should use for this sequence?


2. Report the "Algorithm parameters“ of your search
3. Do you find any sequences that look like your input sequences
4. What is the typical length of the hits (the alignment length)?
5. What is the typical % identity?
6. What is the range of the E-values?

38
HOMEWORK - DAY 3
1. Report your translated protein sequence in FASTA format of one gene in your three genes
from Gene list (Homework Day 1)

2. Using BLAST to find similar proteins to this protein in mouse, fly, and C. elegans
a. Report the "Algorithm parameters“ of your search
b. Do we find any significant hits? How many are they? % identity? the range of the E-values?
c. Are all the best hits the same category of function?
d. Report a best hit of your sequence
- What is the identifier (Accession)?
- What is the alignment score ("max score")?
- What is the percent identity and query coverage?
- What is the E-value?
- Are there any gaps in the alignment?

e. Give a summary of your BLAST search for this protein.

DEADLINE: 10am Thursday 22th 2021


39
END

40
NW algorithm calculate the scoring matrix for the alignment between two sequences

Si-1, j-1 Si-1, j

Si, j-1 Si, j

41
Example
S1,1 = 0 S1, j-1 + g

Si-1, 1 + g

Gap penalty = -1
Match = +2
Mismatch = -1

42
- A
0 -1
_ Si-1, j-1 Si-1, j Si-1, j-1 + Mi,j = 0 + 2

-1 2 Max Si, j-1 + g = -1 + 0


A Si, j-1 Si, j
Si-1, j + g = 0 + (-1)
Gap penalty = -1
Match = +2
Mismatch = -1

43
- A G
-1 -2
_ 0
Si-1, j-1 Si-1, j Si-1, j-1 + Mi,j = -1 + (-1)

2 1 Max Si, j-1 + g


A -1 = 2 + (- 1)
Si, j-1 Si, j
Si-1, j + g = -2 + (-1)
Gap penalty = -1
Match = +2
Mismatch = -1

44
The traceback path
The traceback performed on the completed traceback matrix:

- A G C A
_ 0 -1 -2 -3 -4

A -1 2 1 0 -1

C -2 1 1 3 2

A -3 0 0 2 5
Traceback
A -4 -1 -1 1 4 Starts here

45
The traceback path
- A G C A S1
diag – the letters from two sequences are aligned
_ done left – a gap is introduced in the left sequence (S2)
up – a gap is introduced in the top sequence (S1)

A up diag

C diag

A left
Traceback
A diag Starts here

S2

S1: A_CAA

S2: AGCA_ 46

You might also like