Lecture 3
Lecture 3
- Finding transcript information about a specific gene using NCBI & Ensembl
and compare with your prediction from bioinformatics program.
- Between Ensembl and NCBI, which one would you prefer when searching
information of human genes? Why?
Homework Day 2
Which ORF is a real gene?
2
Bioinformatics
-----
Sequence Alignment and Database Search
Course Provider: PhD Tam Tran
Department of Life Sciences (LS) – USTH
Email: [email protected]
• Database searching
4
Database searching
We compare our DNA/protein sequences to all known sequences in the database to find
any matches
GenBank, SwissProt,
One sequence Lots of sequences
NR, DDBJ
5
Why do we compare sequences?
6
Sequence Alignment Algorithms
7
What is an algorithm?
An algorithm is a step-by-step procedure to solve a problem
8
Algorithm for Pairwise Sequence Alignment
ATGCATGC ATGCATGC-
TGCATGCA -TGCATGCA
no matching positions six matching positions
9
Q1: What Do We Want to Align?
find best subsequence match find best match without penalizing gaps on the
start or end of the alignment
10
Q2: How Do We Score Alignments?
S1: TACG---A--TTCAGATACG
|||| | ||||||||||
S2: AACGCTAACGTTCAATCGTC
E.g. How many possible global alignments are there for 2 sequences of length 3 (TAT and TCT)?
12
But: too many possible alignments
Possible global alignments for 2 sequences of length n
13
Dot plot
G
Sequence 1 A
T
A
C
T
A C A T A G
Query
14
The dot plot of the alignment of two different contigs to a reference sequence.
15
Needleman-Wunsch algorithm
Key insight: Matrix representation of alignments
• Build a matrix
• S(i,j) = score of the best alignment of S1[1..i] and
S2[1..j]
16
(Needleman & Wunsch, Journal of Molecular Biology, 1970)
Scoring parameters
Parameters so far:
Match/mismatch
Gap opening
Insertion/Deletion
Gap extending
CGATGCAGCAGCAGCATCG CGATGCAGCAGCAGCATCG
|||||| ||||||| || || |||| || || |
CGATGC------AGCATCG CG-TG-AGCA-CA--AT-G
• DNA:
Purines (A,G) – dual ring
18
Amino acid substitution matrices
19
BLOSUM 62 (BLOcks SUbstitution Matrix)
BLOSUM62 is the most frequently used protein similarity alignment scoring matrix,
20
default for NCBI BLASTP
Scoring parameters
• Match/mismatch
• Gap opening
• Gap extending
• Substitution matrix
21
BLAST
(Basic Local Alignment Search Tool)
22
NCBI BLAST
https://blast.ncbi.nlm.nih.gov/Blast.cgi
23
How BLAST works
2. search space
(what you want to compare with)
3. submit !
Several versions of BLAST
Query
Database
sequence
blastn nucleic nucleic
nucleic
blastx ↓ protein
protein
nucleic
tblastn protein ↓
protein
nucleic nucleic
tblastx ↓ ↓
protein protein
EXERCISE BREAK
Exercise 1: Identify sequences with BLAST
>Unknown_sequence 1
AAATGAGTTAATAGAATCTTTACAAATAAGAATATACACTTCTGCTTAGGATGATAATTGGAGGCAAGTGAAT
CCTGAGCGTGATTTGATAATGACCTAATAATGATGGGTTTTATTTCCAGACTTCACTTCTAATGGTGATTATG
GGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGA
TTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCAT
CAAAGCATGCCAACTAGAAGAGGTAAGAAACTATGTGAAAACTTTTTGATTATGCATATGAACCCTTCACACT
ACCCAAATTATATATTTGGCTCCATATTCAATCGGTTAGTCTACATATATTTATGTTTCCTCTATGGGTAAGC
TACTGTGAATGGATCAATTAATAAAACACATGACCTATGCTTTAAGAAGCTTGCAAACACATGAA
27
BLAST results
4. alignments
BLAST results
Summary of query
which sequence was submitted ?
2. Graphical summary
List of alignments
identifier description score E-value
name, identifier, description and length of the similar sequence in the database
Number of
matches (%) Number of positive
Number of gaps (%)
substitutions (%)
Interpreting the E-value
What is the likelihood that this alignment represents a true
homology between both sequences ?
1 10-10
maybe
no homology homology
("twilight zone")
Query: "British forces at a gaelic football match during the war of …"
Items of interest in BLAST
•Max[imum] Score: the highest alignment score calculated from the sum of the
rewards for matched nucleotides or amino acids and penalities for mismatches
and gaps.
•Tot[al] Score: the sum of alignment scores of all segments from the same
subject sequence.
•Query Cover[age]: the percent of the query length that is included in the
aligned segments.
•E[xpect] Value: the number of alignments expected by chance with the
calculated score or better. The expect value is the default sorting metric; for
significant alignments the E value should be very close to zero.
•Ident[ity]: the highest percent identity for a set of aligned segments to the
same subject sequence.
36
BLAST must-know
BLAST= very fast tool for local alignments, used to query a sequence
database with a query sequence
output: list of HSP (high scoring pairs = alignments) with
% identity, % positives, %gaps
score, E-value
E-value = statistical measure : how many HSP with a comparable score
would we expect to find by chance ?
E-value decreases when score increases: the smaller, the better !
E-value < 1e-10: very likely homology.
EXERCISE BREAK
Exercise 2: Carrying out a BLAST search of an unknown protein
>unknown_protein2
VVKSSGVRQPFDKEKIYKVLKWACDGHNIDVRAFLENVLELIRDGMTTKQIQRIAAIKYA
ADHISVKEPDWQYVASNLEMFALRKDVYGQFDPIPFYDHIVKMVEAGKYDKEILEKYSKQ
DIQVFERAIDHDKDFEFSYAGSQQLIGKYLVQDRDTGEIFETPQYAFMLIAMCLHQEETG
AQVTHIVDFYNAISDRKLSLPTPIMAGVRTPTRQFSSCVVIESGDSLGSLNAVTSAIKVY
ISQRAGIGVNAGHIRAMGSKIRGGEAVHTGVIPFWKIQTAVKSCSQGGVRGGAATLYYPF
WHLEVENLLVLKNNKGVEENRVRHLDYGVQLNQLMYKRLMNRDYITLFSPDVANDRLYDL
38
HOMEWORK - DAY 3
1. Report your translated protein sequence in FASTA format of one gene in your three genes
from Gene list (Homework Day 1)
2. Using BLAST to find similar proteins to this protein in mouse, fly, and C. elegans
a. Report the "Algorithm parameters“ of your search
b. Do we find any significant hits? How many are they? % identity? the range of the E-values?
c. Are all the best hits the same category of function?
d. Report a best hit of your sequence
- What is the identifier (Accession)?
- What is the alignment score ("max score")?
- What is the percent identity and query coverage?
- What is the E-value?
- Are there any gaps in the alignment?
40
NW algorithm calculate the scoring matrix for the alignment between two sequences
41
Example
S1,1 = 0 S1, j-1 + g
Si-1, 1 + g
Gap penalty = -1
Match = +2
Mismatch = -1
42
- A
0 -1
_ Si-1, j-1 Si-1, j Si-1, j-1 + Mi,j = 0 + 2
43
- A G
-1 -2
_ 0
Si-1, j-1 Si-1, j Si-1, j-1 + Mi,j = -1 + (-1)
44
The traceback path
The traceback performed on the completed traceback matrix:
- A G C A
_ 0 -1 -2 -3 -4
A -1 2 1 0 -1
C -2 1 1 3 2
A -3 0 0 2 5
Traceback
A -4 -1 -1 1 4 Starts here
45
The traceback path
- A G C A S1
diag – the letters from two sequences are aligned
_ done left – a gap is introduced in the left sequence (S2)
up – a gap is introduced in the top sequence (S1)
A up diag
C diag
A left
Traceback
A diag Starts here
S2
S1: A_CAA
S2: AGCA_ 46