0% found this document useful (0 votes)

44 views29 pages

Introduction To Bioinformatics: Sequence Alignment

This document discusses protein sequence alignment tools like BLAST and PSI-BLAST, how they use substitution matrices like PAM and BLOSUM to score alignments based on amino acid similarities, and statistical measures like E-values to assess the significance of matches found in database searches. It also covers concepts like affine gap penalties, filtering of low-complexity sequences, and how PSI-BLAST can be used to find more distant homologs through iterative profile searches.

Uploaded by

ranjankumartiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views29 pages

Introduction To Bioinformatics: Sequence Alignment

Uploaded by

ranjankumartiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 29

Introduction to bioinformatics

Sequence Alignment
Part 3
WHATS TODAY?
• MORE BLAST ….
- Similarity scores for protein sequences
- Gaps
- Statistical significance (e-value)
Protein Sequence Alignment
Rule of thumb:
Proteins are homologous if 25% identical (length >100)
DNA sequences are homologous if 70% identical
Protein Pairwise Sequence Alignment
• The alignment tools are similar to the DNA alignment tools
• BLASTN for nucleotides
• BLASTP for proteins

• Main difference: instead of scoring match (+2) and

mismatch (-1) we have similarity scores:
• Score s(i,j) > 0 if amino acids i and j have similar
properties
• Score s(i,j) is  0 otherwise

• How should we score s(i,j)?

The 20 Amino Acids
Chemical Similarities Between
Amino Acids
Acids & Amides DENQ (Asp, Glu, Asn, Gln)

Basic HKR (His, Lys, Arg)

Aromatic FYW (Phe, Tyr, Trp)

Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr)

Hydrophobic ILMV (Ile, Leu, Met, Val)

Sequence Alignment based on AA
similarity
TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS
|| + |||| +|| ||| | +| | | | |
TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS

RFSGSGSGTDFTLTINSLQPEDFATYYCQ---------------QSYSTPHFSQGTKLEI
| | | +| | | +|+ || || |+ + | | || | +
RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL

---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN---------NFYPREAKVQWKVD
++||| | + ++ | | | + ||++|+|
TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID

| = identity + = similarity
Amino Acid Substitutions Matrices
• When scoring protein sequence alignments
it is common to use a matrix of 20  20,
representing all pairwise comparisons :
Substitution Matrix
Given an alignment of closely related sequences
we can score the relation between amino acids
based on how frequently they substitute each other

M G Y D E
M G Y D E
M G Y E E
In this column
M G Y D E E & D are found
M G Y Q E
7/8
M G Y D E
M G Y E E
M G Y E E
Amino Acid Matrices
Symmetric matrix
of 20x20 entries:
Entry (i,i) is
greater than any entry (i,j)=entry(j,i)
entry (i,j), ji. Entry (i,j): the score
of aligning amino
acid i against amino
acid j.
PAM - Point Accepted Mutations
• Developed by Margaret Dayhoff, 1978.
• Analyzed very similar protein sequences
• Proteins are evolutionary close.
• Alignment is easy.
• Point mutations - mainly substitutions
• Accepted mutations - by natural selection.
• Used global alignment.
• Counted the number of substitutions (i,j) per amino acid pair: Many
i<->j substitutions => high score s(i,j)
• Found that common substitutions occurred involving
chemically similar amino acids.
PAM 250

• Similar amino acids are close to each other.

• Regions define conserved substitutions.
Example: Asp & Glu
COO- COO-

+
H3N C H +
H3N C H

HCH HCH

Score = 3 C HCH
O O-
C
Aspartate O O-
(Asp, D)
Glutamate
(Glu, E)
Selecting a PAM Matrix
• Low PAM numbers: short sequences, strong local
similarities.
• High PAM numbers: long sequences, weak
similarities.
– PAM120 recommended for general use (40% identity)
– PAM60 for close relations (60% identity)
– PAM250 for distant relations (20% identity)
• If uncertain, try several different matrices
– PAM40, PAM120, PAM250 recommended
BLOSUM
• Blocks Substitution Matrix
– Steven and Jorga G. Henikoff (1992)
• Based on BLOCKS database (www.blocks.fhcrc.org)
– Families of proteins with identical function
– Highly conserved protein domains
• Ungapped local alignment to identify motifs
– Each motif is a block of local alignment
– Counts amino acids observed in same column
– Symmetrical model of substitution AABCDA… BBCDA
DABCDA. A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA… BBCCC
BLOSUM Matrices

• Different BLOSUMn matrices are

calculated independently from
BLOCKS
• BLOSUMn is based on sequences that
are at most n percent identical.
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable for
sequences which are more similar
– BLOSUM62 recommended for general use
– BLOSUM80 for close relations
– BLOSUM45 for distant relations
Summary:
• BLOSUM matrices are based on the
replacement patterns found in more highly
conserved regions of the sequences without
gaps
• PAM matrices based on mutations observed
throughout a global alignment, includes
both highly conserved and highly mutable
regions
Gap Scores
• Example showed -1 score per indel
– So gap cost is proportional to its length
• Biologically, indels occur in groups
– We want our gap score to reflect this
• Standard solution: affine gap model
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
Scoring system =

Substitution Matrix +
Gap Penalty
Gap penalty
• We expect to penalize gaps
• Scoring for gap opening & for extension
– Insertions and deletions are rare in evolution
– But once they are created, they are easy to extend
– Gap-extension penalty < gap-open penalty
• Default gap parameters are given for each matrix:
– PAM30: open=9, extension=1
– PAM250: open=14, extension=2
Low Complexity Sequences
• AAAAAAAAAAA

• ATATATATATATA

• CAGCAGCAGCAG

Sequences of low complexity can cause getting significant hits

which are not true homologues !!!

How does BLAST deal with low complexity sequences?

By default low complexity sequences are filtered out

and replaced by XXXXX
Statistical significance
E-value
• The number of hits (with the same similarity score) one can
"expect" to see just by chance when searching the given
string in a database of a particular size.
• higher e-value lower similarity
– “sequences with E-value of less than 0.01 are almost always
found to be homologous”
• The lower bound is normally 0 (we want to find the best)
Expectation Values

Increases Decreases
linearly with Increases exponentially
length of query linearly with with score of
sequence length of alignment
database
• Bit score (S)
– Similar to alignment score
– Normalized
– Higher means more significant
• E value:
Number of hits of score ≥ S expected by chance
– Based on random database of similar size
– Lower means more significant
– Used to assess the statistical significance of the
alignment
Remote homologues
• Sometimes BLAST isn’t enough.
• Large protein family, and BLAST only
gives close members. We want more distant
members

PSI-BLAST
PSI-BLAST
• Position Specific Iterated BLAST
Regular blast

Construct profile from

blast results

Blast profile search

Final results
PSI-BLAST
• Advantage: PSI-BLAST looks for seqs that
are close to ours, and learns from them to
extend the circle of friends
• Disadvantage: if we found a WRONG
sequence, we will get to unrelated
sequences. This gets worse and worse each
iteration

BW80S Factory Service Manual
100% (1)
BW80S Factory Service Manual
133 pages
Substitution Matrix
No ratings yet
Substitution Matrix
10 pages
UKZN Map - Westville
0% (1)
UKZN Map - Westville
1 page
Taking Leaving Message Kelas Xi #1 Meet
No ratings yet
Taking Leaving Message Kelas Xi #1 Meet
17 pages
Next Roll Prediction
33% (3)
Next Roll Prediction
77 pages
MachineLearningNotes PDF
100% (1)
MachineLearningNotes PDF
299 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
Constraint Networks: Targeting Simplicity for Techniques and Algorithms
From Everand
Constraint Networks: Targeting Simplicity for Techniques and Algorithms
Christophe Lecoutre
No ratings yet
Bioinformatics 1 p3
No ratings yet
Bioinformatics 1 p3
17 pages
BLOSUM Matrices
No ratings yet
BLOSUM Matrices
18 pages
Diagnostic Tools
No ratings yet
Diagnostic Tools
36 pages
Assignment 2 (If Else If Ladder)
100% (1)
Assignment 2 (If Else If Ladder)
2 pages
A Master and Slave Control Strategy For Parallel Operation of Three-Phase UPS Systems With Different Ratings48
No ratings yet
A Master and Slave Control Strategy For Parallel Operation of Three-Phase UPS Systems With Different Ratings48
7 pages
A Study On Engineering Critical Assessment of Subsea Pipeline Girth Welds For Reeling Instalation PDF
No ratings yet
A Study On Engineering Critical Assessment of Subsea Pipeline Girth Welds For Reeling Instalation PDF
210 pages
BLAST Background
100% (1)
BLAST Background
27 pages
Lec 02
No ratings yet
Lec 02
103 pages
GB ENEXIO SKS Lamella Clarifier
No ratings yet
GB ENEXIO SKS Lamella Clarifier
4 pages
Fundamentals of Bioinformatics - L5
No ratings yet
Fundamentals of Bioinformatics - L5
56 pages
Aire Acondicionado LG
No ratings yet
Aire Acondicionado LG
78 pages
Bioinfo Ders 7 ALLIGNMENT - 1
No ratings yet
Bioinfo Ders 7 ALLIGNMENT - 1
55 pages
W03 Pairwise
No ratings yet
W03 Pairwise
55 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Second - Done - w14b - Searching Squence Databases
No ratings yet
Second - Done - w14b - Searching Squence Databases
32 pages
Introduction To Different Resources of Bioinformatics and Application PDF
No ratings yet
Introduction To Different Resources of Bioinformatics and Application PDF
55 pages
Sequence Alignment
No ratings yet
Sequence Alignment
25 pages
M2 R5 Jan2023 Set1
No ratings yet
M2 R5 Jan2023 Set1
21 pages
Bioinformatics I
No ratings yet
Bioinformatics I
39 pages
2-Substitution Matrices and Python - 2017
No ratings yet
2-Substitution Matrices and Python - 2017
65 pages
Bio 2
No ratings yet
Bio 2
39 pages
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
PAM and BLOSUM
No ratings yet
PAM and BLOSUM
21 pages
Alignment of Sequences
No ratings yet
Alignment of Sequences
33 pages
Sequence Alignment
No ratings yet
Sequence Alignment
24 pages
Sequence Alignment Presentation
No ratings yet
Sequence Alignment Presentation
27 pages
Msa MTech
No ratings yet
Msa MTech
17 pages
05 CAP5510 Fall21
No ratings yet
05 CAP5510 Fall21
40 pages
Frid Seminar
No ratings yet
Frid Seminar
30 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
04 CAP5510 Fall21
No ratings yet
04 CAP5510 Fall21
37 pages
What Is Engine
No ratings yet
What Is Engine
25 pages
Sequence Alignment: "Continuing.." (5th Week)
No ratings yet
Sequence Alignment: "Continuing.." (5th Week)
61 pages
Classical Approach to Constrained and Unconstrained Molecular Dynamics
From Everand
Classical Approach to Constrained and Unconstrained Molecular Dynamics
Ajith Gunaratne
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
Aml Crashlog
No ratings yet
Aml Crashlog
18 pages
Blast
No ratings yet
Blast
26 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
LO5 Pairwise Sequence Alignment
No ratings yet
LO5 Pairwise Sequence Alignment
11 pages
Sequence Comparison
No ratings yet
Sequence Comparison
39 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
AsBioinfo Ders 7 ALLIGNMENT - 1
No ratings yet
AsBioinfo Ders 7 ALLIGNMENT - 1
9 pages
Opamp
No ratings yet
Opamp
77 pages
Sequence Alignment: Scoring Matrices
No ratings yet
Sequence Alignment: Scoring Matrices
30 pages
Unit Iii
No ratings yet
Unit Iii
14 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
Selecting The Right Similarity-Scoring Matrix
No ratings yet
Selecting The Right Similarity-Scoring Matrix
18 pages
6.3.1.10 Packet Tracer - Exploring Internetworking Devices Instructions
100% (1)
6.3.1.10 Packet Tracer - Exploring Internetworking Devices Instructions
4 pages
Lab 2.1
No ratings yet
Lab 2.1
21 pages
Blast ND Fasta
No ratings yet
Blast ND Fasta
28 pages
10 Minute Guide to Orthogonal Array Test Strategy
From Everand
10 Minute Guide to Orthogonal Array Test Strategy
Rajeev Nair Raman
No ratings yet
BLAST Lecture Notes
No ratings yet
BLAST Lecture Notes
16 pages
Sequence Alignment Methods and Algorithms
No ratings yet
Sequence Alignment Methods and Algorithms
37 pages
3a - External Flow Examples For Convection Heat Transfer
No ratings yet
3a - External Flow Examples For Convection Heat Transfer
7 pages
Algorithm Design and Scoring Matrices PDF
No ratings yet
Algorithm Design and Scoring Matrices PDF
31 pages
Unit Ii
No ratings yet
Unit Ii
14 pages
Protein Alignment Scoring - PAM and BLOSUM
No ratings yet
Protein Alignment Scoring - PAM and BLOSUM
11 pages
Running BLAST Through Perl
No ratings yet
Running BLAST Through Perl
35 pages
Unit2 2
No ratings yet
Unit2 2
30 pages
Optimal Alignment and Heuristic Solutions
No ratings yet
Optimal Alignment and Heuristic Solutions
7 pages
Why Processor Performance Is More Than Frequency and Core Counts v10 13 23
No ratings yet
Why Processor Performance Is More Than Frequency and Core Counts v10 13 23
7 pages
Shanthi Pavan - Jitter in CT DSM - TCAS 2007
No ratings yet
Shanthi Pavan - Jitter in CT DSM - TCAS 2007
11 pages
1 Pearson
No ratings yet
1 Pearson
9 pages
Accounts Receivable Management and Finan Quoted Firms Nigeria
No ratings yet
Accounts Receivable Management and Finan Quoted Firms Nigeria
5 pages
Ammonia QP
No ratings yet
Ammonia QP
4 pages
Using Scoring Matrices
No ratings yet
Using Scoring Matrices
3 pages
Failure Analysis of A Helical Gear in A Gearbox Used in A Steel Rolling Mill
No ratings yet
Failure Analysis of A Helical Gear in A Gearbox Used in A Steel Rolling Mill
7 pages
SECT 5 SL L1-Rev
No ratings yet
SECT 5 SL L1-Rev
30 pages
Different Terrestrial Sampling Technique
No ratings yet
Different Terrestrial Sampling Technique
41 pages
Department of Education: Lesson Study Learning Plan On Addition of Polynomials
No ratings yet
Department of Education: Lesson Study Learning Plan On Addition of Polynomials
6 pages
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
No ratings yet
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
4 pages
Semiconductors Are A Special Class of Elements Having A Conductivity Between That of A Good Conductor and That of An Insulator
No ratings yet
Semiconductors Are A Special Class of Elements Having A Conductivity Between That of A Good Conductor and That of An Insulator
2 pages
MTH 324 (Complex Analysis) Lecture # 23 (Power Series) : Whenever
No ratings yet
MTH 324 (Complex Analysis) Lecture # 23 (Power Series) : Whenever
7 pages
BLOSUM
No ratings yet
BLOSUM
3 pages
Lecture 3 and 4 LSM2241
No ratings yet
Lecture 3 and 4 LSM2241
6 pages
Brushless AC Motor PDF
No ratings yet
Brushless AC Motor PDF
2 pages
Week Number: (Week 1) Topic: Orientation Course Description
No ratings yet
Week Number: (Week 1) Topic: Orientation Course Description
12 pages
Alarm System - DSC Pc1555 - Faq
No ratings yet
Alarm System - DSC Pc1555 - Faq
3 pages
Amino Acid Substitution Scores: 1 2 N 1 2 N N I 1 I I
No ratings yet
Amino Acid Substitution Scores: 1 2 N 1 2 N N I 1 I I
3 pages
SAT Exam Syllabus (2020-21)
No ratings yet
SAT Exam Syllabus (2020-21)
3 pages

Introduction To Bioinformatics: Sequence Alignment

Uploaded by

Introduction To Bioinformatics: Sequence Alignment

Uploaded by

Introduction to bioinformatics

• Main difference: instead of scoring match (+2) and

• How should we score s(i,j)?

Basic HKR (His, Lys, Arg)

Aromatic FYW (Phe, Tyr, Trp)

Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr)

Hydrophobic ILMV (Ile, Leu, Met, Val)

• Similar amino acids are close to each other.

• Different BLOSUMn matrices are

Sequences of low complexity can cause getting significant hits

How does BLAST deal with low complexity sequences?

By default low complexity sequences are filtered out

Construct profile from

Blast profile search

You might also like