Bio 3
Bio 3
(3)
ALIGNMENT AND MATCHING
DR. IBRAHIM ZAGHLOUL
PROTEIN STRUCTURE
https://www.rcsb.org/structure/3ERT 2
MACROMOLECULAR STRUCTURE
• Primary structure of proteins
– Linear polymers linked by peptide bonds
– Sense of direction
3
SECONDARY STRUCTURE
• Polypeptide chains fold into regular local structures
– alpha helix, beta sheet, turn, loop
– based on energy considerations
4
ALPHA HELIX
5
BETA SHEET
anti-parallel parallel
schematic
6
TERTIARY STRUCTURE
• 3-d structure of a polypeptide sequence
– interactions between non-local and foreign atoms
– often separated into domains
7
QUATERNARY STRUCTURE
quaternary structure
of Cro
human hemoglobin
tetramer
8
ACTIVE SITE (BINDING SITE)
- Upon folding, the protein active site is formed.
- The spot at which molecules fit and interact.
- The major point for protein activity.
- Usually it is a Cleft, Pocket, Cavity.
- Called active because interaction usually
results in some chemical change or reaction.
- Basis of the lock and key model.
https://www.slideshare.net/MerlynH/protein-structure-
function-46933802
http://www.chemeddl.org/collections/TSTS/Gellman/Gellm
anpg5-8/Active%20Sites.html 9
ACTIVE SITE AND DRUG DESIGN: LOCK AND KEY
10
PROTEIN-LIGAND DOCKING
Computational method that mimics the binding of a
ligand to a protein
Given: Target (Protein), Binding Site, Ligand (set of
ligands)
• Predicts:
• The pose of the molecule in the binding site
• The binding affinity or a score representing the
strength of binding
• Docking can be used for:
• virtual screening: Virtual testing of compounds
• Lead optimization: Investigate specific compounds
• De novo design of ligands: Synthesize new
compounds.
Image credit: Charaka Goonatilake, Glen Group, University of Cambridge.
11
http://www-ucc.ch.cam.ac.uk/research/cg369-research.html
VIDEO: MOLECULAR DOCKING USING GLIDE
Bioinformatics: A simple view
Biological Computational
+
Data methods
13
Challenges of working in bioinformatics
14
Skill set
• Artificial intelligence
• Machine learning
• Statistics & probability
• Algorithms
• Databases
• Programming
15
Bioinformatics Topics Genome Sequence
16
Bioinformatics Topics Protein Sequence
• Sequence Alignment
– non-exact string matching, gaps
• Scoring schemes and Matching
– How to align two strings optimally via statistics
Dynamic Programming
• Patterns – How to tell if a given
– Local vs Global Alignment alignment or match is
– TM-helix finding – Amino acid substitution scoring
statistically significant
– Motifs
– A P-value (or an e-value)?
• Secondary Structure “Prediction”
– Assessing Secondary Structure Prediction – Score Distributions
(extreme val. dist.)
– Ab initio
– Low Complexity Sequences
• Function Prediction
• Evolutionary Issues
– Active site identification
– Rates of mutation and
• Tertiary Structure Prediction change
– Fold Recognition • Relation of Sequence Similarity to Structural
– Threading Similarity
17
Evolution
1
DNA Sequencers
DNA Sequencing
• DNA sequencing refers to the general laboratory technique for determining the exact
sequence of nucleotides, or bases, in a DNA molecule.
3
Sequence conservation implies function
3
Homology
GCTAGTCAGATCTGACGCTA
| |||| ||||| |||
TGGTCACATCTGCCGC
35
Sequence Alignment
VLSPADKTNVKAAWGKVGAHAGYEG
||| | | | || | ||
VLSEGDWQLVLHVWAKVEADVAGEG
3
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Given two strings x = x1x2...xM, y = y1y2…yN,
3
Sources of variation
• Nucleotide substitution
– Replication error
– Chemical reaction
• Insertions or deletions (indels)
– Unequal crossing over
– Replication slippage
• Duplication
– a single gene (complete gene duplication)
– part of a gene (internal or partial gene duplication)
• Domain duplication
• Exon shuffling
– part of a chromosome (partial polysomy)
38
A simple alignment
39
Scoring the alignments
• We need to have a scoring mechanism to evaluate alignments
– match score
– mismatch score
• We can have the total score as:
n
∑
=1
i
match or mismatch score at position i
41
BLOSUM 62 matrix
String Definitions
43
String Definitions
44
String Definitions
45
Exact matching
46
Exact Matching
47
Exact matching: naïve algorithm
48
Exact matching: naïve algorithm
49
Exact matching: naïve algorithm
50
Can we improve on the naïve algorithm?
P: word
T: There would have been a time for such a word
word
P: word
T: There would have been a time for such a word
word
word skip!
word skip!
word
51