0% found this document useful (0 votes)
43 views30 pages

SECT 5 SL L1-Rev

This document provides information about an exam for the CSC8312 Bioinformatics Theory and Applications module. It discusses the exam format, which will consist of 3 questions with 1.5 hours total time and 45 minutes for each question. It provides some tips for the exam such as reading questions carefully, making a rough sketch before answering, not panicking, attempting all questions, watching the time, and providing examples. It also lists the module overview with the topics covered in each lecture. Finally, it discusses example exam questions related to protein sequence divergence, homology, orthology, paralogy, and the PAM and BLOSUM scoring matrices used for sequence alignment.

Uploaded by

Uday Kiran
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views30 pages

SECT 5 SL L1-Rev

This document provides information about an exam for the CSC8312 Bioinformatics Theory and Applications module. It discusses the exam format, which will consist of 3 questions with 1.5 hours total time and 45 minutes for each question. It provides some tips for the exam such as reading questions carefully, making a rough sketch before answering, not panicking, attempting all questions, watching the time, and providing examples. It also lists the module overview with the topics covered in each lecture. Finally, it discusses example exam questions related to protein sequence divergence, homology, orthology, paralogy, and the PAM and BLOSUM scoring matrices used for sequence alignment.

Uploaded by

Uday Kiran
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

Revision Lecture

CSC8312

Prof. A. Wipat
Exam format

 3 Questions – answer 2
 Total 1.5 hours - 45 minutes each

CSC8312 Bioinformatics Theory and Applications 2


Some tips
 RTQ – read the question carefully
 Make a rough sketch of answer quickly before answering – helps structure
 Don’t panic
 Don’t spend too long on one answer – If you don’t attempt the others the examiner
will not be able to give you any marks
• always attempt – trade off of time – 1st 20 min most productive
 Watch the time – if you run out then at least list headings/themes of the stuff you
were going to write about and some bullet points
 Give examples/diagrams no matter how simple
 Learn as much as you can – revise properly, check you can write it down from
memory
 Targeted revision is a gamble – best learn it all!
 Everything covered in lectures and practicals is valid
 Evidence of external reading impresses examiners – essential for a very good mark.

CSC8312 Bioinformatics Theory and Applications 3


CSC8312 Module Overview

 Section 1: Lect 1: Molecular and Cellular Biology


 Section 1: Lect 2: Genetic Material and Genomes
 Section 1: Lect 3: Gene Expression, Transcriptomes and Proteomes
 Section 2: Lect 1: DNA and genome sequencing
 Section 2: Lect 2: Sequence data and annotation
 Section 3: Lect 1: Evolution
 Section 3: Lect 2: Sequence Similarity and Comparison
 Section 3: Lect 3: Sequence Similarity Algorithms
 Section 3: Lect 4: Multiple Sequence Alignment
 Section 4: Lect 1: Microarray Data Analysis
 Section 4: Lect 2: Data Standards and Ontologies
 Section 4: Lect 3: Proteomics
 Section 4: Lect 4: Protein structure
 Section 4: Lect 5: Protein Structure Prediction
 Section 4: Lect 6 & 7 : Biological Networks

CSC8312 Bioinformatics Theory and Applications 4


2004 paper

Example Questions
CSC8312 Bioinformatics Theory and Applications 6
CSC8312 Bioinformatics Theory and Applications 7
Protein sequence divergence

 How far can sequence diverge and still be related?

 As sequences diverge the non-conservative mutations increase and


the gaps (insertions/deletions) increase

 In some cases protein sequences can diverge to a great extent and


still be related functionally. e.g some globin sequences have only 8%
amino acid identity

• Makes it hard to detect related proteins i.e. homologues.

CSC8312 Bioinformatics Theory and Applications 8


Homology, Orthology and Paralogy

 Different types of homologous sequences:

• Orthologous genes (orthologues or orthologs):

• Homologous genes from different genomes ascribed to the same


gene family
• Evolved by speciation
• Often encode proteins that perform the same function in different
organisms

CSC8312 Bioinformatics Theory and Applications 9


Homology, Orthology and Paralogy

• Paralogous genes (paralogues or paralogs)

• Homologous genes within the same genome ascribed to


the same gene family.
• Created by gene duplication
• May perform different functions in the same host as their
function may change with evolution (lack of functional
constraint)

CSC8312 Bioinformatics Theory and Applications 10


Paralogs can possess different functions

From Biochemistry, Stryer

CSC8312 Bioinformatics Theory and Applications 11


Paralog & Ortholog

 "Two genes are said to be paralogous


if they are derived from a duplication
event, but orthologous if they are
derived from a speciation event.“
W-H Li

CSC8312 Bioinformatics Theory and Applications 12


Homology, Orthology and Paralogy

• Orphans
• Single copy genes without any homolog

• Strain-specific expansions
• Gene families of paralogs without any orthologs
• Thus confined to the same family

• Xenologs
• Homologous genes where one gene has been obtained by horizontal gene
transfer. (transfer of genetic material between organisms)
• Comparative analysis of bacterial, archaeal, and eukaryotic genomes indicates
that a significant fraction of the genes in the prokaryotic genomes have been
subject to horizontal transfer.

CSC8312 Bioinformatics Theory and Applications 13


The PAM matrices

 Method is based on our knowledge of evolution:

 As sequences diverge they accumulate mutations

 The probability of a particular change can be derived from


aligning homologous sequences

 e.g. We can count the number of S->T changes & calculate


relative probability of this happening

 The relative frequencies for all amino acid changes can be


used to derive a scoring matrix – the PAM matrix

CSC8312 Bioinformatics Theory and Applications 14


The PAM matrices

 Problem: What if there have been multiple substitutions at the


same site?
• Will bias the statistics
• Samples therefore restricted to sequences that are sufficiently
similar that this probably hasn’t happened
 The PAM is a measure of sequence divergence
 1 PAM = 1% accepted mutation
 Two sequences that are 1 PAM apart are 99% similar
 For example, a 1 PAM substitution matrix is produced by
collecting statistics from sequences 1 PAM apart.

CSC8312 Bioinformatics Theory and Applications 15


The PAM matrices

 However, we need to assert relatedness between more widely


divergent sequences (99% similarity is not much use)
 How can we do this?
 We can extrapolate other matrices from the PAM matrix
 Powers of the matrix can be taken to produce matrices more
appropriate for more divergent sequences
• i.e. the matrix can be multiplied by itself.
 A range of PAM matrices have been derived in this way.

CSC8312 Bioinformatics Theory and Applications 16


PAM Matrices contd..
 The range of PAM matrices are named with the % divergence after the PAM
designation.

 Thus the PAM250 matrix represents a level of 250% change expected (over
2500 million years)
• Sequences at this level of divergence still have around 20% similarity

 The PAM250 matrix (corresponding to ~20% overall sequence similarity) is the


lowest similarity we can use for sequence analysis

 When used for protein comparison, the mutation probability (odds) matrix is
normalized and the logarithm is taken. (this lets us add the scores along a
protein instead of multiplying the probabilities)

 The numbers are multiplied by ten to avoid decimal points

 The resulting matrix is a “log-odds” matrix.

CSC8312 Bioinformatics Theory and Applications 17


PAM 250 logarithm of odds matrix (S250 matrix)

C 12
S 0 2
T -2 1 3
P -3 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
N -4 1 0 -1 0 0 2
D -5 0 0 -1 0 1 2 4
E -5 0 0 -1 0 0 1 3 4
Q -5 -1 -1 0 0 -1 1 2 2 4
H -3 -1 -1 0 -1 -2 2 1 1 3 6
R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8
K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8
V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
C S T P A G N D E Q H R K M I L V F Y W

CSC8312 Bioinformatics Theory and Applications 18


The BLOSUM Matrices
 Developed by Henikoff & Henikoff (1992; PNAS 89:10915-10919)

 The Dayhoff matrices were superceded by the BLOSUM matrices when more
sequence data became available

 Replaced Dayhoff matrix with one that would perform better in identifying
distant relationships

 Made use on the vast increase in data since Dayhoff’s work

 Blosum matrices are based the BLOcks Substitution Matrix

 They use the BLOCKS database to search for differences among sequences
but only among the very conserved regions of a protein family.

CSC8312 Bioinformatics Theory and Applications 19


BLOCKS Database

From: http://www.psc.edu/general/software/packages/blocks/blocks.html ...

“Blocks are multiply aligned ungapped segments corresponding to the


most highly conserved regions of proteins.

The blocks for the BLOCKS database are made automatically by


looking for the most highly conserved regions in groups of proteins
represented in the PROSITE database.

These blocks are then calibrated against the SWISS-PROT database to


obtain a measure of the chance distribution of matches.

It is these calibrated blocks that make up the BLOCKS database. “

CSC8312 Bioinformatics Theory and Applications 20


The BLOSUM Matrices
 Multiple alignments of short regions of related sequences (without gaps) were collected.

 For each alignment the sequences similar at some threshold value of percent identity are
clustered into groups and averaged.

 Substitution frequencies for all pairs of amino acids were calculated between the groups,
this was used to create the log-odds BLOSUM ( Block Substitution Matrix ).

 In general, BLOSUM62 is less tolerant of substitutions to or from hydrophilic amino acids


than PAM160 (it's closest equivalent) and of cysteine and tryptophan mismatches

 Thus, BLOSUM62 means that the sequences clustered in this block are at least 62%
identical.

 This allows detection of more distantly related sequences, as it downplays the role of the
more related sequences in the block when building the matrix.

CSC8312 Bioinformatics Theory and Applications 21


The BLOSUM 62 Matrix
C 9
S -1 4
T -1 1 5
P -3 -1 -1 7
A 0 1 0 -1 4
G -3 0 -2 -2 0 6
N -3 1 0 -2 -2 0 6
D -3 0 -1 -1 -2 -1 1 6
E -4 0 -1 -1 -1 -2 0 2 5
Q -3 0 -1 -1 -1 -2 0 0 2 5
H -3 -1 -2 -2 -2 -2 1 -1 0 0 8
R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5
K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5
M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5
I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4
L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4
V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4
F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6
Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7
W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11
C S T P A G N D E Q H R K M I L V F Y W

CSC8312 Bioinformatics Theory and Applications 22


Interpreting Multiple sequence Alignments for
protein structure

 It is possible to infer information about protein


structures from multiple sequence alignments.
 Protein sequences are aligned and the composition
of conserved regions are examined

CSC8312 Bioinformatics Theory and Applications 23


Conserved positions may be identified

 A multiple sequence alignment of the Drosophila chromatin proteins – showing a


common protein sequence motif

Conserved
positions
CSC8312 Bioinformatics Theory and Applications 24
Inference of Structure from multiple sequence
alignments (See Lesk pg. 188)

 The most highly conserved regions may correspond to an active site


 Regions rich in insertions and deletions probably correspond to
surface loops.
 A position containing a conserved Gly or Pro may correspond to a turn
 A conserved pattern of hydrophobicity with spacing 2 (i.e. every other
residue), with intervening residues more variable and including
hydrophilic residues, suggests a β-strand on the surface
 A conserved pattern of hydrophobicity with spacing ~4 suggests a
helix.

CSC8312 Bioinformatics Theory and Applications 25


MULTIPLE SEQUENCE ALIGNMENT

 Definition (from Attwood):


• “A multiple sequence alignment is a 2D table, in which the
rows represent individual sequences, and the columns the
residue positions”
• “Sequences are laid onto this grid in such a manner that
• a) the relative positioning of residues within any one sequence
is preserved
• b) Similar residues in all the sequences are brought into
vertical register (see example)

CSC8312 Bioinformatics Theory and Applications 26


Multiple sequence alignments (MSA’s)
contd..

 Visual examination of multiple sequence alignments


(MSA’s) is very valuable
 Can be displayed as multicoloured where the
different colours for amino acids of different
physicochemical types
 The alignment table can be summarised in a single
line – a pseudo sequence called the consensus
 May be produced
• By hand e.g. using alignment tools such as CINEMA
• Automatically using implementations of algorithms such as ClustalW

CSC8312 Bioinformatics Theory and Applications 27


Conserved positions may
be identified
 EXMAPLE: A multiple sequence alignment of the Drosophila chromatin proteins –
showing a common protein sequence motif

http://www.russell.embl.de/aas

Ala,A   Cys,C   Asp,D   Glu,E   Phe,F  

Gly,G   His,H   Ile,I   Lys,K   Leu,L  

Met,M   Asn,N   Pro,P   Gln,Q   Arg,R  

Ser,S   Thr,T   Val,V   Trp,W   Tyr,Y  

CSC8312 Bioinformatics Theory and Applications 28


Multiple sequence alignments

 MSA’s are starting points for phylogenetic analysis


• Can be used to group sequences or sub-sequences into families
• Once an MSA is determined each column in the alignment predicts the mutations that
at one site during the evolution of the sequence family
 Starting with an MSA it can be possible to determine the order
of appearance of the sequences during evolution
 Multiple sequence alignments can be global or local
• Protein sequences may be conserved in their entirety through
evolutionary change, or
• Functional domains in protein sequences may be conserved, whilst
remaining sequence diverges
 Used in common alignment programs such as ClustalW

CSC8312 Bioinformatics Theory and Applications 29


Algorithms for calculating multiple sequence
alignments

 Aligning multiple sequences is computational


expensive (even more that pairwise alignment)
 Only a few sequences can be aligned using an exact
solution
 To align many sequences it is necessary to rely on
heuristic strategies

CSC8312 Bioinformatics Theory and Applications 30

You might also like