0% found this document useful (0 votes)
91 views

2-Substitution Matrices and Python - 2017

Here are the key points about if statements in Python: - Indentation is used instead of brackets to denote blocks of code for if, else, etc. Code under an if or else must be indented. - Conditions can be chained with and, or, not keywords like in other languages. - Variables and expressions are evaluated and must result in a boolean True or False. - The body of the if and else blocks will execute conditionally depending on the boolean result. - Optional else blocks can specify code to run if the condition is False. - Multiple elif (else if) blocks can be specified to check additional conditions. So in summary, if statements in Python allow conditional execution

Uploaded by

Areej Zafar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

2-Substitution Matrices and Python - 2017

Here are the key points about if statements in Python: - Indentation is used instead of brackets to denote blocks of code for if, else, etc. Code under an if or else must be indented. - Conditions can be chained with and, or, not keywords like in other languages. - Variables and expressions are evaluated and must result in a boolean True or False. - The body of the if and else blocks will execute conditionally depending on the boolean result. - Optional else blocks can specify code to run if the condition is False. - Multiple elif (else if) blocks can be specified to check additional conditions. So in summary, if statements in Python allow conditional execution

Uploaded by

Areej Zafar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Practical Session 2

+
Table of contents
• Scoring matrices
– PAM
– BLOSUM
• Intro to Python
Aligning Protein Sequences
• Classification
• Clustering of families
• Annotations (functional and structural)
Aligning Protein Sequences
• Proteins consist of 20 amino acids.

Task given: align two protein sequences.


• Can the previous alignment algorithms be
used?
• How do amino acids differ from one another?
Aligning Protein Sequences
When evaluating the probability of one amino
acid mutating to another, we need to
consider:
• Mutational Distance
• Chemical properties - similarity/difference
• Evolutionary time
Mutational Distance
Assume we start with Methionine, which is
encoded by a single codon: ATG
Thr (Threonine) is encoded by AC[ACGT]
In order to mutate Met to Thr, one SNP (single
nucleotide point) mutation is enough

ACG
ATG
Mutational Distance
Assume we start with Methionine, which is
encoded by a single codon: ATG
Thr (Threonine) is encoded by AC[ACGT]
In order to mutate Met to Thr, one SNP (single
nucleotide point) mutation is enough
• 3 point mutations are required to mutate Met
to His - encoded by CA[TC]
• Therefore, His is more distant to Met.
Amino acids’ chemical properties
• Size
• Structure
• Polarity
• Charge
• Acidity (pKa)
These properties affect
mutation probabilities
Amino acids’ chemical properties
Mutations which
change functionality
(chemical properties)
of the protein, should
be less likely to occur.
Evolutionary time
• Time is another aspect which needs attention.
• Does longer time permits less or more
mutation?
• How can that be included in the scoring
system ?
Evolutionary Substitution Matrix
• A substitution matrix contains values
proportional to the probability that amino acid
mutates into amino acid for all pairs of amino
acids.
Evolutionary Substitution Matrix
• A substitution matrix contains values proportional to
the probability that amino acid mutates into amino
acid for all pairs of amino acids.
• Based on empirical observations
• Assumption: frequent substitutions reflect “safe
variations” and thus should be given a higher score,
while infrequent mutations are probably detrimental
and thus should be given lower score.
• The two major types of substitution matrices are PAM
and BLOSUM.
PAM Matrices
PAM – Percent/Point Accepted Mutations.
• The first widely used
scoring scheme used for
amino acid alignment.
• Devised by Margaret
Oakley Dayhoff and Co.
in 1978.
PAM – point accepted mutation
• Substitution of an amino acid in a protein with
another amino acid, which is accepted by the
process of natural selection.
• Silent or lethal mutations are not point
accepted mutations
PAM Matrices
• PAM matrices are noted as PAMn matrices
• PAM1 represents the time period over which
we expect 1% of the amino acids to undergo
point accepted mutations
Constructing PAM Matrices
• Examined 1572 substitutions in 71 families of
proteins (71 phylogenetic trees)
• The proteins sequences were at least 85%
identical
Constructing PAM Matrices
• Calculating - the amount of observed cases
when amino acid mutated to amino acid .
Constructing PAM Matrices
• - the amount of observed cases when amino acid
mutated to amino acid
• is the number amino acid appearances
• is a constant

• , the probability of mutating to is:


Constructing PAM Matrices

For clarity, the values have been multiplied by 10000


Constructing PAM Matrices

The diagonal represents the


probability to still observe the same
residue after 1 PAM.
Therefore the diagonal represents the
99% of the case of non-mutation.

For clarity, the values have been multiplied by 10000


Deriving PAMn matrices
• represents the evolutionary time in which 1%
of amino acids mutated
• represents the evolutionary time in which
250% of amino acids mutated
• represent sequences of approximately 20%
sequence similarity
• How can that be?
• Each amino acid can mutate more than once
Deriving PAMn matrices
Constructing PAM Matrices
• An amino acid’s () frequency:
• is the number amino acid appearances
• is the total sequences length (all alignments)
Dayhof group computed matrix
in the 1970s.
In 1991 recomputed by Jones
group: used a much larger set
of proteins, but still got a very
similar values for relative
frequencies of substitutions.
From probabilities to scores
• So far, we have obtained a probability matrix,
but we would like a scoring matrix.
Observed frequency

Expected frequency by chance


Constructing PAM Matrices
Observed frequency
Expected frequency by chance

• Using log has convenient practical consequences:


• A positive score () characterizes the accepted
mutations
• A negative score () characterizes the unfavorable
mutations
• Another property of the log-odd scores is that they
can be added to produce the score of an alignment:
T A H G K
Y S D G D
Choosing the right PAM matrix
• Correspondence between the observed percent
of amino acid difference and the evolutionary
distance (in PAM)
Choosing the right PAM matrix
• PAM120 matrix is the most appropriate for
database searches
• PAM200 matrix is the most appropriate for
comparing two specific proteins with suspected
homology
• Higher is more appropriate for more distant
proteins
The model’s assumptions
• Only mutations are allow – no indels.
• Sites evolve independently – mutation in one
site, has no effect on another.
• Evolution model:
Next mutation is dependent on current state
and is independent on previous mutations.
Problem
PAM matrices work quite well for closely related
sequences, especially during short evolutionary
time.

However, they seems to lack the ability to


represent more distant/divergent sequences, on
a larger evolutionary time scale.
BLOSUM
(BLOcks SUbstitutions Matrix)
Devised by Henikoff & Henikoff in 1992.
BLOSUM
(BLOcks SUbstitutions Matrix)
• Used to score alignments of evolutionary
divergent (different) sequences.
• As the name hints, the scores are extracted
from local “blocks” of conserved sequences.
• Unlike , the in represents the maximal
similarity between the sequences and all
BLOSUM are computed by observations.
BLOSUM
(BLOcks SUbstitutions Matrix)
• BLOSUM 62 is the default matrix for the
standard protein BLAST program
• BLOSUM 62 is derived from Blocks containing
>62% identity in ungapped sequence
alignment
Constructing BLOSUM

Henikoff and Henikoff developed a database of


>2,000 blocks “blocks” based on sequences from
>500 groups of related proteins with shared
subsequences
AABCDA...BBCDA
DABCDA.A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA...BBCCC
Why blocks?
• Don’t want insertions and deletions to
complicate estimation of substitution
probabilities
• Interested in detecting conserved regions of
protein sequences, so restrict attention to
these regions when computing the scoring
matrix
Constructing BLOSUM

Intuitively, is the ( of the) ratio of:


• The number of times amino acids and appear
together in the same column
• Divided by the expected number of times to
see pairs in the same column if the placement
of amino acids and were random throughout
BLOCKS. 
BLOSUM62
Differences between PAM and BLOSUM
Intro to Python
Why Python?

*By CodeEval - a platform used by developers to showcase their skills. 


Why Python?
• Quick development
• Easy to learn
• Huge community
• Fast enough for most applications
• Capable of interacting with most of the other
languages and platforms
Strings
http://www.codeskulptor.org/
s = 'hi‘
print s[1]         # i       
print len(s)       #2   
print s + ' there' # hi there
pi = 3.14
# does not work
text = 'The value of pi is ' + pi  
# yes
text = 'The value of pi is '  + str(pi)  
s=3
String Slices

• s[1:4]
– 'ell' -- chars starting at index 1 and extending up to but not including index 4
• s[1:]
– 'ello' -- omitting either index defaults to the start or end of the string
• s[:]
– 'Hello' -- omitting both always gives us a copy of the whole thing (this is the pythonic way to copy a
sequence like a string or list)
• s[1:100]
– 'ello' -- an index that is too big is truncated down to the string length
• s[-1]
– 'o' -- last char (1st from the end)
• s[-3:]
– 'llo' -- starting with the 3rd char from the end and extending to the end of the string.
If statement
if speed >= 80:
    print 'License and registration please'
    if mood == 'terrible' or speed >= 100: Indentation is very
      print 'You have the right to remain silent.' important!
    elif mood == 'bad' or speed >= 90:
      print "I'm going to have to write you a ticket."
      write_ticket()
    else:
      print "Let's try to keep it under 80 ok?"

• Note there are no {} or ;


Lists
• my_list = [1,2,3,4,5,6,7,8,9,10]
• my_list[1:5] # [2, 3, 4, 5]
• my_list[::2]
– [1, 3, 5, 7, 9]
• my_list[::-1]
– reverse [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
• Lists can contain different types of variables:
• pi = ['pi', 3.14159, True]
Lists are dynamic
• students = ['Itay',9255587, 'Alon',744554]
• students.append('Michal')
# ['Itay',9255587, 'Alon',744554, 'Michal']
• students[0:2] = [‘Noa‘]    
# [‘Noa’, 'Alon',744554, 'Michal']
Range
range(10)
# returns an ordered list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
range(0,10,2)
#[0, 2, 4, 6, 8]

## print the numbers from 0 through 99


  for i in range(100):
    print i
List iteration
• squares = [1, 4, 9, 16]
  sum = 0
  for num in squares:
    sum += num
  print sum  ## 30
Dict
## Can build up a dict by starting with the the empty dict {}
## and storing key/value pairs into the dict like this:
## dict[key] = value-for-that-key
dict = {}
dict['a'] = 'alpha'
dict['g'] = 'gamma'
dict['o'] = 'omega'

print dict  ## {'a': 'alpha', 'o': 'omega', 'g': 'gamma'}

print dict['a']     ## Simple lookup, returns 'alpha‘


dict['a'] = 6       ## Put new key/value into dict
'a' in dict         ## True
print dict['z']                  ## Throws KeyError
if 'z' in dict: print dict['z']     ## Avoid KeyError
print dict.get('z')  ## None (instead of KeyError)
Dict
dict = {'a': 'alpha', 'o': 'omega', 'g': 'gamma'}
## By default, iterating over a dict iterates over its keys.
## Note that the keys are in a random order.
for key in dict:
print key
## prints a g o

## Exactly the same as above


for key in dict.keys():
print key

## Get the .keys() list:


print dict.keys()  ## ['a', 'o', 'g']

## Likewise, there's a .values() list of values


print dict.values()  ## ['alpha', 'omega', 'gamma']
Dict
dict = {'a': 'alpha', 'o': 'omega', 'g': 'gamma'}
## Common case -- loop over the keys in sorted order,
## accessing each key/value
for key in sorted(dict.keys()):
    print key, dict[key]

## .items() is the dict expressed as (key, value) tuples


print dict.items()  ##  [('a', 'alpha'), ('o', 'omega'), ('g', 'gamma')]

## This loop syntax accesses the whole dict by looping


## over the .items() tuple list, accessing one (key, value)
## pair on each iteration.
for k, v in dict.items(): print k, '>', v
  ## a > alpha    o > omega     g > gamma
Reading and Writing to a file is simple
 # Print the contents of a file
  f = open('foo.txt', 'r')
  for line in f:   ## iterates over the lines of the file
    print line,    ## trailing , so print does not add an end-of-line char
           ## since 'line' already includes the end-of line.
  f.close()

f = open(“testfile.txt”,”w”)
f.write(“Hello World”)
f.write(“This is our new text file”)
f.write(“and this is another line.”)
f.write(“Why? Because we can.”)
f.close()
Functions
# Function definition is here
def print_info( name, age = 35 ):
print "Name: ", name
print "Age ", age

# Now you can call printinfo function


print_info( age=50, name="miki" )
Name: miki
Age 50
print_info( name="miki" )
Name: miki
Age 35
Classes
class MyClass(object):
common = 10
def __init__(self):
self.my_variable = 3
def my_function(self, arg1, arg2):
return self.my_variable

# This is the class instantiation


class_instance = MyClass()
class_instance.my_function(1, 2) #3
class_instance2 = MyClass()
# This variable is shared by all instances
class_instance.common #10
class_instance2.common #10
MyClass.common = 30
Some tutorials
• https://developers.google.com/edu/python/
• http://www.pythonforbeginners.com/
• https://www.codecademy.com/learn/python
• http://www.learnpython.org/
Important Python Packages for the data
scientist
Biopython - collection of non-commercial
Python tools for computational biology and
bioinformatics
NumPy – mathematical package
SciPy – scientific package
Matplotlib – 2D plotting
Pandas – data structures and analysis
Development
• PyCharm - Free community version
Python 2 or Python 3?

Python 2.x is legacy, Python 3.x is the present


and future of the language
Main differences
• Python 2
Read more
print 'Hello, World!'
print('Hello, World!')
print ‘text’,
print 'print more text on the same line‘

• Python 3

print('Hello, World!')
print("some text,", end="")
print(' print more text on the same line')
Python 2 vs. Python 3
• Many prefer to use Python 2 because of larger
library support
• Coding style is very similar – not hard to
transition from Python 2 to Python 3
• In the assignment you will use Python 2
Next practical session
• Blast
• Fasta

You might also like