PSI Blast and Position Specific Scoring Matrix
PSI Blast and Position Specific Scoring Matrix
A multiple alignment is good in order to create groups. Ofc if they’re related. Ex the whole family
of globins: they bind eme group and oxygen. They share something between alpha-beta subunits.
We can see some positions preserved, just looking at it, I can easily spot it. Or I can easily spot an
idrophobic aa in a positions and so on. But what can’t be spotted by AI? Some rare aa may be an
important signal, there are a lot of methods that allow us to transform this multiple alignment to
discuss the whole family of the protein. I can remove unnecessary information, ex some info may
be mascherate by sort of noise.
Ways I can extract these features: already studied PSI-BLAST: they use PSSM which are
position specific scoring matrices. Is a pssm-version of blast. Blast uses predefined matrices as
blosum62, while psi-blast uses these dynamic matrices. PSSM: scoring matrix. It doesn’t give a
score on an aa rispetto a quante volte è stato sostituito, ma in base alla posizione in cui si trova. I
must know what protein i’m considering cuz pos.3 of a globin is different from another protein. I
build a matrix where I put how much that aa is represented in that position. It’s not like a score but a
FREQUENCY. It can occur in a position may be represented an A 34 times and a L 36 times which
is similar, while in another position an F would be represented 78 times which is significant. I
calculate the times of presence of that aa in that position OUT the number of alignment.
As we can see, C is stronger than A (positions 2 and 1) cuz 100/100.
A weight matrix is not a pssm, for example if we compare the test seq with consensus we would not
say that is part of that group, but if we compare it with the matrix instead we would say it’s part of
the family. We can put a frequency (a percentage) on a test seq to compare ow many times the aa is
in the family in that position. Log odds score: Si=log10 qi,a/pi means ratio of observed and
expected. The bigger is Si, the more frequent that aa is present. But, if we compare a pam or a
blosum, they’re not so different cuz are also ratio of frequecy obs/exp, but the difference is the
shape of matrix (?).
Now, if we have a pssm, we can substitute it at a blosum, but with some
differences. Scores won’t be the same, 1 is aa vs aa, and the othr is aa vs
position.
Pssm is a matrix where on Y there are 20 aa, and on X the number of the
position. I compare a query aa, how much scores in position n.
PSI-BLAST calculates a PSSM by starting from result of normal blast. I give
a standard matrix, generic one. So by using blast algorithm, it calculates a
PSSM in a database of sequences, doing iterations. The last step so produces
the final PSSM, that is used for PSI-BLAST that runs the new matrix, a pssm,
on the same database of start. I don’t do it many an many times, not cuz I onlu
time expesive, but useless cuz it’s the same result of the same database. Psi blast is not good to run
for the whole database, much larger is much not precise is. A good use is only in families. The
concept is trying to converge a sequences that is not part of the starting database. Everytime I run
psi, the pssm changes. At each cycle, the n of
alignments in family augments. We start from a Seq. query
query and we want to catch all the family from a
database. Blast only finds few of members, but
psi is more sensible cuz by doing various cycles,
finds a larger family or quite-the whole family.
Maybe 7-8 iterations are enough. Don’t need to
repeat 20-30 times, cuz the results would
converge, and we’ll waste time.
PSCAN
Going forward, we want strategy to ragroup alignments. We want to put proteins in some relations.
For proteins often this means alignment. We want to find similarity and patterns in protein, a
specific aa followed by another, a gropu of 3 aa… PRINTS is a database containing motif and small
patterns, so we can check if someone is represented in proteins.