The document provides information about BLAST and sequence alignment algorithms. It defines key BLAST programs and what types of queries and databases they can search. It also defines important scoring terms like E-value, percent identity, and substitution matrices like BLOSUM and PAM that are used in sequence alignments.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
69 views
BLAST Lecture Notes
The document provides information about BLAST and sequence alignment algorithms. It defines key BLAST programs and what types of queries and databases they can search. It also defines important scoring terms like E-value, percent identity, and substitution matrices like BLOSUM and PAM that are used in sequence alignments.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16
Bioinformatics
Dr Mohamed Abdelmoteleb Lecturer of Microbiology - Bioinformatics BLAST (The Basic Local Alignment Search Tool)
BLAST algorithm: Karlin Altschul algorithm
Find a common character-pattern in two sequences, and use it as core, and extend the alignment in both directions from the core BLAST Programs How to select a program: – What type of query sequence you have (nucleotide or protein) – What type of database you want to search against (nucleotide or protein)
The match or mismatch scores between nucleic acid bases OR amino acid residues (BLOSUM or PAM scoring matrices) the number of indels (insertions/deletions) gap opening penalty the total length of indels gap extension penalty.
The BIT score is the log2 of the raw score
Parameters λ and K depend on the substitution
matrix and the gap penalties (Altschul algorithm) Important definitions
P-value: is the probability to obtain
by chance a score x at least equal to S P-val (S) = P(x ≥ S) Important definitions
E-value (Expectation value): is a correction of P
value for multiple testing In the context of database searches, E value is the number of distinct alignments, with a score equivalent to or better than S, that are expected to occur in a database search by chance. The lower the E value, the more significant the score is.
E-val (S) = P-val (S) * N
where N is the size of the search space (N = n*m where
n is the length of the query sequence and m is the length of the database). Important definitions Max score = highest alignment score (bit-score) between the query sequence and the database sequence segment .
Total score = sum of alignment scores of all segments
from the same database sequence that match the query sequence (calculated over all segments). This score is different from the max score if several parts of the database sequence match different parts of the query sequence.
Query coverage = percent of the query length that is
included in the aligned segments. Percent Sequence Identity: Percent of identical matches between base pairs or amino acids in pairwise sequence alignment Percent Sequence Similarity:
There are amino acid changes. However, amino acid
changes tend to preserve the physico-chemical
properties of the original residue
– Polar to polar • aspartate à glutamate – Nonpolar to nonpolar • leucine à valine – Similar sized residues • Glycine to alanine Classification of Amino acids
– aspartic acid (D) and – leucine (L) – glutamic acid (E) – isoleucine (I) – valine (V) • Basic (high pH) amino acid – methionine(M) residues: – arginine (R) • Aromatic residues – triptophan (the largest residue, W) – lysine (K) – phenylalanine (F) – to a lesser extent histidine (H) – tyrosine (Y) • Other polar (hydrophilic) • Small side chain – asparagine (N) – glycine (the smallest residue, G) – glutamine (Q) – alanine (A) – serine (S) • Disulphide bridge forming – threonine (T) – cysteine (C) , see also selenocysteine • Alpha-helix breaker, rigid structure – proline (P) Scoring matrices FYI
• Amino acid substitution matrices
– PAM – BLOSUM
• DNA substitution matrices
– As a rule, DNA is much less conserved than protein sequences – Less effective to compare coding regions at nucleotide level
FYI: For your information
PAM FYI: For your information
Point accepted mutation
PAM matrices are amino acid substitution matrices that encode the expected evolutionary change at the amino acid level.
A PAM matrix is a matrix where each column and
row represents one of the twenty standard amino acids. For any specific pair (Ai, Aj) of amino acids PAM matrix reflects the frequency at which Ai is expected to replace with Aj in two sequences that are n PAM units diverged. PAM matrix
FYI: For your information
BLOSUM
Blocks Substitution Matrix
– Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins – Matrix name indicates evolutionary distance – BLOSUM62 was created using sequences sharing no more than 62% identity
FYI: For your information
The BLOSUM62 scoring matrix: a brief summary of a large part of protein biochemistry
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados