Sequence Alignment
Sequence Alignment
Sequence alignment is the procedure for comparing two (pair-wise alignment) or more
(multiple sequence alignment) DNA or protein sequences by searching or a series of
individual characters or character patterns that are in the same order in the sequences.
Identical or similar characters are placed in the same column and non-identical characters are
placed either in the same column as mismatch or opposite a gap in the other sequence. It is
used to infer structural, functional and evolutionary relationship between the
sequences. Alignment finds similarity level between the query sequence and different
available database sequences. The algorithm works by dynamic programming approach
which divides the problem into smaller independent sub problems. It finds the alignment
more quantitatively by assigning scores. It is also used to draw functional and evolutionary
inference of a new protein with proteins already existing in the database.
There are two types of pair-wise alignment
a. Global alignment – to align the entire sequence
using all sequence characters up to both ends of
each sequence sequences that are quite similar and
approximately the same length are suitable
candidates for global alignment
b. Local alignment – stretches of sequence with the
heist density o matches are aligned thus generating
one or more islands of matches or sub alignments
in the aligned sequences. This is suitable for
sequences that share a conserved region or domain
or sequences that are similar along some of their lengths but dissimilar in other, or
sequences that differ in length.
The techniques available for alignment includes Dot matrix analysis, dynamic programming
algorithm and heuristic methods (BLAST). BLAST is one of the pairwise sequence
alignment tool widely used to compare different sequences. It was developed by Stephen
Altschul of NCBI in 1990. There are different BLAST programs for different comparisons as
shown in Table 1.
Nucleotide BLAST Programs:
BLASTN: The initial search is done for a word of length ‘w’ and threshold score ‘T’. Whole
sequence is divided into words with a length of 11 for nucleic acids (maximum number of
words can be calculated by L-w+1= max.word no (L=sequence length, w=words)). The
BLASTn algorithm parses nucleotide sequences into 11 letter “words” the same is done for
every sequence in the query database, word matches are being identified from the database
sequence. This searches for somewhat similar sequences.
BLASTp: Finds the similarity between the query protein sequences to a protein sequences
available in the protein database. BLASTp also reports for global alignment, which is the
preferred result for protein identification. The BLASTp algorithm parses protein sequences
into 3 letter “words” the same is done for every sequence in the query database, word
matches are being identified from the database.
Scoring matrices:
Scoring matrices are used to assign score for comparison of pairs of characters. There are
different types of scoring matrices like:
Identity Matrices: In this type of matrix, the score would either 1's or 0's. 1's will lie along
the diagonal. Basically, the scoring scheme is based on matches and mismatches.
Unitary scoring matrices: This matrix also have either 0's or 1's as their scores. The
difference is that it takes into the idea of transitions (change among purines or pyrimidines)
and transversions (change between purine and a pyrimidine).
PAM Matrices: Margaret Dayhoff was the first one to develop the PAM matrix, PAM stands
for Point Accepted Mutations. PAM matrices are calculated by observing the differences in
closely related proteins. One PAM unit (PAM1) specifies one accepted point mutation per
100 amino acid residues, i.e. 1% change and 99% remains as such.
Threshold: It is a boundary of minimum or maximum value which can be used to filter out
words during comparison.
True Homology: In BLAST true homology refers how much the sequence is similar to the
query sequence.
E-value: This expected value is a parameter that describes the number of hits one can expect
to see just by chance when searching a databases of a particular size. It decreases
exponentially with the score that is assigned to an alignment between two sequences.
Word size: Whole Search is done by taking the sequence of a certain word size and compares
it with the database sequence and scores are assigned for each comparison. Word size is given
as 11 for nucleic acids and 3 for proteins.
Putative conserved domains: These are the domains that have different functionalities.
Gap score or gap penalty: Dynamic programming algorithms uses gap penalties to
maximize the biological meaning. Gap penalty is subtracted for each gap that has been
introduced. There are different gap penalties such as gap open and gap extension. The gap
score defines, a penalty given to alignment when we have insertion or deletion. The open
penalty is always applied at the start of the gap, and then the other gaps following it is given
with a gap extension penalty which will be lesser compared to the open penalty. Typical
values are –12 for gap opening, and –4 for gap extension.
Query sequence is taken and analyzed for low complex regions. Low complexity regions are
regions which contain less information or variations like AAAAAAAA or ATATATAT etc.
List of words of certain word size is made. Usually the word size is 3 for proteins and 11 for
DNA
Scores are calculated for each pair of words(query sequence word and database word) using
substitution scoring matrixes (like PAM or BLOSUM),and only the high scoring words i.e.
above a threshold value or a cutoff score is taken for further alignment. A cut off score is
selected to reduce number the number of matches so as to decrease the computation time.
This scoring and checking are repeated for all the words in the query sequence.
The remaining high-scoring words are organised into efficient search tree and rapidly
compared to the database sequence. This is done to find out the exact matches.
If an exact or good match is found, then an alignment is extended in both directions from the
position where the exact match occurred
BLAST Procedure
Step 1: Select the BLAST program or search for DNA or protein sequence and then click
“run” BLAST on the right side panel
Step 2: Enter a query sequence or upload a file containing sequence.
Step 3: Select the database to search.
Step 4: Select the algorithm and the parameters of the algorithm for the search.
Step 5: Run the BLAST program.
User have to specify the type of BLAST programs from the database like BLASTp, BLASTn,
BLASTx, tBLASTn, tBLASTx.
Enter a query sequence by pasting the sequence in the query box or uploading a FASTA file
which is having the sequence for similarity search. This step is similar for all BLAST
programs. The user can give the accession number or gi number or even a raw FASTA
sequence. Go to simulator tab to know more about how to retrieve query sequence.
Figure 1: Enter a query sequence or upload a file containing sequence
User first has to know what all databases are available and what type of sequences are present
in those databases. Sequence similarity search involves searching of similar sequences of the
query sequence from the selected databases (Figure 2).
Step 4: Select the algorithm and the parameters of the algorithm for the
search
There are different algorithms for some of the BLAST program. User has to specify the
algorithm for the BLAST program. Nucleotide BLAST uses algorithms like MegaBLAST
which searches for highly similar sequences, discontiguous MegaBLAST which searches for
more dissimilar sequences and BLASTn which searches for somewhat similar sequences.
Meanwhile for protein BLAST algorithms like BLASTp, searches for similarity between
protein query and protein database, PSI-BLAST performs position specific search iteratively,
PHI-BLAST searches for a particular pattern (user has to enter the pattern to search in the
PHI pattern box provided) that is present in the sequence against the sequences in the
database, DELTA-BLAST is Domain Enhanced Lookup Time Accelerated BLAST. It
searches multiple sequence and aligns them to find protein homology. The different
algorithmic parameters are, Target sequences, Short queries, E-value, Word size, Query
range, scoring parameters (Match/Mismatch scores, and Gap penalties) and filters (Filter and
Mask) which are required to run BLAST programs. Default values are provided but the user
can adjust the values accordingly which is shown in figure 3.
Submission of the BLAST program can be done by clicking the BLAST button at the end of
the page. Screen shot of result can be shown in figure 4.
BLAST Result:
After submitting the query sequence for sequence similarity search, the result page will
appear along with the information like Query id, Description, Molecule type, Length of
sequence, Database name and BLAST program. It shows the putative conserved domains that
have been detected while undergoing sequence similarity search.
The query sequence represented as a numbered red bar below the color key. Database hits are
shown below the query (red) bar according to the alignment score. Among the aligned
sequences, the most related sequences are kept near to the query sequence. User can find
more description about these alignments, by dragging the mouse to the each colored bar
which is shown below in figure 5.
Figure 5: BLAST result
The alignment is preceded by the sequence identities, along with the definition line, length of
the matched sequence, followed by the score and E-value. The line also contains the
information about the identical residues in alignment (identities), number of positivity’s,
number of gaps used in the alignment. Finally it shows the actual alignment, along with the
query sequence on the top and database sequence below the query. The number on either
sides of the alignment indicates the position of amino acids/nucleotides in sequence which
can be represented in figure 6.
MULTIPLE SEQUENCE ALIGNMENT – ClustalX 2.1
Clustal X is a windows interface for the Clustal W multiple sequence alignment program. It
provides an integrated environment for performing multiple sequence and profile alignments
and analysing the results. The sequence alignment is displayed in a window on the screen. A
versatile coloring scheme has been incorporated allowing you to highlight conserved features
in the alignment. The pull-down menus at the top of the window allow you to select all the
options required for traditional multiple sequence and profile alignment.
You can cut-and-paste sequences to change the order of the alignment; you can select a subset
of sequences to be aligned; you can select a sub-range of the alignment to be realigned and
inserted back into the original alignment. Alignment quality analysis can be performed and
low-scoring segments or exceptional residues can be highlighted. ClustalX is available on
Linux, Mac and Windows.
SEQUENCE INPUT
Sequences and profiles (a term for pre-existing alignments) are input using the FILE menu.
Invalid options will be disabled. All sequences must be included into 1 file. 7 formats are
automatically recognised: NBRF/PIR, EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln),
GCG/MSF (Pileup), GCG9 RSF and GDE flat file. All non-alphabetic characters (spaces,
digits, punctuation marks) are ignored except "-" which is used to indicate a GAP ("." in
MSF/RSF).
ALIGNMENT DISPLAY
The alignment is displayed on the screen with the sequence names on the left hand side. The
sequence alignment is for display only, it cannot be edited here (except for changing the
sequence order by cutting-and-pasting on the sequence names).
A ruler is displayed below the sequences, starting at 1 for the first residue position (residue
numbers in the sequence input file are ignored). A line above the alignment is used to mark
strongly conserved positions. Three characters ("*", ":" and ".") are used:
":" indicates that one of the following 'strong' groups is fully conserved.
"." indicates that one of the following 'weaker' groups is fully conserved:
These are all the positively scoring groups that occur in the Gonnet Pam250 matrix. The
strong and weak groups are defined as strong score > 0.5 and weak score =< 0.5 respectively.
For profile alignments, secondary structure and gap penalty masks are displayed above the
sequences, if any data is found in the profile input file.
Colors
Clustal X provides a versatile coloring scheme for the sequence alignment display. The
sequences (or profiles) are colored automatically, when they are loaded. Sequences can be
colored either by assigning a color to specific residues, or on the basis of an alignment
consensus. In the latter case, the alignment consensus is calculated automatically, and the
residues in each column are colored according to the consensus character assigned to that
column. In this way, you can choose to highlight, for example, conserved hydrophylic or
hydrophobic positions in the alignment.
QUALITY SCORES
6. Window will appear, note down the location where it saves the output. Click ‘ok’
7. U will get the result like this. Take screen shot of this and paste it in word
8. Now open ‘aln’ file and do trimming to make the starting point and end point uniform
for all the sequences
9.
10. Then save this file and upload this ‘aln’ file again in clustalx2.1. click ‘alignment’ and
select ‘do complete alignment’
11. U will see the result like this. Take screen shot and paste it in word.