0% found this document useful (0 votes)
14 views

Sequence Alignment

Uploaded by

juhiyaadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Sequence Alignment

Uploaded by

juhiyaadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

PAIRWISE AND MULTIPLE SEQUENCE ALIGNMENT - BLAST

Sequence alignment is the procedure for comparing two (pair-wise alignment) or more
(multiple sequence alignment) DNA or protein sequences by searching or a series of
individual characters or character patterns that are in the same order in the sequences.
Identical or similar characters are placed in the same column and non-identical characters are
placed either in the same column as mismatch or opposite a gap in the other sequence. It is
used to infer structural, functional and evolutionary relationship between the
sequences. Alignment finds similarity level between the query sequence and different
available database sequences. The algorithm works by dynamic programming approach
which divides the problem into smaller independent sub problems. It finds the alignment
more quantitatively by assigning scores. It is also used to draw functional and evolutionary
inference of a new protein with proteins already existing in the database.
There are two types of pair-wise alignment
a. Global alignment – to align the entire sequence
using all sequence characters up to both ends of
each sequence sequences that are quite similar and
approximately the same length are suitable
candidates for global alignment
b. Local alignment – stretches of sequence with the
heist density o matches are aligned thus generating
one or more islands of matches or sub alignments
in the aligned sequences. This is suitable for
sequences that share a conserved region or domain
or sequences that are similar along some of their lengths but dissimilar in other, or
sequences that differ in length.

The techniques available for alignment includes Dot matrix analysis, dynamic programming
algorithm and heuristic methods (BLAST). BLAST is one of the pairwise sequence
alignment tool widely used to compare different sequences. It was developed by Stephen
Altschul of NCBI in 1990. There are different BLAST programs for different comparisons as
shown in Table 1.
Nucleotide BLAST Programs:

BLASTN: The initial search is done for a word of length ‘w’ and threshold score ‘T’. Whole
sequence is divided into words with a length of 11 for nucleic acids (maximum number of
words can be calculated by L-w+1= max.word no (L=sequence length, w=words)). The
BLASTn algorithm parses nucleotide sequences into 11 letter “words” the same is done for
every sequence in the query database, word matches are being identified from the database
sequence. This searches for somewhat similar sequences.

Mega BLAST: Searches for highly similar sequences.

Discontiguous Mega BLAST: Searches for more dissimilar sequences.

Protein BLAST Programs :

BLASTp: Finds the similarity between the query protein sequences to a protein sequences
available in the protein database. BLASTp also reports for global alignment, which is the
preferred result for protein identification. The BLASTp algorithm parses protein sequences
into 3 letter “words” the same is done for every sequence in the query database, word
matches are being identified from the database.

PSI-BLAST: Position-Specific Iterated -BLAST is the most sensitive BLAST program. It is


used to find very distantly related proteins or new members of protein family. Algorithm
builds a position-specific scoring matrix (PSSM or profile) from an iterative alignment of
sequences, returns with E-values and threshold (default=0.005). E-value It decreases
exponentially with the score that is assigned to a match between two sequences.
PHI-BLAST: Pattern-Hit Initiated BLAST is used to find protein sequences which contains
a pattern, specified by the user and are similar to the query sequence. This requirement was
proposed to reduce the number of hits which contains only the pattern, but is likely to have no
true homology to the query. To run PHI-BLAST, enter the query into the Search box, and
enter the pattern into the PHI pattern box. Only one pattern can be used in one search.

Scoring matrices:

Scoring matrices are used to assign score for comparison of pairs of characters. There are
different types of scoring matrices like:

Identity Matrices: In this type of matrix, the score would either 1's or 0's. 1's will lie along
the diagonal. Basically, the scoring scheme is based on matches and mismatches.

Unitary scoring matrices: This matrix also have either 0's or 1's as their scores. The
difference is that it takes into the idea of transitions (change among purines or pyrimidines)
and transversions (change between purine and a pyrimidine).

PAM Matrices: Margaret Dayhoff was the first one to develop the PAM matrix, PAM stands
for Point Accepted Mutations. PAM matrices are calculated by observing the differences in
closely related proteins. One PAM unit (PAM1) specifies one accepted point mutation per
100 amino acid residues, i.e. 1% change and 99% remains as such.

BLOSUM: BLOcks SUbstitution Matrix, developed by Henikoff and Henikoff in 1992,


using conserved regions, these matrices are actual percentage identity values. Simply to say
they depend on similarity. Blosum 62 means there is a 62 % similarity.

Parameters used in BLAST algorithm:

Threshold: It is a boundary of minimum or maximum value which can be used to filter out
words during comparison.

True Homology: In BLAST true homology refers how much the sequence is similar to the
query sequence.

E-value: This expected value is a parameter that describes the number of hits one can expect
to see just by chance when searching a databases of a particular size. It decreases
exponentially with the score that is assigned to an alignment between two sequences.

Word size: Whole Search is done by taking the sequence of a certain word size and compares
it with the database sequence and scores are assigned for each comparison. Word size is given
as 11 for nucleic acids and 3 for proteins.
Putative conserved domains: These are the domains that have different functionalities.

Gap score or gap penalty: Dynamic programming algorithms uses gap penalties to
maximize the biological meaning. Gap penalty is subtracted for each gap that has been
introduced. There are different gap penalties such as gap open and gap extension. The gap
score defines, a penalty given to alignment when we have insertion or deletion. The open
penalty is always applied at the start of the gap, and then the other gaps following it is given
with a gap extension penalty which will be lesser compared to the open penalty. Typical
values are –12 for gap opening, and –4 for gap extension.

Working of BLAST Algorithm:

 Query sequence is taken and analyzed for low complex regions. Low complexity regions are
regions which contain less information or variations like AAAAAAAA or ATATATAT etc.

 These low complex regions are masked with alphabet s like X or N

 List of words of certain word size is made. Usually the word size is 3 for proteins and 11 for
DNA

 Scores are calculated for each pair of words(query sequence word and database word) using
substitution scoring matrixes (like PAM or BLOSUM),and only the high scoring words i.e.
above a threshold value or a cutoff score is taken for further alignment. A cut off score is
selected to reduce number the number of matches so as to decrease the computation time.

 This scoring and checking are repeated for all the words in the query sequence.

 The remaining high-scoring words are organised into efficient search tree and rapidly
compared to the database sequence. This is done to find out the exact matches.

 If an exact or good match is found, then an alignment is extended in both directions from the
position where the exact match occurred

BLAST Procedure

This is the common procedure for any BLAST program.

Step 1: Select the BLAST program or search for DNA or protein sequence and then click
“run” BLAST on the right side panel
Step 2: Enter a query sequence or upload a file containing sequence.
Step 3: Select the database to search.
Step 4: Select the algorithm and the parameters of the algorithm for the search.
Step 5: Run the BLAST program.

Step 1: Select the BLAST program

User have to specify the type of BLAST programs from the database like BLASTp, BLASTn,
BLASTx, tBLASTn, tBLASTx.

Fig: BLAST HOME PAGE


Step 2: Enter a query sequence or upload a file containing sequence

Enter a query sequence by pasting the sequence in the query box or uploading a FASTA file
which is having the sequence for similarity search. This step is similar for all BLAST
programs. The user can give the accession number or gi number or even a raw FASTA
sequence. Go to simulator tab to know more about how to retrieve query sequence.
Figure 1: Enter a query sequence or upload a file containing sequence

Step 3: Select database to search

User first has to know what all databases are available and what type of sequences are present
in those databases. Sequence similarity search involves searching of similar sequences of the
query sequence from the selected databases (Figure 2).

Figure 2: Select database to search

Step 4: Select the algorithm and the parameters of the algorithm for the
search

There are different algorithms for some of the BLAST program. User has to specify the
algorithm for the BLAST program. Nucleotide BLAST uses algorithms like MegaBLAST
which searches for highly similar sequences, discontiguous MegaBLAST which searches for
more dissimilar sequences and BLASTn which searches for somewhat similar sequences.
Meanwhile for protein BLAST algorithms like BLASTp, searches for similarity between
protein query and protein database, PSI-BLAST performs position specific search iteratively,
PHI-BLAST searches for a particular pattern (user has to enter the pattern to search in the
PHI pattern box provided) that is present in the sequence against the sequences in the
database, DELTA-BLAST is Domain Enhanced Lookup Time Accelerated BLAST. It
searches multiple sequence and aligns them to find protein homology. The different
algorithmic parameters are, Target sequences, Short queries, E-value, Word size, Query
range, scoring parameters (Match/Mismatch scores, and Gap penalties) and filters (Filter and
Mask) which are required to run BLAST programs. Default values are provided but the user
can adjust the values accordingly which is shown in figure 3.

Figure 3: Algorithm and the parameters

Step 5: Run the BLAST program

Submission of the BLAST program can be done by clicking the BLAST button at the end of
the page. Screen shot of result can be shown in figure 4.

Figure 4: Run the BLAST program

BLAST Result:

After submitting the query sequence for sequence similarity search, the result page will
appear along with the information like Query id, Description, Molecule type, Length of
sequence, Database name and BLAST program. It shows the putative conserved domains that
have been detected while undergoing sequence similarity search.
The query sequence represented as a numbered red bar below the color key. Database hits are
shown below the query (red) bar according to the alignment score. Among the aligned
sequences, the most related sequences are kept near to the query sequence. User can find
more description about these alignments, by dragging the mouse to the each colored bar
which is shown below in figure 5.
Figure 5: BLAST result

The alignment is preceded by the sequence identities, along with the definition line, length of
the matched sequence, followed by the score and E-value. The line also contains the
information about the identical residues in alignment (identities), number of positivity’s,
number of gaps used in the alignment. Finally it shows the actual alignment, along with the
query sequence on the top and database sequence below the query. The number on either
sides of the alignment indicates the position of amino acids/nucleotides in sequence which
can be represented in figure 6.
MULTIPLE SEQUENCE ALIGNMENT – ClustalX 2.1

A multiple sequence alignment (MSA) is a sequence alignment of three or more biological


sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences
are assumed to have an evolutionary relationship by which they share a lineage and are
descended from a common ancestor. From the resulting MSA, sequence homology can be
inferred and phylogenetic analysis can be conducted to assess the sequences' shared
evolutionary origins. Visual depictions of the alignment as in the image at right
illustrate mutation events such as point mutations (single amino acid or nucleotide changes)
that appear as differing characters in a single alignment column, and insertion or deletion
mutations (gaps) that appear as hyphens in one or more of the sequences in the alignment.
Multiple sequence alignment is often used to assess sequence conservation of protein
domains, tertiary and secondary structures, and even individual amino acids or nucleotides.
The most widely used approach to multiple sequence alignments uses a heuristic search
known as progressive technique (also known as the hierarchical or tree method) developed
by Paulien Hogeweg and Ben Hesper in 1984. Progressive alignment builds up a final MSA
by combining pairwise alignments beginning with the most similar pair and progressing to
the most distantly related. All progressive alignment methods require two stages: a first stage
in which the relationships between the sequences are represented as a tree, called a guide tree,
and a second step in which the MSA is built by adding the sequences sequentially to the
growing MSA according to the guide tree. The initial guide tree is determined by an
efficient clustering method such as neighbor-joining or UPGMA, and may use distances
based on the number of identical two letter sub-sequences (as in FASTA rather than a
dynamic programming alignment).

CLUSTAL X (2.0 and above)

Clustal X is a windows interface for the Clustal W multiple sequence alignment program. It
provides an integrated environment for performing multiple sequence and profile alignments
and analysing the results. The sequence alignment is displayed in a window on the screen. A
versatile coloring scheme has been incorporated allowing you to highlight conserved features
in the alignment. The pull-down menus at the top of the window allow you to select all the
options required for traditional multiple sequence and profile alignment.

You can cut-and-paste sequences to change the order of the alignment; you can select a subset
of sequences to be aligned; you can select a sub-range of the alignment to be realigned and
inserted back into the original alignment. Alignment quality analysis can be performed and
low-scoring segments or exceptional residues can be highlighted. ClustalX is available on
Linux, Mac and Windows.

SEQUENCE INPUT
Sequences and profiles (a term for pre-existing alignments) are input using the FILE menu.
Invalid options will be disabled. All sequences must be included into 1 file. 7 formats are
automatically recognised: NBRF/PIR, EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln),
GCG/MSF (Pileup), GCG9 RSF and GDE flat file. All non-alphabetic characters (spaces,
digits, punctuation marks) are ignored except "-" which is used to indicate a GAP ("." in
MSF/RSF).

ALIGNMENT DISPLAY

The alignment is displayed on the screen with the sequence names on the left hand side. The
sequence alignment is for display only, it cannot be edited here (except for changing the
sequence order by cutting-and-pasting on the sequence names).

A ruler is displayed below the sequences, starting at 1 for the first residue position (residue
numbers in the sequence input file are ignored). A line above the alignment is used to mark
strongly conserved positions. Three characters ("*", ":" and ".") are used:

"*" indicates positions which have a single, fully conserved residue.

":" indicates that one of the following 'strong' groups is fully conserved.

"." indicates that one of the following 'weaker' groups is fully conserved:

These are all the positively scoring groups that occur in the Gonnet Pam250 matrix. The
strong and weak groups are defined as strong score > 0.5 and weak score =< 0.5 respectively.
For profile alignments, secondary structure and gap penalty masks are displayed above the
sequences, if any data is found in the profile input file.

Colors

Clustal X provides a versatile coloring scheme for the sequence alignment display. The
sequences (or profiles) are colored automatically, when they are loaded. Sequences can be
colored either by assigning a color to specific residues, or on the basis of an alignment
consensus. In the latter case, the alignment consensus is calculated automatically, and the
residues in each column are colored according to the consensus character assigned to that
column. In this way, you can choose to highlight, for example, conserved hydrophylic or
hydrophobic positions in the alignment.

QUALITY SCORES

Clustal X provides an indication of the quality of an alignment by plotting a 'conservation


score' for each column of the alignment. A high score indicates a well-conserved column; a
low score indicates low conservation. The quality curve is drawn below the alignment.

The main applications of multiple sequence alignment are:


1-Structure Prediction: a multiple sequence alignment can give you the almost perfect
protein or RNA secondary structure, sometimes it helps even with the 3Dstructure.
2-Protein Family: a multiple sequence alignment can help you to decide that your protein is
a member of a known protein family or not.
3-Pattern Identification: By looking at conserved regions or sites, you can identify which
region is responsible for a functional site.
4- Domain Identification: By looking at file provided by a multiple sequence alignment, you
can extract profiles to use them against databases.
5-DNA Regulatory Elements: You can use multiple sequence alignments to locate DNA
regulatory elements such as binding site etc.
6-Phylogenetic Analysis: By carefully picking related sequences you can reconstruct a tree
using sequences that u have used in the multiple sequence alignment.

STEPS FOR MULTIPLE SEQUENCE ALIGNMENT


1. Open clustalx2.1

2. Click file, select load sequence


3. Upload the notepad file with fasta sequences
4. After loading, the clustal window will look like this

5. Then click alignment and select ‘do complete alignment’

6. Window will appear, note down the location where it saves the output. Click ‘ok’
7. U will get the result like this. Take screen shot of this and paste it in word

8. Now open ‘aln’ file and do trimming to make the starting point and end point uniform
for all the sequences

9.
10. Then save this file and upload this ‘aln’ file again in clustalx2.1. click ‘alignment’ and
select ‘do complete alignment’
11. U will see the result like this. Take screen shot and paste it in word.

You might also like