Multiple Sequence Alignment Tools: Tutorials and Comparative Analysis
Multiple Sequence Alignment Tools: Tutorials and Comparative Analysis
Multiple Sequence Alignment Tools: Tutorials and Comparative Analysis
On,
Submitted to,
By,
Step2: On the MultAlin home page you will see a large rectangle. Paste (as in cut and paste) your sequences in the rectangle. Instead of pasting your sequences, you can give the name of your sequences file, or select it with the Browse button. Step3: The next step is to set the parameters. Use the pop up menus or type in text or numbers where required. When you are ready click on the "submit data" button (you can use either the buttons at top or at bottom of the page . Step4:The result will be sent back to your internet browser in the form of a GIF image (default), a plain text or a coloured html page.
Sample Output:
Figure : Alignment
Pros & Cons of Multalin: Pros: 1. It has 7 scoring matrices and gives a choice for user-defined matrix. 2. The output can be seen in a text page, html page, gif image, with color indications, alignment and tree description. 3. It gives an option for users to specify their values of gap create and gap extend rather than selecting a pre-defined value from drop-down. Cons:
1. It dosent allow for the PAM matrices. 2. It forces the usage of cost matrices, there are minmal options for selecting the matrices 3. It also forces the gap open penalty and gap extension penalties 4. This works better for short sequences not for longer sequences
User Manual for ClustalW Step 1: Simply navigate to site: http://www.ebi.ac.uk/clustalw/ Step 2: From here, we can either use the tool hosted online or download and install it for local use.
Step 3: Either upload your protein sequences or paste the protein sequences in the given text box and hit Run button to view the aligned output with default settings Step 4: Use the various output files to analyse the output
Sample Outputs
Regions with high conservation amongst the five sample input sequences of Keratin protein
Phylogenetic analysis by means of two trees - Cladogram: Cladogram is a branching diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length, thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time" separating taxa.
- Phylogram: Phylogram is a branching diagram (tree) assumed to be an estimate of a phylogeny, branch lengths are proportional to the amount of inferred evolutionary change.
Pros: 1. ClustalW has a very simple and sweet interface, with various options that can be tinkered with. 2. It shows alignments with an option of full or fast, so depending on the trade off between fastness and correctness we can choose the option we need. 3. Most commonly used algorithm hence a wide acceptance. 4. A self explanatory Help and FAQ documentation
Cons: 1. Does not provide option for various BLOSUM and PAM matrices. 2. Can be extended to have more additional functionalities. 3. Does not allow us to specify Gap open or Gap extend penalities
includes an advanced and integrated suite of tools which is designed to support a wide variety of data formats. Geneious provides a huge list of databases like the Entrez Genome Database, PubMed database, Popset database etc that can be searched. It is based on a time-efficient method and a heuristic algorithm called the progressive alignment. Progressive alignment algorithm The most popular and time-efficient method of multiple sequence alignment is progressive pairwise alignment. The idea is very simple. At each step, a pairwise alignment is performed. In the first step, two sequences are selected and aligned. The pairwise alignment is added to the mix and the two seque nces are removed. In subsequent steps, one of three things can happen: Another pair of sequences is aligned A sequence is aligned with one of the intermediate alignments A pair of intermediate alignments is aligned This process is repeated until a single alignment containing all of the sequences remains.
User Manual 1. Click the url www.geneious.com 2. Select Download Geneious Pro 2.0.10 which is in the lower bottom at the right 3. After downloading, open the tool and enter the sequences. You can do this in two ways. You can either enter this by creating new sequences which is useful if you want to store the sequences within the tool locally or you can import the file directly. 4. This is how the tool shows all the sequences:
5. We can now align the sequences by clicking the Alignment option. This is the screen we get for chosing different alignment options
6. The tools lets us define the gap open penalty and the gap extension penalty. The default values are 12.0 for gap open penalty and 3.0 for gap extension penalty. The sequence alignment type can be selected which is either Geneious Alignment or ClustalW. Cost matrix can be selected and Refinement iterations can be set from 1 to any number of iterations. The default value used for Cost matrix is Blosum 62 7. The new aligned sequence can be viewed by opening the sequence. 8. There is an option for generating the tree by enabling the build tree option. This is a screen shot for the tree obtained:
Pros: 1. It has a friendly UI which has a detailed tutorial and is much easier to navigate than the existing applications 2. This tool allows for refining an alignment after it is done where it removes the sequence one at a time and then re-aligns the removed sequences to a profile of the remaining sequences. 3. It has a unique feature of build tree via alignment which actually speeds up the whole process of building a tree 4. This also has an extra feature of performing alignment using CLUSTALW which is the most widely used algorithm 5. New plugins can be written and incorporated with relative ease. 6. It provides numerous alignment options and also provides a folder structure where you can store your own local data in the tool thus avoiding numerous imports everytime you open the application. 7. A direct connection to databases
Cons: 1. It uses a probabilistic way of aligning the sequences and not an optimal algorithm as the tool doesnot analyze the sequences structurally and evolutionarily 2. This tool does not let us define the k-tuple (word-size). 3. This tool doesnot specify the time taken to perform an alignment. This might be useful for analyzing the complexities
SEQUENCES USED
We performed a search on NCBI for Keratin protein sequences and we traced out 5 sequences which are mostly similar ones. The following are the sequences in FASTA format >homo sapien vtlartdlem qieglkeela ylrknheeem lalrgqtggd vnvemdaapg vdlsrilnem rdqyeqmaek nrrdaetwfl skteelnkev asnselvqss rsevtelrrv lqgleielqs qlstkaslen sleetkgryc mqlsqiqgli gsveeqlaql rcemeqqsqe yqilldvktr leheiatyrr llxgedahls sqqasgqsys srevftssss sssrqtrpil keqssssfsq gqss >mus musculus matcsrqfts sssmkgscgi gggssrmssi laggscraps tcggmsvtss rfssggvcgi gggyggsfss ssfggglgsg fggrfdgfgg gfgaglgggl gggigdgllv gsekvtmqnl ndrlatyldk vraleeanrd levkirdwyq rqrpteikdy spyfktiedl kskiiiatqe naqftlqidn arlaaddfrt kyenelflrq svegdinglr kvldeltlsr adlemqienl reelaflkkn heeemlalrg qtggdvnvem daapgvdlsr ilnemrdqye qmaeknrrdv eawflrktee lnkevasnsd liqsnrseva elrrvfqgle ielqsqlsmk aslensleet kgrycmqlsq iqglissvee qlaqlrceme qqsqeynill dvktrleqei atyrrlldge nihsssqhss gqssgqsyss revfssssrq prsilkeqgs tsfsqsqsqs srd >Rattus matcsrqfts sssmksscgi gggssrmssv laggscraps tyggmsvtss rfssgaacgi gggysggfss ssfgggfggg lgggfggglg ggfggglgdg llvgsekvtm qnlndrlaty ldkvraleea nsdlevkird wyqrqrptei kdytpffrti edlqskivra kqenaqsvlq idnarlaand frtkydnets lrqlvesdin nlrrvldelt msradlemqi eslreelayl kknheeemla lrgqtggdvn vemdaapgvd lsrilnemrd qyeqmaeknr rdveawfqsk teelnqevas nheliqsgrs evselrrvfq gleielqsql smkaslensl eetkgrycvq lsqiqgligs leeqlaqlrc emeqqsqeyn illdvktrle qeiatyrrll dgenvhssss qhssgqsyss gevfssssrq prsilkeqgs tsfsqsqsqs sgy >Bos Taurus mtttirhfss gsikgssgla ggssrscrvs gslgggscrl gsagglgsgl ggssysscys fgsgggygsg fggvdgllvg gekatmqnln drlasyldkv raleeantel elkirdwyqk qapgpapdys syfktiedlr nkihtatvdn anlllqidna rlaaddfrtk feteqalrvs veadinglrr vldeltlara dlemqienlk eelaylrknh eeemkalrgq vggeinvemd aapgvdlsri lnemrdqyek maeknrkdae dwffskteel nrevatnsel vqsgkseise lrrtlqalei elqsqlsmka slegslaete nrycmqlsqi qgligsveeq laqlrcemeq qnqeykilld vktrleqeia tyrrlleged ahltqyktke pvttrqvrti veevqdgrvi ssreqvhqts h >Canis lupus familiaris maatttsirq fstsgsvkgl cgpgggfspm ssvrvggacr apsllgvgsc gtmsvtssrf saglgggygg gytcslgggf gsgfgsgfga gfgvgfgsgf sssdallggs eketmqnlnd rlasyldkvr aleeanadle vkihdwykkq gpgpardysh ffktieelrn kilaatidna slvlqidnar laaddfrtky etelnlrmsv eadtnglrrv ldeltlarad lemqieslke elaylkknhe eemnalrgqv ggdvsvemda apgvdlsril nemrdqyekm aeknrkdaed wffskteeln revatnteal qssrteitel rrsvqnleie lqsqlsmkas legslaetea rygaqlaqlq glissieqql gelrcdmerq nqeyqvlldv ktrleqeiat yrrllegeda hlatqysssl isqptreatv ttrqvrtime evqdgkvvss rqvhrsth
Comparitive Analysis:
In this section we compared all the above discussed tools by playing with thevarious parameters passed to perform the sequence alignment. Below are the observed results. We used the following 5 sequences of Human, Mouse, Rat, Buffalo, Dog to study these tools.
CLUSTALW
1. Impact of Gap penalities
Gap Open 25 10 5 2 1 Gap Ext 10 5 2.5 1 0.5 Score 14452 15522 15908 16329 16658
Interpretation: As we decrease the penalty parameters our alignment score increases, the reasoning for this is that when we increase the penalties we are penalizing each Gap creation and hence the number of matches or alignments made would be less and hence a less alignment score. 2. Alignment Scores by varying the number of sequences for ClustalW
No. of sequences 2 3 4 5 Score 2450 5922 11249 15522
Alignment Score
Series1
Interpretation: As the number of sequences increases the complexity to solve the problem also increases. From the above plot we observe that as we go on increasing the
number of sequences, the alignment score also increases, this observation agrees to the fact that by increasing the number of sequences we also increase the probability of finding many matched or aligned pairs amongst sequences
MULTIALIN
1.Observations made with MultiAlin
MSF
Series1
MSF
Series1
GENEIOUS
Experiment We try to perform the global alignment alignment of 3 sequences , 4 sequences and 5 sequences using Blosum 62 as the cost matrix and 12 as gap open penalty, 3 as gap extension penalty and compare the similarities. The first set of 3 sequences are keratin sequences of Mouse, rat and buffalo The next set of 4 sequences are keratin sequences of dog, mouse, rat and buffalo The 5 sequences are keratin sequences of mouse, rat, buffalo,dog and human These were our observations:
No of Sequences 3 4 5 % of Similarity 72 69 58 Gap open penalty 12 12 12 Gap extn penalty 3 3 3
Inference made: 1. As the number of sequences increases, the similarity score gradually decreased. 2. Mouse, rat, buffalo had the best alignment. The human sequence had a worst match with the other four
Impact of Cost matrix: The Geneious tool has a wide range of cost matrix selection. So we choose to analyze the similarity changes when the cost matrix changes for all the 5 sequences These were our observations(Blosum):
Cost Matrix Blosum 45 Blosum 50 Blosum 55 Blosum 60 Blosum 65 Blosum 70 Blosum 75 Blosum 80 Blosum 85 Blosum 90 Gap open penalty 12 12 12 12 12 12 12 12 12 12 Gap extension penalty 3 3 3 3 3 3 3 3 3 3 Similarity % 58 58 58 58 58 58 58 58 58 58 Identical Sites 150 150 150 150 150 150 150 151 150 150
Inferences 1. There was no change in the similarity % but there was one case where it decreased and then again increased.