0% found this document useful (0 votes)
4 views21 pages

Lecture 4

The document outlines a course on Multiple Sequence Alignment provided by PhD Tam Tran at USTH, covering the importance, methods, and interpretation of multiple sequence alignments in bioinformatics. Key methods discussed include ClustalW and MUSCLE, along with exercises for practical application. The course also emphasizes the significance of alignments in structure prediction, genetic variation analysis, and understanding evolutionary relationships.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views21 pages

Lecture 4

The document outlines a course on Multiple Sequence Alignment provided by PhD Tam Tran at USTH, covering the importance, methods, and interpretation of multiple sequence alignments in bioinformatics. Key methods discussed include ClustalW and MUSCLE, along with exercises for practical application. The course also emphasizes the significance of alignments in structure prediction, genetic variation analysis, and understanding evolutionary relationships.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Bioinformatics

-----
Multiple Sequence Alignment
Course Provider: PhD Tam Tran
Department of Life Sciences (LS) – USTH
Email: [email protected]

Master in Medical biotechnology - Plant biotechnology – Pharmacology – Year 2


Outline

• Introduction
• Applications: Why we do multiple sequence alignments?
• Multiple-sequence-alignment methods
 ClustalW
 MUSCLE

• Interpreting Multiple Sequence Alignment Results

2
From pairwise to multiple alignment

F G K  G K G
F G K  G K G F G K F G K G
F G K F G K G - G K Q G K G
- - K F G K G
Pairwise: MSA:
For 2 sequences For more than 2 sequences

Alignments help to analyze sequence data: organize and visualize.

3
Why we do multiple sequence alignments?

1. Structure prediction (RNA, protein)


Target: VTISCTGSSSNIGAG-NHVKWYQQLPG
Homologue 1: VTISCTGSSSNIGS--ITVNWYQQLPG
Homologue 2: LRLSCTGSGFIFSS--YAMYWYQQAPG
… …
Homologue n: LSLTCTGSGTSFDD-QYYSTWYQQPPG

2. Genetic variations (Sequence similarity)


Analysis of residues’ substitutions: mutation or polymorphism?

4
Why we do multiple sequence alignments?

3. Learn about evolutionary relationships


• Two sequences from different organisms are similar  they may have a
common ancestor.
• Needed for construction of phylogenetic trees

5
Multiple Alignment Methods

6
ClustalW
 ClustalW = multiple alignment tool
 The most commonly used program for making multiple sequence alignments
 ‘W’ stands for ‘weighted’ (sequences are weighted differently).

BLAST: Clustal:
7
pairwise comparison Global alignment "all against all"
ClustalW- Progressive Alignment

A
A B C D
A - - - -
B
B 1 - - -
C 7 8 - - C
D 11 5 2 -
D

8
See Thompson et al. (1994) for an explanation of the three
stages of progressive alignment implemented in ClustalW
MUSCLE

•MUltiple Sequence Comparison by Log- Expectation (MUSCLE)


•The most recent popular MSA software
• Considered to be the most accurate MSA software available today
• The basic idea: iterative progressive alignment

Edgar, R.C., 2004


Finding your favorite alignment method

Multiple Sequence Alignment Resources over the Internet


Method Description Address
The most commonly used
ClustalW https://www.genome.jp/tools-bin/clustalw
program.
The latest addition to the
ClustalO https://www.ebi.ac.uk/Tools/msa/clustalo/
Clustal family
A fast and accurate www.drive5.com/muscle/ (download)
MUSCLE
sequence cruncher https://www.ebi.ac.uk/Tools/msa/muscle/
Accurate combination of www.tcoffee.org
Tcoffee
sequences and structures https://www.ebi.ac.uk/Tools/msa/tcoffee/
A Bayesian version of probcons.stanford.edu/
Probcons
Tcoffee
Kalign A fast sequence aligner https://msa.sbc.su.se/cgi-bin/msa.cgi
A fast and accurate
MAFFT sequence cruncher using https://mafft.cbrc.jp/alignment/server/
Fast Fourier Transforms
Ideal for Sequences With
Dialign https://bibiserv.cebitec.uni-bielefeld.de/dialign/
Local Homology

11
EXERCISE BREAK
Exercise 1: Perform multiple sequence alignements

1. Download amino acid sequences for accession numbers P20472, P80079, P02626,
P02619, P43305, P32930, Q91482, P02620, P02622, P02586 from the NCBI protein
database.

2. Perform multiple sequence alignments with ClustalW/ClustalO and MUSCLE

3. What are the differences between the two outputs?

4. What do you guess that the symbols ("*"), (“:”) and (“.”) under the alignment mean?

5. How many stretches of perfectly conserved sequence (of at least, say, 10 amino
acid) can you find? Write down the sequence(s) of the perfectly conserved
stretch(es).

12
Interpreting Multiple Sequence Alignment

* = perfectly conserved residue


: = conservative substitution
. = semi-conservative substitution

13
How can you tell whether a block is good?

14
Important amino acids (or nucleotides) are not allowed to mutate

active site

15
Editing and Analyzing Multiple Sequence Alignments

Bioedit: https://bioedit.software.informer.com/7.2/

Jalview: https://www.jalview.org/

Mega: https://www.megasoftware.net/

16
Sequence logo

17
Sequence logo
IGF1B_HUMAN APQTGIVDECCFRSCDLRRLEMYCAPLKPAKSAR
IGF1_PIG APQTGIVDECCFRSCDLRRLEMYCAPLKPAKSAR
IGF1_CANFA APQTGIVDECCFRSCDLRRLEMYCAPLKPAKSAR
IGF2_HORSE -RSRGIVEECCFRSCDLALLETYCATPAKSERDV
INS_CHIBR -----IVDQCCTSICTLYQLENYCN---------
INS_ORNAN -----IVEECCKGVCSMYQLENYCN---------
INS_AOTTR MQKRGVVDQCCTSICSLYQLQNYCN---------

 Sequence logo shown as frequency of amino acid per position


 Size of a letter represents the information content 18
EXERCISE BREAK
Exercise 2: Visualize consensus sequences with Sequence Logos

1. Identification of a consensus block in Exercise 1

2. Build their sequence logo

3. Interpreting a Sequence Logo

Tool (WebLogo): https://weblogo.berkeley.edu/logo.cgi


- Input: Multiple Sequence Alignment
- Output: Sequence logo

19
HOMEWORK - DAY 4
1. Get 10 homologous proteins with identity <80% from BLAST result in Homework Day 3
- Provide the list of proteins sequences with their names and E-values, but NOT the full FASTA
format

2. Perform multiple alignment with your favorite alignment method


a. Overview and quality of the multiple alignment
- Can you confirm that these sequences are really homologs? Similar lengths?
- How many identical positions? How many conservative substitutions positions? Number of indels
(deletions and insertions)?
b. Identification of conserved blocks
- Identification of conserved blocks and visualize consensus sequences with Sequence Logos
- What can you observe ? What does it mean from a biological point of view ?
- Bonus: Can you make a link between your observations and what we know about the protein?
(Are there any conserved amino acids that are known as actives sites for this protein family? If
yes, position in alignment, function, activity? )
c. Write a conclusion

DEADLINE: 10am Thursday 06 May 2021 20


END

21

You might also like