15GN402L Final Bioinformatics Lab Manual
15GN402L Final Bioinformatics Lab Manual
RECORD MANUAL
NAME :
REGISTER NUMBER :
BRANCH :
YEAR & SEMESTER :
ACADEMIC YEAR :
SCHOOL OF BIOENGINEERING
BONAFIDE CERTIFICATE
lab at the Dept. of Genetic Engineering, School of Bioengineering, SRM IST, Kattankulathur
– 603203.
Examiner 1 Examiner 2
Name : Name:
Signature : Signature:
Date : Date:
CONTENTS
Teacher’
Exp Date of
s Marks
t. Date Name of the Experiment Submissio
Signatur (20)
No. n
e
1 Biological Databases
3 Sequence Manipulation
5 BLAST
Total Marks
Experiment No: Date:
Biological Databases
Aim:
To retrieve data from various DNA and protein sequence databases and understand
the importance of these databases.
Introduction:
A database is an organized collection of data that models a part of the reality (a
domain). A database could refer both to the data and to the organization of that data.
Biological databases which contain biological data emerged as a response to the huge data
generated by low-cost DNA sequencing technologies. The data stored in biological databases
is organized for optimal analysis and consists of two types: raw and curated (or annotated).
Data is submitted directly to biological databases for indexing, organization, and data
optimization. They help researchers find relevant biological data by making it available in a
format that is readable on a computer. All biological information is readily accessible through
data mining tools that save time and resources.
Biological databases can be broadly classified as sequence and structure databases. Structure
databases are for protein structures, while sequence databases are for nucleic acid and protein
sequences. The data repositories more relevant to the biological sciences include: nucleotide
and protein sequences, genomes, bibliography, genetic expression and protein structures.
A sequence database is a collection of DNA or protein sequences with some extra relevant
information. The main sequence databases are Genbank, DDBJ and EMBL. Originally, they
were just sequence collections, but they have grown to store different biological databases
heavily interconnected, and they provide powerful interfaces to search and browse the stored
information.Biological databases can be further classified as primary, secondary, and
composite databases.
Primary databases are archival in nature. They consist of experimentally derived data such
as nucleotide sequence, protein sequence or macromolecular structure. Experimental results
are submitted directly into the database by researchers. Examples of primary biological
databases include:
Swiss-Prot and PIR for protein sequences
EMBL and DDBJ for nucleotide sequences
Protein Databank (PDB) for protein structures
Composite databases aim to amalgamate the information held in two or more of the primary
databases. This means that you can search one composite database rather than do multiple
searches on individual primary databases e.g.
OWL – is a composite of SWISS-PROT, PIR1-4, GenPept and NRL-3D.
NRDB (Non-Redundant DataBase) – is a composite of SWISS-PROT+TrEMBL.
Structure databases such as PDB (Protein Data Bank) that hold the atomic co-ordinate data
for proteins whose structure has been determined by X-ray crystallography and/or NMR. In
addition, the MMDB (molecular Modelling Database) at NCBI is a compilation of the PDB
entries as ASN.1 files. Other databases such as SCOP (Structural Classification of Proteins)
and CATH (Class, Architecture, Topology, Homology) hold information on the structural
relationship of proteins and their structural domains.
1. https://libraries.wm.edu/databases/by-subject/7
2. https://ansit.wordpress.com/2007/02/06/biological-databases/
3. https://www.ncbi.nlm.nih.gov/
Practice Questions:
1. Go to the “Gene” entry for Homo sapiens PTGS2 and find how many databases are
available.
5. What is the RefSeq accession number for the mRNA sequence of Homo sapiens
prostaglandin-endoperoxide synthase 2? __________________.
6. Open the entry, then choose “FASTA” from the pull-down menu. Copy the sequence
(including the title line designated by the “>” symbol) and paste it into a word document.
7. Select the “Replace” tool under the EDIT menu. In the “find” box, type “^p” to find all
paragraph marks. Don’t type anything into the “replace” box. Then click “Replace All.” This
will eliminate all the paragraph marks in the document. If you still see white spaces in the
sequence, use the same procedure, but type “^w” in the “find” box to represent white spaces.
8. You now should add back a paragraph mark after the title line (that starts with “>”) and
before the sequence starts. Save the file as PTGS2rna.doc on your desktop.
9. What is the RefSeq accession number for the Homo sapiens PTGS2 protein sequence?
_________________. Open the entry. Follow the steps given above to save the sequence in
FASTA format as a Word document called PTGS2prot.doc file on your desktop. 10. Go to
the Expasy website and search for the Swiss-Prot entry for PTGS2. (Hint: use the gene name
to search and be sure to select the HUMAN protein from the search results).
14. What amino acid is acetylated by aspirin (amino acid type and number)?
15. What His residue is in the active site?
Evaluation Scheme:
Max. Mark
S. No Component Marks Obtained
(20)
5 Results Presentation 4
6 Viva / MCQ 2
Total 20
Experiment No: Date:
Aim
To understand different file formats and their conversion in bioinformatics.
Introduction
DNA, protein sequence and chemical compounds structural information are stored in
different file formats for use in different contexts. Formats can be converted from one to
another for easier access or sharing or analyzing.
FASTA
Standard and most widely used file format.
FASTA format is a text-based format for storing nucleotide protein sequences, in
which nucleotides or amino acids are represented using single-letter codes.
The first line in a FASTA file started either with a ">" (greater-than) symbol,
followed by sequence description.
The Second line contains amino acid or base in single letter code.
FASTQ
FASTQ format stores sequences and Phred qualities in a single file. It is concise and
compact.
A FASTQ file normally uses four lines per sequence.
Line 1 begins with a '@' character and is followed by a sequence identifier and
an optional description.
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same
sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain
the same number of symbols as letters in the sequence.
ABI
ABI is a binary file format containing sanger sequencing sequence and trace data.
The format is used by sequencing facilities and requires special readers capable of
reading the file format to view the trace data and extract the sequence.
The file format is difficult to parse given its binary nature and the complexity of the
spec.
Procedure
Protein or Nucleotide sequence format Conversion
1. Retrieve sequences from any sequence database in FASTA/Genbank/FASTQ format in
text or file format.
2. Open EMBOSS Seqret (https://www.ebi.ac.uk/Tools/sfc/emboss_seqret/).
3. Submit the sequence in EMBOSS Seqret and do the following file format conversion.
FASTA to PIR
Genbank to Fasta format
FASTQ to FASTA
4. Retreive the 3D structure of Myoglobin in PDB format and convert it to FASTA Format.
Chemical Compound format Conversion
1. Open PubChem Database (https://pubchem.ncbi.nlm.nih.gov/)
2. Select Compound
3. Retrieve the structure of aspirin in Canonical SMILES format.
4. Open Online SMILES Translator (https://cactus.nci.nih.gov/translate/)
5. Convert the structure into PDB and SDF format.
6. Open OPENBABEL
(http://www.cheminfo.org/Chemistry/Cheminformatics/FormatConverter/index.html)
7. Do the following format conversions.
Aspirin (PDB to SMILES)
Polynoxylin (SDF to MOL format)
Results
Additional Information:
1. https://www.ebi.ac.uk/Tools/sfc/emboss_seqret/
2. https://www.hiv.lanl.gov/content/sequence/FORMAT_CONVERSION/form.html
3. https://cactus.nci.nih.gov/translate/
4. http://www.cheminfo.org/Chemistry/Cheminformatics/FormatConverter/index.html
Practice Questions:
1. Find the difference between Genbank and Fasta file format for the same entry.
2. List out the major differences between SDF and PDB format for the same entry.
3. What is the Smiles notation for CH3CH2OH, CH3COOH and C6H12O6.
Evaluation Scheme:
Max. Mark
S. No Component Marks Obtained
(20)
4 Results Presentation 3
5 Viva / MCQ 2
Total 20
Experiment No: Date:
Sequence Manipulation
Aim:
To manipulate DNA and Protein sequence and structure data using various
bioinformatics tools.
Introduction:
Reverse Complement
A DNA sequence contains only four characters (A, C, G and T) referred as base, basepair or
nucleotide. DNA occurs as a double strand where each A is paired with a T and vice versa,
and each C is paired with a G and vice versa. The reverse complement of a DNA sequence is
formed by reversing the letters, interchanging A and T and interchanging C and G. Thus the
reverse complement of ACCTGAG is CTCAGGT.
Tools
Reverse Complement (https://www.bioinformatics.org/sms/rev_comp.html )
Revseq (http://www.bioinformatics.nl/cgi-bin/emboss/revseq )
Revcomp (http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html )
Procedure
Retrieve a DNA sequence from nucleotide sequence database in fasta format.
Open Reverse Complement tool and convert the DNA sequence into is
Reverse
Complement
Reverse-complement
Submit your query sequence and understand the difference and computational
mechanism behind reverse complement.
ORF Prediction
DNA (Deoxyribonucleic acid) stores genetic information as genetic codes using adenine (A),
guanine (G), cytosine(C) and thymine (T). During the transcription process, DNA is
transcribed to mRNA. Each of these base pairs will bond with a sugar and phosphate
molecule to form a nucleotide. Three nucleotides that codes for a particular amino acid
during translation is called as a codon.
The region of the nucleotide sequences from the start codon (ATG) to the stop codon (TAA,
TAG, TGA) is called the Open Reading frame (ORF). Depending on the starting point, there
are six possible ways of translating any nucleotide sequence into amino acid sequence by
“Six-frame translation process”. These are called reading frames. Three reading frames are
possible from each strand of a DNA. Longest frame uninterrupted by a stop codon is the
correct frame.
By analyzing the ORF we can predict the possible amino acids that might be produced during
translation.
ORF Prediction Tools:
ORF Finder (http://www.bioinformatics.org/sms2/orf_find.html )
NCBI ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder/ )
getorf (http://www.bioinformatics.nl/cgi-bin/emboss/getorf )
Procedure
Retrieve a DNA sequence from nucleotide sequence database in fasta format.
Open NCBI ORFfnder
Enter your query sequence.
Optional Parameters
Enter coordinates for a sub range of the query sequence. The ORF search will
apply only to the residues in the range. (Default: 1 to length of the sequence).
Minimal ORF Length (nt). The search will be restricted to the ORFs with the
length equal or more than the selected value. (Default: 75)
Genetic Code table: Standard
ORF Start codon to use: ATG only.
Submit
Identify all the possible open reading frames in a sequence.
Number of ORF in each frame.
Find out the longest ORF is in which frame, its length and location.
Additional Information:
1. http://www.bioinformatics.org/sms2/
2. https://www.biologicscorp.com/sms2/about.html
3. http://manuals.bioinformatics.ucr.edu/home/emboss
Practice Questions:
1. Perform a six frame translation of any DNA sequence and do pairwise sequence
alignment between six products obtained.
2. Perform a reverse complement and for a given DNA sequence and align it with
original DNA sequence and report your findings.
3. Retrieve any protein sequence from Uniprot database and assess its various properties
in Expasy under primary sequence analysis category.
Evaluation Scheme:
Max. Mark
S. No Component Marks Obtained
(20)
3 Practice Questions 6
4 Results Presentation 3
5 Viva / MCQ 2
Total 20
Experiment No: Date:
Aim:
To assemble sample NGS raw data using Codon Code aligner.
Introduction:
DNA sequence assembly is a process through which short DNA sequence fragments (called
reads or samples) are merged into a longer DNA sequence to reconstruct the original DNA
sequence. The longer sequence resulted from sequence assembly is called a 'contig' sequence.
A contig is a set of overlapping DNA segments that together represent a consensus region of
DNA.
During sequence assembly the short DNA fragments may also be aligned to a reference
sequence in order to see the differences between the contig sequence obtained and the
reference sequence.
Line 1 begins with a '@' character and is followed by a sequence identifier and
an optional description.
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence
identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the
same number of symbols as letters in the sequence.
Example
@SEQ_ID|Description
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCAC
AGTTT
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
CodonCode Aligner
CodonCode Aligner is Windows and MAC OS X based DNA sequence assembly, sequence
alignment and editing software.
Url: https://www.codoncode.com/aligner/
Features of Codon Code Aligner:
Chromatogram editing, end clipping, and vector trimming, sequence assembly and
contig editing.
Aligning cDNA against genomic templates, sequence alignment and editing.
Alignment of contigs to each other with ClustalW, MUSCLE, or built-in algorithms
Mutation detection, including detection of heterozygous single-nucleotide
polymorphism, analysis of heterozygous insertions and deletions.
Restriction analysis (find and view restriction cut sites), trace sharpening, and support
for Phred, Phrap, ClustalW, and MUSCLE.
Procedure:
1. Start Programs CodonCode Aligner
On Windows, opening CodonCode Aligner will open the main application window
("root window"), which contains the application menu and all other windows opened by
Aligner.
2. Create a new project ok
Project Window options
1. Save a project with new
2. Import Samples
3. Import a folder of Samples
8. Click on Contigs to view assembly details or Contig (Left click) View contigs
Results:
Additional Information:
The assembled contig sequence can be searched against NCBI nr database to annotate
in functionally.
Other Sequence Assembly Tools
CAP3 (http://doua.prabi.fr/software/cap3)
Trinity (https://github.com/trinityrnaseq/trinityrnaseq/wiki)
Practice Questions:
1. Download a Raw data file from NCBI and find out its quality and report its assembly
details.
2. For any three contigs perform annotation.
Evaluation Scheme:
Max. Mark
S. No Component Marks Obtained
(20)
2 Contig Statistics 3
3 Practice Questions 6
4 Results Presentation 3
5 Viva / MCQ 3
Total 20
Experiment No: Date:
Aim:
To search similar sequences to the given query sequence using sequence similarity
search tool BLAST and its variants.
Introduction:
Similarity is a measure of how related two sequences are, whereas homology is a conclusion
about the evolutionary relatedness of two sequences based on an assessment of their
similiarity. Two sequences can be said to be 68% similar but these same two sequences are
either homologous or not. There is no degree to homology, two sequences are either related
or not. At the next bioinformatics seminar you attend, you can correct the misinformed
graduate student who attempts to state that protein X is 23% homologous to protein Y.
The Basic Local Alignment Search Tool (BLAST) is a program that can detect sequence
similarity between a Query sequence and sequences within a database. The ability to detect
sequence homology allows us to identify putative genes in a novel sequence. It also allows us
to determine if a gene or a protein is related to other known genes or proteins. BLAST is
popular because it can quickly identify regions of local similarity between two sequences.
More importantly, BLAST uses a robust statistical framework that can determine if the
alignment between two sequences is statistically significant.
The measure of similarity between two sequences is captured by a scoring scheme in BLAST
which is based on scoring matrices. Scoring matrices are empirical weighting schemes that
are used in comparing sequences and capture information about residue conservation, residue
frequency, and evolutionary models. The two most commonly used substitution matrices are
the BLOSUM and PAM scoring matrices. The PAM (point accepted mutation) scoring
matrices are based on global alignments of closely related proteins. The PAM1 matrix is
calculated by looking at the amino acid substitutions that occur in proteins with no more than
1% divergence (1 change per 100 amino acids). In an effort to model evolutionary changes,
the other PAM matrices are extrapolated from PAM1 by matrix multiplication. The
BLOSUM (blocks substitution matrices) are based on local alignments where the
BLOSUM62 matrix is calculated from comparisions of sequences with <62% identity.
Results
Blastp
Blastn
Blastx
tBlastn
Blastx
PSI-Blast
PHI-Blast
Additional Information:
Blast (https://blast.ncbi.nlm.nih.gov/Blast.cgi )
Other Similarity Search tool: FASTA (https://www.ebi.ac.uk/Tools/sss/ )
Practice Questions:
1. Comment on the conserved domain present in Q8NFM4.
2. Find the gene sequences of Mouse origin similar to U80226.1.
3. Write the function of C7AE31. Find its orthologous proteins.
4. Write the function of P80404. Find its paralogous proteins.
5. Find whether the given pattern is present in the following protein. Also find its
homologous proteins present in SWISPROT database possessing the similar pattern.
Pattern: [LIVMFYWCS]-[LIVMFYWCAH]-x -D-[ED]-[IVA]-x(2,3)-[GAT]-
[LIVMFAGCYN]-x(0,1)-[RSAC LIH]-x-[GSADEHRM]-x(10,16)- [DH]-[LIVMFCAG]-
[LIVMFYS TAR]-x(2)-[GSA]-K-x(2,3)- [GSTADNV]-[GSAC]
6. Find the structurally solved homologous proteins for P80404. Comment on the results.
Evaluation Scheme:
Max. Mark
S. No Component Marks Obtained
(20)
5 Viva / MCQ 2
Total 20
Experiment No: Date:
Aim:
To align multiple sequences to find conserved regions and to understand evolutionary
relationship among these sequences by phylogenetic tree construction using different
methods.
Introduction:
Multiple Sequence Alignment:
A multiple sequence alignment (MSA) is a sequence alignment of three or more
biological sequences such as protein, DNA, or RNA. Typically it is implied that the set of
sequences share an evolutionary relationship, which means they are all descendents from a
common ancestor. These regions may correspond to functional, structural, or evolutionary
relationships between the sequences. Alignments can reflect a degree of evolutionary change
between sequences that are descendants from a common ancestor. There are different tools
following different methods/algorithms to perform MSA. Mostly followed MSA tools
include ClustalW, T-Coffee, Muscle, etc.
Phylogenetic Analysis:
A phylogenetic tree is an estimate of the relationships among taxa (or sequences) and
their hypothetical common ancestors. Originally, the purpose of most molecular phylogenetic
trees was to estimate the relationships among the species represented by those sequences,
now expanded to include understanding the relationships among the sequences themselves
without regard to the host species, inferring the functions of genes that have not been studied
experimentally, and elucidating mechanisms that lead to microbial outbreaks among many
others. Building a phylogenetic tree requires four distinct steps:
1. Identify and acquire a set of homologous DNA or protein sequences,
2. Multiple sequences alignment,
3. Estimate a tree from the aligned sequences, and
4. Present that tree in such a way as to clearly convey the relevant information to others.
The evolutionary history inferred from phylogenetic analysis is usually depicted as
branching, treelike diagrams that represent an estimated pedigree of the inherited
relationships among molecules (‘‘gene trees’’), organisms, or both. These trees can be
rooted / unrooted and scaled / unsclaed trees. Various methods of phylogenetic tree
construction exist such as distance based methods (UPGMA, WPGMA, FM & NJ),
maximum parsimony method and maximum likelihood method. Numerous programs offer
construction of phylogeny trees using these methods. Most prominent among them are
Phylip, PAUP, MEGA, etc.
1. Download ten COI homologous nucleotide / protein sequences from different species
and save them in fasta format as filename.fasta.
2. Open MEGA application, Launch the Alignment Explorer by selecting Alignment ->
Alignment/CLUSTAL
3. A window will appear asking you either to a) Create a new alignment, b) Open a
saved alignment session, or c) Retrieve sequences from a file. Select the first option,
“create a new alignment”.
4. Copy and paste unaligned sequences from the text file to the Alignment Explorer.
5. In the Alignment Explorer highlight all the sequences by selecting Edit -> Select All.
6. Align the highlighted sequences by selecting Alignment -> Align by ClustalW.
7. Save the current alignment as an alignment session file by selecting Data -> Export -
> Save. This will allow the current alignment session to be restored for future editing
in a file with the extension “.mas”, i.e. coi_alignment.mas
8. Save the current alignment as a MEGA file by selecting Data -> Export -> MEGA
file. This will allow the current alignment to be analyzed by MEGA.
Compute the proportion of amino acid differences between each pair of sequences
1. Select Distance -> Compute Pairwise command to display the distance analysis
preferences dialog box.
2. In the Distance Options tab, click on the green box in the Models pulldown section
and then select the Amino Acid -> p-distance option.
3. Click “Compute” to begin the computation.
Constructing Trees
1. Activate the data file that you want to analyze by clicking on it.
2. Select the Phylogeny -> Construct Tree -> Neighbor-Joining command to display
the analysis preferences dialog box.
3. In the Options Summary tab, click the Model pulldown (found in the Substitution
Model section) and then select the Amino Acid -> p-distance option. A progress
indicator will appear briefly, then the tree will be displayed in the Tree Explorer.
4. To select a branch, click on it with the left mouse button. IF you click on a branch
with the right mouse button, you will get a small options menu that will let you flip
the branch and perform various other operations on it. To edit the OUT labels, double
click on them.
5. Change the branch style by selecting the View->Tree/Branch Style command from
the Tree Explorer menu.
6. At this time the cursor assumes a triangular shape instead of the diamond shape.
Press M and the mirror image of the original tree is displayed instantly. Press M
again and the tree reverts to its original shape.
7. Select the View -> Topology Only command from the Tree Explorer menu and the
branching pattern (without actual branch lengths) is displayed on the screen.
Results:
Additional Information:
PHYLIP - PHYLIP is a free package of programs for inferring phylogenies.
(http://evolution.genetics.washington.edu/phylip.html )
Clustal Omega - Clustal Omega is a new multiple sequence alignment program that uses
seeded guide trees and HMM profile-profile techniques to generate alignments between three
or more sequences. (https://www.ebi.ac.uk/Tools/msa/clustalo/)
Practice Questions:
Download following sequences from Uniprot protein sequence database and report conserved
sequences, domain and motifs present.
Uniprot Entry: P04247, P02192,P02144, P02196, P68082, P04248 and P02173
Evaluation Scheme:
Max. Mark
S. No Component Marks Obtained
(20)
2 Tabulation of data 1
8 Viva / MCQ 2
Total 20
Experiment No: Date:
GOR method:
The GOR Method was published by Garnier, Osguthorpe, and Robson in 1978 and
was one of the first successful methods to predict protein secondary structure from amino
acid sequence. The GOR method (http://fasta.bioch.virginia.edu/fasta www/garnier.htm) is
also based on the “propensity” of each residue to be in one of the four secondary structural
conformational states, helix(H), strand(E), turn(T), and coil(C). However, unlike Chou-
Fasman, the GOR method takes into account not only the propensities of individual amino
acids to form particular secondary structures, but also the conditional probability of the amino
acid to form a secondary structure given that its immediate neighbour’s have already formed
that structure. It examines a window of every seventeen residues and sums up propensity
scores for all residues for each of the four states resulting in four summed values. The highest
scored state defines the conformational state for the center residue in the window (ninth
position). The GOR method has been shown to be more accurate than Chou–Fasman because
it takes the neighbouring effect of residues into consideration.
The prediction method or algorithm is split into three stages: generating a sequence
profile, predicting initial secondary structure, and filtering the predicted structure.
Steps:
Step II: Generate patterns of window size 5 for first sequence-to-structure network
Step III: Run SNNS (Neural Network SNNS: Second level-Structure-to-structure net) and
analyze output
NNvPDB
NNvPDB is another neural network based secondary structure prediction server with
PDB Validation developed by SRM University. NNvPDB predicts three states Helix, Sheet
and Coil; and reports percent accuracy by comparing it with similar PDB structures.
NNvPDB Methodology: The input sequence is subjected to Blastp against PDB database and
a sub database is created with top 5 blast hits and its secondary structure. Each entry in the
database is prepared and is used to train the network. Once the network is trained, the input
sequence is prepared and queried against the trained network. The network binary outputs are
converted to secondary structure states based on a set of threshold value. Structure assigned
to each threshold value is validated against PDB and the structure with highest accuracy is
reported to user with validation information
Secondary Structure Prediction Servers:
GOR4
(https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_gor4.html)
SOPMA
(https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_sopma.html)
Psipred (http://bioinf.cs.ucl.ac.uk/psipred/ )
JPred (http://www.compbio.dundee.ac.uk/jpred/)
NNvPDB (http://bit.srmuniv.ac.in/cgi-bin/bit/cfpdb/nnsecstruct.pl)
Procedure:
Predict the possible secondary structure states of human myoglobin using following
secondary structure prediction servers
1. Open the following links for performing secondary structure prediction
a. CFSSP
b. GOR4
c. Psipred
d. NNvPDB
Analyse the result by comparing predictions from both the methods and report the best
method.
Results
.
Additional Informations
Chou PY, Fasman GD (1978). "Prediction of the secondary structure of proteins from their
amino acid sequence". Adv Enzymol Relat Areas Mol Biol. 47: 45–148.
doi:10.1002/9780470122921.ch2
Garnier J, Gibrat JF, Robson B. (1996). GOR method for predicting protein secondary
structure from amino acid sequence. Methods Enzymol 266:540-53
Practice question:
1. Predict the secondary structure of Topoisomerase IA and Topoisomerase IB and
report the secondary structural variation between two isomerase.
2. What are the various classifications of amino acids and how are they crucial to
secondary structure formation.
Evaluation Scheme:
Max. Mark
S. No Component Marks Obtained
(20)
6 Viva / MCQ 2
Total 20
Experiment No: Date:
Aim:
To model the tertiary structure of query protein sequence using homology modelling.
Introduction:
Homology modelling or comparative modelling employs the use of available
homologous protein structure(s) to predict the unknown structure of a related amino acid
sequence. The principle governing this approach is that if two proteins share a high sequence
similarity, they are more likely to have very similar three-dimensional structures.
Structural information is always of great assistance in the study of protein function,
dynamics, interactions with ligands and other proteins. The "low-resolution" structure
provided by homology modelling contains sufficient information about the spatial
arrangement of important residues in the protein and may guide the design of new
experiments, for example site-directed mutagenesis. Even within the pharmaceutical industry
homology modelling can be valuable in structure-based drug discovery and drug design.
2. Alignment correction
3. Backbone generation
4. Loop modelling
5. Side-chain modelling
6. Model optimization
7. Model validation
Computational methods:
EasyModeller 4.0
Easy Modeller is a GUI based tool for homology modelling. The user provides a
predetermined sequence alignment of a template(s) and a target to allow the program to
calculate a model containing all of the heavy atoms (nonhydrogen atoms). The program
models the backbone using a homology derived restraint method, which relies on multiple
sequence alignment between target and template proteins to distinguish highly conserved
residues from less conserved ones. Conserved residues are given high restraints in copying
from the template structures. Less conserved residues, including loop residues, are given less
or no restraints, so that their conformations can be built in a more or less ab initio fashion.
The entire model is optimized by energy minimization and molecular dynamics procedures.
Tool Link: http://modellergui.blogspot.com/
Modelling using EasyModeller 4.0:
1. Go to the UniProt database; Search and download the amino acid sequence of the
protein human SST5 (Accession ID: P35346).
2. Perform a protein BLAST against Protein Data Bank proteins (pdb)
3. From the BLAST result page choose homologue which belongs to same family as
query and shows remarkable similarity score.
4. Download the respective 3D structure (.pdb) from PDB. Treat this structure as
template and BLAST query as target.
5. Open EasyModeller 4.0 (runs Python and Modeller in the backend for source code
and functioning), upload query and templates in corresponding boxes in first tab.
6. Switch to next tab and click on align templates. Save the generated result in .ps file.
7. Jump to next tab to align query with template. Save the result in .ps file
8. Go to ‘Build Model’ tab and click on ‘Generate Model’ button. A prompt window
will ask for, ‘Number of models to be generated’, ‘include heteroatom if present’,
‘automatically refine the model’.
9. Choose the parameters accordingly and click ‘OK’ button.
Swiss Model Server
SWISS-MODEL is a server for automated comparative modeling of three-dimensional (3D)
protein structures. SWISS-MODEL provides several levels of user interaction through its
World Wide Web interface: in the ‘first approach mode’ only an amino acid sequence of a
protein is submitted to build a 3D model. Template selection, alignment and model building
are done completely automated by the server.
Modelling using Swiss Model Server:
1. Go to the Swiss Model Server - https://swissmodel.expasy.org/
2. Click on ‘Start Modelling’.
3. Paste the FASTA sequence in the space provided.
4. Click on ‘Search for Templates’
5. Select the best templates for modelling and build model.
6. Download the developed models.
PyMOL is a powerful tool used to visualize and analyze protein structures, DNA, and other
biological molecules. It is well written and easy to use, and has become very popular with
structural biologists.
When the molecule is loaded into PyMOL the molecule can be controlled by mouse, with
three buttons, left, middle (push able ball), and right. The left button is used to rotate the
molecule. Middle button is used to move the molecule and the right mouse button is used to
move the molecule in z axis that is to zoom in and out.
Protein Structure Visualization using PyMOL
1. Open PyMOL. PyMOL normally starts with two windows: The Viewer Window and the
External (Tcl/Tk) GUI Window.
2. Load the protein PDB file into PyMOL. By default, the loaded PDB file structure will be
shown with line representation in PyMOL.
3. Buttons with A,S,H,L and C labels in the two rows as different columns, these alphabets
implies, Action , Show , Hide , Label and Color respectively.
4. Go to Action Preset Pretty. Protein will be represented in cartoon display, with each
helix having different colour.
5. Load any one of the template used for developing the model into the same window.
Repeat step 4 with the template.
6. Go to Action Align to molecule template.pdb. In the external window, RMSD
value between the query and the template will be displayed.
Ramachandran Plot:
The Ramachandran plot is a fundamental tool in the analysis of protein structures. The
Ramachandran plot is the 2d plot of the φ-ψ torsion angles of the protein backbone. It
provides a simple view of the conformation of a protein. The φ-ψ angles cluster into distinct
regions in the Ramachandran plot where each region corresponds to a particular secondary
structure.
Structure validation using Ramachandran Plot:
1. Go to the RAMPAGE Server - http://mordred.bioc.cam.ac.uk/~rapper/rampage.php
2. Load the protein PDB file and submit.
3. Ramachandran plot will be displayed. Analyse the results.
Result:
Modelling using EasyModeller 4.0:
Modelling using Swiss Model Server:
Additional information:
1. Protein Structure Modelling Server
a. ITASSER (http://zhanglab.ccmb.med.umich.edu/I-TASSER/ )
b. TaptorX (http://raptorx.uchicago.edu/ )
c. ROBETTA (http://robetta.bakerlab.org/ )
2. Protein Structure Visualisation Tools\
a. RasMol(http://www.openrasmol.org/ )
b. Chimera (https://www.cgl.ucsf.edu/chimera/ )
c. VMD (http://www.ks.uiuc.edu/Research/vmd/ )
Practice Questions:
1. What is protein folding?
2. What are the different methods used to predict a protein structure?
3. What is sequence similarity search? How sequence similarity searches help in protein
structure prediction?
4. What is ab initio modelling?
5. What is the difference between RMSD and RMSF?
6. Which two amino acids are frequently found in the disallowed regions of the
Ramachandran plot? Why?
7. What are the different structure visualisation tools available?
8. What are Sequence-dependent vs. sequence-independent methods in structure
comparison?
Evaluation scheme:
Max. Mark
S. No Component Marks Obtained
(20)
4 Model validation 2
6 Viva / MCQ 2
Total 20
Experiment: Date:
Aim:
To perform protein-ligand and protein-protein molecular docking
Introduction:
Molecular docking is a key tool in structural molecular biology and computer-assisted drug
design. The term “docking” is mostly related to protein molecule interactions. There are
several types of molecular docking for protein interactions:
If a protein interacts with a ligand: protein-ligand interaction
If a protein interacts with another protein: protein-protein interaction
If a protein binds to DNA: protein-DNA interactions
Of all these Protein ligand interaction techniques are the most widely used techniques. The
goal of protein- ligand docking is to predict the predominant binding mode(s) of a ligand with
a protein of known three-dimensional structure. Docking can be used to perform virtual
screening on large libraries of compounds, rank the results, and propose structural hypotheses
of how the ligands inhibit the target, which is invaluable in lead optimization.
Principle:
There are two basic components of molecular docking:
The search algorithm: A search algorithm finds the best docking pose measured by
the scoring function. Since it is impossible to do an exhaustive search most of the
docking tools resort to the most flexible ligands available.
The scoring function: A scoring function discriminates correct (experimentally
verified) docking poses from incorrect ones. It estimates the binding affinity between
ligand and receptor.
The aim of molecular docking is to evaluate the feasible binding geometries of a putative
ligand with a target whose 3D structure is known. The binding geometries, often called
binding modes or poses include, in principle, both the positioning of the ligand relative to the
receptor (ligand configuration) and the conformational state(s) of the ligand and the receptor.
The exploration of the configurational and conformational space (the sampling) and the
energetic evaluation of each discrete geometry (the scoring) are separable tasks.
Computational methods:
Protein-ligand docking using SwissDock:
SwissDock, a web server dedicated to the docking of small molecules on target proteins. It is
based on the EADock DSS engine, combined with setup scripts for curating common
problems and for preparing both the target protein and the ligand input files. An efficient
Ajax/HTML interface was designed and implemented so that scientists can easily submit
dockings and retrieve the predicted complexes.
Procedure:
1. Download the protein PDB file of H1N1 neuraminidase from PDB
(http://www.rcsb.org).
2. Go to SwissDock Server - http://www.swissdock.ch/docking and load the protein
PDB file in the Target selection section.
3. Go to PubChem database - https://pubchem.ncbi.nlm.nih.gov/.
4. In search bar type the PubChem ID : 60855 (Zanamivir) and download the 3D SDF
format file of the molecule.
5. Go to http://pasilla.health.unm.edu/tomcat/biocomp/convert - A server for converting
one chemical format to another format. Load the downloaded SDF file in input. In
output, select ‘mol2 – Tripos mol2’ format. Select convert. Download the converted
file.
6. In SwissDock window, load the converted ligand file in ‘Ligand selection’ section.
7. Start docking.
8. After run, download the prediction file and analyse the results.
Result:
Protein-ligand docking using SwissDock:
Additional information:
1. http://pymol.sourceforge.net/newman/userman.pdf
2. https://www.ebi.ac.uk/pdbe/docs/Tutorials/workshop_tutorials/PDBefold.pdf
3. http://www.proteinstructures.com/Structure/Structure/Ramachandran-plot.html
Practice Questions:
Download the following inhibitor molecules (Wortmanin, Copanlisib, and Alpelisib) of
Phosphoinositide-3-kinase.
Identify the best inhibitor molecule and rank the inhibitors based on docking analysis.
Evaluation scheme:
Max. Mark
S. No Component Marks Obtained
(20)
5 Viva / MCQ 2
Total 20