0% found this document useful (0 votes)
44 views68 pages

15GN402L Final Bioinformatics Lab Manual

Uploaded by

Sambit Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views68 pages

15GN402L Final Bioinformatics Lab Manual

Uploaded by

Sambit Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

15GN405L – BIOINFORMATICS LABORATORY

RECORD MANUAL

NAME :
REGISTER NUMBER :
BRANCH :
YEAR & SEMESTER :
ACADEMIC YEAR :

DEPARTMENT OF GENETIC ENGINEERING


SCHOOL OF BIOENGINEERING

COURSE COORDINATOR: DR. S. K. M. HABEEB

FACULTY HANDLING COURSE: DR. S. K. M. HABEEB


SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

DEPARTMENT OF GENETIC ENGINEERING

SCHOOL OF BIOENGINEERING

BONAFIDE CERTIFICATE

This is a bona-fide record of work done by

_____________________________________register number _________________________

of IV year B. Tech. Genetic Engineering for 15GN402L – Bioinformatics Laboratory during

the academic year 2019 – 2020.

Submitted for the University Examinations held on ______________________ in Hi- Tech

lab at the Dept. of Genetic Engineering, School of Bioengineering, SRM IST, Kattankulathur

– 603203.

Handling Faculty Head of the Department


Name : Name:
Signature : Signature:
Date : Date:

Examiner 1 Examiner 2
Name : Name:
Signature : Signature:
Date : Date:
CONTENTS

Teacher’
Exp Date of
s Marks
t. Date Name of the Experiment Submissio
Signatur (20)
No. n
e

1 Biological Databases

2 Sequence Format Conversion

3 Sequence Manipulation

DNA Sequence Assembly – Codon


4
Code Aligner

5 BLAST

Multiple Sequence Alignment &


6
Phylogeny

Protein Secondary Structure


7
Prediction

Protein Tertiary Structure Prediction


8
& Validation

9 Protein – Ligand Molecular Docking

Total Marks
Experiment No: Date:

Biological Databases

Aim:
To retrieve data from various DNA and protein sequence databases and understand
the importance of these databases.

Introduction:
A database is an organized collection of data that models a part of the reality (a
domain). A database could refer both to the data and to the organization of that data.
Biological databases which contain biological data emerged as a response to the huge data
generated by low-cost DNA sequencing technologies. The data stored in biological databases
is organized for optimal analysis and consists of two types: raw and curated (or annotated).
Data is submitted directly to biological databases for indexing, organization, and data
optimization. They help researchers find relevant biological data by making it available in a
format that is readable on a computer. All biological information is readily accessible through
data mining tools that save time and resources.
Biological databases can be broadly classified as sequence and structure databases. Structure
databases are for protein structures, while sequence databases are for nucleic acid and protein
sequences. The data repositories more relevant to the biological sciences include: nucleotide
and protein sequences, genomes, bibliography, genetic expression and protein structures.
A sequence database is a collection of DNA or protein sequences with some extra relevant
information. The main sequence databases are Genbank, DDBJ and EMBL. Originally, they
were just sequence collections, but they have grown to store different biological databases
heavily interconnected, and they provide powerful interfaces to search and browse the stored
information.Biological databases can be further classified as primary, secondary, and
composite databases.

Primary databases are archival in nature. They consist of experimentally derived data such
as nucleotide sequence, protein sequence or macromolecular structure. Experimental results
are submitted directly into the database by researchers. Examples of primary biological
databases include:
 Swiss-Prot and PIR for protein sequences
 EMBL and DDBJ for nucleotide sequences
 Protein Databank (PDB) for protein structures

Secondary databases contain information derived from primary databases. Secondary


databases store information obtained from the analysis of primary data. Information such as
conserved sequences, active site residues, and signature sequences generally is housed under
this category. Protein Databank data is stored in secondary databases. Examples include:
 SCOP
 CATH
 PROSITE
 PFAM
 OMIM

Composite databases aim to amalgamate the information held in two or more of the primary
databases. This means that you can search one composite database rather than do multiple
searches on individual primary databases e.g.
 OWL – is a composite of SWISS-PROT, PIR1-4, GenPept and NRL-3D.
 NRDB (Non-Redundant DataBase) – is a composite of SWISS-PROT+TrEMBL.

Structure databases such as PDB (Protein Data Bank) that hold the atomic co-ordinate data
for proteins whose structure has been determined by X-ray crystallography and/or NMR. In
addition, the MMDB (molecular Modelling Database) at NCBI is a compilation of the PDB
entries as ASN.1 files. Other databases such as SCOP (Structural Classification of Proteins)
and CATH (Class, Architecture, Topology, Homology) hold information on the structural
relationship of proteins and their structural domains.

Procedure to be followed in this experiment:


 Querying and retrieving data from these databases.
 Understanding the file formats and cross comparing them with similar databases.
 Understanding purpose, organization and hierarchy of data and scope leading to
research applications.
 Write your understanding of formats and other observations as results for every
database in a point wise manner in detail.
1. NCBI / ENTREZ
2. GENBANK
3. DDBJ
4. EMBL
5. UNIPROT
6. NRDB / OWL
7. INTERPROSCAN
8. PDB & PDBSUM
9. SCOP2
10. CATH
11. KEGG
12. OMIM
Additional Information:

1. https://libraries.wm.edu/databases/by-subject/7
2. https://ansit.wordpress.com/2007/02/06/biological-databases/
3. https://www.ncbi.nlm.nih.gov/

Practice Questions:

1. Go to the “Gene” entry for Homo sapiens PTGS2 and find how many databases are
available.

2. What is the gene name?

3. What is the GeneID number?

4. Where in the human genome is this gene located?

5. What is the RefSeq accession number for the mRNA sequence of Homo sapiens
prostaglandin-endoperoxide synthase 2? __________________.

6. Open the entry, then choose “FASTA” from the pull-down menu. Copy the sequence
(including the title line designated by the “>” symbol) and paste it into a word document.

7. Select the “Replace” tool under the EDIT menu. In the “find” box, type “^p” to find all
paragraph marks. Don’t type anything into the “replace” box. Then click “Replace All.” This
will eliminate all the paragraph marks in the document. If you still see white spaces in the
sequence, use the same procedure, but type “^w” in the “find” box to represent white spaces.
8. You now should add back a paragraph mark after the title line (that starts with “>”) and
before the sequence starts. Save the file as PTGS2rna.doc on your desktop.

9. What is the RefSeq accession number for the Homo sapiens PTGS2 protein sequence?
_________________. Open the entry. Follow the steps given above to save the sequence in
FASTA format as a Word document called PTGS2prot.doc file on your desktop. 10. Go to
the Expasy website and search for the Swiss-Prot entry for PTGS2. (Hint: use the gene name
to search and be sure to select the HUMAN protein from the search results).

11. Write at least three alternate names for this protein.

12. Where in the cell is this protein located?

13. What types of drugs target this protein?

14. What amino acid is acetylated by aspirin (amino acid type and number)?
15. What His residue is in the active site?

Evaluation Scheme:

Max. Mark
S. No Component Marks Obtained
(20)

1 Understanding of the databases 2

2 Sequence Databases (6) 6

3 Structure Databases (3) 3

4 Specialized Databases / Search Engine (3) 3

5 Results Presentation 4

6 Viva / MCQ 2

Total 20
Experiment No: Date:

Sequence Format Conversion

Aim
To understand different file formats and their conversion in bioinformatics.

Introduction
DNA, protein sequence and chemical compounds structural information are stored in
different file formats for use in different contexts. Formats can be converted from one to
another for easier access or sharing or analyzing.

Sequence File formats:


Genebank
 Standard sequences file formats used by NCBI.
 The file is plain text and thus can be read with a text editor. Genbank files often have
the file extension '.gb' or '.genbank
 It is quite flexible and allows sequence annotations, comments, and references to be
included within the file.

FASTA
 Standard and most widely used file format.
 FASTA format is a text-based format for storing nucleotide protein sequences, in
which nucleotides or amino acids are represented using single-letter codes.
 The first line in a FASTA file started either with a ">" (greater-than) symbol,
followed by sequence description.
 The Second line contains amino acid or base in single letter code.

FASTQ
 FASTQ format stores sequences and Phred qualities in a single file. It is concise and
compact.
 A FASTQ file normally uses four lines per sequence.
 Line 1 begins with a '@' character and is followed by a sequence identifier and
an optional description.
 Line 2 is the raw sequence letters.
 Line 3 begins with a '+' character and is optionally followed by the same
sequence identifier (and any description) again.
 Line 4 encodes the quality values for the sequence in Line 2, and must contain
the same number of symbols as letters in the sequence.
ABI
 ABI is a binary file format containing sanger sequencing sequence and trace data.
 The format is used by sequencing facilities and requires special readers capable of
reading the file format to view the trace data and extract the sequence.
 The file format is difficult to parse given its binary nature and the complexity of the
spec.

PIR (Protein Information Resource)


 A sequence in PIR format consists of:
 First line starting with a ">" sign, followed by a sequence identification code.
 Second line containing a textual description of the sequence.
 One or more lines containing the sequence itself. The end of the sequence is
marked by a "*" character.
 A file in PIR format may comprise more than one sequence.
 The PIR format is also often referred to as the NBRF format.

Structure file formats


PDB (file extension .pdb)
 Protein Data Bank (PDB) format is a standard for files containing atomic coordinates.
It is used for structures in the Protein Data Bank
 PDB format consists of lines of information in a text file. Each line of information in
the file is called a record. A PDB file generally contains several different types of
records, arranged in a specific order to describe a structure.
 The complete PDB file specification provides for a wealth of information, including
authors, literature references, and the method of structure determination.
MOLfile (file extension .mol)
An MDL Molfile is a file format for holding information about the atoms, bonds,
connectivity and coordinates of a molecule.
The molfile consists of some header information, the Connection Table (CT) containing atom
info, then bond connections and types, followed by sections for more complex information.
SDF (file extension .sdf)
 SDF is one of a family of chemical-data file formats developed by MDL
 SDF Stands for Structure Data Format

SMILES (file extension.smi)


 Simplified Molecular-Input Line-Entry System (SMILES)
 SMILES is a specification in form of a line notation for describing the structure of
chemical compounds using short ASCII strings.
 SMILES strings can be used to convert any compounds back to its two-dimensional
or three-dimensional models.
 Example: SMILES for corbon dioxide: O=C=O

File format conversion tools


EMBOSS Seqret (https://www.ebi.ac.uk/Tools/sfc/emboss_seqret/).
Format Converter
(https://www.hiv.lanl.gov/content/sequence/FORMAT_CONVERSION/form.html )
Online SMILES Translator (https://cactus.nci.nih.gov/translate/)
OPENBABEL (http://www.cheminfo.org/Chemistry/Cheminformatics/FormatConverter/
index.html)

Procedure
Protein or Nucleotide sequence format Conversion
1. Retrieve sequences from any sequence database in FASTA/Genbank/FASTQ format in
text or file format.
2. Open EMBOSS Seqret (https://www.ebi.ac.uk/Tools/sfc/emboss_seqret/).
3. Submit the sequence in EMBOSS Seqret and do the following file format conversion.
FASTA to PIR
Genbank to Fasta format
FASTQ to FASTA
4. Retreive the 3D structure of Myoglobin in PDB format and convert it to FASTA Format.
Chemical Compound format Conversion
1. Open PubChem Database (https://pubchem.ncbi.nlm.nih.gov/)
2. Select Compound
3. Retrieve the structure of aspirin in Canonical SMILES format.
4. Open Online SMILES Translator (https://cactus.nci.nih.gov/translate/)
5. Convert the structure into PDB and SDF format.
6. Open OPENBABEL
(http://www.cheminfo.org/Chemistry/Cheminformatics/FormatConverter/index.html)
7. Do the following format conversions.
Aspirin (PDB to SMILES)
Polynoxylin (SDF to MOL format)

Results
Additional Information:

1. https://www.ebi.ac.uk/Tools/sfc/emboss_seqret/
2. https://www.hiv.lanl.gov/content/sequence/FORMAT_CONVERSION/form.html
3. https://cactus.nci.nih.gov/translate/
4. http://www.cheminfo.org/Chemistry/Cheminformatics/FormatConverter/index.html

Practice Questions:

1. Find the difference between Genbank and Fasta file format for the same entry.
2. List out the major differences between SDF and PDB format for the same entry.
3. What is the Smiles notation for CH3CH2OH, CH3COOH and C6H12O6.

Evaluation Scheme:

Max. Mark
S. No Component Marks Obtained
(20)

1 Understanding of the file formats 3

2 Sequence File Formats (6) 6

3 Structure File Formats (3) 6

4 Results Presentation 3

5 Viva / MCQ 2

Total 20
Experiment No: Date:
Sequence Manipulation

Aim:
To manipulate DNA and Protein sequence and structure data using various
bioinformatics tools.

Introduction:
Reverse Complement
A DNA sequence contains only four characters (A, C, G and T) referred as base, basepair or
nucleotide. DNA occurs as a double strand where each A is paired with a T and vice versa,
and each C is paired with a G and vice versa. The reverse complement of a DNA sequence is
formed by reversing the letters, interchanging A and T and interchanging C and G. Thus the
reverse complement of ACCTGAG is CTCAGGT.

Tools
 Reverse Complement (https://www.bioinformatics.org/sms/rev_comp.html )
 Revseq (http://www.bioinformatics.nl/cgi-bin/emboss/revseq )
 Revcomp (http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html )

Procedure
 Retrieve a DNA sequence from nucleotide sequence database in fasta format.
 Open Reverse Complement tool and convert the DNA sequence into is
 Reverse
 Complement
 Reverse-complement
 Submit your query sequence and understand the difference and computational
mechanism behind reverse complement.

ORF Prediction
DNA (Deoxyribonucleic acid) stores genetic information as genetic codes using adenine (A),
guanine (G), cytosine(C) and thymine (T). During the transcription process, DNA is
transcribed to mRNA. Each of these base pairs will bond with a sugar and phosphate
molecule to form a nucleotide. Three nucleotides that codes for a particular amino acid
during translation is called as a codon.
The region of the nucleotide sequences from the start codon (ATG) to the stop codon (TAA,
TAG, TGA) is called the Open Reading frame (ORF). Depending on the starting point, there
are six possible ways of translating any nucleotide sequence into amino acid sequence by
“Six-frame translation process”. These are called reading frames. Three reading frames are
possible from each strand of a DNA. Longest frame uninterrupted by a stop codon is the
correct frame.
By analyzing the ORF we can predict the possible amino acids that might be produced during
translation.
ORF Prediction Tools:
ORF Finder (http://www.bioinformatics.org/sms2/orf_find.html )
NCBI ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder/ )
getorf (http://www.bioinformatics.nl/cgi-bin/emboss/getorf )

Procedure
 Retrieve a DNA sequence from nucleotide sequence database in fasta format.
 Open NCBI ORFfnder
 Enter your query sequence.
 Optional Parameters
 Enter coordinates for a sub range of the query sequence. The ORF search will
apply only to the residues in the range. (Default: 1 to length of the sequence).
 Minimal ORF Length (nt). The search will be restricted to the ORFs with the
length equal or more than the selected value. (Default: 75)
 Genetic Code table: Standard
 ORF Start codon to use: ATG only.
 Submit
 Identify all the possible open reading frames in a sequence.
 Number of ORF in each frame.
 Find out the longest ORF is in which frame, its length and location.

Translation (DNA to Protein Translation)


A gene is a segment of DNA that provides the instructions for making a protein. Proteins
have many different functions that influence our characteristics.
Tools:
EMBOSS Transeq (https://www.ebi.ac.uk/Tools/st/emboss_transeq/ )
ExPASy Traslate (https://web.expasy.org/cgi-bin/translate/dna2aa.cgi )
DNA to Protein Translation (http://bio.lundberg.gu.se/edu/translat.html )
Procedure
 Open ExPASy Translate tool in browser.
 Retrieve gene sequence with accession number NM_203378.1 in fasta format.
 Enter the retrieved query sequence in ExPASy translate tool.
 Optional Paratmeters
o Output format
 Verbose: Met, Stop, space between resiues
 Compact: M, -, no spaces
 Includes nucleotide sequence
o DNA Strands: forward and reverse
o Genetic codes: Standard

Translate and retrieve the protein sequence in fasta format.


Results:

Additional Information:

1. http://www.bioinformatics.org/sms2/
2. https://www.biologicscorp.com/sms2/about.html
3. http://manuals.bioinformatics.ucr.edu/home/emboss

Practice Questions:
1. Perform a six frame translation of any DNA sequence and do pairwise sequence
alignment between six products obtained.
2. Perform a reverse complement and for a given DNA sequence and align it with
original DNA sequence and report your findings.
3. Retrieve any protein sequence from Uniprot database and assess its various properties
in Expasy under primary sequence analysis category.
Evaluation Scheme:

Max. Mark
S. No Component Marks Obtained
(20)

Understanding of different manipulations


1 6
using DNA
2 ORF Prediction 3

3 Practice Questions 6

4 Results Presentation 3

5 Viva / MCQ 2

Total 20
Experiment No: Date:

DNA Assembly by Codon Code Aligner

Aim:
To assemble sample NGS raw data using Codon Code aligner.

Introduction:
DNA sequence assembly is a process through which short DNA sequence fragments (called
reads or samples) are merged into a longer DNA sequence to reconstruct the original DNA
sequence. The longer sequence resulted from sequence assembly is called a 'contig' sequence.
A contig is a set of overlapping DNA segments that together represent a consensus region of
DNA.
During sequence assembly the short DNA fragments may also be aligned to a reference
sequence in order to see the differences between the contig sequence obtained and the
reference sequence.

Sequencing Output Format (FASTQ)


FASTQ format:
FASTQ format is a text-based format for storing both a biological sequence (usually
nucleotide sequence) and its corresponding quality scores.

A FASTQ file normally uses four lines per sequence.

 Line 1 begins with a '@' character and is followed by a sequence identifier and
an optional description.
 Line 2 is the raw sequence letters.
 Line 3 begins with a '+' character and is optionally followed by the same sequence
identifier (and any description) again.
 Line 4 encodes the quality values for the sequence in Line 2, and must contain the
same number of symbols as letters in the sequence.
Example

@SEQ_ID|Description

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCAC
AGTTT

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

CodonCode Aligner
CodonCode Aligner is Windows and MAC OS X based DNA sequence assembly, sequence
alignment and editing software.
Url: https://www.codoncode.com/aligner/
Features of Codon Code Aligner:
 Chromatogram editing, end clipping, and vector trimming, sequence assembly and
contig editing.
 Aligning cDNA against genomic templates, sequence alignment and editing.
 Alignment of contigs to each other with ClustalW, MUSCLE, or built-in algorithms
 Mutation detection, including detection of heterozygous single-nucleotide
polymorphism, analysis of heterozygous insertions and deletions.
 Restriction analysis (find and view restriction cut sites), trace sharpening, and support
for Phred, Phrap, ClustalW, and MUSCLE.

Procedure:
1. Start  Programs  CodonCode Aligner
On Windows, opening CodonCode Aligner will open the main application window
("root window"), which contains the application menu and all other windows opened by
Aligner.
2. Create a new project  ok
Project Window options
 1. Save a project with new
 2. Import Samples
 3. Import a folder of Samples

Import files from Sample  Filename.fq or filename.fastq (Figure 1)


Figure 1. Raw reads in FastQ format.

Figure 2. Project loaded with samples. *Contents = No. of reads in file.

3. Menu  View  Bases


Figure 3. Read Base Window

4. Menu  View  Quality

Figure 4. Base Quality Window


5. Unassembled Samples (left click)  Trim

Figure 5. Read Trim details

6. Unassembled Samples (left click)  Trim Vector


7. Unassembled Samples (left click)  Assemble

Figure 6. Steps in Sequence Assembly


Assembly Result
25-26 Sep, 201725-26 Sep, 2017

Figure 6. Assembly Result window

8. Click on Contigs to view assembly details or Contig (Left click)  View contigs

Figure 7. Reads Overlap and contig sequence


9. Contig Contig information

Figure 8. Contig Sequence Base statistics


10. Contig (Left click)  View Bases

Figure 9. Assembled Contig Sequence for further downstream analysis

Results:
Additional Information:
The assembled contig sequence can be searched against NCBI nr database to annotate
in functionally.
Other Sequence Assembly Tools
CAP3 (http://doua.prabi.fr/software/cap3)
Trinity (https://github.com/trinityrnaseq/trinityrnaseq/wiki)

Practice Questions:
1. Download a Raw data file from NCBI and find out its quality and report its assembly
details.
2. For any three contigs perform annotation.

Evaluation Scheme:

Max. Mark
S. No Component Marks Obtained
(20)

1 Understanding of DNA assembly 5

2 Contig Statistics 3

3 Practice Questions 6

4 Results Presentation 3

5 Viva / MCQ 3

Total 20
Experiment No: Date:

Sequence Similarity Search Using BLAST

Aim:
To search similar sequences to the given query sequence using sequence similarity
search tool BLAST and its variants.

Introduction:

Similarity is a measure of how related two sequences are, whereas homology is a conclusion
about the evolutionary relatedness of two sequences based on an assessment of their
similiarity. Two sequences can be said to be 68% similar but these same two sequences are
either homologous or not. There is no degree to homology, two sequences are either related
or not. At the next bioinformatics seminar you attend, you can correct the misinformed
graduate student who attempts to state that protein X is 23% homologous to protein Y.

The Basic Local Alignment Search Tool (BLAST) is a program that can detect sequence
similarity between a Query sequence and sequences within a database. The ability to detect
sequence homology allows us to identify putative genes in a novel sequence. It also allows us
to determine if a gene or a protein is related to other known genes or proteins. BLAST is
popular because it can quickly identify regions of local similarity between two sequences.
More importantly, BLAST uses a robust statistical framework that can determine if the
alignment between two sequences is statistically significant.

The measure of similarity between two sequences is captured by a scoring scheme in BLAST
which is based on scoring matrices. Scoring matrices are empirical weighting schemes that
are used in comparing sequences and capture information about residue conservation, residue
frequency, and evolutionary models. The two most commonly used substitution matrices are
the BLOSUM and PAM scoring matrices. The PAM (point accepted mutation) scoring
matrices are based on global alignments of closely related proteins. The PAM1 matrix is
calculated by looking at the amino acid substitutions that occur in proteins with no more than
1% divergence (1 change per 100 amino acids). In an effort to model evolutionary changes,
the other PAM matrices are extrapolated from PAM1 by matrix multiplication. The
BLOSUM (blocks substitution matrices) are based on local alignments where the
BLOSUM62 matrix is calculated from comparisions of sequences with <62% identity.

There are five main BLAST programs to choose from.


 blastn: Search nucleotide database with nucleotide query
 blastp: Search protein database with protein query
 blastx: Search protein database with translated nucleotide query
 tblastn: Search translated nucleotide database with protein query
 tblastx: Search translated nucleotide database with translated nucleotide query
 Psi-BLAST (position-specific iterative BLAST) is an iterative version of BLAST. It
aligns the high scoring hits in the initial round of BLAST to construct a profile for the
hits, then uses this profile for the next iteration of BLAST. Psi-BLAST is especially
useful for finding remote homologs and entire protein sequences.
 PHI-BLAST stands for pattern-hit initiated BLAST. The program uses an input
sequence and a defined pattern to query a protein database. The pattern is defined in
PROSITE format (http://ca.expasy.org/prosite/)and is used as the seed for the
alignment. The pattern is used instead of the words that are usually generated for
seeding alignments in BLASTP. Here's a sample profile:

There are also a number of databases types available, including:


 swissprot and pdb (protein database): protein databases
 est (expressed sequence tag): short sequence tags used to identify gene transcripts
 refseq: a set of sequences, including DNA, RNA, and protein products
 nr (non-redundant): all non-redundant GenBank CDS translations, RefSeq Proteins,
PDB, SwissProt, PIR and PRF

Procedure to be followed in this experiment:


 Download a nucleotide and protein query sequence from the database in fasta format,
save it and understand the query you have downloaded.
 Visit NCBI Blast page at (https://blast.ncbi.nlm.nih.gov/Blast.cgi).
 Search for similar sequences to your query sequence by using all types of blast
programs discussed above.
 Go to PROSITE and PSSM websites and understand there patterns and matrices
respectively. Report on this on the working page of the manual.
 Perform PHI & PSI Blast searches using appropriate query sequences.
 Use different scoring schemes / matrices for the same data and observe differences.

Points to be reported as results


 E-value, Score, Query Coverage, Identity, alignment and all other related aspects of
all the Blast programs used.
 Your understanding and inference of this exercise individual program wise.
 Use both default and user defined settings to better understand results.
 Explanation on the CDD predicted for each blast type.

Results
Blastp
Blastn

Blastx
tBlastn

Blastx
PSI-Blast

PHI-Blast
Additional Information:
Blast (https://blast.ncbi.nlm.nih.gov/Blast.cgi )
Other Similarity Search tool: FASTA (https://www.ebi.ac.uk/Tools/sss/ )

Practice Questions:
1. Comment on the conserved domain present in Q8NFM4.
2. Find the gene sequences of Mouse origin similar to U80226.1.
3. Write the function of C7AE31. Find its orthologous proteins.
4. Write the function of P80404. Find its paralogous proteins.
5. Find whether the given pattern is present in the following protein. Also find its
homologous proteins present in SWISPROT database possessing the similar pattern.
Pattern: [LIVMFYWCS]-[LIVMFYWCAH]-x -D-[ED]-[IVA]-x(2,3)-[GAT]-
[LIVMFAGCYN]-x(0,1)-[RSAC LIH]-x-[GSADEHRM]-x(10,16)- [DH]-[LIVMFCAG]-
[LIVMFYS TAR]-x(2)-[GSA]-K-x(2,3)- [GSTADNV]-[GSAC]
6. Find the structurally solved homologous proteins for P80404. Comment on the results.
Evaluation Scheme:

Max. Mark
S. No Component Marks Obtained
(20)

1 Understanding of Blast Program 2

2 Execution of types of Blast programs (7) 7

3 Results of all Blast types (7) 7

4 Answers to practice questions 2

5 Viva / MCQ 2

Total 20
Experiment No: Date:

Multiple Sequence Alignment & Phylogeny using MEGA

Aim:
To align multiple sequences to find conserved regions and to understand evolutionary
relationship among these sequences by phylogenetic tree construction using different
methods.

Introduction:
Multiple Sequence Alignment:
A multiple sequence alignment (MSA) is a sequence alignment of three or more
biological sequences such as protein, DNA, or RNA. Typically it is implied that the set of
sequences share an evolutionary relationship, which means they are all descendents from a
common ancestor. These regions may correspond to functional, structural, or evolutionary
relationships between the sequences. Alignments can reflect a degree of evolutionary change
between sequences that are descendants from a common ancestor. There are different tools
following different methods/algorithms to perform MSA. Mostly followed MSA tools
include ClustalW, T-Coffee, Muscle, etc.

Phylogenetic Analysis:
A phylogenetic tree is an estimate of the relationships among taxa (or sequences) and
their hypothetical common ancestors. Originally, the purpose of most molecular phylogenetic
trees was to estimate the relationships among the species represented by those sequences,
now expanded to include understanding the relationships among the sequences themselves
without regard to the host species, inferring the functions of genes that have not been studied
experimentally, and elucidating mechanisms that lead to microbial outbreaks among many
others. Building a phylogenetic tree requires four distinct steps:
1. Identify and acquire a set of homologous DNA or protein sequences,
2. Multiple sequences alignment,
3. Estimate a tree from the aligned sequences, and
4. Present that tree in such a way as to clearly convey the relevant information to others.
The evolutionary history inferred from phylogenetic analysis is usually depicted as
branching, treelike diagrams that represent an estimated pedigree of the inherited
relationships among molecules (‘‘gene trees’’), organisms, or both. These trees can be
rooted / unrooted and scaled / unsclaed trees. Various methods of phylogenetic tree
construction exist such as distance based methods (UPGMA, WPGMA, FM & NJ),
maximum parsimony method and maximum likelihood method. Numerous programs offer
construction of phylogeny trees using these methods. Most prominent among them are
Phylip, PAUP, MEGA, etc.

MEGA (Molecular Evolutionary Genetic Analysis):


MEGA is an integrated tool for conducting sequence alignment, inferring
phylogenetic trees, estimating divergence times, mining online databases, estimating rates of
molecular evolution, inferring ancestral sequences, and testing evolutionary hypotheses.
MEGA is used by biologists in a large number of laboratories for reconstructing the
evolutionary histories of species and inferring the extent and nature of the selective forces
shaping the evolution of genes and species.

Download latest version of MEGA application and install it on your computer

Procedure to be followed in this experiment:


Creating Multiple Sequence Alignments

1. Download ten COI homologous nucleotide / protein sequences from different species
and save them in fasta format as filename.fasta.
2. Open MEGA application, Launch the Alignment Explorer by selecting Alignment ->
Alignment/CLUSTAL
3. A window will appear asking you either to a) Create a new alignment, b) Open a
saved alignment session, or c) Retrieve sequences from a file. Select the first option,
“create a new alignment”.
4. Copy and paste unaligned sequences from the text file to the Alignment Explorer.
5. In the Alignment Explorer highlight all the sequences by selecting Edit -> Select All.
6. Align the highlighted sequences by selecting Alignment -> Align by ClustalW.
7. Save the current alignment as an alignment session file by selecting Data -> Export -
> Save. This will allow the current alignment session to be restored for future editing
in a file with the extension “.mas”, i.e. coi_alignment.mas
8. Save the current alignment as a MEGA file by selecting Data -> Export -> MEGA
file. This will allow the current alignment to be analyzed by MEGA.

Estimating Evolutionary Distances

Activating a Mega file


1. Activate the MEGA file you just saved, by clicking on it.
2. Select the desired data file to activate.
3. The Sequence Data Explorer will open.

Compute the proportion of amino acid differences between each pair of sequences
1. Select Distance -> Compute Pairwise command to display the distance analysis
preferences dialog box.
2. In the Distance Options tab, click on the green box in the Models pulldown section
and then select the Amino Acid -> p-distance option.
3. Click “Compute” to begin the computation.

Compute distances and compare them using other methods


1. Select Distance -> Compute Pairwise command. Use the Models pulldown to select
the Amino-acid -> Poisson correction method. Now click “OK” to begin the
computation.
2. Follow the steps from 2-3 in the previous section to compute the JTT Model
distance.
3. You now have open results windows containing the distances estimated by three
different methods, which you can now compare.
After you’ve compared the results, close each one of the windows displaying the
distance matrices.

Constructing Trees
1. Activate the data file that you want to analyze by clicking on it.
2. Select the Phylogeny -> Construct Tree -> Neighbor-Joining command to display
the analysis preferences dialog box.
3. In the Options Summary tab, click the Model pulldown (found in the Substitution
Model section) and then select the Amino Acid -> p-distance option. A progress
indicator will appear briefly, then the tree will be displayed in the Tree Explorer.
4. To select a branch, click on it with the left mouse button. IF you click on a branch
with the right mouse button, you will get a small options menu that will let you flip
the branch and perform various other operations on it. To edit the OUT labels, double
click on them.
5. Change the branch style by selecting the View->Tree/Branch Style command from
the Tree Explorer menu.
6. At this time the cursor assumes a triangular shape instead of the diamond shape.
Press M and the mirror image of the original tree is displayed instantly. Press M
again and the tree reverts to its original shape.
7. Select the View -> Topology Only command from the Tree Explorer menu and the
branching pattern (without actual branch lengths) is displayed on the screen.

Constructing a maximum parsimony tree by using the branch-&-bound search option.


1. Select Phylogeny -> Construct Tree -> Maximum Parsimony command. In the
resultant preferences window, choose the Max-Mini Branch-&-Bound Search option
in the MP Tree Search Options tab.
2. Click the “OK” button to accept the defaults for the other options and begin the
calculation. A progress window will appear briefly and the tree will be displayed in
Tree Explorer.
3. Now print this tree.
4. Exit out the tree.
5. Compare the NJ and MP trees.

Test the reliability of a Tree Obtained.


1. Select the Phylogeny -> Bootstrap Test of Phylogeny->Neighbor-Joining from the
main application menu.
2. An analysis preferences dialog box appears. Use the Models pulldown to ensure that
Amino-Acid -> p-distance model is selected. Note that only the Amino-Acid
submenu is available.
3. Click “Compute” to accept the default values for the rest of the options and compute
the tree.
4. Once the computation is complete, the Tree Explorer appears and display two tree
tabs. The first is the original Neighbor-Joining tree and the second is the Bootstrap
consensus tree.

Points to be reported as results


 Tabulate sequence data details taken up for study such as accession #, species name,
length of sequence, etc.
 Highlighting the conserved regions from the MSA and writing inference about it.
 Reporting on the distance matrix table, variable sites, singleton sites, etc and its
inference.
 At least one tree to be constructed using three methods (one in each category) and
displaying the final and bootstrap tree.
 Reporting the inference for individual tree and also as a comparison between three
trees.

Results:
Additional Information:
PHYLIP - PHYLIP is a free package of programs for inferring phylogenies.
(http://evolution.genetics.washington.edu/phylip.html )
Clustal Omega - Clustal Omega is a new multiple sequence alignment program that uses
seeded guide trees and HMM profile-profile techniques to generate alignments between three
or more sequences. (https://www.ebi.ac.uk/Tools/msa/clustalo/)

Practice Questions:
Download following sequences from Uniprot protein sequence database and report conserved
sequences, domain and motifs present.
Uniprot Entry: P04247, P02192,P02144, P02196, P68082, P04248 and P02173
Evaluation Scheme:

Max. Mark
S. No Component Marks Obtained
(20)

1 Understanding of MSA & Phylogeny 2

2 Tabulation of data 1

3 MSA and Conserved regions 1


Phylogeny tree construction one each in three
4 6
methods
5 Inference on Bootstrap trees (3) 3

6 Report on different sites, etc. 3

7 Comparison of trees from different methods 2

8 Viva / MCQ 2

Total 20
Experiment No: Date:

Protein Secondary Structure Predictions


Aim:
To predict the secondary structure of proteins
Introduction:
Proteins are made up of long chain of amino acids that fold into a three dimensional structure.
Amino acids are linked to each other by peptide bond. A peptide bond is formed when the
carboxyl group of one amino acid linked to the amino group of another molecule through a
covalent bond. During this reaction a molecule of water is released. Short sequence of amino
acids held together by peptide bonds is called peptides. Each amino acid in a peptide is called
as a residue. Each end of every peptide has an N-terminus and C-terminus residue. N-
terminus is the starting of a protein which contains an amino acid with a free amine group (-
NH2) and the C-terminus is the end of a protein which contains an amino acid (-COOH) with
a free carboxyl group.
Protein secondary structures are stable local conformations of a polypeptide chain.
They are critically important in maintaining a protein three-dimensional structure. The highly
regular and repeated structural elements include α-helices and ß-sheets. It has been estimated
that nearly 50% of residues of a protein fold into either as α-helices and ß-strands. A α-helix
is a spiral-like structure with 3.6 amino acid residues per turn. The structure is stabilized by
hydrogen bonds between residues i and i + 4. Proline normally do not occur in the middle of
helical segments, but can be found at the end positions of α-helices/ and ß-sheet consists of
two or more ß-strands having an extended zigzag conformation. The structure is stabilized by
hydrogen bonding between residues of adjacent strands, which actually may be long-range
interactions at the primary structure level/ ßStrands at the protein surface show an alternating
pattern of hydrophobic and hydrophilic residues; buried strands tend to contain mainly
hydrophobic residues.
Protein secondary structure prediction plays a vital role and acts as an intermediate in
solving tertiary structures; which provides an insight in to protein function. It refers to the
prediction of the conformational state of each amino acid residue of a protein sequence as one
of the three possible states, namely, helices, strands, or coils, denoted as H, E, and C,
respectively. The prediction is based on the fact that secondary structures have a regular
arrangement of amino acids, stabilized by hydrogen bonding patterns. The structural
regularity serves the foundation for prediction algorithms. Predicting protein secondary
structures has a number of applications. It can be useful for the classification of proteins and
for the separation of protein domains and functional motifs. Secondary structures are much
more conserved than sequences during evolution. As a result, correctly identifying secondary
structure elements (SSE) can help to guide sequence alignment or improve existing sequence
alignment of distantly related sequences. In addition, secondary structure prediction is an
intermediate step in tertiary structure prediction as in threading analysis.

Secondary Structure Prediction algorithms


Chou–Fasman algorithm:
The Chou-Fasman method was among the first secondary structure prediction algorithms
developed and relies predominantly on probability/propensity parameters determined from
relative frequencies of each amino acid's appearance in each type of secondary structure. The
ChouFasman method (1985) is a combination of statistics-based methods and rule-based
methods. The Chou-Fasman method is 56-60% accurate in predicting secondary structures.

Helix Prediction Rules


1. Group of 4 residues (helix favoured) in window of 6 residues will nucleate a helix.
2: Extend this in both directions until helix breaker (Pa<1) is reached
3: Proline can occur among first 3 N-terminal residues
4: P, D, E, H, R can be incorporated at terminal ends
5: Any segment with Pa≥1.05 and Pa>Pb is predicted as helical.

Sheet Prediction Rules


1: Cluster of 3 beta formers out of 5 residues will nucleate a beta-sheet
2: Extend this in both directions until beta-breakers with Pb<1.00 is reached
3: Any segment with Pb≥1.05 and Pb>Pa is predicted as a beta-sheet.

GOR method:
The GOR Method was published by Garnier, Osguthorpe, and Robson in 1978 and
was one of the first successful methods to predict protein secondary structure from amino
acid sequence. The GOR method (http://fasta.bioch.virginia.edu/fasta www/garnier.htm) is
also based on the “propensity” of each residue to be in one of the four secondary structural
conformational states, helix(H), strand(E), turn(T), and coil(C). However, unlike Chou-
Fasman, the GOR method takes into account not only the propensities of individual amino
acids to form particular secondary structures, but also the conditional probability of the amino
acid to form a secondary structure given that its immediate neighbour’s have already formed
that structure. It examines a window of every seventeen residues and sums up propensity
scores for all residues for each of the four states resulting in four summed values. The highest
scored state defines the conformational state for the center residue in the window (ninth
position). The GOR method has been shown to be more accurate than Chou–Fasman because
it takes the neighbouring effect of residues into consideration.

Neural Network (NN) based method:


PSI-PRED
PSI-blast based secondary structure PREDiction (PSIPRED) is artificial neural network based
method used to investigate protein structure. It uses artificial neural network machine
learning methods in its algorithm.

The prediction method or algorithm is split into three stages: generating a sequence
profile, predicting initial secondary structure, and filtering the predicted structure.

Steps:

Step I: Get PSI-BLAST position specific matrices

Step II: Generate patterns of window size 5 for first sequence-to-structure network

Step III: Run SNNS (Neural Network SNNS: Second level-Structure-to-structure net) and
analyze output

Step IV: Filtering by second structure-to-structure network

Step V: Generate patterns of window size 5 for second structure-to-structure network

Step VI: Run SNNS and analyze output


Figure 10. Neural Network Architecture.

NNvPDB

NNvPDB is another neural network based secondary structure prediction server with
PDB Validation developed by SRM University. NNvPDB predicts three states Helix, Sheet
and Coil; and reports percent accuracy by comparing it with similar PDB structures.
NNvPDB Methodology: The input sequence is subjected to Blastp against PDB database and
a sub database is created with top 5 blast hits and its secondary structure. Each entry in the
database is prepared and is used to train the network. Once the network is trained, the input
sequence is prepared and queried against the trained network. The network binary outputs are
converted to secondary structure states based on a set of threshold value. Structure assigned
to each threshold value is validated against PDB and the structure with highest accuracy is
reported to user with validation information
Secondary Structure Prediction Servers:
 GOR4
(https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_gor4.html)
 SOPMA
(https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_sopma.html)
 Psipred (http://bioinf.cs.ucl.ac.uk/psipred/ )
 JPred (http://www.compbio.dundee.ac.uk/jpred/)
 NNvPDB (http://bit.srmuniv.ac.in/cgi-bin/bit/cfpdb/nnsecstruct.pl)
Procedure:
Predict the possible secondary structure states of human myoglobin using following
secondary structure prediction servers
1. Open the following links for performing secondary structure prediction
a. CFSSP
b. GOR4
c. Psipred
d. NNvPDB

Analyse the result by comparing predictions from both the methods and report the best
method.

Results
.
Additional Informations
Chou PY, Fasman GD (1978). "Prediction of the secondary structure of proteins from their
amino acid sequence". Adv Enzymol Relat Areas Mol Biol. 47: 45–148.
doi:10.1002/9780470122921.ch2
Garnier J, Gibrat JF, Robson B. (1996). GOR method for predicting protein secondary
structure from amino acid sequence. Methods Enzymol 266:540-53
Practice question:
1. Predict the secondary structure of Topoisomerase IA and Topoisomerase IB and
report the secondary structural variation between two isomerase.

2. What are the various classifications of amino acids and how are they crucial to
secondary structure formation.

Evaluation Scheme:

Max. Mark
S. No Component Marks Obtained
(20)

Secondary Structure Prediction using Chou-


1 5
Fasman and GOR method
Secondary Structure Prediction using Neural
2 5
Netwrok methods
3 Comparison of Results 3

4 Interpretation of the results 3

5 Answers to practice questions 2

6 Viva / MCQ 2

Total 20
Experiment No: Date:

Tertiary Structure Prediction of Proteins

Aim:
To model the tertiary structure of query protein sequence using homology modelling.

Introduction:
Homology modelling or comparative modelling employs the use of available
homologous protein structure(s) to predict the unknown structure of a related amino acid
sequence. The principle governing this approach is that if two proteins share a high sequence
similarity, they are more likely to have very similar three-dimensional structures.
Structural information is always of great assistance in the study of protein function,
dynamics, interactions with ligands and other proteins. The "low-resolution" structure
provided by homology modelling contains sufficient information about the spatial
arrangement of important residues in the protein and may guide the design of new
experiments, for example site-directed mutagenesis. Even within the pharmaceutical industry
homology modelling can be valuable in structure-based drug discovery and drug design.

Principle / Steps - Homology Modelling


The principle governing this approach is that if two proteins share a high sequence similarity,
they are more likely to have very similar three-dimensional structures. If one of the protein
sequences has a known structure, then this structure can be superimposed onto the unknown
protein with a high degree of confidence. Protein sequences are more conserved than DNA
and hence attribute to greater evolutionary significance.
Development of homology model is a multi-steps process that can be summarized in
following way:
1. Template recognition and initial alignment

2. Alignment correction

3. Backbone generation

4. Loop modelling
5. Side-chain modelling

6. Model optimization

7. Model validation

Computational methods:
EasyModeller 4.0
Easy Modeller is a GUI based tool for homology modelling. The user provides a
predetermined sequence alignment of a template(s) and a target to allow the program to
calculate a model containing all of the heavy atoms (nonhydrogen atoms). The program
models the backbone using a homology derived restraint method, which relies on multiple
sequence alignment between target and template proteins to distinguish highly conserved
residues from less conserved ones. Conserved residues are given high restraints in copying
from the template structures. Less conserved residues, including loop residues, are given less
or no restraints, so that their conformations can be built in a more or less ab initio fashion.
The entire model is optimized by energy minimization and molecular dynamics procedures.
Tool Link: http://modellergui.blogspot.com/
Modelling using EasyModeller 4.0:
1. Go to the UniProt database; Search and download the amino acid sequence of the
protein human SST5 (Accession ID: P35346).
2. Perform a protein BLAST against Protein Data Bank proteins (pdb)
3. From the BLAST result page choose homologue which belongs to same family as
query and shows remarkable similarity score.
4. Download the respective 3D structure (.pdb) from PDB. Treat this structure as
template and BLAST query as target.
5. Open EasyModeller 4.0 (runs Python and Modeller in the backend for source code
and functioning), upload query and templates in corresponding boxes in first tab.
6. Switch to next tab and click on align templates. Save the generated result in .ps file.
7. Jump to next tab to align query with template. Save the result in .ps file
8. Go to ‘Build Model’ tab and click on ‘Generate Model’ button. A prompt window
will ask for, ‘Number of models to be generated’, ‘include heteroatom if present’,
‘automatically refine the model’.
9. Choose the parameters accordingly and click ‘OK’ button.
Swiss Model Server
SWISS-MODEL is a server for automated comparative modeling of three-dimensional (3D)
protein structures. SWISS-MODEL provides several levels of user interaction through its
World Wide Web interface: in the ‘first approach mode’ only an amino acid sequence of a
protein is submitted to build a 3D model. Template selection, alignment and model building
are done completely automated by the server.
Modelling using Swiss Model Server:
1. Go to the Swiss Model Server - https://swissmodel.expasy.org/
2. Click on ‘Start Modelling’.
3. Paste the FASTA sequence in the space provided.
4. Click on ‘Search for Templates’
5. Select the best templates for modelling and build model.
6. Download the developed models.

Protein Structure Visualization using PyMOL:

PyMOL is a powerful tool used to visualize and analyze protein structures, DNA, and other
biological molecules. It is well written and easy to use, and has become very popular with
structural biologists.
When the molecule is loaded into PyMOL the molecule can be controlled by mouse, with
three buttons, left, middle (push able ball), and right. The left button is used to rotate the
molecule. Middle button is used to move the molecule and the right mouse button is used to
move the molecule in z axis that is to zoom in and out.
Protein Structure Visualization using PyMOL
1. Open PyMOL. PyMOL normally starts with two windows: The Viewer Window and the
External (Tcl/Tk) GUI Window.
2. Load the protein PDB file into PyMOL. By default, the loaded PDB file structure will be
shown with line representation in PyMOL.
3. Buttons with A,S,H,L and C labels in the two rows as different columns, these alphabets
implies, Action , Show , Hide , Label and Color respectively.
4. Go to Action  Preset  Pretty. Protein will be represented in cartoon display, with each
helix having different colour.
5. Load any one of the template used for developing the model into the same window.
Repeat step 4 with the template.
6. Go to Action  Align  to molecule  template.pdb. In the external window, RMSD
value between the query and the template will be displayed.

Protein Structure Validation

Ramachandran Plot:
The Ramachandran plot is a fundamental tool in the analysis of protein structures. The
Ramachandran plot is the 2d plot of the φ-ψ torsion angles of the protein backbone. It
provides a simple view of the conformation of a protein. The φ-ψ angles cluster into distinct
regions in the Ramachandran plot where each region corresponds to a particular secondary
structure.
Structure validation using Ramachandran Plot:
1. Go to the RAMPAGE Server - http://mordred.bioc.cam.ac.uk/~rapper/rampage.php
2. Load the protein PDB file and submit.
3. Ramachandran plot will be displayed. Analyse the results.

Result:
Modelling using EasyModeller 4.0:
Modelling using Swiss Model Server:

Structure Validation Report

Additional information:
1. Protein Structure Modelling Server
a. ITASSER (http://zhanglab.ccmb.med.umich.edu/I-TASSER/ )
b. TaptorX (http://raptorx.uchicago.edu/ )
c. ROBETTA (http://robetta.bakerlab.org/ )
2. Protein Structure Visualisation Tools\
a. RasMol(http://www.openrasmol.org/ )
b. Chimera (https://www.cgl.ucsf.edu/chimera/ )
c. VMD (http://www.ks.uiuc.edu/Research/vmd/ )

Practice Questions:
1. What is protein folding?
2. What are the different methods used to predict a protein structure?
3. What is sequence similarity search? How sequence similarity searches help in protein
structure prediction?
4. What is ab initio modelling?
5. What is the difference between RMSD and RMSF?
6. Which two amino acids are frequently found in the disallowed regions of the
Ramachandran plot? Why?
7. What are the different structure visualisation tools available?
8. What are Sequence-dependent vs. sequence-independent methods in structure
comparison?

Evaluation scheme:

Max. Mark
S. No Component Marks Obtained
(20)

1 BLAST and Template selection 3

2 Modelling using EasyModeller 4.0 4

3 Modelling using Swiss Model Server 4

4 Model validation 2

4 Interpretation of the results 3

5 Answers to practice questions 2

6 Viva / MCQ 2

Total 20
Experiment: Date:

Protein – Ligand Molecular Docking

Aim:
To perform protein-ligand and protein-protein molecular docking
Introduction:
Molecular docking is a key tool in structural molecular biology and computer-assisted drug
design. The term “docking” is mostly related to protein molecule interactions. There are
several types of molecular docking for protein interactions:
 If a protein interacts with a ligand: protein-ligand interaction
 If a protein interacts with another protein: protein-protein interaction
 If a protein binds to DNA: protein-DNA interactions

Of all these Protein ligand interaction techniques are the most widely used techniques. The
goal of protein- ligand docking is to predict the predominant binding mode(s) of a ligand with
a protein of known three-dimensional structure. Docking can be used to perform virtual
screening on large libraries of compounds, rank the results, and propose structural hypotheses
of how the ligands inhibit the target, which is invaluable in lead optimization.
Principle:
There are two basic components of molecular docking:
 The search algorithm: A search algorithm finds the best docking pose measured by
the scoring function. Since it is impossible to do an exhaustive search most of the
docking tools resort to the most flexible ligands available.
 The scoring function: A scoring function discriminates correct (experimentally
verified) docking poses from incorrect ones. It estimates the binding affinity between
ligand and receptor.

The aim of molecular docking is to evaluate the feasible binding geometries of a putative
ligand with a target whose 3D structure is known. The binding geometries, often called
binding modes or poses include, in principle, both the positioning of the ligand relative to the
receptor (ligand configuration) and the conformational state(s) of the ligand and the receptor.
The exploration of the configurational and conformational space (the sampling) and the
energetic evaluation of each discrete geometry (the scoring) are separable tasks.
Computational methods:
Protein-ligand docking using SwissDock:
SwissDock, a web server dedicated to the docking of small molecules on target proteins. It is
based on the EADock DSS engine, combined with setup scripts for curating common
problems and for preparing both the target protein and the ligand input files. An efficient
Ajax/HTML interface was designed and implemented so that scientists can easily submit
dockings and retrieve the predicted complexes.
Procedure:
1. Download the protein PDB file of H1N1 neuraminidase from PDB
(http://www.rcsb.org).
2. Go to SwissDock Server - http://www.swissdock.ch/docking and load the protein
PDB file in the Target selection section.
3. Go to PubChem database - https://pubchem.ncbi.nlm.nih.gov/.
4. In search bar type the PubChem ID : 60855 (Zanamivir) and download the 3D SDF
format file of the molecule.
5. Go to http://pasilla.health.unm.edu/tomcat/biocomp/convert - A server for converting
one chemical format to another format. Load the downloaded SDF file in input. In
output, select ‘mol2 – Tripos mol2’ format. Select convert. Download the converted
file.
6. In SwissDock window, load the converted ligand file in ‘Ligand selection’ section.
7. Start docking.
8. After run, download the prediction file and analyse the results.

Protein-ligand docking using ClusPro:


ClusPro 2.0, is a protein-protein docking server, which has been performing well in the
critical assessment of prediction of interactions (CAPRI) since its introduction. ClusPro is
found to be the best among online servers which perform protein-protein docking. ClusPro
follows a correlation method called as PIPER. PIPER calculates the docked conformation
energy in a grid using fast Fourier transform (FFT) coupled with pairwise interaction
potentials. As a result, the docking results are improved considerably. Also, due to the more
accurate pairwise interaction potential of PIPER, much fewer near native structures are only
retained. The structures are clustered based on the pairwise RMSD as the distance measure
and are optimized.
Procedure:
4. Go to the ClusPro Server - https://cluspro.bu.edu/login.php?redir=/home.php
5. Create an account and login.
6. Upload the protein PDB in receptor section.
7. Go to PDB database and download the PDB structure of ID 2MI1 (Somatostatin-14,
natural agonist of SSTR5). Load the file as ligand input.
8. Select ‘Dock’
9. Download and analyse the results.

Result:
Protein-ligand docking using SwissDock:

Protein-ligand docking using ClusPro:

Additional information:
1. http://pymol.sourceforge.net/newman/userman.pdf
2. https://www.ebi.ac.uk/pdbe/docs/Tutorials/workshop_tutorials/PDBefold.pdf
3. http://www.proteinstructures.com/Structure/Structure/Ramachandran-plot.html

Practice Questions:
Download the following inhibitor molecules (Wortmanin, Copanlisib, and Alpelisib) of
Phosphoinositide-3-kinase.
Identify the best inhibitor molecule and rank the inhibitors based on docking analysis.
Evaluation scheme:

Max. Mark
S. No Component Marks Obtained
(20)

1 Protein-ligand docking using SwissDock 5

2 Protein-ligand docking using ClusPro 5

3 Interpretation of the results 3

4 Answers to practice questions 5

5 Viva / MCQ 2

Total 20

You might also like