0% found this document useful (0 votes)
112 views

Manual

The document is a lab manual for a course on bioinformatics. It contains information about retrieving data from the National Center for Biotechnology Information (NCBI) database, including an overview of NCBI and descriptions of its various associated databases that contain information like genes, genomes, protein sequences, and published research articles. It also explains the GenBank data format for annotated DNA sequences stored in NCBI.

Uploaded by

GANESHAN S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views

Manual

The document is a lab manual for a course on bioinformatics. It contains information about retrieving data from the National Center for Biotechnology Information (NCBI) database, including an overview of NCBI and descriptions of its various associated databases that contain information like genes, genomes, protein sequences, and published research articles. It also explains the GenBank data format for annotated DNA sequences stored in NCBI.

Uploaded by

GANESHAN S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

BANNARI AMMAN INSTITUTE OF

TECHNOLOGY

SATHYAMANGALAM - 638401

DEPARTMENT OF BIOTECHNOLOGY

15BT704: BIOINFORMATICS LAB MANUAL

NAME : GANESHAN S

REGISTER NUMBER : 162BT121

SEMESTER : VII SEMESTER

1
BANNARI AMMAN INSTITUTE OF TECHNOLOGY
SATHYAMANGALAM - 638401

BONAFIDE CERTIFICATE

REG NO: 162BT121

Certified that this is bonafide record of practical work done by ………… of

B.TECH-BIOTECHNOLOGY in the BIOINFORMATICS LAB during the

academic year 2019 -2020 is submitted for the practical Examination held on

…………………………. in the BIOINFORMATICS LAB of the Department of

Biotechnology, Bannari Amman Institute of Technology, Sathyamangalam - 638401.

INTERNAL EXAMINER STAFF-IN-CHARGE

2
INDEX

Expt. Page
Date Experiment Name Signature
No. No.

3
EXPERIMENT 1: INFORMATION RETRIVAL FROM BIOLOGICAL DATABASES

A. NATIONAL CENTER FOR BIOTECHNOLOGY - (NCBI)

INTRODUCTION:

Bioinformatics is the application of computer technology to the management of biological


information. The need for Bioinformatics has arisen from the recent explosion of publicly available
genomic information, such as that resulting from the Human Genome Project. To address this, the
National Center for Biotechnology Information (NCBI) was established in 1988 as a national
resource for molecular biology information. The NCBI creates public-access databases, develops
software tools for analyzing genome data, and disseminates biomedical information - all for the
better understanding of molecular processes affecting human health and disease. The NCBI is a
virtual goldmine both in terms of available resources, and treasures yet to be discovered.

NCBI is one of the leading online resources known for providing Biological sequence information.
NCBI is maintained by two organizations in US, National Library of Medicine
(NLM) and National Institute of science (NIH). As a national resource for molecular biology
information, NCBI's mission is to develop new information technologies to aid in the
understanding of fundamental molecular and genetic processes that control health and disease.
NCBI is connected to various other sequence databases in order to be more efficient in answering
sequence queries. The user queries and sequence information are delivered through NCBI’s search
tool called the “entrez”.
Home Page:
NCBI has a simplified homepage from where the user can navigate to different resources. The left
side pane of the Homepage has a site map followed by different categories which narrows down
the possibility of finding the right sequence. On the right side , you can see the list of popular
resources which is very useful for first time users.
Searching can be made more precisely by using Boolean operators like AND, OR or NOT with
the search statement.

4
The associated databases included are as follows.

• Books:Bookshelf provide free access to search, retrieve and read books and journals from
life science area.

• CDD: Conserved Domain Database is a collection of annotation of functional units in


protein. It contains manually annotated domain models, which uses 3D structure
information to define sequence /structure/function relationships.

• Gene: Gene database comprises of information about various species including their
nomenclature, associated pathways, RefSeq's, phenotypes, links to genome.

• EST: Expression Sequence Tag database is a collection of data from GenBank. These are
sequence tagged site derived from cDNA, which act as a resource to evaluate gene
expression, find potential variation, annotated genes.

• Genome: Genome database is a collection of genomes information which include their


sequences, maps, chromosomes and annotations.

• dbGaP: The database of Genotypes and Phenotypes is a library of results, from the studies
of interaction of genotypes and phenotypes.

• GEO Datasets: The Gene Expression Omnibus (GEO) offers information on gene
expression datasets, their original series and Platform records. It also provides additional
information such as experimental details, cluster tools and differential expression queries.

• GEO Profiles: It offers to browse for profiles which are important on gene annotation or
pre-computed profile characteristics.

• GSS: The GSS nucleotide database provides information from GenBank of Genome
Survey Sequence records.

• HomoloGene: It is a collection of homologs from the annotated genes of completely


sequenced eukaryotic organisms.

• MeSH: MeSH (Medical Subject Headings) is the NLM (Nations Library of Medicine)
controlled vocabulary used for browsing articles, also act as a thesaurus in biomedical
sciences for Pubmed and MEDLINE.

5
• NLM Catalog: NLM (United States National Library of Medicine) is the largest medical
library which offers access to books, journals, technical information, audiovisuals,
software’s and other resources.

• OMIM: It is a comprehensive resource database for human genes and genetic disorders. It
contains information about human genes and genetic phenotypes, which is updated daily.

• OMIA: Online Mendelian Inheritance in Animals is acting as a resource for genes,


inherited disorders and traits in more than 135 animal species, authored by Professor Frank
Nicholas. It provides access to animal species excluding those in human and mouse, for
which species specific data are offered.

• PopSet: Population study dataset is a collection of set of DNA sequences, collected to


study evolutionary relatedness of a population. It can be accessed from the site

• Probe: It is a collection of nucleic acids reagents. It also contains information on reagent


distributors, probe effectiveness and computed sequence similarities.

• Protein Sequence Database: It is a collection of sequences from GenBank, RefSeq, TAP,


SwissProt, PIR, PRF, PDB.

• Pubchem BioAssay: It contains information of bioactivity screens of chemical substances


from PubChem.

• PubChem Compound: It contains compounds with their unique structures and biological
information from PubChem substances.

• PubChem Substance: It is a collection of records of substances from depositors into the


system, descriptions of samples, and links to biological screening results which are
available in PubChem BioAssay.

• PubMed: PubMed is a freely accessible database search system for health information
which is developed and maintained by the National Center for Biotechnology Information
(NCBI) at the National Library of Medicine (NLM). It contains articles from MEDLINE
and other biomedical articles.

• Pubmed Central: PubMed central is a freely accessible digital resource of full text articles
for biomedical life science journals, which is linked to PubMed database. It can be accessed

6
from the site

• SNP: The SNP database contains information of single nucleotide polymorphisms, short
insertion and deletion polymorphisms.

• Structure: The Structure database contains information of 3 dimensional structures of


proteins and other polynucleotide.

• Taxonomy: Taxonomy contains information of all the organisms that are included in the
genetic database with their nucleotide or protein sequence.

• UniGene: It identifies transcripts from the same locus, analyses expression by tissue, age,
health status and report related proteins (protest) and clone resources.

• BioSample: It is a collection of information of different biological source materials used


in experimental assays.

The results of the query search are represented in different data formats like GenBank, FASTA.

• GenBank :GenBank is a collection of annotated DNA sequences, which is the NIH


genetic sequence database. The different parameter components included are explained
below.

• Locus name helps in group entries with similar sequences. The first 3 characters denotes
the organism, the fourth and fifth characters gives other group designations, such as gene
product and the last character is a series of sequential integers.

• Sequence Length contains number of nucleotide base pairs (or amino acid residues) in the
sequence record.

• Molecule Type shows the type of sequenced molecule.

• Genbank Division shows the GenBank division to which a record belongs and is indicated
by a three letter abbreviation.

1. PRI - primate sequences

2. ROD - rodent sequences

3. MAM - other mammalian sequences

7
4. VRT - other vertebrate sequences

5. INV - invertebrate sequences

6. PLN - plant, fungal, and algal sequences

7. BCT - bacterial sequences

8. VRL - viral sequences

9. PHG - bacteriophage sequences

10. SYN - synthetic sequences

11. UNA - unannotated sequences

12. EST - EST sequences (expressed sequence tags)

13. PAT - patent sequences

14. STS - STS sequences (sequence tagged sites)

15. GSS - GSS sequences (genome survey sequences)

16. HTG - HTG sequences (high-throughput genomic seq)

17. HTC - unfinished high-throughput cDNA sequencing

18. ENV - environmental sampling sequences

Modification Date shows the last date of modification.


Definition is a brief description of sequence that includes information such as source organism,
gene name/protein name, or some description of the sequence's function.
Accession number indicates the unique identifier for a sequence record.

8
Records from the RefSeq
NT_123456 constructed genomic contigs

NM_123456 mRNAs

NP_123456 proteins

NC_123456 chromosomes

Version shows a nucleotide sequence identification number that represents a single, specific
sequence in the GenBank database.
GI "GenInfo Identifier" is a sequence identification number for the nucleotide sequence.
Keywords describes word or phrase of the sequence.
Source indicates free-format information including an abbreviated form of the organism
name, sometimes followed by a molecule type.
Organism describes the formal scientific name for the source organism and its lineage.
Reference includes publications by the authors of the sequence that discuss the data reported
in the record.
Authors contains List of authors in the order in which they appear in the cited article.
Title represents the title of the published work or tentative title of an unpublished word.

Journal: MEDLINE abbreviation of the journal name.


Pubmed: PubMed Identifier (PMID)
Features shows information about genes and gene products, as well as regions of biological
significance reported in the sequence.
Source is a mandatory feature in each record that summarizes the length of the sequence,
scientific name of the source organism, and Taxon ID number. Can also include other
information such as map location, strain, clone, tissue type, etc., if provided by submitter.
Taxon is a stable unique identification number for the taxon of the source organism.

CDS (Coding sequence) represents region of nucleotides that corresponds with the sequence
of amino acids in a protein

9
FASTA:

It is a file format used for representing nucleotide or protein sequences as a string with some basic
tag or identifier in which nucleotides or amino acids are represented as single letter codes. A
FASTA sequence starts with a (>) greater than symbol which implies the beginning of a new
sequence records called as definition line (“def line”). An accession number or version number is
followed by description of that entry. DNA sequence in either uppercase or lower case letters starts
from the next line. The sequences contain 60 characters per line.

Experimental Methods used for Sequencing

The sequences which are stored in the database were obtained from different experimental
methods. Most commonly used methods for DNA sequencing are Sanger Method and Maxam-
Gilbert Method. Similarly Edman Degradation method and Mass Spectrometry technique are used
for protein sequencing.
Sanger Method (dideoxy chain termination method):Here 4 test tubes are taken labelled with
A, T, G and C. Into each of the test tubes DNA has to be added in denatured form (single strands).
Next a primer is to be added which anneals to one of the strand in template. The 3' end of the
primer accomadates the dideoxy nucleotides [ddNTPs] (specific to each tube) as well as the deoxy
nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain
terminatesdue to lack of 3'OH which forms the phospho diester bond with the next nucleotide.
Thus small strands of DNA are formed. Electrophoresis is done and the sequence order can be
obtained by analysing the bands in the gel based on the molecular weight. The primer or one of
the nucleotides can be radioactively or fluorescently labeled also, so that the final product can be
detected from the gel easily and the sequence can be inferred.

Maxam-Gilbert (Chemical degradation method):This method requires denature DNA fragment


whose 5' end is radioactively labeled. This fragment is then subjected to purification before
proceeding for chemical treatment which results in a series of labeled fragments. Electrophoresis
technique helps in arranging the fragments based on their molecular weight. To view the fragments,
gel is exposed to X-ray film for autoradiography. A series of dark bands will appear, each
corresponding to a radio labeled DNA fragment, from which the sequence can be inferred.

10
Edman Degradation reaction:The reaction finds the order of amino acids in a protein from the
N-terminal, by cleaving each amino acid from the N-terminal without distrubing the bonds in the
protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid

Mass Spectrometry: It is used to determine the mass of particle, composition of molecule and for
finding the chemical structures of molecules like peptides and other chemical compounds. Based
on the mass to charge ratio, one can identify the amino acids in a protein.

EXERCISE- NCBI

A. GENOME & GENE DATABASES

1. Go the NCBI Home page, and choose GENOME from the databases list in the upper
left.

i) Retrieve the FASTA sequence of Bacillus pumilus.

ii) What is the shape of the organism?


Bacilli
iii) Retrieve any 2 protein sequences of Bacillus pumilus.

>WP_003152957.1 MULTISPECIES: 30S ribosomal protein S21 [Bacteria]


MSKTVVRKNESLEDALRRFKRSVSKTGTLQEARKREFYEKPSVKRKKKSEAARK
RKF. >WP_003153142.1 MULTISPECIES: stage III sporulation protein AC [Bacillales]
MGVDVNVIFQIAGVGIVVAFLHTILDQMGKKEYAQWVTLLGFIYILFMVATIVD
DLFKKIKAVFLFQG

S.No Gene ID Gene Description


1. 31666582 chromosomal replication initiator protein
DnaA
2. 31666583 DNA polymerase III subunit beta
3. 31666582 chromosomal replication initiator protein
DnaA
4. 31666584 S4 domain-containing protein YaaA
5. 31666585 DNA replication/repair protein RecF

11
vi) Download the FASTA sequence for following gene ID- 31666697, 98, 99, & 31666700

Gene ID 31666697: >NZ_CP011007.1:105585-106676 Bacillus pumilus strain SH-B9,


complete genome
ATGTCGCTCCAGCATTTTATTCAAGACGCTTTAAGTCAATGGATGAAACAAAAAGGA
CCAGAAAGTGACATTGTTCTAAGCAGTCGAATCAGGCTAGCACGAAACCTAGAGCA
TGTTCGTTTCCCAACTCAGTTTTCTCAGGAAGAGGCTGAAGCTGTTCTCCAGCAATTC
GAGCAAAAGTTTGCTAGCCAGGAAGTGAAGGACATTGGAAACTTTGTTCTAATTCA
ATGAATGAAACGCAGCCTTTAGCAAAGAGAGTACTTGTTGAAAAGCATTTGATCAG
CCCAAACTTAGCAGAATCAAGATTTGGCGGTTGTTTGCTTTCTGAAAATGAGGAAAT
CAGTGTGATGCTGAATGAAGAAGACCATATTCGAATCCAATGCTTATTCCCAGGCTT
CCAACTAGCCAATGCATTAAAAGCGGCTAACCAGATAGATGACTGGATTGAAGAGC
AAGTGGATTATGCGTTTTCTGAAAAGCGAGGATACTTAACAAGCTGTCCAACGAATT
AGGTACAGGTATTAGGGCTTCGGTCATGATGCATTTACCAGCTTTAGCCCTCACAAG
ACAAATGAATCGGATTATTCCGGCGATTAATCAATTAGGTCTTGTAGTCAGAGGAAT
TTATGGTGAAGGCAGCGAAGCAATAGGGAACATCTTTCAAATTTCAAATCAAATGA
CACTTGGTCAATCAGAAGAGGATATTGTAGATGATTTAAATAGTGTGACCGCTCAGC
TCATTGAACAAGAGCGATCTGCACGAAAAGCGTTATATCAAACATCTAAAATTGAA
CTTGAGGACAGAGTGTACCGTTCCTTGGGGATTTTGTCCAATTGTCGGATGATTGAA
TCAAAGGAAACAGCTAAGTGTTTGTCAGATGTGCGCCTTGGAATTGATTTAGGTATC
ATTAAGGGGCTTTCAAGTAATATACTGAATGAACTCATGATTTTGACACAGCCTGGC
TTTCTTCAACAATATTCTGGAGGAGCTTTGGAGCCAAATGAACGAGATATAAAACGA
GCAGCGATTATTAGAGAAAGGCTGCGTTTAGAAATGCATAGGAATGGACAGGAGGA
TGAAACGATATGA

Gene ID 31666698>NZ_CP011007.1:106673-109108 Bacillus pumilus strain SH-B9, complete


genome
ATGATGTTTGGAAGATTCACTGAAAGAGCTCAAAAGGTATTAGCACTTGCACAAGA
AGAAGCCATTCGCCTAGGCCATAAGAACATTGGTACTGAACACATTTTACTTGGTCT
TGTACGCGAGGGTGAGGGTATTGCCGCAAAAGCGTTAGAAGCACTGGGCCTTGTTTC
AGATAAAATCCAAAAAGAAGTCGAAAGCTTGATTGGAAGAGGGCAAGAGGTGTCTA
AGCTATTCCTCATTATACGCCTAGAGCGAAGAAGGTCACTGAGCTTTCAATGGATGA
AGCAAGAAAGCTAGGTCATTCCTATGTAGGGACAGAACATATTCTATTAGGTCTTAT
TCGCGAGGGAGAGGGTGTAGCTGCCCGCGTTTTAAATAACCTCGGAGTGAGCTTAA
ATAAAGCACGTCAGCAAGTCCTGCAGCTGCTTGGCAGCAATGAAACAGGTGCATCT
GCCGCTGGCTCTAACAGCAATGCAAATACACCAACATTAGATAGCTTGGCAAGAGA
TTTAACAGCGATTGCGAAAGAAGACAGCTTGGACCCTGTCATTGGACGAAGCAAAG
AAATTCAGCGTGTCATTGAGGTCCTAAGCAGAAGAACAAAAAACAACCCTGTGCTG
ATTGGTGAGCCCGGTGTTGGTAAAACAGCCATCGCTGAAGGTCTTGCACAGCAAATT
ATTCATAATGAAGTGCCTGAAATTCTGCGGGATAAACGAGTGATGACGCTTGATATG
GGAACCGTTGTAGCAGGAACGAAATATCGTGGTGAATTTGAGGATCGTTTGAAAAA
AGTCATGGACGAAATTCGTCAGGCAGGAAATATCATTCTCTTCATTGATGAGCTTCA
TACACTGATTGGTGCTGGCGGAGCAGAAGGTGCGATTGACGCATCTAATATTCTCAA
ACCATCCTTAGCACGTGGAGAGCTTCAATGTATCGGGGCAACAACGTTAGATGAGT
ACCGTAAATATATTGAAAAGGATGCTGCGCTTGAACGACGTTTCCAGCCAATTCAAG
TAGATCAGCCATCCGTTGATGAAAGTATTCAAATCTTAAGAGGTCTTAGAGATCGTT

12
ATGAGGCACATCACCGTGTGTCCATCACAGATGAAGCGATTGAGGCGGCGGTGAAG
CTGTCTGACCGTTATATTTCTGATCGTTTCCTTCCAGATAAGGCGATTGATTTAATTG
ATGAGGCAGGTTCGAAAGTCCGCTTACGTTCTTTCACAACACCGCCTAACCTAAAAG
AACTAGAGCAAAAGTTGGATGAAGTACGCAAGGAAAAGGATGCGGCTGTTCAAAGT
CAGGAATTTGAAAAAGCAGCTTCTCTTCGCGATACAGAGCAGCGTTTACGTGAAAA
AGTAGAAGTCACAAAGAAATCTTGGAAAGAAAAGCAAGGTCAGGAGAATTCAGAG
GTATCAGTGGATGATATCGCAATGGTTGTCTCTAGCTGGACGGGAGTGCCTGTTTCA
AAAATTGCCCAAACAGAAACAGATAAGCTTCTGAATATGGAACAATTACTCCATTCT
CGTGTAATCGGGCAGGATGAAGCGGTTGTCGCTGTAGCAAAAGCTGTGAGACGTGC
GCGTGCTGGTCTAAAAGATCCAAAACGTCCAATCGGCTCCTTTATCTTCTTAGGCCC
AACAGGGGTTGGTAAAACGGAGCTTGCAAGAGCACTTGCGGAGTCTATTTTTGGTG
ATGAAGAAGCGATGATCCGTATCGATATGTCTGAATACATGGAGAAACATTCTACAT
CTAGACTTGTTGGGTCACCTCCAGGCTATGTTGGCTATGAAGAAGGCGGACAACTGA
CTGAAAAAGTGAGAAGAAAACCTTATTCTGTTGTGCTTTTAGACGAGATTGAAAAG
GCGCATCCAGATGTATTCAACATCTTACTGCAAGTATTAGAAGATGGTCGTCTGACG
GATTCTAAAGGGCGTACCGTTGACTTTAGAAATACGATTTTGATCATGACATCCAAC
GTTGGAGCTAGTGAACTGAAGCGAAATAAATATGTTGGCTTTAACGTGCAGGATGA
AGGTCAAAATTACAAAGATATGAAAGGCAAAGTGATGGGCGAGTTGAAACGTGCGT
TTAGACCAGAATTCATCAACCGTATTGATGAAATCATTGTCTTCCATTCACTTGAAA
AGAAACATTTAAAAGAGATCGTGTCTCTCATGTCTGATCAATTGACGAAACGATTAA
AAGAACAAGACCTTTCAATTGAATTGACAGAAGCAGCAAAAGCGAAGATTGCCGAC
GAAGGTGTAGACCTTGAGTACGGTGCGCGTCCGTTAAGAAGAGCGATTCAAAAGCA
TGTGGAGGATCGACTTTCTGAGGAGCTTCTAAAGGGTAATATTGAAAAAGGTCAAC
AAATCGTATTAGATGTGGAAGATGGAGAAATTGTCGTAAAAACGACGGCTGCTACG
AACTAA

Gene ID 31666699>NZ_CP011007.1:109202-110581 Bacillus pumilus strain SH-B9, complete


genome
ATGGCTAAGACAAAATCAAAATTTATATGCCAATCGTGCGGTTATGAATCAGCCAA
ATGGATGGGGAAATGTCCAGGCTGCGGCACGTGGAACAGTATGACAGAAGAGGTCG
TTCGTAAAGAGCCGGTAAACCGTCGAAGCGCTTTTAATCATTCTGTCCAAACCATTC
AAAAACCTTCACCTATTTCTGCAATTGAAACATCAGATGAACCCCGAATCAAAACGA
ATTTAGAAGAATTTAACCGAGTATTAGGAAGTGGAATTGTCAAAGGCTCTCTTGTTC
TCATTGGCGGAGATCCTGGAATTGGGAAGTCCACATTATTATTACAAGTATCAGCAC
AGCTCTCAGACAAAAATCAGAATGTATTATACATATCCGGTGAGGAGTCCATTAAAC
AAACGAAGTTAAGAGCGGACCGCCTCGGCATAAAAAGCACCTCTTTACATGTTTTGG
CTGAAACCGATATGGAGTATATAACGTCTGCTATACAAGAGATGAAACCCGCTTTTG
TCGTGGTGGATTCGATTCAAACTGTTTATCAAAGCGATATTACGTCAGCTCCTGGGT
CTGTATCTCAAGTGAGAGAATGTACAGCGCAGCTGATGAAGATTGCCAAGACAAAT
GGGATTCCAATTTTTATTGTTGGTCACGTGACCAAAGAAGGCTCGATCGCAGGTCCA
CGTCTTTTAGAGCATATGGTGGACACGGTTCTTTATTTTGAAGGTGAGCGTCATCAT
ACGTTTCGTATCTTACGTGCAGTGAAAAACCGATTTGGTTCAACGAATGAACTAGGG
ATTTTTGAAATGAGAGAGGAAGGGCTCACGGAAGTATTAAACCCATCCGAAATTTTC
TTAGAAGAACGTTCGGCAGGTGTATCTGGTTCGTGTGTTGTTGCCTCAATGGAAGGA
CAAGACCTGTTCTTGTCGAGATACAAGCGCTAATTTCACCAACAAGCTTCGGAAATC
CGCGTAGAATGGCCACTGGTCTTGATCATAATCGAGTGTCACTGCTCATGGCGGTTT

13
TAGAAAAACGTGTTGGACTGCTGCTGCAAAACCAGGATGCGTATTTAAAGGTCGCA
GGCGGTGTGAAGCTTGACGAACCAGCAATTGACTTGGCCATTGCGGTCAGTATTGCC
TCCAGCTTTAAAGACGCAGCGCCGCATCAAGCGGATTGCTTTATAGGAGAGGTCGGT
CTGACGGGAGAAGTCAGAAGAGTATCAAGAATTGAACAGCGTGTGCAGGAAGCGG
CGAAACTAGGATTTAAGCGAATGTTTATTCCTCAGGCGAATATAGATGGATGGAAA
AAGCCGAGAGGGATTGAGTTAGTCGGTGTAGAAAATGTAGCGGAGGCACTTCGAAT
TTCACTAGGGGGATCATAG

31666700 >NZ_CP011007.1:110584-111663 Bacillus pumilus strain SH-B9, complete genome


ATGGAAAAAGAGAAAAAAGGAGCGAGAGAACTCGATCTCTTAGATATCGTACAGTT
TGTGGCACCAGGGACACCTCTTCGGGCTGGGATCGAAAACGTCCTGAGGGCCAATA
CTGGCGGGCTTATTGTTGTTGGTTATAACGACAAGGTAAAAAGTGTTGTCGACGGAG
GATTTCATATTAATTCTGCCTTCTCCCCAGCACACTTATATGAATTGGCGAAGATGG
ATGGAGCCATTATTTTAAGCGATTCTGGTCAAAAGATCTTGTATGCGAATACACAGC
TCATGCCAGATGCAACCATTCATTCATCGGAAACGGGCATGAGGCACCGAACTGCT
GAACGTGTAGCGAAGCAGACAGGCTGTTTAATCATTGCGATTTCTGAACGGAGGAA
CGTCATTACCTTATACCAAGGGAATCGTCGTTATACGCTGAAAGATATTGGCTTTAT
TTTAACGAAGGCCAATCAAGCCATACAAACACTTGAGAAATATAAAACCATTTTAG
ATCATGCCATTTCTGCGTTAAGCGCCCTGGAATTTGAAGAGCTTGTGACCTTTGGTG
ATGTATTATCCGTCCTGCATCGTTACGAAATGGTGCTTAGAATTAAAAATGAAATCA
ATATGTATATCAAAGAGCTTGGAACAGAAGGACATTTGATTCGTCTGCAAGTCAATG
AACTGATTACAGATATGGAGCAGGAAGCGGCTTTATTTATTAAAGATTACGTGAAA
GAAAAGATTAAAGATCCATATGTTCTGCTTAAACAGCTCCAAGATATGTCTAGCTTT
GAGCTTTTAGATGATTCCATTCTGTACAAGTTACTTGGCTATCCAGCTTCTACAAATA
TTGACGAATATGTGTACACAAGAGGTTACAGACTGCTTCACAAAATCCCTAGACTGC
CTATGCCAATCGTCGAAAACGTGGTTGAAGCGTTCGGTGTGTTAGATCGAATTATGG
AAGCAGATGTACAAGATTTGGATGAAGTAGAAGGAATTGGAGAAGTAAGAGCGAA
AAAGATAAAAAAAGGATTAAAAAGGCTGCAAGAAAAACATTATATCGACCGACAG
CTGTAA

v) Label the Missing data

S.No Gene Id Name of the protein Length of Amino acid


product
1 31666582 chromosomal replication 446 aa
initiator protein DnaA
2 31666734 50S ribosomal protein L16 144

3 31666735 50S ribosomal protein L29 66


4 31666736 30S ribosomal protein S17 87

14
vi) Use the genome annotation report of Bacillus pumilus & fill the following details

S.No Name of the organism Strain No of gene No of protein


1 Bacillus pumilus 15.1 3897 3797
2 Bacillus pumilus JRS3 3871 3756
3 Bacillus pumilus Bonn 3759 3593
4 Bacillus pumilus MG84 3747 3603
5 Bacillus pumilus MG52 3919 3770
6 Bacillus pumilus 124 3774 3623
7 Bacillus pumilus Ps115 3776 3645

vii) Using Plasmid Annotation report, fill the following details

S.No Name of the organism Plasmid No of gene No of protein


Name
1 Bacillus pumilus pSHB9 104 92
2 Bacillus pumilus pPL10 6 6
3 Bacillus pumilus pPL7065 6 6
4 Bacillus pumilus pPZZ84 7 7
5 Bacillus pumilus P576 51 49

ix) List any 2 characteristics of Bacillus pumilus.


This species is a naturally occurring, uniquitous soil microorganism. Commonly found in a variety
of food commodities, some strains have developed an increased tolerance to gamma irradiation.
This bacterium colonizes the root zone of some plants, where it inhibits soil-borne fungal diseases
and nematodes.

x) List the Lineage of Bacillus pumilus

Bacteria[25418]; Firmicutes[3922]; Bacilli[1650]; Bacillales[1114]; Bacillaceae[498]; Bacillus[2


19]; Bacillus pumilus[1]
xi) Download the complete genome sequence for a bacterium of interest. Do the same.

15
B. NUCLEOTIDE DATABASE

From the NCBI homepage (https://www.ncbi.nlm.nih.gov/), type in: Ochrobactrum


intermedium isolate CL6in the “Query” box at the top of the page. Make sure the “Search in”
box states “Nucleotide”. Click on the “Search” button.

1. What is FASTA? Retrieve the FASTA sequence of the entry.


It is a file format used for representing nucleotide or protein sequences as a string with some basic tag or
identifier in which nucleotides or amino acids are represented as single letter codes.

AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCG
CCCTGCAAGGGGAGCGGCAGACGGGTGAGTAACGCGTGGGAACGTACCATTTGCTACGGAA
TAACTCAGGGAAACTTGTGCTAATACCGTATGAGCCCGAAAGGGGAAAGATTTATCGGCAA
ATGATCGGCCCGCGTTGGATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATCC
ATAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACG
GGAGGCAGCAGTGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGA
GTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCACCGGTGAAGATAATGACGGTAACCGGA
GAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGT
TCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGGCTAATAAGTCAGGGGTGAAATCCCGG
GGCTCAACCCCGGAACTGCCTTTGATACTGTTAGTCTTGAGTATGGTAGAGGTGAGTGGAAT
TCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCA
CTGGACCATTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTG
GTAGTCCACGCCGTAAACGATGAATGTTAGCCGTTGGGGAGTTTACTCTTCGGTGGCGCAGC
TAACGCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGA
CGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTAC
CAGCCCTTGACATCCCGATCGCGGTTAGTGGAGACACTATCCTTCAGTTCGGCTGGATCGGA
GACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACG
AGCGCAACCCTCGCCCTTAGTTGCCAGCATTCAGTTGGGCACTCTAAGGGGACTGCCGGTGA
TAAGCCGAGAGGAAGGTGGGGGATGACGTCAAGTCCTCATGGCCCTTACGGGCTGGGCTAC
ACACGTGCTACAATGGTGGTGACAGTGGGCAGCGAGCACGCGAGTGTGAGCTAATCTCCAA
AAGCCATCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTTGGAATCGCTAGTAA
TCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACC
ATGGGAGTTGGTTTTACCCGAAGGCGCTGTGCTAACCGCAAGGAGGCAGGCGACCACGGTA
GGGTCAGCGACTGGGGTGAAGTCGTAACAAGGTA.
2. How many base pairs are present?
1444bp
3. What kind of molecule it is?
Genomic DNA
4. What are the types of RNA present in a cell?
m-RNA, t-RNA, r-RNA are the types of RNA present in a cell.
16S r-RNA - Present in eukaryotic
18S r-RNA - Present in prokaryotic
23S r-RNA - Present in large subunit of 50sRNA
5. Is this a linear or circular DNA?
Yes, it is a linear DNA.
6. What is the Accession number of the sequence?
KM658164 is the accession number of the sequence.
16
7. What is the keyword for this particular entry? What does it indicate?
ENV.
It indicates - environmental sampling sequences

8. What is the source of the particular sequence? Where it is indicated?


Environmental samples is the source of the particular sequences and it is indicated under
‘Features’.
Ochrobactrum intermedium, It is indicated in the definition and below the keywords.
9. Where can you get the author information from?
I can get the author information from ‘AUTHORS’ and we get it from GeneBank.
10. Is the sequence published in any Journal? No, it is not published in any journal.
11. What is the submission date for the particular sequence? 02-FEB-2015
12. Which method was used to sequence this data?
Commonly used methods for DNA sequencing are Sanger Method and Maxam-Gilbert Method.
Sanger dideoxy sequencing method was used to sequence this data.
C. GENE DATABASE

From the NCBI homepage (https://www.ncbi.nlm.nih.gov/), type in: HK1 hexokinase 1 [Homo
sapiens (human)] in the “Query” box at the top of the page. Make sure the “Search in” box states
“Gene”. Click on the “Search” button.
1. What is the symbol for this entry?
HK1
2. Is this a protein coding gene?
Yes, it is protein coding gene.
3. From which organism it is derived from?
Homo sapiens
4. What does the protein encoded by this gene do?
This gene encodes a ubiquitous form of hexokinase which localizes to the outer membrane
of mitochondria.
5. Where does the protein gets expressed?

17
6. What is the location of the particular gene and what is the exon count?
The location of the particular gene is 10q22.1 and its exon count is 29.

7. In which chromosome is the gene present?


The gene is present in Chromosome 10.
8. What are the other names given for the protein?
Brain form hexokinase
Glycolytic enzyme
Hexokinase IR
Hexokinase type I

9. Download any 2 related articles related to this gene.


D. PUBCHEM
Go the NCBI Home page, and choose the Pubchem from the popular resources in the
upper right.
1. Calculate physical and chemical properties of the drug –
i) CID 6419971
ii) CID 4055.
iii) CID 6163
Note down the names and properties of above drug
S.No Property Name Property value
CID 6419971 CID 4055 CID 6163
1 Molecular Weight 875.1 g/mol 172.18 g/mol 154.19 g/mol

2 XLogP 4.1 2.2 1.1

3 Hydrogen Bond Donor count 3 0 0

4 Hydrogen Bond Acceptor count 14 2 4

5 Heavy atom count 62 13 9

2) List any 2 uses of above drugs.


CID 6419971- Veterinary drug, Ivermectin
CID 4055- The synthetic water soluble forms of vitamin K (menadione, menadiol) have long
since been considered inferior to vitamin K1 (phytonadione) in the treatment of drug-induced
hypoprothrombinemia.
CID 6163-Cleaning and furnishing care products,Fabric, textile, and leather products not
covered elsewhere, Paper products
E. PROTEIN DATABASE

18
1. Retrieve the PROTEIN sequence in FASTA format corresponding to NCBI Reference
Sequence: XP_010321220.1. Comment on the function and taxonomy involved it.
>XP_010321220.1 elastin [Solanum lycopersicum]
MAKKGILILVLSIFLVGSILGRKLAESDALTNGGHSPLKPLFTQKNGDLPGIPLLGGVPGSI
PGIPSIGGLPSNIPGIPSIGGLPGTISGIPLIGGLPGIIPGIPSIGGLPGTIPGVPSIGGVPGTIPGV
PSTGGGPGNIPGVPSIGGGPGNIPGVPSIGGVPGQGPGDVPVVPSVGGNTGPAYGGIPYIP
STGIPGLGYGGLPGIPWLGGGPGNIPGIPWLGGVPGGYGIGPGIVPFVGGGGGGGGGGGI
GGGLGGGVGGGVGGGIGGGV.

B. Introduction to UniProt Protein Resource


UniProt is the world’s most comprehensive catalogue of information on proteins. It is a central
repository of protein sequence and function created by joining the information contained in
UniProt/Swiss-Prot, UniProt/TrEMBL and PIR.
UniProt is comprised of three components. The UniProt Knowledgebase (UniProtKB) is the
central access point for extensive curated protein information, including function, classification
and cross-reference. The UniProt Reference cluster *UniRef) databases combine closely related
sequences into a single record to speed searches. The UniProt Archive (UniParc) is a
comprehensive repository, reflecting the history of all protein sequences. The UniProt
Metagenomic and Environmental Sequences (UniMES) database is a repository specifically
developed for metagenomic and environmental data. Using UniProt, we can uncover a lot of
information about a protein in addition to its sequence.
UniProtKB

The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional
information on proteins, with accurate, consistent and rich annotation. In addition to capturing the
core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or
description, taxonomic data and citation information), as much annotation information as possible
is added. This includes widely accepted biological ontologies, classifications and cross-references,
and clear indications of the quality of annotation in the form of evidence attribution of
experimental and computational data. The UniProt Knowledgebase consists of two sections: a
section containing manually-annotated records with information extracted from literature and
curator-evaluated computational analysis, and a section with computationally analyzed records
that await full manual annotation. For the sake of continuity and name recognition, the two sections
are referred to as "UniProtKB/Swiss-Prot" (reviewed, manually annotated) and
"UniProtKB/TrEMBL" (unreviewed, automatically annotated), respectively.

Where do the protein sequences come from?

About 98 % of the protein sequences provided by UniProtKB are derived from the translation of
the coding sequences (CDS) which have been submitted to the public nucleic acid databases, the

19
EMBL-Bank/GenBank/DDBJ databases (INSDC). All these sequences, as well as the related data
submitted by the authors, are automatically integrated into UniProtKB/TrEMBL.

UniProtKB/Swiss-Prot (‘gold star’) contains manually annotated records (reviewed) with


information extracted from the literature and curator-evaluated computational predictions;
UniProtKB/TrEMBL (‘grey star’) contains computationally annotated records (unreviewed).
Text search

Select the Search tab of the toolbar to search this site:

1. Select a data set from the Search in drop-down list.


2. Enter your query in the Query field.
3. Click the Search button.
4. Query syntax
5. Here is a brief overview of the supported query syntax (see also query fields for
UniProtKB):

human antigen All entries containing both terms.


human AND antigen
human && antigen
"human antigen" All entries containing both terms in the exact order.
human -antigen All entries containing the term human but not antigen.
human NOT antigen
human ! antigen
human OR mouse All entries containing either term.
human || mouse
antigen AND (human OR mouse) Using parentheses to override boolean precedence rules.
anti* All entries containing terms starting with anti. Asterisks can
also be used at the beginning and within terms. Note: Terms
starting with an asterisk or a single letter followed by an
asterisk can slow down queries considerably.
author:Tiger* Citations that have an author whose name starts with Tiger.
To search in a specific field of a dataset, you must prefix

20
your search term with the field name and a colon. To
discover what fields can be queried explicitly, observe the
query hints that are shown after submitting a query or use
the query builder (see below).
length:[100 TO *] All entries with a sequence of at least 100 amino acids.
citation:(author:Arai All entries with a publication that was coauthored by two
author:Chung) specific authors.

6. To use characters that have a special meaning in the query syntax literally in your query,
you must escape them with a backslash, e.g. use gene:L\(1\)2CB to search for the gene
name L(1)2CB. The current list of special characters is:
7. + - && || ! ( ) { } [ ] ^ " ~ * ? : \

Query builder

To restrict terms to specific fields in advance, click the 'Advanced Search »' button. Depending on
the chosen data set and field, you can then enter some text or choose values from a drop-down list.
Then click the Add & Search button to add the new constraint and run the new query.

Anatomy of a UniProtKB/TrEMBL entry

Each UniProtKB/TrEMBL entry contains information associated with one protein sequence. 100%
identical protein sequences over the entire length of the protein and from the same species are
automatically merged together. The different sections of the entry report selected information
extracted from the original ENA/GenBank/DDBJ entry as well as additional high quality
automated annotation.

A UniProtKB/TrEMBL entry contains at least the following sections:

Entry information

Each entry is associated with a stable unique identifier: the primary accession number. When
citing an entry, always use the primary accession number. The ‘entry name’ is composed of the
primary accession number and amnemonic species identification: it is not stable and will change
as soon as the entry will be reviewed and integrated into UniProtKB/Swiss-Prot.

UniProtKB accession numbers

UniProtKB accession numbers consist of 6 or 10 alphanumerical characters in the format:

21
1 2 3 4 5 6 7 8 9 10
[O,P,Q] [0-9] [A-Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9]
[A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]
[A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]

The three patterns can be combined into the following regular expression:

[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}

Examples: A2BC19, P12345, A0A022YWF9

Entries can have more than one accession number. This can be due to two distinct mechanisms:

a) When two or more entries are merged, the accession numbers from all entries are kept.
The first accession number is referred to as the ‘Primary (citable) accession number’, while
the others are referred to as ‘Secondary accession numbers’. These are listed in
alphanumerical order.

b) If an existing entry is split into two or more entries (‘demerged’), new ‘primary’
accession numbers are attributed to all the split entries while all original accession numbers
are retained as ‘secondary’ accession numbers.

Names and origin

Protein name (‘submitted name’), synonyms, gene and locus names, and taxonomy information
are automatically extracted from the original ENA/GenBank/DDBJ entry

Protein attributes

This section provides information on the protein sequence length, indicates if the protein sequence
is complete or a fragment (according to the original ENA/GenBank/DDBJ record). It also provides
the level of evidence that supports the existence of the protein. The vast majority of
UniProtKB/TrEMBL protein existences are considered as ‘Predicted’ since they derived from in
silico nucleotide translations

In UniProtKB there are 5 types of evidence for the existence of a protein:

• Evidence at protein level


• Evidence at transcript level
• Inferred from homology
• Predicted

22
• Uncertain

The value ‘Evidence at protein level’ indicates that there is clear experimental evidence for the
existence of the protein. The criteria include partial or complete Edman sequencing, clear
identification by mass spectrometry, X-ray or NMR structure, good quality protein-protein
interaction or detection of the protein by antibodies.

The value ‘Evidence at transcript level’ indicates that the existence of a protein has not been
strictly proven but that expression data (such as existence of cDNA(s), RT-PCR or Northern
blots) indicate the existence of a transcript.

The value ‘Inferred by homology’ indicates that the existence of a protein is probable because
clear orthologs exist in closely related species.

The value ‘Predicted’ is used for entries without evidence at protein, transcript, or homology
levels.

The value ‘Uncertain’ indicates that the existence of the protein is unsure.

Sequences

More than 99 % of the protein sequences are obtained from the translation of annotated coding
sequences (CDS) in the ENA/GenBank/DDBJ databases and are automatically processed and
entered in UniProtKB/TrEMBL. The protein sequence displayed by default in the entry is the most
prevalent and/or the most similar to orthologous sequences. When the genomic sequence is
available, we generally display the protein sequence derived from genome translation

References

This section contains published articles or submissions that were cited in the original
ENA/GenBank/DDBJ entry. Additional computationally mapped references can also be accessed
from this section

Cross-references

This section is used to point to related information stored in other data collections, including the
links to the original ENA/GenBank/DDBJ submissions

General annotation (Comments)

This section provides any useful information about the protein, mostly biological knowledge
(function, subcellular location, enzyme-specific information (catalytic activity, cofactors,

23
metabolic pathway), tissue expression…Qualifiers (e.g. ‘By similarity’, ‘Probable’, ‘Potential’)
are used in the absence of direct experimental evidence.

Ontologies

This section provides a selection of UniProtKB keywords, which are terms from a controlled
vocabulary list, which summarizes the content of the entry and a selection of Gene Ontology (GO)
terms

Alternative products

This section lists the alternative protein sequences that can be generated from the same gene by
alternative promoter usage, alternative splicing, alternative initiation and/or ribosomal
frameshifting. In addition, this section provides relevant information for each alternative protein
isoforms

Sequence annotation (Features)

Over 30 feature keys are available for the description of regions or sites of interest in the protein
sequence, such as post-translational modifications (glycosylation, phosphorylation…), binding
sites, enzyme active sites, local secondary structure, or variants. Features can be either
experimentally proven in the literature or predicted in silico. Qualifiers (‘By similarity’, ‘Probable’
and ‘Potential’) indicate the existence of indirect experimental evidence or the computer-
prediction of the feature

3D-structure databases

Protein, or part of a protein, whose three-dimensional structure has been resolved experimentally
(for example by X-ray crystallography or NMR spectroscopy) and whose coordinates are available
in the PDB database.

Exercise 1: Searching Uniprot using a Text search

From the UniProt homepage (http://www.uniprot.org/), type in: Myosin light chain kinase in the
“Query” box at the top of the page. Make sure the “Search in” box states “Protein Knowledgebase
UniProtKB”. Click on the “Search” button.

[1] How many results did you retrieve? - 2834


[2] Restrict the results for human by using “Fields” box drop-down menu. Select “Human
[9606]”. - 46

24
[3] How many gold star and grey star entries you get and what is the difference between these
two?

GOLD STAR: Records with information extracted from literature and curator-evaluated
computational analysis.
GREY STAR: Records that await full manual annotation.
Number of GOLD STAR: 147; HUMAN [9606] = 38
Number of GREY STAR: 2687; HUMAN [9606] = 8

Exercise 2: Exploring the General Information of UniProt/SwissProt Entry

[1] Choose the UniProt entry Q15746 and identify the (a) protein name, (b) total number of
amino acids and find whether it is manually annotated or computationally annotated?

Myosin light chain kinase, smooth muscle.


1914 amino acid, it is manually annotated.

[2] Identify their primary and secondary accession numbers and what is the difference
between primary and secondary accession numbers

When two or more entries are merged, the accession numbers from all entries are kept. The
first accession number is referred to as the ‘Primary (citable) accession number’, while the
others are referred to as ‘Secondary accession numbers’. These are listed in alphanumerical
order. If an existing entry is split into two or more entries (‘demerged’), new ‘primary’
accession numbers are attributed to all the split entries while all original accession numbers
are retained as ‘secondary’ accession numbers.
Primary accession numbers: Q15746
Secondary accession numbers: B4DUE3, D3DN97, O95796, O95797, O95798, O95799,
Q14844, Q16794, Q17S15, Q3ZCP9, Q5MY99, Q5MYA0, Q6P2N0, Q7Z4J0, Q9C0L5,
Q9UBG5, Q9UBY6, Q9UIT9

[3] What other names this protein is known by?

Kinase-related protein
Telokin

[4] Is there any evidence that this protein exists at the protein level? Where you can find this
information?

Yes, there is an evidence that this protein exists at the protein level, I can find this
information from ‘STATUS’.

25
[5] What is meant by Evidence at protein level, Evidence at transcript level and Inferred from
homology?

The value 'Experimental evidence at protein level' indicates that there is clear
experimental evidence for the existence of the protein. The criteria include partial or
complete Edman sequencing, clear identification by mass spectrometry, X-ray or NMR
structure, good quality protein-protein interaction or detection of the protein by
antibodies.The value 'Experimental evidence at transcript level' indicates that the
existence of a protein has not been strictly proven but that expression data (such as
existence of cDNA(s), RT-PCR or Northern blots) indicate the existence of a transcript.The
value 'Protein inferred by homology' indicates that the existence of a protein is probable
because clear orthologs exist in closely related species.

Exercise 3: Exploring a Sequence of UniProt/SwissProt Entry

[1] How many different isoforms are known for this protein? There are 11 isoforms for this
protein.
[2] What region is missing in the identifier Q15746-8? What is the isoform and by what name
it is known by? 1-1760 is the missing region. Isoform 6 and is known by the name Telokin.
[3] Is isoform 6 catalytically active? It has no catalytic activity.
[4] What alternative initiation site does it use? Met-1761 as initiator codon.
[5] What are the residue positions of the “Protein kinase” domain? 1464-1719 are the residue
positions of the protein kinase domain.

Exercise 4: Exploring General annotation of UniProt/SwissProt Entry

[1] What ligands might this protein bind according to the keywords list? Where you can get
this information? ATP-binding, Calcium, Magnesium, Metal-binding, Nucleotide-binding
are the ligands for this protein and it can find from uniprotkb keywords.
[2] What “Biological process” is this protein involved in? Aorta smooth muscle tissue
morphogenesis, bleb assembly, cardiovascular system development, cellular hypotonic
response, muscle contraction, positive regulation of calcium ion transport, positive
regulation of cell migration, positive regulation of wound healing, protein phosphorylation,
smooth muscle contraction, tonic smooth muscle contraction.
[3] We already know this protein is an enzyme, but what is its function? Actin binding, ATP
binding, calmodulin binding, metal ion binding, myosin light chain kinase activity, protein

26
kinase activity. Calcium/calmodulin-dependent myosin light chain kinase implicated in
smooth muscle contraction via phosphorylation of myosin light chains (MLC). Also
regulates actin-myosin interaction through a non-kinase activity. Phosphorylates
PTK2B/PYK2 and myosin light-chains. Involved in the inflammatory response (e.g.
apoptosis, vascular permeability, and leukocyte diapedesis), cell motility and morphology,
airway hyperreactivity and other activities relevant to asthma.
[4] How do post-translational modifications affect this protein?
Can probably be down-regulated by phosphorylation. Tyrosine phosphorylation by ABL1
increases kinase activity, reverses MLCK-mediated inhibition of Arp2/3-mediated actin
polymerization, and enhances CTTN-binding. Phosphorylation by SRC at Tyr-464 and
Tyr-471 promotes CTTN binding.The C-terminus is deglutamylated by AGTPBP1/CCP1,
AGBL1/CCP4 and AGBL4/CCP6, leading to the formation of Myosin light chain kinase,
smooth muscle, deglutamylated form. The consequences of C-terminal deglutamylation
are unknown.Acetylated at Lys-608 by NAA10/ARD1 via a calcium-dependent signaling;
this acetylation represses kinase activity and reduces tumor cell migration.

[5] What are helix, strand and turn? Identify the regions that make these secondary structures?

The alpha helix (α-helix) is a common motif in the secondary structure of proteins and
is a right hand-helix conformation in which every backbone N−H group
donates a hydrogen bond to the backbone C=O group of the amino acid located three or
four residues earlier along the protein sequence.
The β-sheet is a common motif of regular secondary structure in proteins. Beta sheets
consist of beta strands connected laterally by at least two or three backbone hydrogen
bonds, forming a generally twisted, pleated sheet.
A turn is an element of secondary structure in proteins where the polypeptide chain
reverses its overall direction.
Beta strand 415-418;423-425;431-438;444-449;458-466;469-476;483-491;494-504;512-
518;523-526;531-534;536-541;546-553;560-562;565-570;581-586;591-595;598595;598-
601;1239-1241;1246-1250;1255-1266;1268-1277;1281-1288;1290-1297;1306-
1313;1318-1320;1323-1328.
Helix: 479-481; 1743-1760
Turn: 1302-1304

[6] What is the mass of Non-muscle isozyme? 21075,715Da


[7] Find the 3-Dimensional protein structure Id’s. Find whether the protein has been resolved
by X-ray crystallography method or NMR spectroscopy?
2CQV NMR, 2K0F NMR, 2YR3 NMR, 5JQA X-ray, 5JTH X-ray
6C6M X-ray

27
C- RETRIEVING STRUCTURAL DATA OF A PROTEIN USING PDB DATABASE

Objective
• To study about proteins and its structure.
• To introduce PDB and its importance.
• To learn how to retrieve structural data of a protein using PDB database.
• To describe the PDB file format.

Proteins
Proteins are the fundamental units of all living cells. It performs a large variety of biological tasks.
The structures of proteins are much conserved which determines its function. The primary
structure of a protein is made up of linear sequence of amino acid. It is synthesized during the
translation process of DNA to mRNA. DNA (Deoxyribonucleic acid) is the genetic material that
contains all the genetic information for the development and maintaining all functions in all living
organisms. The information is stored as genetic codes using four types of bases. They are adenine
(A), guanine (G), cytosine(C) and thymine (T). In two strands of DNA, adenine always pair with
thymine and guanine pair with cytosine. Each of these base pairs will bond with a sugar and
phosphate molecule to form a nucleotide. The base pairing of DNA will result in a ladder shape
structure of these strands which is called a double helix.
The intermolecular and intramolecular hydrogen bonding between the amide groups in primary
structure of protein form secondary structure. The attraction of hydrogen molecule towards electro
negative atom (N, F, O etc) within same molecule is called intramolecular hydrogen bonding and
formed between two different molecules is called intermolecular H bonding. Alpha helices and
beta sheets are the two important secondary structures in protein.
PDB
A protein database contains the information about 3D structure of proteins. The PDB files contain
experimentally decided 3D structures of biological macromolecules. The structural information of
a protein can be determined by X–ray crystallography or Nuclear Magnetic Resonance (NMR)
spectroscopy methods. Here X-rays are diffracted by electrons of a comparable sized atom
resulting in patterns obtained as small spots in an X-ray film. These patterns are used to calculate

28
the coordinates of atoms in a protein. NMR spectroscopy (Nuclear Magnetic Resonance) is also
used for determining the structure of molecules. The nucleus of an atom that is located in a high
magnetic field can absorb the electromagnetic radiation of a particular frequency. Electromagnetic
radiation is a form of energy that contains both electric and magnetic fields. This type of radiation
includes X-rays, gamma rays, radio waves, visible light etc. The PDB files also contains
information of data collected, molecule name, primary and secondary structure, ligand, atomic
coordinates, crystallographic structure factors, NMR experimental data etc.. The data are
submitted by scientists from all over the world. PDB is maintained by Worldwide Protein Data
Bank. Each entry in the PDB is provided with a unique identification number called the PDB ID.
It is a 4 letter identification number which consist of both alphanumeric characters.

All data in PDB are accessible to public. There are databases which contain data derived from
PDB. For example Structural Classification of Proteins (SCOP) that groups different protein
structures, HSSP (Homology-Derived Secondary Structure of Proteins) for 3D- structure and 1D-
sequence of the protein, CATH (Class Architecture, Topology and Homologous superfamily) for
protein structure classification according to their evolution etc.. PDB allows searching for
information regarding the structure, sequence, function, visualize, download and to assess
molecules.
Resolution: It is a measure of the quality of the data that has been collected on the crystal
containing the protein or nucleic acid. If all of the proteins in the crystal are aligned in an identical
way, forming a very perfect crystal, then all of the proteins will scatter X-rays the same way, and
the diffraction pattern will show the fine details of crystal. On the other hand, if the proteins in the
crystal are all slightly different, due to local flexibility or motion, the diffraction pattern will not
contain as much fine information. So resolution is a measure of the level of detail present in the
diffraction pattern and the level of detail that will be seen when the electron density map is
calculated. High-resolution structures, with resolution values of 1 Å or so, are highly ordered and
it is easy to see every atom in the electron density map. Lower resolution structures, with resolution
of 3 Å or higher, show only the basic contours of the protein chain, and the atomic structure must
be inferred. Most crystallographic-defined structures of proteins fall in between these two
extremes. As a general rule of thumb, we have more confidence in the location of atoms in
structures with resolution values that are small, called "high-resolution structures".

29
Ligand Protein Contacts : Residues forming contact with the ligand can be analyzed using
Ligand protein contact server (LPC). Different types of interactions such as hydrophobic, aromatic
and hydrogen bonds can be analyzed between contacting residues and the ligand.

PDB File Format

The PDB file format is the standard file format for protein structure files. It describes how
molecules are held together in 3-D structure of a protein. The file contains hundreds or thousands
of lines called record, which describes about protein. Figure 1 shows certain parts of a PDB
formatted file for deoxyhemoglobin.

Figure 1. Certain parts of a PDB formatted file for deoxyhemoglobin

30
Each record provides a different set of information like:
• The HEADER record contains the file name and date of submission and the molecule PDB
ID. Header contains the classification (classify the molecule), deposition date of the data
at PDB and id code (unique PDB identifier) respectively.
• The TITLE record contains title of the PDB entry.
• The COMPND record includes the protein name. The specification list describes the
molecular component.
• The SOURCE record contains the name of the organism in which the particular protein is
obtained.
• The KEYWDS record contains keywords that describes about the protein. It includes
functional classification, metabolic role, biological chemical activity and structural
classification.
• The EXPDTA record contains the method used for the protein structure experiment. E.g.
X-ray diffraction, electron crystallography etc.
• The AUTHOR record contains the name of contributors who put the data into PDB
database.
• The REVDAT record contains revision date of the data related to the protein. It includes
the date of modification and the type of modification.
• The JRNL record contains journal details of the literature that has been reported about the
protein.
• The REMARK record contains the reference to journal about the protein and other remarks
about the protein structure.
• The DBREF record contains the reference to the protein in the sequence databases. It
contains ID code of the entry, Chain identifier, Initial sequence number of the PDB
sequence segment, Initial insertion code of the PDB sequence segment, the ending
sequence number of the PDB sequence segment, ending insertion code of the PDB
sequence segment, sequence database name, sequence database accession code, sequence
database identification code, initial sequence number of the database seqment, initial
residue of the segment for PDB reference, ending sequence number of the database
segment, insertion code of the ending residue of the segment for PDB reference.

31
• The SEQADV record contains the difference between named sequence database and the
PDB. It includes ID code of the entry name of the PDB residue in conflict, PDB chain
identifier, PDB sequence number, PDB insertion code, sequence database accession
number, sequence database residue name, sequence database sequence number and
Conflict comment.
• The SEQRES record contains information about the amino acid sequence of protein. It
includes serial number of the SEQRES record for the current chain, chain identifier and
number of residues in the chain.
• The HET record contains details about the non protein substances in protein. It contains
HET identifier, chain identifier, sequence number, insertion code, the number of HETATM
records present in the entry and the text describing Het group.
• The HETNAM record contains the compound name of non standard residues. It contains
HET identifier and the chemical name.
• The HETSYN record contains the identical compound names for non standard residues.
• The FORMUL record contains the chemical formula of non standard residues.
• The HELIX record holds the recognition of helical substructures. It includes Serial number
of the helix, Helix identifier, Name of the initial residue, Chain identifier for the chain
containing the helix, Sequence number of the initial residue, Insertion code of the initial
residue, Name of the terminal residue of the helix, Chain identifier for the chain containing
the helix, sequence number of the terminal residue, Insertion code of the terminal residue,
comment about this helix and Length of this helix.
• The LINK record holds the recognition of inter-residue bonds. It contains atom name,
alternate location indicator, residue name, chain identifier, residue sequence number,
insertion code, atom name, alternate location indicator, residue name, chain identifier,
residue sequence number, insertion code, symmetry operator atom 1, symmetry operator
atom 2 and link distance.
• The SITE record contains groups that contain important entity sites. It shows the sequence
number, site name, number of residues that compose the site, residue name for first residue
that creates the site, chain identifier for first residue of site, residue sequence number for
first residue of the site, insertion code for first residue of the site, residue name for second
residue that creates the site, chain identifier for second residue of the site, residue sequence

32
number for second residue of the site, insertion code for second residue of the site. residue
name for third residue that creates the site, chain identifier for third residue of the site,
residue sequence number for third residue of the site, insertion code for third residue of the
site, residue name for fourth residue that creates the site, chain identifier for fourth residue
of the site, residue sequence number for fourth residue of the site, insertion code for fourth
residue of the site.
• The ORIGXn record shows the transformation from orthogonal coordinates to the
submitted coordinates.
• The SCALE record transformation from orthogonal coordinates to fractional
crystallographic coordinates.
• The ATOM record contains the atomic coordinates for the structure. It contains the atom
name, alternate location indicator, residue name, chain identifier, residue sequence number,
code for insertion of residues, othogonal coordinates for X, Y and Z respectively in
angstroms, occupancy, temperature factor, element symbol and charge on the atom.
• The TER record indicates the termination of a series.
• The HETATM record contains the atomic coordinate records for non standard residues. It
includes the atom serial number, atom name, alternate location indicator, residue name,
chain identifier, residue sequence number, code for insertion of residues, orthogonal
coordinates for X, Y and Z respectively, occupancy, temperature factor, element symbol,
charge on the atom.
• The CONECT record contains the details about the bonds involved in non-protein atoms.
• The MASTER contains number of REMARK records, number of HET records, number
of HELIX records, number of SHEET records, deprecated, number of SITE records,
number of coordinate transformation records, number of atomic coordinate records,
number of TER records, number of CONECT records and number of SEQRES records.
• END records represent the end of a file.
Exercise 1:
• Obtain the information such as primary citation, molecular description, source, related
PDB entries, and Ligand Chemical component of Human serum albumin.
primary citation:

33
molecular description: Human serum albumin (HSA) is an abundant plasma protein that
binds a remarkably wide range of drugs, thereby restricting their free, active concentrations.
Total Structure Weight: 134343.86,Atom Count: 8669, Residue Count: 1170, Unique
protein chains: 1, source: Homo sapiens, related PDB entries:
2BXP, 2BXO, 2BXN, 2BXM, 2BXI, 2BXH, 2BXD, 2BXC, 2BXK, 2BXQ, 2BXL, 2BX
G, 2BXF, 2BXE, 2BXB, 2BXA, Ligand: AZAPROPAZONE
• Identify the molecule name for the PDB ID 4FG3, download the fasta sequence and the
PDB file for the same. How is fasta format different from bare sequence?

Molecule name: Crystal Structure Analysis of the Human Insulin


Fasta sequence:
>4FG3:A|PDBID|CHAIN|SEQUENCE
GIVEQCCTSICSLYQLENYCN
>4FG3:B|PDBID|CHAIN|SEQUENCE
FVNQHLCGSHLVEALYLVCGERGFFYTPKT
>4FG3:C|PDBID|CHAIN|SEQUENCE
GIVEQCCTSICSLYQLENYCN
>4FG3:D|PDBID|CHAIN|SEQUENCE
FVNQHLCGSHLVEALYLVCGERGFFYTPKT
A sequence in FASTA format begins with a single-line description, followed by lines
of sequence data. The description line is distinguished from the sequence data by a
greater-than (">") symbol in the first column. It is recommended that all lines of text be
shorter than 80 characters in length. Bare Sequence may be just lines of sequence data,
without the FASTA definition line. It can also be sequence interspersed with numbers
and/or spaces, such as the sequence portion of a GenBank/GenPept flatfile report.

• Identify the residues that form secondary structures such as sheet and alpha helix.

• Which record in PDB file contains the chemical formula of non standard residues and
write the chemical formula of non standard residues present in 4FG3 .
Heteroatoms (HETATM) record in PDB file contains the chemical formula of non standard
residues. Chemical formula: Zn, Cl, C

34
Steps:
1.
➢ Open the link www.rcsb.org
➢ Enter the query name as Human serum albumin
➢ Retrieve primary citation, molecular description, source, related PDB entries, and Ligand
Chemical component
2.
➢ Enter the query name as 4FG3
➢ On the right side there is an option as “Download Files”
➢ Download the fasta sequence
3.
➢ Download the PDB file
➢ Look for HELIX and SHEET record.
➢ Note down the residues that form helix and sheet
4.
➢ Download the PDB file
➢ Find for Heteroatoms (HETATM) record. This record contains details about the non
protein substances. Eg: Zn, Cl, etc…
Exercise 2:
• Find out which structure the PDB ID “1cdw” denotes

• Where this structure was first published and what the molecule does?
This structure was first published in National.Academy.Science.USA.
The TATA box-binding protein (TBP) is required by all three eukaryotic RNA polymerases
for correct initiation of transcription of ribosomal, messenger, small nuclear, and transfer
RNAs.

35
• What is the molecular composition of this structure? Which functional classification does
it belong to?
Amino acids
TRANSCRIPTION INITIATION, DNA BINDING, COMPLEX
(TRANSCRIPTION FACTOR/DNA), TRANSCRIPTION/DNA COMPLEX
• What host species was used to clone this gene?
Escherichia coli
• How many domains are observed for SCOP, CATH and pfam?
SCOP-2, CATH-2, pfam-0
• What is NDB and find the NDB ID associated with this structure?
NDB stands for nucleic acid data base. It is gives 3 Dimensional structural information
about nucleic acids, PDT034
• From the downloaded PDB file find the positions of HOH. From which part of the record
you can collect the information?
280
STEPS
➢ Open the link www.rcsb.org
➢ Enter the query name as 1cdw
➢ Retrieve the informations as mentioned in the question.
➢ To find the positions of HOH, download the PDB file and look in HETATM
Exercise 3:
• How many structures of TATA Binding Proteins have been resolved from humans only
(hint: use Boolean Operators)?
71
• Conduct a second search to look for TATA Binding Proteins that have been resolved from
species other than humans. How many did you find and what was the range of species
represented (hint: use Boolean Operators)?
81
STEPS
➢ Boolean operators include AND, OR, NOT
➢ Range of species → check in query refinements → Organism

36
Exercise 4: LIGAND PROTEIN CONTACTS
• Analyze the ligand protein contacts for 4FG3.Note down the different ligands present
in the structure
Zinc ion, Chloride ion, Glycerol
• Find the residues that forms contact with Glycerol
GLN 4, HIS 5
• Find the residue that has minimum distance
GLN 4
• Find the residue that forms maximum number of contacts.
GLN 4 - 2 contacts
STEPS
➢ Go to links → Structure features → click Analysis of Ligand- Protein Contacts (LPC)
➢ Choose the respective ligand and click RUN
➢ Choose contacts sorted by residues
➢ To find the maximum number of contacts → choose contacts sorted by contact types and
find which residue forms maximum contacts
EXERCISE-PDB

Exercise 5:

A. Highlight the no of aminoacid residues of PDB id: 2F51. Comment on experimental data
on above structure.
X ray diffraction method
Method: Vapor Diffusion Hanging Drop
pH:4.9
Temp: 293
B. Show the HETATM molecules present in the PDB id: 4FYU. Calculate the physical and
chemical properties of above HETATM.
C. Retrive the fasta sequence of protein PDB id: 4J56
D. Download the PDB structure of protein Human Thioredoxin (reduced form). Comment
on experimental data on above structure.
X ray diffraction method

37
EXPERIMENT -2-SEQUENCE ALIGNMENT

PAIRWISE SEQUENCE ALIGNMENT

• Pairwise sequence alignment is the most fundamental operation of bioinformatics.

• Sequence alignment is a way of arranging DNA, RNA, or protein sequences to identify


similarity regions that may be a consequence of functional, structural, or evolutionary
relationships between those sequences.

• Computational approaches to sequence alignment generally fall into two categories: global
alignments and local alignments.
GLOBAL ALIGNMENT

• Global alignments attempt to align every residue in every sequence. Those alignments are
most useful when the sequences in the query set are similar and of roughly equal size. (This
does not mean global alignments cannot end in gaps.)

• A general global alignment technique is the Needleman–Wunsch algorithm, which is based


on dynamic programming
LOCAL ALIGNMENT

• Local alignments are more useful for dissimilar sequences that are suspected to contain
regions of similarity or similar sequence motifs within their larger sequence context.

• The Smith–Waterman algorithm is a general local alignment method also based on


dynamic programming
General idea of global and local alignment is given in Tables 1 and 2.

38
Global alignment
As already said, a global alignment tries to align sequences through their entire length using
Needleman–Wunsch algorithm. Global alignment is ideal if you are searching for similarity
between two orthologs, paralogs or homolog genes, DNA regions or proteins.

• The submission page has three parts, each representing a step to perform the alignment.

• The first step is to input the sequences that should be aligned. This can be done in several
different ways. The sequences can be entered in the dialog boxes as plain text representing
DNA/RNA or protein sequences or in one of the supported formats (GCG, FASTA, EMBL,
GenBank, PIR, NBRF, Phylip or UniProtKB/SwissProt). Sequences can also be uploaded
in a file (which will we do!) in one of supported formats.

• Second step is about alignment options.

• Third step is the submission of the job. You just need to click on the Submit button.
OUTPUT
The output has three parts.
• The first part contains information about the program and algorithm. You can see which
program has been used, when the job was done, and how to perform the job using command
line on your computer.

• The second part is about your alignment. It contains options used as alignment parameters.
Sequence used as the input, and the most important part – it contains the result summary

• The third part of the alignment results is the alignment itself. It is composed of three lines.
The first and the third line represent the aligned sequences. The second line represents the
similarity between them.

• Vertical bars (|) are put on the positions in the alignment where the residues between two
sequences are the same.

• Colons (:) are used to specify positions in the alignment where sequences doesn’t have
the same amino acids, but those residues have similar physicochemical properties (for
example, glutamate and aspartate).

• Finally, a full stop (.) is put on the positions where the residues in the alignments aren’t
either identical, either similar.

• On the positions in the alignment where gaps are present in the sequences, a blank space
is left
To do pairwise sequence alignment

39
• Go to the EBI main page (http://www.ebi.ac.uk/).

• Choose services Proteins EMBOSS Tools

Step 1.
• The first step is to retrieve the sequences of the proteins.

• Go to the UniProtKB database, and enter the respective Id in the search bar. Click on the
FASTA link in the “Sequence” section.
Step 2.
• Go to the EBI main page (http://www.ebi.ac.uk/).

• Choose services Proteins EMBOSS Tools


On the Pairwise Sequence Alignment page, there are a variety of options. You can choose
between global and local alignment, each of them containing several different algorithms for
performing the alignment. In this exercise, we will first use Needleman–Wunsch algorithm to
perform the alignment.
Step 3.
• Click on the Protein link by the Needle (EMBOSS) tag

• The submission page has three parts, each representing a step to perform the alignment.

• The first step is to input the sequences that should be aligned. This can be done in several
different ways. The sequences can be entered in the dialog boxes as plain text representing
DNA/RNA or protein sequences or in one of the supported formats (GCG, FASTA, EMBL,
GenBank, PIR, NBRF, Phylip or UniProtKB/SwissProt). Sequences can also be uploaded
in a file (which will we do!) in one of supported formats.

• Second step is about alignment options.

• Third step is the submission of the job. You just need to click on the Submit button.
Step 4.
• Upload the files (Saccharomyces_cerevisiae_pho5.fasta and Candida_albicans_-
acid_phosphatase.fasta) with sequences you want to align and perform the alignment be
pressing the Submit button

40
MULTIPLE SEQUENCE ALIGNMENT
Multiple sequence alignment (MSA) is a sequence alignment of three or more biological
sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are
assumed to have an evolutionary relationship by which they share a lineage and are descended
from a common ancestor. From the resulting MSA, sequence homology can be inferred and
phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins.
Multiple sequence alignment is often used to assess sequence conservation of protein domains,
tertiary and secondary structures, and even individual amino acids or nucleotides.
Conserved domains (CD) in proteins play a crucial role in protein interactions, DNA binding,
enzyme activity, and other important cellular processes. Protein domains are often conserved
across many species, and as such, they offer an interesting dataset in how genomes maintain them
with relationship to other conserved domains, as well as to proteome size.
Phylogram is a branching diagram (tree) assumed to be an estimate of a phylogeny, branch lengths
are proportional to the amount of inferred evolutionary change. A Cladogram is a branching
diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length,
thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time"
separating taxa.
ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins.
It attempts to calculate the best match for the selected sequences, and lines them up so that the
identities, similarities and differences can be seen.
Step 1 – Sequence
Sequence Input Window
Three or more sequences to be aligned can be entered directly into this form. Sequences can be
be in GCG, FASTA, EMBL, PIR, NBRF or UniProtKB/Swiss-Prot format. Partially formatted
sequences are not accepted. Adding a return to the end of the sequence may help certain
applications understand the input.

• Sequence Type
Indicates if the sequences to align are protein or nucleotide (DNA/RNA).

Step 2 - Pairwise Alignment Options


• Alignment Type
The alignment method used to perform the pairwise alignments used to generate the guide tree.

41
Output Description Abbreviation
Format

slow Slow, but slow


accurate

fast Fast, but fast


approximate

Default value is: slow

• Protein Weight Matrix (PW)


Slow pairwise alignment protein sequence comparison matrix series used to score alignment.

Matrix Description Abbreviation


(Protein
Only)

BLOSUM blosum

PAM pam

Gonnet gonnet

ID id

Default value is: Gonnet [gonnet]

• DNA Weight Matrix (PW)


Slow pairwise alignment nucleotide sequence comparison matrix used to score alignment.

42
Matrix (Protein Only) Description Abbreviation

IUB iub

ClustalW clustalw

Default value is: IUB [iub]

• Gap Open (PW)


Slow pairwise alignment score for the first residue in a gap.

Default value is: 10

• Gap Extension (PW)


Slow pairwise alignment score for each additional residue in a gap.

Default value is: 0.1

• KTUP
Fast pairwise alignment word size used to find matches between the sequences. Decrease for
sensitivity; increase for speed.

Default value is: 1

• Window Length
Fast pairwise alignment window size for joining word matches. Decrease for speed; increase for
sensitivity.

Default value is: 5

• Score Type
Fast pairwise alignment score type to output.

43
Order Description Abbreviation

percent percent

absolute absolute

Default value is: percent

• Top Diags
Fast pairwise alignment number of match regions are used to create the pairwise alignment.
Decrease for speed; increase for sensitivity.

Default value is: 5

• Pair Gap
Fast pairwise alignment gap penalty for each gap created.

Default value is: 3

Step 3 - Multiple Sequence Alignment Options


• Protein Weight Matrix
Multiple alignment protein sequence comparison matrix series used to score the alignment.

Matrix (Protein Only) Description Abbreviation

BLOSUM blosum

PAM pam

Gonnet gonnet

ID id

Default value is: Gonnet [gonnet]

44
• DNA Weight Matrix
Multiple alignment nucleotide sequence comparison matrix used to score the alignment.

Matrix (Protein Only) Description Abbreviation

IUB iub

ClustalW clustalw

Default value is: IUB [iub]

• Gap Open
Multiple alignment penalty for the first residue in a gap.

Default value is: 10

• Gap Extension
Multiple alignment penalty for each additional residue in a gap.

Default value is: 0.20

• Gap Distances
Multiple alignment gaps that are closer together than this distance are penalised.

Default value is: 5

• No End Gaps
Multiple alignment disable the gap separation penalty when scoring gaps the ends of the alignment

Order Description Abbreviation

no false

yes true

Default value is: no [false]

45
• Iteration
Multiple alignment improvement iteration type

Order Description Abbreviation

none No iteration none

tree Iteration at each step of alignment process tree

alignment Iteration only on final alignment alignment

Default value is: none

• Num Iter
Maximum number of iterations to perform

Default value is: 1

• Clustering
Clustering type.

Order Description Abbreviation

NJ Neighbour-joining (Saitou and Nei 1987) NJ

UPGMA UPGMA clustering UPGMA

Default value is: NJ

• Output

Format for generated multiple sequence alignment.

46
Order Description Abbreviation

Clustal w/ numbers Clustal alignment format with base/residue numbering aln1

Clustal w/o Clustal alignment format without base/residue aln2


numbers numbering

GCG MSF GCG Multiple Sequence File (MSF) alignment format gcg

PHYLIP PHYLIP interleaved alignment format phylip

NEXUS NEXUS alignment format nexus

NBRF/PIR NBRF or PIR sequence format pir

GDE GDE sequence format gde

Pearson/FASTA Pearson or FASTA sequence format fasta

Default value is: Clustal w/ numbers [aln1]

• Order

The order in which the sequences appear in the final alignment

Order Description Abbreviation

aligned Determined by the guide tree aligned

input Same order as the input sequences input

Default value is: aligned

47
Step 4 - Submission

EXERCISE – EMBOSS, CLUSTAL OMEGA, KEGG BLAST & FASTA

A.EMBOSS

Go the UNIPROT Home page, and Type the protein Name or ID in Entrez Search bar &
Go the EMBOSS tool click NEEDLE and WATER for pair wise alignment.

Exercise 1:

1. Retrieve the following 2 protein sequences Q63716 and O08807 using UNIPROT. Perform the
pair wise alignment between the above 2 sequences. Report the following values/ observations
from the alignment.
1. Alignment Score 726;729
2. Alignment Length 275; 192
3. % and fraction identity 48.7%(134/275); 69.8%(134/192)
4. % and fraction similarity 58.2%(160/275); 82.8%(159/192)
2. Perform the Global and Local alignment between 2 sequences P29600 and P41363 using
EMBOSS tools.
Local and Global alignment were below,

3. Compare Savinase (P29144) to the human peptidase by global alignment (Needle). Report
the following values /observations from the alignment.
1. Alignment Score -25.0
2. Alignment Length- 1317
3. % and fraction identity (Identity includes perfect matches only) - 58/1317 ( 4.4%)
4. % and fraction similarity (Similarity includes perfect matches and close matches)- 87/1317
( 6.6%)
4. Repeat the alignment again using local alignment algorithm (Water) and report the same results
as above.

48
1. Alignment Score -53.5
2. Alignment Length- 196
3. % and fraction identity (Identity includes perfect matches only) - 38/196 (19.4%)
4. % and fraction similarity (Similarity includes perfect matches and close matches) - 60/196
(30.6%)

B. MULTIPLE SEQUENCE ALIGNMENT USING CLUSTAL OMEGA


Exercise 2:
Go the Clustal Omega Home page, and Paste the protein sequence in Entrez Search bar &
Click the submit button for multiple sequence alignment.

1. Retrive the following sequences and perform the multiple sequence alignment.
1. Albumin – Cannabis sativa

2. Albumin – Trifolium pratenase

3. Albumin – Helianthus annuus

49
2. Retrieve the thioredoxin protein sequence from Rattus norvegicus, Lysobacter silvestris,
Handroanthus impetiginosus. Perform the multiple sequence alignment.

C. BLAST & FASTA


Exercise 3:
1. Here the sequence of unknown origin and unknown function given here
MTTCSRQFTSSSSMKGSCGIGGGIGAGSSRISSVLAGGSCRAPNTYGGGLSVSSSRFSSGG
AYGLGGGYGGFSSSSSSFGSGFGGGYGGGLGAGLGGGFGGGFAGGDGLLVGSEKVTM
QNLNDRLASYLDKVRALEEANADLEVKIRDWYQRQRPAEIKDYSPYFKTIEDLRNKILT
ATVDNANVLLQIDNARLAADDFRTKYETELNLRMSVEADINGLRRVLDELTLARADLE
MQIESLKEELAYLKKNHEEEMNALRGQVGGDVNVEMDAAPGVDLSRILNEMRDQYEK
MAEKNRKDAEEWFFTKTEELNREVATNSELVQSGKSEISELRRTMQNLEIELQSQLSMK
ASLENSLEETKGRYCMQLAQIQEMIGSVEEQLAQLRCEMEQQNQEYKILLDVKTRLEQE
IATYRRLLEGEDAHLSSSQFSSGSQSSRDVTSSSRQIRTKVMDVHDGKVVSTHEQVLRTK
N
Perform a Protein BLAST Search. Make an intelligent guess about the meaning of the Score, E-
value and % Identity.

i) What organism is this sequence from?

ii) What is the function and PTM involved in the protein?

iii) What other sequences are quite similar to this query sequence? What organisms are they
from, and what do they do?

iv) Download any 2 FASTA sequences of 90% identical proteins of above sequence.

2. Measles morbillivirus, formerly measles virus (MeV), is a single-stranded, negative-sense,


enveloped, non-segmented RNA virus of the genus Morbillivirus within the
family Paramyxoviridae. It is the cause of measles. Retrieve any one protein sequence of Measles
morbillivirus and use BLAST P to search for similar sequences?

50
3. Perform the BLAST P search on O43790. Comment on E value, Total score, Maximum score
and Query coverage. Provide a screenshot of the BLAST colour code key that yielded the answer.

4. Identify the 10- homologues sequences of P68871 of various origins. Find the conserved region
existing between them comment on the same.

51
5. Perform the BLAST P search on myosin from Arabidopsis thaliana. Comment on E value, Total
score, Maximum score and Query coverage. Provide a screenshot of the BLAST color code key
that yielded the answer.

6. Find the gene sequences of Mouse Origin similar to U80226.1


7. You have obtained a sequence from your sequence experiment:
>q2
ACTGGCGCCGGCCGCGCTTAATGGCGCCGCTACAGGGCGCGTCCCATTCGCCATTCA
GGCTGCGCAACTGTTGGGAAGGGCGATCGGTCCGGGCCTCTTCGCTATTATCGCCAG
CTGGCGAAAGGGGGATGTGCTGCAAGGCGATTAAGTTGGGTAACGCCAGGGTTTTC
CCAGTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGG
GCGAATTGGAGCTCCACCGCGGTGGCGGCCGCTCTACACTAGTGGATCCCCCGACAT
TTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCT
GACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATG
AAGTTGGTGGTGAGGCCCTGGGC

BLAST this sequence against the nucleotide database (nucleotide collection (nr/nt)) with blastn.
Can you identify the gene that corresponds to this sequence? What is the source of the first 280
bases in the sequence?

D. KEGG
Exercise 4:

52
1. Go to the KEGG database and then go to the GENES database. Find the gene of 55697 (in
human, hsa).
2. Retrieve the amino acid and nucleotide sequence of above gene id.
>hsa:55697 K15305 vacuole morphology and inheritance protein 14 | (RefSeq) VAC14,
ArPIKfyve, TAX1BP2, TRX; VAC14 component of PIKFYVE complex
(A)MNPEKDFAPLTPNIVRALNDKLYEKRKVAALEIEKLVREFVAQNNTVQIKHVIQTLSQEFALSQHPHSRKGGLIGLAACSIAL
GKDSGLYLKELIEPVLTCFNDADSRLRYYACEALYNIVKVARGAVLPHFNVLFDGLSKLAADPDPNVKSGSELLDRLLKDIVTESN
KFDLVSFIPLLRERIYSNNQYARQFIISWILVLESVPDINLLDYLPEILDGLFQILGDNGKEIRKMCEVVLGEFLKEIKKNPSSVK
FAEMANILVIHCQTTDDLIQLTAMCWMREFIQLAGRVMLPYSSGILTAVLPCLAYDDRKKSIKEVANVCNQSLMKLVTPEDDELDE
LRPGQRQAEPTPDDALPKQEGTASGGPDGSCDSSFSSGISVFTAASTERAPVTLHLDGIVQVLNCHLSDTAIGMMTRIAVLKWLYH
LYIKTPRKMFRHTDSLFPILLQTLSDESDEVILKDLEVLAEIASSPAGQTDDPGPLDGPDLQASHSELQVPTPGRAGLLNTSGTKG
LECSPSTPTMNSYFYKFMINLLKRFSSERKLLEVRGPFIIRQLCLLLNAENIFHSMADILLREEDLKFASTMVHALNTILLTSTEL
FQLRNQLKDLKTLESQNLFCCLYRSWCHNPVTTVSLCFLTQNYRHAYDLIQKFGDLEVTVDFLAEVDKLVQLIECPIFTYLRLQLL
DVKNNPYLIKALYGLLMLLPQSSAFQLLSHRLQCVPNPELLQTEDSLKAAPKSQKADSPSIDYAELLQHFEKVQNKHLEVRHQRSG
RGDHLDRRVVL
>hsa:55697 K15305 vacuole morphology and inheritance protein 14 | (RefSeq) VAC14,
ArPIKfyve, TAX1BP2, TRX; VAC14 component of PIKFYVE complex
(N)atgaaccccgagaaggatttcgcgccgctcacgcctaacatcgtgcgcgccctcaatgacaagctgtacgaaaagcggaaggt
ggcagcgctggagatcgagaagctggtccgggagttcgtggcccagaacaataccgtgcaaatcaagcatgtgatccagaccctgt
cccaggagttgccctgtctagcacccccacagccggaaagggggcctcatcggcctggccgcctgctccatcgcactgggcaagga
ctcagggctctacctgaaggagctgatcgagccagtgctgacctgcttcaatgatgcagacagcaggctgcgctactatgcctgcg
aggccctctacaacatcgtcaaggtggcccggggcgctgtgctgccccacttcaacgtgctctttgacgggctgagcaagctggca
gccgacccagaccccaatgtgaaaagcggatctgagctcctagaccgccttttaaaggacattgtgactgagagcaacaagtttga
cctggtgagcttcatccccttgttgcgagagaggatttactccaacaaccagtatgcccggcagttcatcatctcctggatcctgg
ttctggagtcggtgccagacattaacctgctggattacctgccggagatcctggatggactcttccagatcctgggtgacaatggc
aaagagattcgcaaaatgtgtgaggttgttcttggagaattcttaaaagaaattaagaagaacccctccagtgtgaagtttgctga
gatggccaacatcctggtgatccactgccagacaacagatgacctcatccagctgacagccatgtgctggatgcgggagttcatcc
agctggcgggccgcgtcatgctgccttactcctccgggatctgactgctgtcttgccctgcttggcctacgatgaccgcaagaaaa
gcatcaaagaagtggccaacgtgtgcaaccagagcctgatgaagctggtcacccccgaggacgacgagctggatgagctgagacct
gggcagaggcaggcagagcccacccctgacgatgccctgccaaagcaggagggcacagccagtggaggtccagatggttcctgtga
ctccagcttcagtagcggcatcagtgtcttcactgcagccagcactgaaagagccccagtgacccttcacctcgacgggatcgtgc
aggtcctaaactgccacctcagtgacacggccattgggatgatgaccaggattgcagttctcaagtggctctaccacctctacata
aaactcctcggaagatgttccggcacacggacagcctctttcccatcctactgcagacgttatcggatgaatcggatgaggtgatc
ctgaaggacctggaggtgctggcagaaatcgcttcctcccccgcaggccagacggatgacccaggccccctcgatggccctgacct
ccaggccagccactcagagctccaggtgcccacccctggcagagccggcctactgaacacctctggtaccaaaggcttagaatgtt
ctccttcaactcccaccatgaattcttacttttataagttcatgatcaaccttctcaagagattcagcagcgaacggaagctcctg
gaggtcagaggccctttcatcatcaggcagctgtgcctcctgctaatgcggagaacatcttccactcaatggcagacatcctgctg
cgggaggaggacctcaagttcgcctcgaccatggtccacgccctcaacaccatcctgctgacctccacagagctcttccagctaag
gaaccagctgaaggacctgaagaccctggagagccagaacctgttctgctgcctgtaccgctcctggtgccacaacccagtcacca
cggtgtccctctgcttcctcacccagaactaccggcacgcctatgacctcatccagaagtttggggacctggaggtcaccgtggac
ttcctcgcagaggtggacaagctggtgcagctgattgagtgccccatcttcacatatctgcgcctgcagctgctggacgtgaagac
aacccctacctgatcaaggccctctacggcctgctcatgctcctgccgcagagcagcgccttccagctgctctcgcaccggctcca
gtgcgtgcccaaccctgagctgctgcagaccgaagacagtctaaaggcagcccccaagtcccagaaagctgactcccctagcatcg
actacgcagagctgctgcagcactttgagaaggtccagaacaagcacctggaagtgcggcaccagcggagcgggcgtggggaccac
ctggaccggagggttgtcctctga
EXPERIMENT 3- PHYLOGENETIC TREES
AIM To construct a phylogenetic tree using phylogeny.fr
DESCRIPTION: Phylogeny.fr is a free, simple to use web service dedicated to reconstructing
and analysing phylogenetic relationships between molecular sequences. It runs and connects
various bioinformatics programs to reconstruct a robust phylogenetic tree from a set of sequences.
PROCEDURE:
1. The sequence of interest and other related sequences were retrieved from their databases in
a single text file.

53
2. Open the phylogeny.fr website (www.phylogeny.fr)
3. Select the appropriate tool (One click, Advanced or A’la carte) for phylogenetic analysis.
4. Upload the set of sequences in FASTA , EMBL or NEXUS formats from a file.
5. On pressing submit button result page is displayed.
RESULT: Thus the phylogenetic tree was constructed using phylogeny.fr tool evolutionary
relationship between various species has been studied
EXERCISE-PHYLOGENETIC ANALYSIS

1. Identify the 10- homologues sequences of P68871 of various origins. Find the conserved region
existing between them comment on the same. Comment on the evolutionary relationship between
the sequences.

EXPERIMENT 4: GENE ANNOTATION AND GENE FINDING

Aim: To identify and analyze gene structures in genomic DNA using web based gene finding tool.
Description:

Genome annotation is the process of identifying the locations of genes and all of the coding
regions in a genome and determining what those genes do. Eukaryotic nuclear genomes are much
larger than prokaryotic ones, with size ranging from 10Mbp to 670 Gbp. They tend to have a very
low gene density. In human for instance only 3% of the genome codes for gene about 1 gene per
100 Kbp on average. The space between genes is often very large and rich in repetitive sequence
and transposable elements. Most importantly eukaryotic genomes are characterized by a mosaic
organization in which gene is split into piece called exons by intervening non coding sequence.
The main issue in prediction of eukaryotic gene is the identification of exons and introns and splicy
sites. To date numerous computer programs have been developed for identifying eukaryotic gene
Eg: GEN SCAN, FGENESH, and GRAIL
GENSCAN for predicting the locations and exon-intron structures of genes in genomic sequences
from a variety of organisms.

54
Procedure:
1. Type in the following URL in your browser http://hollywood.mit.edu/GENSCAN.html.
2. The first text box ask for organisms type with 3 parameter options vertebrate, Arabidopsis
thaliana and Zea mays.
3. Suboptimal exon cut off may be specified in the next text box. The option available for 1.00,
0.50, 0.25, 0.10, 0.05, 0.01, 0.02.
4. The next step is to fill in sequence name in the text box provided.
5. Next specify the print option as either predicted peptides only or predicted or CDS peptide.
Choose the option predicted peptide only.
6. The data can be pasted in the figure text box provided or Upload the data file using browser
option.
7. The results of GENSCAN are noted by clicking Run Gen scan. Prepare a small report for the
following
i) No of exons predicted in the sequence submitted.
ii) The start and end points for each exon.
iii) Note the other information – Poly A sites, GC content , predicted peptides given by GEN
SCAN program
RESULT:

Thus the locations and exon-intron structures of genes in genomic sequences was studied using
GENSCAN Web server.
EXPERIMENT 5A: DOCKING USING AUTODOCK TOOL
AIM : To perform docking study of a protein with its known inhibitor using Autodock and analysis
of different interactions.

INTRODUCTION: AutoDock performs automated docking of flexible ligand to receptors. The


principle behind the dock program is to find the global minimum in the interaction between the
target protein and the substrate by exploring all available degrees of freedom for the system.
Autodock uses the Monte Carlo simulated annealing for configurationally exploration along with
grid based molecule affinity potential for rapid energy evaluation.
PROCEDURE:

PREPARATION OF LIGAND:

• The ligand molecule is first loaded in Autodock in pdb format.

55
• The charges for the ligand are calculated using Gasteiger method.
• The root for the torsional rotations is detectedand the resulting file is saved in PDBQT
format.

PREPARATION OF MACROMOLECULE:

• The input file is in either pdb or mol format.


• Polar hydrogen are added to the macromolecule.
• Kollman charges are added and finally it is saved in pdbqt format.

SETTING THE GRID:


• Open the macromolecule and the ligand in PDBQT format.
• The grid box is set by adjusting its dimensions and postins in the x,y,z co-ordinates.
• The grid can be centered on a single atom in the macromolecule by setting the position
of the grid or by selecting pick an atom option in the grid parameters.
• The grid parameter file is saved as .gpf.

56
PREPARING THE DOCKING PARAMETER FILE (DPF):
• The macromolecule is set as rigid.
• The ligand for that macromolecule is selected and the ligand parameters are set.
• The docking parameters are set and the search algorithm is selected as Genetic
Algorithm
• Scoring is done based on Lamarckian Genetic Algorithm.
• The docking parameter file is saved as .dpf

RUNNING THE DPF:

• The autogrid is launched first and this will set map types for the AutoDock.
• The autodock is launched next and this will run the genetic algorithm and searches the
conformational space for the map type that has been set.

RESULT:

• Clustering Histogram

• From the above histogram, it was found that conformation 4 had the minimum energy(-
2.5) and hence gives the best interaction possible.

Hydrogen Bond between Ligand and Target Protein

57
The above figure shows the interaction between the protein molecule and the ligand. Thus, the
protein molecule was docked with the ligand by determining the binding site through the
interactions calculated using autodock tool.

EXPERIMENT 5B: HOMOLOGY MODELING OF PROTEINS

AIM: To model a protein sequence using swiss model.


DESCRIPTION:
SWISS-MODEL (http://swissmodel.expasy.org) is a server for automated comparative
modeling of three-dimensional (3D) protein structures. It pioneered the field of automated
modeling starting in 1993 and is the most widely-used free web-based automated modeling facility
today.
PROTOCOL:
1. Go to NCBI (http://www.ncbi.nlm.nih.gov/) click on PUBMED activate a “protein” search in
the menu and search for a particular protein.
2. Follow links to the amino acids sequence and copy that sequence to the computer’s clipboard
for use in the program BLAST. (You might also want to save it to a file as you will need it later).
3. Go back to the NCBI home page again and follow the “BLAST” link submit the sequence to
BLAST selecting the PDB database alone. Click on the blue format button to see the results.
4. The resulting sequences are from sequences deposited with a known 3D structure and the four
digit PDBcode is next to the words “pdb”.
5. Record the PDB codes for the different known 3D structures which align with the sequence.
6. Go to the PDB. http://www.rcsb.org/pdb/ and either type in a four digit code for one (or more)
of the structures OR – use the “searchlight” functionality and search for “hslV” for example to see
many related files at once.
7. Check the different PDB codes to find the one which structure was solved to the highest
resolution.

58
8. Download one or more of these structures from the PDB. The files may have an “.ent” file
designation. These are equivalent to “.pdb” files
9. Run PYMOL or RASMOL and view your protein.
10. Now that you know that a reasonable structure exists, submit the sequence to the Swiss- Model
web site. http://www.expasy.ch/swissmod/SWISS-MODEL.html Submit the original sequence
with your e-mail address. The SWISS-MODEL server may take ~0.5-3 hours to return the results
of the modeling exercise. You don’t have to submit a PDB for SWISS-MODEL to use; it will use
a mixture of all the top hits.
11. Receive several e-mails from SWISS-MODEL containing some introductory messages and the
results of the modeling and PHD exercise.
12. Save the models to a file and view with PYMOL.
RESULT:
Thus the structure of unknown protein sequence was predicted using SWISS-MODEL
EXPERIMENT 5C: TOOLS FOR CALCULATE PROPERTIES OF SMALL
MOLECULES:
SMALL MOLECULE DATABASES:
AIM: To view and use the various small molecule databases available on the World Wide Web.

S.NO Database name Link

1 Zinc Database zinc.docking.org

2 Zinc15 database Zinc15.docking.org

3 ChEMBL www.ebi.ac.uk

4 JChemforExcel www.chemaxon.com

5 Binding MOAD www.bindingmoad.org

6 PDB bind Sw16.im.med.umich.edu

7 STITCH Stitch.embl.de

1. CHEMICAL STRUCTURE REPRESENTATIONS:


AIM: To view and use the various tools for Chemical structure representations

59
S.NO Tools name Link

1 Chem Draw www.cambridgesoft.com

2 Marvin Sketch www.chemaxon.com

3 ACD/Chemsketch www.acdlabs.com

4 Marvin molecule editor and viewer www.chemaxon.com

5 ChemWriter Chemwriter.com

2. ADME TOXICITY:
AIM: To view and use the various tools for ADME Toxicity.

S.NO Tools name Link


1 ALOGPS www.vcclab.org
2 OSIRIS Property Explorer www.openmolecules.org
3 SWISS ADME www.swissadme.ch
4 Metrabase www.metrabase.ch.cam.ac.uk
5 PreADMET Server www.preadmet.bmdrc.kr

EXAMPLE - PreADMET Server:


PROCEDURE:
• Open Internet explorer
• Go to the link preadmet.bmdrc.kr
• Access the ADME Prediction. Load your molecule in the text box provided by copy/paste
the contents of a MOL File.
• Compile the detailed report of the small molecule. The properties such as Buffer solubility,
CYP_inhibition,Caco2 permeability, MDCK,BBB,HIA data were displayed.

60
Home page of PreADMET Server
RESULT: Thus the tools used for calculate the properties of small molecules were studied.

Exercise : Molecular Modelling Of Protein And Visualization

1. Perform a homology modeling for the protein sequence [Accession Number: BAA23356.1]
(Homo sapiens) using Swiss Model.

2. Perform a docking analysis using swiss dock and provide the best pose for the protein (4FYU)
and Ligand:Plumbagine. Represent the any 2 best poses and the difference between them using
PLIP tool.

61
EXERCISE: 6 SIMPLE PERL CODES FOR SEQUENCE ANALYSIS

AIM: To study and execute basic codes for sequence analysis using PERL

DESCRIPTION

Perl is an interpreted high-level programming language developed by Larry Wall.

(High-level languages are built on top of low-level languages and hid the complexity of low-level
languages from the programmers. All such complexities are handled by the interpreters or
compilers automatically.)Perl is the most popular scripting language used to write scripts that
utilize the Common Gateway Interface (CGI) Perl is a language optimized for scanning arbitrary
text files, extracting information from those text files, and printing reports based on that
information. It’s also a good language for many system management tasks. The language is
intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant,
minimal). Perl combines some of the features of C, sed, awk, and sh along with some vestiges of
csh, Pascal, and even BASIC-PLUS. Expression syntax corresponds closely to C expression
syntax. Unlike most Unix utilities, Perl does not arbitrarily limit the size of your data–if you’ve
got the memory, Perl can slurp in your whole file as a single string. Recursion is of unlimited
depth. And the tables used by hashes (sometimes called "associative arrays") grow as necessary to
prevent degraded performance. Perl can use sophisticated pattern matching techniques to scan
large amounts of data quickly. Although optimized for scanning text, Perl can also deal with binary
data, and can make dbm files look like hashes. Setuid Perl scripts are safer than C programs
through a dataflow tracing mechanism that prevents many security holes.Perl does not enforce any
particular programming paradigm (procedural, object-oriented, functional, or others) or even
require the programmer to choose among them.

RESULT:
Thus the basic scripting in perl programming was executed and studied.
Program :1 - A program to concatenating DNA fragments.

#!/usr/bin/perl -w
use strict;
my ($fragment1, $fragment2, $fragment3);
$fragment1 = "TATCGTCAGCAGTCGT";
$fragment2 = "TAGCACTGACTATCGT";
print "Fragment1 is $fragment1\n";

62
print "Fragment2 is $fragment2\n";
$fragment3 = $fragment1.$fragment2;
print "Ligated together they are $fragment3\n";

Program: 2
A program to store a DNA Sequence
#!/usr/bin/perl -w
# Storing DNA in a variable, and printing it out
# First we store the DNA in a variable called $DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Next, we print the DNA onto the screen
print $DNA;
# Finally, we'll specifically tell the program to exit

Program: 3
Transcribing DNA into RNA
#!/usr/bin/perl -w

63
# Transcribing DNA into RNA
# The DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";
print "$DNA\n\n";
# Transcribe the DNA to RNA by substituting all T's with
U's.
$RNA = $DNA;
$RNA =~ s/T/U/g;
# Print the RNA onto the screen
print "Here is the result of transcribing the DNA to
RNA:\n\n";
print "$RNA\n";
# Exit the program.
exit;

Program: 4
Calculating the reverse complement of a strand of DNA
#!/usr/bin/perl -w
# Calculating the reverse complement of a strand of DNA
# The DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";

64
print "$DNA\n\n";
$revcom = reverse $DNA;
# See the text for a discussion of tr///
$revcom =~ tr/ACGTacgt/TGCAtgca/;
# Print the reverse complement DNA onto the screen
print "Here is the reverse complement DNA:\n\n";
print "$revcom\n";
print "\nThis time it worked!\n\n";
exit;

Program :5
Reading protein sequence data from a file.
#!/usr/bin/perl -w
# Reading protein sequence data from a file
# The filename of the file containing the protein sequence
data
$proteinfilename = 'NM_021964fragment.pep';
# First we have to "open" the file, and associate
# a "filehandle" with it. We choose the filehandle
# PROTEINFILE for readability.
open(PROTEINFILE, $proteinfilename);
# Now we do the actual reading of the protein sequence data
from the file,
# by using the angle brackets < and > to get the input from

65
the
# filehandle. We store the data into our variable $protein.
$protein = <PROTEINFILE>;

# Now that we've got our data, we can close the file.
close PROTEINFILE;
# Print the protein onto the screen
print "Here is the protein:\n\n";
print $protein;
exit;

Program: 6
Program to print each element of the array
# Here's one way to declare an array, initialized with a
list of four scalar values.
@bases = ('A', 'C', 'G', 'T');
# Now we'll print each element of the array

print "Here are the array elements:";


print "\nFirst element: ";
print $bases[0];
print "\nSecond element: ";
print $bases[1];
print "\nThird element: ";
print $bases[2];

66
print "\nFourth element: ";
print $bases[3];

Program: 7
Program to take an element off the end of an array with pop
@bases = ('A', 'C', 'G', 'T');
$base1 = pop @bases;
print "Here's the element removed from the end: ";
print $base1, "\n\n";
print "Here's the remaining array of bases: ";
print "@bases";

Program: 8
Program to take a base off the beginning of an array with shift
@bases = ('A', 'C', 'G', 'T');
$base2 = shift @bases;
print "Here's an element removed from the beginning: ";
print $base2, "\n\n";

67
print "Here's the remaining array of bases: ";
print "@bases";

Program: 9
Program to put an element at the beginning of the array with unshift
@bases = ('A', 'C', 'G', 'T');
$base1 = pop @bases;
unshift (@bases, $base1);
print "Here's the element from the end put on the beginning:";
print "@bases\n\n";

68

You might also like