Manual
Manual
TECHNOLOGY
SATHYAMANGALAM - 638401
DEPARTMENT OF BIOTECHNOLOGY
NAME : GANESHAN S
1
BANNARI AMMAN INSTITUTE OF TECHNOLOGY
SATHYAMANGALAM - 638401
BONAFIDE CERTIFICATE
academic year 2019 -2020 is submitted for the practical Examination held on
2
INDEX
Expt. Page
Date Experiment Name Signature
No. No.
3
EXPERIMENT 1: INFORMATION RETRIVAL FROM BIOLOGICAL DATABASES
INTRODUCTION:
NCBI is one of the leading online resources known for providing Biological sequence information.
NCBI is maintained by two organizations in US, National Library of Medicine
(NLM) and National Institute of science (NIH). As a national resource for molecular biology
information, NCBI's mission is to develop new information technologies to aid in the
understanding of fundamental molecular and genetic processes that control health and disease.
NCBI is connected to various other sequence databases in order to be more efficient in answering
sequence queries. The user queries and sequence information are delivered through NCBI’s search
tool called the “entrez”.
Home Page:
NCBI has a simplified homepage from where the user can navigate to different resources. The left
side pane of the Homepage has a site map followed by different categories which narrows down
the possibility of finding the right sequence. On the right side , you can see the list of popular
resources which is very useful for first time users.
Searching can be made more precisely by using Boolean operators like AND, OR or NOT with
the search statement.
4
The associated databases included are as follows.
• Books:Bookshelf provide free access to search, retrieve and read books and journals from
life science area.
• Gene: Gene database comprises of information about various species including their
nomenclature, associated pathways, RefSeq's, phenotypes, links to genome.
• EST: Expression Sequence Tag database is a collection of data from GenBank. These are
sequence tagged site derived from cDNA, which act as a resource to evaluate gene
expression, find potential variation, annotated genes.
• dbGaP: The database of Genotypes and Phenotypes is a library of results, from the studies
of interaction of genotypes and phenotypes.
• GEO Datasets: The Gene Expression Omnibus (GEO) offers information on gene
expression datasets, their original series and Platform records. It also provides additional
information such as experimental details, cluster tools and differential expression queries.
• GEO Profiles: It offers to browse for profiles which are important on gene annotation or
pre-computed profile characteristics.
• GSS: The GSS nucleotide database provides information from GenBank of Genome
Survey Sequence records.
• MeSH: MeSH (Medical Subject Headings) is the NLM (Nations Library of Medicine)
controlled vocabulary used for browsing articles, also act as a thesaurus in biomedical
sciences for Pubmed and MEDLINE.
5
• NLM Catalog: NLM (United States National Library of Medicine) is the largest medical
library which offers access to books, journals, technical information, audiovisuals,
software’s and other resources.
• OMIM: It is a comprehensive resource database for human genes and genetic disorders. It
contains information about human genes and genetic phenotypes, which is updated daily.
• PubChem Compound: It contains compounds with their unique structures and biological
information from PubChem substances.
• PubMed: PubMed is a freely accessible database search system for health information
which is developed and maintained by the National Center for Biotechnology Information
(NCBI) at the National Library of Medicine (NLM). It contains articles from MEDLINE
and other biomedical articles.
• Pubmed Central: PubMed central is a freely accessible digital resource of full text articles
for biomedical life science journals, which is linked to PubMed database. It can be accessed
6
from the site
• SNP: The SNP database contains information of single nucleotide polymorphisms, short
insertion and deletion polymorphisms.
• Taxonomy: Taxonomy contains information of all the organisms that are included in the
genetic database with their nucleotide or protein sequence.
• UniGene: It identifies transcripts from the same locus, analyses expression by tissue, age,
health status and report related proteins (protest) and clone resources.
The results of the query search are represented in different data formats like GenBank, FASTA.
• Locus name helps in group entries with similar sequences. The first 3 characters denotes
the organism, the fourth and fifth characters gives other group designations, such as gene
product and the last character is a series of sequential integers.
• Sequence Length contains number of nucleotide base pairs (or amino acid residues) in the
sequence record.
• Genbank Division shows the GenBank division to which a record belongs and is indicated
by a three letter abbreviation.
7
4. VRT - other vertebrate sequences
8
Records from the RefSeq
NT_123456 constructed genomic contigs
NM_123456 mRNAs
NP_123456 proteins
NC_123456 chromosomes
Version shows a nucleotide sequence identification number that represents a single, specific
sequence in the GenBank database.
GI "GenInfo Identifier" is a sequence identification number for the nucleotide sequence.
Keywords describes word or phrase of the sequence.
Source indicates free-format information including an abbreviated form of the organism
name, sometimes followed by a molecule type.
Organism describes the formal scientific name for the source organism and its lineage.
Reference includes publications by the authors of the sequence that discuss the data reported
in the record.
Authors contains List of authors in the order in which they appear in the cited article.
Title represents the title of the published work or tentative title of an unpublished word.
CDS (Coding sequence) represents region of nucleotides that corresponds with the sequence
of amino acids in a protein
9
FASTA:
It is a file format used for representing nucleotide or protein sequences as a string with some basic
tag or identifier in which nucleotides or amino acids are represented as single letter codes. A
FASTA sequence starts with a (>) greater than symbol which implies the beginning of a new
sequence records called as definition line (“def line”). An accession number or version number is
followed by description of that entry. DNA sequence in either uppercase or lower case letters starts
from the next line. The sequences contain 60 characters per line.
The sequences which are stored in the database were obtained from different experimental
methods. Most commonly used methods for DNA sequencing are Sanger Method and Maxam-
Gilbert Method. Similarly Edman Degradation method and Mass Spectrometry technique are used
for protein sequencing.
Sanger Method (dideoxy chain termination method):Here 4 test tubes are taken labelled with
A, T, G and C. Into each of the test tubes DNA has to be added in denatured form (single strands).
Next a primer is to be added which anneals to one of the strand in template. The 3' end of the
primer accomadates the dideoxy nucleotides [ddNTPs] (specific to each tube) as well as the deoxy
nucleotides randomly. When the ddNTP's gets attached to the growing chain, the chain
terminatesdue to lack of 3'OH which forms the phospho diester bond with the next nucleotide.
Thus small strands of DNA are formed. Electrophoresis is done and the sequence order can be
obtained by analysing the bands in the gel based on the molecular weight. The primer or one of
the nucleotides can be radioactively or fluorescently labeled also, so that the final product can be
detected from the gel easily and the sequence can be inferred.
10
Edman Degradation reaction:The reaction finds the order of amino acids in a protein from the
N-terminal, by cleaving each amino acid from the N-terminal without distrubing the bonds in the
protein. After each clevage, chromatography or electrophoresis is done to identify the amino acid
Mass Spectrometry: It is used to determine the mass of particle, composition of molecule and for
finding the chemical structures of molecules like peptides and other chemical compounds. Based
on the mass to charge ratio, one can identify the amino acids in a protein.
EXERCISE- NCBI
1. Go the NCBI Home page, and choose GENOME from the databases list in the upper
left.
11
vi) Download the FASTA sequence for following gene ID- 31666697, 98, 99, & 31666700
12
ATGAGGCACATCACCGTGTGTCCATCACAGATGAAGCGATTGAGGCGGCGGTGAAG
CTGTCTGACCGTTATATTTCTGATCGTTTCCTTCCAGATAAGGCGATTGATTTAATTG
ATGAGGCAGGTTCGAAAGTCCGCTTACGTTCTTTCACAACACCGCCTAACCTAAAAG
AACTAGAGCAAAAGTTGGATGAAGTACGCAAGGAAAAGGATGCGGCTGTTCAAAGT
CAGGAATTTGAAAAAGCAGCTTCTCTTCGCGATACAGAGCAGCGTTTACGTGAAAA
AGTAGAAGTCACAAAGAAATCTTGGAAAGAAAAGCAAGGTCAGGAGAATTCAGAG
GTATCAGTGGATGATATCGCAATGGTTGTCTCTAGCTGGACGGGAGTGCCTGTTTCA
AAAATTGCCCAAACAGAAACAGATAAGCTTCTGAATATGGAACAATTACTCCATTCT
CGTGTAATCGGGCAGGATGAAGCGGTTGTCGCTGTAGCAAAAGCTGTGAGACGTGC
GCGTGCTGGTCTAAAAGATCCAAAACGTCCAATCGGCTCCTTTATCTTCTTAGGCCC
AACAGGGGTTGGTAAAACGGAGCTTGCAAGAGCACTTGCGGAGTCTATTTTTGGTG
ATGAAGAAGCGATGATCCGTATCGATATGTCTGAATACATGGAGAAACATTCTACAT
CTAGACTTGTTGGGTCACCTCCAGGCTATGTTGGCTATGAAGAAGGCGGACAACTGA
CTGAAAAAGTGAGAAGAAAACCTTATTCTGTTGTGCTTTTAGACGAGATTGAAAAG
GCGCATCCAGATGTATTCAACATCTTACTGCAAGTATTAGAAGATGGTCGTCTGACG
GATTCTAAAGGGCGTACCGTTGACTTTAGAAATACGATTTTGATCATGACATCCAAC
GTTGGAGCTAGTGAACTGAAGCGAAATAAATATGTTGGCTTTAACGTGCAGGATGA
AGGTCAAAATTACAAAGATATGAAAGGCAAAGTGATGGGCGAGTTGAAACGTGCGT
TTAGACCAGAATTCATCAACCGTATTGATGAAATCATTGTCTTCCATTCACTTGAAA
AGAAACATTTAAAAGAGATCGTGTCTCTCATGTCTGATCAATTGACGAAACGATTAA
AAGAACAAGACCTTTCAATTGAATTGACAGAAGCAGCAAAAGCGAAGATTGCCGAC
GAAGGTGTAGACCTTGAGTACGGTGCGCGTCCGTTAAGAAGAGCGATTCAAAAGCA
TGTGGAGGATCGACTTTCTGAGGAGCTTCTAAAGGGTAATATTGAAAAAGGTCAAC
AAATCGTATTAGATGTGGAAGATGGAGAAATTGTCGTAAAAACGACGGCTGCTACG
AACTAA
13
TAGAAAAACGTGTTGGACTGCTGCTGCAAAACCAGGATGCGTATTTAAAGGTCGCA
GGCGGTGTGAAGCTTGACGAACCAGCAATTGACTTGGCCATTGCGGTCAGTATTGCC
TCCAGCTTTAAAGACGCAGCGCCGCATCAAGCGGATTGCTTTATAGGAGAGGTCGGT
CTGACGGGAGAAGTCAGAAGAGTATCAAGAATTGAACAGCGTGTGCAGGAAGCGG
CGAAACTAGGATTTAAGCGAATGTTTATTCCTCAGGCGAATATAGATGGATGGAAA
AAGCCGAGAGGGATTGAGTTAGTCGGTGTAGAAAATGTAGCGGAGGCACTTCGAAT
TTCACTAGGGGGATCATAG
14
vi) Use the genome annotation report of Bacillus pumilus & fill the following details
15
B. NUCLEOTIDE DATABASE
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCG
CCCTGCAAGGGGAGCGGCAGACGGGTGAGTAACGCGTGGGAACGTACCATTTGCTACGGAA
TAACTCAGGGAAACTTGTGCTAATACCGTATGAGCCCGAAAGGGGAAAGATTTATCGGCAA
ATGATCGGCCCGCGTTGGATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATCC
ATAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACG
GGAGGCAGCAGTGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGA
GTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCACCGGTGAAGATAATGACGGTAACCGGA
GAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGT
TCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGGCTAATAAGTCAGGGGTGAAATCCCGG
GGCTCAACCCCGGAACTGCCTTTGATACTGTTAGTCTTGAGTATGGTAGAGGTGAGTGGAAT
TCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCA
CTGGACCATTACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTG
GTAGTCCACGCCGTAAACGATGAATGTTAGCCGTTGGGGAGTTTACTCTTCGGTGGCGCAGC
TAACGCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGA
CGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTAC
CAGCCCTTGACATCCCGATCGCGGTTAGTGGAGACACTATCCTTCAGTTCGGCTGGATCGGA
GACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACG
AGCGCAACCCTCGCCCTTAGTTGCCAGCATTCAGTTGGGCACTCTAAGGGGACTGCCGGTGA
TAAGCCGAGAGGAAGGTGGGGGATGACGTCAAGTCCTCATGGCCCTTACGGGCTGGGCTAC
ACACGTGCTACAATGGTGGTGACAGTGGGCAGCGAGCACGCGAGTGTGAGCTAATCTCCAA
AAGCCATCTCAGTTCGGATTGCACTCTGCAACTCGAGTGCATGAAGTTGGAATCGCTAGTAA
TCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACC
ATGGGAGTTGGTTTTACCCGAAGGCGCTGTGCTAACCGCAAGGAGGCAGGCGACCACGGTA
GGGTCAGCGACTGGGGTGAAGTCGTAACAAGGTA.
2. How many base pairs are present?
1444bp
3. What kind of molecule it is?
Genomic DNA
4. What are the types of RNA present in a cell?
m-RNA, t-RNA, r-RNA are the types of RNA present in a cell.
16S r-RNA - Present in eukaryotic
18S r-RNA - Present in prokaryotic
23S r-RNA - Present in large subunit of 50sRNA
5. Is this a linear or circular DNA?
Yes, it is a linear DNA.
6. What is the Accession number of the sequence?
KM658164 is the accession number of the sequence.
16
7. What is the keyword for this particular entry? What does it indicate?
ENV.
It indicates - environmental sampling sequences
From the NCBI homepage (https://www.ncbi.nlm.nih.gov/), type in: HK1 hexokinase 1 [Homo
sapiens (human)] in the “Query” box at the top of the page. Make sure the “Search in” box states
“Gene”. Click on the “Search” button.
1. What is the symbol for this entry?
HK1
2. Is this a protein coding gene?
Yes, it is protein coding gene.
3. From which organism it is derived from?
Homo sapiens
4. What does the protein encoded by this gene do?
This gene encodes a ubiquitous form of hexokinase which localizes to the outer membrane
of mitochondria.
5. Where does the protein gets expressed?
17
6. What is the location of the particular gene and what is the exon count?
The location of the particular gene is 10q22.1 and its exon count is 29.
18
1. Retrieve the PROTEIN sequence in FASTA format corresponding to NCBI Reference
Sequence: XP_010321220.1. Comment on the function and taxonomy involved it.
>XP_010321220.1 elastin [Solanum lycopersicum]
MAKKGILILVLSIFLVGSILGRKLAESDALTNGGHSPLKPLFTQKNGDLPGIPLLGGVPGSI
PGIPSIGGLPSNIPGIPSIGGLPGTISGIPLIGGLPGIIPGIPSIGGLPGTIPGVPSIGGVPGTIPGV
PSTGGGPGNIPGVPSIGGGPGNIPGVPSIGGVPGQGPGDVPVVPSVGGNTGPAYGGIPYIP
STGIPGLGYGGLPGIPWLGGGPGNIPGIPWLGGVPGGYGIGPGIVPFVGGGGGGGGGGGI
GGGLGGGVGGGVGGGIGGGV.
The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional
information on proteins, with accurate, consistent and rich annotation. In addition to capturing the
core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or
description, taxonomic data and citation information), as much annotation information as possible
is added. This includes widely accepted biological ontologies, classifications and cross-references,
and clear indications of the quality of annotation in the form of evidence attribution of
experimental and computational data. The UniProt Knowledgebase consists of two sections: a
section containing manually-annotated records with information extracted from literature and
curator-evaluated computational analysis, and a section with computationally analyzed records
that await full manual annotation. For the sake of continuity and name recognition, the two sections
are referred to as "UniProtKB/Swiss-Prot" (reviewed, manually annotated) and
"UniProtKB/TrEMBL" (unreviewed, automatically annotated), respectively.
About 98 % of the protein sequences provided by UniProtKB are derived from the translation of
the coding sequences (CDS) which have been submitted to the public nucleic acid databases, the
19
EMBL-Bank/GenBank/DDBJ databases (INSDC). All these sequences, as well as the related data
submitted by the authors, are automatically integrated into UniProtKB/TrEMBL.
20
your search term with the field name and a colon. To
discover what fields can be queried explicitly, observe the
query hints that are shown after submitting a query or use
the query builder (see below).
length:[100 TO *] All entries with a sequence of at least 100 amino acids.
citation:(author:Arai All entries with a publication that was coauthored by two
author:Chung) specific authors.
6. To use characters that have a special meaning in the query syntax literally in your query,
you must escape them with a backslash, e.g. use gene:L\(1\)2CB to search for the gene
name L(1)2CB. The current list of special characters is:
7. + - && || ! ( ) { } [ ] ^ " ~ * ? : \
Query builder
To restrict terms to specific fields in advance, click the 'Advanced Search »' button. Depending on
the chosen data set and field, you can then enter some text or choose values from a drop-down list.
Then click the Add & Search button to add the new constraint and run the new query.
Each UniProtKB/TrEMBL entry contains information associated with one protein sequence. 100%
identical protein sequences over the entire length of the protein and from the same species are
automatically merged together. The different sections of the entry report selected information
extracted from the original ENA/GenBank/DDBJ entry as well as additional high quality
automated annotation.
Entry information
Each entry is associated with a stable unique identifier: the primary accession number. When
citing an entry, always use the primary accession number. The ‘entry name’ is composed of the
primary accession number and amnemonic species identification: it is not stable and will change
as soon as the entry will be reviewed and integrated into UniProtKB/Swiss-Prot.
21
1 2 3 4 5 6 7 8 9 10
[O,P,Q] [0-9] [A-Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9]
[A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]
[A-N,R-Z] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9] [A-Z] [A-Z,0-9] [A-Z,0-9] [0-9]
The three patterns can be combined into the following regular expression:
[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}
Entries can have more than one accession number. This can be due to two distinct mechanisms:
a) When two or more entries are merged, the accession numbers from all entries are kept.
The first accession number is referred to as the ‘Primary (citable) accession number’, while
the others are referred to as ‘Secondary accession numbers’. These are listed in
alphanumerical order.
b) If an existing entry is split into two or more entries (‘demerged’), new ‘primary’
accession numbers are attributed to all the split entries while all original accession numbers
are retained as ‘secondary’ accession numbers.
Protein name (‘submitted name’), synonyms, gene and locus names, and taxonomy information
are automatically extracted from the original ENA/GenBank/DDBJ entry
Protein attributes
This section provides information on the protein sequence length, indicates if the protein sequence
is complete or a fragment (according to the original ENA/GenBank/DDBJ record). It also provides
the level of evidence that supports the existence of the protein. The vast majority of
UniProtKB/TrEMBL protein existences are considered as ‘Predicted’ since they derived from in
silico nucleotide translations
22
• Uncertain
The value ‘Evidence at protein level’ indicates that there is clear experimental evidence for the
existence of the protein. The criteria include partial or complete Edman sequencing, clear
identification by mass spectrometry, X-ray or NMR structure, good quality protein-protein
interaction or detection of the protein by antibodies.
The value ‘Evidence at transcript level’ indicates that the existence of a protein has not been
strictly proven but that expression data (such as existence of cDNA(s), RT-PCR or Northern
blots) indicate the existence of a transcript.
The value ‘Inferred by homology’ indicates that the existence of a protein is probable because
clear orthologs exist in closely related species.
The value ‘Predicted’ is used for entries without evidence at protein, transcript, or homology
levels.
The value ‘Uncertain’ indicates that the existence of the protein is unsure.
Sequences
More than 99 % of the protein sequences are obtained from the translation of annotated coding
sequences (CDS) in the ENA/GenBank/DDBJ databases and are automatically processed and
entered in UniProtKB/TrEMBL. The protein sequence displayed by default in the entry is the most
prevalent and/or the most similar to orthologous sequences. When the genomic sequence is
available, we generally display the protein sequence derived from genome translation
References
This section contains published articles or submissions that were cited in the original
ENA/GenBank/DDBJ entry. Additional computationally mapped references can also be accessed
from this section
Cross-references
This section is used to point to related information stored in other data collections, including the
links to the original ENA/GenBank/DDBJ submissions
This section provides any useful information about the protein, mostly biological knowledge
(function, subcellular location, enzyme-specific information (catalytic activity, cofactors,
23
metabolic pathway), tissue expression…Qualifiers (e.g. ‘By similarity’, ‘Probable’, ‘Potential’)
are used in the absence of direct experimental evidence.
Ontologies
This section provides a selection of UniProtKB keywords, which are terms from a controlled
vocabulary list, which summarizes the content of the entry and a selection of Gene Ontology (GO)
terms
Alternative products
This section lists the alternative protein sequences that can be generated from the same gene by
alternative promoter usage, alternative splicing, alternative initiation and/or ribosomal
frameshifting. In addition, this section provides relevant information for each alternative protein
isoforms
Over 30 feature keys are available for the description of regions or sites of interest in the protein
sequence, such as post-translational modifications (glycosylation, phosphorylation…), binding
sites, enzyme active sites, local secondary structure, or variants. Features can be either
experimentally proven in the literature or predicted in silico. Qualifiers (‘By similarity’, ‘Probable’
and ‘Potential’) indicate the existence of indirect experimental evidence or the computer-
prediction of the feature
3D-structure databases
Protein, or part of a protein, whose three-dimensional structure has been resolved experimentally
(for example by X-ray crystallography or NMR spectroscopy) and whose coordinates are available
in the PDB database.
From the UniProt homepage (http://www.uniprot.org/), type in: Myosin light chain kinase in the
“Query” box at the top of the page. Make sure the “Search in” box states “Protein Knowledgebase
UniProtKB”. Click on the “Search” button.
24
[3] How many gold star and grey star entries you get and what is the difference between these
two?
GOLD STAR: Records with information extracted from literature and curator-evaluated
computational analysis.
GREY STAR: Records that await full manual annotation.
Number of GOLD STAR: 147; HUMAN [9606] = 38
Number of GREY STAR: 2687; HUMAN [9606] = 8
[1] Choose the UniProt entry Q15746 and identify the (a) protein name, (b) total number of
amino acids and find whether it is manually annotated or computationally annotated?
[2] Identify their primary and secondary accession numbers and what is the difference
between primary and secondary accession numbers
When two or more entries are merged, the accession numbers from all entries are kept. The
first accession number is referred to as the ‘Primary (citable) accession number’, while the
others are referred to as ‘Secondary accession numbers’. These are listed in alphanumerical
order. If an existing entry is split into two or more entries (‘demerged’), new ‘primary’
accession numbers are attributed to all the split entries while all original accession numbers
are retained as ‘secondary’ accession numbers.
Primary accession numbers: Q15746
Secondary accession numbers: B4DUE3, D3DN97, O95796, O95797, O95798, O95799,
Q14844, Q16794, Q17S15, Q3ZCP9, Q5MY99, Q5MYA0, Q6P2N0, Q7Z4J0, Q9C0L5,
Q9UBG5, Q9UBY6, Q9UIT9
Kinase-related protein
Telokin
[4] Is there any evidence that this protein exists at the protein level? Where you can find this
information?
Yes, there is an evidence that this protein exists at the protein level, I can find this
information from ‘STATUS’.
25
[5] What is meant by Evidence at protein level, Evidence at transcript level and Inferred from
homology?
The value 'Experimental evidence at protein level' indicates that there is clear
experimental evidence for the existence of the protein. The criteria include partial or
complete Edman sequencing, clear identification by mass spectrometry, X-ray or NMR
structure, good quality protein-protein interaction or detection of the protein by
antibodies.The value 'Experimental evidence at transcript level' indicates that the
existence of a protein has not been strictly proven but that expression data (such as
existence of cDNA(s), RT-PCR or Northern blots) indicate the existence of a transcript.The
value 'Protein inferred by homology' indicates that the existence of a protein is probable
because clear orthologs exist in closely related species.
[1] How many different isoforms are known for this protein? There are 11 isoforms for this
protein.
[2] What region is missing in the identifier Q15746-8? What is the isoform and by what name
it is known by? 1-1760 is the missing region. Isoform 6 and is known by the name Telokin.
[3] Is isoform 6 catalytically active? It has no catalytic activity.
[4] What alternative initiation site does it use? Met-1761 as initiator codon.
[5] What are the residue positions of the “Protein kinase” domain? 1464-1719 are the residue
positions of the protein kinase domain.
[1] What ligands might this protein bind according to the keywords list? Where you can get
this information? ATP-binding, Calcium, Magnesium, Metal-binding, Nucleotide-binding
are the ligands for this protein and it can find from uniprotkb keywords.
[2] What “Biological process” is this protein involved in? Aorta smooth muscle tissue
morphogenesis, bleb assembly, cardiovascular system development, cellular hypotonic
response, muscle contraction, positive regulation of calcium ion transport, positive
regulation of cell migration, positive regulation of wound healing, protein phosphorylation,
smooth muscle contraction, tonic smooth muscle contraction.
[3] We already know this protein is an enzyme, but what is its function? Actin binding, ATP
binding, calmodulin binding, metal ion binding, myosin light chain kinase activity, protein
26
kinase activity. Calcium/calmodulin-dependent myosin light chain kinase implicated in
smooth muscle contraction via phosphorylation of myosin light chains (MLC). Also
regulates actin-myosin interaction through a non-kinase activity. Phosphorylates
PTK2B/PYK2 and myosin light-chains. Involved in the inflammatory response (e.g.
apoptosis, vascular permeability, and leukocyte diapedesis), cell motility and morphology,
airway hyperreactivity and other activities relevant to asthma.
[4] How do post-translational modifications affect this protein?
Can probably be down-regulated by phosphorylation. Tyrosine phosphorylation by ABL1
increases kinase activity, reverses MLCK-mediated inhibition of Arp2/3-mediated actin
polymerization, and enhances CTTN-binding. Phosphorylation by SRC at Tyr-464 and
Tyr-471 promotes CTTN binding.The C-terminus is deglutamylated by AGTPBP1/CCP1,
AGBL1/CCP4 and AGBL4/CCP6, leading to the formation of Myosin light chain kinase,
smooth muscle, deglutamylated form. The consequences of C-terminal deglutamylation
are unknown.Acetylated at Lys-608 by NAA10/ARD1 via a calcium-dependent signaling;
this acetylation represses kinase activity and reduces tumor cell migration.
[5] What are helix, strand and turn? Identify the regions that make these secondary structures?
The alpha helix (α-helix) is a common motif in the secondary structure of proteins and
is a right hand-helix conformation in which every backbone N−H group
donates a hydrogen bond to the backbone C=O group of the amino acid located three or
four residues earlier along the protein sequence.
The β-sheet is a common motif of regular secondary structure in proteins. Beta sheets
consist of beta strands connected laterally by at least two or three backbone hydrogen
bonds, forming a generally twisted, pleated sheet.
A turn is an element of secondary structure in proteins where the polypeptide chain
reverses its overall direction.
Beta strand 415-418;423-425;431-438;444-449;458-466;469-476;483-491;494-504;512-
518;523-526;531-534;536-541;546-553;560-562;565-570;581-586;591-595;598595;598-
601;1239-1241;1246-1250;1255-1266;1268-1277;1281-1288;1290-1297;1306-
1313;1318-1320;1323-1328.
Helix: 479-481; 1743-1760
Turn: 1302-1304
27
C- RETRIEVING STRUCTURAL DATA OF A PROTEIN USING PDB DATABASE
Objective
• To study about proteins and its structure.
• To introduce PDB and its importance.
• To learn how to retrieve structural data of a protein using PDB database.
• To describe the PDB file format.
Proteins
Proteins are the fundamental units of all living cells. It performs a large variety of biological tasks.
The structures of proteins are much conserved which determines its function. The primary
structure of a protein is made up of linear sequence of amino acid. It is synthesized during the
translation process of DNA to mRNA. DNA (Deoxyribonucleic acid) is the genetic material that
contains all the genetic information for the development and maintaining all functions in all living
organisms. The information is stored as genetic codes using four types of bases. They are adenine
(A), guanine (G), cytosine(C) and thymine (T). In two strands of DNA, adenine always pair with
thymine and guanine pair with cytosine. Each of these base pairs will bond with a sugar and
phosphate molecule to form a nucleotide. The base pairing of DNA will result in a ladder shape
structure of these strands which is called a double helix.
The intermolecular and intramolecular hydrogen bonding between the amide groups in primary
structure of protein form secondary structure. The attraction of hydrogen molecule towards electro
negative atom (N, F, O etc) within same molecule is called intramolecular hydrogen bonding and
formed between two different molecules is called intermolecular H bonding. Alpha helices and
beta sheets are the two important secondary structures in protein.
PDB
A protein database contains the information about 3D structure of proteins. The PDB files contain
experimentally decided 3D structures of biological macromolecules. The structural information of
a protein can be determined by X–ray crystallography or Nuclear Magnetic Resonance (NMR)
spectroscopy methods. Here X-rays are diffracted by electrons of a comparable sized atom
resulting in patterns obtained as small spots in an X-ray film. These patterns are used to calculate
28
the coordinates of atoms in a protein. NMR spectroscopy (Nuclear Magnetic Resonance) is also
used for determining the structure of molecules. The nucleus of an atom that is located in a high
magnetic field can absorb the electromagnetic radiation of a particular frequency. Electromagnetic
radiation is a form of energy that contains both electric and magnetic fields. This type of radiation
includes X-rays, gamma rays, radio waves, visible light etc. The PDB files also contains
information of data collected, molecule name, primary and secondary structure, ligand, atomic
coordinates, crystallographic structure factors, NMR experimental data etc.. The data are
submitted by scientists from all over the world. PDB is maintained by Worldwide Protein Data
Bank. Each entry in the PDB is provided with a unique identification number called the PDB ID.
It is a 4 letter identification number which consist of both alphanumeric characters.
All data in PDB are accessible to public. There are databases which contain data derived from
PDB. For example Structural Classification of Proteins (SCOP) that groups different protein
structures, HSSP (Homology-Derived Secondary Structure of Proteins) for 3D- structure and 1D-
sequence of the protein, CATH (Class Architecture, Topology and Homologous superfamily) for
protein structure classification according to their evolution etc.. PDB allows searching for
information regarding the structure, sequence, function, visualize, download and to assess
molecules.
Resolution: It is a measure of the quality of the data that has been collected on the crystal
containing the protein or nucleic acid. If all of the proteins in the crystal are aligned in an identical
way, forming a very perfect crystal, then all of the proteins will scatter X-rays the same way, and
the diffraction pattern will show the fine details of crystal. On the other hand, if the proteins in the
crystal are all slightly different, due to local flexibility or motion, the diffraction pattern will not
contain as much fine information. So resolution is a measure of the level of detail present in the
diffraction pattern and the level of detail that will be seen when the electron density map is
calculated. High-resolution structures, with resolution values of 1 Å or so, are highly ordered and
it is easy to see every atom in the electron density map. Lower resolution structures, with resolution
of 3 Å or higher, show only the basic contours of the protein chain, and the atomic structure must
be inferred. Most crystallographic-defined structures of proteins fall in between these two
extremes. As a general rule of thumb, we have more confidence in the location of atoms in
structures with resolution values that are small, called "high-resolution structures".
29
Ligand Protein Contacts : Residues forming contact with the ligand can be analyzed using
Ligand protein contact server (LPC). Different types of interactions such as hydrophobic, aromatic
and hydrogen bonds can be analyzed between contacting residues and the ligand.
The PDB file format is the standard file format for protein structure files. It describes how
molecules are held together in 3-D structure of a protein. The file contains hundreds or thousands
of lines called record, which describes about protein. Figure 1 shows certain parts of a PDB
formatted file for deoxyhemoglobin.
30
Each record provides a different set of information like:
• The HEADER record contains the file name and date of submission and the molecule PDB
ID. Header contains the classification (classify the molecule), deposition date of the data
at PDB and id code (unique PDB identifier) respectively.
• The TITLE record contains title of the PDB entry.
• The COMPND record includes the protein name. The specification list describes the
molecular component.
• The SOURCE record contains the name of the organism in which the particular protein is
obtained.
• The KEYWDS record contains keywords that describes about the protein. It includes
functional classification, metabolic role, biological chemical activity and structural
classification.
• The EXPDTA record contains the method used for the protein structure experiment. E.g.
X-ray diffraction, electron crystallography etc.
• The AUTHOR record contains the name of contributors who put the data into PDB
database.
• The REVDAT record contains revision date of the data related to the protein. It includes
the date of modification and the type of modification.
• The JRNL record contains journal details of the literature that has been reported about the
protein.
• The REMARK record contains the reference to journal about the protein and other remarks
about the protein structure.
• The DBREF record contains the reference to the protein in the sequence databases. It
contains ID code of the entry, Chain identifier, Initial sequence number of the PDB
sequence segment, Initial insertion code of the PDB sequence segment, the ending
sequence number of the PDB sequence segment, ending insertion code of the PDB
sequence segment, sequence database name, sequence database accession code, sequence
database identification code, initial sequence number of the database seqment, initial
residue of the segment for PDB reference, ending sequence number of the database
segment, insertion code of the ending residue of the segment for PDB reference.
31
• The SEQADV record contains the difference between named sequence database and the
PDB. It includes ID code of the entry name of the PDB residue in conflict, PDB chain
identifier, PDB sequence number, PDB insertion code, sequence database accession
number, sequence database residue name, sequence database sequence number and
Conflict comment.
• The SEQRES record contains information about the amino acid sequence of protein. It
includes serial number of the SEQRES record for the current chain, chain identifier and
number of residues in the chain.
• The HET record contains details about the non protein substances in protein. It contains
HET identifier, chain identifier, sequence number, insertion code, the number of HETATM
records present in the entry and the text describing Het group.
• The HETNAM record contains the compound name of non standard residues. It contains
HET identifier and the chemical name.
• The HETSYN record contains the identical compound names for non standard residues.
• The FORMUL record contains the chemical formula of non standard residues.
• The HELIX record holds the recognition of helical substructures. It includes Serial number
of the helix, Helix identifier, Name of the initial residue, Chain identifier for the chain
containing the helix, Sequence number of the initial residue, Insertion code of the initial
residue, Name of the terminal residue of the helix, Chain identifier for the chain containing
the helix, sequence number of the terminal residue, Insertion code of the terminal residue,
comment about this helix and Length of this helix.
• The LINK record holds the recognition of inter-residue bonds. It contains atom name,
alternate location indicator, residue name, chain identifier, residue sequence number,
insertion code, atom name, alternate location indicator, residue name, chain identifier,
residue sequence number, insertion code, symmetry operator atom 1, symmetry operator
atom 2 and link distance.
• The SITE record contains groups that contain important entity sites. It shows the sequence
number, site name, number of residues that compose the site, residue name for first residue
that creates the site, chain identifier for first residue of site, residue sequence number for
first residue of the site, insertion code for first residue of the site, residue name for second
residue that creates the site, chain identifier for second residue of the site, residue sequence
32
number for second residue of the site, insertion code for second residue of the site. residue
name for third residue that creates the site, chain identifier for third residue of the site,
residue sequence number for third residue of the site, insertion code for third residue of the
site, residue name for fourth residue that creates the site, chain identifier for fourth residue
of the site, residue sequence number for fourth residue of the site, insertion code for fourth
residue of the site.
• The ORIGXn record shows the transformation from orthogonal coordinates to the
submitted coordinates.
• The SCALE record transformation from orthogonal coordinates to fractional
crystallographic coordinates.
• The ATOM record contains the atomic coordinates for the structure. It contains the atom
name, alternate location indicator, residue name, chain identifier, residue sequence number,
code for insertion of residues, othogonal coordinates for X, Y and Z respectively in
angstroms, occupancy, temperature factor, element symbol and charge on the atom.
• The TER record indicates the termination of a series.
• The HETATM record contains the atomic coordinate records for non standard residues. It
includes the atom serial number, atom name, alternate location indicator, residue name,
chain identifier, residue sequence number, code for insertion of residues, orthogonal
coordinates for X, Y and Z respectively, occupancy, temperature factor, element symbol,
charge on the atom.
• The CONECT record contains the details about the bonds involved in non-protein atoms.
• The MASTER contains number of REMARK records, number of HET records, number
of HELIX records, number of SHEET records, deprecated, number of SITE records,
number of coordinate transformation records, number of atomic coordinate records,
number of TER records, number of CONECT records and number of SEQRES records.
• END records represent the end of a file.
Exercise 1:
• Obtain the information such as primary citation, molecular description, source, related
PDB entries, and Ligand Chemical component of Human serum albumin.
primary citation:
33
molecular description: Human serum albumin (HSA) is an abundant plasma protein that
binds a remarkably wide range of drugs, thereby restricting their free, active concentrations.
Total Structure Weight: 134343.86,Atom Count: 8669, Residue Count: 1170, Unique
protein chains: 1, source: Homo sapiens, related PDB entries:
2BXP, 2BXO, 2BXN, 2BXM, 2BXI, 2BXH, 2BXD, 2BXC, 2BXK, 2BXQ, 2BXL, 2BX
G, 2BXF, 2BXE, 2BXB, 2BXA, Ligand: AZAPROPAZONE
• Identify the molecule name for the PDB ID 4FG3, download the fasta sequence and the
PDB file for the same. How is fasta format different from bare sequence?
• Identify the residues that form secondary structures such as sheet and alpha helix.
• Which record in PDB file contains the chemical formula of non standard residues and
write the chemical formula of non standard residues present in 4FG3 .
Heteroatoms (HETATM) record in PDB file contains the chemical formula of non standard
residues. Chemical formula: Zn, Cl, C
34
Steps:
1.
➢ Open the link www.rcsb.org
➢ Enter the query name as Human serum albumin
➢ Retrieve primary citation, molecular description, source, related PDB entries, and Ligand
Chemical component
2.
➢ Enter the query name as 4FG3
➢ On the right side there is an option as “Download Files”
➢ Download the fasta sequence
3.
➢ Download the PDB file
➢ Look for HELIX and SHEET record.
➢ Note down the residues that form helix and sheet
4.
➢ Download the PDB file
➢ Find for Heteroatoms (HETATM) record. This record contains details about the non
protein substances. Eg: Zn, Cl, etc…
Exercise 2:
• Find out which structure the PDB ID “1cdw” denotes
• Where this structure was first published and what the molecule does?
This structure was first published in National.Academy.Science.USA.
The TATA box-binding protein (TBP) is required by all three eukaryotic RNA polymerases
for correct initiation of transcription of ribosomal, messenger, small nuclear, and transfer
RNAs.
35
• What is the molecular composition of this structure? Which functional classification does
it belong to?
Amino acids
TRANSCRIPTION INITIATION, DNA BINDING, COMPLEX
(TRANSCRIPTION FACTOR/DNA), TRANSCRIPTION/DNA COMPLEX
• What host species was used to clone this gene?
Escherichia coli
• How many domains are observed for SCOP, CATH and pfam?
SCOP-2, CATH-2, pfam-0
• What is NDB and find the NDB ID associated with this structure?
NDB stands for nucleic acid data base. It is gives 3 Dimensional structural information
about nucleic acids, PDT034
• From the downloaded PDB file find the positions of HOH. From which part of the record
you can collect the information?
280
STEPS
➢ Open the link www.rcsb.org
➢ Enter the query name as 1cdw
➢ Retrieve the informations as mentioned in the question.
➢ To find the positions of HOH, download the PDB file and look in HETATM
Exercise 3:
• How many structures of TATA Binding Proteins have been resolved from humans only
(hint: use Boolean Operators)?
71
• Conduct a second search to look for TATA Binding Proteins that have been resolved from
species other than humans. How many did you find and what was the range of species
represented (hint: use Boolean Operators)?
81
STEPS
➢ Boolean operators include AND, OR, NOT
➢ Range of species → check in query refinements → Organism
36
Exercise 4: LIGAND PROTEIN CONTACTS
• Analyze the ligand protein contacts for 4FG3.Note down the different ligands present
in the structure
Zinc ion, Chloride ion, Glycerol
• Find the residues that forms contact with Glycerol
GLN 4, HIS 5
• Find the residue that has minimum distance
GLN 4
• Find the residue that forms maximum number of contacts.
GLN 4 - 2 contacts
STEPS
➢ Go to links → Structure features → click Analysis of Ligand- Protein Contacts (LPC)
➢ Choose the respective ligand and click RUN
➢ Choose contacts sorted by residues
➢ To find the maximum number of contacts → choose contacts sorted by contact types and
find which residue forms maximum contacts
EXERCISE-PDB
Exercise 5:
A. Highlight the no of aminoacid residues of PDB id: 2F51. Comment on experimental data
on above structure.
X ray diffraction method
Method: Vapor Diffusion Hanging Drop
pH:4.9
Temp: 293
B. Show the HETATM molecules present in the PDB id: 4FYU. Calculate the physical and
chemical properties of above HETATM.
C. Retrive the fasta sequence of protein PDB id: 4J56
D. Download the PDB structure of protein Human Thioredoxin (reduced form). Comment
on experimental data on above structure.
X ray diffraction method
37
EXPERIMENT -2-SEQUENCE ALIGNMENT
• Computational approaches to sequence alignment generally fall into two categories: global
alignments and local alignments.
GLOBAL ALIGNMENT
• Global alignments attempt to align every residue in every sequence. Those alignments are
most useful when the sequences in the query set are similar and of roughly equal size. (This
does not mean global alignments cannot end in gaps.)
• Local alignments are more useful for dissimilar sequences that are suspected to contain
regions of similarity or similar sequence motifs within their larger sequence context.
38
Global alignment
As already said, a global alignment tries to align sequences through their entire length using
Needleman–Wunsch algorithm. Global alignment is ideal if you are searching for similarity
between two orthologs, paralogs or homolog genes, DNA regions or proteins.
• The submission page has three parts, each representing a step to perform the alignment.
• The first step is to input the sequences that should be aligned. This can be done in several
different ways. The sequences can be entered in the dialog boxes as plain text representing
DNA/RNA or protein sequences or in one of the supported formats (GCG, FASTA, EMBL,
GenBank, PIR, NBRF, Phylip or UniProtKB/SwissProt). Sequences can also be uploaded
in a file (which will we do!) in one of supported formats.
• Third step is the submission of the job. You just need to click on the Submit button.
OUTPUT
The output has three parts.
• The first part contains information about the program and algorithm. You can see which
program has been used, when the job was done, and how to perform the job using command
line on your computer.
• The second part is about your alignment. It contains options used as alignment parameters.
Sequence used as the input, and the most important part – it contains the result summary
• The third part of the alignment results is the alignment itself. It is composed of three lines.
The first and the third line represent the aligned sequences. The second line represents the
similarity between them.
• Vertical bars (|) are put on the positions in the alignment where the residues between two
sequences are the same.
• Colons (:) are used to specify positions in the alignment where sequences doesn’t have
the same amino acids, but those residues have similar physicochemical properties (for
example, glutamate and aspartate).
• Finally, a full stop (.) is put on the positions where the residues in the alignments aren’t
either identical, either similar.
• On the positions in the alignment where gaps are present in the sequences, a blank space
is left
To do pairwise sequence alignment
39
• Go to the EBI main page (http://www.ebi.ac.uk/).
Step 1.
• The first step is to retrieve the sequences of the proteins.
• Go to the UniProtKB database, and enter the respective Id in the search bar. Click on the
FASTA link in the “Sequence” section.
Step 2.
• Go to the EBI main page (http://www.ebi.ac.uk/).
• The submission page has three parts, each representing a step to perform the alignment.
• The first step is to input the sequences that should be aligned. This can be done in several
different ways. The sequences can be entered in the dialog boxes as plain text representing
DNA/RNA or protein sequences or in one of the supported formats (GCG, FASTA, EMBL,
GenBank, PIR, NBRF, Phylip or UniProtKB/SwissProt). Sequences can also be uploaded
in a file (which will we do!) in one of supported formats.
• Third step is the submission of the job. You just need to click on the Submit button.
Step 4.
• Upload the files (Saccharomyces_cerevisiae_pho5.fasta and Candida_albicans_-
acid_phosphatase.fasta) with sequences you want to align and perform the alignment be
pressing the Submit button
40
MULTIPLE SEQUENCE ALIGNMENT
Multiple sequence alignment (MSA) is a sequence alignment of three or more biological
sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are
assumed to have an evolutionary relationship by which they share a lineage and are descended
from a common ancestor. From the resulting MSA, sequence homology can be inferred and
phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins.
Multiple sequence alignment is often used to assess sequence conservation of protein domains,
tertiary and secondary structures, and even individual amino acids or nucleotides.
Conserved domains (CD) in proteins play a crucial role in protein interactions, DNA binding,
enzyme activity, and other important cellular processes. Protein domains are often conserved
across many species, and as such, they offer an interesting dataset in how genomes maintain them
with relationship to other conserved domains, as well as to proteome size.
Phylogram is a branching diagram (tree) assumed to be an estimate of a phylogeny, branch lengths
are proportional to the amount of inferred evolutionary change. A Cladogram is a branching
diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length,
thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time"
separating taxa.
ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins.
It attempts to calculate the best match for the selected sequences, and lines them up so that the
identities, similarities and differences can be seen.
Step 1 – Sequence
Sequence Input Window
Three or more sequences to be aligned can be entered directly into this form. Sequences can be
be in GCG, FASTA, EMBL, PIR, NBRF or UniProtKB/Swiss-Prot format. Partially formatted
sequences are not accepted. Adding a return to the end of the sequence may help certain
applications understand the input.
• Sequence Type
Indicates if the sequences to align are protein or nucleotide (DNA/RNA).
41
Output Description Abbreviation
Format
BLOSUM blosum
PAM pam
Gonnet gonnet
ID id
42
Matrix (Protein Only) Description Abbreviation
IUB iub
ClustalW clustalw
• KTUP
Fast pairwise alignment word size used to find matches between the sequences. Decrease for
sensitivity; increase for speed.
• Window Length
Fast pairwise alignment window size for joining word matches. Decrease for speed; increase for
sensitivity.
• Score Type
Fast pairwise alignment score type to output.
43
Order Description Abbreviation
percent percent
absolute absolute
• Top Diags
Fast pairwise alignment number of match regions are used to create the pairwise alignment.
Decrease for speed; increase for sensitivity.
• Pair Gap
Fast pairwise alignment gap penalty for each gap created.
BLOSUM blosum
PAM pam
Gonnet gonnet
ID id
44
• DNA Weight Matrix
Multiple alignment nucleotide sequence comparison matrix used to score the alignment.
IUB iub
ClustalW clustalw
• Gap Open
Multiple alignment penalty for the first residue in a gap.
• Gap Extension
Multiple alignment penalty for each additional residue in a gap.
• Gap Distances
Multiple alignment gaps that are closer together than this distance are penalised.
• No End Gaps
Multiple alignment disable the gap separation penalty when scoring gaps the ends of the alignment
no false
yes true
45
• Iteration
Multiple alignment improvement iteration type
• Num Iter
Maximum number of iterations to perform
• Clustering
Clustering type.
• Output
46
Order Description Abbreviation
GCG MSF GCG Multiple Sequence File (MSF) alignment format gcg
• Order
47
Step 4 - Submission
A.EMBOSS
Go the UNIPROT Home page, and Type the protein Name or ID in Entrez Search bar &
Go the EMBOSS tool click NEEDLE and WATER for pair wise alignment.
Exercise 1:
1. Retrieve the following 2 protein sequences Q63716 and O08807 using UNIPROT. Perform the
pair wise alignment between the above 2 sequences. Report the following values/ observations
from the alignment.
1. Alignment Score 726;729
2. Alignment Length 275; 192
3. % and fraction identity 48.7%(134/275); 69.8%(134/192)
4. % and fraction similarity 58.2%(160/275); 82.8%(159/192)
2. Perform the Global and Local alignment between 2 sequences P29600 and P41363 using
EMBOSS tools.
Local and Global alignment were below,
3. Compare Savinase (P29144) to the human peptidase by global alignment (Needle). Report
the following values /observations from the alignment.
1. Alignment Score -25.0
2. Alignment Length- 1317
3. % and fraction identity (Identity includes perfect matches only) - 58/1317 ( 4.4%)
4. % and fraction similarity (Similarity includes perfect matches and close matches)- 87/1317
( 6.6%)
4. Repeat the alignment again using local alignment algorithm (Water) and report the same results
as above.
48
1. Alignment Score -53.5
2. Alignment Length- 196
3. % and fraction identity (Identity includes perfect matches only) - 38/196 (19.4%)
4. % and fraction similarity (Similarity includes perfect matches and close matches) - 60/196
(30.6%)
1. Retrive the following sequences and perform the multiple sequence alignment.
1. Albumin – Cannabis sativa
49
2. Retrieve the thioredoxin protein sequence from Rattus norvegicus, Lysobacter silvestris,
Handroanthus impetiginosus. Perform the multiple sequence alignment.
iii) What other sequences are quite similar to this query sequence? What organisms are they
from, and what do they do?
iv) Download any 2 FASTA sequences of 90% identical proteins of above sequence.
50
3. Perform the BLAST P search on O43790. Comment on E value, Total score, Maximum score
and Query coverage. Provide a screenshot of the BLAST colour code key that yielded the answer.
4. Identify the 10- homologues sequences of P68871 of various origins. Find the conserved region
existing between them comment on the same.
51
5. Perform the BLAST P search on myosin from Arabidopsis thaliana. Comment on E value, Total
score, Maximum score and Query coverage. Provide a screenshot of the BLAST color code key
that yielded the answer.
BLAST this sequence against the nucleotide database (nucleotide collection (nr/nt)) with blastn.
Can you identify the gene that corresponds to this sequence? What is the source of the first 280
bases in the sequence?
D. KEGG
Exercise 4:
52
1. Go to the KEGG database and then go to the GENES database. Find the gene of 55697 (in
human, hsa).
2. Retrieve the amino acid and nucleotide sequence of above gene id.
>hsa:55697 K15305 vacuole morphology and inheritance protein 14 | (RefSeq) VAC14,
ArPIKfyve, TAX1BP2, TRX; VAC14 component of PIKFYVE complex
(A)MNPEKDFAPLTPNIVRALNDKLYEKRKVAALEIEKLVREFVAQNNTVQIKHVIQTLSQEFALSQHPHSRKGGLIGLAACSIAL
GKDSGLYLKELIEPVLTCFNDADSRLRYYACEALYNIVKVARGAVLPHFNVLFDGLSKLAADPDPNVKSGSELLDRLLKDIVTESN
KFDLVSFIPLLRERIYSNNQYARQFIISWILVLESVPDINLLDYLPEILDGLFQILGDNGKEIRKMCEVVLGEFLKEIKKNPSSVK
FAEMANILVIHCQTTDDLIQLTAMCWMREFIQLAGRVMLPYSSGILTAVLPCLAYDDRKKSIKEVANVCNQSLMKLVTPEDDELDE
LRPGQRQAEPTPDDALPKQEGTASGGPDGSCDSSFSSGISVFTAASTERAPVTLHLDGIVQVLNCHLSDTAIGMMTRIAVLKWLYH
LYIKTPRKMFRHTDSLFPILLQTLSDESDEVILKDLEVLAEIASSPAGQTDDPGPLDGPDLQASHSELQVPTPGRAGLLNTSGTKG
LECSPSTPTMNSYFYKFMINLLKRFSSERKLLEVRGPFIIRQLCLLLNAENIFHSMADILLREEDLKFASTMVHALNTILLTSTEL
FQLRNQLKDLKTLESQNLFCCLYRSWCHNPVTTVSLCFLTQNYRHAYDLIQKFGDLEVTVDFLAEVDKLVQLIECPIFTYLRLQLL
DVKNNPYLIKALYGLLMLLPQSSAFQLLSHRLQCVPNPELLQTEDSLKAAPKSQKADSPSIDYAELLQHFEKVQNKHLEVRHQRSG
RGDHLDRRVVL
>hsa:55697 K15305 vacuole morphology and inheritance protein 14 | (RefSeq) VAC14,
ArPIKfyve, TAX1BP2, TRX; VAC14 component of PIKFYVE complex
(N)atgaaccccgagaaggatttcgcgccgctcacgcctaacatcgtgcgcgccctcaatgacaagctgtacgaaaagcggaaggt
ggcagcgctggagatcgagaagctggtccgggagttcgtggcccagaacaataccgtgcaaatcaagcatgtgatccagaccctgt
cccaggagttgccctgtctagcacccccacagccggaaagggggcctcatcggcctggccgcctgctccatcgcactgggcaagga
ctcagggctctacctgaaggagctgatcgagccagtgctgacctgcttcaatgatgcagacagcaggctgcgctactatgcctgcg
aggccctctacaacatcgtcaaggtggcccggggcgctgtgctgccccacttcaacgtgctctttgacgggctgagcaagctggca
gccgacccagaccccaatgtgaaaagcggatctgagctcctagaccgccttttaaaggacattgtgactgagagcaacaagtttga
cctggtgagcttcatccccttgttgcgagagaggatttactccaacaaccagtatgcccggcagttcatcatctcctggatcctgg
ttctggagtcggtgccagacattaacctgctggattacctgccggagatcctggatggactcttccagatcctgggtgacaatggc
aaagagattcgcaaaatgtgtgaggttgttcttggagaattcttaaaagaaattaagaagaacccctccagtgtgaagtttgctga
gatggccaacatcctggtgatccactgccagacaacagatgacctcatccagctgacagccatgtgctggatgcgggagttcatcc
agctggcgggccgcgtcatgctgccttactcctccgggatctgactgctgtcttgccctgcttggcctacgatgaccgcaagaaaa
gcatcaaagaagtggccaacgtgtgcaaccagagcctgatgaagctggtcacccccgaggacgacgagctggatgagctgagacct
gggcagaggcaggcagagcccacccctgacgatgccctgccaaagcaggagggcacagccagtggaggtccagatggttcctgtga
ctccagcttcagtagcggcatcagtgtcttcactgcagccagcactgaaagagccccagtgacccttcacctcgacgggatcgtgc
aggtcctaaactgccacctcagtgacacggccattgggatgatgaccaggattgcagttctcaagtggctctaccacctctacata
aaactcctcggaagatgttccggcacacggacagcctctttcccatcctactgcagacgttatcggatgaatcggatgaggtgatc
ctgaaggacctggaggtgctggcagaaatcgcttcctcccccgcaggccagacggatgacccaggccccctcgatggccctgacct
ccaggccagccactcagagctccaggtgcccacccctggcagagccggcctactgaacacctctggtaccaaaggcttagaatgtt
ctccttcaactcccaccatgaattcttacttttataagttcatgatcaaccttctcaagagattcagcagcgaacggaagctcctg
gaggtcagaggccctttcatcatcaggcagctgtgcctcctgctaatgcggagaacatcttccactcaatggcagacatcctgctg
cgggaggaggacctcaagttcgcctcgaccatggtccacgccctcaacaccatcctgctgacctccacagagctcttccagctaag
gaaccagctgaaggacctgaagaccctggagagccagaacctgttctgctgcctgtaccgctcctggtgccacaacccagtcacca
cggtgtccctctgcttcctcacccagaactaccggcacgcctatgacctcatccagaagtttggggacctggaggtcaccgtggac
ttcctcgcagaggtggacaagctggtgcagctgattgagtgccccatcttcacatatctgcgcctgcagctgctggacgtgaagac
aacccctacctgatcaaggccctctacggcctgctcatgctcctgccgcagagcagcgccttccagctgctctcgcaccggctcca
gtgcgtgcccaaccctgagctgctgcagaccgaagacagtctaaaggcagcccccaagtcccagaaagctgactcccctagcatcg
actacgcagagctgctgcagcactttgagaaggtccagaacaagcacctggaagtgcggcaccagcggagcgggcgtggggaccac
ctggaccggagggttgtcctctga
EXPERIMENT 3- PHYLOGENETIC TREES
AIM To construct a phylogenetic tree using phylogeny.fr
DESCRIPTION: Phylogeny.fr is a free, simple to use web service dedicated to reconstructing
and analysing phylogenetic relationships between molecular sequences. It runs and connects
various bioinformatics programs to reconstruct a robust phylogenetic tree from a set of sequences.
PROCEDURE:
1. The sequence of interest and other related sequences were retrieved from their databases in
a single text file.
53
2. Open the phylogeny.fr website (www.phylogeny.fr)
3. Select the appropriate tool (One click, Advanced or A’la carte) for phylogenetic analysis.
4. Upload the set of sequences in FASTA , EMBL or NEXUS formats from a file.
5. On pressing submit button result page is displayed.
RESULT: Thus the phylogenetic tree was constructed using phylogeny.fr tool evolutionary
relationship between various species has been studied
EXERCISE-PHYLOGENETIC ANALYSIS
1. Identify the 10- homologues sequences of P68871 of various origins. Find the conserved region
existing between them comment on the same. Comment on the evolutionary relationship between
the sequences.
Aim: To identify and analyze gene structures in genomic DNA using web based gene finding tool.
Description:
Genome annotation is the process of identifying the locations of genes and all of the coding
regions in a genome and determining what those genes do. Eukaryotic nuclear genomes are much
larger than prokaryotic ones, with size ranging from 10Mbp to 670 Gbp. They tend to have a very
low gene density. In human for instance only 3% of the genome codes for gene about 1 gene per
100 Kbp on average. The space between genes is often very large and rich in repetitive sequence
and transposable elements. Most importantly eukaryotic genomes are characterized by a mosaic
organization in which gene is split into piece called exons by intervening non coding sequence.
The main issue in prediction of eukaryotic gene is the identification of exons and introns and splicy
sites. To date numerous computer programs have been developed for identifying eukaryotic gene
Eg: GEN SCAN, FGENESH, and GRAIL
GENSCAN for predicting the locations and exon-intron structures of genes in genomic sequences
from a variety of organisms.
54
Procedure:
1. Type in the following URL in your browser http://hollywood.mit.edu/GENSCAN.html.
2. The first text box ask for organisms type with 3 parameter options vertebrate, Arabidopsis
thaliana and Zea mays.
3. Suboptimal exon cut off may be specified in the next text box. The option available for 1.00,
0.50, 0.25, 0.10, 0.05, 0.01, 0.02.
4. The next step is to fill in sequence name in the text box provided.
5. Next specify the print option as either predicted peptides only or predicted or CDS peptide.
Choose the option predicted peptide only.
6. The data can be pasted in the figure text box provided or Upload the data file using browser
option.
7. The results of GENSCAN are noted by clicking Run Gen scan. Prepare a small report for the
following
i) No of exons predicted in the sequence submitted.
ii) The start and end points for each exon.
iii) Note the other information – Poly A sites, GC content , predicted peptides given by GEN
SCAN program
RESULT:
Thus the locations and exon-intron structures of genes in genomic sequences was studied using
GENSCAN Web server.
EXPERIMENT 5A: DOCKING USING AUTODOCK TOOL
AIM : To perform docking study of a protein with its known inhibitor using Autodock and analysis
of different interactions.
PREPARATION OF LIGAND:
55
• The charges for the ligand are calculated using Gasteiger method.
• The root for the torsional rotations is detectedand the resulting file is saved in PDBQT
format.
PREPARATION OF MACROMOLECULE:
56
PREPARING THE DOCKING PARAMETER FILE (DPF):
• The macromolecule is set as rigid.
• The ligand for that macromolecule is selected and the ligand parameters are set.
• The docking parameters are set and the search algorithm is selected as Genetic
Algorithm
• Scoring is done based on Lamarckian Genetic Algorithm.
• The docking parameter file is saved as .dpf
• The autogrid is launched first and this will set map types for the AutoDock.
• The autodock is launched next and this will run the genetic algorithm and searches the
conformational space for the map type that has been set.
RESULT:
• Clustering Histogram
• From the above histogram, it was found that conformation 4 had the minimum energy(-
2.5) and hence gives the best interaction possible.
57
The above figure shows the interaction between the protein molecule and the ligand. Thus, the
protein molecule was docked with the ligand by determining the binding site through the
interactions calculated using autodock tool.
58
8. Download one or more of these structures from the PDB. The files may have an “.ent” file
designation. These are equivalent to “.pdb” files
9. Run PYMOL or RASMOL and view your protein.
10. Now that you know that a reasonable structure exists, submit the sequence to the Swiss- Model
web site. http://www.expasy.ch/swissmod/SWISS-MODEL.html Submit the original sequence
with your e-mail address. The SWISS-MODEL server may take ~0.5-3 hours to return the results
of the modeling exercise. You don’t have to submit a PDB for SWISS-MODEL to use; it will use
a mixture of all the top hits.
11. Receive several e-mails from SWISS-MODEL containing some introductory messages and the
results of the modeling and PHD exercise.
12. Save the models to a file and view with PYMOL.
RESULT:
Thus the structure of unknown protein sequence was predicted using SWISS-MODEL
EXPERIMENT 5C: TOOLS FOR CALCULATE PROPERTIES OF SMALL
MOLECULES:
SMALL MOLECULE DATABASES:
AIM: To view and use the various small molecule databases available on the World Wide Web.
3 ChEMBL www.ebi.ac.uk
4 JChemforExcel www.chemaxon.com
7 STITCH Stitch.embl.de
59
S.NO Tools name Link
3 ACD/Chemsketch www.acdlabs.com
5 ChemWriter Chemwriter.com
2. ADME TOXICITY:
AIM: To view and use the various tools for ADME Toxicity.
60
Home page of PreADMET Server
RESULT: Thus the tools used for calculate the properties of small molecules were studied.
1. Perform a homology modeling for the protein sequence [Accession Number: BAA23356.1]
(Homo sapiens) using Swiss Model.
2. Perform a docking analysis using swiss dock and provide the best pose for the protein (4FYU)
and Ligand:Plumbagine. Represent the any 2 best poses and the difference between them using
PLIP tool.
61
EXERCISE: 6 SIMPLE PERL CODES FOR SEQUENCE ANALYSIS
AIM: To study and execute basic codes for sequence analysis using PERL
DESCRIPTION
(High-level languages are built on top of low-level languages and hid the complexity of low-level
languages from the programmers. All such complexities are handled by the interpreters or
compilers automatically.)Perl is the most popular scripting language used to write scripts that
utilize the Common Gateway Interface (CGI) Perl is a language optimized for scanning arbitrary
text files, extracting information from those text files, and printing reports based on that
information. It’s also a good language for many system management tasks. The language is
intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant,
minimal). Perl combines some of the features of C, sed, awk, and sh along with some vestiges of
csh, Pascal, and even BASIC-PLUS. Expression syntax corresponds closely to C expression
syntax. Unlike most Unix utilities, Perl does not arbitrarily limit the size of your data–if you’ve
got the memory, Perl can slurp in your whole file as a single string. Recursion is of unlimited
depth. And the tables used by hashes (sometimes called "associative arrays") grow as necessary to
prevent degraded performance. Perl can use sophisticated pattern matching techniques to scan
large amounts of data quickly. Although optimized for scanning text, Perl can also deal with binary
data, and can make dbm files look like hashes. Setuid Perl scripts are safer than C programs
through a dataflow tracing mechanism that prevents many security holes.Perl does not enforce any
particular programming paradigm (procedural, object-oriented, functional, or others) or even
require the programmer to choose among them.
RESULT:
Thus the basic scripting in perl programming was executed and studied.
Program :1 - A program to concatenating DNA fragments.
#!/usr/bin/perl -w
use strict;
my ($fragment1, $fragment2, $fragment3);
$fragment1 = "TATCGTCAGCAGTCGT";
$fragment2 = "TAGCACTGACTATCGT";
print "Fragment1 is $fragment1\n";
62
print "Fragment2 is $fragment2\n";
$fragment3 = $fragment1.$fragment2;
print "Ligated together they are $fragment3\n";
Program: 2
A program to store a DNA Sequence
#!/usr/bin/perl -w
# Storing DNA in a variable, and printing it out
# First we store the DNA in a variable called $DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Next, we print the DNA onto the screen
print $DNA;
# Finally, we'll specifically tell the program to exit
Program: 3
Transcribing DNA into RNA
#!/usr/bin/perl -w
63
# Transcribing DNA into RNA
# The DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";
print "$DNA\n\n";
# Transcribe the DNA to RNA by substituting all T's with
U's.
$RNA = $DNA;
$RNA =~ s/T/U/g;
# Print the RNA onto the screen
print "Here is the result of transcribing the DNA to
RNA:\n\n";
print "$RNA\n";
# Exit the program.
exit;
Program: 4
Calculating the reverse complement of a strand of DNA
#!/usr/bin/perl -w
# Calculating the reverse complement of a strand of DNA
# The DNA
$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";
64
print "$DNA\n\n";
$revcom = reverse $DNA;
# See the text for a discussion of tr///
$revcom =~ tr/ACGTacgt/TGCAtgca/;
# Print the reverse complement DNA onto the screen
print "Here is the reverse complement DNA:\n\n";
print "$revcom\n";
print "\nThis time it worked!\n\n";
exit;
Program :5
Reading protein sequence data from a file.
#!/usr/bin/perl -w
# Reading protein sequence data from a file
# The filename of the file containing the protein sequence
data
$proteinfilename = 'NM_021964fragment.pep';
# First we have to "open" the file, and associate
# a "filehandle" with it. We choose the filehandle
# PROTEINFILE for readability.
open(PROTEINFILE, $proteinfilename);
# Now we do the actual reading of the protein sequence data
from the file,
# by using the angle brackets < and > to get the input from
65
the
# filehandle. We store the data into our variable $protein.
$protein = <PROTEINFILE>;
# Now that we've got our data, we can close the file.
close PROTEINFILE;
# Print the protein onto the screen
print "Here is the protein:\n\n";
print $protein;
exit;
Program: 6
Program to print each element of the array
# Here's one way to declare an array, initialized with a
list of four scalar values.
@bases = ('A', 'C', 'G', 'T');
# Now we'll print each element of the array
66
print "\nFourth element: ";
print $bases[3];
Program: 7
Program to take an element off the end of an array with pop
@bases = ('A', 'C', 'G', 'T');
$base1 = pop @bases;
print "Here's the element removed from the end: ";
print $base1, "\n\n";
print "Here's the remaining array of bases: ";
print "@bases";
Program: 8
Program to take a base off the beginning of an array with shift
@bases = ('A', 'C', 'G', 'T');
$base2 = shift @bases;
print "Here's an element removed from the beginning: ";
print $base2, "\n\n";
67
print "Here's the remaining array of bases: ";
print "@bases";
Program: 9
Program to put an element at the beginning of the array with unshift
@bases = ('A', 'C', 'G', 'T');
$base1 = pop @bases;
unshift (@bases, $base1);
print "Here's the element from the end put on the beginning:";
print "@bases\n\n";
68