0% found this document useful (0 votes)
32 views25 pages

Unit II Bioinformatics

Bioinformatics

Uploaded by

Bhavana Manimala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views25 pages

Unit II Bioinformatics

Bioinformatics

Uploaded by

Bhavana Manimala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

5

systems are not unique to biomedical imagery, biomedical imaging is


Although these
diagnostics and research.
becoming more important for both
BIOLOGICAL DATABASES CLASSIFICATION
scientific
databases are libraries of life sciences information, collected from areas
Biological contain information from research
experiments and published literature Theygene expression, and phylogenetics.
including genomics, proteomics, microarray
BIOLOGICAL DATABASES

SEOUENCE DATABASES INTEGRATED or COMPOSITE DATABASES


STRUCTURAL DATABASES

Primary DB Secondary DB 1.UniPROT


2.NRDB

1. PDB 1. SCOP 3.OWL


3. SWISS-PROT + TrEMBL
2. NDB 2.CATH
3. PDBSum 3. ProFunc
4. PDBe

PROTEIN
NUCLEOTIDE
SEQUENCE DB
SEQUENCE DB

1. PIR
1. GenBank
2. EMBL 2. SWISS-PROT
4. TrEMBL
3. DDBJ
4. Patents 5. PROSITE
5. MetaGenomics

The Biological databases can be briefly classified into


I. Structural,
I. Sequence and
Integrated/Composite Databases
Diverse types of information may be stored in biological databases i.e. sequences,
images of 2D gel, structures of macromolecules, metabolic pathways, includes gene
and chromoSomal), clinicaleffects of
function, structure, localization (both cellular
biological sequences and structures. A database
of
mutations as wellas similarities of
databases together is known as Metadatabase.
all the
these databases either in the form of flat files or any relational
The information is stored in
can be categorized into
database. Depending on this the databases
Primary databases -i.e. repository of raw data and
from different databases.
Secondary databases - compiled &filtered data
. NUCLEOTIDE SEQUENCE DATABASES
The databases EMBL, GenBank, & DDBJ are the three primary nucleotide sequence
databases.
They include sequences submitted directly by scientists and genome sequencing group,
and sequences taken from literature and patents.The entries in the EMBL, GenBank and
DDBJ databases are synchronized on a daily b¡sis, and the accession numbers are
managed in aconsistent manner between these three centers.
There are no legal restrictions on the use of the data in these databases. However,
there are some patented sequences in the databases.

1. GenBank/NCBI
The GenBank nucleotide database is maintained by the National Center for
Biotechnology Information (NCBI), which is part of the National Institute of Health (NIH),
a federal agency of the US govermment.
There are approximately 106,533,156,756 bases in the traditional Genbank divisions. The
complete data of GenBank is available on the NCBI website i.e. www.ncbi.nlm.nih.gov.in
Historically GenBank doubles in size every 18months because of the enormous growth
in data being deposited on a daily basis from variousparts of the world.
This database works on the management of biological data as well as tools to analyse
the biologicaldata.To publish information of sequences many journals require submission
of sequence to a database to get its accession number in the paper. GenBank facilitates
the direct submission of papers using submission tools like Banklt & Sequin, These
submission tools can also be used to update sequence once entered.A GenBank flat file
consists of various keywords as follows
LOCUS- has sequence length given after accession number
DEFINITION description of source organism
ACCESSION NUMBER- unique identifier of the sequence
7

been modified
VERSION - as to how many times the sequence has
GI No. - Genlnfo ldentifier, sequence identifier number
name)
SOURCE- organism name from which It is derived(scientific
REFERENCE- publication by authors
publication
AUTHORS - list of authors in the same order as appeared in the
TITLE - Title of the published or unpublished work
JOURNAL - MEDLINE abbreviationof journal name,
FEATURES - information about gene and its products
GENE- Gene length, gene name, itsfunction
sequence
COMMENTS - points out the changes that occurred in the submitted
and finally ends with '// sign.
the scientific name
Each GenBank entry includes a concise description of the sequence,
references and a table of features
and taxonomy of the source organism, bibiliographic
significance such as
that identifies coding regions and other sites of biological
other sequence
transcription units, repeat regions sites of mutations or modifications and
submitters at any
features.Revisions or updates to GenBank entries can be made by the
time and can be accepted through the Update option available.
within the scientific
The GenBank database is designed to provide and encourage access
community to the most upto date and comprehensive DNA sequence
information. Therefore NCBI places no restrictions on the use or distribution of the
GenBank data.
2. EMBL
The EMBL (European Molecular Biological Laboratory) nucleotide sequence database is
maintained by the European Bioinformatics Institute (EBI) in Hinxton, Cambridge, UK.
The EMBL nucleotide sequence database forms part of the European Nucleotide Archive.
The European Bioinformatics Institute(EBI) is an outstation of EMBL in Hiedelberg,
Germany. The mission of EBI is the maintenance and provision of biological databases
and other information services to support data deposition and free access by the scientific
community.
The EMBL Nucleotide Sequence Database is Europe's primary nucleotide databses and
has a collaboration with the DNA Database of Japan(DDBJ) and GenBank(USA).
Scientific community producing large volumes of sequence data are advised to contact
the EBI database
from entries theprepare(scientists) curators designated
Contacts
with and/or literature
both in
are they thatdatabases nucleotide the
groupsof thatmeans Thiscurated.
SWISS-PROTIUniPROT databases sequence protein two The
fromdifferent are PIR and
DATABASES SEQUENCE PROTEIN I.
database.) EMBL tool
the intosubmission web
based interactive is
anWebin
year. last theduring created has
been database
subset new users,adatabase fromrequests Following
the
EMBLCDS the data, EMBL of
8
H. PROTEIN STRUCTURAL DATABASES
structure
The structural databases consist of information about the three dimensional
of biomolecules. These structures have been deciphered by research scientists using
various experimental techniques and deposited in the databases. The following are few
important Structural Databases.

1. PDB

The Protein Data Bank was established in 1971at Brookhaven National Laboratoryand
is the main primary database for 3D structures of biological macromolecules determined
by X-ray crystallography and NMR. This database started with seven structures and from
1980s the number of deposited structures began to increase dramatically.
Knowing the 3D structures helps to understand the shape of a molecule which further
helps to understand the functioning of that molecule. This enhances us to understand the
biologicalsystems in a more comprehensive way.
Depositors to the PDB should have varying expertise in the techniques of X-ray
crystallography, NMR, cryoelectron microscopy and theoretical mode ling.
PDB record contains several lines each line consisting of 80 columns. Every line is self
recognized by its record name of fixed length. Each record file is organized in a well
defined way. Few are types of records and the details they contain.
HEADER provides classification of molecule based on molecule type, cellular location
etc

TITLE - title of the experiment


SOURCE- Source of biological molecule
COMPND - macromolecular contents of an entry
AUTHOR- person responsible for information of entry
REMARK- experimentdetails, annotations, explanations and comments etc..
END -The end of the file is marked by "END".
The PDB data validation requires the followingchecks and
summarization i.e.
icatedCurrent Covalent
Ligand
key A

depositions goal and


Bond
of
Atom
to PDB distances
e
thnomenclature,
are is
sitors to
reviewed make
and
angles, the
carefullySequence
archive
and stereochemical
by as
comparision
updates the
consistent
staff
are
before validation,
etc.
eleased and
error-free
release.
Atom
into
Errors as
the Nomenclature,
possible.
PDB. found

are 11
All
Biological databases
S oiological data accumulate at larger sealcs and increase at exponential paccs by higher
lhroughput and lower-cost DNA scqucncing tcchnologics, a number of biological dalabascs
have been devcloped to manage the data.
The major objectives of biological databases are not only to storc, organize and share dala in a
Sructured and searchable manner with the aim to facilitate dataretrieval and visualization but
cxchange and
llso to provide web application programming interaces (APIs) for computers to
integrate data from various daabase resources in an aulomated manner.
is a
Therefore, developing databases to deal with gigantic volumes of biological data
fundamentally essential task in bioinformatics,

amounts of omics data, serving as


To be short, biological databases integratc enormous
crucially important resources and becoming increasingly indispensable for scientists.
the journal Nucleic
According to a report of 2014 Molecular Biology Database Collection in
accessible online
Acids Research, there are a sum of 1552 databases that are publicly

Databasc classification:

Biological
databases

Primary Comp0zite Secondary Structiurat Specialized

MRDE

Biological databases are developed for diverse purposes,encompass various types of data at
heterogeneous coverage and are curated at different levels with different methods,so that there
are accordingly several different criieria applicable to database classification.
can be classified as
According to the scope of data coverage, biological databases
1. comprehensive 2. specialized databases.

Comprehensive databases cover different types of data from numerous species and typical
examples are GenBank, European Molecular Biology Laboratory (EMBL), and DNA Data
Bank of Japan (DDBJ)
These three databases were established as the International Nucleotide SequenceDatabase
Collaboration in 1988 to collect and disseminate DNA and RNA sequenccs.

On the other hand, specialized databases contain specific types of data or data from specific
organisms. For example, WormBase is for nematode biology and RiceWiki is for community
curation of rice genes.

Level of biocuration

According to level of data curation, biological databases can roughly fall into
primary and
secondary or derivative databnses.
Primary databases contain raw data as archival repository such as the
NCBI Sequence Read
Archive (SRA)
whereas secondary or derivative databases contain curated
information as added value, e.g.,
NCBI RcfSeq

Method of biocuration

As a consequence of the explosive growth of data, curation


increasingly requires collective
intelligence collaborative data integration and annotation. Therefore, biological datab ases
for
can also be classificd as (1) expert-curated databases, e.g., RefSeq and
TAIR,
and (2) community-curated databases, which are curated in a collective
and collaborative manner by a number of researchers, e.g.,
LncRNAWiki and GeneWiki

Primary atabses
Primary databases are also called as archieval database.
They are populated with experimentally derived data such as
nucleotide sequence,
protein scquence or macromolccular sructure.
Experimental results are submitted dircctly into the databasce by rescarchers, and the caa
are essentially archival in noturc.
Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.
Examples
EMBL, GenBank and DDBJ (nuclcotide scquencc)
Protein Data Bank (PDB: coordinates of three-dimensional macromolecular structures)
2. Secondary databases
Secondary databases comprise data derived from the results of analysing primary data.
Secondary databases oflen draw upon inforination from numerous sources, including
scientific
other databases (primary and secondary), controlled vocabularies and the
literature.

They are highly curated, often using a complex combination of computational algorithms
and manual analysis and interpretation to derive new knowledge from the public record of
science.
Examples
InterPro (protein families, motifs and domains)
UniProt Knowledgebase (sequence and functional information on proteins)
Ensembl (ariation, function, regulation and more layered onto whole genome sequcnces)
3. However, many data resources have both primary and secondary characteristics. For
example, UniProt accepts primary sequences derived from peptide sequencing experiments.
However, UniProt also infers peptide sequences from genomic information, and it provides a
wealth of additional information, some derived from automated annotation (TrEMBL), and
even more from careful manual analysis (SwissProt).
4. There are also specialized databases are those that cater to a particular rescarch interest. For
example, Flybase, HIV sequence database, and Ribosomal Database Project are databases that
specialize in a particular organism or a particular type of data.
Importance of Databases
Dalabases act as a store house of information.
Databases are used to store and organize data in such a way that information can be
retrieved easily via a variety of search criteria.
I allows knowlcdge discovery, which refers to the identification of connections betwecn
pieces of information that were not known when the information was first entered. This
facilitates the iscovery of new biological insights from raw data.
Secondary databases have become the molecular biologist's reference library over the
pastdecade or so,providing a wealth of information on just about any gene or gene product
that has bcen investigated by the rescarch community.
It helps to solve cases where many users want to access the same entries of data.
Allows the indexing of data.
It helps toremove redundancyof dala.
C. DDBJ (DNAdatabank of Japan) It
Institute of Genctics (NIG) in the Shizuoka prcfecture of Jupan.
It islocated at the National mainly receives its data
nycleotide sequence data bank in Asia. Although DDBJ
is the only
from contributors from any other country.
from Japancse rescarchers, it can acccpt data
Secondary databases of nucleotide sequences
scquences culled from one or
Many of the sccondary databases arc simply sub-collection of
the other of the primary databases such as GenBank or EMBL.
There is alsousually a great deal of value addition in terms of annotation, software, presentation
of the information and the cross-references.
1.Omniome Database:
Omniome Database is acomprchensive microbial resource maintaincd by TIGR (The Institute
for Genomic Rescarch). It has not only the sequence and annotation of cach of the completed
genomes, but also has associated information about the organisms (such as taxon and gram
stain pattern), the structure and composition of their DNA molecules, and many other attributes
of the protein sequences predicted from the DNA sequences.
2.FlyBase Database:
A consortium sequenced the entire genome of the fruit fly D. Melanogaster to a high degree
of completeness and quality.
3.ACeDB:
It is a repository of not only the sequence but also the genetic map as well as phenotypic
information about the C. Elegans nematode worm.

RNA databases

It is well acknowledged that only a tiny proportion of the human genome is transcribed into
mRNAs, whereas the vast majority of the genome is transcribed into "dark matter" non
coding RNAs (ncRNAS) that do not encode proteins , including microRNAs (miRNAs), small
nucleolar RNAs (snoRNAS), piwiRNAs (piRNAS), and long non-coding RNA (1ncRNA).
Therefore, an increasing number of human RNAdatabases have been built for deciphering

O Scanned with OKEN SCanner


nCkNAS (e.g., GENCODE), in particular IncRNAs hat attract he rising interest
(e.g""
LncRNAWiki ), and charncterizing their functions and interactions (e.g., RNAcentral ).
Areprescntative example of RNAdatabase is RNAcentral, It provides unified access
to the neRNA sequence data supplied by multiple databases including Rfam, IncRNAdb,
and miRBase.
Protein databases
Protcin Databases- Types and Importance
The protein scquences, and the 3D structural data produced by X-ray crysallography and
macromolecular NMR.
The biological information of proteins is available as sequences and structures.
Scquences are represented in a single dimension whereas the structure contains the three
dimensional data of sequences.
A protein database is one or more datasets about proteins, which could include
a protein's amino acid sequence, conformation, structure, and features such as active
sites.
Proiein databases are compiled by the translation of DNA sequcnces from different genc
databases and include structural information. They are an important resource because
proteins mediate most biological functions.

Imporance of Protein Databasesg

Hugeamounts of data forprotein structures, functions, and particularly sequences are being
generated. Scarching databases are often the first step in the study of a new protein. It has the
following uses:
1. Comparison between proteins or between protein families provides information about the
relationship between proteins within a genome or across different species and hence offers
much more information that can be obtained by studying only an isolated protein.
2. Secondary databases derived from experimental databases are also widely available.
These databases reorganize and annotate the data or provide predictions.
3. The use of multiple databases often helps researchers understand the structure and
function of a protein.
Primary databases of Protein
hold the cxperimentally dctermined protein sequences inferred from
The PRIMARY databases course, is not experimentally
nucleotide sequences. This, of
lhe conceptual translation of the interpretation of the nucleotide sequence
derived information, but has arisen as aresult of
information.
(PIR-PSD):
a. Protein Information Resource (PIR) - Protein Sequence Database
MIPS (Munich
The PIR-PSD is a collaborative endeavor between the PIR, the
Inlormation Centre for Protcin Sequcnccs, Germany) and the JIPID (Japan International
Protein Information Database, Japan).
The PIR-PSD is now a comprehensive, non-redundant, expertly annotated database.
A unique characteristic of the PIR-PSD is its classification of protein sequences based on
the superfamily concept.
The scquence in PIR-PSD is also classified based on homology domain and sequence
motifs.

Homology domains may corespond to evolutionary building blocks, while sequence


motifs represent functional sites or conserved regions.
The classification approach allows a more complete understanding of sequence function
structure relationship.

b. SWISS-PROT
The other well known and extensively used protein database is SWSS-PROT, Like the
PIR-PSD, this curated proteins sequence database also provides a high level of annotation.
The data in each entry can be considered separately as core data and annotation.
The core data consists of the sequences entered in common single letter amino acid code,
and the related references and bibliography. The taxonomy of the organism from which
the sequence was obtained also forms part of this core information.
The annotation contains information on the function or functions of the protein, post
translational modification such as phosphorylation, acetylation, etc., functional and
structural domains and sites, such as calcium binding regions, ATP-binding sites, zinc
fingers, etc., known secondary structural features as for examples alpha helix, beta sheet,
ctc., the quaternary structure of the protein, similarities to other protein if any, and diseases
that may arise due to different authors publishing different sequences for the same protein,
or due to mutations in different strains of an described as part of heannotation.
TrEMBL (for Translated EMBL) is a
computer-annotated
released as a supplement to sWISS-PROT.
protein sequence datnbase ua
It contains the translation of all coding
present in the EMBL Nucleotide database. which have not been fully
scquences
annotated. Thus it may
Contain the sequence of proteins that are never expressed and never actually
identified in the
organisms.
c. Protein Databank (PDB):

FDB Is a primary protein structure database. It is a crystallographic database for the


thrce-dimensional structure of large biological molccules, such as protcins.
In spite of the name, PDB archive the three-dimensional structures of not only protcins
but also all biologically important molecules, such as nucleic acid fragments, RNA
molecules, large peptides such as antibioic gramicidin and complexes of protcin and
nucleic acids.
The database holds data derived from mainly three sources: Suructure determined by X
ray crystallography, NMR experiments, and molecular modeling.

Secondary Databases of Protein


The secondary databases are so termed because they contain the results of analysis of the
sequences held in primary databases. Many secondary protein databases are the result of
looking for features that relate different proteins. Some commonly used secondary databuses
of sequence and structure are as follows:
a. PROSITE:
A set of databases collects together patterns found in protein
sequences rather than the
complete sequences. PROSITE is one such pattern database.
The protein motif and pattern are encoded as "regular expressions".
The information corresponding to cach entry in PROSITE is of the two
forms - the
patterns and the related descriptive text.
b. PRINTS:

In the PRINTS database, the protein sequence patterns are


stored as 'fingerprints'. A
fingerprint is a set of motifs or patterns rather than a single onc.
The information contained in the PRINT entry may be
divided into three sections. In
addition to cntry name, accession number and number of motifs, the
frst section contains
cross-links to other databases that have more information about the
characterized family.
table showing how many of the motifs hat make up the
The second scction provides a
sequences in that family.
fingerprint occurs in the how many of the
the actual fingerprints that are stored as multiple
The last section of the entry contains
made without gaps. There is, therefore, one set
aligned sets of sequences, the alignment is
of aligned scquences for cach motif.
c.MHCPep:
sequences known to bind the
MHCPep is a database comprising over 13000 peptide
Major Histocompatibility Complex of the immunc system.
Each entry in the database contains not only the peptide scquence, which may be 8 to 10
auminoacid long but in addition has infomation on the specific MHC molecules to which
it binds, the experimental method used to assay the peptide, the degree of activity and the
binding affinity observed,the source protein that, when broken down gave rise to this
peptide along with other, the positions along the peptide where it anchors on the MHC
moleculcs and references and cross-links to other information.
d. Pfam

Pfam contains the profiles used using Hidden MarkoV models.


HMMs build the model of the pattern as a series of the match, substitute, insert or delete
states, with scores assigned for alignment to go from one state to another.
Each family or pattem defined in the Pfam consists of the four elements. The first is the
annotation, which has the information on the source to make the entry, the method used
and some numbers that serve as figures of merit.
The sccond is the secd alignment that is used to bootstrap the rest of the sequences into
the multiple alignments and then the family.
The third is the HMM profile.
The fourth element is the complete alignment of all the sequences identified in that family,

The purpose of constructing protein databases includes collection of universal proteins (e.g.,
UniProt ), identification of protein families and domains (e.g., Pfam ), reconstruction
of phylogenetic trees (e.g., TreeFam ), and profiling of protein structures (e.g., PDB ).
3D
A represcntative example of protein database is PDB, the main primary database for
structures of biological macromolccules determined by X-ray crystallography and NMR.

Scane
O Scanned with OKEN
Established in 1971, PDB contains 105,465
biologieal
December 2014, in which 27,393 entries belong to macromolecular structurcs as oI S0
human
Another cxample is theUniversal Protein Resource
between EMBL-EBI, Swiss Institute of
(UniProt). As a collaborative project
Bioinformatics (SIB), and Protein Infomathon
Resourcc (PIR), UniProt provides a comprehensive, high-quality, and
frecly-accessible
resourcc of protcin sequcncc and unctional information. Currently, UniProt includes thrce
member databases: UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters
(UniRef), and UniProt Archive (UniParc), In addition, UniProtKB consists of two scctions:
Swiss-Prot (containing a collection of 547.357 manually-annotated and -revicwcd proteins as
of January 2015) and TrEMBL (Containing acollection of 89,451, 166 un-reviewed proteins as
of January 2015).

Expression databases
Expression databases can be used for various purposes, including archiving expression data
(e.g., GEO [26), detecting differential and baseline xpression (e-g., Expression Atlas [27|),
exploring tissue-specific gene expression and regulation (e.g., TiGER (28|), and profiling
expression information based on both RNA and protein data (e.g., Human Protein Atlas (29).
A representative case of expression database is Human Protein Atlas. As of 30 December 2014,
it encompasses expression profiles for a large majority of human protein-coding genes based
on both RNA (Iranscriptome analysis based on 213 tissue and cell line samples) and protcin
data (proteome analysis based on 24,028 antibodies) (htp://www.protcinatlas.org).
Pathway databases
Pathway databases contain biological pathways for metabolic, signaling, and regulatory
pathway analysis. Arepresentative cxample is KEGG PATHWAY [30), a curated biological
pathway resource on the molecular interaction and reaction networks. As he core of KEGG.
KEGG PATHWAY integrates many entities that are stored in KEGG sibling databases,
including genes, proteins, RNAS, chemical compounds, and chemical reactions
(hitp://www.genome.jpkeggpathway.html).
Disease databases

There are at least 200 forms of cancer in the world, causing l4.6% of all human
deaths
(http:/len.wikipedia.org/wiki/Cancer). Thus, obtaining complete cancer genomes and

O Scanned with KEN S%ae


for cancer
mutations and abnormal genes can provide new insights
identifying molecular
eventually, personalized treatment (31]. Toward this end, there are
prevention, detection, and and
cancer projects, viz, The Cancer Genome Adas (TCGA) (32]
two well-known
Cancer Genome Consortium (1CGC) [331. TCGA, founded in 2006 by the
International
Research Institute at the National
National Cancer Institute and National Human Genome
(including exome, SNP,
Institutes of Hcalth, aims to collect a wide diversity of omics data
human cancer
mRNA, miRNA, and methylation) for more than 20 different types of
(http://cancergenome.nih.gov). Unlike TCGA, ICGC is a voluntary collaborative organization
initiated in 2008 and open to all cancer and genomic rescarchers in thc world. It aims to obtain
acomprehensive description of genomic, transcriptomic, and epigenomic changes in S0
different tumor types andor subtypes, which are of clinical and societal importance across the
globe (http://icgc.org).
KEGG
Introduction
KEGG is a database resource for understanding high level
functions and utilities of the biological system, such as the cell,
the organism and the ecosystem, from genomic and
molecular-level information
lt is a computer representation of the biological system,
consisting of molecular building blocks of genes and proteins
(genomic information) and chemical substances (chemical
information) that are integrated with the knowledge on
relation
molecular wiringg diagrams of interaction, reaction and
networks (systems information).
(health
It also contains disease and drug information
information) as perturbations to the biological system.
Kyoto Encyclopedia of Genes and Genomes(KEGG):
"It is acollection of online databases dealingwith genomes,
enzymaticpathways, and biological chemicals."
Pathway database
Record networks of molecule interaction.
OBIECTIVES:
Computerize current knowledge of biological systems & provide
consistent annotations.
Maintain gene catalogs for sequenced genomes.
Maintain catalogs of chemical reactions in living cells by
LIGAND
Provide new informatics technologiestoward predicting
biological systemsand designing further experiments.
KEGG maintain 6 main databases:
KEGG Pathway (Atlas)
KEGG Genes
KEGG Genome
KEGG Ligand
KEGG BRITE
KEGG Cancer
KEGG PATHWAY
of manuallydrawn pathway maps representing our
is a collection andreaction networks
knowledge on the molecular interaction
for:
Metabolism:
Genetic Information Processing
Environmental Information Processing
Cellular Processes
Organismal Systems
Human Diseases
Drug Development
KEGG GENES: G KFGG
Collection of gene catalogs for all complete genome
Four types:
MGENES (Meta Genomes)
VGENES(Viral Genome)
DGENES(Draft Genome)
EGENES(EST database)
KEGG GENOME:
Collection of KEGG organisms which are the organisms with
known complete genome sequence.
KEGG BRITE:
Collection of manually created hierarchical text files.
capturing functional hierarchies of various biological objects,
especially those represented as KEGG objects.
Applications:
Use to detect functionally related enzyme clusters.
Automaticdetection of conserved gene clusters in multiple
genomes by graph comparison.
Modeling & Simulation
Browsing and retrieval of data
CONCLUSION:
TheKyoto Encyclopedia of Genes and Genomes is avast library
of information gathered from fully sequenced genomes, genes,
proteins, pathways, and chemical compounds pertaining toover
a hundred different species of both prokaryotes and eukaryotes.
GENOME ANNOTATION
DNA ahotation or genome annotatlon is the process of
ideaiving the locations ofgenesS and
al or the coding regions in a genomé'ard determining what those genes do.
An annotation
(irrespective of the context) is a note added by wav of explanation or comentary. Once a
genome is sequenced, it needs to be annotated to make sense of it.u
ror DNA annotation, apreviously unknown sequence representation of genetic material is
eniched with information relating qengmic position to intron-exon boundaries, requlatory
Sequences, repeats, gene names and protein products. This annotation is stored in nomiC
databases such as Mouse Genome Informatics, FVBase,and WormBase. Educational materials
on some aspects of biological annotation from the 2006 Gene Ontology annotation camp and
similar events are available at the Gene Ontology website.
The National Center for Biomedical Ontology (www.bioontology.org) develops tools for automated
annotation of database records based on the textual descriptions of those records.
As a general method, dcGO has an automated procedure for statistically inferring associations
between ontology terms aDd protein domains or combinations of domains from the existing
gene/protein-level annotations.
Genome annotation consists of three main steps:
1: identifying portions of the genome that do not code for proteins
2. identifying elements on the genome, a process called gene prediction, and
3. attaching biological information to these elements.
The simpliest way to perform gene annotation relies on homology based search tools, like BLAST,
to search for homologous genes in specific databases, the resulting information is then used to
annotate genes and genomes. The additional information allows manual annotators to
deconvolute discrepancies between genes that are given the same annotation. Some databases
use genome context information, similarity scores, experimental data, and integrations of other
resources to provide genome annotations through their Subsystems approach. Other databases
(e.g. Ensembl) rely on both curated data sources as well as a range of different software tools in
their automated genome annotation pipeline.
Structural annotation consists of the identification of genomic elements.
ORFs and their localization
gene structure
coding regions
location of regulatory motifs
information to genomic elements.
Functional annotation consists of attaching biological
biochemical function
biological function
involved regulation and interactions
expression
involve both biological experiments and in
These steps may
proteins,
sillico analysis. Proteogenomics based approaches utilize information fromn expressed
often derived from mass spectrometry, to improve genomics annotations.
GENOME ANNOTATIONS
Genome annotation is the process of identifying the locations of genes and all of
the coding regions in a genome and determining what those genes do
Once a genome is sequenced, it needs to be annotated to make sense of it by attaching
biological information to sequences
Since the 1980's, molecular biology and bioinformatics have created the need for DNA
annotation.
DNA annotation involves enriching the genetic material with information
relating genomic position to intron-exon boundaries, regulatory
sequences, repeats, gene names and protein products.
> This annotation is stored in genomic databases such as Mouse Genome
Informatics, FlyBase, and Worm Base.
develops tools for
>The National Center for Biomedical Ontology (www.bioontology.org)
automated annotation
Genome annotation consists of three main steps:
proteins
1. ldentifying portions of the genome that do not code for prediction
called gene
2. Identifying elements on the genome, a process
3. Attaching biological information to these elements
homology based search tools,
o A simple method of gene annotation relies on
like BLAST,
specific databases can be found and the
Using BLAST search the homologous genes in
resulting information is then used to annotate genes and genomes.
by computer analysis, as opposed to
> Automatic annotation tools try to perform all of this genome can be annotated
manualannotation (a.k.a, curation) Ex. Genes in a eukaryotic
FINDER.
using various annotation tools such as annotation
each other in the same
Ideally both the approaches co-exist and complement
pipeline (process).
elements.
Structural annotation consists of the identification of genomic

> ORFs and their localization


Gene structure
> Coding regions
location of regulatory motifs
> splice sites
> Non coding regions
> Introns
Functional annotation consists of attaching biological information to genomic
elements.
> Biochemicalfunction
> Biological function
> Involved regulation and interactions
A Expression

Domains/Motifs Orthology search Homology search


NCUI CDDS, KEGG Ko groups Biast against NR,
nterro Databases P Uaiprot, Swissprot.
Pfarm ggNOG
Signal? targete

Domains, sites, families Pathways, reactions

GO termS Putative name

> These steps may involve both biological experiments and in silico
analysis.
Genome annotation remains a major challenge for scientists investigating the human
genome,
Genome
sequence
General Prediction of
database structural
search FB Statistical
prediction
gene features

Gene/protein
RNA sct

Specialized
database search

Predicted gene
functions

Contextanalysisv
genome comparison

-Sy
A generalized flow chart of genome annotation
Searching sequence databases (typically, NCBI NR/fosequence similarity, usually using BLAST.
(Specialized database search:searching domain databases, such as Pfam, SMART,
and CDD, for
conserved domains, genome-oriented databases, such as CoGs, for identification of orthologous
relationship and refined functional prediction, metabolic databases, such as KEGG for metabolic
pathway reconstructioD, and possibly, other database searches) Statisical gene prediction: use of
methods like GeneMark or Glimmer to predict protein-coding gene_(Prediction of structural features:
prediction of signal peptide, transmembrane segments, coiled domain and other features in putative
protein functions.)

You might also like