0% found this document useful (0 votes)
8 views65 pages

Module-I

The document provides an overview of bioinformatics, highlighting its interdisciplinary nature that combines computer science and biological science. It discusses the central dogma of molecular biology, the structure and function of DNA, and the goals and applications of bioinformatics in various fields such as drug design and agriculture. Additionally, it addresses the limitations of bioinformatics, emphasizing the importance of high-quality data for accurate analysis.

Uploaded by

kpavankumar887
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views65 pages

Module-I

The document provides an overview of bioinformatics, highlighting its interdisciplinary nature that combines computer science and biological science. It discusses the central dogma of molecular biology, the structure and function of DNA, and the goals and applications of bioinformatics in various fields such as drug design and agriculture. Additionally, it addresses the limitations of bioinformatics, emphasizing the importance of high-quality data for accurate analysis.

Uploaded by

kpavankumar887
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 65

Bioinformatics

— Unit I—

Dr. Chandra Mohan D


Assistant Professor
Computer Science and Engineering Group
Indian Institute of Information Technology, Sri City

If you know your own DNA sequence than you know every thing about your self

January 20, 2025 1


Outline

 Bioinformatics and Biological databases


 Introduction, Scope and applications (03-01-24)
 Central dogma of molecular biology
 DNA and protein sequence databases
 Sequence retrieval from databases
 Sequence formats
 DNA Sequencing, Submission of sequences to
databases
January 20, 2025 Bioinformatics 2
What is Bioinformatics?
 Bioinformatics is an interdisciplinary research area at the interface
between computer science and biological science
 Union of biology and informatics

 Bioinformatics involves the technology that


uses computers to
 Storage, retrieval, manipulation, and distribution of information

 Related to biological macromolecules such as DNA, RNA, and

Proteins
 The emphasis on the use of computers because
 Most of the tasks in genomic data analysis are highly repetitive or

mathematically complex
January 20, 2025 Bioinformatics 3
Bioinformatics and Computational Biology
 The use of computers is absolutely indispensable in mining genomes
for information gathering and knowledge building
 Bioinformatics differs from a related field known as computational
biology
 Bioinformatics refers to the study of large sets of biodata, biological
statistics, and results of scientific studies
 Bioinformatics is limited to
 Sequence, structural, functional analysis of genes & genomes and

 Their corresponding products and is often considered computational

molecular biology
 Bioinformatics as the development and application of computational
tools in managing all kinds of biological data
 Example: Prediction of protein function from DNA sequence and
structural information
January 20, 2025 Bioinformatics 4
Bioinformatics and Computational Biology
 Computational biology, by contrast, is concerned with solutions to
issues that have been raised by studies in bioinformatics.
 However, computational biology encompasses all biological areas that
involve computation
 Computational biology is more confined to the theoretical development
of algorithms used for bioinformatics
 Computational biology is useful in scientific research, including
 The examination of how proteins interact with each other

 Through the simulation of protein folding, motion, and interaction

 Bioinformatics tends to concern itself with the gathering and

collation of biodata, and computational biology with the practical


application of this biodata
 Though the two fields are interrelated, bioinformatics and

computational biology differ in the kinds of needs they address


January 20, 2025 Bioinformatics 5
Genetic Material
 DNA, or deoxyribonucleic acid, is the hereditary material in humans
and almost all other organisms
 Nearly every cell in a person’s body has the same DNA
 Most DNA is located in the cell nucleus
 The information in DNA is stored as a code made up of four
chemical bases:
 Adenine (A), Guanine (G),
 Cytosine (C), and Thymine (T)
 Human DNA consists of about 3 billion bases, and more than 99% of
those bases are the same in all people
 When DNA is transmitted from parents to children, it can determine
some of the children's characteristics
 Their eye color or hair color
January 20, 2025 Bioinformatics 6
DNA Bases

January 20, 2025 Bioinformatics 7


DNA Bases
 The sequence of these bases determines the
 Information available for building and maintaining an organism

 DNA bases pair up with each other, A with T and C with G, to form
units called base pairs
 Each base is also attached to a
sugar molecule and a phosphate molecule

 Together, a base, sugar, and phosphate are called a nucleotide


 Nucleotides are arranged in two long strands that form a spiral called
a double helix
 The structure of the double helix is somewhat like a ladder
 An important property of DNA is that it can replicate, or make
copies of itself
January 20, 2025 Bioinformatics 8
Gene expression
 Each strand of DNA in the double helix can serve as a pattern for
duplicating the sequence of bases
 This is critical when cells divide because each new cell needs to have
an exact copy of the DNA present in the old cell
 A gene is the basic physical and functional unit of heredity
 A DNA molecule isn't just a long, boring string of nucleotides
 Each chromosome contains many genes
 Some genes act as instructions to make
molecules called proteins
 However, many genes do not code for proteins
 In humans, genes vary in size from a few
hundred DNA bases to more than 2 million
bases
January 20, 2025 Bioinformatics 9
Gene expression
 To determine the sequence of the human genome and identify the
genes that it contains,
 Estimated that humans have between 20,000 and 25,000 genes

 Every person has two copies of each gene, one inherited from

each parent
 Most genes are the same in all people, but a small number of

genes (<1%) are slightly different between people


 Alleles are forms of the same gene with small differences in their

sequence of DNA bases


 These small differences

contribute to each person’s


unique physical features

January 20, 2025 Bioinformatics 10


Gene expression
 DNA is divided up into functional units called genes
 Each gene provides instructions for a functional product, that is, a
molecule needed to perform a job in the cell
 In many cases, the functional product of a gene is a protein
 Mendel's flower color gene provides instructions for a protein

that helps make colored molecules (pigments) in flower petals

 The functional products of most known genes are proteins, or,


more accurately, polypeptides
 Polypeptide is just another word for a chain of amino acids

January 20, 2025 Bioinformatics 11


Gene expression
 Although many proteins consist of a single polypeptide, some are
made up of multiple polypeptides
 Genes that specify polypeptides are called protein-coding genes
 Not all genes specify polypeptides
 Many genes provide instructions for building polypeptides
 Instead, some provide instructions to build functional RNA
molecules, such as
 the transfer RNAs and

 ribosomal RNAs that play roles in

translation

 How, exactly, does DNA direct the construction of a polypeptide?

January 20, 2025 Bioinformatics 12


Goals of Bioinformatics
 To better understand living cell, how it functions at the molecular level
 Bioinformatics research can generate
 New insights and provide a “global” perspective of the cell

 By analyzing raw molecular sequence and structural data

 The reason that the functions of a cell can be better understood


 By analyzing sequence data is ultimately because

 The flow of genetic information is dictated by the “central dogma”

of biology in which
 DNA is transcribed to RNA, which is translated to proteins

 Cellular functions are mainly performed by proteins whose capabilities


are ultimately determined by their sequences
 Therefore, solving functional problems using sequence and sometimes
structural approaches has proved to be a fruitful endeavor
January 20, 2025 Bioinformatics 13
Scope of Bioinformatics
 Bioinformatics consists of two subfields:
 The development of computational tools and databases and

 The application of these tools and databases in generating biological

knowledge to better understand living systems


 These two subfields are complementary to each other
 The tool development includes
 Writing software for sequence, structural, and functional analysis, as

well as the construction and curating of biological databases


 These tools are used in three areas of genomic and molecular biological
research: molecular sequence analysis, molecular structural analysis,
and molecular functional analysis
 The analyses of biological data often generate new problems and
challenges that in turn spur the development of new and better
computational tools
January 20, 2025 Bioinformatics 14
Scope of Bioinformatics
 The areas of sequence analysis include
 sequence alignment,

 sequence database searching,

 motif and pattern discovery,

 gene and promoter finding,

 reconstruction of evolutionary relationships, and

 genome assembly and comparison

 The Structural analyses include


 protein and nucleic acid structure analysis,

 comparison, classification, and prediction

 The functional analyses include


 gene expression profiling,

 protein–protein interaction prediction,

 protein subcellular localization prediction,

 metabolic pathway reconstruction, and simulation

January 20, 2025 Bioinformatics 15


Applications of Bioinformatics
 Bioinformatics has not only become essential for basic genomic and
molecular biology research, but is having
 a major impact on many areas of biotechnology and biomedical

sciences
 It has applications in
 knowledge-based drug design,

 forensic DNA analysis, and

 agricultural biotechnology

 Computational studies of protein–ligand interactions provide a

rational basis for


 the rapid identification of novel leads for synthetic drugs

 Knowledge of the 3D structures of proteins allows

 molecules to be designed that are capable of binding to the receptor

site of a target protein with great affinity and specificity


January 20, 2025 Bioinformatics 16
Applications of Bioinformatics
 This informatics-based approach significantly reduces the time and cost
necessary to develop drugs with
 higher potency,
 fewer side effects, and
 less toxicity than using the traditional trial-and-error approach
 It is worth mentioning that genomics and bioinformatics are now poised
to revolutionize our healthcare system
 by developing personalized and customized medicine
 In forensics, results from molecular phylogenetic analysis have been
accepted as evidence in criminal courts
 Plant genome databases and gene expression profile analyses have
played an important role in
 the development of new crop varieties that have higher productivity and
more resistance to disease
January 20, 2025 Bioinformatics 17
Limitations of Bioinformatics
 In fact, bioinformatics has a number of inherent limitations
 Overreliance on poor-quality intelligence can yield costly mistakes if
not complete failures
 Bioinformatics depends on experimental science to produce raw data
for analysis
 The quality of bioinformatics predictions depends on the quality of data
and the sophistication of the algorithms being used
 Sequence data from high throughput analysis often contain errors
 If the sequences are wrong or annotations incorrect,
 the results from the downstream analysis are misleading as well

January 20, 2025 Bioinformatics 18


Central dogma of molecular biology
 This process involves two major steps: transcription and translation
 In transcription, the DNA sequence of a gene is copied to make an
RNA molecule
 Transcription involves rewriting, or transcribing,
 The DNA sequence into a similar RNA “alphabet”

 In Eukaryotes, the RNA molecule must undergo processing to


become a mature messenger RNA (mRNA)
 In translation, the sequence of the mRNA is decoded to specify the
amino acid sequence of a polypeptide
 The name translation reflects that the nucleotide sequence of the
mRNA sequence must be translated into
 the completely different "language" of amino acids

January 20, 2025 Bioinformatics 19


Central dogma of molecular biology
 Thus, during expression of a protein-coding gene, information flows
from DNA → RNA → protein
 This directional flow of information is known as the central
dogma of molecular biology
 Non-protein-coding genes
(genes that specify functional
RNAs) are still transcribed
to produce an RNA, but this
RNA is not translated into a
polypeptide
 For either type of gene, the process of going from DNA to a
functional product is known as gene expression

January 20, 2025 Bioinformatics 20


RNA Transcription
 In transcription, the non-coding strand, acts as a template for the
synthesis of a matching (complementary) RNA strand by an enzyme
called RNA polymerase
 This RNA strand is the primary transcript
 The primary transcript carries the same
sequence information as the non-
transcribed strand of DNA
 However, the primary transcript and the
coding strand of DNA are not identical
 One important difference is that RNA
molecules do not include the base thymine (T)
 Instead, they have the similar base uracil (U)
 Like thymine, uracil pairs with adenine
January 20, 2025 Bioinformatics 21
Transcription and RNA processing: Eukaryotes vs.
bacteria
 In bacteria (prokaryote), the primary RNA transcript can directly
serve as a messenger RNA, or mRNA
 Messenger RNAs get their name because they act as messengers
between DNA and ribosomes
 The 5' and 3' designations refer to the number of carbon atom in a
deoxyribose sugar molecule to which a phosphate group bonds

January 20, 2025 Bioinformatics 22


Transcription and RNA processing: Eukaryotes vs.
bacteria
 Ribosomes are RNA-and-protein structures in the cytosol where
proteins are actually made
 In eukaryotes, a primary transcript has to go through some extra
processing steps in order to become a mature mRNA
 During processing, caps are added to the ends of the RNA, and some
pieces of it may be carefully removed in a process called splicing
 The location of transcription is also different between prokaryotes
and eukaryotes
 Eukaryotic transcription takes place in the nucleus, where the DNA
is stored, while protein synthesis takes place in the cytosol
 Because of this, a eukaryotic mRNA must be exported from the
nucleus before it can be translated into a polypeptide
 Prokaryotic cells, on the other hand, don't have a nucleus, so they
carry out both transcription and translation in the cytosol
January 20, 2025 Bioinformatics 23
Translation
 After transcription (and, in eukaryotes, after processing), an mRNA
molecule is ready to direct protein synthesis
 The process of using information in an mRNA to build a polypeptide

is called translation
The genetic code:
 During translation, the nucleotide sequence of an mRNA is translated

into the amino acid sequence of a polypeptide


 Specifically, the nucleotides of the mRNA are read in triplets

called codons
 There are 61 codons that specify amino acids

 One codon is a "start" codon that indicates where to start translation

 The start codon specifies the amino acid methionine, so most

polypeptides begin with this amino acid


January 20, 2025 Bioinformatics 24
Translation
 Three other “stop” codons signal the end of a polypeptide
 These relationships between codons and amino acids are called
the genetic code

 Translation takes place inside


of structures known as ribosomes
 Ribosomes are molecular machines
whose job is to build polypeptides

January 20, 2025 Bioinformatics 25


Translation
 Once a ribosome latches on to an mRNA and finds the "start" codon,
it will travel rapidly down the mRNA, one codon at a time
 As it goes, it will gradually build a chain of amino acids that exactly
mirrors the sequence of codons in the mRNA

How does the ribosome "know" which amino acid to add for each
codon?
 This matching is not done by the ribosome itself

 Instead, it depends on a group of specialized RNA molecules

called transfer RNAS (tRNAs)


 Each tRNA has a three nucleotides sticking out at one end, which

can recognize just one or a few particular codons


 At the other end, the tRNA carries an amino acid – specifically, the

amino acid that matches those codons


January 20, 2025 Bioinformatics 26
Translation
 There are many tRNAs floating around in a cell, but
 only a tRNA that matches the codon that's currently being read

can bind and deliver its amino acid cargo


 Once a tRNA is snugly bound
to its matching codon in the
ribosome, its amino acid will
be added to the end of the
polypeptide chain

 This process repeats many


times, with the ribosome
moving down the mRNA
one codon at a time
January 20, 2025 Bioinformatics 27
Translation
 A chain of amino acids is built up one by one, with an amino acid
sequence that matches the sequence of codons found in the mRNA
 Translation ends when the ribosome reaches a stop codon and
releases the polypeptide

January 20, 2025 Bioinformatics 28


Translation
What happens next?
Once the polypeptide is finished,

 it may be processed or modified,

 combine with other polypeptides, or

 be shipped to a specific destination inside or outside the cell

 Ultimately, it will perform a specific job needed by the cell or

organism
 perhaps as a signaling molecule, structural element, or enzyme!

January 20, 2025 Bioinformatics 29


Protein Synthesis Process

 Protein Synthesis process Animation Link:


ahttps://www.youtube.com/watch?v=gG7uCskUOrA
January 20, 2025 Bioinformatics 30
Introduction to Biological Databases
 One of the hallmarks of modern genomic research is the generation of
enormous amounts of raw sequence data
 As the volume of genomic data grows, sophisticated computational
methodologies are required to manage the data deluge
 The very first challenge in the genomics era is
 to store and handle the staggering volume of information through

the establishment and use of computer databases


 The development of databases to handle the vast amount of molecular
biological data is thus a fundamental task of bioinformatics
 There is a need to learn the basic concepts related to databases, in
particular,
 The types, designs, and architectures of biological databases

 Emphasis is on retrieving data from the main biological databases such


as GenBank
January 20, 2025 Bioinformatics 31
Databases and Types
 A database is a computerized archive used to store and organize data in
such a way that
 information can be retrieved easily via a variety of search criteria

 Databases are composed of computer hardware and software for data


management
 The objective of the development of a database is to organize data in a
set of structured records to enable easy retrieval of information
Database Types:
 Originally, databases all used a flat file format, which is a long text file
that contains
 many entries separated by a delimiter, a special character such as a

vertical bar (|)


 Within each entry, are a number of fields separated by tabs or commas
 The text file can be considered a single table
January 20, 2025 Bioinformatics 32
Databases and Types
 Thus, to search a flat file for a particular piece of information,
 A computer has to read through the entire file, an obviously

inefficient process
 Searches through such files often cause crashes of the entire computer
system because of the memory-intensive nature of the operation
 To facilitate the access and retrieval of data, sophisticated computer
software programs
 for organizing, searching, and accessing data have been developed

 They are called database management systems (DBMS)


 These systems contain not only raw data records but also operational
instructions to help identify hidden connections among data records
 The purpose of establishing a data structure is
 for easy execution of the searches and to combine different records

to form final search reports


January 20, 2025 Bioinformatics 33
Databases and Types
 Depending on the types of data structures, these database management
systems can be classified into two types:
 Relational database management systems and

 Object-oriented database management systems

 Consequently, databases employing these management systems are

known as
 relational databases or

 object-oriented databases, respectively

Relational Databases:
 Instead of using a single table as in a flat file database, relational

databases use a set of tables to organize data


 Each table, also called a relation, is made up of columns and rows

 Columns represent individual fields

 Rows represent values in the fields of records

January 20, 2025 Bioinformatics 34


Databases and Types
 The columns in a table are indexed according to a common feature
called an attribute, so they can be cross-referenced in other tables
 To execute a query in a relational database, the system selects linked
data items from different tables and combines the information into one
report
 Therefore, specific information can be found more quickly from a
relational database than from a flat file database
 Relational databases can be created using a special programming
language called structured query language (SQL)
 The creation of this type of databases can take a great deal of planning
during the design phase
 After creation of the original database, a new data category can be
easily added
 without requiring all existing tables to be modified

January 20, 2025 Bioinfo rmatics 35


Databases and Types
 The subsequent database searching and data gathering for reports are
relatively straightforward
Problems with relational databases:
 The tables used do not describe complex hierarchical relationships

between data items


Object-Oriented Databases:
 To overcome the problem, object-oriented databases have been

developed that store data as objects


 In an object-oriented programming language, an object can be

considered as a unit that


 combines data and mathematical routines that act on the data

 The database is structured such that the objects are linked by a set of

pointers
 defining predetermined relationships between the objects
January 20, 2025 Bioinformatics 36
Databases and Types
 Searching the database involves navigating through the objects with the
aid of the pointers linking different objects
 Programming languages like C++ are used to create OODbs
 The object-oriented database system is more flexible; data can be
structured based on hierarchical relationships
 However, this type of database system lacks the rigorous mathematical
foundation of the relational databases
 There is also a risk that some of the relationships between objects
maybe misrepresented
 Some current databases have therefore incorporated features of
 both types of database programming, creating the object–relational

database management system

January 20, 2025 Bioinformatics 37


Classification Scheme of Biological Databases

 Biological databases can be broadly classified into sequence, structure


and pathway databases
 Sequence databases are applicable for nucleic acid and protein
sequences but structure databases are applicable only to proteins
January 20, 2025 Bioinformatics 38
Biological Databases
 Current biological databases use all three types of database structures:
 Flat files

 Relational, and

 Object oriented

 Despite the obvious drawbacks of using flat files in database

management, many biological databases still use this format


 The justification for this is that this system involves
 Minimum amount of database design and

 The search output can be easily understood by working biologists

 Based on their contents, biological databases can be roughly divided


into three categories:
 primary databases,

 secondary databases, and

 specialized databases

January 20, 2025 Bioinformatics 39


Biological Databases
 Primary databases contain original biological data
 They are archives of raw sequence or structural data submitted by the
scientific community
 GenBank and Protein Data Bank (PDB) are of primary databases
 Secondary databases contain computationally processed or manually
curated information, based on
 original information from primary databases
 Translated protein sequence databases containing functional annotation
belong to this category
Examples: SWISS-Prot and Protein Information Resources (PIR)
 Specialized databases are those that cater to a particular research
interest
Example: Flybase, HIV sequence database, and Ribosomal Database
Project are databases that specialize in a particular organism
January 20, 2025 Bioinformatics 40
Biological Databases: Primary
Primary Databases:
There are three major public sequence databases that store raw nucleic

acid sequence data produced and submitted by researchers worldwide:


 GenBank

 European Molecular Biology Laboratory (EMBL) database and

 DNA Data Bank of Japan (DDBJ)

Most of the data in the databases are contributed directly by authors with

a minimal level of annotation


A small number of sequences, especially those published in the 1980s,

were entered manually from published literature by database management


staff
Presently, sequence submission to either GenBank, EMBL, or DDBJ is a

precondition for publication in most scientific journals to ensure the


fundamental molecular data to be made freely available
January 20, 2025 Bioinformatics 41
Biological Databases: Primary
 These three public databases closely collaborate and exchange new
data daily
 They together constitute the International Nucleotide Sequence
Database Collaboration
 This means that by connecting to any one of the three databases, one
should have access to the same nucleotide sequence data
 Although the three databases all contain the same sets of raw data,
 Each of the individual databases has a slightly different kind of

format to represent the data


 Fortunately, for the 3D structures of biological macromolecules, there is
only one centralized database, the PDB
 This database archives atomic coordinates of macromolecules (both
proteins and nucleic acids) determined
 by x-ray crystallography and NMR Spectroscopy

January 20, 2025 Bioinformatics 42


Protein Structures

January 20, 2025 Bioinformatics 43


Biological Databases: Secondary
 It uses a flat file format to represent protein name, authors,
experimental details, secondary structure, cofactors, and atomic
coordinates
 The web interface of PDB also provides viewing tools for simple image

manipulation
Example: NM_031959.3
Secondary Databases:
Need of Secondary Databases:
 Sequence annotation information in the primary database is often

minimal
 To turn the raw sequence information into more sophisticated

biological knowledge, much


 post processing of the sequence information is needed

January 20, 2025 Bioinformatics 44


Biological Databases: Secondary
 This begs the need for secondary databases, which contain
 Computationally processed sequence information derived from the

primary databases
 The amount of computational processing work varies greatly among the
secondary databases
 Some are simple archives of translated sequence data from

identified open reading frames in DNA, whereas


 Others provide additional annotation and information related to

higher levels of information regarding structure and functions


 A prominent example of secondary databases is SWISS-PROT
 Provides detailed sequence annotation that includes
 Structure, the description of the function of a protein, and

 Protein family assignment

 Post-translational modifications, variants, etc.


January 20, 2025 Bioinformatics 45
Biological Databases: Secondary
Open reading frames:

 An open reading frame, as related to genomics, is a portion of a


DNA sequence that does not include a stop codon
 A metabolic pathway can be defined as a set of actions or
interactions between genes and their products that results in
 The formation or change of some component of the system,

essential for the correct functioning of a biological system


January 20, 2025 Bioinformatics 46
Biological Databases: Secondary
 The sequence data are mainly derived from TrEMBL, a database of
translated nucleic acid sequences stored in the EMBL database
 The annotation of each entry is carefully curated by human experts and
thus is of good quality
 The protein annotation includes
 Function, domain structure, catalytic sites,
 Cofactor binding, post-translational modification,
 Metabolic pathway information, disease association, and
 Similarity with other sequences
 Other features such as
 Very low redundancy and high level of integration with other primary
and secondary databases
 Make SWISS-PROT very popular among biologists
January 20, 2025 Bioinformatics 47
Biological Databases: Secondary
 A recent effort to combine SWISS-PROT, TrEMBL, and PIR led to the
creation of the UniProt database
 There are also secondary databases that relate to protein family
classification according to functions or structures
 The Pfam and Blocks databases contain aligned protein sequence
information as well as derived motifs and patterns, which can be used
 for classification of protein families and inference of protein functions
 The DALI database is a protein secondary structure database that is
 Vital for protein structure classification and
 Threading analysis to identify distant evolutionary relationships among
proteins

January 20, 2025 Bioinformatics 48


Biological Databases: Specialized
Specialized Databases
 Normally serve a specific research community or focus on a particular

organism
 The content of these databases may be sequences or other types of

information
 The sequences in these databases may overlap with a primary database,

but may also have new data submitted directly by authors


 Because they are often curated by experts in the field,

They may have unique organizations and additional annotations

associated with the sequences


 Many genome databases that are taxonomic specific fall within this

category
Examples include Flybase, WormBase, AceDB, and TAIR

January 20, 2025 Bioinformatics 49


Information Retrieval from Biological Databases
 A major goal in developing databases is to provide efficient and user
friendly access to the data stored
 There are a number of retrieval systems for biological data
 The most popular retrieval systems for biological databases are
 Entrez and

 Sequence Retrieval Systems (SRS)

 These provide access to multiple databases for retrieval of

integrated search results


 The data including
 Annotated genetic sequence information, structural information,

 As well as citations and abstracts, full papers, and taxonomic data

January 20, 2025 Bioinformatics 50


Information Retrieval from Biological Databases
Entrez:
The NCBI developed and maintains Entrez, a biological database retrieval

system
It is a gateway that allows text-based searches for a wide variety of data

The key feature of Entrez is

 Its ability to integrate information, which comes from cross-

referencing between NCBI databases based on preexisting and


 Logical relationships between individual entries

 This is highly convenient: users do not have to visit multiple

databases located in disparate places

January 20, 2025 Bioinformatics 51


Information Retrieval from Biological Databases
GenBank
GenBank is the most complete collection of annotated nucleic acid

sequence data for almost every organism


The content includes genomic DNA, mRNA, cDNA, ESTs, high

throughput raw sequence data, and sequence polymorphisms


There is also a GenPept database for protein sequences,

 The majority of which are conceptual translations from DNA

sequences,
 Although a small number of the amino acid sequences are derived

using peptide sequencing techniques


There are two ways to search for sequences in GenBank

One is using text-based keywords similar to a PubMed search

The other is using molecular sequences to search by sequence similarity

using BLAST
January 20, 2025 Bioinformatics 52
Biological Databases: Characteristics
 The contents
 The ontology: the list of valid terms and their definitions
 The logical structure, or the expression of the inter-relationships among
the data, called schema
 The format of the data
 The roots for selective retrieval of data, and presentation of results, or
pasting them on to a program for analysis
 Links to other resources: other databases, references to original
publications of data, tutorial background etc.

January 20, 2025 Bioinformatics 53


GenBank Sequence Format
 To search GenBank effectively using the text-based method requires an
understanding of the GenBank sequence format
 GenBank is a relational database
 However, the search output for sequence files is produced as flat files
for easy reading
 The resulting flat files contain three sections –
 Header, Features, and Sequence entry

 There are many fields in the Header and Features sections


 Each field has an unique identifier for easy indexing by computer
software
 Understanding the structure of the GenBank files helps in designing
effective search strategies

January 20, 2025 Bioinformatics 54


GenBank Sequence Format
 The Header section describes
 The origin of the sequence,

 Identification of the organism, and

 Unique identifiers associated with the record

 The top line of the Header section is the LOCUS, which contains
 A unique database identifier for a sequence location in the database

 The identifier is followed by sequence length and molecule type (e.g.,


DNA or RNA)
 This is followed by a three-letter code for GenBank divisions
 There are 18 divisions in total, which were set up simply based on
convenience of data storage
 PLN for plant, fungal, and algal sequences

 PRI for primate sequences

January 20, 2025 Bioinformatics 55


GenBank Sequence Format
MAM for non-primate mammalian sequences
 BCT for bacterial sequences and

 EST for EST sequences

 Next to the division is the date when the record was last modified
 DEFINITION provides the summary information including
 The name of the sequence,

 Gene/protein name, or some description of the sequence's function

 The name and taxonomy of the source organism if known, and

 Whether the sequence is complete or partial

 Accession Number for the sequence, which is a unique number

assigned to a piece of DNA when


 It was first submitted to GenBank and is permanently associated

with that sequence


January 20, 2025 Bioinformatics 56
GenBank Sequence Format
 It contains letters followed by digits, such as a single letter followed by
five digits (e.g., U12345) or two letters followed by six digits (e.g.,
AF123456)
 For a nucleotide sequence that has been translated into a protein
sequence,
 A new accession number is given

 In addition to the accession number, there is also a version number and


a gene index (gi) number, sequence identification number
 A translated protein sequence also has a different gi number from the
DNA sequence it is derived from
 ORGANISM field, which includes the source of the organism with
 the scientific name of the species and sometimes the tissue type

 Along with the scientific name it contains the information of taxonomic


classification of the organism
January 20, 2025 Bioinformatics 57
GenBank Sequence Format
 REFERENCE field, which provides the publication citation related to
the sequence entry
 The REFERENCE part includes author and title information of the
published work
 The “JOURNAL” field includes the citation information as well as the
date of sequence submission
 The “Features” section includes annotation information about
 The gene and gene product,

 As well as regions of biological significance reported in the

sequence, with identifiers and qualifiers


 The “Source” field provides
 The length of the sequence,

 The scientific name of the organism, and

 The taxonomy identification number


January 20, 2025 Bioinformatics 58
GenBank Sequence Format
 The “gene” field is the information about the nucleotide coding
sequence and its name
 For DNA entries, there is a “CDS” field,
 which is information about the boundaries of the sequence that can be
translated into amino acids
 For eukaryotic DNA, CDS field also contains information of the
locations of exons and translated protein sequences is entered
 The third section of the flat file is the sequence itself starting with the
label “ORIGIN”
 For DNA entries, there is a BASE COUNT report that includes the
numbers of A, G, C, and T in the sequence
 This section, for both DNA or protein sequences, ends with two
forward slashes (the “//” symbol)

January 20, 2025 Bioinformatics 59


Alternative Sequence Formats: FASTA
 FASTA is one of the simplest and the most popular sequence formats
because
 It contains plain sequence information that is readable by many

bioinformatics analysis programs


 It has a single definition line that begins
 With a right angle bracket (>) followed by a sequence name

 Sometimes, extra information such as gi number or comments can be


given, which are separated from the sequence name by a “|” symbol
 The extra information is considered optional and is ignored by sequence
analysis programs
 The plain sequence in standard one-letter symbols starts in the 2 nd line
 Each line of sequence data is limited to 60 to 80 characters in width
 The drawback of this format is that much annotation information is lost
January 20, 2025 Bioinformatics 60
Pitfalls of Biological Databases
 One of the problems associated with biological databases is
 Overreliance on sequence information and

 Related annotations without understanding the reliability of the

information
 There are many errors in sequence databases
 All these types of errors can be passed on to other databases, causing
propagation of errors
 Most errors in nucleotide sequences are caused by sequencing errors
 Some of these errors cause frame shifts that make whole gene
identification difficult or protein translation impossible
 Sometimes, gene sequences are contaminated with sequences from
cloning vectors
 Errors are more common for sequences produced before the 1990s;
sequence quality has been greatly improved now
January 20, 2025 Bioinformatics 61
Pitfalls of Biological Databases
 Therefore, exceptional care should be taken when dealing with more
updated sequences
Redundancy:
 There are also high levels of redundancy in the primary sequence Dbs

 There is tremendous duplication of information in the databases, for

various reasons
 The causes of redundancy include

 Repeated submission of identical or overlapping sequences by the

same or different authors


 Revision of annotations

 Dumping of expressed sequence tags (EST) data, and

 Poor database management that fails to detect the redundancy

 This makes some primary databases excessively large and unwieldy

for information retrieval


January 20, 2025 Bioinformatics 62
Pitfalls of Biological Databases
The steps to reduce the redundancy
The NCBI has now created a non redundant database, called RefSeq, in

which
 Identical sequences from the same organism and associated

sequence fragments are merged into a single entry


 Proteins sequences derived from the same DNA sequences are

explicitly linked as related entries


 Sequence variants from the same organism with very minor

differences, which may well be caused


 By sequencing errors, are treated as distinctly related entries

 This carefully curated database can be considered a secondary

database
 The SWISS-PROT database also has minimal redundancy for

protein sequences compared to most other databases


January 20, 2025 Bioinformatics 63
Pitfalls of Biological Databases
 Another way to address the redundancy problem is
 To create sequence-cluster databases such as UniGene that combine

EST sequences that are derived from the same gene


 The other common problem is erroneous annotations
 The same gene sequence is found under different names resulting in
multiple entries and confusion about the data
 Conversely, unrelated genes bearing the same name are found in the
databases
 To alleviate the problem of naming genes, re-annotation of genes and
proteins using a set of common controlled vocabulary to describe a
gene or protein is necessary
 The goal is to provide a consistent and unambiguous naming system for
all genes and proteins
 A prominent example of such systems is Gene Ontology
January 20, 2025 Bioinformatics 64
Thank You

January 20, 2025 65

You might also like