0% found this document useful (0 votes)
22 views

Lecture 3 Database

Uploaded by

Zahra.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Lecture 3 Database

Uploaded by

Zahra.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 81

Lecture 3

Bioinformatics
Databases
Getting Knowledge
from Information
Databases in Bioinformatics

• Why?
• The different types of databases
• Database language: identifiers
• Nucleotide sequence databases
• Protein sequence databases
• 3D structure databases
• Ontologies
Biological databases: Why We need?

• Make biological data available to scientists


– Consolidation of data (gather data from different sources)
– Provide access to large dataset that cannot be published
explicitly (genome, …)

• Make biological data available in computer-readable


format
– Make data accessible for automated analysis

Bioinformatics: “a collective term for data compilation,


organisation, analysis and dissemination”
The different types of Databases in Bioinformatics
1) Data:

Type of data: Data entry and quality control:


• nucleotide sequences • data deposited directly
• protein sequences • curators add and update data
• 3D structures • treatment of erroneous data: removed,
• gene expression data or marked
• metabolic pathways • error checking
• …. • consistency, updates
• ….

Primary, or derived data:


• Primary databases: direct experimental results
• Secondary databases: result of analysis on primary databases
• Consolidation of many databases
• …
Growth in Available Bioinformatics Databases
Different Types of Databases in
Bioinformatics
• Bibliographic Database
• Taxonomic Database
• Nucleotide Database
• Protein Database
• Microarray Database
• Many more………………….
Bibliographic Database

• MEDLINE
• PUBMED
• EMBASE
• BIOSIS
• ZOOLOGICAL
• CAB
• AGROCOLA……..etc
Nucleic Acids Research article lists
1512 public databases
(up from 719 in 2005, 1230 in 2010):

Contains More than 180


databases today

First DB issue: April 1991


Containing 18 articles.
Taxonomic Database
How to look at the number of taxa
How to look at the number of taxa (e.g. species) in
GenBank; the most sequenced organisms; types of data;
and look at a particular example, the UniGene database
of expressed sequence tags (ESTs).

Try this link,

http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
Taxa represented in GenBank (at NCBI)
What does the genome data look like?
1 gcggagggtg cgtgcgggcc gcggcagccg aacaaaggag caggggcgcc gccgcaggga
61 cccgccaccc acctcccggg gccgcgcagc ggcctctcgt ctactgccac catgaccgcc
121 aacggcacag ccgaggcggt gcagatccag ttcggcctca tcaactgcgg caacaagtac
181 ctgacggccg aggcgttcgg gttcaaggtg aacgcgtccg ccagcagcct gaagaagaag
241 cagatctgga cgctggagca gccccctgac gaggcgggca gcgcggccgt gtgcctgcgc
301 agccacctgg gccgctacct ggcggcggac aaggacggca acgtgacctg cgagcgcgag
361 gtgcccggtc ccgactgccg tttcctcatc gtggcgcacg acgacggtcg ctggtcgctg
421 cagtccgagg cgcaccggcg ctacttcggc ggcaccgagg accgcctgtc ctgcttcgcg
481 cagacggtgt cccccgccga gaagtggagc gtgcacatcg ccatgcaccc tcaggtcaac
541 atctacagtg tcacccgtaa gcgctacgcg cacctgagcg cgcggccggc cgacgagatc
601 gccgtggacc gcgacgtgcc ctggggcgtc gactcgctca tcaccctcgc cttccaggac
661 cagcgctaca gcgtgcagac cgccgaccac cgcttcctgc gccacgacgg gcgcctggtg
721 gcgcgccccg agccggccac tggctacacg ctggagttcc gctccggcaa ggtggccttc
781 cgcgactgcg agggccgtta cctggcgccg tcggggccca gcggcacgct caaggcgggc
841 aaggccacca aggtgggcaa ggacgagctc tttgctctgg agcagagctg cgcccaggtc
901 gtgctgcagg cggccaacga gaggaacgtg tccacgcgcc agggtatgga cctgtctgcc
961 aatcaggacg aggagaccga ccaggagacc ttccagctgg agatcgaccg cgacaccaaa
...

Multiply times eighteen million

What you can infer from these alphabets?


GAME

https://www.dnalc.org/view/15891-DNA-sequencing-game-interactive-2D-animation.html
Assignment-
A. PAPER SEARCH?
B. TAXONOMY?
SEQUENCING
Nucleotide Sequence Database
NCBI Database Resources

www.ncbi.nlm.nih.gov
National Center for Biotechnology Information
(NCBI): organization
Identifiers and Accession numbers

• Identifier: string of letters and digits that generally is


“understandable”
– Example: TPIS_CHICK (Triose Phosphate Isomerase from
chicken (gallus gallus) ) in SwissProt
– The identifier can change (based on the curator)
• Accession code: a string of letters and digits that
uniquely identifies an entry in its database.
– The accession number for TPIS_CHICK in Swissprot is
P00940
– Accession number should not changed!!
Nucleotide
record
Centralized databases store DNA sequences
Potential Errors in GenBank
• Sequence errors estimated at between 0.37 and 35
(!) errors per 1000 bases
• Recombination
• Contamination
• Annotation errors - propagated misannotations
– Transfer by similarity is problematic
– Errors not always corrected in a timely way
– Genes with varying unrelated functions depending on
context
– Functional annotation is often unsystematic
• Name-function disconnect
Potential Errors in GenBank

• Naming conflicts
– One gene, many acronyms
– Many genes, shared acronym
– Spelling errors
– Cultural differences (US, UK)
– Representation of non-ASCII characters
Also known as
ACTR; AIB1; RAC3; SRC3; pCIP; AIB-1; CTG26; SRC-1; CAGH16;
KAT13B; TNRC14; TNRC16; TRAM-1; MGC141848
Many Databases available:
• Comparative Genomics
• Gene Expression
• Gene Identification & structure
• Genetic Maps
• Genomic Databases
• Intermolecular Interactions
• Metabolic Pathways and Cellular Regulation
• Mutation Databases NAR Summary List:
• Pathology
• Protein Databases
• Protein Sequence Motifs
• Proteome Resources
• Retrieval Systems & Database Structure
• RNA Sequences
• Structure
• Transgenics
• Varied Biomedical Content
Types of data and examples of databases
Genomic Databases
Intermolecular Interactions
Mutation Databases
Protein Databases
Protein Databases: Swiss-Prot
• Extremely well
(manually) curated
protein database
• Link to BLAST
• Powerful cross-references
• Est. 1986
• Maintained by the
Department of Medical
Biochemistry of the
University of Geneva and
the EMBL Data Library
Proteome Resources: Proteome BKL
Structural Database
Varied Biomedical Content
Metabolic Pathways and Celluar Regulation
B I
N C
National Center for Biotechnology
Information (NCBI):
A network of linked resources
• Database access: Genbank
structure, function, SNP,
taxonomy...
• Literature (PubMed)
• Whole genomes
• Tools
• Contacts & research
information
• FTP
NCBI resources
• Nucleotide
databases
• Protein databases
• Structure databases
• Taxonomy
databases
• Genome databases
• Expression
databases
NCBI
Title Bar
UniGene database: clusters of EST sequences

B&FG 3e
Fig. 2-5
UniGene database: clusters of EST sequences

B&FG 3e
Fig. 2-5
How to Access information?
How to Access information?
NCBI includes databases (such as GenBank) that
contain information on DNA, RNA, or protein
sequences.
You may want to acquire information beginning with a
query such as the name of a protein of interest, or the
raw nucleotides comprising a DNA sequence of
interest.

DNA sequences and other molecular data are


tagged with accession numbers that are used to
identify a sequence
or other record relevant to molecular data.
What is an accession
number?
An accession number is a label used to identify a
sequence. It is a string of letters and/or numbers
that corresponds to a molecular sequence.
Examples:
CH471100.2 GenBank genomic DNA sequence
NC_000001.10 Genomic contig DNA
rs121434231 dbSNP (single nucleotide polymorphism)

AI687828.1 An expressed sequence tag (1 of 184)


NM_001206696
RNA
RefSeq DNA sequence (from a transcript)

NP_006138.1 RefSeq protein


CAA18545.1 GenBank protein
O14896 SwissProt protein protein
1KT7 Protein Data Bank structure record
B&FG 3e
NCBI’s important RefSeq
project:
best representative sequences
RefSeq (accessible via the main page of NCBI)
provides an expertly curated accession number
that
corresponds to the most stable, agreed-upon
“reference”
version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######


Complete chromosome NC_######
Genomic contig NT_######
mRNA (DNA format) NM_###### e.g.
NM_006744
Access to sequences: Gene resource
at NCBI

NCBI Gene is a great starting point: it collects


key information on each gene/protein from
major databases. It covers all major
organisms.

RefSeq provides a curated, optimal accession


number for each DNA (NM_000518 for beta
globin DNA corresponding to mRNA) or
protein (NP_000509)

B&FG 3e
NCBI Gene: example of query for beta globin

B&FG 3e
Fig. 2.8
NCBI Gene: example of query for beta globin

B&FG 3e
Fig. 2.9
NCBI Protein: hemoglobin subunit beta

B&FG 3e
Fig. 2.10
NCBI Protein: hemoglobin subunit beta
in the FASTA format

B&FG 3e
Fig. 2.11
You Better Start…
preparing for your
course and survival in
the exam
Assignment-1a
1.

2. Write ~100bp DNA sequence for your practice using

bioinformatics tools…….
Genome browsers
Genome browsers
• Versatile tools to visualize chromosomal
positions (typically on x-axis) with annotation
tracks (typically on y-axis).
• Useful to explore data related to some
chromosomal feature of interest such as a gene.
• Prominent browsers are at Ensembl, UCSC,
and NCBI.
• Many hundreds of specialized genome
browsers are available, some for particular
organisms or molecule types.
Genome Browsers: UCSC

Choose the group (e.g. mammal), genome (e.g. human),


assembly (e.g. GRCh37 or GRCh38), position and/or
search term (e.g. hbb).

A genome build or assembly (e.g. GRCh37 or GRCh38)


refers to a fixed, agreed-upon version of a reference
genome. Assemblies are typically updated every few
years (see Chapter 15 for more information).
Genome Browsers: UCSC

When you enter a query such as “hbb” you may have to


specify which entry you want, such as the RefSeq version
having accession NM_000518.
Genome Browsers: UCSC

Explore the browser! Begin with a favorite gene or


region. Zoom in to base pair level, then out to full
chromosome level. Explore the many tracks you can add.
Accessing sequence data for individual genes

When you search for information about a particular


gene, make sure you know the official gene symbol
(e.g. visit http://www.genenames.org) and choose
the appropriate species.

Some searches are particularly challenging. For


example, there are thousands of histones. Use
Boolean operators to limit the search results.

Searching for HIV-1 proteins, note that there are


vast numbers of protein and DN A results
(approaching 1 million entries!) but there is only
one RefSeq accession. This highlights the
usefulness of the RefSeq project.

You might also like