Lecture 3 Database
Lecture 3 Database
Bioinformatics
Databases
Getting Knowledge
from Information
Databases in Bioinformatics
• Why?
• The different types of databases
• Database language: identifiers
• Nucleotide sequence databases
• Protein sequence databases
• 3D structure databases
• Ontologies
Biological databases: Why We need?
• MEDLINE
• PUBMED
• EMBASE
• BIOSIS
• ZOOLOGICAL
• CAB
• AGROCOLA……..etc
Nucleic Acids Research article lists
1512 public databases
(up from 719 in 2005, 1230 in 2010):
http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
Taxa represented in GenBank (at NCBI)
What does the genome data look like?
1 gcggagggtg cgtgcgggcc gcggcagccg aacaaaggag caggggcgcc gccgcaggga
61 cccgccaccc acctcccggg gccgcgcagc ggcctctcgt ctactgccac catgaccgcc
121 aacggcacag ccgaggcggt gcagatccag ttcggcctca tcaactgcgg caacaagtac
181 ctgacggccg aggcgttcgg gttcaaggtg aacgcgtccg ccagcagcct gaagaagaag
241 cagatctgga cgctggagca gccccctgac gaggcgggca gcgcggccgt gtgcctgcgc
301 agccacctgg gccgctacct ggcggcggac aaggacggca acgtgacctg cgagcgcgag
361 gtgcccggtc ccgactgccg tttcctcatc gtggcgcacg acgacggtcg ctggtcgctg
421 cagtccgagg cgcaccggcg ctacttcggc ggcaccgagg accgcctgtc ctgcttcgcg
481 cagacggtgt cccccgccga gaagtggagc gtgcacatcg ccatgcaccc tcaggtcaac
541 atctacagtg tcacccgtaa gcgctacgcg cacctgagcg cgcggccggc cgacgagatc
601 gccgtggacc gcgacgtgcc ctggggcgtc gactcgctca tcaccctcgc cttccaggac
661 cagcgctaca gcgtgcagac cgccgaccac cgcttcctgc gccacgacgg gcgcctggtg
721 gcgcgccccg agccggccac tggctacacg ctggagttcc gctccggcaa ggtggccttc
781 cgcgactgcg agggccgtta cctggcgccg tcggggccca gcggcacgct caaggcgggc
841 aaggccacca aggtgggcaa ggacgagctc tttgctctgg agcagagctg cgcccaggtc
901 gtgctgcagg cggccaacga gaggaacgtg tccacgcgcc agggtatgga cctgtctgcc
961 aatcaggacg aggagaccga ccaggagacc ttccagctgg agatcgaccg cgacaccaaa
...
https://www.dnalc.org/view/15891-DNA-sequencing-game-interactive-2D-animation.html
Assignment-
A. PAPER SEARCH?
B. TAXONOMY?
SEQUENCING
Nucleotide Sequence Database
NCBI Database Resources
www.ncbi.nlm.nih.gov
National Center for Biotechnology Information
(NCBI): organization
Identifiers and Accession numbers
• Naming conflicts
– One gene, many acronyms
– Many genes, shared acronym
– Spelling errors
– Cultural differences (US, UK)
– Representation of non-ASCII characters
Also known as
ACTR; AIB1; RAC3; SRC3; pCIP; AIB-1; CTG26; SRC-1; CAGH16;
KAT13B; TNRC14; TNRC16; TRAM-1; MGC141848
Many Databases available:
• Comparative Genomics
• Gene Expression
• Gene Identification & structure
• Genetic Maps
• Genomic Databases
• Intermolecular Interactions
• Metabolic Pathways and Cellular Regulation
• Mutation Databases NAR Summary List:
• Pathology
• Protein Databases
• Protein Sequence Motifs
• Proteome Resources
• Retrieval Systems & Database Structure
• RNA Sequences
• Structure
• Transgenics
• Varied Biomedical Content
Types of data and examples of databases
Genomic Databases
Intermolecular Interactions
Mutation Databases
Protein Databases
Protein Databases: Swiss-Prot
• Extremely well
(manually) curated
protein database
• Link to BLAST
• Powerful cross-references
• Est. 1986
• Maintained by the
Department of Medical
Biochemistry of the
University of Geneva and
the EMBL Data Library
Proteome Resources: Proteome BKL
Structural Database
Varied Biomedical Content
Metabolic Pathways and Celluar Regulation
B I
N C
National Center for Biotechnology
Information (NCBI):
A network of linked resources
• Database access: Genbank
structure, function, SNP,
taxonomy...
• Literature (PubMed)
• Whole genomes
• Tools
• Contacts & research
information
• FTP
NCBI resources
• Nucleotide
databases
• Protein databases
• Structure databases
• Taxonomy
databases
• Genome databases
• Expression
databases
NCBI
Title Bar
UniGene database: clusters of EST sequences
B&FG 3e
Fig. 2-5
UniGene database: clusters of EST sequences
B&FG 3e
Fig. 2-5
How to Access information?
How to Access information?
NCBI includes databases (such as GenBank) that
contain information on DNA, RNA, or protein
sequences.
You may want to acquire information beginning with a
query such as the name of a protein of interest, or the
raw nucleotides comprising a DNA sequence of
interest.
B&FG 3e
NCBI Gene: example of query for beta globin
B&FG 3e
Fig. 2.8
NCBI Gene: example of query for beta globin
B&FG 3e
Fig. 2.9
NCBI Protein: hemoglobin subunit beta
B&FG 3e
Fig. 2.10
NCBI Protein: hemoglobin subunit beta
in the FASTA format
B&FG 3e
Fig. 2.11
You Better Start…
preparing for your
course and survival in
the exam
Assignment-1a
1.
bioinformatics tools…….
Genome browsers
Genome browsers
• Versatile tools to visualize chromosomal
positions (typically on x-axis) with annotation
tracks (typically on y-axis).
• Useful to explore data related to some
chromosomal feature of interest such as a gene.
• Prominent browsers are at Ensembl, UCSC,
and NCBI.
• Many hundreds of specialized genome
browsers are available, some for particular
organisms or molecule types.
Genome Browsers: UCSC