0% found this document useful (0 votes)

23 views81 pages

Lecture 3 Database

Uploaded by

Zahra.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views81 pages

Lecture 3 Database

Uploaded by

Zahra.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 81

Lecture 3

Bioinformatics
Databases
Getting Knowledge
from Information
Databases in Bioinformatics

• Why?
• The different types of databases
• Database language: identifiers
• Nucleotide sequence databases
• Protein sequence databases
• 3D structure databases
• Ontologies
Biological databases: Why We need?

• Make biological data available to scientists

– Consolidation of data (gather data from different sources)
– Provide access to large dataset that cannot be published
explicitly (genome, …)

• Make biological data available in computer-readable

format
– Make data accessible for automated analysis

Bioinformatics: “a collective term for data compilation,

organisation, analysis and dissemination”
The different types of Databases in Bioinformatics
1) Data:

Type of data: Data entry and quality control:

• nucleotide sequences • data deposited directly
• protein sequences • curators add and update data
• 3D structures • treatment of erroneous data: removed,
• gene expression data or marked
• metabolic pathways • error checking
• …. • consistency, updates
• ….

Primary, or derived data:

• Primary databases: direct experimental results
• Secondary databases: result of analysis on primary databases
• Consolidation of many databases
• …
Growth in Available Bioinformatics Databases
Different Types of Databases in
Bioinformatics
• Bibliographic Database
• Taxonomic Database
• Nucleotide Database
• Protein Database
• Microarray Database
• Many more………………….
Bibliographic Database

• MEDLINE
• PUBMED
• EMBASE
• BIOSIS
• ZOOLOGICAL
• CAB
• AGROCOLA……..etc
Nucleic Acids Research article lists
1512 public databases
(up from 719 in 2005, 1230 in 2010):

Contains More than 180

databases today

First DB issue: April 1991

Containing 18 articles.
Taxonomic Database
How to look at the number of taxa
How to look at the number of taxa (e.g. species) in
GenBank; the most sequenced organisms; types of data;
and look at a particular example, the UniGene database
of expressed sequence tags (ESTs).

Try this link,

http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
Taxa represented in GenBank (at NCBI)
What does the genome data look like?
1 gcggagggtg cgtgcgggcc gcggcagccg aacaaaggag caggggcgcc gccgcaggga
61 cccgccaccc acctcccggg gccgcgcagc ggcctctcgt ctactgccac catgaccgcc
121 aacggcacag ccgaggcggt gcagatccag ttcggcctca tcaactgcgg caacaagtac
181 ctgacggccg aggcgttcgg gttcaaggtg aacgcgtccg ccagcagcct gaagaagaag
241 cagatctgga cgctggagca gccccctgac gaggcgggca gcgcggccgt gtgcctgcgc
301 agccacctgg gccgctacct ggcggcggac aaggacggca acgtgacctg cgagcgcgag
361 gtgcccggtc ccgactgccg tttcctcatc gtggcgcacg acgacggtcg ctggtcgctg
421 cagtccgagg cgcaccggcg ctacttcggc ggcaccgagg accgcctgtc ctgcttcgcg
481 cagacggtgt cccccgccga gaagtggagc gtgcacatcg ccatgcaccc tcaggtcaac
541 atctacagtg tcacccgtaa gcgctacgcg cacctgagcg cgcggccggc cgacgagatc
601 gccgtggacc gcgacgtgcc ctggggcgtc gactcgctca tcaccctcgc cttccaggac
661 cagcgctaca gcgtgcagac cgccgaccac cgcttcctgc gccacgacgg gcgcctggtg
721 gcgcgccccg agccggccac tggctacacg ctggagttcc gctccggcaa ggtggccttc
781 cgcgactgcg agggccgtta cctggcgccg tcggggccca gcggcacgct caaggcgggc
841 aaggccacca aggtgggcaa ggacgagctc tttgctctgg agcagagctg cgcccaggtc
901 gtgctgcagg cggccaacga gaggaacgtg tccacgcgcc agggtatgga cctgtctgcc
961 aatcaggacg aggagaccga ccaggagacc ttccagctgg agatcgaccg cgacaccaaa
...

Multiply times eighteen million

What you can infer from these alphabets?

GAME

https://www.dnalc.org/view/15891-DNA-sequencing-game-interactive-2D-animation.html
Assignment-
A. PAPER SEARCH?
B. TAXONOMY?
SEQUENCING
Nucleotide Sequence Database
NCBI Database Resources

www.ncbi.nlm.nih.gov
National Center for Biotechnology Information
(NCBI): organization
Identifiers and Accession numbers

• Identifier: string of letters and digits that generally is

“understandable”
– Example: TPIS_CHICK (Triose Phosphate Isomerase from
chicken (gallus gallus) ) in SwissProt
– The identifier can change (based on the curator)
• Accession code: a string of letters and digits that
uniquely identifies an entry in its database.
– The accession number for TPIS_CHICK in Swissprot is
P00940
– Accession number should not changed!!
Nucleotide
record
Centralized databases store DNA sequences
Potential Errors in GenBank
• Sequence errors estimated at between 0.37 and 35
(!) errors per 1000 bases
• Recombination
• Contamination
• Annotation errors - propagated misannotations
– Transfer by similarity is problematic
– Errors not always corrected in a timely way
– Genes with varying unrelated functions depending on
context
– Functional annotation is often unsystematic
• Name-function disconnect
Potential Errors in GenBank

• Naming conflicts
– One gene, many acronyms
– Many genes, shared acronym
– Spelling errors
– Cultural differences (US, UK)
– Representation of non-ASCII characters
Also known as
ACTR; AIB1; RAC3; SRC3; pCIP; AIB-1; CTG26; SRC-1; CAGH16;
KAT13B; TNRC14; TNRC16; TRAM-1; MGC141848
Many Databases available:
• Comparative Genomics
• Gene Expression
• Gene Identification & structure
• Genetic Maps
• Genomic Databases
• Intermolecular Interactions
• Metabolic Pathways and Cellular Regulation
• Mutation Databases NAR Summary List:
• Pathology
• Protein Databases
• Protein Sequence Motifs
• Proteome Resources
• Retrieval Systems & Database Structure
• RNA Sequences
• Structure
• Transgenics
• Varied Biomedical Content
Types of data and examples of databases
Genomic Databases
Intermolecular Interactions
Mutation Databases
Protein Databases
Protein Databases: Swiss-Prot
• Extremely well
(manually) curated
protein database
• Link to BLAST
• Powerful cross-references
• Est. 1986
• Maintained by the
Department of Medical
Biochemistry of the
University of Geneva and
the EMBL Data Library
Proteome Resources: Proteome BKL
Structural Database
Varied Biomedical Content
Metabolic Pathways and Celluar Regulation
B I
N C
National Center for Biotechnology
Information (NCBI):
A network of linked resources
• Database access: Genbank
structure, function, SNP,
taxonomy...
• Literature (PubMed)
• Whole genomes
• Tools
• Contacts & research
information
• FTP
NCBI resources
• Nucleotide
databases
• Protein databases
• Structure databases
• Taxonomy
databases
• Genome databases
• Expression
databases
NCBI
Title Bar
UniGene database: clusters of EST sequences

B&FG 3e
Fig. 2-5
UniGene database: clusters of EST sequences

B&FG 3e
Fig. 2-5
How to Access information?
How to Access information?
NCBI includes databases (such as GenBank) that
contain information on DNA, RNA, or protein
sequences.
You may want to acquire information beginning with a
query such as the name of a protein of interest, or the
raw nucleotides comprising a DNA sequence of
interest.

DNA sequences and other molecular data are

tagged with accession numbers that are used to
identify a sequence
or other record relevant to molecular data.
What is an accession
number?
An accession number is a label used to identify a
sequence. It is a string of letters and/or numbers
that corresponds to a molecular sequence.
Examples:
CH471100.2 GenBank genomic DNA sequence
NC_000001.10 Genomic contig DNA
rs121434231 dbSNP (single nucleotide polymorphism)

AI687828.1 An expressed sequence tag (1 of 184)

NM_001206696
RNA
RefSeq DNA sequence (from a transcript)

NP_006138.1 RefSeq protein

CAA18545.1 GenBank protein
O14896 SwissProt protein protein
1KT7 Protein Data Bank structure record
B&FG 3e
NCBI’s important RefSeq
project:
best representative sequences
RefSeq (accessible via the main page of NCBI)
provides an expertly curated accession number
that
corresponds to the most stable, agreed-upon
“reference”
version of a sequence.

RefSeq identifiers include the following formats:

Complete genome NC_######

Complete chromosome NC_######
Genomic contig NT_######
mRNA (DNA format) NM_###### e.g.
NM_006744
Access to sequences: Gene resource
at NCBI

NCBI Gene is a great starting point: it collects

key information on each gene/protein from
major databases. It covers all major
organisms.

RefSeq provides a curated, optimal accession

number for each DNA (NM_000518 for beta
globin DNA corresponding to mRNA) or
protein (NP_000509)

B&FG 3e
NCBI Gene: example of query for beta globin

B&FG 3e
Fig. 2.8
NCBI Gene: example of query for beta globin

B&FG 3e
Fig. 2.9
NCBI Protein: hemoglobin subunit beta

B&FG 3e
Fig. 2.10
NCBI Protein: hemoglobin subunit beta
in the FASTA format

B&FG 3e
Fig. 2.11
You Better Start…
preparing for your
course and survival in
the exam
Assignment-1a
1.

2. Write ~100bp DNA sequence for your practice using

bioinformatics tools…….
Genome browsers
Genome browsers
• Versatile tools to visualize chromosomal
positions (typically on x-axis) with annotation
tracks (typically on y-axis).
• Useful to explore data related to some
chromosomal feature of interest such as a gene.
• Prominent browsers are at Ensembl, UCSC,
and NCBI.
• Many hundreds of specialized genome
browsers are available, some for particular
organisms or molecule types.
Genome Browsers: UCSC

Choose the group (e.g. mammal), genome (e.g. human),

assembly (e.g. GRCh37 or GRCh38), position and/or
search term (e.g. hbb).

A genome build or assembly (e.g. GRCh37 or GRCh38)

refers to a fixed, agreed-upon version of a reference
genome. Assemblies are typically updated every few
years (see Chapter 15 for more information).
Genome Browsers: UCSC

When you enter a query such as “hbb” you may have to

specify which entry you want, such as the RefSeq version
having accession NM_000518.
Genome Browsers: UCSC

Explore the browser! Begin with a favorite gene or

region. Zoom in to base pair level, then out to full
chromosome level. Explore the many tracks you can add.
Accessing sequence data for individual genes

When you search for information about a particular

gene, make sure you know the official gene symbol
(e.g. visit http://www.genenames.org) and choose
the appropriate species.

Some searches are particularly challenging. For

example, there are thousands of histones. Use
Boolean operators to limit the search results.

Searching for HIV-1 proteins, note that there are

vast numbers of protein and DN A results
(approaching 1 million entries!) but there is only
one RefSeq accession. This highlights the
usefulness of the RefSeq project.

Biological Databases ODL
No ratings yet
Biological Databases ODL
31 pages
Beautiful Graphics in R
No ratings yet
Beautiful Graphics in R
238 pages
Nei & Kumar 2000 Molecular Evolution and Phylogenetics PDF
100% (2)
Nei & Kumar 2000 Molecular Evolution and Phylogenetics PDF
350 pages
Lecture3 4
No ratings yet
Lecture3 4
73 pages
(Methods in Molecular Biology 1525) Jonathan M. Keith (Eds.) - Bioinformatics - Volume I - Data, Sequence Analysis, and Evolution-Humana Press (2017)
100% (3)
(Methods in Molecular Biology 1525) Jonathan M. Keith (Eds.) - Bioinformatics - Volume I - Data, Sequence Analysis, and Evolution-Humana Press (2017)
489 pages
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
Mathano Dukhavo
No ratings yet
Mathano Dukhavo
105 pages
Systems Biology
From Everand
Systems Biology
Robert A. Meyers
No ratings yet
Sequence Similarity Searching: Basic Local Alignment Search Tool
No ratings yet
Sequence Similarity Searching: Basic Local Alignment Search Tool
47 pages
Human Genome Project
100% (1)
Human Genome Project
17 pages
Biological Database 1
No ratings yet
Biological Database 1
50 pages
Databases Class Work
No ratings yet
Databases Class Work
48 pages
M Lec 01 & 02 Biological Database
No ratings yet
M Lec 01 & 02 Biological Database
50 pages
4 Bioinformaticsdatabases
No ratings yet
4 Bioinformaticsdatabases
71 pages
Nucleic Acid Databases
No ratings yet
Nucleic Acid Databases
37 pages
Selected Topic in Cs 1
No ratings yet
Selected Topic in Cs 1
53 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
Databases Bioinformatics
No ratings yet
Databases Bioinformatics
42 pages
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
No ratings yet
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
105 pages
Bioinformatics Question Bank For FAT
No ratings yet
Bioinformatics Question Bank For FAT
53 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Introduction To Databases
No ratings yet
Introduction To Databases
21 pages
Lecture 4 Ncbi Database
No ratings yet
Lecture 4 Ncbi Database
30 pages
Biological Databases: DR Z Chikwambi Biotechnology
No ratings yet
Biological Databases: DR Z Chikwambi Biotechnology
47 pages
Bioinformatics 1 p2
No ratings yet
Bioinformatics 1 p2
22 pages
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Bioinformatics
No ratings yet
Bioinformatics
10 pages
Genomics
No ratings yet
Genomics
24 pages
Genomics & Proteomics
No ratings yet
Genomics & Proteomics
22 pages
Database
No ratings yet
Database
40 pages
Lecture1 BIOF242 Shuvadeep
No ratings yet
Lecture1 BIOF242 Shuvadeep
38 pages
Biological Databases
No ratings yet
Biological Databases
39 pages
Lecture Bioinfo Databases
No ratings yet
Lecture Bioinfo Databases
27 pages
BCH 516-1
No ratings yet
BCH 516-1
32 pages
Sec1 Introduction To Bioinformatics
No ratings yet
Sec1 Introduction To Bioinformatics
20 pages
Bioinformatics Lecture Notes Database
No ratings yet
Bioinformatics Lecture Notes Database
28 pages
Sequence and Structure Retrieval
No ratings yet
Sequence and Structure Retrieval
9 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Module 1 - Session 3 - Part 1
No ratings yet
Module 1 - Session 3 - Part 1
17 pages
Introduction To Mathematical Oncology - 1st Edition (FULL VERSION DOWNLOAD)
100% (10)
Introduction To Mathematical Oncology - 1st Edition (FULL VERSION DOWNLOAD)
14 pages
Shotgun Sequencing
No ratings yet
Shotgun Sequencing
29 pages
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
No ratings yet
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
22 pages
Database
No ratings yet
Database
16 pages
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
No ratings yet
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
41 pages
MAT500 Paper Phylogenetics
100% (1)
MAT500 Paper Phylogenetics
19 pages
University of Okara: Name: Topic: Subject: Semester: Department
No ratings yet
University of Okara: Name: Topic: Subject: Semester: Department
29 pages
Bioinformatic Databases 2
No ratings yet
Bioinformatic Databases 2
28 pages
LO4 Access To Sequenced Data and Related Information
No ratings yet
LO4 Access To Sequenced Data and Related Information
11 pages
Coursera BioinfoMethods-I Lab01 PDF
No ratings yet
Coursera BioinfoMethods-I Lab01 PDF
22 pages
Essential Info Notes-1
No ratings yet
Essential Info Notes-1
57 pages
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
No ratings yet
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
48 pages
Bioinformatics Glossary
No ratings yet
Bioinformatics Glossary
4 pages
Online Biological Databases: A/Prof. Ly Le
No ratings yet
Online Biological Databases: A/Prof. Ly Le
64 pages
CMSC 838T - Lecture 9: Bioinformatics Databases
No ratings yet
CMSC 838T - Lecture 9: Bioinformatics Databases
65 pages
Lecture 5 - DataBase
No ratings yet
Lecture 5 - DataBase
18 pages
2024.HF BioInformatics Lec3p
No ratings yet
2024.HF BioInformatics Lec3p
11 pages
Biol BDs Singapore
No ratings yet
Biol BDs Singapore
24 pages
5th Semester (Course of Content)
No ratings yet
5th Semester (Course of Content)
8 pages
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
No ratings yet
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
75 pages
Bio in For Matics
No ratings yet
Bio in For Matics
26 pages
Manual
No ratings yet
Manual
68 pages
Biological Databases: - Bio-Informatics
No ratings yet
Biological Databases: - Bio-Informatics
16 pages
CH12
No ratings yet
CH12
8 pages
Phylogenetic Tree
No ratings yet
Phylogenetic Tree
25 pages
Computational Biology B.Tech - Biotech (Vith Semester)
No ratings yet
Computational Biology B.Tech - Biotech (Vith Semester)
34 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
52 pages
BioinfoMethods I Lab01
No ratings yet
BioinfoMethods I Lab01
19 pages
Sequence Analysis in Bioinformatics
No ratings yet
Sequence Analysis in Bioinformatics
18 pages
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Bioinfi U3 Part - 1
No ratings yet
Bioinfi U3 Part - 1
4 pages
Bioinformatics A Practical Handbook of Next Generation Sequencing and Its Applications (Lloyd Low, Martti Tammi)
No ratings yet
Bioinformatics A Practical Handbook of Next Generation Sequencing and Its Applications (Lloyd Low, Martti Tammi)
242 pages
Exp 1
No ratings yet
Exp 1
7 pages
Lecture1. 2. Plant Cell
No ratings yet
Lecture1. 2. Plant Cell
99 pages
Lecture 2.genes and Genomes
No ratings yet
Lecture 2.genes and Genomes
58 pages
Bioinformatics Developments in India
No ratings yet
Bioinformatics Developments in India
27 pages
Lecture2 Sequence Alignment
No ratings yet
Lecture2 Sequence Alignment
26 pages
Detection of Carbapenemase Producing Enterobacterales and BD Phoenix CPO Detect Panel
No ratings yet
Detection of Carbapenemase Producing Enterobacterales and BD Phoenix CPO Detect Panel
19 pages
Lab 3
No ratings yet
Lab 3
6 pages
Coursera 14b Unit 1-Ncbi PDF
No ratings yet
Coursera 14b Unit 1-Ncbi PDF
5 pages
Industrial Biochemistry-I
No ratings yet
Industrial Biochemistry-I
14 pages
ImportanceofBiologyforEngineers ACaseStudy
No ratings yet
ImportanceofBiologyforEngineers ACaseStudy
15 pages
4.1. Pairwise Alignment - 2
No ratings yet
4.1. Pairwise Alignment - 2
4 pages
Lecture 8.bioinformatics Tools For Lab
No ratings yet
Lecture 8.bioinformatics Tools For Lab
32 pages
Binc Syllabus For Paper-Ii Binc Bioinformatics Syllabus - Advanced
No ratings yet
Binc Syllabus For Paper-Ii Binc Bioinformatics Syllabus - Advanced
7 pages
Industrial
No ratings yet
Industrial
19 pages
Previewpdf
No ratings yet
Previewpdf
57 pages
Botany Ug Syllabus
No ratings yet
Botany Ug Syllabus
205 pages
Amrita 2021 ONTbarcoder & MinION
No ratings yet
Amrita 2021 ONTbarcoder & MinION
20 pages
Ph.D. Bioinformatics Course at NIMS University Jaipur
No ratings yet
Ph.D. Bioinformatics Course at NIMS University Jaipur
5 pages
1041-Article Text-2526-1-10-20240416
No ratings yet
1041-Article Text-2526-1-10-20240416
20 pages
Leaflet - Sapienza Info and Programmes in English
No ratings yet
Leaflet - Sapienza Info and Programmes in English
49 pages
Svietnam National University - Ho Chi Minh City International University
No ratings yet
Svietnam National University - Ho Chi Minh City International University
11 pages
Fasta and Blast
No ratings yet
Fasta and Blast
2 pages
Cladograms DOLPHIN COW SHARK
No ratings yet
Cladograms DOLPHIN COW SHARK
4 pages
EMBOSS Needle - Alignment 3
No ratings yet
EMBOSS Needle - Alignment 3
6 pages
International Journal On Bioinformatics & Biosciences (IJBB)
No ratings yet
International Journal On Bioinformatics & Biosciences (IJBB)
2 pages

Lecture 3 Database

Uploaded by

Lecture 3 Database

Uploaded by

Lecture 3

• Make biological data available to scientists

• Make biological data available in computer-readable

Bioinformatics: “a collective term for data compilation,

Type of data: Data entry and quality control:

Primary, or derived data:

Contains More than 180

First DB issue: April 1991

Try this link,

Multiply times eighteen million

What you can infer from these alphabets?

• Identifier: string of letters and digits that generally is

DNA sequences and other molecular data are

AI687828.1 An expressed sequence tag (1 of 184)

NP_006138.1 RefSeq protein

RefSeq identifiers include the following formats:

Complete genome NC_######

NCBI Gene is a great starting point: it collects

RefSeq provides a curated, optimal accession

2. Write ~100bp DNA sequence for your practice using

Choose the group (e.g. mammal), genome (e.g. human),

A genome build or assembly (e.g. GRCh37 or GRCh38)

When you enter a query such as “hbb” you may have to

Explore the browser! Begin with a favorite gene or

When you search for information about a particular

Some searches are particularly challenging. For

Searching for HIV-1 proteins, note that there are

You might also like