Introduction to Databases
Daniela Puiu Applications Specialist Center for the Study of Biological Complexity, VCU [email protected] 804-827-0952
General Concepts
Database definition
Organized collection of logically related data
Data
Known facts Types: text, graphics, images, sound, videos
Database management system (DBMS)
Software package for defining and managing a database
Database Examples
Class roster Hospital patients Literature (published articles in a certain field) Genomic information Protein structure Taxonomy Single nucleotide polymorphism
Example: Microbial Database
Data about the protein coding regions in the microbial genomes sequenced so far. Organism: Name Accession number Genome size GC% Release date Genome center Sequence Gene (protein coding regions): Name Accession number Organism Location on the chromosome (start,end) Strand Size Product Sequence
Database Models
Flat files Hierarchical Network Relational Object oriented Object relational Web enabled 60 60 70 80 90 90 90
Database Types (cont.)
Type Personal Workgroup Department Enterprise Internet Typical number of users 1 5-25 25-100 >100 >1000 Typical architecture Desktop/Laptop/ PDA Typical size MB
Client/server:2 tier MB-GB Client/server:3 tier GB Client/server: distributed Web sever & application servers GB-TB MB-GB
Flat Files
Characteristics: Data is stored as records in regular files Records usually have a simple structure and fixed number of fields For fast access may support indexing of fields in the records No mechanisms for relating data between files One needs special programs in order to access and manipulate the data
Flat Files Example
Microbial database:
Genbank format:
Escherichia coli K12 Streptococcus pneumoniae R6
Fasta format: multiple files
Escherichia coli K12: genome , genes , gene positions Streptococcus pneumoniae R6: genome , genes , gene positions
Data manipulation:
Sequence extraction, search Indexing Format conversion
Relational Database
Characteristics: Data is organized into tables: rows & columns Each row represents an instance of an entity Each column represents an attribute of an entity Metadata describes each table column Relationships between entities are represented by values stored in the columns of the corresponding tables (keys) Accessible through Standard Query Language (SQL)
Enterprise data model
Graphical representation of the high level entities Example: Microbial database
each organism has multiple corresponding genes One:Many relation
1 Organism
m Gene
Metadata
Data that describes the properties or characteristics of other data Does not include sample data Allows database designers and users to understand the meaning of the data
Metadata & Data Table
Organism
Name Name Size Gc Accession Release Center Sequence Name Escherichia coli K12 Streptococcus pneumoniae R6 Type Alphanumeric Integer Float Alphanumeric Date Alphanumeric Alphanumeric Size 4,640,000 2,040,000 Gc 50 40 Max Length 100 10 5 10 8 100 Variable Accession NC_000913 NC_003098 Description Organism name Genome length (bases) Percent GC Accession number Release date Genome center name Sequence Release 09/05/1997 09/07/2001 Center Univ. Wisconsin Eli Lilly and Company Sequence AGCTTTTC ATT TTGAAAGA AAA
Metadata & Data Table (cont.)
Gene
Name Name Accession OAccesion Start End Strand Product Sequence Name thrL thrA transposas e_A Accession 16127995 16127996 15902058 Type Alphanumeric Alphanumeric Alphanumeric Integer Integer Character Alphanumeric Alphanumeric OAccession NC_000913 NC_000913 NC_003098 Start 190 337 20207 Max Length 100 10 10 10 10 1 1000 Variable End 255 2799 20554 Description Gene name Gene accession number Organism accession number Gene start Gene end Gene strand Gene annotation Gene sequence Strand + + + Product the operon leader peptide homoserine dehydrogenase I transposase Sequence MKRI MRVL MWYN
Relationships
Used to connect tables Field(s) that have the same value in the related tables Organism.Accession=Gene.OAccession Organism.Accession Unique Primary key Gene.OAccession Not unique Secondary key
SQL
ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems. SQL statements are used to retrieve and update data in a database. Includes:
Data Manipulation Language (DML) Data Definition Language (DDL)
Data Manipulation Language
Syntax for executing queries, updating, inserting, and deleting records.
SELECT - extracts data from one or more table INSERT INTO - inserts new data into a table UPDATE - updates data in a table DELETE FROM - deletes data from a table
DML Example
Select all Escherichia coli K12 genes which are in the 1MB2MB region of the chromosome: SELECT * FROM Organism, Gene WHERE Organism.Name=Escherichia coli K12 AND Organism.Accession=Gene.OAccession AND Gene.Start>=1,000,000 AND Gene.End<=2,000,000
DML Example (cont.)
INSERT INTO Gene (Name, Accession, OAccession, Start, End, Strand, Sequence) VALUES (thrL, 16127995,NC_000913,190,255,+,thr operon leader peptide, MKRI) UPDATE Gene SET Start=160 WHERE Accession= NC_000913 DELETE FROM Gene WHERE Accession= NC_000913
Data Definition Language
Syntax for creating ,editing, deleting: Databases Tables Views Indexes Constraints Users Privileges
DDL Examples
CREATE DATABASE Microbial; CREATE TABLE Organism ( Name varchar(100) Size int(10) Gc decimal(5) Accession varchar(10) Release date(8) Center varchar(100)); ALTER TABLE Organism ADD Sequence varchar; DROP TABLE Organism;
DBMS
Software package for defining and managing a database. Examples:
Proprietary: MS Access, MS SQL Server, DB2, Oracle, Sybase Open source: MySql, PostgreSQL
DBMS Advantages
Program-data independence Minimal data redundancy Improved data consistency & quality
Access control Transaction control
Improved accessibility & data sharing Increased productivity of application development Enforced standards
Web Databases
Data is accessible through Internet Have different underlying database models Example: biological databases
Molecular data: NCBI , Swissprot , PDB , GO Protein interaction : DIP , BIND Organism specific: Mouse , Worm, Yeast Literature: Pubmed Disease
CSBC Resources
Database and software list
Molecular databases: Genbank, EMBL, NR, NT, RefSeq, Swissprot DBMS:
MS Excel, MS Access MySQL, PostgreSQL
Computer resources
watson.vcu.edu : 8 processor Sun server medusa.vcu.edu : 64 processor Beowulf cluster