Database
Database
DATABASE :
A database is an information set with a regular structure. Its front-end allows data access, searching and sorting
routines. Its back-end affords data inputting and updating. A database is usually but not necessarily stored in some
machine-readable format accessed by a computer. There are a wide variety of databases, from simple tables stored
in a single file to very large databases with many millions of records, stored in rooms full of disk drives or other
peripheral electronic storage devices.
Databases resembling modern versions were first developed in the 1960s. A pioneer in the field was Charles
Bachman.
The most useful way of classifying databases is by the programming model associated with the database. Several
models have been in wide use for some time. Historically, the hierarchical model was implemented first, then the
network model, then the relational model overcame with the so-called flat model accompanying it for low-end usage.
The first two and the last one were never theoretised and were deemed as data models only as a contrast to the
relational model, not having conceptual underpinnings of their own; they have arisen simply out of the realisation of
physical constraints and programming, not data, models.
A database management system (DBMS) is a computer program (or more typically, a suite of them) designed to
manage a database, a large set of structured data, and run operations on the data requested by numerous users.
Typical examples of DBMS use include accounting, human resources and customer support systems. Originally found
only in large companies with the computer hardware needed to support large data sets, DBMSs have more recently
emerged as a fairly standard part of any company back office.
DBMS's contrast with the more general concept of a database applications in that they are designed as the "engine"
of a multi-user system. In order to fill this role, DBMSs are typically built around a private multitasking kernel with built-
in networking support. A typical database application will not include these features internally, but may be able to
support similar functionality by relying on the operating system to provide these features for it.
HISTORY
Databases have been in use since the earliest days of electronic computing, but the vast majority of these were
custom programs written to access custom databases. Unlike modern systems which can be applied to widely
Compiled By: Pankaj Kumar Bahety 1
Faculty of Management Studies Shri Shankaracharya Group of Institutions
different databases and needs, these systems were tightly linked to the database in order to gain speed at the price of
flexibility.
As computers grew in capability this tradeoff became increasingly unnecessary, as a number of general-purpose
database systems emerged, and by the mid-1960s there were a number of such systems in commercial use. Interest
in a standard started to grow, and Charles Bachman, author of one such product, IDS, founded the Database Task
Group within Codasyl, the group responsible for the creation and standardization of COBOL. In 1971 they delivered
their standard, which generally became known as the Codasyl approach, and soon there were a number of
commercial products based on it available.
The Codasyl approach was based on the "manual" navigation of a linked dataset which was formed into a large
network. When the database was first opened, the program was handed back a link to the first record in the database,
which also contained pointers to other pieces of data. To find any particular record the programmer had to step
through these pointers one at a time until the required record was returned. Simple queries like "find all the people in
Sweden" required the program to walk the entire data set and collect the matching results. There was, essentially, no
concept of "find" or "search". This might sound like a serious limitation today, but in an era when the data was most
often stored on magnetic tape such operations were too expensive to contemplate anyway.
IBM also had their own DBMS system in 1968, known as IMS. IMS was a development of software written for the
Apollo program on the System/360. IMS was generally similar in concept to Codasyl, but used a strict hierarchy for its
model of data navigation instead of Codasyl's network model.
Both concepts later became known as navigational databases due to the way data was accessed, and Bachman's
1973 Turing Award award presentation was The Programmer as Navigator. IMS is classified as a hierarchical
database. IDS and IDMS (both CODASYL databases) as well as CINCOMs TOTAL database are classified as
network databases.
Edgar Codd worked at IBM in San Jose, California, one of their offshoot offices that was primarily involved in the
development of hard disk systems. He was unhappy with the navigational model of the Codasyl approach, notably the
lack of a "search" facility which was becoming increasingly useful when the database was stored on disk instead of
tape. In 1970 he wrote a number of papers outlining a new approach to database construction, eventually culminating
in the groundbreaking A Relational Model of Data for Large Shared Data Banks.
In this paper he described a new system for storing and working with large databases. Instead of records being stored
in some sort of linked list of free-form records as in Codasyl, his concept was to use a "table" of fixed-length records.
Such a system would be very ineffecient when storing "sparse" databases where some of the data for any one record
could be left empty. The relational model solved this by splitting the data into a series of tables, with optional elements
being moved out of the main table where they would take up room only if needed.
For instance, a common use of a database system is to track information about users, their name, login information,
various addresses and phone numbers. In the navigational approach all of this data would be placed in a single
record, and items that were not used would simply not be placed in the database. In the relational approach, the data
would be split into a user table, an address table and a phone number table (for instance). Only if the address or
phone numbers were provided would records be created in these optional tables.
Linking the information back together is the key to this system. In the relational model some bit of information was
used as a "key", uniquely defining a particular record. When information was being collected about a user, information
stored in the optional (or related) tables would be found by searching for this key. For instance, if the login name of a
user is unique, addresses and phone numbers for that user would be recorded with the login name as their key.
This "re-linking" of related data back into a single collection is something that traditional computer languages are not
designed for. Just as the navigational approach would require programs to loop in order to collect records, the
relational approach would require loops to collect information about any one record. Codd's solution to this problem
was to create a new language dedicated to just this problem, a suggestion that would later develop into the almost-
universal SQL today.
Using a branch of mathematics known as tuple calculus, he demonstrated that such a system could support all the
operations of normal databases (inserting, updating etc.) as well as providing a simple system for finding and
returning sets of data in a single operation.
IBM started working on a prototype system based on Codd's concepts as System R in the early 1970s. The first
"quicky" version was ready in 1974/5, and work then started on multi-table systems in which the data could be broken
down so that all of the data for a record (much of which is often optional) didn't have to be stored in a single large
"chunk". Followup multi-user versions were tested by customers in 1978 and 79, by which time a standardized
computer language, SQL, had been added. By this time it had become clear that Codd's ideas were both workable
and superior to Codasyl, and IBM started working on a true product versions of System R, known as SQL/DS, and,
later, Database 2 (DB2).
Codd's paper was also picked up by two people at Berkeley, Euegene Wong and Michael Stonebraker. They started a
project known as INGRES using funding that had already been allocated for geographical database project, using
student programmers to produce code. Starting in 1973, INGRES delivered its first test products in 1974, and was
generally ready for widespread use in 1979. During this time a number of people had moved "through" the group,
perhaps as many 30 people worked on the project, about five at a time. INGRES was similar to System R in a number
of ways, including the use of a "language" for data access, known as QUEL.
Many of the people involved with INGRES became convinced of the future commercial success of such systems, and
formed their own companies to commercialize the work. Sybase, Informix, NonStop SQL and eventually Ingres itself
were all being sold as offshoots to the original INGRES product in the 1980s. Even Microsoft SQL Server is actually a
re-built version of Sybase, and thus, INGRES. Only Larry Ellison's Oracle started from a different chain, based on
IBM's papers on System R by beating them to market when the first version was released in 1978.
Even in Sweden Codd's paper was read, and Mimer SQL was developed from the mid-70s at Uppsala University, and
in 1984 this project was consolidated into an independent enterprise. In the early 1980s Mimer introduced transaction
handling for high robustness in applications, an idea that was subsequently implemented on most other DBMSs.
MULTIDIMENSIONAL DBM S
Pseudo-relational DBMS implementations (Oracle, Sybase, et al) soon took over the entire database market, even
though they were not particularly faithful to Codd’s original framework. As a result, these implementation’s query
performance in fully normalized databases can be quite poor. For instance, to find the address of the user named
Bob, these implementations may look up Bob in the USER table, find his "primary key" (the login name), and then
search the ADDRESS table for that key. Although this appears to be a single operation to the user, in most
implementations it requires a complex and time consuming search through the tables.
In response database programmers have turned to denormalization to help improve performance in these flawed
implementations, but such violations of database normalization carries with it a heavy cost (namely the cost of data
redundancies and the extra data-checking logic to ensure the database remains consistent).
The multidimensional DBMS ignores the logical/physical independence tenent of the relational model and instead
exposes pointers to the programmer. Instead of finding Bob's address by looking up the "key" in the address table, the
multidimensional DBMS store a pointer to the data in question. In fact, if the data is "owned" by the original record
(that is, no other records in USER point to it), it can be stored in the same physical location, thereby increasing the
speed at which it can be accessed.
This sort of physical data optimization can (and should) be done in current pseudo-relational systems but still allowing
for logical data independence (the programmer would still see the “key” value but internally it would be stored as a
pointer).
Due to poor timing (and generally poor implementations) as a general solution the multidimensional system never
became popular, although certain ideas have been picked up in Object DBMS.
OBJECT DBMS
Multidimensional DBMS did have one lasting impact on the market: they led directly to the development of the object
database systems. Based on the same general structure and concepts as the multidimensional systems, these new
systems allowed the user to store objects directly in the database. That is, the programming constructs being used in
the object oriented (OO) programming world could be used directly in the database, instead of first being converted to
some other format.
This could happen because of the multidimensional system's concepts of ownership. In an OO program a particular
object will typically contain others; for example, the object representing Bob may contain a reference to a separate
object referring to Bob's home address. Adding support for various OO languages and polymorphism re-created the
multidimensional systems as object databases, which continue to serve a niche today.
DESCRIPTION
A DBMS can be an extremely complex set of software programs that controls the organization, storage and retrieval
of data (fields, records and files) in a database. It also controls the security and integrity of the database. The DBMS
accepts requests for data from the application program and instructs the operating system to transfer the appropriate
data.
When a DBMS is used, information systems can be changed much more easily as the organization's information
requirements change. New categories of data can be added to the database without disruption to the existing system.
Data security prevents unauthorised users from viewing or updating the database. Using passwords, users are
allowed access to the entire database or subsets of the database, called subschemas (pronounced "sub-skeema").
For example, an employee database can contain all the data about an individual employee, but one group of users
may be authorized to view only payroll data, while others are allowed access to only work history and medical data.
The DBMS can maintain the integrity of the database by not allowing more than one user to update the same record
at the same time. The DBMS can keep duplicate records out of the database; for example, no two customers with the
same customer numbers (key fields) can be entered into the database. See ACID properties for more information.
Database query languages and report writers allow users to interactively interrogate the database and analyse its
data.
If the DBMS provides a way to interactively enter and update the database, as well as interrogate it, this capability
allows for managing personal databases. However, it may not leave an audit trail of actions or provide the kinds of
controls necessary in a multi-user organisation. These controls are only available when a set of application programs
are customised for each data entry and updating function.
A business information system is made up of subjects (customers, employees, vendors, etc.) and activities (orders,
payments, purchases, etc.). Database design is the process of deciding how to organize this data into record types
and how the record types will relate to each other. The DBMS should mirror the organization's data structure and
process transactions efficiently.
Organizations may use one kind of DBMS for daily transaction processing and then move the detail onto another
computer that uses another DBMS better suited for random inquiries and analysis. Overall systems design decisions
are performed by data administrators and systems analysts. Detailed database design is performed by database
administrators.
The three most common organizations are the hierarchical, network and relational models. A database management
system may provide one, two or all three methods. Inverted lists and other methods are also used. The most suitable
structure depends on the application and on the transaction rate and the number of inquiries that will be made.
The dominant model in use today is the relational model, usually used with the SQL query language. Many DBMSes
also support the Open Database Connectivity API that supports a standard way for programmers to access the
DBMS.
Database servers are specially designed computers that held the actual databases and run only the DBMS and
related software. Database servers are usually multiprocessor computers, with RAID disk arrays used for stable
storage. Connected to one or more servers via a high-speed channel, hardware database accelerators are also used
in large volume transaction processing environments.
A Relational Database Management System (RDBMS) is a database management system (DBMS) that is
based on the relational model as introduced by Edgar F. Codd.
HISTORY OF THE TERM
Codd introduced the term in his seminal paper A Relational Model of Data for Large Shared Data Banks. In
this paper and later papers he defined what he meant by relational. One well-known definition of what
constitutes a relational database system is Codd's 12 rules. However, many of the early implementations of
the relational model did not conform to all of Codd's rules, so the term gradually came to describe a broader
class of database systems. At a minimum, these systems:
presented the data to the user in tabular form (as a collection of tables, each table consisting of a set
of rows and columns)
provided operators to manipulate the data in tabular form
The first RDBMS that was a relatively faithful implementation of the relational model was the Multics
Relational Data Store, first sold in 1978. Others have been Berkeley Ingres QUEL and IBM BS12.
DATABASE MODELS
The flat (or table) model consists of a single, two-dimensional array of data elements, where all members of a given
column are assumed to be similar values, and all members of a row are assumed to be related to one another. For
instance, columns for name and password might be used as a part of a system security database. Each row would
have the specific password associated with a specific user. Columns of the table often have a type associated with
them, defining them as character data, date or time information, integers, or floating point numbers. This model is the
basis of the spreadsheet.
The network model allows multiple tables to be used together through the use of pointers (or references). Some
columns contain pointers to different tables instead of data. Thus, the tables are related by references, which can be
viewed as a network structure. A particular subset of the network model, the hierarchical model, limits the
relationships to a tree structure, instead of the more general directed graph structure implied by the full network
model.
relations of n-tuples (tables of rows) of data elements (or attributes or columns); each n-tuple (row) is a
collection of data elements (attributes/columns) of the entity represented by that particular n-tuple (row);
a collection of operators, the relational algebra and calculus;
and a collection of integrity constraints, defining the set of consistent database states and changes of state.
The integrity constraints can be of four types: domain (AKA type), attribute, relvar and database
constraints.
Unlike the hierarchical and network models, there are no explicit pointers whatsoever in the data held in the relational
model. In the hierarchical and network models data is accessed by the programmer specifying an access path from
pointer to pointer embedded in the data. In the relational model data is accessed using relational algebra. Subsets of
Compiled By: Pankaj Kumar Bahety 8
Faculty of Management Studies Shri Shankaracharya Group of Institutions
n-tuples (rows) in different relations (tables) are joined in cross-products, they are intersected and they are differenced
using the values of any of the attributes (columns). This flexibility in relational databases allow users (and
programmers) to write queries that were not anticipated by the database designers. As a result, relational databases
can be used by multiple applications in ways the original designers did not foresee, which is especially important for
databases that might be used for decades. This has made relational databases very popular with businesses.
Any number of declararative programming languages could be invented which would provide users with the means of
specifying the relational algebra necessary to access and manipulate the data in relational databases. The de facto
standard is Structured Query Language (SQL) although every RDBMS has its own dialect of this English-like
declarative programming language.
The relational model is an implementation of the relational algebra and set theory branches of mathematics to the
design and working of databases. Perhaps the most important pioneer in this field was Ted Codd. Although this model
is the basis for relational database software management systems (RDBMS), very few RDBMS's implement the
model entirely rigourously or completely and many have extra features which, if used, violate the theory. Some so
called RDBMS's are not relational enough to be worthy of the term - they are DBMS's with relational features.
A hierarchical database is a kind of database management system that links records together in a tree data
structure such that each record type has only one owner, e.g. an order is owned by only one customer. Hierarchical
structures were widely used in the first mainframe database management systems. However, due to their restrictions,
they often cannot be used to relate structures that exist in the real world.
Hierarchical relationships between different types of data can make it very easy to answer some questions, but very
difficult to answer others. If one-to-many relationship is violated (e.g., a patient can have more than one physician)
then the hierarchy becomes a network.
TERMS
A network model database management system has a more flexible structure than the hierarchical model or
relational model, but pays for it in processing time and specialization of types. Some object-oriented database
systems use a general network model, but most have some hierarchical limitations.
The neural network is an important modern example of a network database - a large number of similar simple
processing units, analogous to neurons in the human brain, 'learn' the differences and similarities between a number
of inputs. At any one time the 'weights' assigned to different connections between layers of neuron-like processing
units constitute a set of assertions about what is most closely related to what.
This enables prototype models to be built or inferred from the weights, and possibly used to define formal typed link or
typed object schemas. Internet search engines like Google use a technique similar to this, to determine whether a link
could normally be expected to a given page on the World Wide Web, from a page containing the search terms.
As this example suggests, network models are particularly useful to text analysis. One approach, called singular value
decomposition, enables many-dimensional maps of large corpus text databases to be drawn, and to generate an
orthogonal base of vectors of difference between text. This approach yielded interesting results, such as the fact that
a given medical specialty typically contained 100-140 degrees of freedom in its text corpus - a number remarkably
similar to the number of human beings in a village of optimal size according to anthropology. This led to speculation
that tracking differences in implications of medical papers and tracking differences in behavior of one's neighbors
relied on the same mental module in the brain. It remains controversial, but if it were so, that would suggest that the
human brain's cognition resembles a network database more than the other models.
The relational model for management of a database is a data model based on predicate logic and set theory.
Other models are the hierarchical model and network model. Some systems using these older architectures are still in
use today in data centers with high data volume needs or where existing systems are so complex it would be cost
prohibitive to migrate to systems employing the relational model; also of note are newer object-oriented databases,
even though many of them are DBMS-construction kits, rather than proper DBMSs.
The relational model was the first formal database model. After it was defined, formal models were made to describe
hierarchical databases (the hierarchical model,) and network databases (the network model). Hierarchical and
network databases existed before relational databases, but were only given formal models after the relational model
was defined.
The relational model was invented by Dr. Ted Codd and subsequently maintained and developed by Chris Date and
Hugh Darwen, as a general model of data. In The Third Manifesto (1995) they show how the relational model can be
extended with object oriented features without compromising its fundamental principles.
The standard language for relational databases, SQL, is only vaguely reminiscent of the mathematical model. Usually
it is adopted, despite its restrictions, because it is far and away more popular than any other database language.
The fundamental assumption of the relational model is that all data is represented as mathematical relations, i.e., a
subset of the Cartesian product of n sets. In the mathematical model (unlike SQL), reasoning about such data is done
in two-valued predicate logic (that is, without a null value), meaning there are two possible evaluations for each
proposition: either true or false. The data is operated upon by means of a relational calculus and algebra.
The relational data model permits the designer to create a consistent logical model of the information to be stored.
This logical model can be refined through a process of database normalization. A database built on the pure relational
model would be entirely normalized. The access plans and other implementation and operation details are handled by
the DBMS engine, and should not be reflected in the logical model. This contrasts with common practice for SQL
DBMSs in which performance tuning often requires changes to the logical model.
The basic relational building block is the domain, or data type. A tuple is a set of attributes, which are ordered pairs of
domain and value. A relvar (relation variable) is an unordered set of ordered pairs of domain and name, which serves
as the header for a relation. A relation is an unordered set of tuples. Although these relational concepts are
mathematically defined, they correspond loosely to traditional database concepts. A relation is similar to the traditional
concept of table. A tuple is similar to the concept of row.
The basic principle of the relational model is the information is represented by data values in relations. Thus, the
relvars are not related to each other at domain in several relvars, and if one attribute is dependent on another, this
dependency is enforced through referential integrity.
EXAMPLE DATABASE
An idealized, very simple example of a description of some relvars and their attributes:
Customer(Customer ID, Tax ID, Name, Address, City, State, Zip, Phone)
Order(Order No, Customer ID, Invoice No, Date Placed, Date Promised, Terms, Status)
In this design we have five relvars: Customer, Order, Order Line, Invoice, and Invoice Line. The bold, underlined
attributes are candidate keys. The non-bold, underlined attributes are foreign keys.
Usually one candidate key is arbitrarily chosen to be called the primary key and used in preference over the other
candidate keys, which are then called alternate keys.
A candidate key is a unique identifier enforcing that no tuple will be duplicated; this would make the relation into
something else, namely a bag, by violating the basic definition of a set. A key can be composite, that is, can be
composed of several attributes. Below is a tabular depiction of a relation of our example Customer relvar; a relation
can be thought of as a value that can be attributed to a relvar.
If we attempted to insert a new customer with the ID 1234567890, this would violate the design of the relvar since
Customer ID is a primary key and we already have a customer 1234567890. The DBMS must reject a transaction
such as this that would render the database inconsistent by a violation of an integrity constraint.
Foreign keys are integrity constraints enforcing that the value of the attribute set is drawn from a candidate key in
another relation, for example in the Order relation the attribute Customer ID is a foreign key. A join is the operation
that draws on information from several relations at once. By joining relvars from the example above we could query
the database for all of the Customers, Orders, and Invoices. If we only wanted the tuples for a specific customer, we
would specify this using a restriction condition.
If we wanted to retrieve all of the Orders for Customer 1234567890, we could query the database to return every row
in the Order table with Customer ID 1234567890 and join the Order table to the Order Line table based on Order No.
There is a flaw in our database design above. The Invoice relvar contains an Order No attribute. So, each tuple in the
Invoice relvar will have one Order No, which implies that there is precisely one Order for each Invoice. But in reality an
invoice can be created against many orders, or indeed for no particular order. Additionally the Order relvar contains
an Invoice No attribute, implying that each Order has a corresponding Invoice. But again this is not always true in the
Real World. An order is sometimes paid through several invoices, and sometimes paid without an invoice. In other
words there can be many Invoices per Order and many Orders per Invoice. This is a many-to-many relationship
between Order and Invoice (also called a non-specific relationship). To represent this relationship in the database a
new relvar should be introduced whose role is to specify the correspondence between Orders and Invoices:
Now, the Order relvar has a one-to-many relationship to the OrderInvoice table, as does the Customer relvar. If we
want to retrieve every Invoice for a particular Order, we can query for all orders where Order No in the Order relation
equals the Order No in OrderInvoice, and where Invoice No in OrderInvoice equals the Invoice No in Invoice.
Database normalization is usually performed when designing a relational database, to improve the logical consistency
of the database design and the transactional performance.
There are two commonly used systems of diagramming to aid in the visual representation of the relational model: the
entity-relationship diagram (ERD), and the related IDEF diagram used in the IDEF1X method created by the U.S. Air
Force based on ERDs.
All of these kinds of database can take advantage of indexing to increase their speed. The most common kind of
index is a sorted list of the contents of some particular table column, with pointers to the row associated with the
value. An index allows a set of table rows matching some criterion to be located quickly. Various methods of indexing
are commonly used, including b-trees, hashes, and linked lists are all common indexing techniques.
Relational DBMSs have the advantage that indexes can be created or dropped without changing existing applications,
because applications don't use the indexes directly. Instead, the database software decides on behalf of the
application which indexes to use. The database chooses between many different strategies based on which one it
estimates will run the fastest.
In recent years, the object-oriented paradigm has been applied to databases as well, creating a new programming
model known as object databases. These databases attempt to overcome some of the difficulties of using objects with
the SQL DBMSs. An object-oriented program allows objects of the same type to have different implementations and
behave differently, so long as they have the same interface (polymorphism). This doesn't fit well with a SQL database
where user-defined types are difficult to define and use, and where the Two Great Blunders prevail: the identification
of classes with tables (the correct identification is of classes with types, and of objects with values), and the usage of
pointers.
A variety of ways have been tried for storing objects in a database, but there is little consensus on how this should be
done. Implementing object databases undo the benefits of relational model by introducing pointers and making ad-hoc
queries more difficult. This is because they are essentially adaptations of obsolete network and hierarchical databases
to object-oriented programming. As a result, object databases tend to be used for specialized applications and
general-purpose object databases have not been very popular. Instead, objects are often stored in SQL databases
using complicated mapping software. At the same time, SQL DBMS vendors have added features to allow objects to
be stored more conveniently, drifting even further away from the relational model.
APPLICATIONS OF DATABASES
Databases are used in many applications, spanning virtually the entire range of computer software. Databases are the
preferred method of storage for large multiuser applications, where coordination between many users is needed. Even
individual users find them convenient, though, and many electronic mail programs and personal organizers are based
on standard database technology.
A database application is a type of computer application dedicated to managing a database. Database applications
span a huge variety of needs and purposes, from small user-oriented tools such as an address book, to huge
enterprise-wide systems for tasks like accounting.
The term "database application" usually refers to software providing a user interface to a database. The software that
actually manages the data is usually called a database management system (DBMS) or (if it is embedded) a database
engine.
Examples of database applications include MySQL, Microsoft Access, dBASE, FileMaker, Oracle, Informix, and (to
some degree) HyperCard.
In March, 2004, AMR Research (as cited in the CNET News.com article listed in the "References" section) predicted
that open source database applications would come into wide acceptance in 2006.
In addition to their data model, most practical databases attempt to enforce a database transaction model that has
desirable data integrity properties. Ideally, the database software should enforce the ACID rules, summarised here:
Atomicity - either all or no operations are completed. (Transactions that can't be finished must be
completely undone.)
Consistency - all transactions must leave the database in consistent state.
Isolation - transactions can't interfere with each other's work and incomplete work isn't visible to other
transactions.
Durability - successful transactions must persist through crashes.
In practice, many DBMS's allow most of these rules to be relaxed for better performance.
Concurrency control is a method used to ensure transactions are executed in a safe manner and follows the
ACID rules. The DBMS must be able to ensure only serializable, recoverable schedules are allowed, and
that no actions of committed transactions are lost while undoing aborted transactions.