0% found this document useful (0 votes)
12 views100 pages

END213E Lecturenotes Week2

The document discusses the evolution and importance of database systems in various applications, highlighting how they facilitate user interaction and data management in enterprises. It outlines the major disadvantages of traditional file-processing systems, such as data redundancy, difficulty in accessing data, and security issues, which database systems aim to resolve. Additionally, it introduces different data models, including the relational model and entity-relationship model, that underpin modern database management systems.

Uploaded by

alnadalsancak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views100 pages

END213E Lecturenotes Week2

The document discusses the evolution and importance of database systems in various applications, highlighting how they facilitate user interaction and data management in enterprises. It outlines the major disadvantages of traditional file-processing systems, such as data redundancy, difficulty in accessing data, and security issues, which database systems aim to resolve. Additionally, it introduces different data models, including the relational model and entity-relationship model, that underpin modern database management systems.

Uploaded by

alnadalsancak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Database-System Applications

Last time, we discussed some of the database system application examples.


Databases form an essential part not only of every enterprise but also of a large part of a person’s
daily activities.
The ways in which people interact with databases has changed over time. Early databases were
maintained as back-office systems with which users interacted via printed reports and paper forms
for input.
As database systems became more sophisticated, better languages were developed for
programmers to interact with the data, and with user interfaces that allowed end users within the
enterprise to query and update data.
Database-System Applications
As the support for programmer interaction with databases improved, and computer hardware
performance increased, more sophisticated applications emerged that brought database data into
more direct contact not only with end users within an enterprise but also with the general public.

Whereas once bank customers had to interact with a teller for every transaction, automated-teller
machines (ATMs) allowed direct customer interaction.

Today, virtually every enterprise employs web applications or mobile applications to allow its
customers to interact directly with the enterprise’s database, and, thus, with the enterprise itself.
Database-System Applications
For instance, when you read a social-media post, or access an online bookstore and browse
a book or music collection, you are accessing data stored in a database.
When you enter an order online, your order is stored in a database.
When you access a bank web site and retrieve your bank balance and transaction
information, the information is retrieved from the bank’s database system.
When you access a web site, information about you may be retrieved from a database to
select which advertisements you should see.
Almost every interaction with a smartphone results in some sort of database access.
Furthermore, data about your web accesses may be stored in a database.
Accessing databases forms an essential part of almost everyone’s life today.
Database-System Applications
Broadly speaking, there are two modes in which databases are used.
• The first mode is to support online transaction processing, where a large number of users use
the database, with each user retrieving relatively small amounts of data, and performing small
updates. This is the primary mode of use for the vast majority of users of database applications
such as those that we outlined earlier.
• The second mode is to support data analytics, that is, the processing of data to draw conclusions,
and infer rules or decision procedures, which are then used to drive business decisions.
Purpose of Database Systems
To understand the purpose of database systems, consider part of a university organization that keeps
information about all instructors, students, departments, and course offerings.
One way to keep the information on a computer is to store it in operating-system files. To allow users
to manipulate the information, the system has a number of application programs that manipulate
the files, including programs to:
• Add new students, instructors, and courses.
• Register students for courses and generate class rosters.
• Assign grades to students, compute grade point averages (GPA), and generate transcripts.

Programmers develop these application programs to meet the needs of the university
Purpose of Database Systems
New application programs are added to the system as the need arises.
For example, suppose that a university decides to create a new major. As a result,
- The university creates a new department and creates new permanent files (or adds information to
existing files) to record information about all the instructors in the department, students in that
major, course offerings, degree requirements, and so on.
- The university may have to write new application programs to deal with rules specific to the new
major. New application programs may also have to be written to handle new rules in the university.

Thus, as time goes by, the system acquires more files and more application programs.
Purpose of Database Systems
This typical file-processing system is supported by a conventional operating system. The system
stores permanent records in various files, and it needs different application programs to extract
records from, and add records to, the appropriate files.
Purpose of Database Systems
Keeping organizational information in a file-processing system has a number of major disadvantages:
• Data redundancy and inconsistency
• Difficulty in accessing data
• Data isolation
• Integrity problems
• Atomicity problems
• Concurrent access anomalies
• Security problems

Database systems offer solutions to all the above problems


Major disadvantages of file-processing systems
Data redundancy and inconsistency. Since different programmers create the files and application
programs over a long period, the various files are likely to have different structures, and the
programs may be written in several programming languages. Moreover, the same information may
be duplicated in several places (files).

For example, if a student has a double major (say, music and mathematics), the address and
telephone number of that student may appear in a file that consists of student records of students
in the Music department and in a file that consists of student records of students in the
Mathematics department. This redundancy leads to higher storage and access cost. In addition, it
may lead to data inconsistency; that is, the various copies of the same data may no longer agree.
For example, a changed student address may be reflected in the Music department records but not
elsewhere in the system.
Major disadvantages of file-processing systems
Difficulty in accessing data. Suppose that one of the university clerks needs to find out the names of all
students who live within a particular postal-code area. The clerk asks the data-processing department to
generate such a list.
Because the designers of the original system did not anticipate this request, there is no application program on
hand to meet it. There is, however, an application program to generate the list of all students.

The university clerk now has two choices: either obtain the list of all students and extract the needed
information manually or ask a programmer to write the necessary application program. Both alternatives are
obviously unsatisfactory. Suppose that such a program is written and that, several days later, the same clerk
needs to trim that list to include only those students who have taken at least 60 credit hours. As expected, a
program to generate such a list does not exist. Again, the clerk has the preceding two options, neither of which
is satisfactory.
The point here is that conventional file-processing environments do not allow needed data to be retrieved in a
convenient and efficient manner. More responsive data-retrieval systems are required for general use.
Major disadvantages of file-processing systems
Data isolation. Because data are scattered in various files, and files may be in different formats,
writing new application programs to retrieve the appropriate data is difficult.

Integrity problems. The data values stored in the database must satisfy certain types of consistency
constraints. Suppose the university maintains an account for each department, and records the
balance amount in each account. Suppose also that the university requires that the account balance
of a department may never fall below zero. Developers enforce these constraints in the system by
adding appropriate code in the various application programs. However, when new constraints are
added, it is difficult to change the programs to enforce them. The problem is compounded when
constraints involve several data items from different files.
Major disadvantages of file-processing systems
Atomicity problems. A computer system, like any other device, is subject to failure. In many
applications, it is crucial that, if a failure occurs, the data be restored to the consistent state that
existed prior to the failure.
Consider a banking system with a program to transfer $500 from account A to account B. If a system
failure occurs during the execution of the program, it is possible that the $500 was removed from
the balance of account A but was not credited to the balance of account B, resulting in an
inconsistent database state.
Clearly, it is essential to database consistency that either both the credit and debit occur, or that
neither occur. That is, the funds transfer must be atomic—it must happen in its entirety or not at all.
It is difficult to ensure atomicity in a conventional file-processing system.
Major disadvantages of file-processing systems
Concurrent-access anomalies. Many systems allow multiple users to update the data simultaneously. Indeed,
the largest internet retailers may have millions of accesses per day to their data by shoppers. In such an
environment, interaction of concurrent updates is possible and may result in inconsistent data.
Consider account A, with a balance of $10,000. If two bank clerks debit the account balance (by say $500 and
$100, respectively) of account A at almost exactly the same time, the result of the concurrent executions may
leave the account balance in an incorrect (or inconsistent) state.

Suppose that the programs executing on behalf of each withdrawal read the old balance, reduce that value by
the amount being withdrawn, and write the result back. If the two programs run concurrently, they may both
read the value $10,000, and write back $9500 and $9900, respectively. Depending on which one writes the
value last, the balance of account A may contain either $9500 or $9900, rather than the correct value of $9400.
To guard against this possibility, the system must maintain some form of supervision. But supervision is difficult
to provide because data may be accessed by many different application programs that have not been
coordinated previously.
Major disadvantages of file-processing systems
Security problems. Not every user of the database system should be able to access all the data.
For example, in a university, payroll personnel need to see only that part of the database that has
financial information. They do not need access to information about academic records.
But since application programs are added to the file-processing system in an ad hoc manner,
enforcing such security constraints is difficult.
These difficulties, among others, prompted both the initial development of database systems and
the transition of file-based applications to database systems, back in the 1960s and 1970s.

Now, we will see the concepts and algorithms that enable database systems to solve the problems
with file-processing systems.
A data model is a collection of high-level data description constructs that hide many low-level
storage details.

A DBMS allows a user to define the data to be stored in terms of a data model.

Most database management systems today are based on the relational data model, which we will
focus on in this course.
While the data model of the DBMS hides many details, it is nonetheless closer to how the DBMS
stores data than to how a user thinks about the underlying enterprise.
A semantic data model is a more abstract, high-level data model that makes it easier for a user to
come up with a good initial description of the data in an enterprise.
These models contain a wide variety of constructs that help describe a real application scenario.
A database design in terms of a semantic model serves as a useful starting point and is
subsequently translated into a database design in terms of the data model the DBMS actually
supports.
A widely used semantic data model called the entity-relationship (ER) model allows us to pictorially
denote entities and the relationships among them.
Data Models

 A collection of conceptual tools for describing


• Data
• Data relationships
• Data semantics
• Data constraints
There are a number of different data models. The data models can be classified into four different categories:
 Relational model
 Entity-Relationship data model (mainly for database design)
 Object-based data models (Object-oriented and Object-relational)
 Semi-structured data model (XML)
 Other older models:
• Network model
• Hierarchical model
Data Models

Hierarchical Model
The earliest databases followed the hierarchical model, which evolved from the file systems that
the databases replaced, with records arranged in a hierarchy much like an organization chart.
Each file from the flat file system became a record type, or node in hierarchical terminology.
Records were connected using pointers that contained the address of the related record. Pointers
told the computer system where the related record was physically located.
Each pointer establishes a parent-child relationship, also called a one-to-many relationship, in
which one parent can have many children, but each child can have only one parent.
The obvious problem with the hierarchical model is that some data does not exactly fit this strict
hierarchical structure, such as an order that must have the customer who placed the order as one
parent and the employee who accepted the order as another.
Data Models

Network Model
The network database model evolved at around the same time as the hierarchical database
model.
As with the hierarchical model, record types (or simply records) depict what would be separate
files in a flat file system, and those records are related using one-to-many relationships, called
owner-member relationships or sets in network model terminology.
Data Models

Relational Model
The relational model uses a collection of tables to represent both data and the relationships
among those data.
Each table has multiple columns, and each column has a unique name. Tables are also known as
relations.
The relational model is an example of a record-based model.
Record-based models are so named because the database is structured in fixed-format records of
several types.
Each table contains records of a particular type. Each record type defines a fixed number of fields,
or attributes. Tuples : Each row in the table is called tuple. A row contains all the information about any
instance of the object
The columns of the table correspond to the attributes of the record type. The relational data
model is the most widely used data model, and a vast majority of current database systems are
based on the relational model.
Data Models

Entity-Relationship Model
The entity-relationship (E-R) data model uses a collection of basic objects, called entities, and
relationships among these objects.
An entity is a “thing” or “object” in the real world that is distinguishable from other objects.
The entity-relationship model is widely used in database design.
Data Models

Semi-structured Data Model


The semi-structured data model permits the specification of data where individual data items of
the same type may have different sets of attributes.
This is in contrast to the data models mentioned earlier, where every data item of a particular
type must have the same set of attributes.
JSON and Extensible Markup Language (XML) are widely used semi-structured data
representations.
Relational Model

 All the data is stored in various tables.


 Example of tabular data in the relational model

Columns

Rows
A Sample Relational Database
Data Abstraction
For the system to be usable, it must retrieve data efficiently.
The need for efficiency has led database system developers to use complex data structures to
represent data in the database.
Since many database-system users are not computer trained, developers hide the complexity
from users through several levels of data abstraction, to simplify users’ interactions with the
system.
Layers of Data Abstraction

Databases are unique in their ability to present multiple users with their own distinct views of the
data while storing the underlying data only once. These are collectively called user views.
A user in this context is any person or application that signs onto the database for the purpose of
storing and/or retrieving data.
An application is a set of computer programs designed to solve a particular business problem,
such as an order-entry system, a payroll-processing system, or an accounting system.
User views for an order entry system would include the online shopping web page, web pages
listing preferences and previous orders, printed invoices and packing slips, and so forth.
Layers of Data Abstraction

When an electronic spreadsheet application such as Microsoft Excel is used, all users must share
a common view of the data, and that view must match the way the data is physically stored in the
underlying data file.

If a user hides some columns in a spreadsheet, reorders the rows, and saves the spreadsheet for
future use, the next user who opens the spreadsheet will view the data in the manner in which
the first user saved it.

An alternative, of course, is for each user to save his or her copy in a separate physical file, but
then as one user applies updates, the other users’ data becomes out of date.

Database systems present each user a view of the same data, but the views can be tailored to the
needs of the individual users, even though they all come from one commonly stored copy of the
data. Because views store no actual data, they automatically reflect any data changes made to
the underlying database objects. This is all possible through layers of abstraction.
Data Abstraction

Physical level. The lowest level of abstraction describes how the data are actually stored. The
physical level describes complex low-level data structures in detail.
The physical layer contains the data files that hold all the data for the database. Nearly all
modern DBMSs allow the database to be stored in multiple data files, which are usually spread
over multiple physical disk drives. With this arrangement, the disk drives can work in parallel for
maximum performance. A notable exception among the DBMSs is Microsoft Access, which stores
the entire database in a single physical file. While it simplifies database use on a single-user
personal computer system, this arrangement limits the ability of the DBMS to scale to
accommodate many concurrent users of the database, making it inappropriate as a solution for
large enterprise systems.
Data Abstraction

Physical level.
The database user does not need to understand how the data is actually stored within the data
files or even which file contains the data item(s) of interest. In most organizations, a technician
known as a database administrator (DBA) handles the details of installing and configuring the
database software and data files and of making the database available to users. The DBMS works
with the computer’s operating system to manage the data files automatically, including all file
opening, closing, reading, and writing operations.
The database user should not be required to refer to physical data files when using a database,
which is in sharp contrast to spreadsheets and word processing, where the user must consciously
save the document(s) and choose filenames and storage locations. Many of the personal
computer–based DBMSs are exceptions to this tenet because the user is required to locate and
open a physical file as part of the process of signing onto the DBMS. Conversely, with enterprise
class DBMSs (such as Oracle, Sybase, Microsoft SQL Server, DB2, and MySQL), the physical files
are managed automatically, and the database user never needs to refer to them when using the
database.
Data Abstraction
Logical level. The next-higher level of abstraction describes what data are stored in the database,
and what relationships exist among those data. The logical level thus describes the entire
database in terms of a small number of relatively simple structures. Although implementation of
the simple structures at the logical level may involve complex physical-level structures, the user
of the logical level does not need to be aware of this complexity. This is referred to as physical
data independence. Database administrators, who must decide what information to keep in the
database, use the logical level of abstraction.

The logical layer or logical model comprises the first of two layers of abstraction in the database:
the physical layer has a concrete existence in the operating system files, whereas the logical layer
exists only as abstract data structures assembled from the physical layer as needed. The DBMS
transforms the data in the data files into a common structure. This layer is sometimes called the
schema, a term used for the collection of all the data items stored in a particular database or
belonging to a particular database user. Depending on the particular DBMS, this layer can contain
a set of two-dimensional tables, a hierarchical structure similar to a company’s organization
chart, or some other structure.
Data Abstraction
View level. The highest level of abstraction describes only part of the entire database. Even
though the logical level uses simpler structures, complexity remains because of the variety of
information stored in a large database. Many users of the database system do not need all this
information; instead, they need to access only a part of the database. The view level of
abstraction exists to simplify their interaction with the system. The system may provide many
views for the same database.

The external layer or external model is the second layer of abstraction in the database. This layer
is composed of the user views discussed earlier, which are collectively called the subschema. In
this layer, the database users (application programs as well as individuals) that access the
database connect and issue queries against the database. Ideally, only the DBA deals with the
physical and logical layers. The DBMS handles the transformation of selected items from one or
more data structures in the logical layer to form each user view. The user views in this layer can
be predefined and stored in the database for reuse, or they can be temporary items that are built
by the DBMS to hold the results of a single ad hoc database query until they are no longer needed
by the database user. An ad hoc query is a query that is not preconceived and that is not likely to
be reused.
Instances and Schemas

Databases change over time as information is inserted and deleted.


The collection of information stored in the database at a particular moment is called an instance of the
database.
The overall design of the database is called the database schema. A description of data in terms of a data
model is called a schema.
Instances and Schemas

Database systems have several schemas, partitioned according to the levels of abstraction.

The physical schema describes the database design at the physical level, while the logical schema
describes the database design at the logical level.
A database may also have several schemas at the view level, sometimes called subschemas, that
describe different views of the database.
Of these, the logical schema is by far the most important in terms of its effect on application
programs, since programmers construct applications by using the logical schema. The physical
schema is hidden beneath the logical schema and can usually be changed easily without affecting
application programs.
Instances and Schemas

 Similar to types and variables in programming languages


 Logical Schema – the overall logical structure of the database
• Example: The database consists of information about a set of customers and accounts in a bank and the
relationship between them
 Analogous to type information of a variable in a program
 Physical schema – the overall physical structure of the database
 Instance – the actual content of the database at a particular point in time
• Analogous to the value of a variable
Physical Data Independence

 Physical Data Independence – the ability to modify the physical schema without changing the logical
schema
• Applications depend on the logical schema
• In general, the interfaces between the various levels and components should be well defined so that
changes in some parts do not seriously influence others.

 Logical data independence. The ability to make changes to the logical layer without disrupting existing
users and processes is called logical data independence.
Layers of Data Abstraction

The Conceptual Layer


The conceptual layer was intended to be a layer
of abstraction above the logical layer that
described all the data in the database as well as
the relationships among the data, but in a
manner that was independent of any particular
type of database system.
A conceptual data model is a high-level model that captures
data and relationship concepts in a technology-independent
manner. Figure shows a simple conceptual model for a
fictitious company.
Each rectangle represents an entity, which is a person,
place, thing, event, or concept about which the organization
collects data.
You likely noticed that the CORPORATE CUSTOMER and
INDIVIDUAL CUSTOMER rectangles appear within the
CUSTOMER rectangle. This denotes a subtype, meaning an
entity that represents a subset of the things represented by
the containing entity, also called a supertype. In this case,
the two subtypes mean that a customer is either a corporate
customer or an individual customer—always one or the other,
and never both.
The line between the CUSTOMER and CUSTOMER
CONTACT entities is a relationship. For now, suffice it to say
that this relationship line and the symbols on it indicate that
one customer can have any number of customer contacts
(including none at all), but that each customer contact has
one and only one customer related to it.
Logical data model is a data model tailored to a particular
type of database management system such as relational,
object-relational, object oriented, hierarchical, or network.
Figure shows the logical data model. Be aware that there
are many variations in the way these diagrams are drawn,
so this is just one example.
Each large rectangle represents an entity, with the name of
the entity just above the rectangle. Within each large
rectangle are two smaller rectangles formed by the
horizontal line that divides the large rectangle into two. The
names inside the rectangles form the list of the attributes
that are included in the entity. An attribute is a fact that
characterizes or describes the entity in some way.
Physical data model is a data model that is tailored to the
features and constraints of a particular database
management system (DBMS), such as MySQL, Oracle, or
Microsoft SQL Server.
Figure 1-5 shows a very simple physical model in the form of
four relational database tables that correspond to the four
entities for our fictitious company.
Database Languages

A database system provides a data-definition language (DDL) to specify the database schema and
a data-manipulation language (DML) to express database queries and updates.
In practice, the data-definition and data-manipulation languages are not two separate languages;
instead they simply form parts of a single database language, such as the SQL language.
Data-Definition Language

We specify a database schema by a set of definitions expressed by a special language called a


data-definition language (DDL). The DDL is also used to specify additional properties of the data.

We specify the storage structure and access methods used by the database system by a set of
statements in a special type of DDL called a data storage and definition language. These
statements define the implementation details of the database schemas, which are usually hidden
from the users.

The data values stored in the database must satisfy certain consistency constraints. For example,
suppose the university requires that the account balance of a department must never be
negative. The DDL provides facilities to specify such constraints. The database system checks
these constraints every time the database is updated. In general, a constraint can be an arbitrary
predicate pertaining to the database. However, arbitrary predicates may be costly to test.
Data-Definition Language

Thus, database systems implement only those integrity constraints that can be tested with
minimal overhead:
 Domain Constraints. A domain of possible values must be associated with every attribute (for
example, integer types, character types, date/time types). Declaring an attribute to be of a
particular domain acts as a constraint on the values that it can take. Domain constraints are
the most elementary form of integrity constraint. They are tested easily by the system
whenever a new data item is entered into the database.
 Referential Integrity. There are cases where we wish to ensure that a value that appears in
one relation for a given set of attributes also appears in a certain set of attributes in another
relation (referential integrity). For example, the department listed for each course must be
one that actually exists in the university. More precisely, the dept_name value in a course
record must appear in the dept_name attribute of some record of the department relation.
Database modifications can cause violations of referential integrity. When a referential-
integrity constraint is violated, the normal procedure is to reject the action that caused the
violation.
Data-Definition Language

Thus, database systems implement only those integrity constraints that can be tested with
minimal overhead:
 Authorization. We may want to differentiate among the users as far as the type of access they
are permitted on various data values in the database. These differentiations are expressed in
terms of authorization, the most common being: read authorization, which allows reading, but
not modification, of data; insert authorization, which allows insertion of new data, but not
modification of existing data; update authorization, which allows modification, but not
deletion, of data; and delete authorization, which allows deletion of data. We may assign the
user all, none, or a combination of these types of authorization.
Data-Definition Language

The processing of DDL statements, just like those of any other programming language, generates
some output.
The output of the DDL is placed in the data dictionary, which contains metadata—that is, data
about data.
The data dictionary is considered to be a special type of table that can be accessed and updated
only by the database system itself (not a regular user).
The database system consults the data dictionary before reading or modifying actual data.
Data Definition Language (DDL)

 Specification notation for defining the database schema


Example: create table instructor (
ID char(5),
name varchar(20),
dept_name varchar(20),
salary numeric(8,2))
 DDL compiler generates a set of table templates stored in a data dictionary
 Data dictionary contains metadata (i.e., data about data)
• Database schema
• Integrity constraints
 Primary key (ID uniquely identifies instructors)
• Authorization
 Who can access what
Data-Manipulation Language

A data-manipulation language (DML) is a language that enables users to access or manipulate


data as organized by the appropriate data model.
The types of access are:
 Retrieval of information stored in the database.
 Insertion of new information into the database.
 Deletion of information from the database.
 Modification of information stored in the database.
Data-Manipulation Language

There are basically two types of data-manipulation language:


• Procedural DMLs require a user to specify what data are needed and how to get those data.
• Declarative DMLs (also referred to as nonprocedural DMLs) require a user to specify what data
are needed without specifying how to get those data.

Declarative DMLs are usually easier to learn and use than are procedural DMLs. However, since a
user does not have to specify how to get the data, the database system has to figure out an
efficient means of accessing data.
Data-Manipulation Language

A query is a statement requesting the retrieval of information.


The portion of a DML that involves information retrieval is called a query language.
Although technically incorrect, it is common practice to use the terms query language and data-
manipulation language synonymously.

There are a number of database query languages in use, either commercially or experimentally.
We will study the most widely used query language, SQL.
Data Manipulation Language (DML)

 Language for accessing and updating the data organized by the appropriate data model
• DML also known as query language
 There are basically two types of data-manipulation language
• Procedural DML -- require a user to specify what data are needed and how to get those
data.
• Declarative DML -- require a user to specify what data are needed without specifying how
to get those data.
 Declarative DMLs are usually easier to learn and use than are procedural DMLs.
 Declarative DMLs are also referred to as non-procedural DMLs
 The portion of a DML that involves information retrieval is called a query language.
Database Access from Application Program

 Non-procedural query languages such as SQL are not as powerful as a universal


Turing machine.
 SQL does not support actions such as input from users, output to displays, or
communication over the network.
 Such computations and actions must be written in a host language, such as C/C++,
Java or Python, with embedded SQL queries that access the data in the database.
 Application programs -- are programs that are used to interact with the database in
this fashion. Examples in a university system are programs that allow students to
register for courses, generate class rosters, calculate student GPA, generate payroll
checks, and perform other tasks.
Database Design

Database systems are designed to manage large bodies of information. These large bodies of
information do not exist in isolation. They are part of the operation of some enterprise whose
end product may be information from the database or may be some device or service for which
the database plays only a supporting role.

Database design mainly involves the design of the database schema. The design of a complete
database application environment that meets the needs of the enterprise being modeled
requires attention to a broader set of issues. In this text, we focus on the writing of database
queries and the design of database schemas.
Database Design
A high-level data model provides the database designer with a conceptual framework in which to
specify the data requirements of the database users and how the database will be structured to
fulfill these requirements.
The initial phase of database design, then, is to characterize fully the data needs of the
prospective database users. The database designer needs to interact extensively with domain
experts and users to carry out this task. The outcome of this phase is a specification of user
requirements.
Next, the designer chooses a data model, and by applying the concepts of the chosen data model,
translates these requirements into a conceptual schema of the database.
The schema developed at this conceptual-design phase provides a detailed overview of the
enterprise. The designer reviews the schema to confirm that all data requirements are indeed
satisfied and are not in conflict with one another. The designer can also examine the design to
remove any redundant features. The focus at this point is on describing the data and their
relationships, rather than on specifying physical storage details.
Database Design
In terms of the relational model, the conceptual-design process involves decisions on
- what attributes we want to capture in the database and
- how to group these attributes to form the various tables.

For the “how” part, there are principally two ways to tackle the problem. The first one is to use
the entity-relationship model; the other is to employ a set of algorithms (collectively known as
normalization that takes as input the set of all attributes and generates a set of tables.
Database Design
A fully developed conceptual schema indicates the functional requirements of the enterprise. In a
specification of functional requirements, users describe the kinds of operations (or transactions)
that will be performed on the data. Example operations include modifying or updating data,
searching for and retrieving specific data, and deleting data. At this stage of conceptual design,
the designer can review the schema to ensure it meets functional requirements.

The process of moving from an abstract data model to the implementation of the database
proceeds in two final design phases. In the logical-design phase, the designer maps the high-level
conceptual schema onto the implementation data model of the database system that will be
used. The designer uses the resulting system-specific database schema in the subsequent
physical-design phase, in which the physical features of the database are specified. These
features include the form of file organization and the internal storage structures.
Database Design

The process of designing the general structure of the database:

 Logical Design – Deciding on the database schema. Database design requires that
we find a “good” collection of relation schemas.
• Business decision – What attributes should we record in the database?
• Computer Science decision – What relation schemas should we have and how
should the attributes be distributed among the various relation schemas?
 Physical Design – Deciding on the physical layout of the database
Database Design

The database design process can be divided into six steps.


1. Requirements Analysis: The very first step in designing a database application is to understand
what data is to be stored in the database, what applications must be built on top of it, and what
operations are most frequent and subject to performance requirements. In other words, we must
find out what the users want from the database. This is usually an informal process that involves
discussions with user groups, a study of the current operating environment and how it is
expected to change, analysis of any available documentation on existing applications that are
expected to be replaced or complemented by the database, and so on.
Database Design

The database design process can be divided into six steps.


2. Conceptual Database Design: The information gathered in the requirements analysis step is
used to develop a high-level description of the data to be stored in the database, along with the
constraints known to hold over this data. This step is often carried out using the ER model and is
discussed in the following weeks. The ER model is one of several high-level, or semantic, data
models used in database design. The goal is to create a simple description of the data that closely
matches how users and developers think of the data (and the people and processes to be
represented in the data). This facilitates discussion among all the people involved in the design
process, even those who have no technical background. At the same time, the initial design must
be sufficiently precise to enable a straightforward translation into a data model supported by a
commercial database system (which, in practice, means the relational model).
Database Design

The database design process can be divided into six steps.


3. Logical Database Design: We must choose a DBMS to implement our database design, and
convert the conceptual database design into a database schema in the data model of the chosen
DBMS. We will consider only relational DBMSs, and therefore, the task in the logical design step is
to convert an ER schema into a relational database schema. The result is a conceptual schema,
sometimes called the logical schema, in the relational data model.
Database Design

The database design process can be divided into six steps.


4. Schema Refinement: The fourth step of database design is to analyze the collection of
relations in our relational database schema to identify potential problems, and to refine it. In
contrast to the requirements analysis and conceptual design steps, which are essentially
subjective, schema refinement can be guided by some elegant and powerful theory. We will
discuss the theory of normalizing relations-restructuring them to ensure some desirable
properties.
Database Design

The database design process can be divided into six steps.


5. Physical Database Design: In this step, we consider typical expected workloads that our
database must support and further refine the database design to ensure that it meets desired
performance criteria. This step may simply involve building indexes on some tables and clustering
some tables, or it may involve a substantial redesign of parts of the database schema obtained
from the earlier design steps.
Database Design

The database design process can be divided into six steps.


6. Application and Security Design: Any software project that involves a DBMS must consider
aspects of the application that go beyond the database itself. Briefly, we must identify the
entities (e.g., users, user groups, departments) and processes involved in the application. We
must describe the role of each entity in every process that is reflected in some application task, as
part of a complete workflow for that task. For each role, we must identify the parts of the
database that must be accessible and the parts of the database that must not be accessible, and
we must take steps to ensure that these access rules are enforced. A DBMS provides several
mechanisms to assist in this step.
Database Engine

 A database system is partitioned into modules that deal with each of the responsibilities of
the overall system.
 The functional components of a database system can be divided into
• The storage manager,
• The query processor component,
• The transaction management component.
Database Engine

The storage manager is important because databases typically require a large amount of storage
space. Corporate databases commonly range in size from hundreds of gigabytes to terabytes of
data. A gigabyte is approximately 1 billion bytes, or 1000 megabytes (more precisely, 1024
megabytes), while a terabyte is approximately 1 trillion bytes or 1 million megabytes (more
precisely, 1024 gigabytes). The largest enterprises have databases that reach into the multi-
petabyte range (a petabyte is 1024 terabytes).
Since the main memory of computers cannot store this much information, and since the contents
of main memory are lost in a system crash, the information is stored on disks. Data are moved
between disk storage and main memory as needed. Since the movement of data to and from disk
is slow relative to the speed of the central processing unit, it is imperative that the database
system structure the data so as to minimize the need to move data between disk and main
memory. Increasingly, solid-state disks (SSDs) are being used for database storage. SSDs are
faster than traditional disks but also more costly.
Database Engine

The query processor is important because it helps the database system to simplify and facilitate
access to data.
The query processor allows database users to obtain good performance while being able to work
at the view level and not be burdened with understanding the physical-level details of the
implementation of the system.
It is the job of the database system to translate updates and queries written in a nonprocedural
language, at the logical level, into an efficient sequence of operations at the physical level.
Database Engine

The transaction manager is important because it allows application developers to treat a


sequence of database accesses as if they were a single unit that either happens in its entirety or
not at all.
This permits application developers to think at a higher level of abstraction about the application
without needing to be concerned with the lower-level details of managing the effects of
concurrent access to the data and of system failures.

While database engines were traditionally centralized computer systems, today parallel
processing is key for handling very large amounts of data efficiently. Modern database engines
pay a lot of attention to parallel data storage and parallel query processing.
Storage Manager
 A program module that provides the interface between the low-level data stored in the database and the
application programs and queries submitted to the system.
 The storage manager is responsible to the following tasks:
• Interaction with the OS file manager
• Efficient storing, retrieving and updating of data
 The storage manager components include:
• Authorization and integrity manager, which tests for the satisfaction of integrity constraints and checks
the authority of users to access data.
• Transaction manager, which ensures that the database remains in a consistent (correct) state despite
system failures, and that concurrent transaction executions proceed without conflicts.
• File manager, which manages the allocation of space on disk storage and the data structures used to
represent information stored on disk.
• Buffer manager, which is responsible for fetching data from disk storage into main memory, and
deciding what data to cache in main memory. The buffer manager is a critical part of the database
system, since it enables the database to handle data sizes that are much larger than the size of main
memory.
Storage Manager (Cont.)

 The storage manager implements several data structures as part of the physical system
implementation:
• Data files -- store the database itself
• Data dictionary -- stores metadata about the structure of the database, in particular the
schema of the database.
• Indices -- can provide fast access to data items. A database index provides pointers to those
data items that hold a particular value.
Query Processor

 The query processor components include:


• DDL interpreter -- interprets DDL statements and records the definitions in the data
dictionary.
• DML compiler -- translates DML statements in a query language into an evaluation plan
consisting of low-level instructions that the query evaluation engine understands.
 The DML compiler performs query optimization; that is, it picks the lowest cost
evaluation plan from among the various alternatives.
• Query evaluation engine -- executes low-level instructions generated by the DML compiler.
Transaction Management

Often, several operations on the database form a single logical unit of work.
An example is a funds transfer, in which one account A is debited and another account B is
credited. Clearly, it is essential that either both the credit and debit occur, or that neither
occur. That is, the funds transfer must happen in its entirety or not at all.
This all-or-none requirement is called atomicity.
In addition, it is essential that the execution of the funds transfer preserves the consistency of
the database. That is, the value of the sum of the balances of A and B must be preserved. This
correctness requirement is called consistency.
Finally, after the successful execution of a funds transfer, the new values of the balances of
accounts A and B must persist, despite the possibility of system failure. This persistence
requirement is called durability.
Transaction Management

 A transaction is a collection of operations that performs a single logical function in a


database application
 Transaction-management component ensures that the database remains in a consistent
(correct) state despite system failures (e.g., power failures and operating system crashes)
and transaction failures.
 Concurrency-control manager controls the interaction among the concurrent transactions,
to ensure the consistency of the database.
Database Architecture

 Centralized databases
• One to a few cores, shared memory
 Client-server,
• One server machine executes work on behalf of multiple client machines.
 Parallel databases
• Many core shared memory
• Shared disk
• Shared nothing
 Distributed databases
• Geographical distribution
• Schema/data heterogeneity
Database Architecture
(Centralized/Shared-Memory)
Database Applications

Database applications are usually partitioned into two or three parts


 Two-tier architecture -- the application resides at the client machine, where it invokes database
system functionality at the server machine
 Three-tier architecture -- the client machine acts as a front end and does not contain any direct
database calls.
• The client end communicates with an application server, usually through a forms interface.
• The application server in turn communicates with a database system to access data.
Two-tier and three-tier architectures
Database Users
Database Administrator

A person who has central control over the system is called a database administrator (DBA). Functions
of a DBA include:
 Schema definition
 Storage structure and access-method definition
 Schema and physical-organization modification
 Granting of authorization for data access
 Routine maintenance
 Periodically backing up the database
 Ensuring that enough free disk space is available for normal operations, and upgrading
disk space as required
 Monitoring jobs running on the database
History of Database Systems

 1950s and early 1960s:


• Data processing using magnetic tapes for storage
 Tapes provided only sequential access
• Punched cards for input
 Late 1960s and 1970s:
• Hard disks allowed direct access to data
• Network and hierarchical data models in widespread use
• Ted Codd defines the relational data model
 Would win the ACM Turing Award for this work
 IBM Research begins System R prototype
 UC Berkeley (Michael Stonebraker) begins Ingres prototype
 Oracle releases first commercial relational database
• High-performance (for the era) transaction processing
History of Database Systems (Cont.)

 1980s:
• Research relational prototypes evolve into commercial systems
 SQL becomes industrial standard
• Parallel and distributed database systems
 Wisconsin, IBM, Teradata
• Object-oriented database systems
 1990s:
• Large decision support and data-mining applications
• Large multi-terabyte data warehouses
• Emergence of Web commerce
History of Database Systems (Cont.)

 2000s
• Big data storage systems
 Google BigTable, Yahoo PNuts, Amazon,
 “NoSQL” systems.
• Big data analysis: beyond SQL
 Map reduce and friends
 2010s
• SQL reloaded
 SQL front end to Map Reduce systems
 Massively parallel database systems
 Multi-core main-memory databases
Class Discussions

 List applications you have used that most likely employed a database system to store
persistent data.
 Assume that two students are trying to register for a course in which there is only one open
seat. What component of a database system prevents both students from being given that
last seat?
 Describe tables that might be used to store information in a socialnetworking system such as
Facebook.

You might also like