Database II
Database II
OBJECT-ORIENTED DATABASE
An object-oriented database (OODBMS) or object database management system
(ODBMS) is a database that is based on object-oriented programming (OOP). The data is
represented and stored in the form of objects. OODBMS are also called object databases or
object-oriented database management systems.
A database is a data storage. A software system that is used to manage databases is
called a database management system (DBMS). There are many types of database
management systems such as hierarchical, network, relational, object-oriented, graph, and
document.
Object-Oriented Database
The idea of object databases was originated in 1985 and today has become common
for various common OOP languages, such as C++, Java, C#, Smalltalk, and LISP. Common
examples are Smalltalk is used in GemStone, LISP is used in Gbase, and COP is used in Vbase.
Calculations, and faster results. Some of the common applications that use object databases
are real-time systems, architectural & engineering for 3D modelling, telecommunications, and
scientific products, molecular science, and astronomy.
In a typical relational database, the program data is stored in rows and columns. To
store and read that data and convert it into program objects in memory requires reading data,
loading data into objects, and storing it in memory. Imagine creating a class in your program
and saving it as it is in a database, reading back and start using it again. Object databases bring
permanent persistent to objects. Objects can be stored in persistent storage forever.
Here is a list of some of the popular object databases and their features.
Cache -
InterSystems’s Caché is a high-performance object database. Caché database engine
is a set of services including data storage, concurrency management, transactions, and
process management. You can think of the Caché engine as a powerful database toolkit. It is
also a full-featured relational database. All the data within a Caché database is available as
true relational tables and can be queried and modified using standard SQL.
• The ability to model data as objects (each with an automatically created and
synchronized native relational representation) while eliminating both the
impedance mismatch between databases and object-oriented application
environments as well as reducing the complexity of relational modelling,
• A simpler, object-based concurrency model
• User-defined data types
• The ability to take advantage of methods and inheritance, including
polymorphism, within the database engine
• Object-extensions for SQL to handle object identity and relationships
• The ability to intermix SQL and object-based access within a single application,
using each for what they are best suited
• Control over the physical layout and clustering used to store data in order to
ensure the maximum performance for applications
• Automatic interoperability via Java, JDBC, ActiveX, .NET, C++, ODBC, XML, SOAP, Perl,
Python, and more
• Support for common Internet protocols: POP3, SMTP, MIME, FTP, and so on
• A reusable user portal for your end users
• Support for analyzing unstructured data
• Support for Business Intelligence (BI)
• Built-in testing facilities
ConceptBase -
Db4o -
b4o is the world's leading open-source object database for Java and .NET. Leverage
fast native object persistence, ACID transactions, query-by-example, S.O.D. A object query
API, automatic class schema evolution, small size.
ObjectDB/Object Database -
ObjectDatabase++
Objectivity/DB -
Objectivity/DB runs on 32 or 64-bit processors running Linux, Mac OS X, UNIX (Oracle, Solaris)
or Windows. There are C++, C#, Java and Python APIs. A program using C++ on Linux can be
read by a C# program on Windows and a Java program on Mac OS X.
Objectivity/DB generally runs on POSIX file systems, but there are plugins that can be
modified for other storage infrastructure. Objectivity/DB client programs can be configured
to run on a standalone laptop, networked workgroups, large clusters or in grids or clouds with
no changes to the application code.
ObjectStore -
ObjectStore is OO storage that directly integrates with Java or C++ applications and
treats memory and persistent storage as one – improving the performance of application logic
while fully maintaining ACID compliance against the transactional and distributed load.
Key Benefits
WakandaDB -
Object-relational Databases
PostgreSQL is the most popular pure ORDBMS. Some popular databases including
Microsoft SQL Server, Oracle, and IBM DB2 also support objects and can be considered as
ORDBMS.
A data model is an abstraction of the real world. It allows you to deal with the
complexity inherent in a real-world problem by focusing on the essential and interesting
features of the data an organization needs. An object-oriented model is built around objects,
just as the E-R model is built around entities. However, an object encapsulates both data and
behaviour, implying that we can use the object-oriented approach not only for data
modelling, but also to model system behaviour. To thoroughly represent any real-world
system, you need to model both the data and the processes and behaviour that act on the
data. By allowing you to capture them together within a common representation, and by
offering benefits such as inheritance and code reuse, the object-oriented modelling approach
provides a powerful environment for developing complex systems.
The object-oriented systems development cycle, depicted in the figure below consists
of progressively and iteratively developing object representation through three phases—
analysis, design, and implementation—similar to the heart of the systems development life
cycle. In an iterative development model, the focus shifts from more abstract aspects of the
development process (analysis) to the more concrete ones over the lifetime of a project. Thus,
in the early stages of development, the model you develop is abstract, focusing on external
qualities of the system. As the model evolves, it becomes more and more detailed, the focus
shifting to how the system will be built and how it should function. The emphasis in modeling
should be on analysis and design, focusing on front-end conceptual issues rather than back-
end implementation issues that unnecessarily restrict design choices (Larman, 2004).
In the analysis phase, you develop a model of a real-world application, showing its
important properties. The model abstracts concepts from the application domain and
describes what the intended system must do, rather than how it will be done. It specifies the
functional behaviour of the system independent of concerns relating to the environment in
which it is to be finally implemented. Please note that during the analysis activities, your focus
should be on analysing and modeling the real-world domain of interest, not the internal
characteristics of the software system.
In the object-oriented design phase, you define how the analysis model focused on
the real world will be realized in the implementation environment. Therefore, your focus will
move to modeling the software system, which will be very strongly informed by the models
that you created during the analysis activities. Jacobson et al. (1992) cite three reasons for
using object-oriented design:
• The system must be adapted to the environment in which the system will actually be
implemented. To accomplish that, the analysis model has to be transformed into a
design model, considering different factors such as performance requirements, real-
time requirements and concurrency, the target hardware and systems software, the
DBMS and programming language to be adopted, and so forth.
• The analysis results can be validated using object-oriented design. At this stage, you
can verify whether the results from the analysis are appropriate for building the
system and make any necessary changes to the analysis model during the next
iteration of the development cycle.
To develop the design model, you must identify and investigate the consequences that
the implementation environment will have on the design. All strategic design decisions, such
as how the DBMS is to be incorporated, how process communications and error handling are
to be achieved, what component libraries are to be reused, and so on are made. Next, you
incorporate those decisions into a first-cut design model that adapts to the implementation
environment. Finally, you formalize the design model to describe how the objects interact
with one another for each conceivable scenario.
Within each iteration, the design activities are followed by implementation activities (i.e.,
implementing the design using a programming language and/or a database management
system). If the design was done well, translating it into program code is a relatively
straightforward process, given that the design model already incorporates the nuances
of the programming language and the DBMS.
Coad and Yourdon (1991) identify several motivations and benefits of object oriented
modeling:
UML notation is useful for graphically depicting object-oriented analysis and design
models. It not only allows you to specify the requirements of a system and capture the design
decisions; it also promotes communication among key persons involved in the development
effort. A developer can use an analysis or design model expressed in the UML notation as a
means to communicate with domain experts, users, and other stakeholders.
To represent a complex system effectively, the model you develop must consist of a
set of independent views or perspectives. UML allows you to represent multiple perspectives
of a system by providing different types of graphical diagrams, such as the use-case diagram,
class diagram, state diagram, sequence diagram, component diagram, and deployment
diagram. If these diagrams are used correctly together in the context of a well-defined
modeling process, UML allows you to analyze, design, and implement a system based on one
consistent conceptual model. UML also offers the ability to treat entity sets as true classes,
with methods as well as data. Below summarize the common concepts, with different
terminology, used by E /R and UML.
A class in UML is similar to an entity set in the E /R model. The notation for a class is
rather different, however. The following figure shows the class that corresponds to the E /R
entity set Movies.
The box for a class is divided into three parts. At the top is the name of the class. The
middle has the attributes, which are like instance variables of a class. In the movies class, we
use the attributes title, year, length, and genre. The bottom portion is for methods. Neither
the E /R model nor the relational model provides methods. However, they are an important
concept, and one that actually appears in modern relational systems, called “object
relational” DBMS’s. We might have added an instance method lengthlnHours(). The UML
specification doesn’t tell anything more about a method than the types of any arguments and
the type of its return-value. Perhaps this method returns length/60.0, but we cannot know
from the design.
Class diagram, is one of the static diagrams in UML, addressing primarily structural
characteristics of the domain of interest. The class diagram allows us also to capture the
responsibilities that classes can perform, without any specifics of the behaviors. Keep in mind
that a database system is usually part of an overall system, whose underlying model should
encompass all the different perspectives. It is important to note that the UML class diagrams
can be used for multiple purposes at various stages of the life cycle model.
Class is an entity type that has a well-defined role in the application domain about
which the organization wishes to maintain state, behavior, and identity, while an Object is an
instance of a class that encapsulates data and behavior. State is an object’s properties
(attributes and relationships) and the values those properties have. Behavior is the way in
which an object acts and reacts.
In the object-oriented approach, we model the world in objects. Before applying the
approach to a real-world problem, therefore, we need to understand what an object and
some related concepts really are. A class is an entity type that has a well-defined role in the
application domain about which the organization wishes to maintain state, behavior, and
identity. A class is a concept, an abstraction, or a thing that makes sense and matters in an
application context. A class could represent a tangible or visible entity type (e.g., a person,
place, or thing); it could be a concept or an event (e.g., Department, Performance, Marriage,
Registration); or it could be an artifact of the design process (e.g., User Interface, Controller,
Scheduler). An object is an instance of a class (e.g., a particular person, place, or thing) that
encapsulates the data and behavior we need to maintain about that object. A class of objects
shares a common set of attributes and behaviors.
Entity types in the E-R model can be represented as classes and entity instances as
objects in the object model. But, in addition to storing a state (information), an object also
exhibits behavior, through operations that can examine or change its state. The state of an
object encompasses its properties (attributes and relationships) and the values those
properties have, and its behavior represents how an object acts and reacts. Thus, an object’s
state is determined by its attribute values and links to other objects. An object’s behavior
depends on its state and the operation being performed. An operation is simply an action that
one object performs in order to give a response to a request. You can think of an operation
as a service provided by an object (supplier) to its clients. A client sends a message to a
supplier, which delivers the desired service by executing the corresponding operation.
Consider an example student class and a particular object in this class, Mary Jones.
The state of this object is characterized by its attributes, say, name, date of birth, year,
address, and phone, and the values these attributes currently have. For example, name is
“Mary Jones,” year is “junior,” and so on. The object’s behavior is expressed through
operations such as calcGpa, which is used to calculate a student’s current grade point average.
The Mary Jones object, therefore, packages its state and its behavior together. Every
object has a persistent identity; that is, no two objects are the same. For example, if there are
two Student instances with the same value of an identifier attribute, they are still two
different objects. Even if those two instances have identical values for all the identifying
attributes of the object, the objects maintain their separate identities. At the same time, an
object maintains its own identity over its life. For example, if Mary Jones gets married and the
values of the attributes name, address, and phone change for her, she will still be represented
by the same object.
This can be depicted graphically using class diagram as shown in figure below. A class
diagram shows the static structure of an object-oriented model: the classes, their internal
structure, and the relationships in which they participate. In UML, a class is represented by a
rectangle with three compartments separated by horizontal lines. The class name appears in
the top compartment, the list of attributes in the middle compartment, and the list of
operations in the bottom compartment of a box. The figure shows two classes, Student and
Course, along with their attributes and operations.
A static object diagram, such as the one shown in the figure, is an instance of a class
diagram, providing a snapshot of the detailed state of a system at a point in time. In an object
diagram, an object is represented as a rectangle with two compartments. The names of the
object and its class are underlined and shown in the top compartment using the following
syntax:
objectname : classname
The object’s attributes and their values are shown in the second compartment. For
example, we have an object called Mary Jones that belongs to the Student class. The values
of the name, dateOfBirth, and year attributes are also shown. Attributes whose values are
not of interest to you may be suppressed; for example, we have not shown the address and
phone attributes for Mary Jones. If none of the attributes is of interest, the entire second
compartment may be suppressed. The name of the object may also be omitted, in which case
the colon should be kept with the class name as we have done with the instance of Course. If
the name of the object is shown, the class name, together with the colon, may be suppressed.
Types of Operations
Operations can be classified into four types, depending on the kind of service
requested by clients, they are:
• Constructor: A constructor operation creates a new instance of a class. For example,
you can have an operation called Student and initializes its state. Such constructor
operations are available to all classes and are therefore not explicitly shown in the
class diagram.
• Query: A query operation is an operation without any side effects; it accesses the state
of an object but does not alter the state. For example, the Student class can have an
operation called getYear, which simply retrieves the year (freshman, sophomore,
junior, or senior) of the Student object specified in the query. Consider, however, the
calcAge operation within Student. This is also a query operation because it does not
have any other effects. Note that the only argument for this query is the target Student
object. Such a query can be represented as a derived attribute.
• Update: An update operation alters the state of an object. For example, consider an
operation of Student called promoteStudent. The operation promotes a student to a
new class, thereby changing the Student object’s state (value of the attribute year).
Another example of an update operation is registerFor(course), which, when invoked,
has the effect of establishing a connection from a Student object to a specific Course
object. Again, in standard object-oriented programming terminology, the methods
that are used to changes the value of an object’s internal attribute are called setter,
or mutator, methods.
• Class - Scope: A class-scope operation is an operation that applies to a class rather
than an object instance. For example, avgGpa for the Student class calculates the
average grade point average across all students. In object-oriented programming,
class-scope operations are implemented with class methods.
Associations:
Unary Associations
Binary Associations
Tertiary Associations
A derived attribute, association, or role is one that can be computed or derived from
other attributes, associations, and roles, respectively. A derived element (attribute,
association, or role) is typically shown by placing either a slash (/) or a stereotype of
<<Derived>> before the name of the element. For instance, in figure below, age is a derived
attribute of Student, because it can be calculated from the date of birth and the current date.
Because the calculation is a constraint on the class, the calculation is shown on the diagram
within {} above the Student class. Also, the Takes relationship between Student and Course is
derived, because it can be inferred from the Registers For and Scheduled For relationships.
By the same token, participants is a derived role because it can be derived from other roles.
Generalization
Consider the example shown in the figure below. There are three types of employees: hourly
employees, salaried employees, and consultants. The features that are shared by all
employees—empName, empNumber, address, dateHired, and printLabel—are stored in the
Employee superclass, whereas the features that are peculiar to a particular employee type
are stored in the corresponding subclass (e.g., hourlyRate and computeWages of Hourly
Employee). A generalization path is shown as a solid line from the subclass to the superclass,
with a hollow triangle at the end of, and pointing toward, the superclass. You can show a
group of generalization paths for a given superclass as a tree with multiple branches
connecting the individual subclasses, and a shared segment with a hollow triangle pointing
toward the superclass. In the other figure for instance, we have combined the generalization
paths from Outpatient to Patient, and from Resident Patient to Patient, into a shared segment
with a triangle pointing toward Patient. We also specify that this generalization is dynamic,
meaning that an object may change subtypes.
You can indicate the basis of a generalization by specifying a discriminator next to the
path. A discriminator shows which property of an object class is being abstracted by a
particular generalization relationship. You can discriminate on only one property at a time.
For example, we discriminate the Employee class on the basis of employment type (hourly,
salaried, consultant). To disseminate a group of generalization relationships, we need to
specify the discriminator only once. Although we discriminate the Patient class into two
subclasses, Outpatient and Resident Patient, based on residency, we show the discriminator
label only once next to the shared line. An instance of a subclass is also an instance of its
superclass for example an Outpatient instance is also a Patient instance. For that reason, a
generalization is also referred to as an is-a relationship. Also, a subclass inherits all the
features from its superclass. For example, addition to its own special features— hourlyRate
and computeWages—the Hourly Employee subclass inherits empName, empNumber,
address, dateHired, and printLabel from Employee. An instance of Hourly Employee will store
values for the attributes of Employee and Hourly Employee, and, when requested, will apply
the printLabel and computeWages operations.
Advocates of the object-oriented approach claim that code reuse results in productivity gains
of several orders of magnitude.
Notice that in the figure, the Patient class is in italics, implying that it is an abstract class. An
abstract class is a class that has no direct instances but whose descendants may have direct
instances. A class that can have direct instances (e.g., Outpatient or Resident Patient) is called
a concrete class. In this example, therefore, Outpatient and Resident Patient can have direct
instances, but Patient cannot have any direct instances of its own.
Aggregation
Examples of aggregation
In figure above, we can see that the inheritance relationship and two association
relationships. The CDSalesReport class inherits from the Report class. A CDSalesReport is
associated with one CD, but the CD class doesn’t know anything about the CDSalesReport
class. The CD and the Band classes both know about each other, and both classes can be
associated to one or more of each other.
Activity Diagram
Activity diagrams shows the procedural flow of control between two or more class
objects while processing an activity. In other words, it describes the business and operational
step-by-step workflows of components in a system.
An activity diagram shows the overall flow of control It can be used to model higher-
level business process at the business unit level, or to model low-level internal class actions.
They are best used to model higher-level processes, such as how the company is currently
doing business, or how it would like to do business. This is because activity diagrams are “less
Activity diagram, with two swim lanes to indicate control of activity by two
objects: the band manager, and the reporting tool
In our example activity diagram, we have two swim lanes because we have two objects
that control separate activities: a band manager and a reporting tool. The process starts with
the band manager electing to view the sales report for one of his bands. The reporting tool
then retrieves and displays all the bands that person manages and asks him to choose one.
After the band manager selects a band, the reporting tool retrieves the sales information and
displays the sales report. The activity diagram shows that displaying the report is the last step
in the process.
Use-case diagram
A use case is used to help development teams visualize the functional requirements
of a system, including the relationship of “actors” (human beings who will interact with the
system) to essential processes, as well as the relationships among different use cases. Use-
case diagrams generally show groups of use cases — either all use cases for the complete
system, or a breakout of a particular group of use cases with related functionality (e.g., all
security administration-related use cases). To show a use case on a use-case diagram, you
draw an oval in the middle of the diagram and put the name of the use case in the center of,
or below, the oval. To draw an actor (indicating a system user) on a use-case diagram, you
draw a stick person to the left or right of your diagram. Use simple lines to depict relationships
between actors and use cases.
Sequence diagram
This Shows how objects communicate with each other in terms of a sequence of
messages. It also indicates the lifespans of objects relative to those messages. Sequence
diagrams show a detailed flow for a specific use case or even just part of a specific use case.
They are almost self-explanatory; they show the calls between the different objects in their
sequence and can show, at a detailed level, different calls to different objects.
A sequence diagram has two dimensions: The vertical dimension shows the sequence
of messages/calls in the time order that they occur; the horizontal dimension shows the
object instances to which the messages are sent.
A typical computer system has several different components in which data may be
stored. These components have data capacities ranging over at least seven orders of
magnitude and also have access speeds ranging over seven or more orders of magnitude. The
cost per byte of these components also varies, but more slowly, with perhaps three orders of
magnitude between the cheapest and most expensive forms of storage. Not surprisingly, the
devices with smallest capacity also offer the fastest access speed and have the highest cost
per byte as shown in the figure below:
Memory hierarchy
• Cache: A typical machine has a megabyte or more of cache storage. On-board cache
is found on the same chip as the microprocessor itself, and additional level-2 cache is
found on another chip. Data and instructions are moved to cache from main memory
when they are needed by the processor. Cached data can be accessed by the processor
in a few nanoseconds.
• Main Memory: In the center of the action is the computer’s main memory. We may
think of everything that happens in the computer — instruction executions and data
manipulations — as working on information that is resident in main memory (although
in practice, it is normal for what is used to migrate to the cache). Currently, there
machines configured with about a many gigabytes of main memory. Typical times to
move data from main memory to the processor or cache are in the 10-100 nanosecond
range.
• Secondary Storage: Secondary storage is typically magnetic disk. Currently, there are
single disk units that have capacities of up to a terabyte or more, and one machine
can have several disk units. The time to transfer a single byte between disk and main
memory is around 10 milliseconds. However, large numbers of bytes can be
transferred at one time, so the matter of how fast data moves from and to disk is
somewhat complex.
• Tertiary Storage. As capacious as a collection of disk units can be, there are databases
much larger than what can be stored on the disk(s) of a single machine, or even several
machines. To serve such needs, tertiary storage devices have been developed to hold
data volumes measured in terabytes. Tertiary storage is characterized by significantly
higher read/write times than secondary storage, but also by much larger capacities
and smaller cost per byte than is available from magnetic disks. Many tertiary devices
© 2023 Computer Science Department HAFED Poly Kazaure Page 24 of 73
COM 322 Introduction to Database Design II Adamu Isah
involve robotic arms or conveyors that bring storage media such as magnetic tape or
optical disks (e.g., DVD’s) to a reading device. Retrieval takes seconds or minutes, but
capacities in the petabyte range are possible.
The path in the fig. above involving virtual memory represents the treatment of
conventional programs and applications. It does not represent the typical way data in a
database is managed, since a DBMS manages the data itself.
However, there is increasing interest in main-memory database systems, which do
indeed manage their data through virtual memory, relying on the operating system to bring
needed data into main memory through the paging mechanism. Main-memory database
systems, like most applications, are most useful when the data is small enough to remain in
main memory without being swapped out by the operating system.
Presently, the common secondary storage media used to store data is a disk, and
before disk there was a tape. Tape is generally used for archival data. The storage medium
used in a disk is a disk pack. It is made up of number of surfaces. Data is read and written from
the disk pack by means of transducers called read/write heads. The number of read/write
heads depends on the type of the disk drive. If we trace projection of one head on the surface
associated with it as the disk rotates, we would create a circular figure called track. The tracks
at the same position on every surface of the disk form the surface of an imaginary cylinder. In
disk terminology, therefore, a cylinder consists of the tracks under the heads on each of its
surfaces.
A major factor that determines the overall system performance is response time for
data on secondary storage. This time depends not only on physical device characteristics, but
also on the data arrangement and request sequencing. In general; the response cost has two
components: access time and data transfer time. Data transfer time is the time needed to
move data from the secondary storage device to processor memory; access time is the time
needed to position the read/write head at the required position. The data transfer time
depends on physical device characteristics and cannot be optimized. In the case of reading a
1KB (kilobyte=1024 bytes) block of data from a device that can transfer it at 100KB/sec
(KB/sec =kilobyte/second), the data transfer time is 10msec. The response time, which is
influenced by physical characteristics, depends on the distance between current and target
positions and therefore on data organization.
File system:
A File is a collection of records which are logically related to any object. Record value
can be in any form like: student’s records which having values of Roll no, Name, Class. For
arranging such data, we use file.
For e.g.: files of bank’s customer, files of department, files of stack records etc.
Files are recorded on secondary storage such as magnetic disks, magnetic tables and
optical disks.
Types of files:
• Physical file:
o Physical file concern with actual data that is stored.
o It stores description about how the data is to be represented.
• Logical file –
o This do not contain data.
© 2023 Computer Science Department HAFED Poly Kazaure Page 26 of 73
COM 322 Introduction to Database Design II Adamu Isah
o They contain a description of records that are found in one or more physical files.
o A logical file is a view or representation of one or more physical files.
• Special character file:
o At the time of file creation, we insert some special characters in file.
o For e.g. Control + z for end of a file which having ASCII value 26
According to records types of files, there are two types:
✓ Fixed length record file
✓ Variable length record file
1. Fixed length record file:
Every record in this file has same size (in bytes). Record having value set, in the fixed
length record file, memory block is assign in same size. For e.g., if the size for a record is
assigned 30 bytes to each then records in this type are stored 30 bytes manner
Advantage: records are stored in fixed distance of memory block, so fast searching for
a particular record is done.
Disadvantage: Memory blocks are unnecessarily used when record size is small as
compared to assigned memory block. This useless memory block increases size of file.
2. Variable length record file:
Every record in this file has variable size (in bytes). Memory block assigned for a file
records are in variable size. Different records in the file have different sizes. As per size of
records value, memory blocks are used.
Advantage: Memory used efficiently for storing record. Whatever exact size of record
that much size of memory block occupies in memory in this kind of records. Because of less
memory they can move, save or transfer from one location to other in fast manner.
Disadvantage: Access for record is slower as compared to fixed length record file due
to varying size of a record.
The B+ tree is similar to a binary search tree (BST), but it can have more than two
children. In this method, all the records are stored only at the leaf node. Intermediate nodes
act as a pointer to the leaf nodes. They do not contain any records.
tables. In the given example, we are retrieving the record for only particular
departments. This method can't be used to retrieve the record for the entire
department.
In this method, we can directly insert, update or delete any record. Data is sorted
based on the key with which searching is done. Cluster key is a type of key with which joining
of the table is performed.
Types of Cluster file organization:
Cluster file organization is of two types:
• Indexed Clusters: In indexed cluster, records are grouped based on the cluster key and
stored together. The above EMPLOYEE and DEPARTMENT relationship is an example
of an indexed cluster. Here, all the records are grouped based on the cluster key-
DEP_ID and all the records are grouped.
• Hash Clusters: It is similar to the indexed cluster. In hash cluster, instead of storing the
records based on the cluster key, we generate the value of the hash key for the cluster
key and store the records with the same hash key value.
Pros of Cluster file organization
• The cluster file organization is used when there is a frequent request for joining the
tables with same joining condition.
• It provides the efficient result when there is a 1:M mapping between the tables.
Cons of Cluster file organization
• This method has the low performance for the very large database.
• If there is any change in joining condition, then this method cannot use. If we change
the condition of joining, then traversing the file takes a lot of time.
• This method is not suitable for a table with a 1:1 condition.
The system catalogue is a collection of tables and views that contain important
information about a database. It is the place where a relational database management system
stores schema metadata, such as information about tables and columns, and internal
bookkeeping information. A system catalogue is available for each database. Information in
the system catalogue defines the structure of the database. For example, the DDL (data
dictionary language) for all tables in the database is stored in the system catalogue. Most
system catalogues are copied from the template database during database creation, and are
thereafter database-specific. A few catalogues are physically shared across all databases in an
installation; these are marked in the descriptions of the individual catalogues.
The system catalogue for a database is actually part of the database. Within the
database are objects, such as tables, indexes, and views. The system catalogue is basically a
group of objects that contain information that defines other objects in the database, the
structure of the database itself, and various other significant information.
The system catalogue may be divided into logical groups of objects to provide tables
that are accessible by not only the database administrator, but by any other database user as
well. A user typically queries the system catalogue to acquire information on the user’s own
objects and privileges, whereas the DBA needs to be able to inquire about any structure or
event within the database. In some implementations, there are system catalogue objects that
are accessible only to the database administrator.
The terms system catalogue and data dictionary have been used interchangeably in
most situations. In database management systems, a file defines the basic organisation of a
database. A data dictionary contains a list of all the files in the database, the number of
records in each file, and the names and types of each field. Most database management
systems keep the data dictionary hidden from users to prevent them from accidentally
destroying its contents.
The information stored in a catalogue of an RDBMS includes:
• the relation names,
• attribute names,
The description of the relational database schema in the Figure A above is shown as
the tuples (contents) of the catalogue relation in Figure B. This entry is called as CAT_ENTRY.
All relation names should be unique and all attribute names within a particular relation should
also unique. Another catalogue relation can store information such as tuple size, current
number of tuples, number of indexes, and creator name for each relation.
Data dictionaries also include data on the secondary keys, indexes and views. The
above could also be extended to the secondary key, index as well as view information by
defining the secondary key, indexes and views. Data dictionaries do not contain any actual
data from the database, it contains only book-keeping information for managing it. Without
a data dictionary, however, a database management system cannot access data from the
database.
The Database Library is built on a Data Dictionary, which provides a complete
description of record layouts and indexes of the database, for validation and efficient data
access. The data dictionary can be used for automated database creation, including building
tables, indexes, and referential constraints, and granting access rights to individual users and
groups. The database dictionary supports the concept of Attached Objects, which allow
database records to include compressed BLOBs (Binary Large Objects) containing images,
texts, sounds, video, documents, spreadsheets, or programmer-defined data types. The data
dictionary stores useful metadata, such as field descriptions, in a format that is independent
of the underlying database system. Some of the functions served by the Data Dictionary
include:
• Ensuring efficient data access, especially with regard to the utilisation of indexes,
• partitioning the database into both logical and physical regions,
• specifying validation criteria and referential constraints to be automatically
enforced,
• supplying pre-defined record types for Rich Client features, such as security and
administration facilities, attached objects, and distributed processing (i.e., grid and
cluster supercomputing).
The terms data dictionary and data repository are used to indicate a more general
software utility than a catalogue. A catalogue is closely coupled with the DBMS software; it
provides the information stored in it to users and the DBA, but it is mainly accessed by the
various software modules of the DBMS itself, such as DDL and DML compilers, the query
optimiser, the transaction processor, report generators, and the constraint enforcer. On the
other hand, a Data Dictionary is a data structure that stores meta-data, i.e., data about data.
The software package for a stand-alone data dictionary or data repository may
interact with the software modules of the DBMS, but it is mainly used by the designers, users,
and administrators of a computer system for information resource management. These
systems are used to maintain information on system hardware and software configurations,
documentation, applications, and users, as well as other information relevant to system
administration.
If a data dictionary system is used only by designers, users, and administrators, and
not by the DBMS software, it is called a passive data dictionary; otherwise, it is called an active
data dictionary or data directory. An active data dictionary is automatically updated as
changes occur in the database. A passive data dictionary must be manually updated. The data
dictionary consists of record types (tables) created in the database by system-generated
command files, tailored for each supported back-end DBMS.
Command files contain SQL statements for CREATE TABLE, CREATE UNIQUE INDEX,
ALTER TABLE (for referential integrity), etc., using the specific SQL statement required by that
type of database.
Data Dictionary Features
A comprehensive data dictionary product will include:
• Support for standard entity types (elements, records, files, reports, programs,
systems, screens, users, terminals, etc.), and their various characteristics (e.g., for
elements, the dictionary might maintain Business name, Business definition, name,
Data type, Size, Format, Range(s), Validation criteria, etc.)
• Support for user-designed entity types (this is often called the “extensibility” feature);
this facility is often exploited in support of data modelling, to record and cross-
reference entities, relationships, data flows, data stores, processes, etc.
• The ability to distinguish between versions of entities (e.g., test and production)
• enforcement of in-house standards and conventions.
• comprehensive reporting facilities, including both “canned” reports and a reporting
language for user-designed reports; typical reports include:
• detail reports of entities, summary reports of entities. component reports (e.g.,
record-element structures), Cross-reference reports (e.g., element keyword indexes)
where-used reports (e.g., element-record-program cross-references).
• a query facility, both for administrators and casual users, which includes the ability to
perform generic searches on business definitions, user descriptions, synonyms, etc.
• language interfaces, to allow, for example, standard record layouts to be
automatically incorporated into programs during the compile process.
• automated input facilities (e.g., to load record descriptions from a copy library).
• security features
• adequate performance tuning abilities
• support for DBMS administration, such as automatic generation of DDL (Data
• Definition Language).
Data Dictionary Benefits
The benefits of a fully utilised data dictionary are substantial. A data dictionary has the
potential to:
• facilitate data sharing by enabling database classes to automatically handle multi-user
coordination, buffer layouts, data validation, and performance optimisations,
improving the ease of understanding of data definitions,
• ensuring that there is a single authoritative source of reference for all users
• facilitate application integration by identifying data redundancies, reduce
development lead times by simplifying documentation, automating programming
activities.
• reduce maintenance effort by identifying the impact of change as it affects:
• users,
• database administrators,
• programmers.
• improve the quality of application software by enforcing standards in the
development process
• ensure application system longevity by maintaining documentation beyond project
completions
• data dictionary information created under one database system can easily be used to
generate the same database layout on any of the other database systems
• BFC supports (Oracle, MS SQL Server, Access, DB2, Sybase, SQL Anywhere, etc.)
These benefits are maximised by a fully utilised data dictionary.
Disadvantages of Data Dictionary
• A DDS is a useful management tool, but it also has several disadvantages.
• It needs careful planning. We would need to define the exact requirements, designing
its contents, testing, implementation and evaluation. The cost of a DDS includes not
only the initial price of its installation and any hardware requirements, but also the
cost of collecting the information entering it into the DDS, keeping it up-to-date and
enforcing standards. The use of a DDS requires management commitment, which is
not easy to achieve, particularly where the benefits are intangible and long term.
RELATIONAL ALGEBRA
Relational algebra is a procedural query language. It gives a step by step process to
obtain the result of the query. It uses operators to perform queries.
Select Operation:
The select operation selects tuples that satisfy a given predicate. It is denoted by sigma
(σ).
Notation: σ p(r)
Where: σ is used for selection prediction. r is used for relation, while p is used as a
propositional logic formula which may use connectors like: AND OR and NOT. These relational
can use as relational operators like =, ≠, ≥, <, >, ≤.
Input:
σ BRANCH_NAME="perryride" (LOAN)
Output:
Project Operation:
This operation shows the list of those attributes that we wish to appear in the result.
Rest of the attributes are eliminated from the table.
It is denoted by ∏.
Notation: ∏ A1, A2, An (r)
Where
A1, A2, A3 is used as an attribute name of relation r.
Input:
∏ NAME, CITY (CUSTOMER)
Output:
Union Operation:
Suppose there are two tuples R and S. The union operation contains all the tuples that
are either in R or S or both in R & S. It eliminates the duplicate tuples. It is denoted by ∪.
Notation: R ∪ S
A union operation must hold the following condition:
• R and S must have the attribute of the same number.
• Duplicate tuples are eliminated automatically.
Example:
DEPOSITOR RELATION
BORROW RELATION
Input:
∏ CUSTOMER_NAME (BORROW) ∪ ∏ CUSTOMER_NAME (DEPOSITOR)
Output:
Set Intersection:
Suppose there are two tuples R and S. The set intersection operation contains all tuples that
are in both R & S. It is denoted by intersection ∩.
Notation: R ∩ S
Example: Using the above DEPOSITOR table and BORROW table
Input:
∏ CUSTOMER_NAME (BORROW) ∩ ∏ CUSTOMER_NAME (DEPOSITOR)
Output:
Set Difference:
Suppose there are two tuples R and S. The set intersection operation contains all tuples that
are in R but not in S.
It is denoted by intersection minus (-).
Notation: R - S
Example: Using the above DEPOSITOR table and BORROW table
Input:
∏ CUSTOMER_NAME (BORROW) - ∏ CUSTOMER_NAME (DEPOSITOR)
Output:
Cartesian product
The Cartesian product is used to combine each row in one table with each row in the
other table. It is also known as a cross product. It is denoted by X.
Notation: E X D
Example:
EMPLOYEE
Input:
EMPLOYEE X DEPARTMENT
Output:
Rename Operation:
The rename operation is used to rename the output relation. It is denoted by rho (ρ).
Example: We can use the rename operator to rename STUDENT relation to STUDENT1.
ρ(STUDENT1, STUDENT)
Join Operations:
A Join operation combines related tuples from different relations, if and only if a given
join condition is satisfied. It is denoted by ⋈.
Example:
EMPLOYEE
SALARY
Result:
Natural Join:
A natural join is the set of tuples of all combinations in R and S that are equal on their
common attribute names.
It is denoted by ⋈.
Example: Let's use the above EMPLOYEE table and SALARY table:
Input:
∏EMP_NAME, SALARY (EMPLOYEE ⋈ SALARY)
Output:
Outer Join:
The outer join operation is an extension of the join operation. It is used to deal with missing
information.
Example:
EMPLOYEE
FACT_WORKERS
Input:
(EMPLOYEE ⋈ FACT_WORKERS)
Output:
b. Right outer join: Right outer join contains the set of tuples of all combinations in R and
S that are equal on their common attribute names. In right outer join, tuples in S have
no matching tuples in R.
It is denoted by ⟖.
Example: Using the above EMPLOYEE table and FACT_WORKERS Relation
Input:
EMPLOYEE ⟖ FACT_WORKERS
Output:
c. Full outer join: Full outer join is like a left or right join except that it contains all rows
from both tables. In full outer join, tuples in R that have no matching tuples in S and
tuples in S that have no matching tuples in R in their common attribute name.
It is denoted by ⟗.
Example: Using the above EMPLOYEE table and FACT_WORKERS table
Input:
EMPLOYEE ⟗ FACT_WORKERS
Output:
d. Equi join:
It is also known as an inner join. It is the most common join. It is based on matched data
as per the equality condition. The equi join uses the comparison operator (=).
Example:
CUSTOMER RELATION
PRODUCT
Input:
CUSTOMER ⋈ PRODUCT
Input:
Transaction
A transaction can be defined as a group of tasks. A single task is the minimum
processing unit which cannot be divided further.
ACID Properties
A transaction is a very small unit of a program and it may contain several low level
tasks. A transaction in a database system must maintain Atomicity, Consistency, Isolation, and
Durability a term commonly known as ACID properties in order to ensure accuracy,
completeness, and data integrity.
• Atomicity − This property states that a transaction must be treated as an atomic unit,
that is, either all of its operations are executed or none. There must be no state in a
database where a transaction is left partially completed. States should be defined
either before the execution of the transaction or after the execution/abortion/failure
of the transaction.
• Consistency − The database must remain in a consistent state after any transaction.
No transaction should have any adverse effect on the data residing in the database. If
the database was in a consistent state before the execution of a transaction, it must
remain consistent after the execution of the transaction as well.
• Durability − The database should be durable enough to hold all its latest updates even
if the system fails or restarts. If a transaction updates a chunk of data in a database
and commits, then the database will hold the modified data. If a transaction commits
but the system fails before the data could be written on to the disk, then that data will
be updated once the system springs back into action.
• Isolation − In a database system where more than one transaction is being executed
simultaneously and in parallel, the property of isolation states that all the transactions
will be carried out and executed as if it is the only transaction in the system. No
transaction will affect the existence of any other transaction.
• Serializability: When multiple transactions are being executed by the operating
system in a multiprogramming environment, there are possibilities that instructions
of one transactions are interleaved with some other transaction.
✓ Schedule − A chronological execution sequence of a transaction is called a
schedule. A schedule can have many transactions in it, each comprising of a
number of instructions/tasks.
✓ Serial Schedule − It is a schedule in which transactions are aligned in such a way
that one transaction is executed first. When the first transaction completes its
cycle, then the next transaction is executed. Transactions are ordered one after
the other. This type of schedule is called a serial schedule, as transactions are
executed in a serial manner.
In a multi-transaction environment, serial schedules are considered as a benchmark.
The execution sequence of an instruction in a transaction cannot be changed, but two
transactions can have their instructions executed in a random fashion. This execution does
no harm if two transactions are mutually independent and working on different segments of
data; but in case these two transactions are working on the same data, then the results may
vary. This ever-varying result may bring the database to an inconsistent state. To resolve this
problem, we allow parallel execution of a transaction schedule, if its transactions are either
serializable or have some equivalence relation among them.
Equivalence Schedules
An equivalence schedule can be of the following types −
o Result Equivalence: If two schedules produce the same result after execution, they
are said to be result equivalent. They may yield the same result for some value and
different results for another set of values. That's why this equivalence is not generally
considered significant.
o View Equivalence: Two schedules would be view equivalence if the transactions in
both the schedules perform similar actions in a similar manner.
For example −
o If T reads the initial data in S1, then it also reads the initial data in S2.
o If T reads the value written by J in S1, then it also reads the value written by J in S2.
o If T performs the final write on the data value in S1, then it also performs the final
write on the data value in S2.
Conflict Equivalence
Two schedules would be conflicting if they have the following properties −
o Both belong to separate transactions.
o Both accesses the same data item.
o At least one of them is "write" operation.
Two schedules having multiple transactions with conflicting operations are said to be conflict
equivalent if and only if −
o Both the schedules contain the same set of Transactions.
o The order of conflicting pairs of operation is maintained in both the schedules.
Note − View equivalent schedules are view serializable and conflict equivalent schedules are
conflict serializable. All conflict serializable schedules are view serializable too.
Real-Time Transaction Systems
In systems with real-time constraints, correctness of execution involves both database
consistency and the satisfaction of deadlines. Real time systems are classified as:
• Hard: The task has zero value if it is completed after the deadline.
• Soft: The task has diminishing value if it is completed after the deadline.
Transactional Workflows is an activity which involves the coordinated execution of multiple
task performed different processing entities e.g., bank loan processing, purchase order
processing.
Transaction Processing Monitors
TP monitors were initially developed as multithreaded servers to support large
numbers of terminals from a single process. They provide infrastructure for building and
administering complex transaction processing systems with a large number of clients and
multiple servers.
A transaction-processing monitor has components for Input queue authorisation,
output queue, network, lock manager, recovery manager, log manager, application servers,
database manager and resource managers.
• Active − In this state, the transaction is being executed. This is the initial state of every
transaction.
• Partially Committed − When a transaction executes its final operation, it is said to be
in a partially committed state.
• Failed − A transaction is said to be in a failed state if any of the checks made by the
database recovery system fails. A failed transaction can no longer proceed further.
• Aborted − If any of the checks fails and the transaction has reached a failed state, then
the recovery manager rolls back all its write operations on the database to bring the
database back to its original state where it was prior to the execution of the
transaction. Transactions in this state are called aborted. The database recovery
module can select one of the two operations after a transaction aborts −
o Re-start the transaction
o Kill the transaction
• Committed − If a transaction executes all its operations successfully, it is said to be
committed. All its effects are now permanently established on the database system.
CONCURRENCY CONTROL
In a multiprogramming environment where multiple transactions can be executed
simultaneously, it is highly important to control the concurrency of transactions. We have
concurrency control protocols to ensure atomicity, isolation, and serializability of concurrent
transactions. Concurrency control protocols can be broadly divided into two categories −
• Lock based protocols
• Two-Phase Locking 2PL: This locking protocol divides the execution phase of a
transaction into three parts. In the first part, when the transaction starts executing, it
seeks permission for the locks it requires. The second part is where the transaction
acquires all the locks. As soon as the transaction releases its first lock, the third phase
starts. In this phase, the transaction cannot demand any new locks; it only releases
the acquired locks.
Strict-
and analyzes if they can create a deadlock situation. If it finds that a deadlock situation might
occur, then that transaction is never allowed to be executed.
There are deadlock prevention schemes that use timestamp ordering mechanism of
transactions in order to predetermine a deadlock situation.
Wait-Die Scheme
• In this scheme, if a transaction requests to lock a resource (data item), which is
already held with a conflicting lock by another transaction, then one of the two
possibilities may occur −if TS(Ti) < TS(Tj) − that is Ti, which is requesting a
conflicting lock, is older than Tj − then Ti is allowed to wait until the data-item is
available.
• If TS(Ti) > TS(tj) − that is Ti is younger than Tj − then Ti dies. Ti is restarted later
with a random delay but with the same timestamp.
This scheme allows the older transaction to wait but kills the younger one.
Wound-Wait Scheme
• In this scheme, if a transaction requests to lock a resource (data item), which is
already held with conflicting lock by some another transaction, one of the two
possibilities may occur. If TS(Ti) < TS(Tj), then Ti forces Tj to be rolled back − that is
Ti wounds Tj. Tj is restarted later with a random delay but with the same
timestamp.
• If TS(Ti) > TS(Tj), then Ti is forced to wait until the resource is available.
This scheme, allows the younger transaction to wait; but when an older transaction
requests an item held by a younger one, the older transaction forces the younger one to abort
and release the item.
In both the cases, the transaction that enters the system at a later stage is aborted.
Deadlock Avoidance
Aborting a transaction is not always a practical approach. Instead, deadlock avoidance
mechanisms can be used to detect any deadlock situation in advance. Methods like "wait-for
graph" are available but they are suitable for only those systems where transactions are
lightweight having fewer instances of resource. In a bulky system, deadlock prevention
techniques may work well.
Wait-for Graph
This is a simple method available to track if any deadlock situation may arise. For each
transaction entering into the system, a node is created. When a transaction Ti requests for a
lock on an item, say X, which is held by some other transaction Tj, a directed edge is created
from Ti to Tj. If Tj releases item X, the edge between them is dropped and Ti locks the data
item.
The system maintains this wait-for graph for every transaction waiting for some data
items held by others. The system keeps checking if there's any cycle in the graph.
But how is locking done in such situations? Locking may be done by using intention mode
locking.
Intention Lock Modes
In addition to S and X lock modes, there are three additional lock modes with multiple
Granularity, they are:
• Intention-shared (IS): It contains explicit locking at a lower level of the tree but only
with shared locks.
• Intention-Exclusive (IX): It contains explicit locking at a lower level with exclusive or
shared locks.
• Shared & Intention-Exclusive (SIX): In this lock, the node is locked in shared mode,
and some node is locked in exclusive mode by the same transaction.
Intention locks allow a higher-level node to be locked in share (S) or exclusive (X) mode
without having to check all descendent nodes. Thus, this locking scheme helps in providing
more concurrency but lowers the lock overheads.
Compatibility Matrix with Intention Lock Modes:
The following figure shows the compatible matrix that allows locking:
of the index is maintained. In particular, the exact values read in an internal node of a B+-tree
are irrelevant so long as we land up in the correct leaf node. There are index concurrency
protocols where locks on internal nodes are released early, and not in a two-phase fashion.
Example of index concurrency protocol: Use crabbing instead of two-phase locking on
the nodes of the B+-tree, as follows. During search/insertion/deletion:
• First lock the root node in shared mode.
• After locking all required children of a node in shared mode, release the lock
on the node.
• During insertion/deletion, upgrade leaf node locks to exclusive mode.
• When splitting or coalescing requires changes to a parent, lock the parent in
exclusive mode.
FAILURE CLASSIFICATION
A DBMS may encounter a failure. These failures may be of the following types:
• Transaction failure. An ongoing transaction may fail due to:
▪ Logical errors: Transaction cannot be completed due to some internal error
condition.
▪ System errors: The database system must terminate an active transaction due
to an error condition (e.g., deadlock).
▪ System crash: A power failure or other hardware or software failure causes the
system to crash.
▪ Fail-stop assumption: Non-volatile storage contents are assumed to be
uncorrupted by system crash.
• Disk failure: A head crash or similar disk failure destroys all or part of the disk storage
capacity.
• Destruction is assumed to be detectable: Disk drives use checksums to detect failure.
All these failures result in the inconsistent state of a transaction. Thus, we need a recovery
scheme in a database system, but before we discuss recovery. Let us briefly define the storage
structure from a recovery point of view.
Storage Structure
There are various ways for storing information:
Volatile storage
• Does not survive system crashes, examples: main memory, cache memory
Non-volatile storage
• Survives system crashes, examples: disk, tape, flash memory, non-volatile (battery
backed up)
Stable storage
• A mythical form of storage that survives all failures,
• Approximated by maintaining multiple copies on distinct non-volatile media.
Stable-Storage Implementation
A stable storage maintains multiple copies of each block on separate disks. Copies can
be kept at remote sites to protect against disasters such as fire or flooding. Failure during data
transfer can still result in inconsistent copies. A block transfer can result in:
• Successful completion
© 2023 Computer Science Department HAFED Poly Kazaure Page 55 of 73
COM 322 Introduction to Database Design II Adamu Isah
RECOVERY ALGORITHMS
If log on stable storage at the time of crash as per (a) (b) and (c) then in: For
o No redo action needs to be performed.
o redo(T1) must be performed since <T1 commit> is present
o redo(T1) must be performed followed by redo(T2) since
<T1 commit> and <T2 commit> are present.
Please note that you can repeat this sequence of redo operation as suggested in (c)
any number of times, it will still bring the value of X, Y, Z to consistent redo values. This
property of the redo operation is called idempotent.
Shadow Paging
Shadow paging is an alternative to log-based recovery; this scheme is useful if
transactions are executed serially. In this, two page tables are maintained during the lifetime
of a transaction – the current page table, and the shadow page table. It stores the shadow
page table in non-volatile storage, in such a way that the state of the database prior to
transaction execution may be recovered (shadow page table is never modified during
execution). To start with, both the page tables are identical. Only the current page table is
used for data item accesses during execution of the transaction. Whenever any page is about
to be written for the first time a copy of this page is made on an unused page, the current
page table is then made to point to the copy and the update is performed on the copy.
Once pointer to shadow page table has been written, transaction is committed. No recovery
is needed after a crash — new transactions can start right away, using the shadow page table.
Pages not pointed to from current/shadow page table should be freed.
Advantages of shadow-paging over log-based schemes:
• It has no overhead of writing log records,
• The recovery is trivial.
Disadvantages:
• Copying the entire page table is very expensive, it can be reduced by using a page table
structured like a B+-tree (no need to copy entire tree, only need to copy paths in the
tree that lead to updated leaf nodes).
• Commit overhead is high even with the above extension (Need to flush every updated
page, and page table).
• Data gets fragmented (related pages get separated on disk).
• After every transaction is completed, the database pages containing old versions is
completed, of modified data need to be garbage collected/freed.
• Hard to extend algorithm to allow transactions to run concurrently (easier to extend
log based schemes).
BUFFER MANAGEMENT
When the database is updated, a lot of records are changed in the buffers allocated
to the log records, and database records. Although buffer management is the job of the
operating system, however, some times the DBMS prefer buffer management policies of their
own.
Log records are buffered in the main memory, instead of being output directly to a
stable storage media. Log records are output to a stable storage when a block of log records
in the buffer is full, or a log force operation is executed. Log force is performed to commit a
transaction by forcing all its log records (including the commit record) to stable storage.
Several log records can thus be output using a single output operation, reducing the I/O cost.
• Log records are output to stable storage in the order in which they are created.
• Transaction Ti enters the commit state only when the log record <Ti commit> has
been output to stable storage.
• Before a block of data in the main memory is output to the database, all log records
pertaining to data in that block must be output to a stable storage.
• These rules are also called the write-ahead logging scheme.
Database Buffering
The database maintains an in-memory buffer of data blocks, when a new block is
needed, if the buffer is full, an existing block needs to be removed from the buffer. If the block
chosen for removal has been updated, even then it must be output to the disk. However, as
per write-ahead logging scheme, a block with uncommitted updates is output to disk, log
records with undo information for the updates must be output to the log on a stable storage.
No updates should be in progress on a block when it is output to disk. This can be ensured as
follows:
• Before writing a data item, the transaction acquires exclusive lock on block
containing the data item.
• Lock can be released once the write is completed. (Such locks held for short
duration are called latches).
• Before a block is output to disk, the system acquires an exclusive latch on the block
(ensures no update can be in progress on the block).
A database buffer can be implemented either, in an area of real main-memory
reserved for the database, or in the virtual memory. Implementing buffer in reserved main-
memory has drawbacks. Memory is partitioned before-hand between database buffer and
applications, thereby, limiting flexibility. Although the operating system knows how memory
should be divided at any time, it cannot change the partitioning of memory.
generated by a system such as TPS and MIS. External data come from a variety of
sources such as newspapers, online data services, databases (financial, marketing,
human resources).
• Model Management System − It stores and accesses models that managers use to
make decisions. Such models are used for designing manufacturing facility, analyzing
the financial health of an organization, forecasting demand of a product or service,
etc.
• Support Tools − Support tools like online help; pulls down menus, user interfaces,
graphical analysis, error correction mechanism, facilitates the user interactions with
the system.
Classification of DSS
There are several ways to classify DSS. Hoi Apple and Whinstone classifies DSS as follows −
• Text Oriented DSS − It contains textually represented information that could have a
bearing on decision. It allows documents to be electronically created, revised and
viewed as needed.
• Database Oriented DSS − Database plays a major role here; it contains organized and
highly structured data.
• Spreadsheet Oriented DSS − It contains information in spread sheets that allows
create, view, modify procedural knowledge and also instructs the system to execute
self-contained instructions. The most popular tool is Excel and Lotus 1-2-3.
• Solver Oriented DSS − It is based on a solver, which is an algorithm or procedure
written for performing certain calculations and particular program type.
• Rules Oriented DSS − Procedures are adopted in rules oriented DSS. Expert system is
the example.
• Compound DSS − It is built by using two or more of the five structures explained above.
Types of DSS
Following are some typical DSSs −
• Status Inquiry System − It helps in taking operational, management level, or middle
level management decisions, for example daily schedules of jobs to machines or
machines to operators.
• Data Analysis System − It needs comparative analysis and makes use of formula or an
algorithm, for example cash flow analysis, inventory analysis etc.
• Information Analysis System − In this system data is analyzed and the information
report is generated. For example, sales analysis, accounts receivable systems, market
analysis etc.
• Accounting System − It keeps track of accounting and finance related information, for
example, final account, accounts receivables, accounts payables, etc. that keep track
of the major aspects of the business.
• Model Based System − Simulation models or optimization models used for decision-
making are used infrequently and creates general guidelines for operation or
management.
DATA MINING
Data mining refers to a process that is used to turn raw data into meaningful data. It
is based on research, so many organizations follow the data mining process to transform data
into useful information. It helps the organizations build more innovative strategies, increase
sales, generate revenue, and grow a business by cost reduction.
Data Mining Techniques
Below are the data mining techniques
• Classification analysis: This is used to classify distinct data in a different class. It is used
to restore significant information related to data and metadata.
• Association Rule Learning: This refers to the process that enables to identify relations
between distinct variables in a large set of data.
• Outlier detection: Outlier detection refers to the data observation in a database that
does not match an expected pattern.
• Clustering Analysis: The term 'cluster' is the collection of data objects which are similar
within the same cluster.
• Regression Analysis: This the process of analyzing and identifying the relationship
among the different variables.
Data Analysis
Data analysis is a method that can be used to investigate, analyze, and demonstrate
data to find useful information. There are several types of that, but usually, people think
about the quantitative data first. For example, the data comes after surveying, census data.
Let's understand the concept of data analysis with the help of a day-to-day example.
Suppose there is a retail shop like ShopRite. You can say, some products are always getting
expired before they are sold. It means it is a financial loss for the company. So how do you
minimize the loss?
Let's have a look at the available data.
The products can be categorized into various categories like food products, beverages, cloth
sections, etc. They can further categorize these products and eventually form a tree.
The retail shop manager has the list of products sold on each day, peak hours of the
store, products sold during the different hour's zones, number of customers on each day, and
a lot of other related information. Now, based on all the information, they can figure out
which products sell at what time of the day. You can say they also split it into seasons, which
means what products sell during which season. So that, they can also find which products
have very less sell.
Methods of Data Analysis
There are two methods of data analysis: qualitative and quantitative.
• Qualitative research: Primarily, it describes the product characteristics. It does not
utilize any number. It emphasizes the quality of the product.
• Quantitative research: It is the inverse of qualitative research because its primary
focus is numbers. Quantitative research is all about quantity.
Data Mining VS Data Analysis
Data Mining and Data Analysis are the major steps in any project based on data-driven
decisions, and it is required to be done with efficiency to ensure the success of the projects.
© 2023 Computer Science Department HAFED Poly Kazaure Page 65 of 73
COM 322 Introduction to Database Design II Adamu Isah
Nowadays, data analysis and strategy development play a vital role in collecting important
information from the available data sets.
First, all the data is kept in the data warehouse, and then it is used for the business
intelligence requirements. There are various concepts and views regarding data mining and
data analysis, but you can say that both terms are subsets of business intelligence. Data
mining and Data analysis are similar, so finding the difference between them is a little bit
difficult. The following table summarises the differences between the two:
Data Warehouse
• Subject-Oriented: A data warehouse target on the modelling and analysis of data for
decision-makers. Therefore, data warehouses typically provide a concise and
straightforward view around a particular subject, such as customer, product, or sales,
instead of the global organization's ongoing operations. This is done by excluding data
that are not useful concerning the subject and including all data needed by the users
to understand the subject.
recovery, and concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and data should
not change.
• Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
• Data warehousing is an efficient method to manage demand for lots of information
from lots of users.
• Data warehousing provide the capabilities to analyze a large amount of historical data.
BIG DATA
Big Data is a Database that is different and advanced from the standard database. The
Standard Relational databases are efficient for storing and processing structured data. It uses
the table to store the data and structured query language (SQL) to access and retrieve the
data. Big Data is the type of data that includes unstructured and semi-structured data. There
are specific types of database known as NoSQL databases, there are several types of NoSQL
Databases and tools available to store and process the Big Data. NoSQL Databases are
optimized for data analytics using the Big Data such as text, images, logos, and other data
formats such as XML, JSON. The big data is helpful for developing data-driven intelligent
applications.
Big Data is a term applied to data sets whose size or type is beyond the ability of
traditional relational databases. A traditional database is not able to capture, manage, and
process the high volume of data with low-latency While Database is a collection of
information that is organized so that it can be easily captured, accessed, managed and
updated.
Big Data refers to technologies and initiatives that involve data that is too diverse i.e.
varieties, rapid-changing or massive for skills, conventional technologies, and infrastructure
to address efficiently While Database management system (DBMS) extracts information from
the database in response to queries but it in restricted conditions. There can be any varieties
of data while DB can be defined through some schema. It is difficult to store and process while
Databases like SQL, data can be easily stored and process.
Why it is so popular?
The reason it is so popular is due to the following characteristics:
• Volume: Volume is probably the best-known characteristic of big data. As we know
that almost 90% of today’s data was created in the past couple of years. Volume plays
a major role while considering Big Data.
• Variety: When we are talking of Big Data, we need to consider data in all formats like
the handling of structured, semi-structured and unstructured data. We are capturing
all varieties of data whether it is a pdf, image, website click, images, and videos. These
mix varieties of data are very difficult to store and analyze.
• Velocity: Velocity is the speed or rate at which data is being generated, clicked,
refreshed, produced and accessed. Facebook generating 500 Tb of data per day.
YouTube is uploading 400 hours of videos per minute. Google is translating billions of
searches per day.
• Variability: The inconsistency shown by the data at times will slow down the process
sometimes. It is multiple data dimensions because of multiple data sources.
• Veracity: It refers your data accuracy. How accurate is your data and how meaningful
it is to the analysis based on it.
Is big data a database?
Google Map tells you the fastest route and saves your time. Amazon knows, what you
want to buy, Netflix recommends to you list of movies, which you may be interested to watch.
If it is capable of all this today – just imagine what it will be capable of tomorrow. The amount
of data available to us is only going to increase, and analytics technology will become more
advanced. It will be the solution to your smart and advanced life. Maybe you will get a
notification on your smartphone prescribing you some medicines because sooner you may
encounter health issues. It is going to change a life – the way we are looking at. The database
like SQL or NoSQL is a tool to store, process and analyze Big Data.
Characteristics of Big Data
Big Data contains a large amount of data that is not being processed by traditional
data storage or the processing unit. It is used by many multinational companies to process
the data and business of many organizations. The data flow would exceed 150 Exabyte’s per
day before replication.
There are five v's of Big Data that explains the characteristics, they are called 5 V's of
Big Data.
• Volume: The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.
• Variety: Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected from databases and
sheets in the past, but these days the data will comes in array forms, that are PDFs,
Emails, audios, SM posts, photos, videos, etc.
logs, i.e., the log file is created and maintained by some server that contains a list of
activities.
• Veracity: Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage data
efficiently. Big Data is also essential in business development.
• Value: Value is an essential characteristic of big data. It is not the data that we process
or store. It is valuable and reliable data that we store, process, and also analyze.
• Velocity: Velocity plays an important role compared to others. Velocity creates the
speed by which the data is created in real-time. It contains the linking of incoming data
sets speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly. Big data velocity deals with the speed at the data
flows from sources like application logs, business processes, networks, and social
media sites, sensors, mobile devices, etc.