0% found this document useful (0 votes)
22 views74 pages

Database II

Uploaded by

sunday johnson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views74 pages

Database II

Uploaded by

sunday johnson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

COM 322 Introduction to Database Design II Adamu Isah

OBJECT-ORIENTED DATABASE
An object-oriented database (OODBMS) or object database management system
(ODBMS) is a database that is based on object-oriented programming (OOP). The data is
represented and stored in the form of objects. OODBMS are also called object databases or
object-oriented database management systems.
A database is a data storage. A software system that is used to manage databases is
called a database management system (DBMS). There are many types of database
management systems such as hierarchical, network, relational, object-oriented, graph, and
document.

Object-Oriented Database

Object database management systems (ODBMSs) are based on objects in object-


oriented programing (OOP). In OOP, an entity is represented as an object and objects are
stored in memory. Objects have members such as fields, properties, and methods. Objects
also have a life cycle that includes the creation of an object, use of an object, and deletion of
an object. OOP has key characteristics, encapsulation, inheritance, and polymorphism. Today,
there are many popular OOP languages such as C++, Java, C#, Ruby, Python, JavaScript, and
Perl.

The idea of object databases was originated in 1985 and today has become common
for various common OOP languages, such as C++, Java, C#, Smalltalk, and LISP. Common
examples are Smalltalk is used in GemStone, LISP is used in Gbase, and COP is used in Vbase.
Calculations, and faster results. Some of the common applications that use object databases
are real-time systems, architectural & engineering for 3D modelling, telecommunications, and
scientific products, molecular science, and astronomy.

Advantages of Object Databases

ODBMS provide persistent storage to objects. Imagine creating objects in your


program and saving them as it is in a database and reading back from the database.

In a typical relational database, the program data is stored in rows and columns. To
store and read that data and convert it into program objects in memory requires reading data,
loading data into objects, and storing it in memory. Imagine creating a class in your program
and saving it as it is in a database, reading back and start using it again. Object databases bring
permanent persistent to objects. Objects can be stored in persistent storage forever.

In typical RDBMS, there is a layer of object-relational mapping that maps database


schemas with objects in code. Reading and mapping an object database data to the objects is
direct without any API or OR tool. Hence faster data access and better performance. Some
object database can be used in multiple languages. For example, Gemstone database
supports C++, Smalltalk and Java programming languages.

© 2023 Computer Science Department HAFED Poly Kazaure Page 1 of 73


COM 322 Introduction to Database Design II Adamu Isah

Drawbacks of Object Databases

• Object databases are not as popular as RDBMS. It is difficult to find object DB


developers.
• Not many programming language support object databases.
• RDBMS have SQL as a standard query language. Object databases do not have
a standard.
• Object databases are difficult to learn for non-programmers.

Popular Object Databases

Here is a list of some of the popular object databases and their features.

Cache -
InterSystems’s Caché is a high-performance object database. Caché database engine
is a set of services including data storage, concurrency management, transactions, and
process management. You can think of the Caché engine as a powerful database toolkit. It is
also a full-featured relational database. All the data within a Caché database is available as
true relational tables and can be queried and modified using standard SQL.

Cache offers the following features,

• The ability to model data as objects (each with an automatically created and
synchronized native relational representation) while eliminating both the
impedance mismatch between databases and object-oriented application
environments as well as reducing the complexity of relational modelling,
• A simpler, object-based concurrency model
• User-defined data types
• The ability to take advantage of methods and inheritance, including
polymorphism, within the database engine
• Object-extensions for SQL to handle object identity and relationships
• The ability to intermix SQL and object-based access within a single application,
using each for what they are best suited
• Control over the physical layout and clustering used to store data in order to
ensure the maximum performance for applications

Cache offers a broad set of tools, which include,

• ObjectScript, the language in which most of Caché is written.


• Native implementations of SQL, MultiValue, and Basic.
• A well-developed, built-in security model
• A suite of technologies and tools that provide rapid development for database and
web applications
• Native, object-based XML and web services support
• Device support (such as files, TCP/IP, printers)

© 2023 Computer Science Department HAFED Poly Kazaure Page 2 of 73


COM 322 Introduction to Database Design II Adamu Isah

• Automatic interoperability via Java, JDBC, ActiveX, .NET, C++, ODBC, XML, SOAP, Perl,
Python, and more
• Support for common Internet protocols: POP3, SMTP, MIME, FTP, and so on
• A reusable user portal for your end users
• Support for analyzing unstructured data
• Support for Business Intelligence (BI)
• Built-in testing facilities

ConceptBase -

ConceptBase.cc is a multi-user deductive database system with an object-oriented


(data, class, metaclass, meta-metaclass, etc.) makes it a powerful tool for metamodeling and
engineering of customized modeling languages. The system is accompanied by a highly
configurable graphical user interface that builds upon the logic-based features of
theConceptBase.cc server.

ConceptBase.cc is developed by the ConceptBase Team at University of Skövde (HIS)


and the University of Aachen (RWTH). ConceptBase.cc is available for Linux, Windows, and
MacOS-X. There is also a pre-configured virtual appliance that contains the executable
systems.

Db4o -

b4o is the world's leading open-source object database for Java and .NET. Leverage
fast native object persistence, ACID transactions, query-by-example, S.O.D. A object query
API, automatic class schema evolution, small size.

ObjectDB/Object Database -

ObjectDB is a powerful Object-Oriented Database Management System (ODBMS). It is


compact, reliable, easy to use and extremely fast. ObjectDB provides all the standard
database management services (storage and retrieval, transactions, lock management, query
processing, etc.) but in a way that makes development easier and applications faster.

ObjectDB Database Key Features

• 100% pure Java Object-Oriented Database Management System (ODBMS).


• No proprietary API - managed only by standard Java APIs (JPA 2 / JDO 2).
• Extremely fast - faster than any other JPA / JDO product.
• Suitable for database files ranging from kilobytes to terabytes.
• Supports both Client-Server mode and Embedded mode.
• Single JAR with no external dependencies.
• The database is stored as a single file.
• Advanced querying and indexing capabilities.
• Effective in heavy loaded multi-user environments.
• Can easily be embedded in applications of any type and size.
• Tested with Tomcat, Jetty, GlassFish, JBoss, and Spring.

© 2023 Computer Science Department HAFED Poly Kazaure Page 3 of 73


COM 322 Introduction to Database Design II Adamu Isah

ObjectDatabase++

ObjectDatabase++ (ODBPP) is an embeddable object-oriented database designed for


server applications that require minimal external maintenance. It is written in C++ as a real-
time ISAM level database with the ability to auto recover from system crashes while
maintaining database integrity.

Objectivity/DB -

Objectivity/DB is a scalable, high performance, distributed Object Database (ODBMS).


It is extremely good at handling complex data, where there are many types of connections
between objects and many variants.

Objectivity/DB runs on 32 or 64-bit processors running Linux, Mac OS X, UNIX (Oracle, Solaris)
or Windows. There are C++, C#, Java and Python APIs. A program using C++ on Linux can be
read by a C# program on Windows and a Java program on Mac OS X.

Objectivity/DB generally runs on POSIX file systems, but there are plugins that can be
modified for other storage infrastructure. Objectivity/DB client programs can be configured
to run on a standalone laptop, networked workgroups, large clusters or in grids or clouds with
no changes to the application code.

ObjectStore -

ObjectStore is an enterprise object-oriented database management system for C++


and Java. IT delivers multi-fold performance improvement by eliminating the middle ware
requirement to map and convert application objects into flat relational rows by directly
persisting objects within an application into an object store.

ObjectStore eliminates need to flatten complex data for consumption in your


application logic reducing the overhead of using a translation layer that converts complex
objects into flat objects, dramatically improving performance and often entirely eliminating
the need to manage a relational database system.

ObjectStore is OO storage that directly integrates with Java or C++ applications and
treats memory and persistent storage as one – improving the performance of application logic
while fully maintaining ACID compliance against the transactional and distributed load.

Versant Object Database -

Versant Object-Oriented Database is an object database that supports native object


persistence and used to build complex and high-performance data management systems.

Key Benefits

• Real-time analytical performance


• Big Data management
• Cut development time by up to 40%
• Significantly lower total ownership cost
• High availability
© 2023 Computer Science Department HAFED Poly Kazaure Page 4 of 73
COM 322 Introduction to Database Design II Adamu Isah

WakandaDB -

WakandaDB is an object database and provides a native REST API to access


interconnectedDataClasses defined in Server-Side JavaScript. WakandaDB is the server within
Wakanda.

Object-relational Databases

Object-relational database (ORD), or object-relational database management


systems(ORDBMS) are databases that support both objects and relational database features.
OR databases are relational database management systems with the support of an object-
oriented database model. That means, the entities are represented as objects and classes and
OOP features such as inheritance are supported in database schemas and in the query
language.

PostgreSQL is the most popular pure ORDBMS. Some popular databases including
Microsoft SQL Server, Oracle, and IBM DB2 also support objects and can be considered as
ORDBMS.

Object-Oriented Data Modelling

A data model is an abstraction of the real world. It allows you to deal with the
complexity inherent in a real-world problem by focusing on the essential and interesting
features of the data an organization needs. An object-oriented model is built around objects,
just as the E-R model is built around entities. However, an object encapsulates both data and
behaviour, implying that we can use the object-oriented approach not only for data
modelling, but also to model system behaviour. To thoroughly represent any real-world
system, you need to model both the data and the processes and behaviour that act on the
data. By allowing you to capture them together within a common representation, and by
offering benefits such as inheritance and code reuse, the object-oriented modelling approach
provides a powerful environment for developing complex systems.

The object-oriented systems development cycle, depicted in the figure below consists
of progressively and iteratively developing object representation through three phases—
analysis, design, and implementation—similar to the heart of the systems development life
cycle. In an iterative development model, the focus shifts from more abstract aspects of the
development process (analysis) to the more concrete ones over the lifetime of a project. Thus,
in the early stages of development, the model you develop is abstract, focusing on external
qualities of the system. As the model evolves, it becomes more and more detailed, the focus
shifting to how the system will be built and how it should function. The emphasis in modeling
should be on analysis and design, focusing on front-end conceptual issues rather than back-
end implementation issues that unnecessarily restrict design choices (Larman, 2004).

In the analysis phase, you develop a model of a real-world application, showing its
important properties. The model abstracts concepts from the application domain and
describes what the intended system must do, rather than how it will be done. It specifies the
functional behaviour of the system independent of concerns relating to the environment in

© 2023 Computer Science Department HAFED Poly Kazaure Page 5 of 73


COM 322 Introduction to Database Design II Adamu Isah

which it is to be finally implemented. Please note that during the analysis activities, your focus
should be on analysing and modeling the real-world domain of interest, not the internal
characteristics of the software system.

In the object-oriented design phase, you define how the analysis model focused on
the real world will be realized in the implementation environment. Therefore, your focus will
move to modeling the software system, which will be very strongly informed by the models
that you created during the analysis activities. Jacobson et al. (1992) cite three reasons for
using object-oriented design:

• The analysis model is not formal enough to be implemented directly in a


programming language. Moving seamlessly into the source code requires refining
the objects by making decisions about what operations an object will provide, what
the communication between objects should look like, what messages are to be
passed, and so forth.

• The system must be adapted to the environment in which the system will actually be
implemented. To accomplish that, the analysis model has to be transformed into a
design model, considering different factors such as performance requirements, real-
time requirements and concurrency, the target hardware and systems software, the
DBMS and programming language to be adopted, and so forth.
• The analysis results can be validated using object-oriented design. At this stage, you
can verify whether the results from the analysis are appropriate for building the
system and make any necessary changes to the analysis model during the next
iteration of the development cycle.

To develop the design model, you must identify and investigate the consequences that
the implementation environment will have on the design. All strategic design decisions, such
as how the DBMS is to be incorporated, how process communications and error handling are
to be achieved, what component libraries are to be reused, and so on are made. Next, you
incorporate those decisions into a first-cut design model that adapts to the implementation
environment. Finally, you formalize the design model to describe how the objects interact
with one another for each conceivable scenario.

© 2023 Computer Science Department HAFED Poly Kazaure Page 6 of 73


COM 322 Introduction to Database Design II Adamu Isah

Within each iteration, the design activities are followed by implementation activities (i.e.,
implementing the design using a programming language and/or a database management
system). If the design was done well, translating it into program code is a relatively
straightforward process, given that the design model already incorporates the nuances
of the programming language and the DBMS.

Coad and Yourdon (1991) identify several motivations and benefits of object oriented
modeling:

• The ability to tackle more challenging problem domains


• Improved communication among the users, analysts, designers, and
programmers
• Increased consistency among analysis, design, and programming activities
• Explicit representation of commonality among system components
• Robustness of systems
• Reusability of analysis, design, and programming results
• Increased consistency among all the models developed during object-oriented
analysis, design, and programming

Unified Modeling Language (UML)

Unified Modeling Language (UML) is a set of graphical notations backed by a common


metamodel that is widely used both for business modeling and for specifying, designing, and
implementing software systems. It culminated from the efforts of three leading experts,
Grady Booch, Ivar Jacobson, and James Rumbaugh, who defined this object oriented
modeling language that has become an industry standard. UML builds upon and unifies the
semantics and notations.

UML notation is useful for graphically depicting object-oriented analysis and design
models. It not only allows you to specify the requirements of a system and capture the design
decisions; it also promotes communication among key persons involved in the development
effort. A developer can use an analysis or design model expressed in the UML notation as a
means to communicate with domain experts, users, and other stakeholders.

To represent a complex system effectively, the model you develop must consist of a
set of independent views or perspectives. UML allows you to represent multiple perspectives
of a system by providing different types of graphical diagrams, such as the use-case diagram,
class diagram, state diagram, sequence diagram, component diagram, and deployment
diagram. If these diagrams are used correctly together in the context of a well-defined
modeling process, UML allows you to analyze, design, and implement a system based on one
consistent conceptual model. UML also offers the ability to treat entity sets as true classes,
with methods as well as data. Below summarize the common concepts, with different
terminology, used by E /R and UML.

© 2023 Computer Science Department HAFED Poly Kazaure Page 7 of 73


COM 322 Introduction to Database Design II Adamu Isah

A class in UML is similar to an entity set in the E /R model. The notation for a class is
rather different, however. The following figure shows the class that corresponds to the E /R
entity set Movies.

The box for a class is divided into three parts. At the top is the name of the class. The
middle has the attributes, which are like instance variables of a class. In the movies class, we
use the attributes title, year, length, and genre. The bottom portion is for methods. Neither
the E /R model nor the relational model provides methods. However, they are an important
concept, and one that actually appears in modern relational systems, called “object
relational” DBMS’s. We might have added an instance method lengthlnHours(). The UML
specification doesn’t tell anything more about a method than the types of any arguments and
the type of its return-value. Perhaps this method returns length/60.0, but we cannot know
from the design.

Class diagram, is one of the static diagrams in UML, addressing primarily structural
characteristics of the domain of interest. The class diagram allows us also to capture the
responsibilities that classes can perform, without any specifics of the behaviors. Keep in mind
that a database system is usually part of an overall system, whose underlying model should
encompass all the different perspectives. It is important to note that the UML class diagrams
can be used for multiple purposes at various stages of the life cycle model.

© 2023 Computer Science Department HAFED Poly Kazaure Page 8 of 73


COM 322 Introduction to Database Design II Adamu Isah

Class is an entity type that has a well-defined role in the application domain about
which the organization wishes to maintain state, behavior, and identity, while an Object is an
instance of a class that encapsulates data and behavior. State is an object’s properties
(attributes and relationships) and the values those properties have. Behavior is the way in
which an object acts and reacts.

Representing Objects and Classes

In the object-oriented approach, we model the world in objects. Before applying the
approach to a real-world problem, therefore, we need to understand what an object and
some related concepts really are. A class is an entity type that has a well-defined role in the
application domain about which the organization wishes to maintain state, behavior, and
identity. A class is a concept, an abstraction, or a thing that makes sense and matters in an
application context. A class could represent a tangible or visible entity type (e.g., a person,
place, or thing); it could be a concept or an event (e.g., Department, Performance, Marriage,
Registration); or it could be an artifact of the design process (e.g., User Interface, Controller,
Scheduler). An object is an instance of a class (e.g., a particular person, place, or thing) that
encapsulates the data and behavior we need to maintain about that object. A class of objects
shares a common set of attributes and behaviors.

Entity types in the E-R model can be represented as classes and entity instances as
objects in the object model. But, in addition to storing a state (information), an object also
exhibits behavior, through operations that can examine or change its state. The state of an
object encompasses its properties (attributes and relationships) and the values those
properties have, and its behavior represents how an object acts and reacts. Thus, an object’s
state is determined by its attribute values and links to other objects. An object’s behavior
depends on its state and the operation being performed. An operation is simply an action that
one object performs in order to give a response to a request. You can think of an operation
as a service provided by an object (supplier) to its clients. A client sends a message to a
supplier, which delivers the desired service by executing the corresponding operation.

Consider an example student class and a particular object in this class, Mary Jones.
The state of this object is characterized by its attributes, say, name, date of birth, year,
address, and phone, and the values these attributes currently have. For example, name is
“Mary Jones,” year is “junior,” and so on. The object’s behavior is expressed through
operations such as calcGpa, which is used to calculate a student’s current grade point average.

The Mary Jones object, therefore, packages its state and its behavior together. Every
object has a persistent identity; that is, no two objects are the same. For example, if there are
two Student instances with the same value of an identifier attribute, they are still two
different objects. Even if those two instances have identical values for all the identifying
attributes of the object, the objects maintain their separate identities. At the same time, an
object maintains its own identity over its life. For example, if Mary Jones gets married and the
values of the attributes name, address, and phone change for her, she will still be represented
by the same object.

© 2023 Computer Science Department HAFED Poly Kazaure Page 9 of 73


COM 322 Introduction to Database Design II Adamu Isah

This can be depicted graphically using class diagram as shown in figure below. A class
diagram shows the static structure of an object-oriented model: the classes, their internal
structure, and the relationships in which they participate. In UML, a class is represented by a
rectangle with three compartments separated by horizontal lines. The class name appears in
the top compartment, the list of attributes in the middle compartment, and the list of
operations in the bottom compartment of a box. The figure shows two classes, Student and
Course, along with their attributes and operations.

A static object diagram, such as the one shown in the figure, is an instance of a class
diagram, providing a snapshot of the detailed state of a system at a point in time. In an object
diagram, an object is represented as a rectangle with two compartments. The names of the
object and its class are underlined and shown in the top compartment using the following
syntax:

objectname : classname

The object’s attributes and their values are shown in the second compartment. For
example, we have an object called Mary Jones that belongs to the Student class. The values
of the name, dateOfBirth, and year attributes are also shown. Attributes whose values are
not of interest to you may be suppressed; for example, we have not shown the address and
phone attributes for Mary Jones. If none of the attributes is of interest, the entire second
compartment may be suppressed. The name of the object may also be omitted, in which case
the colon should be kept with the class name as we have done with the instance of Course. If
the name of the object is shown, the class name, together with the colon, may be suppressed.

© 2023 Computer Science Department HAFED Poly Kazaure Page 10 of 73


COM 322 Introduction to Database Design II Adamu Isah

An operation, such as calcGpa in Student is a function or a service that is provided by


all the instances of a class. Typically, other objects can access or manipulate the information
stored in an object only through such operations. The operations, therefore, provide an
external interface to a class; the interface presents the outside view of the class without
showing its internal structure or how its operations are implemented. This technique of hiding
the internal implementation details of an object from its external view is known as
encapsulation, or information hiding. So although we provide the abstraction of the behavior
common to all instances of a class in its interface, we hide within the class its structure and
the secrets of the desired behavior.

Types of Operations

Operations can be classified into four types, depending on the kind of service
requested by clients, they are:
• Constructor: A constructor operation creates a new instance of a class. For example,
you can have an operation called Student and initializes its state. Such constructor
operations are available to all classes and are therefore not explicitly shown in the
class diagram.
• Query: A query operation is an operation without any side effects; it accesses the state
of an object but does not alter the state. For example, the Student class can have an
operation called getYear, which simply retrieves the year (freshman, sophomore,
junior, or senior) of the Student object specified in the query. Consider, however, the
calcAge operation within Student. This is also a query operation because it does not
have any other effects. Note that the only argument for this query is the target Student
object. Such a query can be represented as a derived attribute.
• Update: An update operation alters the state of an object. For example, consider an
operation of Student called promoteStudent. The operation promotes a student to a
new class, thereby changing the Student object’s state (value of the attribute year).
Another example of an update operation is registerFor(course), which, when invoked,
has the effect of establishing a connection from a Student object to a specific Course
object. Again, in standard object-oriented programming terminology, the methods
that are used to changes the value of an object’s internal attribute are called setter,
or mutator, methods.
• Class - Scope: A class-scope operation is an operation that applies to a class rather
than an object instance. For example, avgGpa for the Student class calculates the
average grade point average across all students. In object-oriented programming,
class-scope operations are implemented with class methods.

© 2023 Computer Science Department HAFED Poly Kazaure Page 11 of 73


COM 322 Introduction to Database Design II Adamu Isah

Associations:

Unary Associations

Binary Associations

© 2023 Computer Science Department HAFED Poly Kazaure Page 12 of 73


COM 322 Introduction to Database Design II Adamu Isah

Tertiary Associations

Example of binary association relationships: (a) University

Example of binary association relationships: (b) Customer order

© 2023 Computer Science Department HAFED Poly Kazaure Page 13 of 73


COM 322 Introduction to Database Design II Adamu Isah

Example of Object Diagram for Customer order

Derived Attributes, Derived Associations, and Derived Roles

A derived attribute, association, or role is one that can be computed or derived from
other attributes, associations, and roles, respectively. A derived element (attribute,
association, or role) is typically shown by placing either a slash (/) or a stereotype of
<<Derived>> before the name of the element. For instance, in figure below, age is a derived
attribute of Student, because it can be calculated from the date of birth and the current date.
Because the calculation is a constraint on the class, the calculation is shown on the diagram
within {} above the Student class. Also, the Takes relationship between Student and Course is
derived, because it can be inferred from the Registers For and Scheduled For relationships.
By the same token, participants is a derived role because it can be derived from other roles.

© 2023 Computer Science Department HAFED Poly Kazaure Page 14 of 73


COM 322 Introduction to Database Design II Adamu Isah

Generalization

In a generalization relationship, a subclass inherits features from its superclass, and by


transitivity, from all its ancestors. In generalizing a set of object classes into a more general
class, we abstract not only the common attributes and relationships, but the common
operations as well. The attributes and operations of a class are collectively known as the
features of the class. The classes that are generalized are called subclasses, and the class they
are generalized into is called a superclass.

Consider the example shown in the figure below. There are three types of employees: hourly
employees, salaried employees, and consultants. The features that are shared by all
employees—empName, empNumber, address, dateHired, and printLabel—are stored in the
Employee superclass, whereas the features that are peculiar to a particular employee type
are stored in the corresponding subclass (e.g., hourlyRate and computeWages of Hourly
Employee). A generalization path is shown as a solid line from the subclass to the superclass,
with a hollow triangle at the end of, and pointing toward, the superclass. You can show a
group of generalization paths for a given superclass as a tree with multiple branches
connecting the individual subclasses, and a shared segment with a hollow triangle pointing
toward the superclass. In the other figure for instance, we have combined the generalization
paths from Outpatient to Patient, and from Resident Patient to Patient, into a shared segment
with a triangle pointing toward Patient. We also specify that this generalization is dynamic,
meaning that an object may change subtypes.

You can indicate the basis of a generalization by specifying a discriminator next to the
path. A discriminator shows which property of an object class is being abstracted by a
particular generalization relationship. You can discriminate on only one property at a time.
For example, we discriminate the Employee class on the basis of employment type (hourly,
salaried, consultant). To disseminate a group of generalization relationships, we need to
specify the discriminator only once. Although we discriminate the Patient class into two
subclasses, Outpatient and Resident Patient, based on residency, we show the discriminator
label only once next to the shared line. An instance of a subclass is also an instance of its

© 2023 Computer Science Department HAFED Poly Kazaure Page 15 of 73


COM 322 Introduction to Database Design II Adamu Isah

superclass for example an Outpatient instance is also a Patient instance. For that reason, a
generalization is also referred to as an is-a relationship. Also, a subclass inherits all the
features from its superclass. For example, addition to its own special features— hourlyRate
and computeWages—the Hourly Employee subclass inherits empName, empNumber,
address, dateHired, and printLabel from Employee. An instance of Hourly Employee will store
values for the attributes of Employee and Hourly Employee, and, when requested, will apply
the printLabel and computeWages operations.

Generalization and inheritance are transitive across any number of levels of a


superclass/subclass hierarchy. For instance, we could have a subclass of Consultant called
Computer Consultant that would inherit the features of Employee and Consultant. An
instance of Computer Consultant would be an instance of Consultant.

Examples of generalization, inheritance, and constraints:


(a) Employee superclass with three subclasses

Employee is an ancestor of Computer Consultant, while Computer Consultant is a


descendant of Employee; these terms are used to refer to generalization of classes across
multiple levels.
Inheritance is one of the major advantages of using the object-oriented model. It
allows code reuse: There is no need for a developer to design or write code that has already
been written for a superclass. The developer creates only code that is unique to the new,
refined subclass of an existing class. In actual practice, object-oriented developers typically
have access to large collections of class libraries in their respective domains. They identify
those classes that may be reused and refined to meet the demands of new applications.

© 2023 Computer Science Department HAFED Poly Kazaure Page 16 of 73


COM 322 Introduction to Database Design II Adamu Isah

Advocates of the object-oriented approach claim that code reuse results in productivity gains
of several orders of magnitude.

(b) Abstract Patient class with two concrete subclasses

Notice that in the figure, the Patient class is in italics, implying that it is an abstract class. An
abstract class is a class that has no direct instances but whose descendants may have direct
instances. A class that can have direct instances (e.g., Outpatient or Resident Patient) is called
a concrete class. In this example, therefore, Outpatient and Resident Patient can have direct
instances, but Patient cannot have any direct instances of its own.

Aggregation

An aggregation expresses a part-of relationship between a component object and an


aggregate object. It is a stronger form of association relationship (with the added “part-of”
semantics) and is represented with a hollow diamond at the aggregate end. For example, the
following figure shows a personal computer as an aggregate of CPU (up to 7 for
multiprocessors), hard disks, monitor, keyboard, and other. Note that aggregation involves a
set of distinct object instances, one of which contains or is composed of the others. For
example, an object in the Personal Computer class is related to (consists of) one to four CPU
objects, one of its parts. As shown, it is also possible for component objects to exist without
being part of a whole (e.g., there can be a Monitor that is not part of any PC). Further, it is
possible that the Personal Computer class has operations that apply to its parts; for example,
calculating the extended warranty cost for the PC involved an analysis of its component parts.
In contrast, generalization relates object classes: an object (e.g., Mary Jones) is
simultaneously an instance of its class (e.g., Undergrad Student) and its superclass (e.g.,
Student). Only one object (e.g., Mary Jones) is involved in a generalization relationship. This
is why multiplicities are indicated at the ends of aggregation lines, whereas there are no
multiplicities for generalization relationships.

© 2023 Computer Science Department HAFED Poly Kazaure Page 17 of 73


COM 322 Introduction to Database Design II Adamu Isah

Examples of aggregation

© 2023 Computer Science Department HAFED Poly Kazaure Page 18 of 73


COM 322 Introduction to Database Design II Adamu Isah

The object diagram

In figure above, we can see that the inheritance relationship and two association
relationships. The CDSalesReport class inherits from the Report class. A CDSalesReport is
associated with one CD, but the CD class doesn’t know anything about the CDSalesReport
class. The CD and the Band classes both know about each other, and both classes can be
associated to one or more of each other.

Activity Diagram
Activity diagrams shows the procedural flow of control between two or more class
objects while processing an activity. In other words, it describes the business and operational
step-by-step workflows of components in a system.
An activity diagram shows the overall flow of control It can be used to model higher-
level business process at the business unit level, or to model low-level internal class actions.
They are best used to model higher-level processes, such as how the company is currently
doing business, or how it would like to do business. This is because activity diagrams are “less

© 2023 Computer Science Department HAFED Poly Kazaure Page 19 of 73


COM 322 Introduction to Database Design II Adamu Isah

technical” in appearance, compared to sequence diagrams, and business-minded people tend


to understand them more quickly.
An activity diagram’s notation set is similar to that used in a state chart diagram. Like
a state chart diagram, the activity diagram starts with a solid circle connected to the initial
activity. The activity is modelled by drawing a rectangle with rounded edges, enclosing the
activity’s name. Activities can be connected to other activities through transition lines, or to
decision points that connect to different activities guarded by conditions of the decision point.
Activities that terminate the modelled process are connected to a termination point (just as
in a state chart diagram). Optionally, the activities can be grouped into swim lanes, which are
used to indicate the object that actually performs the activity.

Activity diagram, with two swim lanes to indicate control of activity by two
objects: the band manager, and the reporting tool

In our example activity diagram, we have two swim lanes because we have two objects
that control separate activities: a band manager and a reporting tool. The process starts with
the band manager electing to view the sales report for one of his bands. The reporting tool
then retrieves and displays all the bands that person manages and asks him to choose one.
After the band manager selects a band, the reporting tool retrieves the sales information and
displays the sales report. The activity diagram shows that displaying the report is the last step
in the process.

© 2023 Computer Science Department HAFED Poly Kazaure Page 20 of 73


COM 322 Introduction to Database Design II Adamu Isah

Use-case diagram
A use case is used to help development teams visualize the functional requirements
of a system, including the relationship of “actors” (human beings who will interact with the
system) to essential processes, as well as the relationships among different use cases. Use-
case diagrams generally show groups of use cases — either all use cases for the complete
system, or a breakout of a particular group of use cases with related functionality (e.g., all
security administration-related use cases). To show a use case on a use-case diagram, you
draw an oval in the middle of the diagram and put the name of the use case in the center of,
or below, the oval. To draw an actor (indicating a system user) on a use-case diagram, you
draw a stick person to the left or right of your diagram. Use simple lines to depict relationships
between actors and use cases.

© 2023 Computer Science Department HAFED Poly Kazaure Page 21 of 73


COM 322 Introduction to Database Design II Adamu Isah

Sequence diagram
This Shows how objects communicate with each other in terms of a sequence of
messages. It also indicates the lifespans of objects relative to those messages. Sequence
diagrams show a detailed flow for a specific use case or even just part of a specific use case.

© 2023 Computer Science Department HAFED Poly Kazaure Page 22 of 73


COM 322 Introduction to Database Design II Adamu Isah

They are almost self-explanatory; they show the calls between the different objects in their
sequence and can show, at a detailed level, different calls to different objects.

A sequence diagram has two dimensions: The vertical dimension shows the sequence
of messages/calls in the time order that they occur; the horizontal dimension shows the
object instances to which the messages are sent.

The Memory Hierarchy

A typical computer system has several different components in which data may be
stored. These components have data capacities ranging over at least seven orders of
magnitude and also have access speeds ranging over seven or more orders of magnitude. The
cost per byte of these components also varies, but more slowly, with perhaps three orders of
magnitude between the cheapest and most expensive forms of storage. Not surprisingly, the
devices with smallest capacity also offer the fastest access speed and have the highest cost
per byte as shown in the figure below:

© 2023 Computer Science Department HAFED Poly Kazaure Page 23 of 73


COM 322 Introduction to Database Design II Adamu Isah

Memory hierarchy

• Cache: A typical machine has a megabyte or more of cache storage. On-board cache
is found on the same chip as the microprocessor itself, and additional level-2 cache is
found on another chip. Data and instructions are moved to cache from main memory
when they are needed by the processor. Cached data can be accessed by the processor
in a few nanoseconds.
• Main Memory: In the center of the action is the computer’s main memory. We may
think of everything that happens in the computer — instruction executions and data
manipulations — as working on information that is resident in main memory (although
in practice, it is normal for what is used to migrate to the cache). Currently, there
machines configured with about a many gigabytes of main memory. Typical times to
move data from main memory to the processor or cache are in the 10-100 nanosecond
range.
• Secondary Storage: Secondary storage is typically magnetic disk. Currently, there are
single disk units that have capacities of up to a terabyte or more, and one machine
can have several disk units. The time to transfer a single byte between disk and main
memory is around 10 milliseconds. However, large numbers of bytes can be
transferred at one time, so the matter of how fast data moves from and to disk is
somewhat complex.
• Tertiary Storage. As capacious as a collection of disk units can be, there are databases
much larger than what can be stored on the disk(s) of a single machine, or even several
machines. To serve such needs, tertiary storage devices have been developed to hold
data volumes measured in terabytes. Tertiary storage is characterized by significantly
higher read/write times than secondary storage, but also by much larger capacities
and smaller cost per byte than is available from magnetic disks. Many tertiary devices
© 2023 Computer Science Department HAFED Poly Kazaure Page 24 of 73
COM 322 Introduction to Database Design II Adamu Isah

involve robotic arms or conveyors that bring storage media such as magnetic tape or
optical disks (e.g., DVD’s) to a reading device. Retrieval takes seconds or minutes, but
capacities in the petabyte range are possible.

Transfer of Data Between Levels


Normally, data moves between adjacent levels of the hierarchy. At the secondary and
tertiary levels, accessing the desired data or finding the desired place to store data takes a
great deal of time, so each level is organized to transfer large amounts of data to or from the
level below, whenever any data at all is needed. Especially important for understanding the
operation of a database system is the fact that the disk is organized into disk blocks (or just
blocks, or as in operating systems, pages) of perhaps 4-64 kilobytes. Entire blocks axe moved
to or from a continuous section of main memory called a buffer. Thus, a key technique for
speeding up database operations is to arrange data so that when one piece of a disk block is
needed, it is likely that other data on the same block will also be needed at about the same
time.
The same idea applies to other hierarchy levels. If we use tertiary storage, we try to
arrange so that when we select a unit such as a DVD to read, we need much of what is on that
DVD. At a lower level, movement between main memory and cache is by units of cache lines,
typically 32 consecutive bytes. The hope is that entire cache lines will be used together. For
example, if a cache line stores consecutive instructions of a program, we hope that when the
first instruction is needed, the next few instructions will also be executed immediately
thereafter.
Volatile and Non-Volatile Storage
An additional distinction among storage devices is whether they are volatile or non-
volatile. A volatile device “forgets” what is stored in it when the power goes off. A non-volatile
device, on the other hand, is expected to keep its contents intact even for long periods when
the device is turned off or there is a power failure. The question of volatility is important,
because one of the characteristic capabilities of a DBMS is the ability to retain its data even
in the presence of errors such as power failures.
Magnetic and optical materials hold their data in the absence of power. Thus,
essentially all secondary and tertiary storage devices are non-volatile. On the other hand,
main memory is generally volatile (although certain types of more expensive memory chips,
such as flash memory, can hold their data after a power failure). A significant part of the
complexity in a DBMS comes from the requirement that no change to the database can be
considered final until it has migrated to non-volatile, secondary storage.
Virtual Memory-
Typical software executes in virtual-memory, an address space that is typically 32 bits;
i.e., there are 232 bytes, or 4 gigabytes, in a virtual memory. The operating system manages
virtual memory, keeping some of it in main memory and the rest on disk. Transfer between
memory and disk is in units of disk blocks (pages). Virtual memory is an artifact of the
operating system and its use of the machine’s hardware, and it is not a level of the memory
hierarchy.

© 2023 Computer Science Department HAFED Poly Kazaure Page 25 of 73


COM 322 Introduction to Database Design II Adamu Isah

The path in the fig. above involving virtual memory represents the treatment of
conventional programs and applications. It does not represent the typical way data in a
database is managed, since a DBMS manages the data itself.
However, there is increasing interest in main-memory database systems, which do
indeed manage their data through virtual memory, relying on the operating system to bring
needed data into main memory through the paging mechanism. Main-memory database
systems, like most applications, are most useful when the data is small enough to remain in
main memory without being swapped out by the operating system.

STORAGE DEVICES CHARACTERISTICS

Presently, the common secondary storage media used to store data is a disk, and
before disk there was a tape. Tape is generally used for archival data. The storage medium
used in a disk is a disk pack. It is made up of number of surfaces. Data is read and written from
the disk pack by means of transducers called read/write heads. The number of read/write
heads depends on the type of the disk drive. If we trace projection of one head on the surface
associated with it as the disk rotates, we would create a circular figure called track. The tracks
at the same position on every surface of the disk form the surface of an imaginary cylinder. In
disk terminology, therefore, a cylinder consists of the tracks under the heads on each of its
surfaces.
A major factor that determines the overall system performance is response time for
data on secondary storage. This time depends not only on physical device characteristics, but
also on the data arrangement and request sequencing. In general; the response cost has two
components: access time and data transfer time. Data transfer time is the time needed to
move data from the secondary storage device to processor memory; access time is the time
needed to position the read/write head at the required position. The data transfer time
depends on physical device characteristics and cannot be optimized. In the case of reading a
1KB (kilobyte=1024 bytes) block of data from a device that can transfer it at 100KB/sec
(KB/sec =kilobyte/second), the data transfer time is 10msec. The response time, which is
influenced by physical characteristics, depends on the distance between current and target
positions and therefore on data organization.
File system:
A File is a collection of records which are logically related to any object. Record value
can be in any form like: student’s records which having values of Roll no, Name, Class. For
arranging such data, we use file.
For e.g.: files of bank’s customer, files of department, files of stack records etc.
Files are recorded on secondary storage such as magnetic disks, magnetic tables and
optical disks.
Types of files:
• Physical file:
o Physical file concern with actual data that is stored.
o It stores description about how the data is to be represented.
• Logical file –
o This do not contain data.
© 2023 Computer Science Department HAFED Poly Kazaure Page 26 of 73
COM 322 Introduction to Database Design II Adamu Isah

o They contain a description of records that are found in one or more physical files.
o A logical file is a view or representation of one or more physical files.
• Special character file:
o At the time of file creation, we insert some special characters in file.
o For e.g. Control + z for end of a file which having ASCII value 26
According to records types of files, there are two types:
✓ Fixed length record file
✓ Variable length record file
1. Fixed length record file:
Every record in this file has same size (in bytes). Record having value set, in the fixed
length record file, memory block is assign in same size. For e.g., if the size for a record is
assigned 30 bytes to each then records in this type are stored 30 bytes manner
Advantage: records are stored in fixed distance of memory block, so fast searching for
a particular record is done.
Disadvantage: Memory blocks are unnecessarily used when record size is small as
compared to assigned memory block. This useless memory block increases size of file.
2. Variable length record file:
Every record in this file has variable size (in bytes). Memory block assigned for a file
records are in variable size. Different records in the file have different sizes. As per size of
records value, memory blocks are used.
Advantage: Memory used efficiently for storing record. Whatever exact size of record
that much size of memory block occupies in memory in this kind of records. Because of less
memory they can move, save or transfer from one location to other in fast manner.
Disadvantage: Access for record is slower as compared to fixed length record file due
to varying size of a record.

Hash File Organization


Hash File Organization uses the computation of hash function on some fields of the
records. The hash function's output determines the location of disk block where the records
are to be placed.
When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way, when a
new record has to be inserted, then the address is generated using the hash key and record
is directly inserted. The same process is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In this
method, each record will be stored randomly in the memory
B+ File Organization
B+ tree file organization is the advanced method of an indexed sequential access
method. It uses a tree-like structure to store records in File.
It uses the same concept of key-index where the primary key is used to sort the
records. For each primary key, the value of the index is generated and mapped with the
record.

© 2023 Computer Science Department HAFED Poly Kazaure Page 27 of 73


COM 322 Introduction to Database Design II Adamu Isah

The B+ tree is similar to a binary search tree (BST), but it can have more than two
children. In this method, all the records are stored only at the leaf node. Intermediate nodes
act as a pointer to the leaf nodes. They do not contain any records.

The above B+ tree shows that:


• There is one root node of the tree, i.e., 25.
• There is an intermediary layer with nodes. They do not store the actual record. They
have only pointers to the leaf node.
• The nodes to the left of the root node contain the prior value of the root and nodes
to the right contain next value of the root, i.e., 15 and 30 respectively.
• There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
• Searching for any record is easier as all the leaf nodes are balanced.
• In this method, searching any record can be traversed through the single path and
accessed easily.
Pros of B+ tree file organization
• In this method, searching becomes very easy as all the records are stored only in
the leaf nodes and sorted the sequential linked list.
• Traversing through the tree structure is easier and faster.
• The size of the B+ tree has no restrictions, so the number of records can increase
or decrease and the B+ tree structure can also grow or shrink.
• It is a balanced tree structure, and any insert/update/delete does not affect the
performance of tree.
Cons of B+ tree file organization
• This method is inefficient for the static method.
Cluster file organization
• When the two or more records are stored in the same file, it is known as clusters.
These files will have two or more tables in the same data block, and key attributes
which are used to map these tables together are stored only once.
• This method reduces the cost of searching for various records in different files.
• The cluster file organization is used when there is a frequent need for joining the
tables with the same condition. These joins will give only a few records from both

© 2023 Computer Science Department HAFED Poly Kazaure Page 28 of 73


COM 322 Introduction to Database Design II Adamu Isah

tables. In the given example, we are retrieving the record for only particular
departments. This method can't be used to retrieve the record for the entire
department.

In this method, we can directly insert, update or delete any record. Data is sorted
based on the key with which searching is done. Cluster key is a type of key with which joining
of the table is performed.
Types of Cluster file organization:
Cluster file organization is of two types:
• Indexed Clusters: In indexed cluster, records are grouped based on the cluster key and
stored together. The above EMPLOYEE and DEPARTMENT relationship is an example
of an indexed cluster. Here, all the records are grouped based on the cluster key-
DEP_ID and all the records are grouped.

© 2023 Computer Science Department HAFED Poly Kazaure Page 29 of 73


COM 322 Introduction to Database Design II Adamu Isah

• Hash Clusters: It is similar to the indexed cluster. In hash cluster, instead of storing the
records based on the cluster key, we generate the value of the hash key for the cluster
key and store the records with the same hash key value.
Pros of Cluster file organization
• The cluster file organization is used when there is a frequent request for joining the
tables with same joining condition.
• It provides the efficient result when there is a 1:M mapping between the tables.
Cons of Cluster file organization
• This method has the low performance for the very large database.
• If there is any change in joining condition, then this method cannot use. If we change
the condition of joining, then traversing the file takes a lot of time.
• This method is not suitable for a table with a 1:1 condition.

DATABASE SYSTEM CATALOGUE

The system catalogue is a collection of tables and views that contain important
information about a database. It is the place where a relational database management system
stores schema metadata, such as information about tables and columns, and internal
bookkeeping information. A system catalogue is available for each database. Information in
the system catalogue defines the structure of the database. For example, the DDL (data
dictionary language) for all tables in the database is stored in the system catalogue. Most
system catalogues are copied from the template database during database creation, and are
thereafter database-specific. A few catalogues are physically shared across all databases in an
installation; these are marked in the descriptions of the individual catalogues.
The system catalogue for a database is actually part of the database. Within the
database are objects, such as tables, indexes, and views. The system catalogue is basically a
group of objects that contain information that defines other objects in the database, the
structure of the database itself, and various other significant information.
The system catalogue may be divided into logical groups of objects to provide tables
that are accessible by not only the database administrator, but by any other database user as
well. A user typically queries the system catalogue to acquire information on the user’s own
objects and privileges, whereas the DBA needs to be able to inquire about any structure or
event within the database. In some implementations, there are system catalogue objects that
are accessible only to the database administrator.
The terms system catalogue and data dictionary have been used interchangeably in
most situations. In database management systems, a file defines the basic organisation of a
database. A data dictionary contains a list of all the files in the database, the number of
records in each file, and the names and types of each field. Most database management
systems keep the data dictionary hidden from users to prevent them from accidentally
destroying its contents.
The information stored in a catalogue of an RDBMS includes:
• the relation names,
• attribute names,

© 2023 Computer Science Department HAFED Poly Kazaure Page 30 of 73


COM 322 Introduction to Database Design II Adamu Isah

• attribute domains (data types),


• descriptions of constraints (primary keys, secondary keys, foreign keys, NULL/NOT
NULL, and other types of constraints), views, and storage structures and indexes
(index name, attributes on which index is defined, type of index etc).
Security and authorisation information is also kept in the catalogue, which describes:
• authorised user names and passwords,
• each user’s privilege to access specific database relations and views,
• the creator and owner of each relation. The privileges are granted using GRANT
command. A listing of such commands is given in Figure 1.
The system catalogue can also be used to store some statistical and descriptive information
about relations. Some such information can be:
• number of tuples in each relation,
• the different attribute values,
• storage and access methods used in relation.
All such information finds its use in query processing.
In relational DBMSs the catalogue is stored as relations. The DBMS software is used
for querying, updating, and maintaining the catalogue. This allows DBMS routines (as well as
users) to access. The information stored in the catalogue can be accessed by the DBMS
routines as well as users upon authorisation with the help of the query language such as SQL.
Let us show a catalogue structure for base relation with the help of an example. The
figure A below shows a relational schema, and the other Figure B shows its catalogue. The
catalogue here has stored - relation names, attribute names, attribute type and primary key
as well as foreign key information.
Student
(RollNo, Name, Address, PhoneNo, DOB, email, CourseNo, DNo)
Course
(CourseNo, CourseTitle, Professor)
Dept
(DeptNo, DeptName, Location)
Grade
(RollNo, CourseNo, Grade)
Figure A – Sample Relation

The description of the relational database schema in the Figure A above is shown as
the tuples (contents) of the catalogue relation in Figure B. This entry is called as CAT_ENTRY.
All relation names should be unique and all attribute names within a particular relation should
also unique. Another catalogue relation can store information such as tuple size, current
number of tuples, number of indexes, and creator name for each relation.

© 2023 Computer Science Department HAFED Poly Kazaure Page 31 of 73


COM 322 Introduction to Database Design II Adamu Isah

Figure B: One sample catalogue table

Data dictionaries also include data on the secondary keys, indexes and views. The
above could also be extended to the secondary key, index as well as view information by
defining the secondary key, indexes and views. Data dictionaries do not contain any actual
data from the database, it contains only book-keeping information for managing it. Without
a data dictionary, however, a database management system cannot access data from the
database.
The Database Library is built on a Data Dictionary, which provides a complete
description of record layouts and indexes of the database, for validation and efficient data
access. The data dictionary can be used for automated database creation, including building
tables, indexes, and referential constraints, and granting access rights to individual users and
groups. The database dictionary supports the concept of Attached Objects, which allow
database records to include compressed BLOBs (Binary Large Objects) containing images,
texts, sounds, video, documents, spreadsheets, or programmer-defined data types. The data
dictionary stores useful metadata, such as field descriptions, in a format that is independent
of the underlying database system. Some of the functions served by the Data Dictionary
include:
• Ensuring efficient data access, especially with regard to the utilisation of indexes,
• partitioning the database into both logical and physical regions,
• specifying validation criteria and referential constraints to be automatically
enforced,

© 2023 Computer Science Department HAFED Poly Kazaure Page 32 of 73


COM 322 Introduction to Database Design II Adamu Isah

• supplying pre-defined record types for Rich Client features, such as security and
administration facilities, attached objects, and distributed processing (i.e., grid and
cluster supercomputing).

DATA DICTIONARY VS DATA CATALOGUE

The terms data dictionary and data repository are used to indicate a more general
software utility than a catalogue. A catalogue is closely coupled with the DBMS software; it
provides the information stored in it to users and the DBA, but it is mainly accessed by the
various software modules of the DBMS itself, such as DDL and DML compilers, the query
optimiser, the transaction processor, report generators, and the constraint enforcer. On the
other hand, a Data Dictionary is a data structure that stores meta-data, i.e., data about data.
The software package for a stand-alone data dictionary or data repository may
interact with the software modules of the DBMS, but it is mainly used by the designers, users,
and administrators of a computer system for information resource management. These
systems are used to maintain information on system hardware and software configurations,
documentation, applications, and users, as well as other information relevant to system
administration.
If a data dictionary system is used only by designers, users, and administrators, and
not by the DBMS software, it is called a passive data dictionary; otherwise, it is called an active
data dictionary or data directory. An active data dictionary is automatically updated as
changes occur in the database. A passive data dictionary must be manually updated. The data
dictionary consists of record types (tables) created in the database by system-generated
command files, tailored for each supported back-end DBMS.
Command files contain SQL statements for CREATE TABLE, CREATE UNIQUE INDEX,
ALTER TABLE (for referential integrity), etc., using the specific SQL statement required by that
type of database.
Data Dictionary Features
A comprehensive data dictionary product will include:
• Support for standard entity types (elements, records, files, reports, programs,
systems, screens, users, terminals, etc.), and their various characteristics (e.g., for
elements, the dictionary might maintain Business name, Business definition, name,
Data type, Size, Format, Range(s), Validation criteria, etc.)
• Support for user-designed entity types (this is often called the “extensibility” feature);
this facility is often exploited in support of data modelling, to record and cross-
reference entities, relationships, data flows, data stores, processes, etc.
• The ability to distinguish between versions of entities (e.g., test and production)
• enforcement of in-house standards and conventions.
• comprehensive reporting facilities, including both “canned” reports and a reporting
language for user-designed reports; typical reports include:
• detail reports of entities, summary reports of entities. component reports (e.g.,
record-element structures), Cross-reference reports (e.g., element keyword indexes)
where-used reports (e.g., element-record-program cross-references).

© 2023 Computer Science Department HAFED Poly Kazaure Page 33 of 73


COM 322 Introduction to Database Design II Adamu Isah

• a query facility, both for administrators and casual users, which includes the ability to
perform generic searches on business definitions, user descriptions, synonyms, etc.
• language interfaces, to allow, for example, standard record layouts to be
automatically incorporated into programs during the compile process.
• automated input facilities (e.g., to load record descriptions from a copy library).
• security features
• adequate performance tuning abilities
• support for DBMS administration, such as automatic generation of DDL (Data
• Definition Language).
Data Dictionary Benefits
The benefits of a fully utilised data dictionary are substantial. A data dictionary has the
potential to:
• facilitate data sharing by enabling database classes to automatically handle multi-user
coordination, buffer layouts, data validation, and performance optimisations,
improving the ease of understanding of data definitions,
• ensuring that there is a single authoritative source of reference for all users
• facilitate application integration by identifying data redundancies, reduce
development lead times by simplifying documentation, automating programming
activities.
• reduce maintenance effort by identifying the impact of change as it affects:
• users,
• database administrators,
• programmers.
• improve the quality of application software by enforcing standards in the
development process
• ensure application system longevity by maintaining documentation beyond project
completions
• data dictionary information created under one database system can easily be used to
generate the same database layout on any of the other database systems
• BFC supports (Oracle, MS SQL Server, Access, DB2, Sybase, SQL Anywhere, etc.)
These benefits are maximised by a fully utilised data dictionary.
Disadvantages of Data Dictionary
• A DDS is a useful management tool, but it also has several disadvantages.
• It needs careful planning. We would need to define the exact requirements, designing
its contents, testing, implementation and evaluation. The cost of a DDS includes not
only the initial price of its installation and any hardware requirements, but also the
cost of collecting the information entering it into the DDS, keeping it up-to-date and
enforcing standards. The use of a DDS requires management commitment, which is
not easy to achieve, particularly where the benefits are intangible and long term.

© 2023 Computer Science Department HAFED Poly Kazaure Page 34 of 73


COM 322 Introduction to Database Design II Adamu Isah

ROLE OF SYSTEM CATALOGUE IN DATABASE ADMINISTRATION

The database administration is a specialised database activity that is performed by a


database administrator. The system catalogue has an important role to play in the database
administration. Some of the key areas where the system catalogue helps the database
administrator are defined below:
• Enforcement of Database Integrity: System catalogue is used to store information on
keys, constraints, referential integrity, business rules, triggering events etc. on various
tables and related objects. Thus, integrity enforcement would necessarily require the
use of a data dictionary.
• Enforcement of Security: The data dictionary also stores information on various users
of the database systems and their access rights. Thus, enforcement of any security
policy has to be processed through the data dictionary.
• Support for Database System Performance: The data dictionary contains information
on the indexes, statistics etc. Such information is very useful for query optimisation.
Also such information can be used by the database administrator to suggest changes
in the internal schema.
Data dictionary can also support the process of database application development and
testing (although this is not a direct relationship to database administration) as they contain
the basic documentation while the systems are in the process of being developed.

RELATIONAL ALGEBRA
Relational algebra is a procedural query language. It gives a step by step process to
obtain the result of the query. It uses operators to perform queries.

Types of Relational operation

Select Operation:
The select operation selects tuples that satisfy a given predicate. It is denoted by sigma
(σ).
Notation: σ p(r)
Where: σ is used for selection prediction. r is used for relation, while p is used as a
propositional logic formula which may use connectors like: AND OR and NOT. These relational
can use as relational operators like =, ≠, ≥, <, >, ≤.

© 2023 Computer Science Department HAFED Poly Kazaure Page 35 of 73


COM 322 Introduction to Database Design II Adamu Isah

For example: LOAN Relation

Input:
σ BRANCH_NAME="perryride" (LOAN)
Output:

Project Operation:
This operation shows the list of those attributes that we wish to appear in the result.
Rest of the attributes are eliminated from the table.
It is denoted by ∏.
Notation: ∏ A1, A2, An (r)
Where
A1, A2, A3 is used as an attribute name of relation r.

© 2023 Computer Science Department HAFED Poly Kazaure Page 36 of 73


COM 322 Introduction to Database Design II Adamu Isah

Example: CUSTOMER RELATION

Input:
∏ NAME, CITY (CUSTOMER)
Output:

Union Operation:
Suppose there are two tuples R and S. The union operation contains all the tuples that
are either in R or S or both in R & S. It eliminates the duplicate tuples. It is denoted by ∪.
Notation: R ∪ S
A union operation must hold the following condition:
• R and S must have the attribute of the same number.
• Duplicate tuples are eliminated automatically.
Example:

© 2023 Computer Science Department HAFED Poly Kazaure Page 37 of 73


COM 322 Introduction to Database Design II Adamu Isah

DEPOSITOR RELATION

BORROW RELATION
Input:
∏ CUSTOMER_NAME (BORROW) ∪ ∏ CUSTOMER_NAME (DEPOSITOR)
Output:

Set Intersection:
Suppose there are two tuples R and S. The set intersection operation contains all tuples that
are in both R & S. It is denoted by intersection ∩.
Notation: R ∩ S
Example: Using the above DEPOSITOR table and BORROW table

© 2023 Computer Science Department HAFED Poly Kazaure Page 38 of 73


COM 322 Introduction to Database Design II Adamu Isah

Input:
∏ CUSTOMER_NAME (BORROW) ∩ ∏ CUSTOMER_NAME (DEPOSITOR)
Output:

Set Difference:
Suppose there are two tuples R and S. The set intersection operation contains all tuples that
are in R but not in S.
It is denoted by intersection minus (-).
Notation: R - S
Example: Using the above DEPOSITOR table and BORROW table
Input:
∏ CUSTOMER_NAME (BORROW) - ∏ CUSTOMER_NAME (DEPOSITOR)
Output:

Cartesian product
The Cartesian product is used to combine each row in one table with each row in the
other table. It is also known as a cross product. It is denoted by X.
Notation: E X D
Example:
EMPLOYEE

© 2023 Computer Science Department HAFED Poly Kazaure Page 39 of 73


COM 322 Introduction to Database Design II Adamu Isah

Input:
EMPLOYEE X DEPARTMENT
Output:

Rename Operation:
The rename operation is used to rename the output relation. It is denoted by rho (ρ).
Example: We can use the rename operator to rename STUDENT relation to STUDENT1.
ρ(STUDENT1, STUDENT)

© 2023 Computer Science Department HAFED Poly Kazaure Page 40 of 73


COM 322 Introduction to Database Design II Adamu Isah

Join Operations:
A Join operation combines related tuples from different relations, if and only if a given
join condition is satisfied. It is denoted by ⋈.
Example:
EMPLOYEE

SALARY

Result:

© 2023 Computer Science Department HAFED Poly Kazaure Page 41 of 73


COM 322 Introduction to Database Design II Adamu Isah

Types of Join operations:

Natural Join:
A natural join is the set of tuples of all combinations in R and S that are equal on their
common attribute names.
It is denoted by ⋈.
Example: Let's use the above EMPLOYEE table and SALARY table:
Input:
∏EMP_NAME, SALARY (EMPLOYEE ⋈ SALARY)
Output:

Outer Join:
The outer join operation is an extension of the join operation. It is used to deal with missing
information.
Example:
EMPLOYEE

© 2023 Computer Science Department HAFED Poly Kazaure Page 42 of 73


COM 322 Introduction to Database Design II Adamu Isah

FACT_WORKERS

Input:
(EMPLOYEE ⋈ FACT_WORKERS)
Output:

An outer join is basically of three types:


• Left outer join
• Right outer join
• Full outer join
a. Left outer join: Left outer join contains the set of tuples of all combinations in R and S
that are equal on their common attribute names. In the left outer join, tuples in R have
no matching tuples in S.
It is denoted by ⟕.
Example: Using the above EMPLOYEE table and FACT_WORKERS table
Input:
EMPLOYEE ⟕ FACT_WORKERS

© 2023 Computer Science Department HAFED Poly Kazaure Page 43 of 73


COM 322 Introduction to Database Design II Adamu Isah

b. Right outer join: Right outer join contains the set of tuples of all combinations in R and
S that are equal on their common attribute names. In right outer join, tuples in S have
no matching tuples in R.
It is denoted by ⟖.
Example: Using the above EMPLOYEE table and FACT_WORKERS Relation
Input:
EMPLOYEE ⟖ FACT_WORKERS
Output:

c. Full outer join: Full outer join is like a left or right join except that it contains all rows
from both tables. In full outer join, tuples in R that have no matching tuples in S and
tuples in S that have no matching tuples in R in their common attribute name.
It is denoted by ⟗.
Example: Using the above EMPLOYEE table and FACT_WORKERS table
Input:
EMPLOYEE ⟗ FACT_WORKERS
Output:

d. Equi join:
It is also known as an inner join. It is the most common join. It is based on matched data
as per the equality condition. The equi join uses the comparison operator (=).

© 2023 Computer Science Department HAFED Poly Kazaure Page 44 of 73


COM 322 Introduction to Database Design II Adamu Isah

Example:
CUSTOMER RELATION

PRODUCT

Input:
CUSTOMER ⋈ PRODUCT
Input:

© 2023 Computer Science Department HAFED Poly Kazaure Page 45 of 73


COM 322 Introduction to Database Design II Adamu Isah

Transaction
A transaction can be defined as a group of tasks. A single task is the minimum
processing unit which cannot be divided further.
ACID Properties
A transaction is a very small unit of a program and it may contain several low level
tasks. A transaction in a database system must maintain Atomicity, Consistency, Isolation, and
Durability a term commonly known as ACID properties in order to ensure accuracy,
completeness, and data integrity.
• Atomicity − This property states that a transaction must be treated as an atomic unit,
that is, either all of its operations are executed or none. There must be no state in a
database where a transaction is left partially completed. States should be defined
either before the execution of the transaction or after the execution/abortion/failure
of the transaction.
• Consistency − The database must remain in a consistent state after any transaction.
No transaction should have any adverse effect on the data residing in the database. If
the database was in a consistent state before the execution of a transaction, it must
remain consistent after the execution of the transaction as well.
• Durability − The database should be durable enough to hold all its latest updates even
if the system fails or restarts. If a transaction updates a chunk of data in a database
and commits, then the database will hold the modified data. If a transaction commits
but the system fails before the data could be written on to the disk, then that data will
be updated once the system springs back into action.
• Isolation − In a database system where more than one transaction is being executed
simultaneously and in parallel, the property of isolation states that all the transactions
will be carried out and executed as if it is the only transaction in the system. No
transaction will affect the existence of any other transaction.
• Serializability: When multiple transactions are being executed by the operating
system in a multiprogramming environment, there are possibilities that instructions
of one transactions are interleaved with some other transaction.
✓ Schedule − A chronological execution sequence of a transaction is called a
schedule. A schedule can have many transactions in it, each comprising of a
number of instructions/tasks.
✓ Serial Schedule − It is a schedule in which transactions are aligned in such a way
that one transaction is executed first. When the first transaction completes its
cycle, then the next transaction is executed. Transactions are ordered one after
the other. This type of schedule is called a serial schedule, as transactions are
executed in a serial manner.
In a multi-transaction environment, serial schedules are considered as a benchmark.
The execution sequence of an instruction in a transaction cannot be changed, but two
transactions can have their instructions executed in a random fashion. This execution does
no harm if two transactions are mutually independent and working on different segments of
data; but in case these two transactions are working on the same data, then the results may
vary. This ever-varying result may bring the database to an inconsistent state. To resolve this

© 2023 Computer Science Department HAFED Poly Kazaure Page 46 of 73


COM 322 Introduction to Database Design II Adamu Isah

problem, we allow parallel execution of a transaction schedule, if its transactions are either
serializable or have some equivalence relation among them.
Equivalence Schedules
An equivalence schedule can be of the following types −
o Result Equivalence: If two schedules produce the same result after execution, they
are said to be result equivalent. They may yield the same result for some value and
different results for another set of values. That's why this equivalence is not generally
considered significant.
o View Equivalence: Two schedules would be view equivalence if the transactions in
both the schedules perform similar actions in a similar manner.
For example −
o If T reads the initial data in S1, then it also reads the initial data in S2.
o If T reads the value written by J in S1, then it also reads the value written by J in S2.
o If T performs the final write on the data value in S1, then it also performs the final
write on the data value in S2.
Conflict Equivalence
Two schedules would be conflicting if they have the following properties −
o Both belong to separate transactions.
o Both accesses the same data item.
o At least one of them is "write" operation.
Two schedules having multiple transactions with conflicting operations are said to be conflict
equivalent if and only if −
o Both the schedules contain the same set of Transactions.
o The order of conflicting pairs of operation is maintained in both the schedules.
Note − View equivalent schedules are view serializable and conflict equivalent schedules are
conflict serializable. All conflict serializable schedules are view serializable too.
Real-Time Transaction Systems
In systems with real-time constraints, correctness of execution involves both database
consistency and the satisfaction of deadlines. Real time systems are classified as:
• Hard: The task has zero value if it is completed after the deadline.
• Soft: The task has diminishing value if it is completed after the deadline.
Transactional Workflows is an activity which involves the coordinated execution of multiple
task performed different processing entities e.g., bank loan processing, purchase order
processing.
Transaction Processing Monitors
TP monitors were initially developed as multithreaded servers to support large
numbers of terminals from a single process. They provide infrastructure for building and
administering complex transaction processing systems with a large number of clients and
multiple servers.
A transaction-processing monitor has components for Input queue authorisation,
output queue, network, lock manager, recovery manager, log manager, application servers,
database manager and resource managers.

© 2023 Computer Science Department HAFED Poly Kazaure Page 47 of 73


COM 322 Introduction to Database Design II Adamu Isah

High-Performance Transaction Systems


High-performance hardware and parallelism helps improve the rate of transaction
processing. They involve the following features for concurrency control that tries to ensure
correctness without serialisability. They use database consistency constraints to split the
database into sub databases on which concurrency can be managed separately.
States of Transactions
A transaction in a database can be in one of the following states –

• Active − In this state, the transaction is being executed. This is the initial state of every
transaction.
• Partially Committed − When a transaction executes its final operation, it is said to be
in a partially committed state.
• Failed − A transaction is said to be in a failed state if any of the checks made by the
database recovery system fails. A failed transaction can no longer proceed further.
• Aborted − If any of the checks fails and the transaction has reached a failed state, then
the recovery manager rolls back all its write operations on the database to bring the
database back to its original state where it was prior to the execution of the
transaction. Transactions in this state are called aborted. The database recovery
module can select one of the two operations after a transaction aborts −
o Re-start the transaction
o Kill the transaction
• Committed − If a transaction executes all its operations successfully, it is said to be
committed. All its effects are now permanently established on the database system.

CONCURRENCY CONTROL
In a multiprogramming environment where multiple transactions can be executed
simultaneously, it is highly important to control the concurrency of transactions. We have
concurrency control protocols to ensure atomicity, isolation, and serializability of concurrent
transactions. Concurrency control protocols can be broadly divided into two categories −
• Lock based protocols

© 2023 Computer Science Department HAFED Poly Kazaure Page 48 of 73


COM 322 Introduction to Database Design II Adamu Isah

• Time stamp based protocols


Lock-based Protocols
A lock based mechanism is used to maintain consistency when more than one transaction
processes is executed. This mechanism controls concurrent access to data items.
Lock requests are made to the concurrency-control manager. The transaction can proceed
only after request is granted. The Lock-compatibility matrix shows when lock request will be
granted (True).
Database systems equipped with lock-based protocols use a mechanism by which any
transaction cannot read or write data until it acquires an appropriate lock on it. Locks are of
two kinds −
o Binary Locks − A lock on a data item can be in two states; it is either locked or
unlocked.
o Shared/exclusive − This type of locking mechanism differentiates the locks based on
their uses. If a lock is acquired on a data item to perform a write operation, it is an
exclusive lock. Allowing more than one transaction to write on the same data item
would lead the database into an inconsistent state. Read locks are shared because no
data value is being changed.
There are three types of lock protocols available −
▪ Simplistic Lock Protocol: Simplistic lock-based protocols allow transactions to obtain
a lock on every object before a 'write' operation is performed. Transactions may
unlock the data item after completing the ‘write’ operation.
▪ Pre-claiming Lock Protocol: Pre-claiming protocols evaluate their operations and
create a list of data items on which they need locks. Before initiating an execution, the
transaction requests the system for all the locks it needs beforehand. If all the locks
are granted, the transaction executes and releases all the locks when all its operations
are over. If all the locks are not granted, the transaction rolls back and waits until all
the locks are granted.

• Two-Phase Locking 2PL: This locking protocol divides the execution phase of a
transaction into three parts. In the first part, when the transaction starts executing, it
seeks permission for the locks it requires. The second part is where the transaction
acquires all the locks. As soon as the transaction releases its first lock, the third phase
starts. In this phase, the transaction cannot demand any new locks; it only releases
the acquired locks.

© 2023 Computer Science Department HAFED Poly Kazaure Page 49 of 73


COM 322 Introduction to Database Design II Adamu Isah

Strict-

2PL does not have cascading abort as 2PL does.


• Timestamp-based Protocols: The most commonly used concurrency protocol is the
timestamp based protocol. This protocol uses either system time or logical counter as
a timestamp.
Lock-based protocols manage the order between the conflicting pairs among
transactions at the time of execution, whereas timestamp-based protocols start working as
soon as a transaction is created.
Every transaction has a timestamp associated with it, and the ordering is determined
by the age of the transaction. A transaction created at 0002 clock time would be older than
all other transactions that come after it. For example, any transaction 'y' entering the system
at 0004 is two seconds younger and the priority would be given to the older one.
In addition, every data item is given the latest read and write-timestamp. This lets the
system know when the last ‘read and write’ operation was performed on the data item.

How Timestamp Ordering Protocol works


The timestamp-ordering protocol ensures serializability among transactions in their
conflicting read and write operations. This is the responsibility of the protocol system that the
conflicting pair of tasks should be executed according to the timestamp values of the
transactions.
Each transaction is issued a timestamp when it enters the system. If an old transaction
Ti has time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti)
<TS(Tj). This protocol manages concurrent execution in such a manner that the time-stamps
determine the serialisability order. In order to assure such behaviour, the protocol needs to
maintain for each data Q two timestamp values:
• W-timestamp(Q) is the largest time-stamp of any transaction that executed write(Q)
successfully.
• R-timestamp(Q) is the largest time-stamp of any transaction that executed read(Q)
successfully.
The timestamp ordering protocol executes any conflicting read and write operations in
timestamp order. Suppose a transaction Ti issues a read(Q).
• If TS(Ti) <W-timestamp(Q), ⇒ Ti needs to read a value of Q that was already
overwritten. Action: reject read operation and rolled back Ti .
• If TS(Ti)>= W-timestamp(Q), ⇒ It is OK. Action: Execute read operation and set R-
timestamp(Q) to the maximum of R-timestamp(Q) and TS(Ti).

© 2023 Computer Science Department HAFED Poly Kazaure Page 50 of 73


COM 322 Introduction to Database Design II Adamu Isah

Suppose that transaction Ti issues write (Q):


• If TS(Ti) < R-timestamp(Q), ⇒ the value of Q that Ti is writing was used previously, this
value should have never been produced. Action: Reject the write operation. Roll back
Ti
• If TS(Ti) < W-timestamp(Q), ⇒ Ti is trying to write an obsolete value of Q. Action: Reject
write operation, and roll back Ti.
• Otherwise, execute write operation, and set W-timestamp(Q) to TS(Ti).
Suppose the following postulates for Ti
• The timestamp of transaction Ti is denoted as TS(Ti).
• Read time-stamp of data-item X is denoted by R-timestamp(X).
• Write time-stamp of data-item X is denoted by W-timestamp(X).
Timestamp ordering protocol works as follows −
• If a transaction Ti issues a read(X) operation −
• If TS(Ti) < W-timestamp(X)
o Operation rejected.
• If TS(Ti) >= W-timestamp(X)
o Operation executed.
• All data-item timestamps updated.
• If a transaction Ti issues a write(X) operation −
If TS(Ti) < R-timestamp(X)
o Operation rejected.
• If TS(Ti) < W-timestamp(X)
o Operation rejected and Ti rolled back.
• Otherwise, operation executed.
Thomas' Write Rule
This rule states if TS(Ti) < W-timestamp(X), then the operation is rejected and Ti is
rolled back. Time-stamp ordering rules can be modified to make the schedule view
serializable. Instead of making Ti rolled back, the 'write' operation itself is ignored.
In a multi-process system, deadlock is an unwanted situation that arises in a shared
resource environment, where a process indefinitely waits for a resource that is held by
another process.
For example, assume a set of transactions {T0, T1, T2, ...,Tn}. T0 needs a resource X to
complete its task. Resource X is held by T1, and T1 is waiting for a resource Y, which is held by
T2. T2 is waiting for resource Z, which is held by T0. Thus, all the processes wait for each other
to release resources. In this situation, none of the processes can finish their task. This situation
is known as a deadlock.
Deadlocks are not healthy for a system. In case a system is stuck in a deadlock, the
transactions involved in the deadlock are either rolled back or restarted.
Deadlock Prevention
To prevent any deadlock situation in the system, the DBMS aggressively inspects all
the operations, where transactions are about to execute. The DBMS inspects the operations

© 2023 Computer Science Department HAFED Poly Kazaure Page 51 of 73


COM 322 Introduction to Database Design II Adamu Isah

and analyzes if they can create a deadlock situation. If it finds that a deadlock situation might
occur, then that transaction is never allowed to be executed.
There are deadlock prevention schemes that use timestamp ordering mechanism of
transactions in order to predetermine a deadlock situation.
Wait-Die Scheme
• In this scheme, if a transaction requests to lock a resource (data item), which is
already held with a conflicting lock by another transaction, then one of the two
possibilities may occur −if TS(Ti) < TS(Tj) − that is Ti, which is requesting a
conflicting lock, is older than Tj − then Ti is allowed to wait until the data-item is
available.
• If TS(Ti) > TS(tj) − that is Ti is younger than Tj − then Ti dies. Ti is restarted later
with a random delay but with the same timestamp.
This scheme allows the older transaction to wait but kills the younger one.
Wound-Wait Scheme
• In this scheme, if a transaction requests to lock a resource (data item), which is
already held with conflicting lock by some another transaction, one of the two
possibilities may occur. If TS(Ti) < TS(Tj), then Ti forces Tj to be rolled back − that is
Ti wounds Tj. Tj is restarted later with a random delay but with the same
timestamp.
• If TS(Ti) > TS(Tj), then Ti is forced to wait until the resource is available.
This scheme, allows the younger transaction to wait; but when an older transaction
requests an item held by a younger one, the older transaction forces the younger one to abort
and release the item.
In both the cases, the transaction that enters the system at a later stage is aborted.
Deadlock Avoidance
Aborting a transaction is not always a practical approach. Instead, deadlock avoidance
mechanisms can be used to detect any deadlock situation in advance. Methods like "wait-for
graph" are available but they are suitable for only those systems where transactions are
lightweight having fewer instances of resource. In a bulky system, deadlock prevention
techniques may work well.
Wait-for Graph
This is a simple method available to track if any deadlock situation may arise. For each
transaction entering into the system, a node is created. When a transaction Ti requests for a
lock on an item, say X, which is held by some other transaction Tj, a directed edge is created
from Ti to Tj. If Tj releases item X, the edge between them is dropped and Ti locks the data
item.
The system maintains this wait-for graph for every transaction waiting for some data
items held by others. The system keeps checking if there's any cycle in the graph.

© 2023 Computer Science Department HAFED Poly Kazaure Page 52 of 73


COM 322 Introduction to Database Design II Adamu Isah

Here, we can use any of the two following approaches −


• First, do not allow any request for an item, which is already locked by another
transaction. This is not always feasible and may cause starvation, where a
transaction indefinitely waits for a data item and can never acquire it.
• The second option is to roll back one of the transactions. It is not always feasible
to roll back the younger transaction, as it may be important than the older one.
With the help of some relative algorithm, a transaction is chosen, which is to be
aborted. This transaction is known as the victim and the process is known as victim
selection.
Granularity
This refers to the size of data item allowed to lock.
Multiple Granularity
It can be defined as hierarchically breaking up the database into blocks which can be
locked. The Multiple Granularity protocol enhances concurrency and reduces lock overhead.
It maintains the track of what to lock and how to lock. It makes easy to decide either to lock
a data item or to unlock a data item. This type of hierarchy can be graphically represented as
a tree. For example: Consider a tree which has four levels of nodes.
• The first level or higher level shows the entire database.
• The second level represents a node of Files.
• The third level consists of children nodes which are known as Records. No record can
be present in more than one area.
• Finally, each Record contains child nodes known as fields. The record has exactly those
fields that are its child nodes.
Hence, the levels of the tree starting from the top level are as follows: Database, File,
Record and Field.
The highest level in the example hierarchy is the entire database. The levels below are
for of file, record and field.

© 2023 Computer Science Department HAFED Poly Kazaure Page 53 of 73


COM 322 Introduction to Database Design II Adamu Isah

But how is locking done in such situations? Locking may be done by using intention mode
locking.
Intention Lock Modes
In addition to S and X lock modes, there are three additional lock modes with multiple
Granularity, they are:
• Intention-shared (IS): It contains explicit locking at a lower level of the tree but only
with shared locks.
• Intention-Exclusive (IX): It contains explicit locking at a lower level with exclusive or
shared locks.
• Shared & Intention-Exclusive (SIX): In this lock, the node is locked in shared mode,
and some node is locked in exclusive mode by the same transaction.
Intention locks allow a higher-level node to be locked in share (S) or exclusive (X) mode
without having to check all descendent nodes. Thus, this locking scheme helps in providing
more concurrency but lowers the lock overheads.
Compatibility Matrix with Intention Lock Modes:
The following figure shows the compatible matrix that allows locking:

CONCURRENCY IN INDEX STRUCTURES


Indices are unlike other database items in that their only job is to help in accessing
data. Index-structures are typically accessed very often, much more than other database
items. Treating index-structures like other database items leads to low concurrency. Two-
phase locking on an index may result in transactions being executed practically one-at-a-time.
It is acceptable to have non serializable concurrent access to an index as long as the accuracy

© 2023 Computer Science Department HAFED Poly Kazaure Page 54 of 73


COM 322 Introduction to Database Design II Adamu Isah

of the index is maintained. In particular, the exact values read in an internal node of a B+-tree
are irrelevant so long as we land up in the correct leaf node. There are index concurrency
protocols where locks on internal nodes are released early, and not in a two-phase fashion.
Example of index concurrency protocol: Use crabbing instead of two-phase locking on
the nodes of the B+-tree, as follows. During search/insertion/deletion:
• First lock the root node in shared mode.
• After locking all required children of a node in shared mode, release the lock
on the node.
• During insertion/deletion, upgrade leaf node locks to exclusive mode.
• When splitting or coalescing requires changes to a parent, lock the parent in
exclusive mode.
FAILURE CLASSIFICATION
A DBMS may encounter a failure. These failures may be of the following types:
• Transaction failure. An ongoing transaction may fail due to:
▪ Logical errors: Transaction cannot be completed due to some internal error
condition.
▪ System errors: The database system must terminate an active transaction due
to an error condition (e.g., deadlock).
▪ System crash: A power failure or other hardware or software failure causes the
system to crash.
▪ Fail-stop assumption: Non-volatile storage contents are assumed to be
uncorrupted by system crash.
• Disk failure: A head crash or similar disk failure destroys all or part of the disk storage
capacity.
• Destruction is assumed to be detectable: Disk drives use checksums to detect failure.
All these failures result in the inconsistent state of a transaction. Thus, we need a recovery
scheme in a database system, but before we discuss recovery. Let us briefly define the storage
structure from a recovery point of view.
Storage Structure
There are various ways for storing information:
Volatile storage
• Does not survive system crashes, examples: main memory, cache memory
Non-volatile storage
• Survives system crashes, examples: disk, tape, flash memory, non-volatile (battery
backed up)
Stable storage
• A mythical form of storage that survives all failures,
• Approximated by maintaining multiple copies on distinct non-volatile media.
Stable-Storage Implementation
A stable storage maintains multiple copies of each block on separate disks. Copies can
be kept at remote sites to protect against disasters such as fire or flooding. Failure during data
transfer can still result in inconsistent copies. A block transfer can result in:
• Successful completion
© 2023 Computer Science Department HAFED Poly Kazaure Page 55 of 73
COM 322 Introduction to Database Design II Adamu Isah

• Partial failure: destination block has incorrect information


• Total failure: destination block was never updated.
For protecting storage media from failure during data transfer we can execute output
operation as follows (assuming two copies of each block):
1. Write the information on the first physical block.
2. When the first write successfully completes, write the same information on the second
physical block.
3. The output is completed only after the second write is successfully completed.
Copies of a block may differ due to failure during output operation. To recover from this
failure, you need to first find the inconsistent blocks: One Expensive solution: is comparing
the two copies of every disk block, but a better solution may be to:
• record in-progress disk writes on non-volatile storage (Non-volatile RAM or
special area of disk),
• use this information during recovery to find blocks that may be inconsistent,
and only compare copies of these, and
• used in hardware-RAID systems.
If either copy of an inconsistent block is detected with an error (bad checksum),
overwrite it with the other copy. If both have no error, but are different, overwrite the second
block with the first block.

RECOVERY ALGORITHMS

Recovery algorithms are techniques to ensure database consistency and transaction


atomicity and durability despite failures. Recovery algorithms have two parts:
• Actions taken during normal transaction processing is to ensure that enough
information exists to recover from failures,
• Actions taken after a failure to recover the database contents to a state that
ensures atomicity, consistency and durability.
While modifying the database, without ensuring that the transaction will commit, may
leave the database in an inconsistent state. To ensure consistency despite failures, we have
several recovery mechanisms, the most popular approaches are: log-based recovery and
shadow paging.
Log-Based Recovery
A log is maintained on a stable storage media. The log is a sequence of log records,
and maintains a record of update activities on the database. When transaction Ti starts, it
registers itself by writing a
<Ti start>log record.
Before Ti executes write(X), a log record <Ti, X, V1, V2> is written, where V1 is the
value of X before the write (undo value), and V2 is the value to be written to X (redo value).
• Log record notes that Ti has performed a write on data item X. X had value V1
before the write, and will have value V2 after the write.
• When Ti finishes it last statement, the log record <Ti commit> is written. We
assume for now that log records are written directly to a stable storage media (that
is, they are not buffered).

© 2023 Computer Science Department HAFED Poly Kazaure Page 56 of 73


COM 322 Introduction to Database Design II Adamu Isah

Two approaches for recovery using logs are:


• Deferred database modification.
• Immediate database modification.
Deferred Database Modification
The deferred database modification scheme records all the modifications to the log,
but defers all the writes to after partial commit. Let us assume that transactions execute
serially. A transaction starts by writing <Ti start> record to log. A write(X) operation results in
a log record <Ti, X, V > being written, where V is the new value for X. The write is not
performed on X at this time, but is deferred. When Ti partially commits, <Ti commit> is written
to the log. Finally, the log records are read and used to actually execute the previously
deferred writes. During recovery after a crash, a transaction needs to be redone if both <Ti
start> and<Ti commit> are there in the log. Redoing a transaction Ti (redoTi) sets the value of
all data items updated by the transaction to the new values. Crashes can occur while:
• the transaction is executing the original updates, or
• while recovery action is being taken.
Example:
Transactions T1 and T2 (T1 executes before T2):
T1: read (X) T2 : read (Z)
X= X−1000 Z= Z− 1000
Write (X) write (Z)
read (Y)
Y= Y + 1000
write (Y)
The following figure shows the log as it appears at three instances of time (Assuming that
initial balance in X is 10,000, Y is 8,000 and Z has 20,000:

If log on stable storage at the time of crash as per (a) (b) and (c) then in: For
o No redo action needs to be performed.
o redo(T1) must be performed since <T1 commit> is present
o redo(T1) must be performed followed by redo(T2) since
<T1 commit> and <T2 commit> are present.
Please note that you can repeat this sequence of redo operation as suggested in (c)
any number of times, it will still bring the value of X, Y, Z to consistent redo values. This
property of the redo operation is called idempotent.

© 2023 Computer Science Department HAFED Poly Kazaure Page 57 of 73


COM 322 Introduction to Database Design II Adamu Isah

Immediate Database Modification


The immediate database modification scheme allows database updates on the stored
database even on an uncommitted transaction. These updates are made as the writes are
issued (since undoing may be needed, update logs must have both the old value as well as
the new value). Updated log records must be written before database item is written (assume
that the log record is output directly to a stable storage and can be extended to postpone log
record output, as long as prior to execution of an output (Y) operation for a data block Y all
log records corresponding to items Y must be flushed to stable storage).
Output of updated blocks can take place at any time before or after transaction
commit. Order in which blocks are output can be different from the order in which they are
written, example:

The recovery procedure in such has two operations instead of one:


• undo(Ti) restores the value of all data items updated by Ti to their old values,
moving backwards from the last log record for Ti,
• redo(Ti) sets the value of all data items updated by Ti to the new values, moving
forward from the first log record for Ti.
Both operations are idempotent, that is, even if the operation is executed multiple times the
effect is the same as it is executed once. (This is necessary because operations may have to
be re-executed during recovery).
When recovering after failure:
• Transaction Ti needs to be undone if the log contains the record <Ti start>, but
does not contain the record <Ti commit>.
• Transaction Ti needs to be redone if the log contains both the record <Ti start>
and the record <Ti commit>.
Undo operations are performed first, then redo operations

© 2023 Computer Science Department HAFED Poly Kazaure Page 58 of 73


COM 322 Introduction to Database Design II Adamu Isah

Example: Consider the log as it appears at three instances of time

Recovery actions in each case above are:

• undo (T1): Y is restored to 8000 and X to 10000.


• undo (T2) and redo (T1): Z is restored to 20000, and then X and Y are set to 9000
and 9000 respectively.
• redo (T1) and redo (T2): X and Y are set to 9000 and 9000 respectively. Then Z is
set to 19000

Shadow Paging
Shadow paging is an alternative to log-based recovery; this scheme is useful if
transactions are executed serially. In this, two page tables are maintained during the lifetime
of a transaction – the current page table, and the shadow page table. It stores the shadow
page table in non-volatile storage, in such a way that the state of the database prior to
transaction execution may be recovered (shadow page table is never modified during
execution). To start with, both the page tables are identical. Only the current page table is
used for data item accesses during execution of the transaction. Whenever any page is about
to be written for the first time a copy of this page is made on an unused page, the current
page table is then made to point to the copy and the update is performed on the copy.

© 2023 Computer Science Department HAFED Poly Kazaure Page 59 of 73


COM 322 Introduction to Database Design II Adamu Isah

A Sample page table

Shadow page table


To commit a transaction:
1. Flush all modified pages in main memory to disk
2. Output current page table to disk
3. Make the current page table the new shadow page table, as follows:
• keep a pointer to the shadow page table at a fixed (known) location on disk.
• to make the current page table the new shadow page table, simply update the
pointer to point at the current page table on disk.

© 2023 Computer Science Department HAFED Poly Kazaure Page 60 of 73


COM 322 Introduction to Database Design II Adamu Isah

Once pointer to shadow page table has been written, transaction is committed. No recovery
is needed after a crash — new transactions can start right away, using the shadow page table.
Pages not pointed to from current/shadow page table should be freed.
Advantages of shadow-paging over log-based schemes:
• It has no overhead of writing log records,
• The recovery is trivial.
Disadvantages:
• Copying the entire page table is very expensive, it can be reduced by using a page table
structured like a B+-tree (no need to copy entire tree, only need to copy paths in the
tree that lead to updated leaf nodes).
• Commit overhead is high even with the above extension (Need to flush every updated
page, and page table).
• Data gets fragmented (related pages get separated on disk).
• After every transaction is completed, the database pages containing old versions is
completed, of modified data need to be garbage collected/freed.
• Hard to extend algorithm to allow transactions to run concurrently (easier to extend
log based schemes).

BUFFER MANAGEMENT
When the database is updated, a lot of records are changed in the buffers allocated
to the log records, and database records. Although buffer management is the job of the
operating system, however, some times the DBMS prefer buffer management policies of their
own.

Log Record Buffering

Log records are buffered in the main memory, instead of being output directly to a
stable storage media. Log records are output to a stable storage when a block of log records
in the buffer is full, or a log force operation is executed. Log force is performed to commit a
transaction by forcing all its log records (including the commit record) to stable storage.
Several log records can thus be output using a single output operation, reducing the I/O cost.

The rules below must be followed if log records are buffered:

• Log records are output to stable storage in the order in which they are created.
• Transaction Ti enters the commit state only when the log record <Ti commit> has
been output to stable storage.
• Before a block of data in the main memory is output to the database, all log records
pertaining to data in that block must be output to a stable storage.
• These rules are also called the write-ahead logging scheme.

Database Buffering
The database maintains an in-memory buffer of data blocks, when a new block is
needed, if the buffer is full, an existing block needs to be removed from the buffer. If the block
chosen for removal has been updated, even then it must be output to the disk. However, as

© 2023 Computer Science Department HAFED Poly Kazaure Page 61 of 73


COM 322 Introduction to Database Design II Adamu Isah

per write-ahead logging scheme, a block with uncommitted updates is output to disk, log
records with undo information for the updates must be output to the log on a stable storage.
No updates should be in progress on a block when it is output to disk. This can be ensured as
follows:

• Before writing a data item, the transaction acquires exclusive lock on block
containing the data item.
• Lock can be released once the write is completed. (Such locks held for short
duration are called latches).
• Before a block is output to disk, the system acquires an exclusive latch on the block
(ensures no update can be in progress on the block).
A database buffer can be implemented either, in an area of real main-memory
reserved for the database, or in the virtual memory. Implementing buffer in reserved main-
memory has drawbacks. Memory is partitioned before-hand between database buffer and
applications, thereby, limiting flexibility. Although the operating system knows how memory
should be divided at any time, it cannot change the partitioning of memory.

Database buffers are generally implemented in virtual memory in spite of drawbacks.


When an operating system needs to evict a page that has been modified, to make space for
another page, the page is written to swap space on disk. When the database decides to write
buffer page to disk, buffer page may be in swap space, and may have to be read from swap
space on disk and output to the database on disk, resulting in extra I/O, Known as dual paging
problem. Ideally when swapping out a database buffer page, the operating system should
handover the control to the database, which in turn outputs page to database instead of to
swap space (making sure to output log records first) dual paging can thus be avoided, but
common operating systems do not support such functionality.

DECISION SUPPORT SYSTEMS (DSS)


Decision support systems (DSS) are interactive software-based systems intended to
help managers in decision-making by accessing large volumes of information generated from
various related information systems involved in organizational business processes, such as
office automation system, transaction processing system, etc.
DSS uses the summary information, exceptions, patterns, and trends using the
analytical models. A decision support system helps in decision-making but does not
necessarily give a decision itself. The decision makers compile useful information from raw
data, documents, personal knowledge, and/or business models to identify and solve
problems and make decisions. There are two types of decisions - programmed and non-
programmed decisions.
Programmed decisions are basically automated processes, general routine work, where −
• These decisions have been taken several times.
• These decisions follow some guidelines or rules.
For example, selecting a reorder level for inventories, is a programmed decision.
Non-programmed decisions occur in unusual and non-addressed situations, so −
• It would be a new decision.

© 2023 Computer Science Department HAFED Poly Kazaure Page 62 of 73


COM 322 Introduction to Database Design II Adamu Isah

• There will not be any rules to follow.


• These decisions are made based on the available information.
• These decisions are based on the manager's discretion, instinct, perception and
judgment.
For example, investing in a new technology is a non-programmed decision.
Decision support systems generally involve non-programmed decisions. Therefore,
there will be no exact report, content, or format for these systems. Reports are generated on
the fly.
Attributes of a DSS
• Adaptability and flexibility
• High level of Interactivity
• Ease of use
• Efficiency and effectiveness
• Complete control by decision-makers
• Ease of development
• Extendibility
• Support for modelling and analysis
• Support for data access
• Standalone, integrated, and Web-based
Characteristics of a DSS
• Support for decision-makers in semi-structured and unstructured problems.
• Support for managers at various managerial levels, ranging from top executive to line
managers.
• Support for individuals and groups. Less structured problems often require the
involvement of several individuals from different departments and organization level.
• Support for interdependent or sequential decisions.
• Support for intelligence, design, choice, and implementation.
• Support for variety of decision processes and styles.
• DSSs are adaptive over time.
Benefits of DSS
• Improves efficiency and speed of decision-making activities.
• Increases the control, competitiveness and capability of futuristic decision-making of
the organization.
• Facilitates interpersonal communication.
• Encourages learning or training.
• Since it is mostly used in non-programmed decisions, it reveals new approaches and
sets up new evidences for an unusual decision.
• Helps automate managerial processes.
Components of a DSS
Following are the components of the Decision Support System −
• Database Management System (DBMS) − To solve a problem the necessary data may
come from internal or external database. In an organization, internal data are

© 2023 Computer Science Department HAFED Poly Kazaure Page 63 of 73


COM 322 Introduction to Database Design II Adamu Isah

generated by a system such as TPS and MIS. External data come from a variety of
sources such as newspapers, online data services, databases (financial, marketing,
human resources).
• Model Management System − It stores and accesses models that managers use to
make decisions. Such models are used for designing manufacturing facility, analyzing
the financial health of an organization, forecasting demand of a product or service,
etc.
• Support Tools − Support tools like online help; pulls down menus, user interfaces,
graphical analysis, error correction mechanism, facilitates the user interactions with
the system.
Classification of DSS
There are several ways to classify DSS. Hoi Apple and Whinstone classifies DSS as follows −
• Text Oriented DSS − It contains textually represented information that could have a
bearing on decision. It allows documents to be electronically created, revised and
viewed as needed.
• Database Oriented DSS − Database plays a major role here; it contains organized and
highly structured data.
• Spreadsheet Oriented DSS − It contains information in spread sheets that allows
create, view, modify procedural knowledge and also instructs the system to execute
self-contained instructions. The most popular tool is Excel and Lotus 1-2-3.
• Solver Oriented DSS − It is based on a solver, which is an algorithm or procedure
written for performing certain calculations and particular program type.
• Rules Oriented DSS − Procedures are adopted in rules oriented DSS. Expert system is
the example.
• Compound DSS − It is built by using two or more of the five structures explained above.
Types of DSS
Following are some typical DSSs −
• Status Inquiry System − It helps in taking operational, management level, or middle
level management decisions, for example daily schedules of jobs to machines or
machines to operators.
• Data Analysis System − It needs comparative analysis and makes use of formula or an
algorithm, for example cash flow analysis, inventory analysis etc.
• Information Analysis System − In this system data is analyzed and the information
report is generated. For example, sales analysis, accounts receivable systems, market
analysis etc.
• Accounting System − It keeps track of accounting and finance related information, for
example, final account, accounts receivables, accounts payables, etc. that keep track
of the major aspects of the business.
• Model Based System − Simulation models or optimization models used for decision-
making are used infrequently and creates general guidelines for operation or
management.

© 2023 Computer Science Department HAFED Poly Kazaure Page 64 of 73


COM 322 Introduction to Database Design II Adamu Isah

DATA MINING
Data mining refers to a process that is used to turn raw data into meaningful data. It
is based on research, so many organizations follow the data mining process to transform data
into useful information. It helps the organizations build more innovative strategies, increase
sales, generate revenue, and grow a business by cost reduction.
Data Mining Techniques
Below are the data mining techniques
• Classification analysis: This is used to classify distinct data in a different class. It is used
to restore significant information related to data and metadata.
• Association Rule Learning: This refers to the process that enables to identify relations
between distinct variables in a large set of data.
• Outlier detection: Outlier detection refers to the data observation in a database that
does not match an expected pattern.
• Clustering Analysis: The term 'cluster' is the collection of data objects which are similar
within the same cluster.
• Regression Analysis: This the process of analyzing and identifying the relationship
among the different variables.
Data Analysis
Data analysis is a method that can be used to investigate, analyze, and demonstrate
data to find useful information. There are several types of that, but usually, people think
about the quantitative data first. For example, the data comes after surveying, census data.
Let's understand the concept of data analysis with the help of a day-to-day example.
Suppose there is a retail shop like ShopRite. You can say, some products are always getting
expired before they are sold. It means it is a financial loss for the company. So how do you
minimize the loss?
Let's have a look at the available data.
The products can be categorized into various categories like food products, beverages, cloth
sections, etc. They can further categorize these products and eventually form a tree.
The retail shop manager has the list of products sold on each day, peak hours of the
store, products sold during the different hour's zones, number of customers on each day, and
a lot of other related information. Now, based on all the information, they can figure out
which products sell at what time of the day. You can say they also split it into seasons, which
means what products sell during which season. So that, they can also find which products
have very less sell.
Methods of Data Analysis
There are two methods of data analysis: qualitative and quantitative.
• Qualitative research: Primarily, it describes the product characteristics. It does not
utilize any number. It emphasizes the quality of the product.
• Quantitative research: It is the inverse of qualitative research because its primary
focus is numbers. Quantitative research is all about quantity.
Data Mining VS Data Analysis
Data Mining and Data Analysis are the major steps in any project based on data-driven
decisions, and it is required to be done with efficiency to ensure the success of the projects.
© 2023 Computer Science Department HAFED Poly Kazaure Page 65 of 73
COM 322 Introduction to Database Design II Adamu Isah

Nowadays, data analysis and strategy development play a vital role in collecting important
information from the available data sets.
First, all the data is kept in the data warehouse, and then it is used for the business
intelligence requirements. There are various concepts and views regarding data mining and
data analysis, but you can say that both terms are subsets of business intelligence. Data
mining and Data analysis are similar, so finding the difference between them is a little bit
difficult. The following table summarises the differences between the two:

Data Warehouse

Data Warehouse is a relational database management system (RDBMS) construct to


meet the requirement of transaction processing systems. It can be loosely described as any
centralized data repository which can be queried for business benefits. It is a database that
stores information oriented to satisfy decision-making requests. It is a group of decision
support technologies, targets to enabling the knowledge worker (executive, manager, and
analyst) to make superior and higher decisions. So, Data Warehousing support architectures
and tool for business executives to systematically organize, understand and use their
information to make strategic decisions.
Data Warehouse environment contains an extraction, transportation, and loading
(ETL) solution, an online analytical processing (OLAP) engine, customer analysis tools, and
other applications that handle the process of gathering information and delivering it to

© 2023 Computer Science Department HAFED Poly Kazaure Page 66 of 73


COM 322 Introduction to Database Design II Adamu Isah

business users. "Data Warehouse is a subject-oriented, integrated, and time-variant store of


information in support of management's decisions."
A Data Warehouse can be viewed as a data system with the following attributes:
• It is a database designed for investigative tasks, using data from various applications.
• It supports a relatively small number of clients with relatively long interactions.
• It includes current and historical data to provide a historical perspective of
information.
• Its usage is read-intensive.
• It contains a few large tables.

Characteristics of Data Warehouse

• Subject-Oriented: A data warehouse target on the modelling and analysis of data for
decision-makers. Therefore, data warehouses typically provide a concise and
straightforward view around a particular subject, such as customer, product, or sales,
instead of the global organization's ongoing operations. This is done by excluding data
that are not useful concerning the subject and including all data needed by the users
to understand the subject.

© 2023 Computer Science Department HAFED Poly Kazaure Page 67 of 73


COM 322 Introduction to Database Design II Adamu Isah

• Integrated: A data warehouse integrates various heterogeneous data sources like


RDBMS, flat files, and online transaction records. It requires performing data cleaning
and integration during data warehousing to ensure consistency in naming
conventions, attributes types, etc., among different data sources.

• Time-Variant: Historical information is kept in a data warehouse. For example, one


can retrieve files from 3 months, 6 months, 12 months, or even previous data from a
data warehouse. These variations with a transactions system, where often only the
most current file is kept.
• Non-Volatile: The data warehouse is a physically separate data storage, which is
transformed from the source operational RDBMS. The operational updates of data do
not occur in the data warehouse, i.e., update, insert, and delete operations are not
performed. It usually requires only two procedures in data accessing: Initial loading of
data and access to data. Therefore, the DW does not require transaction processing,

© 2023 Computer Science Department HAFED Poly Kazaure Page 68 of 73


COM 322 Introduction to Database Design II Adamu Isah

recovery, and concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and data should
not change.

Goals of Data Warehousing


• To help reporting as well as analysis
• Maintain the organization's historical information
• Be the foundation for decision making.
• Need for Data Warehouse
Data Warehouse is needed for the following reasons:
• Business User: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to
them in an elementary form.
• Store historical data: Data Warehouse is required to store the time variable data from
the past. This input is made to be used for various purposes.
• Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
• For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.
• High response time: Data warehouse has to be ready for somewhat unexpected loads
and types of queries, which demands a significant degree of flexibility and quick
response time.
Benefits of Data Warehouse
• Understand business trends and make better forecasting decisions.
• Data Warehouses are designed to perform well enormous amounts of data.
• The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.

© 2023 Computer Science Department HAFED Poly Kazaure Page 69 of 73


COM 322 Introduction to Database Design II Adamu Isah

• Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
• Data warehousing is an efficient method to manage demand for lots of information
from lots of users.
• Data warehousing provide the capabilities to analyze a large amount of historical data.

BIG DATA

Big Data is a Database that is different and advanced from the standard database. The
Standard Relational databases are efficient for storing and processing structured data. It uses
the table to store the data and structured query language (SQL) to access and retrieve the
data. Big Data is the type of data that includes unstructured and semi-structured data. There
are specific types of database known as NoSQL databases, there are several types of NoSQL
Databases and tools available to store and process the Big Data. NoSQL Databases are
optimized for data analytics using the Big Data such as text, images, logos, and other data
formats such as XML, JSON. The big data is helpful for developing data-driven intelligent
applications.
Big Data is a term applied to data sets whose size or type is beyond the ability of
traditional relational databases. A traditional database is not able to capture, manage, and
process the high volume of data with low-latency While Database is a collection of
information that is organized so that it can be easily captured, accessed, managed and
updated.
Big Data refers to technologies and initiatives that involve data that is too diverse i.e.
varieties, rapid-changing or massive for skills, conventional technologies, and infrastructure
to address efficiently While Database management system (DBMS) extracts information from
the database in response to queries but it in restricted conditions. There can be any varieties
of data while DB can be defined through some schema. It is difficult to store and process while
Databases like SQL, data can be easily stored and process.
Why it is so popular?
The reason it is so popular is due to the following characteristics:
• Volume: Volume is probably the best-known characteristic of big data. As we know
that almost 90% of today’s data was created in the past couple of years. Volume plays
a major role while considering Big Data.
• Variety: When we are talking of Big Data, we need to consider data in all formats like
the handling of structured, semi-structured and unstructured data. We are capturing
all varieties of data whether it is a pdf, image, website click, images, and videos. These
mix varieties of data are very difficult to store and analyze.
• Velocity: Velocity is the speed or rate at which data is being generated, clicked,
refreshed, produced and accessed. Facebook generating 500 Tb of data per day.
YouTube is uploading 400 hours of videos per minute. Google is translating billions of
searches per day.
• Variability: The inconsistency shown by the data at times will slow down the process
sometimes. It is multiple data dimensions because of multiple data sources.

© 2023 Computer Science Department HAFED Poly Kazaure Page 70 of 73


COM 322 Introduction to Database Design II Adamu Isah

• Veracity: It refers your data accuracy. How accurate is your data and how meaningful
it is to the analysis based on it.
Is big data a database?
Google Map tells you the fastest route and saves your time. Amazon knows, what you
want to buy, Netflix recommends to you list of movies, which you may be interested to watch.
If it is capable of all this today – just imagine what it will be capable of tomorrow. The amount
of data available to us is only going to increase, and analytics technology will become more
advanced. It will be the solution to your smart and advanced life. Maybe you will get a
notification on your smartphone prescribing you some medicines because sooner you may
encounter health issues. It is going to change a life – the way we are looking at. The database
like SQL or NoSQL is a tool to store, process and analyze Big Data.
Characteristics of Big Data
Big Data contains a large amount of data that is not being processed by traditional
data storage or the processing unit. It is used by many multinational companies to process
the data and business of many organizations. The data flow would exceed 150 Exabyte’s per
day before replication.
There are five v's of Big Data that explains the characteristics, they are called 5 V's of
Big Data.

• Volume: The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.

© 2023 Computer Science Department HAFED Poly Kazaure Page 71 of 73


COM 322 Introduction to Database Design II Adamu Isah

• Variety: Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected from databases and
sheets in the past, but these days the data will comes in array forms, that are PDFs,
Emails, audios, SM posts, photos, videos, etc.

The data is categorized as below:


• Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management
system.
• Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g.,
JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are
built to work with semi-structured data. It is stored in relations, i.e., tables.
• Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but
they did not know how to derive the value of data since the data is raw.
• Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools. Example: Web server

© 2023 Computer Science Department HAFED Poly Kazaure Page 72 of 73


COM 322 Introduction to Database Design II Adamu Isah

logs, i.e., the log file is created and maintained by some server that contains a list of
activities.
• Veracity: Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage data
efficiently. Big Data is also essential in business development.
• Value: Value is an essential characteristic of big data. It is not the data that we process
or store. It is valuable and reliable data that we store, process, and also analyze.

• Velocity: Velocity plays an important role compared to others. Velocity creates the
speed by which the data is created in real-time. It contains the linking of incoming data
sets speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly. Big data velocity deals with the speed at the data
flows from sources like application logs, business processes, networks, and social
media sites, sensors, mobile devices, etc.

© 2023 Computer Science Department HAFED Poly Kazaure Page 73 of 73

You might also like