0% found this document useful (0 votes)
11 views224 pages

MC4202 - Advanced Database Technologies

A distributed database system consists of multiple interconnected databases stored across various computers that communicate via a network. Key characteristics include resource sharing, openness, concurrency, scalability, fault tolerance, and transparency. Advantages include easy data sharing and scalability, while disadvantages involve security challenges and network complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views224 pages

MC4202 - Advanced Database Technologies

A distributed database system consists of multiple interconnected databases stored across various computers that communicate via a network. Key characteristics include resource sharing, openness, concurrency, scalability, fault tolerance, and transparency. Advantages include easy data sharing and scalability, while disadvantages involve security challenges and network complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 224

ADVANCED DATABASE TECHNOLOGY

UNIT: I

DISTRIBUTED SYSTEMS
WHAT IS DISTRIBUTED SYSTEM? (OR) DEFINE DISTRIBUTED SYSTEM. (PART A)
EXPLAIN BRIEFLY ABOUT DISTRIBUTED SYSTEMS (PART B)
INTRODUCTION
In a distributed database system, the database is stored
on several computers. The computers in a distributed
systemcommunicate with one another through various
communication media, such as high-speed private networks
or the Internet.
They do not share main memory or disks. The
computers in a distributed system may vary in size and
function, ranging from workstations up to mainframe systems.
The computers in a distributed system are referred to by
different names, such as sites or nodes.
1. Each site is a database system in its own right
2. It has its own local user;
3. Its own local DBMS;

In a Distributed database system, both data and transaction processing are divided between one or more
computer (CPU’s) connected by a network each computer playing a special role in the system.
A distributed database system allows applications to access data from local & Remote Database.

Characteristics of Distributed Systems


WHAT ARE THE CHARACTERISTICS OF DISTRIBUTED SYSTEM? (PART A)
 Resource sharing - whether it’s the hardware, software or data that can be shared
 Openness - how open is the software designed to be developed and shared with each other
 Concurrency - multiple machines can process the same function at the same time
 Scalability - how do the computing and processing capabilities multiply when extended to many
machines
 Fault tolerance - how easy and quickly can failures in parts of the system be detected and recovered
 Transparency - how much access does one node have to locate and communicate with other nodes in
the system.

Modern distributed systems have evolved to include autonomous processes that might run on the same
physical machine, but interact by exchanging messages with each other.
Example: Internet, ATM (bank) machines

Advantages of Distributed Systems


WHAT ARE THE ADVANTAGES OF DISTRIBUTED SYSTEM? (PART A)
 All the nodes in the distributed system are connected to each other. So nodes can easily share data with
other nodes.
 More nodes can easily be added to the distributed system i.e. it can be scaled as required.
 Failure of one node does not lead to the failure of the entire distributed system. Other nodes can still
communicate with each other.
 Resources like printers can be shared with multiple nodes rather than being restricted to just one.
Disadvantages of Distributed Systems
WHAT ARE THE DISADVANTAGES OF DISTRIBUTED SYSTEM? (PART A)
 It is difficult to provide adequate security in distributed systems because the nodes as well as the
connections need to be secured.
 Some messages and data can be lost in the network while moving from one node to another.
 The database connected to the distributed systems is quite complicated and difficult to handle as
compared to a single user system.
 Overloading may occur in the network if all the nodes of the distributed system try to send data at once.
DISTRIBUTED SYSTEM ARCHITECTURES
EXPLAIN BRIEFLY ABOUT THE DISTRIBUTED SYSTEM ARCHITECTURE. (PART B)
In distributed architecture, components are presented on different platforms and several components can
cooperate with one another over a communication network in order to achieve a specific objective or goal.
 In this architecture, information processing is not confined to a single machine rather it is distributed
over several independent computers.
 A distributed system can be demonstrated by the client-server architecture which forms the base for
multi-tier architectures.
 There are several technology frameworks to support distributed architectures, including .NET, J2EE,
CORBA, .NET Web services, AXIS Java Web services, and Globus Grid services.
 Middleware is an infrastructure that appropriately supports the development and execution of
distributed applications. It provides a buffer between the applications and the network.
 It sits in the middle of system and manages or supports the different components of a distributed
system. Examples are transaction processing monitors, data convertors and communication controllers
etc.
Middleware as an infrastructure for distributed system

The basis of a distributed architecture is its transparency, reliability, and availability.


Advantages:
WHAT ARE THE ADVANTAGES OF MIDDLEWARE DISTRIBUTED SYSTEM? (PART A)
 Resource sharing − Sharing of hardware and software resources.
 Openness − Flexibility of using hardware and software of different vendors.
 Concurrency − Concurrent processing to enhance performance.
 Scalability − Increased throughput by adding new resources.
 Fault tolerance − The ability to continue in operation after a fault has occurred.

Disadvantages
WHAT ARE THE DISADVANTAGES OF MIDDLEWARE DISTRIBUTED SYSTEM?(PART A)
 Complexity − They are more complex than centralized systems.
 Security − More susceptible to external attack.
 Manageability − More effort required for system management.
 Unpredictability − Unpredictable responses depending on the system organization and network load.
Client-Server Architecture
EXPLAIN ABOUT CLIENT SERVER ARCHITECTURE DISTRIBUTED SYSTEM.(PART B)
WHAT IS THE ROLE OF CLIENT AND SERVER IN DISTRIBUTED SYSTEM? (PART A)
The client-server architecture is the most common distributed system architecture which decomposes the
system into two major subsystems or logical processes −
 Client − This is the first process that issues a request to the second process i.e. the server.
 Server − This is the second process that receives the request, carries it out, and sends a reply to the
client.
In this architecture, the application is modelled as a set of services that are provided by servers and a set of
clients that use these services. The servers need not know about clients, but the clients must know the identity
of servers, and the mapping of processors to processes is not necessarily 1 : 1

Client-server Architecture can be classified into two models based on the functionality of the client −
Thin-client model
WHAT IS THIN-CLIENT MODEL? (PART A)
In thin-client model, all the application processing and data management is carried by the server. The client
is simply responsible for running the presentation software.
 Used when legacy systems are migrated to client server architectures in which legacy system acts as a
server in its own right with a graphical interface implemented on a client
 A major disadvantage is that it places a heavy processing load on both the server and the network.
Thick/Fat-client model
WHAT IS THICK/FAT-CLIENT MODEL?(PART A)
In thick-client model, the server is only in charge for data management. The software on the client
implements the application logic and the interactions with the system user.
 Most appropriate for new C/S systems where the capabilities of the client system are known in advance
 More complex than a thin client model especially for management. New versions of the application
have to be installed on all clients.

THICK VS. THIN – A QUICK COMPARISON


GIVE A QUICK COMPARSION OF THIN-CLIENT VS THIN CLIENT MODEL? (PART A)
THIN-CLIENT THIN CLIENT
–More expensive to deploy and more work for IT to – Easy to deploy as they require no extra or
deploy specialized software installation
– Data verified by client not server (immediate – Needs to validate with the server after data
validation) capture
– Robust technology provides better uptime – If the server goes down, data collection is halted
– Only needs intermittent communication with as the client needs constant communication with the
server server
– More expensive to deploy and more work for IT – Cannot be interfaced with other equipment (in
to deploy plants or factory settings for example)
– Require more resources but less servers – Clients run only and exactly as specified by the
– Can store local files and applications server
– Reduced server demands – More downtime
– Increased security issues -Portability in that all applications are on the server
so any workstation can access
– Opportunity to use older, outdated PCs as clients
– Reduced security threat

Multi-Tier Architecture (n-tier Architecture)


Write a note on Multi-Tire or n-tire Architecture? (PART A)
Multi-tier architecture is a client–server architecture in which the functions such as presentation,
application processing, and data management are physically separated. By separating an application into tiers,
developers obtain the option of changing or adding a specific layer, instead of reworking the entire application.
It provides a model by which developers can create flexible and reusable applications.

The most general use of multi-tier architecture is the three-tier architecture. A three-tier architecture is
typically composed of a presentation tier, an application tier, and a data storage tier and may execute on a
separate processor.
Presentation Tier:
WHAT IS PRESENTATION TIRE? (PART A)
Presentation layer is the topmost level of the application by which users can access directly such as
webpage or Operating System GUI (Graphical User interface). The primary function of this layer is to
translate the tasks and results to something that user can understand. It communicates with other tiers so that it
places the results to the browser/client tier and all other tiers in the network.
Application Tier (Business Logic, Logic Tier, or Middle Tier):
WHAT IS APPLICATION TIRE? (PART A)
Application tier coordinates the application, processes the commands, makes logical decisions,
evaluation, and performs calculations. It controls an application’s functionality by performing detailed
processing. It also moves and processes data between the two surrounding layers.
Data Tier
NARRATE THE ROLE OF DATA TIRE. (PART A)
In this layer, information is stored and retrieved from the database or file system. The information is
then passed back for processing and then back to the user. It includes the data persistence mechanisms
(database servers, file shares, etc.) and provides API (Application Programming Interface) to the application
tier which provides methods of managing the stored data.
Advantages
WHAT ARE THE ADVANTAGES OF THREE
TIRE ARCHITECTURE? (PART A)
 Better performance than a thin-client
approach and is simpler to manage than a
thick-client approach.
 Enhances the reusability and scalability − as
demands increase, extra servers can be
added.
 Provides multi-threading support and also reduces network traffic.
 Provides maintainability and flexibility
Disadvantages
WHAT ARE THE DISADVANTAGES OF THREE TIRE ARCHITECTURE? (PART A)
 Unsatisfactory Testability due to lack of testing tools.
 More critical server reliability and availability.
DISTRIBUTED DATABASE
EXPLAIN BRIEFLY ABOUT DISTRIBUTED DATABASE.(PART B)
DEFINE DISTRIBUTED DATABASE. (2MARKS)
A distributed database is a set of interconnected databases that is distributed over the computer
network or internet.[Or]
A Distributed Database System across several sites connected
together via communication network. Each site is typically managed by a
DBMS that is capable of running independently of other sites.
Properties of Distributed Database system:
WHAT ARE THE PROPERTIES OF DISTRIBUTED DATABASE
SYSTEM?(PART A)
1) Distributed Data Independence:-The user should be able to
access the database without having the need to know the location
of the data.
2) Distributed Transaction Atomicity:-The concept of atomicity should be distributed for the
operation taking place at the distributed sites.
Types of Distributed Databases:
LIST OUT THE TYPES OF DISTRIBUTED DATABASE(PART A)
a) Homogeneous Distributed Database is where the data stored across multiple sites is managed by
same DBMS software at all the sites.
b) Heterogeneous Distributed Database is where multiple sites which may be autonomous are under
the control of different DBMS software.

Architecture of Distributed DataBaseSystem:


EXPLAIN THE ARCHITECTURE OF DISTRIBUTED DATABASESYSTEM? (PART B)
There are 3 architectures:-
Client-Server:
 A Client – Server system has one or more client processes and one or more
server processes, and a client process can send a query to any one server
process. Clients are responsible for user-interface issues, and servers
manage data and execute transactions.
 Thus, a client process could run on a personal computer and send queries
to a server running on a mainframe.
 Advantages:-
o Simple to implement because of the centralized server and separation of functionality.
o Expensive server machines are not underutilized with simple user interactions which are now
pushed on to in expensive client machines.
o The users can have a familiar and friendly client side user interface rather than unfamiliar and
unfriendly server interface.
Collaborating Server:
Write a note on Collaborating Server. (PART A)
 In the client server architecture a single query cannot be split and executed
across multiple servers because the client process would have to be quite
complex and intelligent enough to break a query into sub queries to be
executed at different sites and then place their results together making the
client capabilities overlap with the server. This makes it hard to distinguish between the client and
server
 In Collaborating Server system, we can have collection of database servers, each capable of running
transactions against local data, which cooperatively execute transactions spanning multiple servers.
 When a server receives a query that requires access todata at other servers, it generates appropriate sub
queriesto be executed by other servers and puts the resultstogether to compute answers to the original
query.
Middleware:
Explain about Middleware.(PART A)
 Middleware system is as special server, a layer of softwarethat coordinates the execution of queries and
transactionsacross one or more independentdatabase servers.
 The Middleware architecture is designed to allow a singlequery to span multiple servers, without
requiring alldatabase servers to be capable of managing such multisiteexecution strategies. It is
especially attractive whentrying to integrate several legacy systems, whose basiccapabilities cannot be
extended.
 We need just one database server that is capable ofmanaging queries and transactions spanning
multipleservers; the remaining servers only need to handlelocalqueries and transactions.
DISTRIBUTED DATA STORAGE
Explain about Distributed Data Storage. (PART B)
There are two ways in which data can be stored on different sites. These are:
 REPLICATION
DEFINE REPLICATION. (PART A)
Data replication refers to the storage of data copies at multiple
sites served by a computer network. Fragment copies can be stored
at several sites to serve specific information requirements. Because
the existence of fragment copies can enhance data availability and
response time, data copies can help to reduce communication and
total query costs. Hence, in replication, systems maintain copies of
data.
 Suppose database A is divided into two fragments, A1 and A2. Within a replicated distributed database,
the scenario depicted in the following Figure is possible: fragment A1 is stored at sites S1 and S2, while
fragment A2 is stored at sites S2 and S3.
Three replication scenarios exist:
WHAT ARE THE SCENARIOS IN REPLICATION (PART A)
A database can be fully replicated, partially replicated, or unreplicated.
• A Fully Replicated Database stores multiple copies of each database fragment at multiple sites. In
this case, all database fragments are replicated. A fully replicated database can be impractical due to the
amount of overhead it imposes on the system.
• A Partially Replicated Database stores multiple copies of some database fragments at multiple sites.
Most DDBMSs are able to handle the partially replicated database well.
• An Unreplicated Database stores each database fragment at a single site. Therefore, there are no
duplicate database fragments.
Several factors influence the decision to use data replication:
WHAT ARE THE SEVERAL FACTORS OF DATA REPLICATIONS? (PART A)
Database size:
The amount of data replicated will have an impact on the storage requirements and also on the
data transmission costs. Replicating large amounts of data requires a window of time and higher
network bandwidth that could affect other applications.
Usage frequency:
The
frequency of
data usage
determines how
frequently the
data needs to be
updated.
Frequently used
data needs to be
updated more often, for example, than large data sets that are used only every quarter.
Costs:
Including those for performance, software overhead, and management associated with synchronizing
transactions and their components vs. fault-tolerance benefits that are associated with replicated data.
 FRAGMENTATION
What is Fragmentation? (PART A)
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and each of the
fragments is stored in different sites where they’re required. It must be made sure that the fragments are
such that they can be used to reconstruct the original relation (i.e, there isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a problem.
Fragmentation of relations can be
done in two ways:
Horizontal fragmentation –
Splitting by rows – The
relation is fragmented into groups
of tuples so that each tuple is
assigned to at least one
fragment.
Vertical fragmentation – Splitting by columns – The schema of the relation is divided into smaller
schemas. Each fragment must contain a common candidate key so as to ensure lossless join.
In certain cases, an approach that is hybrid of fragmentation and replication is used.
Mixed (Hybrid) Fragmentation - We can intermix the two types of fragmentation,yielding a
mixed fragmentation.
Example :To illustrate the fragmentation strategies, let’s use the CUSTOMER table for the
XYZ Company.The table contains the attributes CUS_NUM, CUS_NAME, CUS_ADDRESS,
CUS_STATE, CUS_LIMIT, CUS_BAL, CUS_RATING, and CUS_DUE.
Horizontal Fragmentation: Suppose that XYZ Company’s corporate management requires
information about its customers in all three states, but company locations in each state (TN, FL, and
GA) require data regarding local customers only. Based on such requirements, you decide to
distribute the data by state. Therefore, you define the horizontal fragments to conform to the
structure shown in the following table.

Vertical Fragmentation
You may also divide the CUSTOMER relation into vertical fragments that are composed of a collection
of attributes. For example, suppose that the company is divided into two departments: the service department
and the collections department. Each department is located in a separate building, and each has an interest in
only a few of the CUSTOMER table’s attributes. In this case, the fragments are defined as shown in the
following table.
Mixed Fragmentation
The XYZ Company’s1 structure requires that the CUSTOMER data be fragmented horizontally to
accommodate the various company locations; within the locations, the data must be fragmented vertically to
accommodate the two departments (service and collection). In short, the CUSTOMER table requires mixed
fragmentation. Mixed fragmentation requires a two-step procedure. First, horizontal fragmentation is introduced
for each site based on the location within a state (CUS_STATE). The horizontal fragmentation yields the
subsets of customer tuples (horizontal fragments) that are located at each site. Because the departments are
located in different buildings, vertical fragmentation is used within each horizontal fragment to divide the
attributes, thus meeting each department’s information needs at each sub site. Mixed fragmentation yields the
results displayed in the following Table.

Advantages of Fragmentation
WHAT ARE THE ADVANTAGES OF FRAGMENTATION? (PART A)
 Horizontal: –
o Allows parallel processing on a relation.
o Allows a global table to be split so that tuples are located where they are most frequently
accessed.
 Vertical: –
o Allows for further decomposition than can be achieved with normalization
o Tuple-id attribute allows efficient joining of vertical fragments.
o Allows parallel processing on a relation.
o Allows tuples to be split so that each part of the tuple is stored where it is most frequently
accessed.

DISTRIBUTED TRANSACTIONS

EXPLAIN BRIEFLY ABOUT DISTRIBUTED TRANACTION. (PART B)


WHAT IS DISTRIBUTED TRANSACTION? (PART A)
A distributed transaction is a type of transaction with two or more engaged network hosts.
Generally, hosts provide resources, and a transaction manager is responsible for developing and handling
the transaction. Like any other transaction, a distributed transaction should include all four ACID properties
(atomicity, consistency, isolation, durability). Given the nature of the work, atomicity is important to
ensure an all-or-nothing outcome for the operations bundle (unit of work).
The local transactions are those that access and update data in only one local database; the global
transactions are those that access and update data in several local databases. However, for global
transactions, this task is much more complicated, since several sites may be participating in execution. The
failure of one of these sites, or the failure of a communication link connecting these sites, may result in
erroneous computations.
System Structure
Each site has its own local transaction manager, whose function is to ensure the ACID properties of those
transactions that execute at that site. The various transaction managers cooperate to execute global transactions.
To understand how such a manager can be implemented, consider an abstract model of a transaction system, in
which each site contains two subsystems:
 The transaction manager manages the execution of those transactions (or subtransactions) that access
data stored in a local site. Note that each such transaction may be either a local transaction (that is, a
transaction that executes at only that site) or part of a global transaction (that is, a transaction that
executes at several sites).
 The transaction coordinator coordinates the execution of the various transactions (both local and
global) initiated at that site.
The overall system architecture appears in below diagram.
The structure of a transaction manager is similar in many respects to the structure of a centralized system. Each
transaction manager is responsible for:
 Maintaining a log for recovery purposes.
 Participating in an appropriate concurrency-control
scheme to coordinate the concurrent execution of the
transactions executing at that site.
As we shall see, we need to modify both the recovery
and concurrency schemes to accommodate the
distribution of transactions.
 The transaction coordinator subsystem is not
needed in the centralized environment, since a
transaction accesses data at only a single site. A
transaction coordinator, as its name implies, is
responsible for coordinating the execution of all the transactions initiated at that site. For each such
transaction, the coordinator is responsible for:
 Starting the execution of the transaction.
 Breaking the transaction into a number of sub transactions and distributing these sub transactions to
the appropriate sites for execution.
 Coordinating the termination of the transaction, which may result in the transaction being committed
at all sites or aborted at all sites.
System Failure Modes
A distributed system may suffer from the same types of failure that a centralized system does (for example,
software errors, hardware errors, or disk crashes). There are, however, additional types of failure with which we
need to deal in a distributed environment. The basic failure types are:
 Failure of a site.
 Loss of messages.
 Failure of a communication link.
 Network partition.
The loss or corruption of messages is always a possibility in a distributed system. The system uses
transmission-control protocols, such as TCP/IP, to handle such errors. Information about such protocols
However, if two sites A and B are not directly connected, messages from one to the other must be routed
through a sequence of communication links. If a communication link fails, messages that would have been
transmitted across the link must be rerouted. In some cases, it is possible to find another route through the
network, so that the messages are able to reach their destination. In other cases, a failure may result in there
being no connection between some pairs of sites. A system is partitioned if it has been split into two (or more)
subsystems, called partitions that lack any connection between them. Note that, under this definition, a partition
may consist of a single node.
COMMIT PROTOCOLS
WHAT IS COMMIT PROTOCOLS (PART A)
In a local database system, for committing a transaction, the transaction manager has to only convey the
decision to commit to the recovery manager. However, in a distributed system, the transaction manager should
convey the decision to commit to all the servers in the various sites where the transaction is being executed
and uniformly enforce the decision. When processing is complete at each site, it reaches the partially
committed transaction state and waits for all other transactions to reach their partially committed states. When
it receives the message that all the sites are ready to commit, it starts to commit. In a distributed system, either
all sites commit or none of them does.
The different distributed commit protocols are −
 One-phase commit
 Two-phase commit
 Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a controlling
site and a number of slave sites where the transaction is being executed. The steps in distributed commit are −
 After each slave has locally completed its transaction, it sends a “DONE” message to the controlling
site.
 The slaves wait for “Commit” or “Abort” message from the controlling site. This waiting time is
called window of vulnerability.
 When the controlling site receives “DONE” message from each slave, it makes a decision to commit or
abort. This is called the commit point. Then, it sends this message to all the slaves.
 On receiving this message, a slave either commits or aborts and then sends an acknowledgement
message to the controlling site.

Distributed Two-phase Commit


Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The steps performed
in the two phases are as follows −
Phase 1: Prepare Phase
 After each slave has locally completed its transaction, it sends a “DONE” message to the controlling
site. When the controlling site has received “DONE” message from all slaves, it sends a “Prepare”
message to the slaves.
 The slaves vote on whether they still want to commit or not. If a slave wants to commit, it sends a
“Ready” message.
 A slave that does not want to commit sends a “Not Ready” message. This may happen when the slave
has conflicting concurrent transactions or there is a timeout.
Phase 2: Commit/Abort Phase
 After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the controlling site.
o When the controlling site receives “Commit ACK” message from all the slaves, it considers the
transaction as committed.
 After the controlling site has received the first “Not Ready” message from any slave −
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message to the controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves, it considers the
transaction as aborted.
Distributed Three-phase Commit
The steps in distributed three-phase commit are as follows −
Phase 1: Prepare Phase
The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase
 The controlling site issues an “Enter Prepared State” broadcast message.
 The slave sites vote “OK” in response.
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message is not required.
Concurrency Control
Concurrency Control is the management procedure that is required for controlling concurrent execution of the
operations that take place on a database.
But before knowing about concurrency control, we should know about concurrent execution.
Concurrent Execution in DBMS
 In a multi-user system, multiple users can access and use the same database at one time, which is known
as the concurrent execution of the database. It means that the same database is executed simultaneously
on a multi-user system by different users.
 While working on the database transactions, there occurs the requirement of using the database by
multiple users for performing different operations, and in that case, concurrent execution of the database
is performed.
 The thing is that the simultaneous execution that is performed should be done in an interleaved manner,
and no operation should affect the other executing operations, thus maintaining the consistency of the
database. Thus, on making the concurrent execution of the transaction operations, there occur several
challenging problems that need to be solved.
Problems with Concurrent Execution
In a database transaction, the two main operations are READ and WRITE operations. So, there is a need to
manage these two operations in the concurrent execution of the transactions as if these operations are not
performed in an interleaved manner, and the data may become inconsistent. So, the following problems occur
with the Concurrent Execution of the operations:
Problem 1: Lost Update Problems (W - W Conflict)
The problem occurs when two different database transactions perform the read/write operations on the same
database items in an interleaved manner (i.e., concurrent execution) that makes the values of the items
incorrect hence making the database inconsistent.
For example:
Consider the below diagram where two transactions TX and TY, are performed on the same account A
where the balance of account A is $300.

 At time t1, transaction TX reads the value of account A,


i.e., $300 (only read).
 At time t2, transaction TX deducts $50 from account A
that becomes $250 (only deducted and not
updated/write).
 Alternately, at time t3, transaction TY reads the value of
account A that will be $300 only because T X didn't
update the value yet.
 At time t4, transaction TY adds $100 to account A that
becomes $400 (only added but not updated/write).
 At time t6, transaction TX writes the value of account A
that will be updated as $250 only, as TY didn't update the value yet.
 Similarly, at time t7, transaction T Y writes the values of account A, so it will write as done at time t4
that will be $400. It means the value written by T X is lost, i.e., $250 is lost.
Hence data becomes incorrect, and database sets to inconsistent.
Dirty Read Problems (W-R Conflict)
The dirty read problem occurs when one transaction updates an item of the database, and somehow the
transaction fails, and before the data gets rollback, the updated database item is accessed by another
transaction. There comes the Read-Write Conflict between both transactions.
For example:
Consider two transactions TX and TY in the below diagram performing read/write operations on account
A where the available balance in account A is $300:
 At time t1, transaction TX reads the value of
account A, i.e., $300.
 At time t2, transaction TX adds $50 to account A
that becomes $350.
 At time t3, transaction TX writes the updated value
in account A, i.e., $350.
 Then at time t4, transaction T Y reads account A
that will be read as $350.
 Then at time t5, transaction T X rollbacks due to
server problem, and the value changes back to $300
(as initially).
 But the value for account A remains $350 for
transaction TY as committed, which is the dirty read and therefore known as the Dirty Read Problem.
Unrepeatable Read Problem (W-R Conflict)
Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two different values are read
for the same database item.
For example:
Consider two transactions, TX and TY, performing the read/write operations on account A, having an
available balance = $300. The diagram is shown below:

 At time t1, transaction T X reads the value from


account A, i.e., $300.
 At time t2, transaction TY reads the value from
account A, i.e., $300.
 At time t3, transaction TY updates the value of
account A by adding $100 to the available balance,
and then it becomes $400.
 At time t4, transaction TY writes the updated value,
i.e., $400.
 After that, at time t5, transaction T X reads the
available value of account A, and that will be read as
$400.
 It means that within the same transaction T X, it reads two different values of account A, i.e., $ 300
initially, and after updation made by transaction T Y, it reads $400. It is an unrepeatable read and is
therefore known as the Unrepeatable read problem.
Thus, in order to maintain consistency in the database and avoid such problems that take place in concurrent
execution, management is needed, and that is where the concept of Concurrency Control comes into role.
Concurrency Control
Concurrency Control is the working concept that is required for controlling and managing the concurrent
execution of database operations and thus avoiding the inconsistencies in the database. Thus, for maintaining
the concurrency of the database, we have the concurrency control protocols.
Concurrency Control Protocols
The concurrency control protocols ensure the atomicity, consistency, isolation, durability and serializability of
the concurrent execution of the database transactions. Therefore, these protocols are categorized as:
 Lock Based Concurrency Control Protocol
 Time Stamp Concurrency Control Protocol
 Validation Based Concurrency Control Protocol
Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it acquires an appropriate lock on it.
There are two types of lock:
1. Shared lock:
 It is also known as a Read-only lock. In a shared lock, the data item can only read by the transaction.
 It can be shared between the transactions because when the transaction holds a lock, then it can't update
the data on the data item.
2. Exclusive lock:
 In the exclusive lock, the data item can be both reads as well as written by the transaction.
 This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.
Validation Based Protocol
Validation phase is also known as optimistic concurrency control technique. In the validation based protocol,
the transaction is executed in the following three phases:
1. Read phase: In this phase, the transaction T is read and executed. It is used to read the value of various
data items and stores them in temporary local variables. It can perform all the write operations on
temporary variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against the actual data to
see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results are written to
the database or system otherwise the transaction is rolled back.
Here each phase has the following different timestamps:
Start(Ti): It contains the time when Ti started its execution.
Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation phase.
Finish(Ti): It contains the time when Ti finishes its write phase.
 This protocol is used to determine the time stamp for the transaction for serialization using the time
stamp of the validation phase, as it is the actual phase which determines if the transaction will commit
or rollback.
 Hence TS(T) = validation(T).
 The serializability is determined during the validation process. It can't be decided in advance.
 While executing the transaction, it ensures a greater degree of concurrency and also less number of
conflicts.
 Thus it contains transactions which have less number of rollbacks.
There are four types of lock protocols available:
1. Simplistic lock protocol
It is the simplest way of locking the data while transaction. Simplistic lock-based protocols allow all the
transactions to get the lock on the data before insert or delete or update on it. It will unlock the data item after
completing the transaction.
2. Pre-claiming Lock Protocol
 Pre-claiming Lock Protocols evaluate the transaction to list all the data items on which they need locks.
 Before initiating an execution of the transaction, it requests DBMS for
all the lock on all those data items.
 If all the locks are granted then this protocol allows the transaction to
begin. When the transaction is completed then it releases all the lock.
 If all the locks are not granted then this protocol allows the transaction
to rolls back and waits until all the locks are granted.
3. Two-phase locking (2PL)
 The two-phase locking protocol divides the execution phase of the
transaction into three parts.
 In the first part, when the execution of the transaction starts, it seeks
permission for the lock it requires.
 In the second part, the transaction acquires all the locks. The third phase
is started as soon as the transaction releases its first lock.
 In the third phase, the transaction cannot demand any new locks. It only releases the acquired locks.
There are two phases of 2PL:
Growing phase: In the growing phase, a new lock on the data item may be acquired by the transaction, but
none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction may be released, but no new
locks can be acquired.
In the below example, if lock conversion is allowed then the following phase can happen:
1. Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.
2. Downgrading of lock (from X(a) to S(a)) must be done in shrinking phase.
Example:
The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
 Growing phase: from step 1-3
 Shrinking phase: from step 5-7
 Lock point: at 3
Transaction T2:
 Growing phase: from step 2-6
 Shrinking phase: from step 8-9
 Lock point: at 6
4. Strict Two-phase locking (Strict-2PL)
 The first phase of Strict-2PL is similar to 2PL. In the first phase,
after acquiring all the locks, the transaction continues to execute
normally.
 The only difference between 2PL and strict 2PL is that Strict-2PL
does not release a lock after using it.
 Strict-2PL waits until the whole transaction to commit, and then it
releases all the locks at a time.
 Strict-2PL protocol does not have shrinking phase of lock release.

It does not have cascading abort as 2PL does.


Timestamp Ordering Protocol
 The Timestamp Ordering Protocol is used to order the transactions based on their Timestamps. The
order of transaction is nothing but the ascending order of the transaction creation.
 The priority of the older transaction is higher that's why it executes first. To determine the timestamp of
the transaction, this protocol uses system time or logical counter.
 The lock-based protocol is used to manage the order between conflicting pairs among transactions at the
execution time. But Timestamp based protocols start working as soon as a transaction is created.
 Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered the system at
007 times and transaction T2 has entered the system at 009 times. T1 has the higher priority, so it
executes first as it is entered the system first.
 The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write' operation on a
data.
Basic Timestamp ordering protocol works as follows:
1. Check the following condition whenever a transaction Ti issues a Read (X) operation:
 If W_TS(X) >TS(Ti) then the operation is rejected.
 If W_TS(X) <= TS(Ti) then the operation is executed.
 Timestamps of all the data items are updated.
2. Check the following condition whenever a transaction Ti issues a Write(X) operation:
 If TS(Ti) < R_TS(X) then the operation is rejected.
 If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise the operation is
executed.
Where,
TS(TI) denotes the timestamp of the transaction Ti.
R_TS(X) denotes the Read time-stamp of data-item X.
W_TS(X) denotes the Write time-stamp of data-item X.
Advantages and Disadvantages of TO protocol:
 TO protocol ensures serializability since the precedence graph is as follows:

 TS protocol ensures freedom from deadlock that means no transaction ever waits.
 But the schedule may not be recoverable and may not even be cascade- free.

DISTRIBUTED QUERY PROCESSING


•A given SQL query is translated by the query processor into a low level program called an execution plan. An
execution plan is a program in a functional language. The physical relational algebra extends the relational
algebra with primitives to search through the internal storage structure of DBMS
Or
 Translating a high level query (relational calculus) in a sequence of database operators (relational algebra +
communication operators)
 One high level query can have many equivalent transformations, the main difficulty is to select the most
efficient one

Example –
Input :All players called “Muller", who are playing for a team
QUERY: SELECT p.Name FROM Players p, Teams tWHEREp.TID = t.TIDANDp.Name LIKE " Muller"

Basic Steps in Query Processing


•Parsing and translation
•Optimization
•Evaluation

1. Parsing and translation


a. Translate the query into its internal form. This is then translated into relational algebra.
b. Parser checks syntax, verifies relation.
2. Optimization
a. SQL is a very high level language:–The users specify what to search for-not how the search
is actually done–The algorithms are chosen automatically by the DBMS.
b. For a given SQL query there may be many possible execution plans.
c. Amongst all equivalent plans choose the one with lowest cost.
d. Cost is estimated using statistical information from the database catalog.
3. Evaluation
a. The query evaluation engine takes a query evaluation plan, executes that plan and returns the
answer to that query.

.In a distributed system, we must take into account several other matters, including:
 The cost of data transmission over the network.
 The potential gain in performance from having several sites process parts of the query in parallel.

Distributed Query Processing Architecture


In a distributed database system, processing a query comprises of optimization at both the global and the local
level. The query enters the database system at the client or controlling site. Here, the user is validated, the query
is checked, translated, and optimized at a global level.
The architecture can be represented as −
Mapping Global Queries into Local Queries
The process of mapping global queries to local ones can be realized as follows −
 The tables required in a global query have fragments distributed across multiple sites. The local
databases have information only about local data. The controlling site uses the global data dictionary to
gather information about the distribution and reconstructs the global view from the fragments.
 If there is no replication, the global optimizer runs local queries at the sites where the fragments are
stored. If there is replication, the global optimizer selects the site based upon communication cost,
workload, and server speed.
 The global optimizer generates a distributed execution plan so that least amount of data transfer occurs
across the sites. The plan states the location of the fragments, order in which query steps needs to be
executed and the processes involved in transferring intermediate results.
 The local queries are optimized by the local database servers. Finally, the local query results are merged
together through union operation in case of horizontal fragments and join operation for vertical
fragments.
For example, let us consider that the following Project schema is horizontally fragmented according to City, the
cities being New Delhi, Kolkata and Hyderabad.
PROJECT
PId City Department Status
Suppose there is a query to retrieve details of all projects whose status is “Ongoing”.
The global query will be &inus;
$$\sigma_{status} = {\small "ongoing"}^{(PROJECT)}$$
Query in New Delhi’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})}$$
Query in Kolkata’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({Kol}_-{PROJECT})}$$
Query in Hyderabad’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$$
In order to get the overall result, we need to union the results of the three queries as follows −
$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})} \cup \sigma_{status} = {\small
"ongoing"}^{({kol}_-{PROJECT})} \cup \sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$
Distributed Query Optimization
Distributed query optimization requires evaluation of a large number of query trees each of which produce the
required results of a query. This is primarily due to the presence of large amount of replicated and fragmented
data. Hence, the target is to find an optimal solution instead of the best solution.
The main issues for distributed query optimization are −
 Optimal utilization of resources in the distributed system.
 Query trading.
 Reduction of solution space of the query.
Optimal Utilization of Resources in the Distributed System
A distributed system has a number of database servers in the various sites to perform the operations pertaining
to a query. Following are the approaches for optimal resource utilization −
Operation Shipping − In operation shipping, the operation is run at the site where the data is stored and not at
the client site. The results are then transferred to the client site. This is appropriate for operations where the
operands are available at the same site. Example: Select and Project operations.
Data Shipping − In data shipping, the data fragments are transferred to the database server, where the
operations are executed. This is used in operations where the operands are distributed at different sites. This is
also appropriate in systems where the communication costs are low, and local processors are much slower than
the client server.
Hybrid Shipping − This is a combination of data and operation shipping. Here, data fragments are transferred
to the high-speed processors, where the operation runs. The results are then sent to the client site.

Query Trading
In query trading algorithm for distributed database systems, the controlling/client site for a distributed query is
called the buyer and the sites where the local queries execute are called sellers. The buyer formulates a number
of alternatives for choosing sellers and for reconstructing the global results. The target of the buyer is to achieve
the optimal cost.
The algorithm starts with the buyer assigning sub-queries to the seller sites. The optimal plan is created from
local optimized query plans proposed by the sellers combined with the communication cost for reconstructing
the final result. Once the global optimal plan is formulated, the query is executed.
Reduction of Solution Space of the Query
Optimal solution generally involves reduction of solution space so that the cost of query and data transfer is
reduced. This can be achieved through a set of heuristic rules, just as heuristics in centralized systems.
Following are some of the rules −
 Perform selection and projection operations as early as possible. This reduces the data flow over
communication network.
 Simplify operations on horizontal fragments by eliminating selection conditions which are not relevant
to a particular site.
 In case of join and union operations comprising of fragments located in multiple sites, transfer
fragmented data to the site where most of the data is present and perform operation there.
 Use semi-join operation to qualify tuples that are to be joined. This reduces the amount of data transfer
which in turn reduces communication cost.
 Merge the common leaves and sub-trees in a distributed query tree.
Possible Questions:

PART A
1. What is distributed systems.
2. What are the characteristics of distributed system?
3. What are the characteristics of distributed system?
4. What are the disadvantages of distributed system?
5. What are the advantages and disadvantages of middleware distributed system?
6. What is thin-client model?
7. What is thick/fat-client model?
8. Give a quick comparison of thin-client vs thin client model?
9. Define distributed database
10. List out the types of distributed database
11. Explain about middleware.
12. What are the several factors of data replications?
13. What is fragmentation?
14. What are the advantages of fragmentation?
15. Define concurrency control.
PART B
1. Explain briefly about distributed systems.
2. Explain briefly about the distributed system architecture.
3. What is the role of client and server in distributed system?
4. Explain the architecture of distributed database system?
5. Explain about distributed data storage.
6. Explain briefly about distributed transaction.
7. Explain briefly about distributed query processing
8. Explain about commit protocols
UNIT II
NOSQL DATABASES
NOSQL

WHAT IS NoSQL? OR DEFINE NoSQL (PART A)


NoSQL, which stands for “not only SQL,” is an approach to database design that provides flexible
schemas for the storage and retrieval of data beyond the traditional table structures found in
relational databases. (for example Google or Facebook which collects terabits of data every day for their
users).
WHY NOSQL? (PART A)
NoSQL Database is a non-relational Data Management System, that does not require a fixed schema. It
avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for distributed data stores
with humongous data storage needs. NoSQL is used for Big data and real-time web apps.
The advantages of NoSQL include:
 High scalability
 Distributed Computing
 Lower cost
 Schema flexibility
 Un/semi-structured data
 No complex relationships
Emergence of NOSQL Systems (PART B)
Consider a free e-mail application, such as Google Mail or Yahoo Mail or other similar service this
application can have millions of users, and each user can have thousands of e-mail messages. There is a need
for a storage system that can manage all these e-mails; a structured relational SQL system may not be
appropriate because
(1) SQL systems offer too many services (powerful query language, concurrency control, etc.), which
this application may not need;
(2) A structured data model such the traditional relational model may be too restrictive.
Some of the organizations that were faced with these data management and storage applications decided to
develop their own systems:
■ Google developed a proprietary NOSQL system known as BigTable, which is used in many of
Google’s applications that require vast amounts of data storage, such as Gmail, Google Maps, and Web
site indexing. Apache Hbase is an open source NOSQL system based on similar concepts. Google’s
innovation led to the category of NOSQL systems known as column-based or wide column stores; they
are also sometimes referred to as column family stores.
■ Amazon developed a NOSQL system called DynamoDBthat is available through Amazon’s cloud
services. This innovation led to the category known as key-value data stores or sometimes key-tuple or
key-object data stores.
■ Facebook developed a NOSQL system called Cassandra, which is now open source and known as
Apache Cassandra. This NOSQL system uses concepts from both key-value stores and column-based
systems.
■ Other software companies started developing their own solutions and making them available to users
who need these capabilities—for example, MongoDBand CouchDB, which are classified as document-
based NOSQL systems or document stores.
■ Another category of NOSQL systems is the graph-based NOSQL systems, or graph databases;
these include Neo4J and GraphBase, among others.

Categories of NOSQL Systems(PART B)

Types of NoSQL Databases


NoSQL Databases are mainly categorized into four
types: Key-value pair, Column-oriented, Graph-
based and Document-oriented. Every category has
its unique attributes and limitations. None of the
above-specified database is better to solve all the
problems. Users should select the database based on their product needs.
Types of NoSQL Databases:
 Key-value Pair Based
 Column-oriented Graph
 Graphs based
 Document-oriented

Key Value Pair Based


Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy load.
Key-value pair storage databases store data as a hash table where each key is unique, and the value can
be a JSON, BLOB(Binary Large Objects), string, etc.
For example, a key-value pair may contain a key like "Age"
associated with a value like "75".
It is one of the most basic NoSQL database example. This kind of
NoSQL database is used as a collection, dictionaries, associative
arrays, etc. Key value stores help the developer to store schema-
less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value
store DataBases. They are all based on Amazon's Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google. Every
column is treated separately. Values of single column databases are
stored contiguously.
They deliver high performance on aggregation queries like SUM,
COUNT, AVG, MIN etc. as the data is readily available in a column.
Column-based NoSQL databases are widely used to manage data
warehouses, business intelligence, CRM, Library card catalogs,
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of
column based database.
Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is stored as a
document. The document is stored in JSON or XML formats. The value is understood by the DB and can be
queried.

Relational Vs. Document


In this diagram on your left you can see we have rows and columns, and in the right, we have a
document database which has a similar structure to JSON. Now for the relational database, you have to
know what columns you have and so on. However, for a document database, you have data store like
JSON object. You do not require defining which make it flexible.
The document type is mostly used for CMS systems, blogging
platforms, real-time analytics & e-commerce applications. It should not
use for complex transactions which require multiple operations or
queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes,
MongoDB, are popular Document originated DBMS systems.
Graph-Based
A graph type database stores entities as well the relations amongst
those entities. The entity is stored as a node with the relationship as
edges. An edge gives a relationship between nodes. Every node and edge has a unique identifier.
Compared to a relational database where tables are loosely connected, a Graph database is a multi-
relational in nature. Traversing relationship is fast as they are already captured into the DB, and there is
no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.

What is the CAP Theorem?(PART A)


CAP theorem is also called brewer's theorem. It states that is impossible for a distributed data store to offer
more than two out of three guarantees
1. Consistency
2. Availability
3. Partition Tolerance
Consistency:
The data should remain consistent even after the execution of an operation. This means once data is written, any
future read request should contain that data. For example, after updating the order status, all the clients should
be able to see the same data.
Availability:
The database should always be available and responsive. It
should not have any downtime.
Partition Tolerance:
Partition Tolerance means that the system should continue to
function even if the communication among the servers is not
stable. For example, the servers can be partitioned into
multiple groups which may not communicate with each other.
Here, if part of the database is unavailable, other parts are
always unaffected.
Eventual Consistency
The term "eventual consistency" means to have copies of data
on multiple machines to get high availability and scalability.
Thus, changes made to any data item on one machine has to
be propagated to other replicas.
Data replication may not be instantaneous as some copies will be updated immediately while others in due
course of time. These copies may be mutually, but in due course of time, they become consistent. Hence, the
name eventual consistency.
BASE: Basically Available, Soft state, Eventual consistency
 Basically, available means DB is available all the time as per CAP theorem
 Soft state means even without an input; the system state may change
 Eventual consistency means that the system will become consistent over time

Document-Based NOSQL Systems and MongoDB


Document-based or document-oriented NOSQL systems typically store data as collections of similar
documents. These types of systems are also sometimes known as document stores. The individual documents
somewhat resemble complex objectsbut a major difference between document-based systems versus object and
object-relational systems and XML is that there is no requirement to specify a schema—rather, the documents
are specified as self-describing data
A popular language to spec-ify documents in NOSQL systems is JSON (JavaScript Object Notation).
There are many document-based NOSQL systems, including MongoDB and CouchDB, among many others.

MongoDB Data Model


MongoDB documents are stored in BSON (Binary JSON) format, which is a varia-tion of JSON with some
additional data types and is more efficient for storage than JSON. Individual documents are stored in a
collection.
MongoDB CRUD Operations (PART B)
CRUD operations create, read, update, and delete documents.

MongoDB: Create a document from a


collection(Create)

The command db.collection.insert() will perform an


insert operation into a collection of a document.
Let us insert a document to a student collection. You
must be connected to a database for doing any insert. It is
done as follows:
db.student.insert({
regNo:"2KVYSAMCA01",
name:"NATARAJAN S",
course:{
courseName:"MCA",
duration:"2 Years"
},
address:{
city:"SALEM",
state:"TN",
country:"India"
}
})
Note that an entry has been made into the collection called student.

MongoDB: Querying a document from a collection(Read)

To retrieve (Select) the inserted document, run the below command. The find() command will retrieve all the
documents of the given collection.

NOTE: Please observe that the record retrieved contains an attribute called _id with some unique identifier
value called ObjectId which acts as a document identifier.
If a record is to be retrieved based on some criteria, the find() method should be called passing parameters,
then the record will be retrieved based on the attributes specified.
db.collection_name.find({"fieldname":"value"})
For Example: Let us retrieve the record from the student collection where the attribute regNo is 3014 and the
query for the same is as shown below:
db.student.find({"regNo":"2KVYSAMCA01"})

MongoDB: Updating a document in a collection(Update)

In order to update specific field values of a collection in MongoDB, run the below query.
db.collection_name.update()
update() method specified above will take the fieldname and the new value as argument to update a document.
Let us update the attribute name of the collection student for the document with regNo 3014.
db.student.update({"regNo":"2KVYSAMCA01" },
{$set:
{"name":"NATZ"}
})
You will see the following in the Command Prompt:

MongoDB: Removing an entry from the collection(Delete)

Let us now look into the deleting an entry from a collection. In order to delete an entry from a collection, run
the command as shown below:
db.collection_name.remove({"fieldname":"value"})
For Example: db.student.remove({"regNo":"2KVYSAMCA01"})

Note that after running the remove() method, the entry has been deleted from the student collection.

MongoDB: Indexing an entry from the collection(Index)

An index in MongoDB is a special data structure that holds the data of few fields of documents on
which the index is created. Indexes improve the speed of search operations in database because instead of
searching the whole document, the search is performed on the indexes that holds only few fields
db.collection_name.createIndex({field_name: 1 or -1})
For Example :db.student.createIndex({"title":1})

MongoDB – Finding the indexes in a collection

We can use getIndexes() method to find all the indexes created on a collection. The syntax for this method is:
db.collection_name.getIndexes()
So to get the indexes of studentdata collection, the command would be:
>db.student.getIndexes()
[
{
"v":2,
"key":{
"_id":1
},
"name":"_id_",
"ns":"test.student"
},
{
"v":2,
"key":{
"student_name":1
},
"name":"student_name_1",
"ns":"test.student"
}
]

MongoDB – Drop indexes in a collection

You can either drop a particular index or all the indexes.


Dropping a specific index:
For this purpose the dropIndex() method is used.
db.collection_name.dropIndex({index_name:1})
Lets drop the index that we have created on student_name field in the collection studentdata. The command
for this:
db.student.dropIndex({student_name:1})

Dropping all the indexes:


To drop all the indexes of a collection, we use dropIndexes() method.
Syntax of dropIndexs() method:
db.collection_name.dropIndexes()
Lets say we want to drop all the indexes of studentdata collection.
db.studentdata.dropIndexes()
What is MongoDBSharding?(PART B)
MongoDB achieves scaling through a technique known as “sharding”. It is the process of writing data
across different servers to distribute the read and write load and data storage requirements.
Sharding is the process of storing data records across multiple machines and it is MongoDB’s approach to
meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to
store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal
scaling. With sharding, you add more machines to support data growth and the demands of read and write
operations.

What is MongoDB Replication?


Unlike relational database servers, scaling NoSQL databases to meet increased demand on your
application is fairly simple — you drop in a new server, make a couple of config changes, and it connects to your
existing servers, enlarging the cluster. All existing databases and collections are automatically replicated and
synced with the other member nodes. A replication cluster works well when the entire data volume of your
database(s) is able to fit onto a single server. Each server in your replication cluster will host a full copy of your
databases.
Replica Sets are a great way to replicate MongoDB data across multiple servers and have the database
automatically failover in case of server failure. Read workloads can be scaled by having clients directly connect
to secondary instances. Note that master/slave MongoDB replication is not the same thing as a Replica Set, and
does not have automatic failover.

Using MongoDB with PHP / JAVA(PART B)

Before you start using MongoDB in your Java programs, you need to make sure that you have MongoDB
CLIENT and Java set up on the machine. You can check Java tutorial for Java installation on your machine.
Now, let us check how to set up MongoDB CLIENT.
 You need to download the jar mongodb-driver-3.11.2.jar and its dependency mongodb-driver-core-
3.11.2.jar.. Make sure to download the latest release of these jar files.
 You need to include the downloaded jar files into your classpath.
Create a Collection
To create a collection, createCollection() method of com.mongodb.client.MongoDatabase class is used.
Following is the code snippet to create a collection −
importcom.mongodb.client.MongoDatabase;
importcom.mongodb.MongoClient;
importcom.mongodb.MongoCredential;
publicclassCreatingCollection{

publicstaticvoid main(Stringargs[]){

// Creating a Mongo client


MongoClient mongo =newMongoClient("localhost",27017);

// Creating Credentials
MongoCredential credential;
credential=MongoCredential.createCredential("sampleUser","myDb",
"password".toCharArray());
System.out.println("Connected to the database successfully");

//Accessing the database


MongoDatabase database =mongo.getDatabase("myDb");

//Creating a collection
database.createCollection("sampleCollection");
System.out.println("Collection created successfully");
}
}
On compiling, the above program gives you the following result −
Connected to the database successfully
Collection created successfully

Features of MongoDB (PART A)

These are some important features of MongoDB:


1. Support ad hoc queries
2. Indexing
3. Replication
4. Duplication of data
5. Load balancing
6. Supports map reduce and aggregation tools.
7. Uses JavaScript instead of Procedures.
8. It is a schema-less database written in C++.
9. Provides high performance.
10. Stores files of any size easily without complicating your stack.
11. Easy to administer in the case of failures.
12. It also supports:
JSON data model with dynamic schemas
Auto-sharding for horizontal scalability
Built in replication for high availability

Cassandra (PART A)
Cassandra is a distributed database management system which is open source with wide column store, NoSQL
database to handle large amount of data across many commodity servers which provides high availability with
no single point of failure. It is written in Java and developed by Apache Software Foundation.

AvinashLakshman&Prashant Malik initially developed the Cassandra at Facebook to power the Facebook
inbox search feature. Facebook released Cassandra as an open source project on Google code in July 2008. In
March 2009 it became an Apache Incubator project and in February 2010 it becomes a top-level project. Due to
its outstanding technical features Cassandra becomes so popular.

Cassandra - Data Model(PART B)


Cluster
Cassandra database is distributed over several machines that operate together. The outermost container is
known as the Cluster. For failure handling, every node contains a replica, and in case of a failure, the replica
takes charge. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace in Cassandra are

 Replication factor − It is the number of machines in the cluster that will receive copies of the same
data.
 Replica placement strategy − It is nothing but the strategy to place replicas in the ring. We have
strategies such as simple strategy (rack-aware strategy), old network topology strategy (rack-aware
strategy), and network topology strategy (datacenter-shared strategy).
 Column families − Keyspace is a container for a list of one or more column families. A column family,
in turn, is a container of a collection of rows. Each row contains ordered columns. Column families
represent the structure of your data. Each keyspace has at least one and often many column families.
The syntax of creating a Keyspace is as follows −
CREATE KEYSPACE Keyspace name
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
The following illustration shows a schematic view of a Keyspace.
Column Family
A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of
columns. The following table lists the points that differentiate a column family from a table of relational
databases.
A Cassandra column family has the following attributes −
 keys_cached − It represents the number of locations to keep cached per SSTable.
 rows_cached − It represents the number of rows whose entire contents will be cached in memory.
 preload_row_cache − It specifies whether you want to pre-populate the row cache.
Note −Unlike relational tables where a column family’s schema is not fixed, Cassandra does not force
individual rows to have all the columns.
The following figure shows an example of a Cassandra column family.

Create Keyspace (PART A)


SYNTAX:
Create keyspace KeyspaceName with replicaton={'class':strategy name, 'replication_factor': No of replic
ations on different nodes}
Two types of strategy
Simple Strategy: one data center.
Network Topology Strategy: more than one data centers.
EXAMPLE:
CREATE KEYSPACE Vysya WITH replication = {'class':'SimpleStrategy', 'replication factor': 3};

Verification:

SYNTAX:
DESCRIBEKEYSPACE;

Using a Keyspace:

Syntax:USE <identifier>
EXAMPLE:
USE Vysya;

Cassandra Alter Keyspace:

ALTER keyspacecommand is used to alter the replication factor


Syntax:
ALTER KEYSPACE "KeySpace Name" WITH replication = {'class': 'Strategy name', 'replicatio
n_factor' : 'No.Of replicas'};
Main points while altering Keyspace in Cassandra
 KeyspaceName:Keyspace name cannot be altered in Cassandra.
 Strategy Name: Strategy name can be altered by using a new strategy name.
 Replication Factor: Replication factor can be altered by using a new replication factor.
 DURABLE_WRITES: DURABLE_WRITES value can be altered by specifying its value true/false.
By default, it is true. If set to false, no updates will be written to the commit log and vice versa.
Syntax
ALTER KEYSPACE Vysya WITH replication = {'class':'NetworkTopologyStrategy', 'repl
ication_factor' : 1};
Cassandra Drop Keyspace:
DROP Keyspace" command is used to drop keyspaces with all the data, column families, user defined
types and indexes from Cassandra.

Syntax:
DROP keyspace KeyspaceName ;
Cassandra Create Table:
CREATE TABLE command is used to create a table. Here, column family is used to store data just like
table in RDBMS.

Syntax:
CREATE TABLE tablename(column1 name datatype PRIMARYKEY, column2 name data type,
column3 name data type)
There are two types of primary keys:
 Single primary key: Use the following syntax for single primary key.
1. Primary key (ColumnName)
 Compound primary key: Use the following syntax for single primary key.
1. Primary key(ColumnName1,ColumnName2 . . .)
Example:
Let's take an example to demonstrate the CREATE TABLE command.
Here, we are using already created Keyspace "Vysya".
CREATE TABLE MCA(Reg_No int PRIMARY KEY, Name text, Address text,Pincode varint,
phone varint);
The table is created now. You can check it by using the following command.
Example: SELECT * FROM MCA;
Cassandra Alter Table:
ALTER TABLE command is used to alter the table after creating it. You can use the ALTER command to
perform two types of operations:
 Add a column
 Drop a column
Syntax:
1. ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>

Adding a Column

While adding column, you have to aware that the column name is not conflicting with the existing
column names and that the table is not defined with compact storage option
Syntax:
1. ALTER TABLE table nameADD new column datatype;
Example:
ALTER TABLE MCA ADD email text;

Dropping a Column
Drop an existing column from a table by using ALTER command.

Syntax: ALTER table name DROP column name;


Example:
ALTER TABLE MCA DROP email;
Cassandra DROP table:
DROP TABLE command is used to drop a table.

Syntax: DROP TABLE <tablename>

Example:
DROP TABLE MCA;
You can use DESCRIBE command to verify if the table is deleted or not. Here the student table has been
deleted; you will not find it in the column families list.
DESCRIBE COLUMNFAMILIES;
Cassandra Truncate Table
TRUNCATE command is used to truncate a table. If you truncate a table, all the rows of the table are
deleted permanently.
Syntax:
TRUNCATE <tablename>
Example:
TRUNCATE MCA;

You can verify it by using SELECT command.


SELECT * FROM student;

What is HIVE? (PART A)


Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in distributed
storage. It runs SQL like queries called HQL (Hive query language) which gets internally converted to
MapReduce jobs.

Features of Hive (PART A)

These are the following features of Hive:

o Hive is fast and scalable.


o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.

Limitations of Hive

o Hive is not capable of handling real-time data.


o It is not designed for online transaction processing.
o Hive queries contain high latency.

Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It supports different
types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request from all those
programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver
is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.

HIVE Data Types


Hive data types are categorized in numeric types, string types, misc types, and complex types. A list of Hive
data types is given below.

Integer Types

Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to


9,223,372,036,854,775,807

Decimal Type

Type Size Range

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point


number
Date/Time Types

TIMESTAMP
o It supports traditional UNIX timestamp with optional nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal
place precision)
DATES
The Date value is used to specify a particular year, month and day, in the form YYYY--MM--DD. However, it
didn't provide the time of the day. The range of Date type lies between 0000--01--01 to 9999--12--31.

String Types

STRING
The string is a sequence of characters. It values can be enclosed within single quotes (') or double quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which specifies that the maximum
number of characters allowed in the character string.
CHAR
The char is a fixed-length type whose maximum length is fixed at 255.

Complex Type

Type Size Range

Struct It is similar to C struct or an object where fields are struct('James','Roy')


accessed using the "dot" notation.

Map It contains the key-value tuples where the fields are map('first','James','last','Roy')
accessed using array notation.

Array It is a collection of similar type of values that array('James','Roy')


indexable using zero-based integers.

Hive - Create Database

In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain multiple tables
within a database where a unique name is assigned to each table. Hive also provides a default database with a
name default.

o Initially, we check the default database provided by Hive. So, to check the list of existing databases,
follow the below command: -

1. hive> show databases;


Here, we can see the existence of a default database provided by Hive.

o Let's create a new database by using the following command: -

1. hive> create database demo;

So, a new database is created.

o Let's check the existence of a newly created database.

1. hive> show databases;

o Each database must contain a unique name. If we create two databases with the same name, the
following error generates: -

o If we want to suppress the warning generated by Hive on creating the database with the same name,
follow the below command: -

Hive - Drop Database


In this section, we will see various ways to drop the existing database.

o Let's check the list of existing databases by using the following command: -

1. hive> show databases;

o Now, drop the database by using the following command.

1. hive> drop database demo;

o Let's check whether the database is dropped or not.

1. hive> show databases;

As we can see, the database demo is not present in the list. Hence, the database is dropped successfully.

o If we try to drop the database that doesn't exist, the following error generates:

o However, if we want to suppress the warning generated by Hive on creating the database with the same
name, follow the below command:-

1. hive> drop database if exists demo;


o In Hive, it is not allowed to drop the database that contains the tables directly. In such a case, we can
drop the database either by dropping tables first or use Cascade keyword with the command.
o Let's see the cascade command used to drop the database:-

1. hive> drop database if exists demo cascade;


Hive - Create Table

Internal Table

The internal tables are also called managed tables as the lifecycle of their data is controlled by the Hive. By
default, these tables are stored in a subdirectory under the directory defined by
hive.metastore.warehouse.dir(i.e. /user/hive/warehouse). The internal tables are not flexible enough to share
with other tools like Pig. If we try to drop the internal table, Hive deletes both table schema and data.

o Let's create an internal table by using the following command:-

1. hive> create table demo.employee (Id int, Name string , Salary float) ;

Here, the command also includes the information that the data is separated by ','.

o Let's see the metadata of the created table by using the following command:-

1. hive> describe demo.employee


o Let's see the result when we try to create the existing table again.

In such a case, the exception occurs. If we want to ignore this type of exception, we can use if not
exists command while creating the table.

1. hive> create table if not exists demo.employee (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;

o While creating a table, we can add the comments to the columns and can also define the table
properties.

1. hive> create table demo.new_employee (Id int comment 'Employee Id', Name string comment 'Employee Na
me', Salary float comment 'Employee Salary')
2. comment 'Table Description'
3. TBLProperties ('creator'='Gaurav Chawla', 'created_at' = '2019-06-06 11:00:00');
o Let's see the metadata of the created table by using the following command: -

1. hive> describe new_employee;

o Hive allows creating a new table by using the schema of an existing table.

1. hive> create table if not exists demo.copy_employee


2. like demo.employee;

Here, we can say that the new table is a copy of an existing table.
Hive - Drop Table

Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the below steps to drop
the table from the database.

o Let's check the list of existing databases by using the following command: -
1. hive> show databases;

o Now select the database from which we want to delete the table by using the following command: -

1. hive> use demo;

o Let's check the list of existing tables in the corresponding database.

1. hive> show tables;

o Now, drop the table by using the following command: -

1. hive> drop table new_employee;

o Let's check whether the table is dropped or not.

1. hive> show tables;


As we can see, the table new_employee is not present in the list. Hence, the table is dropped successfully.

Hive - Alter Table

In Hive, we can perform modifications in the existing table like changing the table name, column name,
comments, and table properties. It provides SQL like commands to alter the table.

Rename a Table

If we want to change the name of an existing table, we can rename that table by using the following signature: -

1. Alter table old_table_name rename to new_table_name;


o Let's see the existing tables present in the current database.

o Now, change the name of the table by using the following command: -

1. Alter table emp rename to employee_data;

o Let's check whether the name has changed or not.


Here, we got the desired output.

Adding column

In Hive, we can add one or more columns in an existing table by using the following signature: -

1. Alter table table_name add columns(column_name datatype);


o Let's see the schema of the table.

o Let's see the data of columns exists in the table.

o Now, add a new column to the table by using the following command: -

1. Alter table employee_data add columns (age int);

o Let's see the updated schema of the table.


o Let's see the updated data of the table.

As we didn't add any data to the new column, hive consider NULL as the value.

Change Column

In Hive, we can rename a column, change its type and position. Here, we are changing the name of the column
by using the following signature: -

1. Alter table table_name change old_column_name new_column_name datatype;


o Let's see the existing schema of the table.

o Now, change the name of the column by using the following command: -

1. Alter table employee_data change name first_name string;


o Let's check whether the column name has changed or not.

Delete or Replace Column

Hive allows us to delete one or more columns by replacing them with the new columns. Thus, we cannot drop
the column directly.

o Let's see the existing schema of the table.

o Now, drop a column from the table.

1. alter table employee_data replace columns( id string, first_name string, age int);
o Let's check whether the column has dropped or not.

Partitioning in Hive

The partitioning in Hive means dividing the table into some parts based on the values of a particular column
like date, course, city or country. The advantage of partitioning is that since the data is stored in slices, the
query response time becomes faster.

As we know that Hadoop is used to handle the huge amount of data, it is always required to use the best
approach to deal with it. The partitioning in Hive is the best example of it.

Let's assume we have a data of 10 million students studying in an institute. Now, we have to fetch the students
of a particular course. If we use a traditional approach, we have to go through the entire data. This leads to
performance degradation. In such a case, we can adopt the better approach i.e., partitioning in Hive and divide
the data among the different datasets based on particular columns.

The partitioning in Hive can be executed in two ways -

o Static partitioning
o Dynamic partitioning

Static Partitioning

In static or manual partitioning, it is required to pass the values of partitioned columns manually while loading
the data into the table. Hence, the data file doesn't contain the partitioned columns.

Example of Static Partitioning

o First, select the database in which we want to create a table.

1. hive> use test;


o Create the table and provide the partitioned columns by using the following command: -

1. hive> create table student (id int, name string, age int, institute string)
2. partitioned by (course string)
3. row format delimited
4. fields terminated by ',';
1. Let's retrieve the information associated with the table.
2. hive> describe student;

o Load the data into the table and pass the values of partition columns with it by using the following
command: -

1. hive> load data local inpath '/home/codegyani/hive/student_details1' into table student


2. partition(course= "java");

Here, we are partitioning the students of an institute based on courses.

o Load the data of another file into the same table and pass the values of partition columns with it by
using the following command: -

1. hive> load data local inpath '/home/codegyani/hive/student_details2' into table student


2. partition(course= "hadoop");
In the following screenshot, we can see that the table student is divided into two categories.

o Let's retrieve the entire data of the able by using the following command: -

1. hive> select * from student;

o Now, try to retrieve the data based on partitioned columns by using the following command: -

1. hive> select * from student where course="java";


In this case, we are not examining the entire data. Hence, this approach improves query response time.

o Let's also retrieve the data of another partitioned dataset by using the following command: -

1. hive> select * from student where course= "hadoop";

Dynamic Partitioning

In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not required to pass
the values of partitioned columns manually.

o First, select the database in which we want to create a table.

1. hive> use show;

o Enable the dynamic partition by using the following commands: -

1. hive> set hive.exec.dynamic.partition=true;


2. hive> set hive.exec.dynamic.partition.mode=nonstrict;

o Create a dummy table to store the data.

1. hive> create table stud_demo(id int, name string, age int, institute string, course string)
2. row format delimited
3. fields terminated by ',';
o Now, load the data into the table.

1. hive> load data local inpath '/home/codegyani/hive/student_details' into table stud_demo;

o Create a partition table by using the following command: -

1. hive> create table student_part (id int, name string, age int, institute string)
2. partitioned by (course string)
3. row format delimited
4. fields terminated by ',';

o Now, insert the data of dummy table into the partition table.

1. hive> insert into student_part


2. partition(course)
3. select id, name, age, institute, course
4. from stud_demo;
o In the following screenshot, we can see that the table student_part is divided into two categories.

o Let's retrieve the entire data of the table by using the following command: -

1. hive> select * from student_part;


o Now, try to retrieve the data based on partitioned columns by using the following command: -

1. hive> select * from student_part where course= "java ";

In this case, we are not examining the entire data. Hence, this approach improves query response time.

o Let's also retrieve the data of another partitioned dataset by using the following command:

1. hive> select * from student_part where course= "hadoop";

What Is OrientDB?

OrientDB is a multi-model database capable of efficiently storing and retrieving data like all
traditional database systems while it also supports new functionality adopted from graph and
document databases. It is written in Java and belongs to the NoSQL database family.

Graph Databases, the structure follows tow classes: V as the base class for vertices and E as the base class
for edges. OrientDB builds these classes automatically when you create the graph database. In the event that
you don't have these classes, create them, (see below).
Working with Vertex and Edge Classes

While you can build graphs using V and E class instances, it is strongly recommended that you create
custom types for vertices and edges.
To create a custom vertex class (or type) use the createVertexType(<name>):
// Create Custom Vertex Class
OrientVertexType account = graph.createVertexType("Account");
To create a vertex of that class Account, pass a string with the format "class:<name>":
// Add Vertex Instance
Vertex v = graph.addVertex("class:Account");
In Blueprints, edges have the concept of labels used in distinguishing between edge types. OrientDB binds
the concept of edge labels to edge classes. There is a similar method in creating custom edge types,
using createEdgeType(<name):
// Create Graph Database Instance
OrientGraph graph = newOrientGraph("plocal:/tmp/db");

// Create Custom Vertex Classes


OrientVertexTypeaccountVertex = graph.createVertexType("Account");
OrientVertexTypeaddressVertex = graph.createVertexType("Address");

// Create Custom Edge Class


OrientEdgeTypelivesedge = graph.createEdgeType("Lives");

// Create Vertices
Vertex account = graph.addVertex("class:Account");
Vertex address = graph.addVertex("class:Address");

// Create Edge
Edge e = account.addEdge("Lives", address);

Inheritance Tree

Classes can extend other classes. To create a class that extend a class different from V or E, pass the class
name in the construction:
graph.createVertexType(<class>, <super-class>); // Vertex
graph.createEdgeType(<class>, <super-class>); // Edge
For instance, create the base class Account, then create two subclasses: Provider and Customer:
// Create Vertex Base Class
graph.createVertexType("Account");

// Create Vertex Subclasses


graph.createVertexType("Customer", "Account");
graph.createVertexType("Provider", "Account");

Retrieve Types

Classes are polymorphic. If you search for generic vertices, you also receive all custom vertex instances:

// Retrieve Vertices
Iterable<Vertex>allVertices = graph.getVertices();
To retrieve custom classes, use the getVertexType() and the getEdgeType methods. For instance, retrieving
from the graph database instance:
OrientVertexTypeaccountVertex = graph.getVertexType("Account");
OrientEdgeTypelivesEdge = graph.getEdgeType("Lives");

Drop Persistent Types


To drop a persistent class, use the dropVertexType() and dropVertexType() methods. For instance, dropping
from the graph database instance:
graph.dropVertexType("Address");
graph.dropEdgeType("Lives");

OrientDB Enterprise Edition gives you all the features of our community edition plus:
 Incremental backups.
 Unmatched security.
 24x7 Support.
 Query Profiler.
 Distributed Clustering configuration.
 Metrics Recording.
 Live Monitor with configurable alerts.

Possible Questions:

PART A

1. What is NOSQL?
2. Define Sharding.
3. What is MongoCurdOperations?
4. How to Insert a record in MONGODB?
5. How to Create a database in MONGODB, CASSANDRA, Hive, ORIENTDB?
6. How to Create a Table in MONGODB, CASSANDRA, Hive, ORIENTDB?
7. How to insert a record in MONGODB, CASSANDRA, Hive, ORIENTDB?
8. How to delete a record in MONGODB, CASSANDRA, Hive, ORIENTDB?
9. What is CURD Operation in MONGODB, CASSANDRA, Hive?
10. List out the CQL Types in Cassandra.
11. What is HIVE?
12. Define Partitioning
13. What is a Graph Database.
14. List out the features of Orientdb.

PART B

1. Explain Briefly about CAP theorem.


2. Write a note on mongodb Operation.
3. Write a Sample program using MONGODB with PHP/JAVA.
4. Explain briefly about Cassandra data model.
5. Write a note on Cassandra Curd Operation.
6. Give an Example for partitioning in Hive.
7. List out the data types supported by Hive in NOSQL
UNIT III

Object-Oriented Databases

An object-oriented database is a collection of object-oriented programming and relational database. There are
various items which are created using object-oriented programming languages like C++, Java which can be
stored in relational databases, but object-oriented databases are well-suited for those items.

An object-oriented database is organized around objects rather than actions, and data rather than logic. For
example, a multimedia record in a relational database can be a definable data object, as opposed to an
alphanumeric value.

The ODBMS which is an abbreviation for object oriented database


management system, is the data model in which data is stored in form of
objects, which are instances of classes. These classes and objects together
makes an object oriented data model.

Need for Complex Data Types

Complex types are nested data structures composed of primitive data types. These data structures can also be
composed of other complex types. Some examples of complex types include struct(row), array/list, map and
union. Complex types are supported by most programming languages including Python, C++ and Java.

Any data that does not fall into the traditional field structure (alpha, numeric, dates) of a relational DBMS.
Examples of complex data types are bills of materials, word processing documents, maps, time-series, images
and video.

Complex types are non-scalar properties of entity types that enable scalar properties to be organized within
entities. Like entities, complex types consist of scalar properties or other complex type properties. ... Complex
types cannot participate in associations and cannot contain navigation properties.

Complex Data Types Motivation: Permit non-atomic domains (atomic  indivisible) Example of non-atomic
domain: set of integers,or set of tuples Allows more intuitive modeling for applications with complex data
Intuitive definition: allow relations whenever we allow atomic (scalar) values — relations within relations
Retains mathematical foundation of relational model Violates first normal form.

Complex Types Extensions to SQL to support complex types include: Collection and large object types 
Nested relations are an example of collection types Structured types  Nested record structures like composite
attributes Inheritance Object orientation  Including object identifiers and references

The Object-Oriented Data Model


A data model is a logic organization of the real world objects (entities), constraints on them, and the
relationships among objects. A DB language is a concrete syntax for a data model. A DB system implements
a data model

Object oriented data model is based upon real world situations. These situations are represented as objects, with
different attributes. All these object have multiple relationships between them.
Elements of Object oriented data model
Objects

The real world entities and situations are represented as objects in the Object oriented database model.

Attributes and Method

Every object has certain characteristics. These are represented using Attributes. The behaviour of the objects is
represented using Methods.

Class

Similar attributes and methods are grouped together using a class. An object can be called as an instance of the
class.

Inheritance

A new class can be derived from the original class. The derived class contains attributes and methods of the
original class as well as its own.
Example

An Example of the Object Oriented data model is −

Shape, Circle, Rectangle and Triangle are all objects in this model.

Circle has the attributes Center and Radius.

Rectangle has the attributes Length and Breath

Triangle has the attributes Base and Height.

The objects Circle, Rectangle and Triangle inherit from the object Shape.
The Object Oriented (OO) Data Model in DBMS
Increasingly complex real-world problems demonstrated a need for a data model that more closely represented
the real world.
In the object oriented data model (OODM), both data and their relationships are contained in a single structure
known as an object.
In turn, the OODM is the basis for the object-oriented database management

The Components of the Object Oriented Data Model


• An object is an abstraction of a real-world entity. In general terms, an object may be considered equivalent to
an ER model’s entity. More precisely, an object represents only one occurrence of an entity. (The object’s
semantic content is defined through several of the items in this list.)
• Attributes describe the properties of an object. For example, a PERSON object includes the attributes Name,
Social Security Number, and Date of Birth.
• Objects that share similar characteristics are grouped in classes. A class is a collection of similar objects with
shared structure (attributes) and behavior (methods). In a general sense, a class resembles the ER model’s entity
set. However, a class is different from an entity set in that it contains a set of procedures known as methods. A
class’s method represents a real-world action such as finding a selected PERSON’s name, changing a
PERSON’s name, or printing a PERSON’s address. In other words, methods are the equivalent of procedures in
traditional programming languages. In OO terms, methods define an object’s behavior.
• Classes are organized in a class hierarchy. The class hierarchy resembles an upside-down tree in which each
class has only one parent. For example, the CUSTOMER class and the EMPLOYEE class share a parent
PERSON class. (Note the similarity to the hierarchical data model in this respect.)
• Inheritance is the ability of an object within the class hierarchy to inherit the attributes and methods of the
classes above it. For example, two classes, CUSTOMER and EMPLOYEE, can be created as subclasses from
the class PERSON. In this case, CUSTOMER and EMPLOYEE will inherit all attributes and methods from
PERSON.

Object-Oriented Languages
It is used to structure a software program into simple, reusable pieces of code blueprints (usually called
classes), which are used to create individual instances of objects. There are many object-oriented programming
languages including JavaScript, C++, Java, and Python

List of object-oriented programming languages


This is a list of notable programming languages with object-oriented programming (OOP) features,
which are also listed in Category:Object-oriented programming languages. Note that, in some contexts, the
definition of an "object-oriented programming language" is not exactly the same as that of a "programming
language with object-oriented features".[1] For example, C++ is a multi-paradigm language including object-
oriented paradigm;[2] however, it is less object-oriented than some other languages such
as Python[3] and Ruby.[4] Therefore, some people consider C++ an OOP language, while others do not or refer
to it as a "semi-object-oriented programming language".

What is Object Oriented Programming?

Object Oriented programming (OOP) is a programming paradigm that relies on the concept
of classes and objects. It is used to structure a software program into simple, reusable pieces of code blueprints
(usually called classes), which are used to create individual instances of objects. There are many object-oriented
programming languages including JavaScript, C++, Java, and Python.

A class is an abstract blueprint used to create more specific, concrete objects. Classes often represent broad
categories, like Car or Dog that share attributes. These classes define what attributes an instance of this type
will have, like color, but not the value of those attributes for a specific object.

Classes can also contain functions, called methods available only to objects of that type. These functions are
defined within the class and perform some action helpful to that specific type of object.

Benefits of OOP

 OOP models complex things as reproducible, simple structures


 Reusable, OOP objects can be used across programs
 Allows for class-specific behavior through polymorphism
 Easier to debug, classes often contain all applicable information to them
 Secure, protects information through encapsulation

Spatial Databases:

Spatial data is associated with geographic locations such as cities,towns etc. A spatial database is
optimized to store and query data representing objects. These are the objects which are defined in a geometric
space.

A common example of spatial data can be seen in a road map. A road map is a two-dimensional object that
contains points, lines, and polygons that can represent cities, roads, and political boundaries such as states or
provinces. ... A GIS is often used to store, retrieve, and render this Earth-relative spatial data.

Spatial Data Types

There are two major supported data-type is SQL server namely geometry data type and geography data type.

1. Geometry spatial data type

It is substantially a two-dimensional rendering of an object and also useful in case of represented as


points on a planar, or flat-earth data. A good example of it is (10, 2) where the first number ‘10’
identifies that point’s position on the horizontal (x) axis and the number ‘2’ represents the point’s
position on the vertical axis (y).
A common use case of the Geometry type is for a three-dimensional object, such as a building

2. Geography spatial data types

These are represented as latitudinal and longitudinal degrees, as on a round-earth coordinate system.
The common use case of the Geography type is to store an application’s GPS data.
In SQL Server, both SQL data types have been implemented in the .NET common language runtime
(CLR)
Spatial data objects

This combines both special data types (geometry and geography). It supports a total of sixteen SQL data types
in which eleven can be utilized in the database. To be more specific, these objects have inherited a particular
property from their parent’s data types and this unique property distributes them as the object. Take the
examples of a Polygon or point or CircularString.

Among them, ten of the depicted data objects will available to Geometry and Geography data types. The ten
objects are respectively Point, MultiPoint, MultiLineString, CircularString, LineString, MultiLineString,
CompoundCurve, Polygon, MultiPolygon, CurvePolygon, and GeometryCollection. However, the FullGlobe is
utilized exclusively for the Geography SQL data types.

Spatial data types divided into two groups:

1. Single geometries: It can be stored in the database only in one way


2. A geometry collection: As the name suggests, is a collection of types of data objects

The object types associated with a spatial data type form a relationship with each other. In the following
diagram, consider it as an example of how the object types of the Geometry SQL data types are related to each
other. To be more specific, the graphic depicts the geometry hierarchy in which the geometry and geography
data types are included. Dark grey is representing types of geometry and geography.

Spatial relationship:

A spatial relation specifies how some object is located in space in relation to some reference object. When the
reference object is much bigger than the object to locate, the latter is often represented by a point. The reference
object is often represented by a bounding box.

Knowledge of object categories and attributes allows children to mentally and physically organize things in
their world. Spatial awareness and spatial relations allow children to locate objects and navigate successfully
in their environments.

Spatial data structures:

Spatial data structures are structures that manipulate spatial data, that is, data that has geometric
coordinates. ... The class will try to float somewhere in the realm between algorithms, computational geometry,
graphics, databases and software design

Spatial Data Structures

Search trees such as BSTs, AVL trees, splay trees, 2-3 Trees, B-trees, and tries are designed for searching on a
one-dimensional key. A typical example is an integer key, whose one-dimensional range can be visualized as a
number line. These various tree structures can be viewed as dividing this one-dimensional number line into
pieces.

Some databases require support for multiple keys. In other words, records can be searched for using any one of
several key fields, such as name or ID number. Typically, each such key has its own one-dimensional index,
and any given search query searches one of these independent indices as appropriate.

Multdimensional Keys

A multidimensional search key presents a rather different concept. Imagine that we have a database of city
records, where each city has a name and an xyxy coordinate. A BST or splay tree provides good performance
for searches on city name, which is a one-dimensional key. Separate BSTs could be used to index
the xx and yy coordinates. This would allow us to insert and delete cities, and locate them by name or by one
coordinate. However, search on one of the two coordinates is not a natural way to view search in a two-
dimensional space. Another option is to combine the xyxy coordinates into a single key, say by concatenating
the two coordinates, and index cities by the resulting key in a BST. That would allow search by coordinate, but
would not allow for an efficient two-dimensional range query such as searching for all cities within a given
distance of a specified point. The problem is that the BST only works well for one-dimensional keys, while a
coordinate is a two-dimensional key where neither dimension is more important than the other.

Multidimensional range queries are the defining feature of a spatial application. Because a coordinate gives a
position in space, it is called a spatial attribute. To implement spatial applications efficiently requires the use of
a spatial data structure. Spatial data structures store data objects organized by position and are an important
class of data structures used in geographic information systems, computer graphics, robotics, and many other
fields.

A number of spatial data structures are used for storing point data in two or more dimensions. The kd tree is a
natural extension of the BST to multiple dimensions. It is a binary tree whose splitting decisions alternate
among the key dimensions. Like the BST, the kd tree uses object-space decomposition. The PR
quadtree uses key-space decomposition and so is a form of trie. It is a binary tree only for one-dimensional
keys (in which case it is a trie with a binary alphabet). For dd dimensions it has 2d2d branches. Thus, in two
dimensions, the PR quadtree has four branches (hence the name "quadtree"), splitting space into four equal-
sized quadrants at each branch. Two other variations on these data structures are the bintree and the point
quadtree. In two dimensions, these four structures cover all four combinations of object- versus key-space
decomposition on the one hand, and multi-level binary versus 2d2d-way branching on the other.

Spatial access methods:

Spatial Data Access Methods


 Attribute-based queries (phenomenon-based) This type of request selects features or records geographic
features that satisfy a statement expressing a set of conditions that forms the basis for the retrieval. ...
 Geometry-based queries (location-based) ...
 Topological relationships.

Spatial Access Methods

The main problem in design of spatial access methods is that there is no total ordering among the spatial data
objects that preserves spatial proximity. Consider, for example, a user wants to find restaurants closest to her
location. One try to answer this query is to build a one-dimensional index that contains the distances of all
restaurants from user's location sorted in ascending order. To answer her query, we can return first entries
from the sorted index. However, this index cannot support a query issued by some other user at a different
location. In order to answer the query of this new user, we will have to sort all the restaurants again in
ascending order of their distances from this user.

Temporal Databases

A temporal database stores data relating to time instances. It offers temporal data types and stores information
relating to past, present and future time. Temporal databases could be uni-temporal, bi-temporal or tri-temporal.

More specifically the temporal aspects usually include valid time, transaction time or decision time.

 Valid time is the time period during which a fact is true in the real world.
 Transaction time is the time at which a fact was recorded in the database.
 Decision time is the time at which the decision was made about the fact.

For example, a conventional database cannot directly support historical queries about past status and cannot
represent inherently retroactive or proactive changes. Without built-in temporal table support from the
DBMS, applications are forced to use complex and often manual methods to manage and maintain temporal
information.

ACTIVE DATABASE:

Active Database is a database consisting of set of triggers. ... If the trigger


is active then DBMS executes the condition part and then executes the action part only if the specified
condition is evaluated to true. It is possible to activate more than one trigger within a single statement.

Deductive Database:

A deductive database is a database system that makes conclusions about its data based on a set of well-
defined rules and facts. This type of database was developed to combine logic programming with relational
database management systems. Usually, the language used to define the rules and facts is the logical
programming language Datalog.

Recursive Queries:

A recursive query is one that refers to itself.

In general, a recursive CTE has three parts: An initial query that returns the base result set of the CTE.
The initial query is called an anchor member. A recursive query that references the common table expression,
therefore, it is called the recursive member.

One of the most fascinating features of SQL is its ability to execute recursive queries. Like sub-queries,
recursive queries save us from the pain of writing complex SQL statements. In most of the situations, recursive
queries are used to retrieve hierarchical data. Let’s take a look at a simple example of hierarchical data.

The below Employee table has five columns: id, name, department, position, and manager. The rationale behind
this table design is that an employee can be managed by none or one person who is also the employee of the
organization. Therefore, we have a manager column in the table which contains the value from the id column of
the same table. This results in a hierarchical data where the parent of a record in a table exists in the same table.
Employee Table

From the Employee table, it can be seen that IT department has a manager David with id 1. David is the
manager of Suzan and John since both of them have 1 in their manager column. Suzan further manages Jacob in
the same IT department. Julia is the manager of the HR department. She has no manager but she manages
Wayne who is an HR supervisor. Wayne manages the office boy Zack. Finally we have Sophie, who manages
the Marketing department and she has two subordinates, Wickey and Julia.

We can retrieve a variety of data from this table. We can get the name of the manager of any employee, all the
employees managed by a particular manager, or the level/seniority of employee in the hierarchy of employees.

Mobile Databases:

Mobile databases are separate from the main database and can easily be transported to various places.
Even though they are not connected to the main database, they can still communicate with the database to share
and exchange data.

The mobile database includes the following components −

 The main system database that stores all the data and is linked to the mobile database.
 The mobile database that allows users to view information even while on the move. It shares
information with the main database.
 The device that uses the mobile database to access data. This device can be a mobile phone, laptop etc.
 A communication link that allows the transfer of data between the mobile database and the main
database.

Current most mobile DBMSs only provide limited prepackaged SQL functions for the mobile
application.It is expected that in the near-future, mobile DBMSs will provide functionality matching
that at the corporate site.

Some Mobile DBMSs

 Microsoft SQL Server CE


 Oracle Lite Edition

Location and Handoff Management


• The current point of attachment or location of a subscriber (mobile unit) is expressed in terms of the cell
or the base station to which it is presently connected.
The mobile units (called and calling subscribers) can continue to talk and move around in their respective cells;
but as soon as both or any one of the units moves to a different cell, the location management procedure is
invoked to identify the new location
• The location management performs three fundamental tasks:
– (a) location update,
– (b) location lookup, and
– (c) paging.
• In location update, which is initiated by the mobile unit, the current location of the unit is recorded in
HLR and VLR databases.
• Location lookup is basically a database search to obtain the current location of the mobile unit and
through paging the system informs the caller the location of the called unit in terms of its current base
station.
• These two tasks are initiated by the MSC.
• The cost of update and paging increases as cell size decreases, which becomes quite significant for finer
granularity cells such as micro- or picocell clusters.
• The presence of frequent cell crossing, which is a common scenario in highly commuting zones, further
adds to the cost.
• The system creates location areas and paging areas to minimize the cost.
• A number of neighboring cells are grouped together to form a location area, and the paging area is
constructed in a similar way.
• Location Management
– Search: find a mobile user’s current location
– Update (Register): update a mobile user’s location
– Location info: maintained at various granularities (cell vs. a group of cells called a registration
area)
– Research Issue: organization of location databases
• Global Systems for Mobile (GSM) vs. Mobile IP vs. Wireless Mesh Networks (WMN)
• Handoff Management
– Ensuring that a mobile user remains connected while moving from one location (e.g., cell) to
another
– Packets or connection are routed to the new location
• Decide when to handoff to a new access point (AP)
• Select a new AP from among several APs
• Acquire resources such as bandwidth channels (GSM), or a new IP address (Mobile IP)
– Channel allocation is a research issue: goal may be to maximize channel usage, satisfy QoS, or
maximize revenue generated
• Inform the old AP to reroute packets and also to transfer state information to the new AP
• Packets are routed to the new AP

Mobile TransactionModels

Transaction

 A set of operations that translate a database from one consistent state to another
consistent state,

 That is a computation processing is considered as a transaction or conventional


transaction if it satisfies ACID (Atomicity, Consistency, Isolation, and Durability)
properties.

Transaction : Atomicity

 An executable program, assumed that this program will finally terminate, has one initial state and
one final state.

 If the program achieves its final state it is said to be committed,

 otherwise if it is at the initial state after some execution steps then it is aborted or rollback.

Transaction : Consistency

 if a program produces consistent result only then it satisfies the consistency property and it will
be at the final state or committed.
 If the result is not consistent then a transaction program should be at the initial state, in other
word the transaction is aborted.

Transaction : Isolation

 if a program is executing and if it is only single program on the system then it satisfies the
isolation property.

 If there are several other processes on the system, then none of the intermediate state of this
program is viewable until it reaches its final state.

Transaction : Durability

 If a program reaches to its final state and the result is made available to the outside world then
this result is made permanent.

 Even a system failure cannot change this result.

 In other words, when a transaction commits its state is durable.

Multimedia Databases.

• Multimedia database is the collection of interrelated multimedia data that includes text, graphics
(sketches, drawings), images, animations, video, audio etc and have vast amounts of multisource
multimedia data. The framework that manages different types of multimedia data which can be
stored, delivered and utilized in different ways is known as multimedia database management
system. There are three classes of the multimedia database which includes static media, dynamic
media and dimensional media.

• Content of Multimedia Database management system :

• Media data – The actual data representing an object.

• Media format data – Information such as sampling rate, resolution, encoding scheme etc. about
the format of the media data after it goes through the acquisition, processing and encoding phase.

• Media keyword data – Keywords description relating to the generation of data. It is also known
as content descriptive data. Example: date, time and place of recording.

• Media feature data – Content dependent data such as the distribution of colors, kinds of texture
and different shapes present in data.

• Types of multimedia applications based on data management characteristic are :

• Repository applications – A Large amount of multimedia data as well as meta-data(Media


format date, Media keyword data, Media feature data) that is stored for retrieval purpose, e.g.,
Repository of satellite images, engineering drawings, radiology scanned pictures.

• Presentation applications – They involve delivery of multimedia data subject to temporal


constraint. Optimal viewing or listening requires DBMS to deliver data at certain rate offering the
quality of service above a certain threshold. Here data is processed as it is delivered. Example:
Annotating of video and audio data, real-time editing analysis.

• Collaborative work using multimedia information – It involves executing a complex task by


merging drawings, changing notifications. Example: Intelligent healthcare network.
• There are still many challenges to multimedia databases, some of which are :

• Modelling – Working in this area can improve database versus information retrieval techniques
thus, documents constitute a specialized area and deserve special consideration.

• Design –The conceptual, logical and physical design of multimedia databases has not yet been
addressed fully as performance and tuning issues at each level are far more complex as they
consist of a variety of formats like JPEG, GIF, PNG, MPEG which is not easy to convert from one
form to another.

• Storage – Storage of multimedia database on any standard disk presents the problem of
representation, compression, mapping to device hierarchies, archiving and buffering during input-
output operation. In DBMS, a ”BLOB”(Binary Large Object) facility allows untyped bitmaps to be
stored and retrieved.

• Performance – For an application involving video playback or audio-video synchronization,


physical limitations dominate. The use of parallel processing may alleviate some problems
but such techniques are not yet fully developed. Apart from this multimedia database consume
a lot of processing time as well as bandwidth.

• Queries and retrieval –For multimedia data like images, video, audio accessing data through
query opens up many issues like efficient query formulation, query execution and optimization
which need to be worked upon.

Possible Questions:

Part A

1) What is Object Oriented Databases?


2) List out the Object Oriented programming languages.
3) what are all the benefits in OOP?
4) Why Spatial Database is Need?
5) What is spatial Relationship?
6) What is Multidimensional Keys in Spatial Database?
7) Write a note on three different databases in Temporal Database.
8) What is mobile Database?
9) List out the mobile transaction models in Mobile Database.
10)Define Multimedia database.

Part B

1) Why The Need of Complex data Types in Object Oriented Database.


2) Explain in detail about Object Oriented Data model.
3) Define Spatial Database, Explain their data types.
4) Explain briefly about Temporal Database.
5) What is Mobile Database and Why Location and Handoff Management is Needed.
6) What are the content of Multimedia database system?
UNIT IV

XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding
documents in a format that is both human-readable and machine-readable.

• XML stands for eXtensible Markup Language


• XML is a markup language much like HTML
• XML was designed to store and transport data
• XML was designed to be self-descriptive

The XML language has no predefined tags.

• The tags in the example above (like <to> and <from>) are not defined in any XML standard.
These tags are "invented" by the author of the XML document.
• HTML works with predefined tags like <p>, <h1>, <table>, etc.
• With XML, the author must define both the tags and the document structure.

Books.xml

<?xml version="1.0" encoding="UTF-8"?>


<bookstore>
<book category="cooking">
<title lang="en">ADT</title>
<author>NATARAJAN S</author>
<year>2021</year>
<price>30.00</price>
</book>
</bookstore>

XML Documents Must Have a Root Element


• XML documents must contain one root element that is the parent of all other elements:
• <root>
<child>
<subchild>.....</subchild>
</child>
</root>
• In this example <note> is the root element:
• <?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
The XML Prolog
 This line is called the XML prolog:
 <?xml version="1.0" encoding="UTF-8"?>
 The XML prolog is optional. If it exists, it must come first in the document.
 XML documents can contain international characters, like Norwegian øæå or French êèé.
 To avoid errors, you should specify the encoding used, or save your XML files as UTF-8.
 UTF-8 is the default character encoding for XML documents.
 UTF-8 is also the default encoding for HTML5, CSS, JavaScript, PHP, and SQL.
 All XML Elements Must Have a Closing Tag
 In XML, it is illegal to omit the closing tag. All elements must have a closing tag:
 <p>This is a paragraph.</p>
<br />

XML Tags are Case Sensitive


• XML tags are case sensitive.
• The tag <Letter> is different from the tag <letter>.
• Opening and closing tags must be written with the same case:
• <message>This is correct</message>

XML Database Types


There are two major types of XML databases −

 XML- enabled
 Native XML (NXD)

XML - Enabled Database

XML enabled database is nothing but the extension provided for the conversion of XML document. This is a
relational database, where data is stored in tables consisting of rows and columns. The tables contain set of
records, which in turn consist of fields.

Native XML Database

Native XML database is based on the container rather than table format. It can store large amount of XML
document and data. Native XML database is queried by the XPath-expressions.

Native XML database has an advantage over the XML-enabled database. It is highly capable to store, query and
maintain the XML document than XML-enabled database.

Example

Following example demonstrates XML database −

<?xml version = "1.0"?>


<contact-info>
<contact1>
<name>TanmayPatil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</contact1>

<contact2>
<name>ManishaPatil</name>
<company>TutorialsPoint</company>
<phone>(011) 789-4567</phone>
</contact2>
</contact-info>

Here, a table of contacts is created that holds the records of contacts (contact1 and contact2), which in turn
consists of three entities − name, company and phone.
XML Schema

XML schema is a language which is used for expressing constraint about XML documents. There are so many
schema languages which are used now a days for example Relax- NG and XSD (XML schema definition).

An XML schema is used to define the structure of an XML document. It is like DTD but provides more control
on XML structure.

Checking Validation
An XML document is called "well-formed" if it contains the correct syntax. A well-formed and valid XML
document is one which have been validated against Schema.

Visit http://www.xmlvalidation.com or https://freeformatter.com or https://www.liquid-


technologies.com/online-xsd-validator to validate the XML file against schema or DTD.

TO check with schema

XML Schema Example


Let's create a schema file.

employee.xsd

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
<xs:element name="email" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>

Let's see the xml file using XML schema or XSD file.

employee.xml

<?xml version="1.0"?>
<employee>
<firstname>Natarajan </firstname>
<lastname>Subramanian</lastname>
<email>[email protected]</email>
</employee>

XML Parsers
An XML parser is a software library or package that provides interfaces for client applications to work with an
XML document. The XML Parser is designed to read the XML and create a way for programs to use XML.

XML parser validates the document and check that the document is well formatted.

Let's understand the working of XML parser by the figure given below:
Types of XML Parsers

These are the two main types of XML Parsers:

1. DOM
2. SAX

DOM (Document Object Model)

A DOM document is an object which contains all the information of an XML document. It is composed like a
tree structure. The DOM Parser implements a DOM API. This API is very simple to use.

Features of DOM Parser


A DOM Parser creates an internal structure in memory which is a DOM document object and the client
applications get information of the original XML document by invoking methods on this document object.

DOM Parser has a tree based structure.

Advantages
1) It supports both read and write operations and the API is very simple to use.

2) It is preferred when random access to widely separated parts of a document is required.

Disadvantages
1) It is memory inefficient. (consumes more memory because the whole XML document needs to loaded into
memory).

2) It is comparatively slower than other parsers.

SAX (Simple API for XML)

A SAX Parser implements SAX API. This API is an event based API and less intuitive.

Features of SAX Parser


It does not create any internal structure.

Clients does not know what methods to call, they just overrides the methods of the API and place his own code
inside method.

It is an event based parser, it works like an event handler in Java.

Advantages
1) It is simple and memory efficient.
2) It is very fast and works for huge documents.

Disadvantages
1) It is event-based so its API is less intuitive.

2) Clients never know the full information because the data is broken into pieces.

What is XSL?

XSL is a language for expressing style sheets. An XSL style sheet is, like with CSS, a file that describes how to
display an XML document of a given type. XSL shares the functionality and is compatible with CSS2 (although
it uses a different syntax). It also adds:

 A transformation language for XML documents: XSLT. Originally intended to perform complex styling
operations, like the generation of tables of contents and indexes, it is now used as a general purpose
XML processing language. XSLT is thus widely used for purposes other than XSL, like generating
HTML web pages from XML data.
 Advanced styling features, expressed by an XML document type which defines a set of elements called
Formatting Objects, and attributes (in part borrowed from CSS2 properties and adding more complex
ones.

How Does It Work?

Styling requires a source XML documents, containing the information that the style sheet will display and the
style sheet itself which describes how to display a document of a given type.

The following shows a sample XML file and how it can be transformed and rendered.

The XML file

<scene>
<FX>General Road Building noises.</FX>
<speech speaker="Prosser">
Come off it Mr Dent, you can't win
you know. There's no point in lying
down in the path of progress.
</speech>
<speech speaker="Arthur">
I've gone off the idea of progress.
It's overrated
</speech>
</scene>

This XML file doesn't contain any presentation information, which is contained in the stylesheet. Separating the
document's content and the document's styling information allows displaying the same document on different
media (like screen, paper, cell phone), and it also enables users to view the document according to their
preferences and abilities, just by modifying the style sheet.

The Stylesheet

Here are two templates from the stylesheet used to format the XML file. The full stylesheet (which includes
extra information on pagination and margins) is available.

...
<xsl:template match="FX">
<fo:block font-weight="bold">
<xsl:apply-templates/>
</fo:block>
</xsl:template>

<xsl:template match="speech[@speaker='Arthur']">
<fo:block background-color="blue">
<xsl:value-of select="@speaker"/>:
<xsl:apply-templates/>
</fo:block>
</xsl:template>
...

The stylesheet can be used to transform any instance of the DTD it was designed for. The first rule says that an
FX element will be transformed into a block with a bold font. <xsl:apply-templates/> is a recursive call to
the template rules for the contents of the current element. The second template applies to all speech elements
that have the speaker attribute set to Arthur, and formats them as blue blocks within which the value speaker
attribute is added before the text.

What is XSLT

Before XSLT, first we should learn about XSL. XSL stands for EXtensibleStylesheet Language. It is a styling
language for XML just like CSS is a styling language for HTML.

XSLT stands for XSL Transformation. It is used to transform XML documents into other formats (like
transforming XML into HTML).

What is XSL

In HTML documents, tags are predefined but in XML documents, tags are not predefined. World Wide Web
Consortium (W3C) developed XSL to understand and style an XML document, which can act as XML based
Stylesheet Language.

An XSL document specifies how a browser should render an XML document.

Main parts of XSL Document

 XSLT: It is a language for transforming XML documents into various other types of documents.
 XPath: It is a language for navigating in XML documents.
 XQuery: It is a language for querying XML documents.
 XSL-FO: It is a language for formatting XML documents.

How XSLT Works

The XSLT stylesheet is written in XML format. It is used to define the transformation rules to be applied on the
target XML document. The XSLT processor takes the XSLT stylesheet and applies the transformation rules on
the target XML document and then it generates a formatted document in the form of XML, HTML, or text
format. At the end it is used by XSLT formatter to generate the actual output and displayed on the end-user.
Image representation:

Advantage of XSLT

A list of advantages of using XSLT:

 XSLT provides an easy way to merge XML data into presentation because it applies user defined
transformations to an XML document and the output can be HTML, XML, or any other structured
document.
 XSLT provides Xpath to locate elements/attribute within an XML document. So it is more convenient
way to traverse an XML document rather than a traditional way, by using scripting language.
 XSLT is template based. So it is more resilient to changes in documents than low level DOM and SAX.
 By using XML and XSLT, the application UI script will look clean and will be easier to maintain.
 XSLT templates are based on XPath pattern which is very powerful in terms of performance to process
the XML document.
 XSLT can be used as a validation language as it uses tree-pattern-matching approach.
 You can change the output simply modifying the transformations in XSL files.

XPath
 XPath is a major element in the XSLT standard.
 XPath can be used to navigate through elements and attributes in an XML document.
 XPath stands for XML Path Language
 XPath uses "path like" syntax to identify and navigate nodes in an XML document
 XPath contains over 200 built-in functions
 XPath is a major element in the XSLT standard
 XPath is a W3C recommendation

• XPath Path Expressions

• XPath uses path expressions to select nodes or node-sets in an XML


document.

• These path expressions look very much like the path expressions you use
with traditional computer file systems:

XML and XQuery


• What is XQuery?
• XQuery is to XML what SQL is to databases.
• XQuery was designed to query XML data.
• XQuery Example
• for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
• XQuery is the language for querying XML data
• XQuery for XML is like SQL for databases
• XQuery is built on XPath expressions
• XQuery is supported by all major databases
• XQuery is a W3C Recommendation

DATA WAREHOUSE – INTRODUCTION:


A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction
processing. It includes historical data derived from transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for
decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
 It is a database designed for investigative tasks, using data from various applications.
 It supports a relatively small number of clients with relatively long interactions.
 It includes current and historical data to provide a historical perspective of information.
 Its usage is read-intensive.
 It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of
management's decisions."
Goals of Data Warehousing
 To help reporting as well as analysis
 Maintain the organization's historical information
 Be the foundation for decision making.
Need for Data Warehouse
Data Warehouse is needed for the following reasons:

1. Business User: Business users require a data warehouse to


view summarized data from the past. Since these people are
non-technical, the data may be presented to them in an
elementary form.
2. Store historical data: Data Warehouse is required to store the
time variable data from the past. This input is made to be used
for various purposes.
3. Make strategic decisions: Some strategies may be depending upon the data in the data warehouse. So,
data warehouse contributes to making strategic decisions.
4. For data consistency and quality: Bringing the data from different sources at a commonplace, the user
can effectively undertake to bring the uniformity and consistency in data.
5. High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be easier to build and maintain in
data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
Multidimensional Data Modeling

A multidimensional model views data in the form of a data-cube. A data cube enables data to be modeled and
viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records.
Now, if we want to view the sales data with a third dimension, For example, suppose the data according to time
and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D
data are shown in the table. The 3D data of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D


data cube, as shown in fig:

STAR SCHEMA:
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or
measured, such as a sale or log in. A dimension includes reference data
about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose
design represents a multidimensional data model. The star schema is the
explicit data warehouse schema. It is known as star schema because the
entity-relationship diagram of this schemas simulates a star, with points,
diverge from a central table. The center of the schema consists of a large
fact table, and the points of the star are the dimension tables.

Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two types of
columns: those that include fact and those that are foreign keys to the dimension table. The primary key of the
fact tables is generally a composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables that include
aggregated fact are often instead called summary tables). A fact table generally contains facts with the same
level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a dimension
has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of each of the
dimensions table are part of the composite primary keys of the fact table. Dimensional attributes help to define
the dimensional value. They are generally descriptive, textual values. Dimensional tables are usually small in
size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets, cities),
clients, products, times, channels.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the following features:
 It creates a DE-normalized database that can quickly provide query responses.
 It provides a flexible design that can be changed easily or added to throughout the development cycle,
and as the database grows.
 It provides a parallel in design to how end-users typically think of and use the data.
 It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema
Star Schemas are easy for end-users and application to understand and navigate. With a well-designed schema,
the customer can instantly analyze large, multidimensional data sets.
The main advantage of star schemas in a decision-support environment are:

Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster than they do
against OLTP systems. Small single-table queries, frequently of a dimension table, are almost instantaneous.
Large join queries that contain multiple tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central fact table. When the two-
dimension table is used in a query, only one join path, intersecting the fact tables, exist between those two
tables. This design feature enforces authentic and consistent query results.
Load performance and administration
Structural simplicity also decreases the time required to load large batches of record into a star schema
database. By describing facts and dimensions and separating them into the various table, the impact of a load
structure is reduced. Dimension table can be populated once and occasionally refreshed. We can add new facts
regularly and selectively by appending records to a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential integrity is enforced
because each data in dimensional tables has a unique primary key, and all keys in the fact table are legitimate
foreign keys drawn from the dimension table. A record in the fact table which is not related correctly to a
dimension cannot be given the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the fact table. These
joins are more significant to the end-user because they represent the fundamental relationship between parts of
the underlying business. Customer can also browse dimension table attributes before constructing a query.
Disadvantage of Star Schema
There is some condition which cannot be meet by star schemas like the relationship between the user, and bank
account cannot describe as star schema as the relationship between them is many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables connected
to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns for each
item_Key, item_name, brand, type, supplier_type. The BRANCH table has columns for each branch_key,
branch_name, branch_type. The LOCATION table has columns of geographic data, including street, city, state,
and country.
In this scenario, the SALES table contains only four columns with IDs from the dimension tables, TIME,
ITEM, BRANCH, and LOCATION, instead of four columns for time data, four columns for ITEM data, three
columns for BRANCH data, and four columns for LOCATION data. Thus, the size of the fact table is
significantly reduced. When we need to change an item, we need only make a single change in the dimension
table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into several tables. The
normalized dimension table is called a Snowflake.

What is Snowflake Schema?


A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more
dimension tables do not connect directly to the fact table but must join through other dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star explodes into more
points. It is called snowflake schema because the diagram of snowflake schema resembles a snowflake.
Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we normalize all the
dimension tables entirely, the resultant structure resembles a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each fact
surrounded by its associated dimensions, and those dimensions are related to other dimensions, branching out
into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables, which can be linked
to other dimension tables through a many-to-one relationship. Tables in a snowflake schema are generally
normalized to the third normal form. Each dimension table performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having three levels. A snowflake
schemas can have any number of dimension, and each dimension can have any number of levels.

Example: Figure shows a snowflake schema with a Sales fact table,


with Store, Location, Time, Product, Line, and Family dimension tables.
The Market dimension has two dimension tables with Store as the
primary dimension table, and Location as the outrigger dimension table.
The product dimension has three dimension tables with Product as the
primary dimension table, and the Line and Family table are the outrigger
dimension tables.

A star schema store all attributes for a dimension into one denormalized
table. This needed more disk space than a more normalized snowflake
schema. Snowflaking normalizes the dimension by moving attributes with
low cardinality into separate dimension tables that relate to the core
dimension table by using foreign keys. Snowflaking for the sole purpose of minimizing disk space is not
recommended, because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables are damaged
into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing
company. The sales fact table include quantity, price, and other
relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME
are the dimension tables.

The STAR schema for sales, as shown above, contains only five
tables, whereas the normalized version now extends to eleven tables.
We will notice that in the snowflake schema, the attributes with low
cardinality in each original dimension tables are removed to form
separate tables. These new tables are connected back to the original
dimension
table through
artificial keys.

A snowflake schema is designed for flexible


querying across more complex dimensions
and relationship. It is suitable for many to many and
one to many relationships between dimension
levels.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake
schema is the development in query
performance due to minimized disk storage
requirements and joining smaller
lookup tables.
2. It provides greater scalability in the
interrelationship between dimension levels and components.
3. No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts required due to
the increasing number of lookup tables. It is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.

OLAP Operations in the Multidimensional Data Model


In the multidimensional model, the records are organized into various dimensions, and each dimension includes
multiple levels of abstraction described by concept hierarchies. This organization support users with the
flexibility to view data from various perspectives. A number of OLAP data cube operation exist to demonstrate
these different views, allowing interactive queries and search of the record at hand. Hence, OLAP supports a
user-friendly environment for interactive data analysis.
Consider the OLAP operations which are to be performed on multidimensional data. The figure shows data
cubes for sales of a shop. The cube contains the dimensions, location, and time and item, where the location is
aggregated with regard to city values, time is aggregated with respect to quarters, and an item is aggregated
with respect to item types.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data cube,
by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data
cubes. Figure shows the result of roll-up operations performed on the dimension location. The hierarchy for the
location is defined as the Order Street, city, province, or state, country. The roll-up operation aggregates the
data by ascending the location hierarchy from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed from the cube. For
example, consider a sales data cube having two dimensions, location and time. Roll-up may be performed by
removing, the time dimensions, appearing in an aggregation of the total sales by location, relatively than by
location and by time.
Example
Consider the following cubes illustrating temperature of certain days recorded weekly:
Temperature 64 65 68 69 70 71 72 75 80 81 83 85
Week1 1 0 1 0 1 0 0 0 0 0 1 0
Week2 0 0 0 1 0 0 1 2 0 1 0 0
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in temperature from the above
cubes.
To do this, we have to group column and add up the value according to the concept hierarchies. This operation
is known as a roll-up.
By doing this, we contain the following cube:
Temperature cool mild hot
Week1 2 1 1
Week2 2 1 1
The roll-up operation groups the information by levels of temperature.
The following diagram illustrates how roll-up works.

Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is like
zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-down can be
performed by either stepping down a concept hierarchy for a dimension or adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a concept hierarchy
which is defined as day, month, quarter, and year. Drill-down appears by descending the time hierarchy from
the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a new dimension
to a cube. For example, a drill-down on the central cubes of the figure can occur by introducing an additional
dimension, such as a customer group.
Example
Drill-down adds more details to the given data
Temperature cool mild hot
Day 1 0 0 0
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
The following diagram illustrates how Drill-down works.

Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension. For
example, a slice operation is executed when the customer wants a selection on one dimension of a three-
dimensional cube resulting in a two-dimensional site. So, the Slice operations perform a selection on one
dimension of the given cube, thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
The following diagram illustrates how Slice works.

Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool OR
temperature = hot) to the original cubes we get the following subcube (still two-dimensional)
Temperature cool hot
Day 3 0 1
Day 4 0 0
Consider the following diagram, which shows the dice operations.

The dice operation on the cubes based on the following selection criteria involves three dimensions.
 (location = "Toronto" or "Vancouver")
 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data axes in
view to provide an alternative presentation of the data. It may contain swapping the rows and columns or
moving one of the row-dimensions into the column dimensions.

Consider the following diagram, which shows the pivot operation.

POSSIBLE QUESTIONS
Part A
1) What is XML Database?
2) What is the use of Prolog in XML?
3) Explain about XML Schema and How to validate the XML Schema.
4) What is XSLT and their parts of XSL Document.
5) Write a note on XPath?
6) Define XQuery and Write a program to implement XQuery.
7) What is Fact Table.
Part B
1) What is XML and explain their data types?
2) Explain in detail about the Parser in XML.
3) What is XSL and how does it work on XML?
4) What is XSLT and how it work explain in Image representation.
5) What is Data warehouse and explain their goals.
6) What is Multidimensional data modelling?
7) Explain in detail about Star Schema.
8) Explain in detail about Snow Flake Schema.
9) Explain the OLAP Operations in multidimensional Data model.
UNIT V
IR CONCEPTS:
The problem of IR

 Goal = find documents relevant to an information need from a large document set

Possible approaches

1. String matching (linear search in documents)


- Slow
- Difficult to improve
2. Indexing (*)
- Fast
- Flexible to further improvement

Introduction
 Text mining refers to data mining using text documents as data.

 Most text mining tasks use Information Retrieval (IR) methods to pre-process text documents.
 These methods are quite different from traditional data pre-processing methods used for relational
tables.
 Web search also has its root in IR.

Information Retrieval (IR)


 Conceptually, IR is the study of finding needed information. I.e., IR helps users find information that
matches their information needs.
 Expressed as queries
 Historically, IR is about document retrieval, emphasizing document as the basic unit.
 Finding documents relevant to user queries
 Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.

IR architecture

IR queries
 Keyword queries
 Boolean queries (using AND, OR, NOT)

 Phrase queries
 Proximity queries
 Full document queries
 Natural language questions

1. Keyword Queries :

 Simplest and most common queries.


 The user enters just keyword combinations to retrieve documents.
 These keywords are connected by logical AND operator.
 All retrieval models provide support for keyword queries.

2. Boolean Queries :

 Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in combination of keyword
formulations.
 No ranking is involved because a document either satisfies such a query or does not satisfy it.
 A document is retrieved for boolean query if it is logically true as exact match in document.

3. Phase Queries :
 When documents are represented using an inverted keyword index for searching, the relative order of
items in document is lost.
 To perform exact phase retrieval, these phases are encoded in inverted index or implemented differently.
 This query consists of a sequence of words that make up a phase.
 It is generally enclosed within double quotes.

4. Proximity Queries :

 Proximity refers ti search that accounts for how close within a record multiple items should be to each
other.
 Most commonly used proximity search option is a phase search that requires terms to be in exact order.
 Other proximity operators can specify how close terms should be to each other. Some will specify the
order of search terms.
 Search engines use various operators names such as NEAR, ADJ (adjacent), or AFTER.
 However, providing support for complex proximity operators becomes expensive as it requires time-
consuming pre-processing of documents and so it is suitable for smaller document collections rather
than for web.

5. Wildcard Queries :

 It supports regular expressions and pattern matching-based searching in text.


 Retrieval models do not directly support for this query type.
 In IR systems, certain kinds of wildcard search support may be implemented. Example: usually words
ending with trailing characters.

6. Natural Language Queries :

 There are only a few natural language search engines that aim to understand the structure and meaning
of queries written in natural language text, generally as question or narrative.
 The system tries to formulate answers for these queries from retrieved results.
 Semantic models can provide support for this query type.

Information retrieval models


 An IR model governs how a document and a query are represented and how the relevance of a
document to a user query is defined.

 Main models:
 Boolean model
 Vector space model

Boolean model
 Each document or query is treated as a “bag” of words or terms. Word sequence is not considered.
 Given a collection of documents D, let V = {t1, t2, ...,t|V|} be the set of distinctive words/terms in the
collection. V is called the vocabulary.

 A weight wij> 0 is associated with each term tiof a document dj∈D. For a term that does not appear in
document dj, wij= 0.
dj= (w1j, w2j, ..., w|V|j),
 Query terms are combined logically using the Boolean operators AND, OR, and NOT.
 E.g., ((data AND mining) AND (NOT text))
 Retrieval
 Given a Boolean query, the system retrieves every document that makes the query logically true.
 Called exact match.
The retrieval results are usually quite poor because term frequency is not considered

Vector space model


 Documents are also treated as a “bag” of words or terms.
 Each document is represented as a vector.
 However, the term weights are no longer 0 or 1. Each term weight is computed based on some
variations of TF or TF-IDF scheme.
 Term Frequency (TF) Scheme:The weight of a term tiin document djis the number of times that tiappears
in dj, denoted by fij. Normalization may also be applied.

Text pre-processing
Word (term) extraction: easy
 Stopwords removal
 Stemming
 Frequency counts and computing TF-IDF term weights.

1. Stopword Removal

Stopwordsare very commonly used words in a language that play a major role inthe formation of a
sentence but which seldom contribute to the meaning of that sentence. Words that are expected to occur in
80 percent or more of the documents in a collection are typically referred to as stopwords, and they are
rendered potentially useless. Because of the commonness and function of these words, they do not
contribute much to the relevance of a document for a query search. Examples include words such as the, of,
to, a, and, in, said, for, that, was, on, he, is, with, at, by, and it. These words are presented here with
decreasing frequency of occurrence from a large corpus of documents called AP89. The fist six of these
words account for 20 percent of all words in the listing, and the most frequent 50 words account for 40
percent of all text.

Removal of stopwords from a document must be performed before indexing. Articles, prepositions,
conjunctions, and some pronouns are generally classified as stopwords. Queries must also be preprocessed
for stopword removal before the actual retrieval process. Removal of stopwords results in elimination of
possible spurious indexes, thereby reducing the size of an index structure by about 40 percent or more.
However, doing so could impact the recall if the stopword is an integral part of a query (for example, a
search for the phrase ‘To be or not to be,’ where removal of stopwords makes the query inappropriate, as all
the words in the phrase are stopwords). Many search engines do not employ query stopword removal for
this reason.

2. Stemming

A stem of a word is defined as the word obtained after trimming the suffix and prefix of an original word.
For example, ‘comput’ is the stem word for computer, computing, and computation. These suffixes and
prefixes are very common in theEnglish language for supporting the notion of verbs, tenses, and plural
forms. Stemming reduces the different forms of the word formed by inflection (due to plurals or tenses) and
derivation to a common stem.
A stemming algorithm can be applied to reduce any word to its stem. In English, the most famous stemming
algorithm is Martin Porter’s stemming algorithm. The Porter stemmer is a simplified version of Lovin’s
technique that uses a reduced set of about 60 rules (from 260 suffix patterns in Lovin’s technique) and
organizes them into sets; conflicts within one subset of rules are resolved before going on to the next. Using
stemming for preprocessing data results in a decrease in the size of the indexing structure and an increase in
recall, possibly at the cost of precision.

3. Utilizing a Thesaurus

A thesaurus comprises a precompiled list of important concepts and the main word that describes each
concept for a particular domain of knowledge. For each concept in this list, a set of synonyms and related
words is also compiled. Thus, a synonym can be converted to its matching concept during preprocessing.
This preprocessing step assists in providing a standard vocabulary for indexing and searching. Usage of a
thesaurus, also known as a collection of synonyms, has a substantial impact on the recall of information
systems. This process can be complicated because many words have different meanings in different
contexts.

UMLS is a large biomedical thesaurus of millions of concepts (called theMetathesaurus) and a semantic
network of meta concepts and relationships thatorganize the Metathesaurus (see Figure 27.3). The concepts
are assigned labels from the semantic network. This thesaurus of concepts contains synonyms of medical
terms, hierarchies of broader and narrower terms, and other relationships among words and concepts that
make it a very extensive resource for information retrieval of documents in the medical domain. Figure 27.3
illustrates part of the UMLS Semantic Network.
WordNetis a manually constructed thesaurus that groups words into strict synonym sets called synsets.
These synsets are divided into noun, verb, adjective, and adverb categories. Within each category, these
synsets are linked together by appropriate relationships such as class/subclass or “is-a” relationships for
nouns.

WordNet is based on the idea of using a controlled vocabulary for indexing, thereby eliminating
redundancies. It is also useful in providing assistance to users with locating terms for proper query
formulation.

4. Other Preprocessing Steps: Digits, Hyphens, Punctuation Marks, Cases

Digits, dates, phone numbers, e-mail addresses, URLs, and other standard types of text may or may not be
removed during preprocessing. Web search engines, however, index them in order to to use this type of
information in the document metadata to improve precision and recall (see Section 27.6 for detailed
definitions of precision and recall).

Hyphens and punctuation marks may be handled in different ways. Either the entire phrase with the
hyphens/punctuation marks may be used, or they may be eliminated. In some systems, the character
representing the hyphen/punctuation mark may be removed, or may be replaced with a space. Different
information retrieval systems follow different rules of processing. Handling hyphens automatically can be
complex: it can either be done as a classification problem, or more commonly by some heuristic rules.

Most information retrieval systems perform case-insensitive search, converting all the letters of the text to
uppercase or lowercase. It is also worth noting that many of these text preprocessing steps are language
specific, such as involving accents and diacritics and the idiosyncrasies that are associated with a particular
language.

5. Information Extraction

Information extraction (IE) is a generic term used for extracting structured con-tent from text. Text
analytic tasks such as identifying noun phrases, facts, events, people, places, and relationships are examples
of IE tasks. These tasks are also called named entity recognition tasks and use rule-based approaches with
either a the-saurus, regular expressions and grammars, or probabilistic approaches. For IR and search
applications, IE technologies are mostly used to identify contextually relevant features that involve text
analysis, matching, and categorization for improving the relevance of search systems. Language
technologies using part-of-speech tagging are applied to semantically annotate the documents with extracted
features to aid search relevance.
Web Search as a huge IR system

 A Web crawler (robot) crawls the Web to collect all the pages.
 Servers establish a huge inverted indexing database and other indexing databases
 At query (search) time, search engines conduct different types of vector query matching.
Inverted index
 The inverted index of a document collection is basically a data structure that
 attaches each distinctive term with a list of all documents that contains the term.
 Thus, in retrieval, it takes constant time to
 find the documents that contains a query term.
 multiple query terms are also easy handle as we will see soon.

An example

Easy! See the example,

Search using inverted index


Given a query q, search has the following steps:

 Step 1 (vocabulary search): find each term/word in q in the inverted index.


 Step 2 (results merging): Merge results to find documents that contain all or some of the words/terms
in q.
 Step 3 (Rank score computation): To rank the resulting documents/pages, using,
 content-based ranking
 link-based ranking

Evaluation Measures:
Evaluation measures for an information retrieval system are used to assess how well the search results
satisfied the user's query intent. Such metrics are often split into kinds: online metrics look at users' interactions
with the search system, while offline metrics measure relevance, in other words how likely each result, or
search engine results page (SERP) page as a whole, is to meet the information needs of the user.
–Web Search and Analytics Ontology based Search -Current trends.

Web: –
A huge, widely-distributed, highly heterogeneous, semistructured,, interconnected, evolving,
hypertext/hypermedia information repository
•Main issues –
Abundance of information
•The 99% of all the information are not interesting for the 99% of all users –The static Web is a very small part
of all the Web
•Dynamic Website –To access the Web user need to exploit Search Engines (SE)
•SE must be improved
•To help people to better formulate their information needs
•More personalization is needed

The 250 mostfrequentterms in the famous AOL query log!


Queryanalysistoevaluateuserneeds
•Informational – want to learn about something (~40% / 65%)
•Navigational – want togo to that page (~25% / 15%)
•Transactional – want to do something (web-mediated) (~35% / 20%)
–Access a service
–Downloads
–Shop
•Gray areas
–Find a good hub
–Exploratory search “see what’s there”

Ontology based Search


• In Philosophy:
“A science or study of being”
“First philosophy”
• In Knowledge Engineering:
“A formal, explicit specification of a shared conceptualization”

• WordNet
– A large lexical database organized in terms of meanings.
– Nouns, Adjectives, Adverbs, and Verbs
– Synonym words are grouped into synset
{car, auto, automobile, machine, motorcar}
{food, nutrient}
{police, police force, constabulary, law}
– Number of words, synsets, and senses

Trends in Computer Science Research


Current trends in hardware and software include the increasing use of reduced instruction-set
computing, migration to the UNIX operating system, the development of large software libraries,
microprocessor-based smart terminals that allow remote validation of data, speech synthesis and
recognition, application ...

 Artificial intelligence and robotics. ...


 Big data analytics. ...
 Computer-assisted education. ...
 Bioinformatics. ...
 Cyber security.

Data and Analytics for 2020


 Trend 1: Smarter, faster, more responsible AI. ...
 Trend 2: Decline of the dashboard. ...
 Trend 3: Decision intelligence. ...
 Trend 4: X analytics. ...
 Trend 5: Augmented data management. ...
 Trend 6: Cloud is a given. ...
 Trend 7: Data and analytics worlds collide. ...
 Trend 8: Data marketplaces and exchanges.

Possible Questions:
Part A
1) Write a note on IR Concepts?
2) Explain the models of IR Conceps.
3) Write a note on
a. Stemming
b. Thesaurus
4) What is wordnet?
5) What is Onthology Search?
6) What are the current Treads in IR Model?
Part B
1) Explain in detail about Information Retrieval architecture?
2) What are the types of Queries in Information Retrieval?
3) Explain in detail about Text pre-processing.
4) Explain about Information Extraction.
5) What are the exaluation measures to search a text in WEB.

UNIT V COMPLETED
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

UNIT 1
DISTRIBUTED DATABASES
1. Elaborate how the data spread over multiple machines? Explain its architecture.
Distributed Systems
 Data spread over multiple machines (also referred to as sites or nodes).
 Network interconnects the machines
 Data shared by users on multiple machines
Distributed Database
 Homogeneous distributed databases
o Same software/schema on all sites, data may be partitioned among sites
o Goal: provide a view of a single database, hiding details of distribution

 Heterogeneous distributed databases


o Different software/schema on different sites
o Goal: integrate existing databases to provide useful functionality
 Differentiate between /local and global transactions
o A local transaction accesses data in the single site at which the transaction was initiated.
o A global transaction either accesses data in a site different from the one at which the
transaction was initiated or accesses data in several different sites.

Trade-offs in Distributed Systems

 Sharing data - on users at one site able to access the data residing at some other sites.
 Autonomy - each site is able to retain a degree of control over data stored locally.
 Higher system availability through redundancy – data can be replicated at remote sites, and system
can function even if a site fails.
 Disadvantage: added complexity required to ensure proper coordination among sites.
o Software development cost.
o Greater potential for bugs.
o Increased processing overhead.

Implementation Issues for Distributed Databases

 Atomicity needed even for transactions that update data at multiple sites
 The two-phase commit protocol (2PC) is used to ensure atomicity
o Basic idea: each site executes transaction until just before commit, and the leaves final
decision to a coordinator

1
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

o Each site must follow decision of coordinator, even if there is failure while waiting for
coordinators decision.

 2PC is not always appropriate: other transaction models based on persistent messaging, and
workflows, are also used
 Distributed concurrency control (and deadlock detection) required
 Data items may be replicated to improve data availability

Network Types
 Local-area networks (LANs) - composed of processors that are distributed over small geographical
areas, such as a single building or a few adjacent buildings.
 Wide-area networks (WANs) - composed of processors distributed over a large geographical area.

Storage-area network

2
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

Storage Area Network

A storage area network (SAN) is a high-speed network that provides access to data storage at the block
level. It connects servers with storage devices like disk arrays, RAID hardware, and tape libraries.

Tape library

A tape library is a storage system that contains multiple tape drives. It is essentially a collection of tapes
and tape drive that store information, usually for backup.

Networks Types

 WANs with continuous connection (e.g., the Internet) are needed for implementing distributed
database systems
 Groupware applications such as Lotus notes can work on WANs with discontinuous connection:
o Data is replicated.
o Updates are propagated to replicas periodically.
o Copies of data may be updated independently.

Database System Concept

 Heterogeneous and Homogeneous Databases


 Distributed Data Storage
 Distributed Transactions
 Commit Protocols
 Concurrency Control in Distributed Databases
 Availability
 Distributed Query Processing
 Heterogeneous Distributed Databases
 Directory Systems

Distributed Database System:


 A distributed database system consists of loosely coupled sites that share no of physical component.
 Database systems that run on each site are independent of each other
 Transactions may access data at one or more sites

Homogeneous Distributed Databases


 In a homogeneous distributed database.
o All sites have identical software.
o Are aware of each other and agree to cooperate in processing user requests.
o Each site surrenders part of its autonomy in terms of right to change schemas or software.

3
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

o Appears to user as a single system


 In a heterogeneous distributed database
o Different sites may use different schemas and software
 Difference in schema is a major problem for query processing
 Difference in software is a major problem for transaction processing
o Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing.

2. How the relation is partitioned into several fragments and stored in distinct sites? Explain with
diagram.
Distributed Data Storage

 Assume relational data model.


 Replication
o System maintains multiple copies of data, stored in different sites, for faster retrieval and
fault tolerance.
 Fragmentation
o Relation is partitioned into several fragments stored in distinct sites
 Replication and fragmentation can be combined
o Relation is partitioned into several fragments: system maintains several identical replicas of
each such fragment.

Data Replication
 A relation or fragment of a relation is replicated, if it is stored redundantly in two or more sites.
 In the most extreme case, we have Full replication of a relation is the case where the relation is stored
at all sites.
 Fully redundant databases are those in which every site contains a copy of the entire database.


Advantages of Replication
o Availability: failure of site containing relation r does not result in unavailability of r is
replicas exist.
o Parallelism: queries on r may be processed by several nodes in parallel.
o Reduced data transfer: relation r is available locally at each site containing are replica of
r.
 Disadvantages of Replication
o Increased cost of updates: each replica of relation r must be updated.
o Increased complexity of concurrency control: concurrent updates to distinct replicas may
lead to inconsistent data unless special concurrency control mechanisms are
implemented.
 One solution: choose one copy as primary copy and apply concurrency control
operations on primary copy.
Data Fragmentation
 Division of relation r into fragments r1, r2, …, rn which contain sufficient information to
reconstruct relation r.
 Horizontal fragmentation: each tuple of r is assigned to one or more fragments

4
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 Vertical fragmentation: the schema for relation r is split into several smaller schemas
 All schemas must contain a common candidate key (or superkey) to ensure lossless join property.
 A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate
key.

Horizontal Fragmentation of account Relation

branch_name account_number balance

Hillside A-305 500


Hillside A-226 336
Hillside A-155 62

acount1 = σ branch_name=“Hillside” (account )

branch_name account_number balance


Hillside A-305 500
Hillside A-226 336
Hillside A-155 62

account2 = σbranch_name=“Valleyview” (account )

Vertical Fragmentation of employee_info Relation

branch_name customer_name tuple_id


Hillside Lowman 1
Hillside Camp 2
Valleyview Camp 3
Valleyview Kahn 4
Hillside Kahn 5
Valleyview Kahn 6
Valleyview Green 7

deposit1 = Πbranch_name, customer_name, tuple_id (employee_info)

5
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

account_number balance tuple_id

A-305 500 1
A-226 336 2
A-177 205 3
A-402 10000 4
A-155 62 5
A-408 1123 6
7
A-639 750

deposit2 = Π account_number, balance, tuple_id (employee_info )

Advantages of Fragmentation

Horizontal:
 allows parallel processing on fragments of a relation
 allows a relation to be split so that tuples are located where they are most frequently
accessed
Vertical:
allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
 tuple-id attribute allows efficient joining of vertical fragments
 allows parallel processing on a relation
Vertical and horizontal fragmentation can be mixed.
 Fragments may be successively fragmented to an arbitrary depth

Data Transparency
 Data transparency: Degree to which system user may remain unaware of the details of how and where the
data items are stored in a distributed system
 Consider transparency issues in relation to:
o Fragmentation transparency
o Replication transparency
o Location transparency

Naming of Data Items - Criteria

1. Every data item must have a system-wide unique name.


2. It should be possible to find the location of data items efficiently.
3. It should be possible to change the location of data items transparently.
4. Each site should be able to create new data items autonomously.

6
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

Centralized Scheme - Name Server

 Structure:
o name server assigns all names
o each site maintains a record of local data items
o sites ask name server to locate non-local data item
 Advantages:
o satisfies naming criteria 1-3
 Disadvantages:
o does not satisfy
o name server is a potential performance bottleneck
o name server is a single point of failure.

Use of Aliases

 Alternative to centralized scheme: each site prefixes its own site identifier to any name that it
generates 1.e., site 17.account.
o Fulfills having a unique identifier, and avoids problems associated with central control.
o However, fails to achieve network transparency.
 Solution: Create a set of aliases for data items; Store the mapping of aliases to the real names at
each site.
 The user can be unaware of the physical location of a data item, and is unaffected if the data item
is moved from one site to another.

3. Name the protocols used to ensure atomicity across sites and explain how the Distributed
Transactions occurs.

Distributed Transactions and 2 Phase Commit

Distributed Transactions

 Transaction may access data at several sites.


 Each site has a local transaction manager responsible for:
o Maintaining a log for recovery purposes
o Participating in coordinating the concurrent execution of the transactions executing at that
site.
 Each site has a transaction coordinator, which is responsible for:
o Starting the execution of transactions that originate at the site.
o Distributing sub transactions at appropriate sites for execution.
o Coordinating the termination of each transaction that originates at the site, which may result
in the transaction being committed at all sites or aborted at all sites.

7
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

Transaction System Architecture

System Failure Modes


 Failures unique to distributed systems
o Failure of a site.
o Loss of messages
 Handled by network transmission control protocols such as TCP-IP
o Failure of a communication link
 Handled by network protocols, by routing messages via alternative links

o Network partition
 A network is said to be partitioned when it has been split into two or more
subsystems that lack any connection between them
- Note: a subsystem may consist of a single node
 Network partitioning and site failures are generally indistinguishable.

Commit Protocols

 Commit protocols are used to ensure atomicity across sites


o A transaction which executes at multiple sites must either be committed at all the sites, or
aborted at all the sites.
o Not acceptable to have a transaction committed at one site and aborted at another
 The two-phase commit (2PC) protocol is widely used
 The three-phase commit (3PC) protocol is more complicated and more expensive, but avoids some
drawbacks of two-phase commit protocol. This protocol is not used in practice.

Two Phase Commit Protocol (2PC)

 Assumes fail-stop model - failed sites simply stop working, and do not cause any other harm, such
as sending incorrect messages to other sites.
 Execution of the protocol is initiated by the coordinator after the last step of the transaction has been
reached.

8
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 The protocol involves all the local sites at which the transaction executed
 Let T be a transaction initiated at site Si , and let the transaction coordinator at Si be Ci

Phase 1: Obtaining a Decision

 Coordinator asks all participants to prepare to commit transaction Ti .


o Cj adds the records <prepare T> to the log and forces log to stable storage.
o sends prepare T messages to all sites at which T executed
 Upon receiving message, transaction manager at site determines if it can commit the transaction
o if not, add a record <no T> to the log and send abort T message to Ci
o if the transaction can be committed, then:
o add the record <ready T> to the log
o force all records for T to stable storage
o send ready T message to Cj

Phase 2: Recording the Decision

 T can be committed of C, received a ready T message from all the participating sites: otherwise
7 must be aborted.
 e Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto
stable storage. Once the record stable storage it is irrevocable (even if failures occur)
 Coordinator sends a message to each participant informing it of the decision (commit or abort)
 Participants take appropriate action locally.

Handling of Failures - Site Failure

When site S_ recovers, it examines its log to determine the fate of transactions active at the time of the
failure.

 Log contain <commit T> record: txn had completed, nothing to be done
 Log contains <abort T> record: txn had completed, nothing to be done
 Log contains <ready T> record: site must consult Ci, to determine the fate of T.
o If T committed, redo (T); write <commit T> record
o If T aborted, undo (T)

9
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 The log contains no log records concerning T:


o Implies that Sk, failed before responding to the prepare T message from Ci.
o since the failure of Sk, precludes the sending of such a response, coordinator C, must abort
T
o Sk must execute undo (T)

Handling of Failures- Coordinator Failure

 If coordinator fails while the commit protocol for T is executing then participating sites must decide on T’s
fate:
1. If an active site contains a<commit T> record in its log, then T must be committed.
2. If an active site contains an <abort T> record in its log, then T must be aborted
3. If some active participating site does not contain a <ready T> record in its log, then the
failed coordinator C, cannot have decided to commit T.
 Can therefore abort T; however, such a site must reject any subsequent <prepare
T> message From C,
4. If none of the above cases holds, then all active sites must have a <ready T>record in their
logs, but no additional Control records (such as <abort T> of <commit T>).

 In this case active sites must wait for C. to recover, to find decision.
 Blocking problem: active sites may have to wait for failed coordinator to recover.

Handling of Failures - Network Partition


 If the coordinator and all its participants remain in one partition, the failure has no effect on the commit
protocol.

 If the coordinator and its participants belong to several partitions:


 Sites that are not in the partition containing the coordinator think the coordinator has failed, and
execute the protocol to deal with failure of the coordinator.
o No harm results, but sites may still have to wait for decision from coordinator.
 The coordinator and the sites are in the same partition as the coordinator think that the sites in the
other partition have failed, and follow the usual commit protocol.
o Again, no harm results

4. How the Recovery and Concurrency Control works? Explain 3 phase commit and Implementation
of Persistent Messaging.
Recovery and Concurrency Control

 In-doubt transactions have a <ready T>, but neither a <commit T>, nor an <abort 7> log record.
 The recovering site must determine the commit-abort status of such transactions by contacting other
sites; this can slow and potentially block recovery.
 Recovery algorithms can note lock information in the log.
o Instead of <ready T>, write out <ready T, L> L = list of locks held by T when the log is
written (read locks can be omitted).

10
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

o For every in-doubt transaction T, all the locks noted in the <ready T, L> log record are
reacquired.
 After lock reacquisition, transaction processing can resume; the commit or rollback of in-doubt
transactions is performed concurrently with the execution of new transactions.

Three Phase Commit (3PC)


 Assumptions:
 No network partitioning
 At any point, at least one site must be up.
 Atmost K sites (participants as well as coordinator) can fail
 Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1.
 Every site is ready to commit if instructed to do so
 Phase 2 of 2PC is split into 2 phases, Phase 2 and Phase 3 of 3PC
 In phase 2 coordinator makes a decision as in 2PC (called the pre-commit decision) and records
it in multiple (at least K) sites.
 In phase 3, coordinator sends commit/abort message to all participating sites,
 Under 3PC, knowledge of pre-commit decision can be used to commit despite coordinator failure
 Avoids blocking problem as long as < K sites fail
 Drawbacks:
 higher overheads
 assumptions may not be satisfied in practice.

Alternative Models of Transaction Processing


 Notion of a single transaction spanning multiple sites 1s inappropriate for many application.
o Eg. transaction crossing an organizational boundary
o No organization would like to permit an externally initiated transaction to block local
transactions for an indeterminate period
 Alternative models carry out transactions by sending messages
o Code to handle messages must be carefully designed to ensure atomicity and durability
properties for updates
 Isolation cannot be guaranteed, in that intermediate stages are visible, but code
must ensure no inconsistent states result due to concurrency
o Persistent messaging systems are systems that provide transactional properties to
messages
 Messages are guaranteed to be delivered exactly once
 Will discuss implementation techniques later

 Motivating example: funds transfer between two banks


o Two phase commit would have the potential to block updates on the accounts involved in
funds transfer
o Alternative solution:
 Debit money from source account and send a message to other site
 Site receives message and credits destination account
 Messaging has long been used for distributed transactions (even before computers
were invented!)

11
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 Atomicity issue
 once transaction sending a message is committed, message must guaranteed to be delivered
 Guarantee as long as destination site is up and reachable, code to handle undeliverable
messages must also be available
-e.g. credit money back to source account.
 If sending transaction aborts, message must not be sent

Error Conditions with Persistent Messaging


 Code to handle messages has to take care of variety of failure situations (even assuming guaranteed
message delivery)
o Eg. if destination account does not exist, failure message must be sent back to source site
o When failure message is received from destination site, or destination site itself does not
exist, money must be deposited back in source account
 Problem if source account has been closed
- get humans to take care of problem
 User code executing transaction processing using 2PC does not have to deal with such failures
 There are many situations where extra effort of error handling is worth the benefit of absence of
blocking
o Eg. pretty much all transactions across organizations

Persistent Messaging and Workflows

 Workflows provide a general model of transactional processing involving multiple sites and
possibly human processing of certain steps
o E.g. when a bank receives a loan application, it may need to
 Contact external credit-checking agencies
 Get approvals of one or more managers and then respond to the loan application
 We study workflows in Chapter 25
 Persistent messaging forms the underlying infrastructure for workflows in a distributed environment

Implementation of Persistent Messaging

 Sending site protocol.


o When a transaction wishes to send a persistent message, writes a record containing the
message in a special relation messages_to_send; the message iS given a unique message
identifier.
o A message delivery process monitors the relation, and when a new message is found, it
sends the message to its destination
o The message delivery process deletes a message from the relation only after it receives
an acknowledgment from the destination site.
 If it receives no acknowledgement from the destination site, after some time it
sends the message again. It repeats this until an acknowledgment is received.
 If after some period of time, that the message is undeliverable,exception handling
code provided by the application is invoked to deal with the failure.
 Writing the message to a relation and processing it only after the transaction commits ensures that
the message will be delivered if and only if the transaction commits.

12
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 Receiving site protocol.


o When a site receives a persistent message, it runs a transaction that adds the message to a
received _messages relation
 provided message identifier is not already present in the relation
 After the transaction commits, or if the message was already present in the relation, the
receiving site sends an acknowledgment back to the sending site.
 Note that sending the acknowledgment before the transaction commits is not safe,
since system failure may then result in loss of the message.

o In many messaging systems, it is possible for messages to get delayed arbitrarily, although
such delays are very unlikely.
 Each message is given a timestamp, and if the timestamp of a received message is
older than some cutoff, the message is discarded.
 All messages recorded in the received messages relation that are older than the
cutoff can be deleted.

5. What is Concurrency Control? Explain in detail about Distributed lock manager approaches with
example.
Concurrency Control

 Modify concurrency control schemes for use in distributed environment.


 We assume that each site participates in the execution of a commit protocol to ensure global
transaction atomicity.
 We assume all replicas of any item are updated
o Will see how to relax this in case of site failures later

Concurrency control-Data distribution

 In centralized database system


o only one possibility
 In distributed database system
o Several possibilities
 Fragmented
 Replicated
 Hybrid

13
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

Concurrency control approaches

 Single lock manager approach


 Distributed lock manager approach
 Primary copy protocol
 Majority protocol
 Biased protocol
 Quorum consensus protocol

14
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

Single lock manager approach

 Dedicated lock manager site


o A Site is chosen from among all the sites as the lock manager site.
 All locks will be granted by the lock manager site.

Single lock manager- Advantages and Disadvantages

 Advantages
 Implementation is simple
 Deadlock handling is simple
 Disadvantages
 manager site becomes a bottleneck
 Lock manager site is vulnerable to failure
Distributed lock manager approach
 Lock managers of all sites involved
 Local data items are controlled by local lock managers
 Variants
o Primary copy protocol
o Majority protocol
o Biased protocol
o Quorum consensus protocol

1 Distributed lock manager - Primary copy


 Primary copy and primary site
 One replica of data item is primary copy and the site that holds primary copy is
primary site.

Primary copy protocol – Example

15
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

Primary copy protocol- Advantages and disadvantages


 Advantages
 Handling of concurrency control on replicated data is like un- replicated data.
 Only 3 messages to handle lock and unlock requests (I Lock request, | Granted, and |
Unlock message) for both read and write.
 Disadvantages
 Single point-of-failure

2 Distributed lock manager - Majority protocol


 Local lock managers are responsible for issuing locks
 Assume that we have the data item Q which is replicated in several sites;
o A transaction needs a lock on Q has to request and lock data item Q in half+one
sites in which Q is replicated (i.e, majority of the sites in which Q is replicated)
o The lock-managers of all the sites in which Q is replicated are responsible for
handling lock and unlock requests locally individually.
o Irrespective of the lock types (read or write, i.e, Shared or Exclusive),we need to
lock half+one sites.
2 Majority protocol - Example

2 Majority protocol - Advantages and disadvantages

 Advantages

Replicated data handled in decentralized manner. Hence, no single point-of-failure


problem
 Disadvantages
 (n/2+1) lock, unlock, and grant messages

16
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 Shared and exclusive locks with same complexity

3 Distributed lock manager - Biased protocol

 Local lock managers are responsible for issuing locks.


 If a data item Q is replicated over n sites, then a read lock (Shared lock) request message must be
sent to any one of the sites in which Q is replicated and, a write lock Exclusive lock) request
message must be sent to all the sites in which Q is replicated.

3 Biased protocol -Example - Read locks and write locks

17
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

4 Distributed lock manager - Quorum consensus protocol

It works as follows;
1. The protocol assigns each site that have a replica with a weight.

2. For any data item, the protocol assigns a read quorum Qr, and write quorum Qw. Here, Q. and Q, are
two integers (sum of weights of some sites). And, these two integers are chosen according to the
following conditions put together;
 Qr + Qw > S - this rule avoids read-write conflict. (i.e., two transactions cannot read and write
concurrently)
 2 * Qw > S - this rule avoids write-write conflict. (i.e., two transactions cannot write
concurrently)
 Here, S is the total weight of all sites in which the data item replicated.
o Number of locks required for read operation >= Qr and for write operation >= Qw.
Timestamping

 Timestamp based concurrency-control protocols can be used in distributed systems


 Each transaction must be given a unique timestam
 Main problem: how to generate a timestamp in a distributed fashion
o Each site generates a unique local timestamp using either a logical counter or the local clock.
o Global unique timestamp is obtained by concatenating the unique local timestamp with the
unique identifier.

 A site with a slow clock will assign smaller timestamps


o Still logically correct: serializability not affected
o But: “disadvantages” transactions
 To fix this problem
o Define within each site S, a logical clock (LC,), which generates the unique local timestamp
o Require that S, advance its logical clock whenever a request is received from a transaction
Ti with timestamp < x, y> and x is greater that the current value of LCi.

18
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

o In this case, site S, advances its logical clock to the value x + I


Replication with Weak Consistency
 Many commercial databases support replication of data with weak degrees of consistency (I.e., without
a guarantee of serializabiliy)
 E.g.: master-slave replication: updates are performed at a single site, and propagated to sites.
o Propagation is not part of the update transaction: its is decoupled
 May be immediately after transaction commits
 May be periodic
o Data may only be read at slave sites, not updated
 No need to obtain locks at any remote site
o Particularly useful for distributing information
 E.g. from central office to branch-office
o Also useful for running read-only queries offline from the main database

 Replicas should see a transaction-consistent snapshot of the database


o That is, a state of the database reflecting all effects of all transactions up to some point in the
serialization order, and no effects of any later transactions.
 Eg. Oracle provides a create snapshot statement to create a snapshot of a relation or a set of relations
at a remote site
o snapshot refresh either by recomputation or by incremental update
o Automatic refresh (continuous or periodic) or manual refresh

Multimaster and Lazy Replication

 With multimaster replication (also called update-anywhere replication) updates are permitted at any
replica, and are automatically propagated to all replicas
o Basic model in distributed databases, where transactions are unaware of the details of
replication, and database system propagates updates as part of the same transaction
 Coupled with 2 phase commit
 Many systems support lazy propagation where updates are transmitted after transaction commits
o Allows updates to occur even if some sites are disconnected from the network, but at the
cost of consistency
Deadlock Handling
Consider the following two transactions and history, with item X and transaction T1,at site 1, and item Y
and transaction T2 at site 2:

T1: write (X) T2: ite (Y)


write (Y) write (X)

19
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 Result: deadlock which cannot be detected locally at either site

Centralized Approach:
 A global wait-for graph is constructed and maintained in a single site; the deadlock-detection
coordinator
o Real graph: Real, but unknown, state of the system.
o Constructed graph: Approximation generated by the controller during the execution of its
algorithm .
 the global wait-for graph can be constructed when:
o A new edge is inserted in or removed from one of the local wait-for graphs.
o A number of changes have occurred in a local wait-for graph.
o the coordinator needs to invoke cycle-detection.
 If the coordinator finds a cycle, it selects a victim and notifies all sites. The sites roll back the victim
transaction.

Local and Global Wait-For Graphs

Global

Example Wait-For Graph for False Cycles

Initial state:

20
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

False Cycles

 Suppose that starting from the state shown in figure,


1. T2 releases resources at S1
 resulting in a message remove T1→ T2 message from the Transaction Manager at site S1
to the coordinator)
2. And then T2, requests a resource held by T3, at site S2
 resulting in a message insert T2 → T3 from S2 to the coordinator

 Suppose further that the insert message reaches before the delete message
o this can happen due to network delays
 The coordinator would then find a false cycle
T1  T2  T3  T1
 The false cycle above never existed in reality.
 False cycles cannot occur if two-phase locking is used.

Unnecessary Rollbacks

 Unnecessary rollbacks may result when deadlock has indeed occurred and a victim has been picked,
and meanwhile one of the transactions was aborted for reasons unrelated to the deadlock.
 Unnecessary rollbacks can result from false cycles in the global wait-for graph; however, likelihood
of false cycles is low.
Availability
 High availability: time for which system is not fully usable should be extremely low (e.g. 99.99%
availability)
 Robustness: ability of system to function spite of failures of components
 Failures are more likely in large distributed systems
 To be robust, a distributed system must
o Detect failures
o Reconfigure the system so computation may continue
o Recovery/reintegration when a site or link is repaired
 Failure detection: distinguishing link failure from site failure is hard
 (partial) solution: have multiple links, multiple link failure is likely a site failure

Reconfiguration
 Reconfiguration:
o Abort all transactions that were active at a failed site
 Making them wait could interfere with other transactions since they may hold locks
on other sites
 However, in case only some replicas of a data item failed, it may be possible to
continue transactions that had accessed data at a failed site (more on this later)

o If replicated data items were at failed site, update system catalog to remove them from the
list of replicas.

21
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 If a failed site was a central server for some subsystem, an election must be held to determine the
new serve
o E.g. name server, concurrency coordinator, global deadlock detector

 Since network partition may not be distinguishable from site failure, the following situations must
be avoided
o Two or more central servers elected in distinct partitions
o More than one partition updates a replicated data item
 Updates must be able to continue even if some sites are down
 Solution: majority based approach
o Alternative of “read one write all available” is tantalizing but causes problems

Majority-Based Approach
 The majority protocol for distributed concurrency control can be modified to work even if some
sites are unavailable
o Each replica of each item has a version number which is updated when the replica is
updated, as outlined below
o A lock request is sent to at least 1/2 the sites at which item replicas are stored and operation
continues only when a lock is obtained on a majority of the sites
o Read operations look at all replicas locked, and read the value from the replica with largest
version number
 May write this value and version number back to replicas with lower version
numbers (no need to obtain locks on all replicas for this task)
 Majority protocol (Cont.)
o Write operations
 find highest version number like reads, and set new version number to old highest
version + 1
 Writes are then performed on all locked replicas and version number on these
replicas is set to new version number
o Failures (network and site) cause no problems as long as
 Sites at commit contain a majority of replicas of any updated data items
 During reads a majority of replicas are available to find version number
 Subject to above, 2 phase commit can be used to update replicas
o Note: reads are guaranteed to see latest version of data item
o Reintegration is trivial: nothing needs to be done
 Quorum consensus algorithm can be similarly extended

Read One Write All (Available)


 Biased protocol is a special case of quorum consensus
o Allows reads to read any one replica but updates require all replicas to be available at
commit time (called read one write all)
 Read one write all available (ignoring failed sites) 1s attractive, but incorrect
o If failed link may come back up, without a disconnected site ever being aware that it was
disconnected
o The site then has old values, and a read from that site would return an incorrect value

22
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

o If site was aware of failure reintegration could have been performed, but no way to
guarantee this
o With network partitioning, sites in each partition may update same item concurrently

 believing sites in other partitions have all failed

Site Reintegration
 When failed site recovers, it must catch up with all updates that 1t missed while it was down
o Problem: updates may be happening to items whose replica is stored at the site while the
site is recovering
o Solution 1: halt all updates on system while reintegrating a site
 Unacceptable disruption
o Solution 2: lock all replicas of all data items at the site, update to latest version, then
release locks
 Other solutions with better concurrency also available
Comparison with Remote Backup
 Remote backup (hot spare) systems (Section 17.10) are also designed to provide high availability
 Remote backup systems are simpler and have lower overhead
o All actions performed at a single site, and only log records shipped
o No need for distributed concurrency control, or 2 phase commit
 Using distributed databases with replicas of data items can provide higher availability by having
multiple (> 2) replicas and using the majority protocol
o Also avoid failure detection and switchover time associated with remote backup systems
Coordinator Selection
 Backup coordinators
o site which maintains enough information locally to assume the role of coordinator if the
actual coordinator fails
o executes the same algorithms and maintains the same internal state information as the actual
coordinator fails executes state information as the actual coordinator
o Allows fast recovery from coordinator failure but involves overhead during normal
processing.
 Election algorithms
o used to elect a new coordinator in case of failures
o Example: Bully Algorithm - applicable to systems where every site can send a message to
every other site.

Bully Algorithm
 If site Sj sends a request that is not answered by the coordinator within a time interval T, assume that
the coordinator has failed Si tries to elect itself as the new coordinator.
 Si sends an election message to every site with a higher identification number, Si then waits for any
of these processes to answer within T.

23
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 If no response within T, assume that all sites with number greater than i have failed, Si elects itself
the new coordinator.
 If answer is received Si begins time interval T’, waiting to receive a message that a site with a higher
identification number has been elected.
 If no message is sent within T’, assume the site with a higher number has failed; S, restarts the
algorithm.

 After a failed site recovers, it immediately begins execution of the same algorithm.

 If there are no active sites with higher numbers, the recovered site forces all processes with lower
numbers to let it become the coordinator site, even if there is a currently active coordinator with a
lower number.

6. What is Consistency? Elaborate CAP theorem.

Trading Consistency for Availability

Consistency in Databases (ACID):


o Database has a set of integrity constraints
o A consistent database state is one where all integrity constraints are satisfied
o Each transaction run individually on a consistent database state must leave the database in a
consistent state
 Consistency in distributed systems with replication
o Strong consistency: a schedule with read and write operations on a replicated object should
give results and final state equivalent to some schedule on a single copy of the object, with
order of operations from a single site preserved
o Weak consistency (several forms)

Availability

 Traditionally, availability of centralized server


 For distributed systems, availability of system to process requests
o For large system, at almost any point in time there is a good chance that
 a node is down or even
 Network partitioning

 Distributed consensus algorithms will block during partitions to ensure consistency


o Many applications require continued operation even during a network partition
 Even at cost of consistency

24
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

Brewer’s CAP Theorem

 Three properties of a system


 Consistency (all copies have same value)
 Availability (system can run even if parts have failed)
o Via replication
 Partitions (network can break into two or more parts, each with active systems that can’t talk to
other parts)

 Brewer’s CAP “Theorem”: You can have at most two of these three properties for any system
 Very large systems will partition at some point
 Choose one of consistency or availability
o Traditional database choose consistency
o Most Web applications choose availability
 Except for specific parts such as order processing
Replication with Weak Consistency

 Many systems support replication of data with weak degrees of consistency (I.e., without a guarantee
of serializabiliy)
o i.e. QR + QW <= S or 2*QW < S

o Usually only when not enough sites are available to ensure quorum
 But sometimes to allow fast local reads

 Tradeoff of consistency versus availability or latency

 Key issues:

 Reads may get old versions


 Writes may occur in parallel, leading to inconsistent versions
Question: how to detect, and how to resolve
- Version vector scheme.

Eventual Consistency

 When no updates occur for a long period of time, eventually all updates will propagate through the
system and all the nodes will be consistent

 Fora given accepted update and a given node, eventually either the update reaches the node or the
node is removed from service Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID

 Soft state: copies of a data item may be inconsistent

25
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

 Eventually Consistent - copies becomes consistent at somelater time if there are no more updates
to that data item

Availability vs Latency

 CAP theorem only matters when there is a partition

o Even if partitions are rare, applications may trade off consistency for latency

o E.g. PNUTS allows inconsistent reads to reduce latency


o -Critical for many applications
o But update protocol (via master) ensures consistency over availability

o Thus, there are two questions:


 If there is partitioning, how does system tradeoff availability for consistency
 else how does system trade off latency for consistency

7. How the queries will be processed in distributed system? Explain in detail with Query Processing
Strategies.
Distributed Query Processing

 For centralized systems, the primary criterion for measuring the cost of a particular strategy is the
number of disk accesses.
 In a distributed system, other issues must be taken into account:
o The cost of a data transmission over the network.
o The potential gain in performance from having several sites process parts of the query in
parallel.

Query Transformation

 Translating algebraic queries on fragments.


o It must be possible to construct relation 7 from its fragments
 Replace relation r by the expression to construct relation r from its fragments

o Consider the horizontal fragmentation of the account relation into

account1 = σ branch_name = “Hillside” (account)


account2 = σ branch_name = “Valleyview” (account)

o The query σ branch_name = “Hillside” (account ) becomes


σ branch_name = “Hillside” (account1 ∪ account2)

which is optimized into

26
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

σ branch_name = “Hillside” (account1) ∪ σ branch_name = “Hillside” (account2)

 Since account1, has only tuples pertaining to the Hillside branch, we can eliminate the selection
operation.

 Apply the definition of account2, to obtain σ branch_name = “Hillside” (σ branch_name = “Valleyview”


(account )

 This expression is the empty set regardless of the contents of the account relation.

 Final strategy is for the Hillside site to return account, as the result of the query.

Simple Join Processing


o Consider the following relational algebra expression in which the three relations are neither replicated
nor fragmented
Account depositor branch
account is stored at site S1

o depositor at S2
o branch at S3
o For a query issued at site SI, the system needs to produce the result at site S1

Possible Query Processing Strategies

 Ship copies of all three relations to site Sl, and choose a strategy for processing the entire locally at
site Sl
 Ship a copy of the account relation to site S, and compute temp1 = account depositor at S2 .Ship
temp1, from S2 to S3, and compute temp2 = temp1 branch at S3. Ship the result temp2 to S1
 Devise similar strategies, exchanging the roles S1, S2, S3
 Must consider following factors:
o amount of data being shipped
o cost of transmitting a data block between sites
o relative processing speed at each site

Semijoin Strategy

 Let r1 be a relation with schema R1 stores at site S1


 Let r2 be a relation with schema R2 stores at site S2
 Evaluate the expression r1 r2 and obtain the result at S1.
1. Compute temp1 ←∏R1 R2 (r1) at S1.
2. Ship temp1 from S1 to S2.
3. Compute temp2 ←r2 temp1 at S2

27
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

4. Ship temp2 from S2 to S1.


5. Compute r1 temp2 at S1. This is the same as r1 r2.

Formal Definition

 The semijoin of r1 with r2, is denoted by:


r1 r2
 it is defined by:
∏R1 (r1 r2)
 Thus, r1 r2 selects those tuples of r1 that contributed to r1 r2.
 In step 3 above, temp2=r2 r1.
 For joins of several relations, the above strategy can be extended to a series of semijoin steps.

Join Strategies that Exploit Parallelism

Heterogeneous Distributed Databases

 Many database applications require data from a variety of preexisting databases located in a
heterogeneous collection of hardware and software platforms
 Data models may differ (hierarchical, relational, etc.)
 Transaction commit protocols may be incompatible
 Concurrency control may be based on different techniques (locking, timestamping, etc.)
 System-level details almost certainly are totally incompatible.
 A multidatabase system is a software layer on top of existing database systems, which is designed
to manipulate information in heterogeneous databases
o Creates an illusion of logical database integration without any physical database integration

Advantages

 Preservation of investment in existing

28
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

o Hardware
o system software
o Applications
 Local autonomy and administrative control
 Allows use of special-purpose DBMSs
 Step towards a unified homogeneous DBMS
o Full integration into a homogeneous DBMS faces
 Technical difficulties and cost of conversion
 Organizational/political difficulties
-Organizations do not want to give up control on their data
-Local databases wish to retain a great deal of autonomy

Unified View of Data

 Agreement on a common data model


o Typically the relational model
 Agreement on a common conceptual schema
o Different names for same relation/attribute
o Same relation/attribute name means different things
 Agreement on a single representation of shared data
o Eg. data types, precision,
o Character sets
 ASCII vs EBCDIC
 Sort order variations
 Agreement on units of measure
 Variations in names
o E.g. Köln vs Cologne, Mumbai vs Bombay

Query Processing

 Several issues in query processing in a heterogeneous database


 Schema translation
o Write a wrapper for each data source to translate data to a global schema
o Wrappers must also translate updates on global schema to updates on local schema
 Limited query capabilities
o Some data sources allow only restricted forms of selections
 E.g. web forms, flat file data sources
 Queries have to be broken up and processed partly at the source and partly at a different site
 Removal of duplicate information when sites have overlapping information
o Decide which sites to execute query
 Global query optimization

29
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY

Mediator Systems

 Mediator systems are systems that integrate multiple heterogeneous data sources by providing an
integrated global view, and providing query facilities on global view
o Unlike full-fledged multidatabase systems, mediators generally do not bother about
transaction processing
o But the terms mediator and multidatabase are sometimes used interchangeably
o The term virtual database is also used to refer to mediator/multidatabase systems

Transaction Management in Multidatabases

 Local transactions are executed by each local DBMS, outside of the MDBS system control.

 Global transactions are executed under multidatabase control.


 Local autonomy - local DBMSs cannot communicate directly to synchronize global transaction
execution and the multidatabase has no control over local transaction execution.
o local concurrency control scheme needed to ensure that DBMS schedule is serializable
o incase of locking, DBMS must be able to guard against local deadlocks.
o need additional mechanisms to ensure global serializability

Local vs. Global Serializability

 The guarantee of local serializability is not sufficient to ensure global serializability.


o As an illustration, consider two global transactions T1 and T2 , each of which accesses and
updates two data items, A and B, located at sites S1 and S2 respectively.
o It is possible to have a situation where, at site S1 , T2 follows T1 , whereas, at S2 , T1 follows
T2, resulting in a non serializable global schedule.
 If the local systems permit control of locking behavior and all systems follow two-phase locking
o the multidatabase system can ensure that global transactions lock in a two-phase manner
o The lock points of conflicting transactions would then define their global serialization order.

30
Jaya Engineering College
UNIT II
SPATIAL AND TEMPORAL DATABASES
SPATIAL AND TEMPORAL DATABASES INTRODUCTION:
What is spatial database in DBMS?
A spatial database is a general-purpose database (usually a relational database) that has been
enhanced to include spatial data that represents objects defined in a geometric space, along with
tools for querying and analyzing such data.
What is temporal database in DBMS?
A temporal database is a database that has certain features that support time-sensitive status
for entries. Where some databases are considered current databases and only support factual data
considered valid at the time of use, a temporal database can establish at what times certain entries
are accurate.
A temporal database stores data relating to time instances. It offers temporal data types and stores
information relating to past, present and future time.
A spatiotemporal database is a database that manages both space and time information.
What is active database system?
An active database is a database that includes an event-driven architecture (often in the form
of ECA rules) which can respond to conditions both inside and outside the database. Possible
uses include security monitoring, alerting, statistics gathering and authorization.
Which model is used to implement active databases?
The model that has been used to specify active database rules is referred to as the Event-
Condition-Action (ECA) model.
1. Explain in detail about active database model. or
List the three components in ECA Model and explain how to create trigger for a
relation.
ACTIVE DATABASE MODEL
The model that has been used to specify active database rules is referred to as the event-condition-
action (ECA) model. A rule in the ECA model has three components:
1. The event(s) that triggers the rule: These events are usually database update operations that are
explicitly applied to the database. However, in the general model, they could also be temporal
events or other kinds of external events.
2. The condition that determines whether the rule action should be executed: Once the triggering
event has occurred, an optional condition may be evaluated. If no condition is specified, the action
will be executed once the event occurs. If a condition is specified, it is first evaluated, and only if
it evaluates to true will the rule action be executed.
3. The action to be taken: The action is usually a sequence of SQL statements, but it could also be
a database transaction or an external program that will be automatically executed.
TRIGGER:A database trigger is procedural code that is automatically executed in response to
certain events on a particular table or view in a database. The trigger is mostly used for maintaining
the integrity of the information on the database.

1
 The basic events that can be specified for triggering the active rules are the standard SQL
update commands: INSERT, DELETE, and UPDATE. They are specified by the keywords
INSERT, DELETE, and UPDATE in Oracle notation.
 The keywords NEW and OLD are used in Oracle notation; NEW is used to refer to a newly
inserted or newly updated tuple, whereas OLD is used to refer to a deleted tuple or to a
tuple before it was updated.

DESIGN AND IMPLEMENTATION ISSUES FOR ACTIVE DATABASES


The first issue concerns activation, deactivation, and grouping of rules. In addition to
creating rules, an active database system should allow users to activate, deactivate, and drop rules

2
by referring to their rule names. A deactivated rule will not be triggered by the triggering event.
This feature allows users to selectively deactivate rules for certain periods of time when they are
not needed. The activate command will make the rule active again. The drop command deletes the
rule from the system. Another option is to group rules into named rule sets, so the whole set of
rules can be activated, deactivated, or dropped. It is also useful to have a command that can trigger
a rule or rule set via an explicit PROCESS RULES command issued by the user.
The second issue concerns whether the triggered action should be executed before, after, instead
of, or concurrently with the triggering event. A before trigger executes the trigger before executing
the event that caused the trigger. It can be used in applications such as checking for constraint
violations. An after trigger executes the trigger after executing the event, and it can be used in
applications such as maintaining derived data and monitoring for specific events and conditions.
An instead of trigger executes the trigger instead of executing the event, and it can be used in
applications such as executing corresponding updates on base relations in response to an event that
is an update of a view.
Let us assume that the triggering event occurs as part of a transaction execution. We should first
consider the various options for how the triggering event is related to the evaluation of the rule’s
condition. The rule condition evaluation is also known as rule consideration, since the action is to
be executed only after considering whether the condition evaluates to true or false. There are three
main possibilities for rule consideration:
1. Immediate consideration: The condition is evaluated as part of the same transaction as the
triggering event and is evaluated immediately. This case can be further categorized into three
options: Evaluate the condition before executing the triggering event. Evaluate the condition after
executing the triggering event. Evaluate the condition instead of executing the triggering event.
2. Deferred consideration: The condition is evaluated at the end of the transaction that included
the triggering event. In this case, there could be many triggered rules waiting to have their
conditions evaluated.
3. Detached consideration: The condition is evaluated as a separate transaction, spawned from
the triggering transaction.
Most active systems use the first option. That is, as soon as the condition is evaluated, if it returns
true, the action is immediately executed.
Another issue concerning active database rules is the distinction between row-level rules and
statement-level rules. The SQL-99 standard and the Oracle system allow the user to choose which
of the options is to be used for each rule, whereas STARBURST uses statement-level semantics
only.
In STARBURST, the basic events that can be specified for triggering the rules are the standard
SQL update commands: INSERT, DELETE, and UPDATE. These are specified by the keywords
INSERTED, DELETED, and UPDATED in STARBURST notation. Second, the rule designer
needs to have a way to refer to the tuples that have been modified. The keywords INSERTED,
DELETED, NEW-UPDATED, and OLD-UPDATED are used in STARBURST notation to refer
to four transition tables (relations) that include the newly inserted tuples, the deleted tuples, the
updated tuples before they were updated, and the updated tuples after they were updated,
respectively.

3
In statement-level semantics, the rule designer can only refer to the transition tables as a whole
and the rule is triggered only once, so the rules must be written differently than for row-level
semantics. Because multiple employee tuples may be inserted in a single insert statement.
POTENTIAL APPLICATIONS FOR ACTIVE DATABASES
One important application is to allow notification of certain conditions that occur. For example,
an active database may be used to monitor, say, the temperature of an industrial furnace.
Active rules can also be used to enforce integrity constraints by specifying the types of events
that may cause the constraints to be violated and then evaluating appropriate conditions that check
whether the constraints are actually violated by the event or not.
Other applications include the automatic maintenance of derived data, maintain the consistency
of materialized views whenever the base relations are modified.
2. Explain in detail about Temporal Databases.
TEMPORAL DATABASE
 Temporal databases, in the broadest sense, encompass all database applications that require
some aspect of time when organizing their information. Hence, they provide a good
example to illustrate the need for developing a set of unifying concepts for application
developers to use.
 Temporal database applications have been developed since the early days of database
usage. There are many examples of applications where some aspect of time is needed to
maintain the information in a database. These include healthcare, where patient histories
need to be maintained; insurance, where claims and accident histories are required as well
as information about the times when insurance policies are in effect; reservation systems
in general (hotel, airline, car rental, train, and so on), where information on the dates and
times when reservations are in effect are required; scientific databases, where data collected
from experiments includes the time when each data is measured; and so on.
 A temporal relation is one where each tuple has an associated time when it is true; the
time may be either valid time or transaction time.
 Both valid time and transaction time can be stored, in which case the relation is said to be
a bitemporal relation.
TIME SPECIFICATION IN SQL:
 The SQL standard defines the types date, time, and timestamp.
 The type date contains four digits for the year (1–9999), two digits for the month (1–12),
and two digits for the date (1–31).
 The type time contains two digits for the hour, two digits for the minute, and two digits for
the second, plus optional fractional digits.
 The seconds field can go beyond 60, to allow for leap seconds that are added during some
years to correct for small variations in the speed of rotation of Earth.
 The type timestamp contains the fields of date and time, with six fractional digits for the
seconds field.
 The Universal Coordinated Time (UTC) is a standard reference point for specifying time,
with local times defined as offsets from UTC.

4
 SQL also supports two types, time with time zone, and timestamp with time zone, which
specify the time as a local time plus the offset of the local time from UTC.
 SQL supports a type called interval, which allows us to refer to a period of time such as “1
day” or “2 days and 5 hours,” without specifying a particular time when this period starts.
TEMPORAL QUERY LANGUAGES
 A database relation without temporal information is sometimes called a snapshot relation,
since it reflects the state in a snapshot of the real world. The snapshot operation on a
temporal relation gives the snapshot of the relation at a specified time (or the current time,
if the time is not specified).
 A temporal selection is a selection that involves the time attributes; a temporal projection
is a projection where the tuples in the projection inherit their times from the tuples in the
original relation.
 A temporal join is a join, with the time of a tuple in the result being the intersection of the
times of the tuples from which it is derived. If the times do not intersect, the tuple is
removed from the result.
 The predicates precedes, overlaps, and contains can be applied on intervals; their meanings
should be clear. The intersect operation can be applied on two intervals, to give a single
(possibly empty) interval. However, the union of two intervals may or may not be a single
interval.
 A temporal functional dependency X → Y holds on a relation schema R if, for all legal
instances r of R, all snapshots of r satisfy the functional dependency X → Y.
3. Explain Spatial Data Types, Spatial Operators and Queries with example.
SPATIAL DATABASE:
 Spatial data include geographic data, such as maps and associated information, and
computer-aided-design data, such as integrated circuit designs or building designs.
Applications of spatial data initially stored data as files in a file system, as did early-
generation business applications.
 Spatial databases incorporate functionality that provides support for databases that keep
track of objects in a multidimensional space.
 The systems that manage geographic data and related applications are known as geographic
information systems (GISs), and they are used in areas such as environmental applications,
transportation systems, emergency response systems, and battle management.
 Other databases, such as meteorological databases for weather information, are three-
dimensional, since temperatures and other meteorological information are related to three-
dimensional spatial points.
 In general, a spatial database stores objects that have spatial characteristics that describe
them and that have spatial relationships among them.
 The spatial relationships among the objects are important, and they are often needed when
querying the database.
 A spatial database is optimized to store and query data related to objects in space, including
points, lines and polygons. Satellite images are a prominent example of spatial data.
 Queries posed on these spatial data, where predicates for selection deal with spatial
parameters, are called spatial queries. For example, “What are the names of all bookstores
within five miles of the College of Computing building at Georgia Tech?” is a spatial query.

5
Common Types of Analysis for Spatial Data:

SPATIAL DATA TYPES AND MODELS


Spatial data comes in three basic forms. These forms have become a de facto standard due to their
wide use in commercial systems.
 Map data includes various geographic or spatial features of objects in a map, such as an
object’s shape and the location of the object within the map. The three basic types of
features are points, lines, and polygons (or areas). Points are used to represent spatial
characteristics of objects whose locations correspond to a single 2-D coordinate (x, y, or
longitude/latitude) in the scale of a particular application. Lines represent objects having
length, such as roads or rivers, whose spatial characteristics can be approximated by a
sequence of connected lines. Polygons are used to represent spatial characteristics of
objects that have a boundary, such as countries, states, lakes, or cities.
 Attribute data is the descriptive data that GIS systems associate with map features. For
example, suppose that a map contains features that represent counties within a U.S. state
(such as Texas or Oregon). Other attribute data could be included for other features in the
map, such as states, cities, congressional districts, census tracts, and so on.
 Image data includes data such as satellite images and aerial photographs, which are
typically created by cameras. Objects of interest, such as buildings and roads, can be
identified and overlaid on these images. Images can also be attributes of map features. One
can add images to other map features so that clicking on the feature would display the
image. Aerial and satellite images are typical examples of raster data.
Models of spatial information are sometimes grouped into two broad categories: field and object.
Field models are often used to model spatial data that is continuous in nature, such as terrain
elevation, temperature data, and soil variation characteristics, whereas object models have
traditionally been used for applications such as transportation networks, land parcels, buildings,
and other objects that possess both spatial and non-spatial attributes.
SPATIAL OPERATORS AND SPATIAL QUERIES
Spatial operators are used to capture all the relevant geometric properties of objects embedded in
the physical space and the relations between them, as well as to perform spatial analysis. Operators
are classified into three broad categories.

6
 Topological operators. Topological properties are invariant when topological
transformations are applied. Topological operators are hierarchically structured in several
levels, where the base level offers operators the ability to check for detailed topological
relations between regions with a broad boundary, and the higher levels offer more abstract
operators that allow users to query uncertain spatial data independent of the underlying
geometric data model. Examples include open (region), close (region), and inside (point,
loop).
 Projective operators. Projective operators, such as convex hull, are used to express
predicates about the concavity/convexity of objects as well as other spatial relations (for
example, being inside the concavity of a given object).
 Metric operators. Metric operators provide a more specific description of the object’s
geometry. They are used to measure some global properties of single objects (such as the
area, relative size of an object’s parts, compactness, and symmetry), and to measure the
relative position of different objects in terms of distance and direction. Examples include
length (arc) and distance (point, point).
 Dynamic Spatial Operators. Dynamic operations alter the objects upon which the
operations act. The three fundamental dynamic operations are create, destroy, and update.
SPATIAL QUERIES
Spatial queries are requests for spatial data that require the use of spatial operations. The following
categories illustrate three typical types of spatial queries:
 Nearness queries request objects that lie near a specified location. A query to find all
restaurants that lie within a given distance of a given point is an example of a nearness
query. The nearest-neighbour query requests the object that is nearest to a specified point.
For example, we may want to find the nearest gasoline station.
 Region queries deal with spatial regions. Such a query can ask for objects that lie partially
or fully inside a specified region. A query to find all retail shops within the geographic
boundaries of a given town is an example.
 Spatial joins or overlays. Typically joins the objects of two types based on some spatial
condition, such as the objects intersecting or overlapping spatially or being within a certain
distance of one another. For example, find all townships located on a major highway
between two cities or find all homes that are within two miles of a lake.
SPATIAL DATA INDEXING
Indices are required for efficient access to spatial data. Traditional index structures, such as hash
indices and B-trees, are not suitable, since they deal only with one-dimensional data, whereas
spatial data are typically of two or more dimensions.

 k-d Trees
A tree structure called a k-d tree was one of the early structures used for indexing in
multiple dimensions. Each level of a k-d tree partitions the space into two. The partitioning
is done along one dimension at the node at the top level of the tree, along another dimension
in nodes at the next level, and so on, cycling through the dimensions. The partitioning
proceeds in such a way that, at each node, approximately one-half of the points stored in
the subtree fall on one side and one-half fall on the other. Partitioning stops when a node
has less than a given maximum number of points.

7
Figure 25.4 shows a set of points in two-dimensional space, and a k-d tree representation
of the set of points. Each line corresponds to a node in the tree, and the maximum number
of points in a leaf node has been set at 1. Each line in the figure (other than the outside box)
corresponds to a node in the k-d tree. The numbering of the lines in the figure indicates the
level of the tree at which the corresponding node appears. The k-d-B tree extends the k-d
tree to allow multiple child nodes for each internal node, just as a B-tree extends a binary
tree, to reduce the height of the tree. k-d-B trees are better suited for secondary storage than
k-d trees.

 Quadtrees
An alternative representation for two-dimensional data is a quadtree.

An example of the division of space by a quadtree appears in Figure 25.5. The set of points
is the same as that in Figure 25.4. Each node of a quadtree is associated with a rectangular
region of space. The top node is associated with the entire target space. Each non leaf node
in a quadtree divides its region into four equal-sized quadrants, and correspondingly each
such node has four child nodes corresponding to the four quadrants. Leaf nodes have
between zero and some fixed maximum number of points. Correspondingly, if the region
corresponding to a node has more than the maximum number of points, child nodes are

8
created for that node. In the example in Figure 25.5, the maximum number of points in a
leaf node is set to 1.
This type of quadtree is called a PR quadtree, to indicate it stores points, and that the
division of space is divided based on regions, rather than on the actual set of points stored.
 R-Trees
 A storage structure called an R-tree is useful for indexing of objects such as points,
line segments, rectangles, and other polygons.
 An R-tree is a balanced tree structure with the indexed objects stored in leaf nodes,
much like a B+-tree. However, instead of a range of values, a rectangular bounding
box is associated with each tree node.
 The bounding box of a leaf node is the smallest rectangle parallel to the axes that
contains all objects stored in the leaf node.
 Each internal node stores the bounding boxes of the child nodes along with the
pointers to the child nodes.
 Each leaf node stores the indexed objects, and may optionally store the bounding
boxes of the objects; the bounding boxes help speed up checks for overlaps of the
rectangle with the indexed objects—if a query rectangle does not overlap with the
bounding box of an object, it cannot overlap with the object, either.

The R-tree itself is at the right side of Figure 25.6. The figure refers to the coordinates of
bounding box i as B Bi in the figure.
Comparison with Quad-trees:
 Tiling level optimization is required in Quad-trees whereas in R-tree doesn’t require
any such optimization.
 Quad-tree can be implemented on top of existing B-tree whereas R-tree follow a
different structure from a B-tree.
 Spatial index creation in Quad-trees is faster as compared to R-trees.
 R-trees are faster than Quad-trees for Nearest Neighbour queries while for window
queries, Quad-trees are faster than R-trees.
SPATIAL DATA MINING

9
Spatial data tends to be highly correlated. For example, people with similar characteristics,
occupations, and backgrounds tend to cluster together in the same neighbourhoods. The three
major spatial data mining techniques are spatial classification, spatial association, and spatial
clustering.
 Spatial classification. The goal of classification is to estimate the value of an attribute of
a relation based on the value of the relation’s other attributes. An example of the spatial
classification problem is determining the locations of nests in a wetland based on the value
of other attributes (for example, vegetation durability and water depth); it is also called the
location prediction problem. Similarly, where to expect hotspots in crime activity is also a
location prediction problem.
 Spatial association. Spatial association rules are defined in terms of spatial predicates
rather than items. A spatial association rule is of the form P1 ∧ P2 ∧ … ∧ Pn ⇒ Q1 ∧ Q2
∧ … ∧ Qm where at least one of the Pi ’s or Qj ’s is a spatial predicate.
 Spatial clustering attempts to group database objects so that the most similar objects are
in the same cluster, and objects in different clusters are as dissimilar as possible. An
example of a spatial clustering algorithm is density-based clustering, which tries to find
clusters based on the density of data points in a region.
APPLICATIONS OF SPATIAL DATA
 Spatial data management is useful in many disciplines, including geography, remote
sensing, urban planning, and natural resource management.
 Spatial database management is playing an important role in the solution of challenging
scientific problems such as global climate change and genomics.
 Due to the spatial nature of genome data, GIS and spatial database management systems
have a large role to play in the area of bioinformatics.
 Some of the typical applications include pattern recognition, genome browser
development, and visualization maps.
 Another important application area of spatial data mining is the spatial outlier detection.
Detecting spatial outliers is useful in many applications of geographic information systems
and spatial databases.
 These application domains include transportation, ecology, public safety, public health,
climatology, and location-based services.

4. How Location and Handoff Management is handled in mobile databases? Explain


with diagram.
MOBLE DATABASES:
Mobile computing has proved useful in many applications. Many business travelers use laptop
computers so that they can work and access data enroute.
Wireless computing creates a situation where machines no longer have fixed locations and network
addresses. Location-dependent queries are an interesting class of queries that are motivated by
mobile computers; in such queries, the location of the user (computer) is a parameter of the query.
The value of the location parameter is provided by a global positioning system (GPS).

10
The mobile-computing environment consists of mobile computers, referred to as mobile hosts, and
a wired network of computers. Mobile hosts communicate with the wired network via computers
referred to as mobile support stations. Each mobile support station manages those mobile hosts
within its cell— that is, the geographical area that it covers. Mobile hosts may move between cells,
thus necessitating a handoff of control from one mobile support station to another. It is possible
for mobile hosts to communicate directly without the intervention of a mobile support station.
However, such communication can occur only between nearby hosts.

LOCATION AND HANDOFF MANAGEMENT:


Location management:
Search: find a mobile user’s current location
 Update (Register): update a mobile user’s location
 Location info: maintained at various granularities (cell vs. a group of cells called a
registration area)
 Research Issue: organization of location databases
 E.g.: Global Systems for Mobile (GSM) vs. Mobile IP vs.
 Wireless Mesh Networks (WMN)
 The location management procedure is invoked to identify the new location.
(a) location update,
(b) location lookup, and
(c) paging.
In location update, which is initiated by the mobile unit, the current location of the unit is
recorded in HLR and VLR databases.
Location lookup is basically a database search to obtain the current location of the mobile unit
and through paging the system informs the caller the location of the called unit in terms of its
current base station.
 These two tasks are initiated by the MSC. The cost of update and paging increase as cell
size decreases, which becomes quite significant for finer granularity cells such as micro-
or picocell clusters.
 The presence of frequent cell crossing, which is a common scenario in highly commuting
zones, further adds to the cost.
 The system creates location areas and paging areas to minimize the cost.
 A number of neighboring cells are grouped together to form a location area, and the paging
area is constructed in a similar way.

HANDOFF MANAGEMENT
Ensuring that a mobile user remains connected while moving from one location (e.g., cell) to
another.
Packets or connection are routed to the new location decide when to handoff to a new access point
(AP).

11
 Select a new AP from among several APs
 Acquire resources such as bandwidth channels (GSM), or a new IP address (Mobile IP)
 Channel allocation is a research issue: goal may be to maximize channel usage, satisfy
QoS, or maximize revenue generated
 Inform the old AP to reroute packets and also to transfer state information to the new AP.
 Packets are routed to the new AP.

12
13
TRADEOFF IN LOCATION MANAGEMENT
 Network may only know approximate location
 By location update (or location registration):
 Network is informed of the location of a mobile user
 By location search or terminal paging:
 Network is finding the location of a mobile user
 A tradeoff exists between location update and search
 When the user is not called often (or if the service arrival rate) is low, resources are wasted
with frequent updates.
 If not done and a call comes, bandwidth or time is wasted in searching.

MOBILE TRANSACTION MODELS:


1.Report and Co-transaction model
 This model considered as a collection of sub transaction either nested or open nested
transaction model.
 Nested transaction is a parent transaction makes child transaction supports more the quality
of being adaptable than atomic transaction. It doesn’t share the result between parent and
child transaction while transactions are executed. It allows hierarchy of transaction nesting
levels and obeys the bottom-up approach by the root.
 When a child transaction successfully executed, the object changed by it can be easily
obtained.
 Parent transactions. The consequence of object made lasting in a database only when the
parent transaction (root) successfully executed. This model arranges the mobile transaction
into following four types:
a. Atomic transactions:
It is related with substantial events like Begin, Commit, and Abort having the normal aborts &
commit properties.
b. Non-compositable transactions:
It is not linked with compensating transaction. It can execute at any time and the parents of these
transactions have the responsibility to commit and abort.
c. Reporting transactions:
 A report can be regarded as a delegation of state between transactions. The reporting
transaction not assigning all its results to its parent transactions.
 It only has one receiver at any time during execution.
 The updating is completed permanently if receiving parent transaction is successfully
executed but if receiver parent transactions unsuccessfully terminate then corresponding
reporting transaction abort.

d. Co-transactions:
These transactions executed like procedures executed. When one transaction is executed then
control passes from current transaction to another transaction during sharing the results. At a time
either both transactions successfully executed or failed.
2.Kangaroo transaction model

14
 This model is proposed by Dunham and made to perform to represent the movement
behaviour and data behaviour of transaction when a mobile host changing the position
from one mobile cell to another in static network. It is named so because in mobile
environment hop transaction move one base station to another.
 This transaction model develops and grows based on abstract idea of global and split
transaction in multi database environment. In this model Data Access Agent
 (DAA) at each base station used for accessing local and global databases. DAA accepts
transactions express to need from a mobile user, and forwards the request to the
corresponding database servers.
 These transactions will be committed on servers. DAA acts as a Mobile Transaction
Manager and data access coordinator.
 Kangaroo transaction has a unique identification number composed of the base station
number and unique sequence number within that base station. When the mobile unit
changes location from one to another, the control of the Kangaroo transaction changes to
a new DAA at another base station. The DAA at the new base station produce a new Joey
transaction.

a. Clustering model:
 This model is proposed by Pitoura and accepts a fully distributed system and considered
as an open nested transaction model. This model is grounded on collection of related to
meaning or nearly placed data together to form a cluster. Clusters can characterize statically
or dynamically.
 Transaction from a mobile host composed of a set of weak and strict transactions grounded
on the consistency requirement. Weak
transaction consists only weak read and weak write operations which can access only
within the clusters.
b. Isolation –only model:
 This model is proposed by Satyanarayan and used in Coda file system. Coda is a distributed
file system by using file hoarding and concurrency control for mobile clients which
provides disconnected operations.
 Here transactions are chronological succession of file accessing operations.
 Like Clustering, transactions are arranged in two categories:
 First class which doesn’t hold any separate section file accesses
 Second class which are carried out under disconnection.
 First class transaction performs to act without delay after being executed, whereas Second
class on one occasion goes to a pending state and waits for validation. When reconnection
becomes possible second-class transactions are made legally valid according to the wanted
consistency criteria. If validation is successful, results are integrated and committed
otherwise transactions entering the resolution state.

c. Two-Tier transaction model


 This model is proposed by Gary and also called as Base Tentative model. This model is
grounded on a data replication scheme. For each object, there is a master data copy and
various replicated copy. Like Clustering and Isolation only transaction, transactions are
arranged in two categories: Base and Tentative. Base transaction function on the master
copy whereas Tentative transaction retrieves the replicated copy.

15
 When the mobile host is abrupt, Tentative transactions modify the replicated data copy.

d. Multi database transaction model:


 This model is grounded on a framework to adopt as a belief on transaction submission form
mobile hosts in a multi database environment.
 Call for messages from a mobile host to its coordinating site is dealt asynchronously
allowing for the mobile host to unplug it.
 The coordinating node carry out the messages on behalf of the mobile unit and it is possible
to query the position of the global transaction from mobile hosts.
 In the aimed Message and Queuing Facility (MQF), for each mobile work station there
exists a message queue and a transaction queue called for, reference and information type
messages such as, called for connection/reconnection, reference for
connection/reconnection mobile workstation, expect message queue position can be used.

e. Pro-motion transaction model:


 This model is proposed by G.D.Walborn, P.K. Chrysanthis and grounded on nested
transaction model .
 The Pro-motion model specially emphasize on supporting disunited transaction processing
grounded on the client-server architecture.
 Mobile transactions are conceived as long and nested transactions where top level
transaction is executed at fixed hosts, and sub transactions are accomplished at mobile
hosts.
 The accomplished task of sub transactions at mobile host is confirmed by the concept of
compact objects.

f. Toggle transaction model:


 This model is proposed by Dirckze and Grunewald and alike multi database transaction
model.
 In this model a Mobile Multi database system is determined as an assembling of set and
mobile databases.
 A global transaction is determined as comprising of a set of operations, each of which is a
legal operation consented by some service interface.
 Any subset of operations of a global transaction that right to enter the same site may be
executed and will figure logical unit called a site-transaction.
 Site transactions are executed below the assurance of the respective DBMS.

5. Explain in detail about deductive database system with prolog notations and
examples.
DEDUCTIVE DATABASES:
 In a deductive database system, we typically specify rules through a declarative
language—a language in which we specify what to achieve rather than how to achieve it.
 An inference engine (or deduction mechanism) within the system can deduce new facts
from the database by interpreting these rules. The model used for deductive databases is

16
closely related to the relational data model, and particularly to the domain relational
calculus formalism.
 It is also related to the field of logic programming and the Prolog language.
 A variation of Prolog called Datalog is used to define rules declaratively in conjunction
with an existing set of relations, which are themselves treated as literals in the language.
Although the language structure of Datalog resembles that of Prolog, its operational
semantics—that is, how a Datalog program is executed—is still different.
 A deductive database uses two main types of specifications: facts and rules. Facts are
specified in a manner similar to the way relations are specified, except that it is not
necessary to include the attribute names.
 In a deductive database, the meaning of an attribute value in a tuple is determined solely
by its position within the tuple. Rules are somewhat similar to relational views. They
specify virtual relations that are not actually stored but that can be formed from the facts
by applying inference mechanisms based on the rule specifications. The main difference
between rules and views is that rules may involve recursion and hence may yield virtual
relations that cannot be defined in terms of basic relational views.
 The evaluation of Prolog programs is based on a technique called backward chaining,
which involves a top-down evaluation of goals.
 In the deductive databases that use Datalog, attention has been devoted to handling large
volumes of data stored in a relational database. Hence, evaluation techniques have been
devised that resemble those for a bottom-up evaluation
Prolog/Datalog Notation :
 Prolog suffers from the limitation that the order of specification of facts and rules is
significant in evaluation; moreover, the order of literals (defined in Section within a rule is
significant). The execution techniques for Datalog programs attempt to circumvent these
problems.
 Prolog/Datalog Notation The notation used in Prolog/Datalog is based on providing
predicates with unique names. A predicate has an implicit meaning, which is suggested
by the predicate name, and a fixed number of arguments. If the arguments are all constant
values, the predicate simply states that a certain fact is true. If, on the other hand, the
predicate has variables as arguments, it is either considered as a query or as part of a rule
or constraint. In our discussion, we adopt the Prolog convention that all constant values
in a predicate are either numeric or character strings; they are represented as identifiers (or
names) that start with a lowercase letter, whereas variable names always start with an
uppercase letter. Consider the example shown in Figure 26.11, which is based on the
relational database in Figure 3.6, but in a much simplified form. There are three predicate
names: supervise, superior, and subordinate.
 The SUPERVISE predicate is defined via a set of facts, each of which has two arguments:
a supervisor name, followed by the name of a direct supervisee (subordinate) of that
supervisor. These facts correspond to the actual data that is stored in the database, and they
can be considered as constituting a set of tuples in a relation SUPERVISE with two
attributes whose schema is
SUPERVISE(Supervisor, Supervisee)

17
Thus, SUPERVISE(X, Y) states the fact that X supervises Y. Notice the omission of the
attribute names in the Prolog notation. Attribute names are only represented by virtue of
the position of each argument in a predicate: the first argument represents the supervisor,
and the second argument represents a direct subordinate. The other two predicate names
are defined by rules. The main contributions of deductive databases are the ability to
specify recursive rules and to provide a framework for inferring new information based on
the specified rules. A rule is of the form head :– body, where :– is read as if and only if. A
rule usually has a single predicate to the left of the :– symbol—called the head or left-hand
side (LHS) or conclusion of the rule—and one or more predicates to the right of the :–
symbol— called the body or right-hand side (RHS) or premise(s) of the rule. A predicate
with constants as arguments is said to be ground; we also refer to it as an instantiated
predicate. The arguments of the predicates that appear in a rule typically include a number
of variable symbols, although predicates can also contain constants as arguments. A rule
specifies that, if a particular assignment or binding of constant values to the variables in
the body (RHS predicates) makes all the RHS predicates true, it also makes the head (LHS
predicate) true by using the same assignment of constant values to variables. Hence, a rule
provides us with a way of generating new facts that are instantiations of the head of the
rule. These new facts are based on facts that already exist, corresponding to the
instantiations (or bindings) of predicates in the body of the rule. Notice that by listing
multiple predicates in the body of a rule we implicitly apply the logical AND operator to
these predicates. Hence, the commas between the RHS predicates may be read as meaning
and.
Datalog Notation:
In Datalog, as in other logic-based languages, a program is built from basic objects called
atomic formulas. It is customary to define the syntax of logic-based languages by describing the
syntax of atomic formulas and identifying how they can be combined to form a program. In
Datalog, atomic formulas are literals of the form p(a1, a2, … , an), where p is the predicate name
and n is the number of arguments for predicate p. Different predicate symbols can have different
numbers of arguments, and the number of arguments n of predicate p is sometimes called the arity
or degree of p. The arguments can be either constant values or variable names. As mentioned
earlier, we use the convention that constant values either are numeric or start with a lowercase
character, whereas variable names always start with an uppercase character. A literal is either an
atomic formula as defined earlier—called a positive literal—or an atomic formula preceded by
not. The latter is a negated atomic formula, called a negative literal.
Clausal Form and Horn Clauses:

18
Recall from Section 6.6 that a formula in the relational calculus is a condition that includes
predicates called atoms (based on relation names). Additionally, a formula can have quantifiers—
namely, the universal quantifier (for all) and the existential quantifier (there exists). In clausal
form, a formula must be transformed into another formula with the following characteristics: ■ All
variables in the formula are universally quantified. Hence, it is not necessary to include the
universal quantifiers (for all) explicitly; the quantifiers are removed, and all variables in the
formula are implicitly quantified by the universal quantifier. ■ In clausal form, the formula is made
up of a number of clauses, where each clause is composed of a number of literals connected by
OR logical connectives only. Hence, each clause is a disjunction of literals. ■ The clauses
themselves are connected by AND logical connectives only, to form a formula. Hence, the clausal
form of a formula is a conjunction of clauses.
Interpretations of Rules:
 There are two main alternatives for interpreting the theoretical meaning of rules: proof-
theoretic and model-theoretic.
 In the proof-theoretic interpretation of rules, we consider the facts and rules to be true
statements, or axioms. Ground axioms contain no variables. The facts are ground axioms
that are given to be true. Rules are called deductive axioms, since they can be used to
deduce new facts. The deductive axioms can be used to construct proofs that derive new
facts from existing facts. For example, Figure 26.12 shows how to prove the fact
SUPERIOR(james, ahmad) from the rules and facts given in Figure 26.11. The proof-
theoretic interpretation gives us a procedural or computational approach for computing an
answer to the Datalog query. The process of proving whether a certain fact (theorem) holds
is known as theorem proving. The second type of interpretation is called the model-
theoretic interpretation. Here, given a finite or an infinite domain of constant values,33 we
assign to a predicate every possible combination of values as arguments. We must then
determine whether the predicate is true or false. In general, it is sufficient to specify the
combinations of arguments that make the predicate true, and to state that all other
combinations make the predicate false. If this is done for every predicate, it is called an
interpretation of the set of predicates. For example, consider the interpretation shown in
Figure 26.13 for the predicates SUPERVISE and SUPERIOR. This interpretation assigns
a truth value (true or false) to every possible combination of argument values (from a finite
domain) for the two predicates.
 An interpretation is called a model for a specific set of rules if those rules are always true
under that interpretation; that is, for any values assigned to the variables in the rules, the
head of the rules is true when we substitute the truth values assigned to the predicates in
the body of the rule by that interpretation. Hence, whenever a particular substitution
(binding) to the variables in the rules is applied, if all the predicates in the body of a rule
are true under the interpretation, the predicate in the head of the rule must also be true. The
interpretation shown in Figure 26.13 is a model for the two rules shown, since it can never
cause the rules to be violated. Notice that a rule is violated if a particular binding of
constants to the variables makes all the predicates in the rule body true but makes the
predicate in the rule head false.

19
6. Discuss the challenges to multimedia databases and Areas where multimedia
database is applied
MULTIMEDIA DATABASE:
Multimedia databases provide features that allow users to store and query different types
of multimedia information, which includes images (such as photos or drawings), video clips (such
as movies, newsreels, or home videos), audio clips(such as songs, phone messages, or speeches),
and documents (such as books or articles).
Content of Multimedia Database management system :
1. Media data – The actual data representing an object.
2. Media format data – Information such as sampling rate, resolution, encoding scheme etc.
about the format of the media data after it goes through the acquisition, processing and
encoding phase.
3. Media keyword data – Keywords description relating to the generation of data. It is also
known as content descriptive data. Example: date, time and place of recording.
4. Media feature data – Content dependent data such as the distribution of colors, kinds of
texture and different shapes present in data.
Types of multimedia applications based on data management characteristic are :
1. Repository applications – A Large amount of multimedia data as well as meta-data(Media
format date, Media keyword data, Media feature data) that is stored for retrieval purpose,
e.g., Repository of satellite images, engineering drawings, radiology scanned pictures.
2. Presentation applications – They involve delivery of multimedia data subject to temporal
constraint. Optimal viewing or listening requires DBMS to deliver data at certain rate
offering the quality of service above a certain threshold. Here data is processed as it is
delivered. Example: Annotating of video and audio data, real-time editing analysis.
3. Collaborative work using multimedia information – It involves executing a complex
task by merging drawings, changing notifications. Example: Intelligent healthcare
network.
There are still many challenges to multimedia databases, some of which are :
1. Modelling – Working in this area can improve database versus information retrieval
techniques thus, documents constitute a specialized area and deserve special consideration.
2. Design – The conceptual, logical and physical design of multimedia databases has not yet
been addressed fully as performance and tuning issues at each level are far more complex
as they consist of a variety of formats like JPEG, GIF, PNG, MPEG which is not easy to
convert from one form to another.
3. Storage – Storage of multimedia database on any standard disk presents the problem of
representation, compression, mapping to device hierarchies, archiving and buffering during

20
input-output operation. In DBMS, a ”BLOB”(Binary Large Object) facility allows untyped
bitmaps to be stored and retrieved.
4. Performance – For an application involving video playback or audio-video
synchronization, physical limitations dominate. The use of parallel processing may
alleviate some problems but such techniques are not yet fully developed. Apart from this
multimedia database consume a lot of processing time as well as bandwidth.
5. Queries and retrieval –For multimedia data like images, video, audio accessing data
through query opens up many issues like efficient query formulation, query execution and
optimization which need to be worked upon.
Areas where multimedia database is applied are :
 Documents and record management : Industries and businesses that keep detailed
records and variety of documents. Example: Insurance claim record.
 Knowledge dissemination : Multimedia database is a very effective tool for knowledge
dissemination in terms of providing several resources. Example: Electronic books.
 Education and training : Computer-aided learning materials can be designed using
multimedia sources which are nowadays very popular sources of learning. Example:
Digital libraries.
 Marketing, advertising, retailing, entertainment and travel. Example: a virtual tour of cities.
 Real-time control and monitoring : Coupled with active database technology,
multimedia presentation of information can be very effective means for monitoring and
controlling complex tasks Example: Manufacturing operation control.

 The main types of database queries that are needed involve locating multimedia sources
that contain certain objects of interest. For example, one may want to locate all video clips
in a video database that include a certain person, say Michael Jackson. One may also want
to retrieve video clips based on certain activities included in them, such as video clips
where a soccer goal is scored by a certain player or team.
 The above types of queries are referred to as content-based retrieval, because the
multimedia source is being retrieved based on its containing certain objects or activities.
Hence, a multimedia database must use some model to organize and index the multimedia
sources based on their contents. Identifying the contents of multimedia sources is a difficult
and time-consuming task.
 There are two main approaches. The first is based on automatic analysis of the multimedia
sources to identify certain mathematical characteristics of their contents. This approach
uses different techniques depending on the type of multimedia source (image, video, audio,
or text). The second approach depends on manual identification of the objects and
activities of interest in each multimedia source and on using this information to index the
sources.
 An image is typically stored either in raw form as a set of pixel or cell values, or in
compressed form to save space. The image shape descriptor describes the geometric shape
of the raw image, which is typically a rectangle of cells of a certain width and height.
Hence, each image can be represented by an m by n grid of cells. Each cell contains a pixel
value that describes the cell content. In black and white images, pixels can be one bit. In

21
grayscale or colour images, a pixel is multiple bits. Because images may require large
amounts of space, they are often stored in compressed form. Compression standards, such
as GIF, JPEG, or MPEG, use various mathematical transformations to reduce the number
of cells stored but still maintain the main image characteristics.
 Applicable mathematical transforms include discrete Fourier transform (DFT), discrete
cosine transform (DCT), and wavelet transforms.
 To identify objects of interest in an image, the image is typically divided into homogeneous
segments using a homogeneity predicate. For example, in a color image, adjacent cells that
have similar pixel values are grouped into a segment.
 The homogeneity predicate defines conditions for automatically grouping those cells.
Segmentation and compression can be Automatic Analysis of Images.
Automatic Analysis of Images:
 Analysis of multimedia sources is critical to support any type of query or search interface.
We need to represent multimedia source data such as images in terms of features that would
enable us to define similarity. The work done so far in this area uses low-level visual
features such as colour, texture, and shape, which are directly related to the perceptual
aspects of image content. These features are easy to extract and represent, and it is
convenient to design similarity measures based on their statistical properties.
 Colour is one of the most widely used visual features in content-based image retrieval
since it does not depend upon image size or orientation. Retrieval based on color similarity
is mainly done by computing a colour histogram for each image that identifies the
proportion of pixels within an image for the three colour channels (red, green, blue—
RGB).
 Texture refers to the patterns in an image that present the properties of homogeneity that
do not result from the presence of a single color or intensity value. Example of texture
classes are rough and silky. Examples of textures that can be identified include pressed calf
leather, straw matting, cotton canvas, and so on. Just as pictures are represented by arrays
of pixels (picture elements), textures are represented by arrays of texels (texture elements).
 Texture identification is primarily done by modelling it as a two-dimensional, grey-level
variation. The relative brightness of pairs of pixels is computed to estimate the degree of
contrast, regularity, coarseness, and directionality.
 Shape refers to the shape of a region within an image. It is generally determined by
applying segmentation or edge detection to an image. Segmentation is a region-based
approach that uses an entire region (sets of pixels), whereas edge detection is a boundary-
based approach that uses only the outer boundary characteristics of entities. Shape
representation is typically required to be invariant to translation, rotation, and scaling.
Object Recognition in Images:
 Object recognition is the task of identifying real-world objects in an image or a video
sequence. The system must be able to identify the object even when the images of the
object vary in viewpoints, size, scale, or even when they are rotated or translated. Some
approaches have been developed to divide the original image into regions based on
similarity of contiguous pixels.
 Thus, in a given image showing a tiger in the jungle, a tiger subimage may be detected
against the background of the jungle, and when compared with a set of training images, it
may be tagged as a tiger.
 The representation of the multimedia object in an object model is extremely important. One
approach is to divide the image into homogeneous segments using a homogeneous
predicate. For example, in a colored image, adjacent cells that have similar pixel values are
grouped into a segment.

22
 An important contribution to this field was made by Lowe,30 who used scale-invariant
features from images to perform reliable object recognition. This approach is called scale-
invariant feature transform (SIFT). For image matching and recognition, SIFT features
(also known as keypoint features) are first extracted from a set of reference images and
stored in a database. Object recognition is then performed by comparing each feature from
the new image with the features stored in the database and finding candidate matching
features based on the Euclidean distance of their feature
 The SIFT features are invariant to image scaling and rotation, and partially invariant to
change in illumination and 3D camera viewpoint.
 For image matching and recognition, SIFT features (also known as keypoint features) are
first extracted from a set of reference images and stored in a database. Object recognition
is then performed by comparing each feature from the new image with the features stored
in the database and finding candidate matching features based on the Euclidean distance of
their feature vectors. Since the keypoint features are highly distinctive, a single feature can
be correctly matched with good probability in a large database of features.
Semantic Tagging of Images:
 The notion of implicit tagging is an important one for image recognition and comparison.
Multiple tags may attach to an image or a subimage: for instance, in the example we
referred to above, tags such as “tiger,” “jungle,” “green,” and “stripes” may be associated
with that image.
 Most image search techniques retrieve images based on user-supplied tags that are often
not very accurate or comprehensive.
 To improve search quality, a number of recent systems aim at automated generation of
these image tags. In case of multimedia data, most of its semantics is present in its content.
These systems use image-processing and statistical-modeling techniques to analyze image
content to generate accurate annotation tags that can then be used to retrieve images by
content.
 Since different annotation schemes will use different vocabularies to annotate images, the
quality of image retrieval will be poor.
 To solve this problem, recent research techniques have proposed the use of concept
hierarchies, taxonomies, or ontologies using OWL (Web Ontology Language), in which
terms and their relationships are clearly defined. These can be used to infer higherlevel
concepts based on tags.
 Concepts like “sky” and “grass” may be further divided into “clear sky” and “cloudy sky”
or “dry grass” and “green grass” in such a taxonomy. These approaches generally come
under semantic tagging and can be used in conjunction with the above feature-analysis and
object-identification strategies.
Analysis of Audio Data Sources:
 Audio sources are broadly classified into speech, music, and other audio data. Each of these
is significantly different from the others; hence different types of audio data are treated
differently.
 Audio data must be digitized before it can be processed and stored. Indexing and retrieval
of audio data is arguably the toughest among all types of media, because like video, it is
continuous in time and does not have easily measurable characteristics such as text.
 Clarity of sound recordings is easy to perceive humanly but is hard to quantify for machine
learning. Interestingly, speech data often uses speech recognition techniques to aid the
actual audio content, as this can make indexing this data a lot easier and more accurate.
This is sometimes referred to as text-based indexing of audio data.

23
 The speech metadata is typically content dependent, in that the metadata is generated from
the audio content; for example, the length of the speech, the number of speakers, and so
on. However, some of the metadata might be independent of the actual content, such as the
length of the speech and the format in which the data is stored.
 Music indexing, on the other hand, is done based on the statistical analysis of the audio
signal, also known as content-based indexing. Content-based indexing often makes use of
the key features of sound: intensity, pitch, timbre, and rhythm. It is possible to compare
different pieces of audio data and retrieve information from them based on the calculation
of certain features, as well as application of certain transforms.

24
ADVANCED DATABASE TECHNOLOGY
UNIT 3
NoSQL Databases
Problems with RDBMS:
 Should know the entire schema upfront
 Every record should have the same properties [rigid structure]
 Scalability is costly [transactions and joins are expensive when running on a distributed
database]
 Many Relational Databases do not provide out of the box support for scaling
 Normalization
 Fixed schemas make it hard to adjust to application needs
 Altering schema on a running database is expensive
 Application changes for any change in schema structure
 SQL was designed for running on single server systems.
 Horizontal Scalability is a problem

Why this architecture not good enough?


 Load
 Failure — Single point of failure
 Maintenance — Downtime during manual maintenance

1
Advantages
 Speed — Files are retrieved from the nearest location
 If one site fails, the system can still run
Disadvantages
 Time for synchronization of the multiple databases
 Data replication
The Benefits of NoSQL
 When compared to relational databases, NoSQL databases are more scalable and
provide superior performance, and their data model addresses several issues that the
relational model is not designed to address
 Large volumes of rapidly changing structured, semi-structured, and unstructured
data:[Schema-less]
 Mostly Open Source
 Object-oriented programming that is easy to use and flexible
 Running well on Clusters-Geographically distributed scale-out architecture instead of
expensive, monolithic architecture
NOSQL Categories
Most of the NOSQL products can be put into these categories:
 Key /Value Stores Y
 Document Databases
 Graph Databases
 Column Databases

2
Examples of Key Value Databases:
 Redis
 Riak.
 Oracle NoSQL
Document Databases
Documents are composed of field-and-value pairs and have the following structure:
Documents can contain many different key-value pairs, or key-array pairs, or even nested
documents
{ field1: value1, field2: value2, field3: value3, ...
fieldN: valueN }

MongoDB stores BSON documents, i.e. data records, in


collections; the collections in databases.
Examples of Document Databases
 CouchDB - Apache Software Foundation
 Cosmos DB —- Microsoft
 MongoDB— Mongo DB inc

3
Graph Databases
Graph databases are NoSQL databases which use the graph data model comprised of vertices,
which is an entity such as a person, place, object or relevant piece of data and edges, which
represent the relationship between two nodes.
Advantages
 Easy to represent connected data
 Very faster to retrieve, navigate and traverse connected data
 Can represent semi-structured data easily
 Not require complex or costly joins to retrieve connected data’
 It supports full ACID(Atomicity, Consistency, Isolation and Durability) rules
Let’s Convert a Relational Model to a Graph Data Model using an Example

Column Family Data Model

 Column family databases are probably most known because of Google’s Big-Table
implementation.
 The are very similar on the surface to relational database, but they are actually quite
different beast.
 Some of the difference is storing data by rows(relational) vs. storing data by columns
(column family databases).

4
 But a lot of the difference is conceptual in nature. You can’t apply the same sort of
solutions that you used in a relational form to a column database.

Column Family Data Stores Example


 Bigtable
 Cassandra
 HBase
 Vertica
 Druid
 Accumulo
 Hypertable
CAP Theorem
 Eric Brewer’s CAP theorem says that if you want consistency, availability, and partition
tolerance, you have to settle for two out of three.
 For a distributed system, partition tolerance means the system will continue to work
unless there is a total network failure. A few nodes can fail and the system keeps going.

5
Base Property of Transaction
 Basically Available — Failure will not halt the system
 Soft state — State of the system will change over time
 Eventual consistency — Will become consistent over time
NEWSQL Databases
 NewSQL is a new approach to relationa| databases that wants to combine transactional
ACID (atomicity, consistency,isolation, durability) guarantees of good RDBMS and
the horizontal scalability of NoSQL.
 They maintain ACID Guarantees
 They run on SQL
NEWSAQL Databases support
 Partitioning and Sharding— Fragmentation is Supported
 Replication — Copies of database stored in a remote site

6
 Secondary Indexes — Accessing database records using a value other than a primary
key
 Concurrency Control — Data Integrity while executing simultaneous transactions
 Crash Recovery — Recovers to a consistent state

MongoDB
Mongod vs mongos vs mongo
Mongod :
 Mongod is almost like an API. It’s the middleman between the application and the db.
 It handles data requests, manages data access, and performs
 Background management operations.
Mongos :
 Mongos is also a middleman.
 The mongos instances route queries and write operations to the shards a sharded cluster
Could it be fair to say it does the same as mongod ? What's the big difference?
Then run the command " mongo --port xxxx" l’am connecting to the cluster / replica itself and
not starting a middleman.
The MongoDB is divided into a two components server and client.
 The server is the main database component which stores and manages data. And, the
clients come in various flavours and connect to the server to perform various queries
and db operations.
 Here, Mongod is the server component. You start it, it runs, that’s it.
 By definition we also call it the primary daemon process for the MongoDB database
which handles data requests, manages data access, and performs background
management operations.
 Whereas Mongo is a default command line client. You start it, you connect to a server,
you enter commands, you exit out of it.
 You have to run mongod first, otherwise you have no database to interact with.
 Now, the question comes then what is Mongos?
 Mongos is a kind of query router, providing an interface between client applications
and the sharded cluster
What is Replication?
 Replication is the process of synchronizing data across multiple servers.
 Replication provides redundancy and increases data availability with multiple copies of
data on different database servers.

7
Advantages
 High (24*7) availability of data Disaster recovery’
 No downtime for maintenance (like backups, index rebuilds, compaction)
 Read scaling (extra copies to read from)

8
MongoDB - Replication
Primary Server —
 The primary server receives all write operations.
 A replica set can have only one primary capable of confirming writes
 The primary records all changes to its data sets in its operation log, i.e. oplog.
Secondary Server —
 The secondary servers replicate the primary oplog and apply the operations to their data
sets such that the secondaries’ data sets reflect the primary’s data set.
 If the primary is unavailable, an eligible secondary will hold an election to elect itself
the new primary
Replica set —
 A replica set is a group of mongod instance that maintain the same data set.
 A replica set contains several data bearing nodes.
 Of the data bearing nodes, one and only one member is deemed the primary node, while
the other nodes are deemed secondary nodes.
Oplog File
 The oplog (operations log) is a special cappe collection that keeps a rolling record of
all operatior that modify the data stored in your databases.
 MongoDB applies database operations on the primary and then records the operations
on the primary’s oplog.

 The secondary members then copy and apply these operations in an asynchronous
process.
 All replica set members contain a copy of the oplog, in the local.oplog.rs collection,
which allows them to maintain the current state of the database.

9
On failure of a primary node
 When a primary goes not communicate with the other members of the set for more
than the configured electionTimeoutMillis period (10 seconds by default), and eligible
secondary calls for an election to nominate itself as the new primary.
 The cluster attempts to complete the election of a new primary and resume normal
operations.
 The replica set cannot process write operations until the election completes
successfully. The replica set can continue te serve read queries if such queries are
configured to run on secondaries while the primary is offline.

What is Sharding?
 Sharding is a method for distributing data across multiple machines.
 MongoDB uses sharding to support deployments with very large data sets and high
throughput operations.
Why Sharding?

10
 Database systems with large data sets or high throughput applications can challenge
the capacity of a single server.
 For example, high query rates can exhaust the CPU capacity of the server.
 Working set sizes larger than the systerm’s RAM stress the I/O capacity of disk drives

Vertical vs Horizontal Scaling


 Vertical Scaling involves increasing the capacity of a single server, such as using a
more powerful CPU, adding more RAM, or increasing the amount of storage space.
 Limitations in available technology may restrict a single machine from being
sufficiently powerful for a given workload
 Horizontal Scaling involves dividing the system dataset and load over multiple
servers, adding additional servers to increase capacity as required.
 While the overall speed or capacity of a single machine may not be high, each
machine handles a subset of the overall workload, potentially providing better
efficiency than a single high-speed high-Capacity server.
 Expanding the capacity of the deployment only requires adding additional servers as
needed, which can be a lower overall cost than high-end hardware for a single
machine.

11
Sharded Cluster in Mongo DB
A MongoDB sharded cluster consists of the following components:
 shard: Each shard contains a subset of the sharded data. Each shard can be deployed
as a replica set.
 mongos: The mongos acts as a query router, providing an interface between client
applications and the sharded cluster.
 config servers: Config servers store metadata and configuration settings for the
cluster. As of MongoDB 3.4, config servers must be deployed as a replica set (CSRS).
Cassandra
The design goal of Cassandra is to handle big data workloads across multiple nodes without
any single point of failure. Cassandra has peer-to-peer distributed system across its nodes,
and data is distributed among all the nodes in a cluster.
 All the nodes in a cluster play the same role. Each node is independent and at the
same time interconnected to other nodes.

12
 Each node in a cluster can accept read and write requests, regardless of where the data
is actually located in the cluster.
 When a node goes down, read/write requests can be served from other nodes in the
network.
Data Replication in Cassandra
Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate
with each other and detect any faulty nodes in the cluster.

Components of Cassandra
The key components of Cassandra are as follows —
 Node — It is the place where data is stored.
 Data center — It is a collection of related nodes.
 Cluster — A cluster is a component that contains one or more data centers.
 Commit log — The commit log is a crash-recovery mechanism in Cassandra. Every
write operation is written to the commit log.
 Mem-table — A mem-table is a memory-resident data structure. After commit log,
the data will be written to the mem-table. Sometimes, for a single-column family,
there will be multiple mem-tables.
 SSTable — It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
 Bloom filter — These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters
are accessed after every query.
Cassandra Query Language
 Users can access Cassandra through its nodes using Cassandra Query Language
(CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use
cqlsh: a prompt to work with CQL or separate application language drivers.
 Clients approach any of the nodes for their read-write operations. That node
(coordinator) plays a proxy between the client and the nodes holding the data.

13
Write Operations
 Every write activity of nodes is captured by the commit logs written in the nodes.
RDBMS Cassandra

RDBMS deals with structured data. Cassandra deals with unstructured data.

It has a fixed schema. Cassandra has a flexible schema.

In RDBMS, a table is an array of arrays. (ROW x In Cassandra, a table is a list of “nested key-value
COLUMN) pairs”. (ROW x COLUMN key x COLUMN value)

Database is the outermost container that contains data Keyspace is the outermost container that contains data
corresponding to an application. corresponding to an application.

Tables are the entities of a database. Tables or column families are the entity of a keyspace.

Row is an individual record in RDBMS. Row is a unit of replication in Cassandra.

Column represents the attributes of a relation. Column is a unit of storage in Cassandra.

RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.

Later the data will be captured and stored in the mem-table. Whenever the mem-table
is full, data will be written into the SStable data file. All writes are automatically
partitioned and replicated throughout the cluster. Cassandra periodically consolidates
the SSTables, discarding unnecessary data.
Read Operations
 During read operations, Cassandra gets values from the mem-table and checks the
bloom filter to find the appropriate SSTable that holds the required data.
Data Models of Cassandra and RDBMS

CQLTypes
CQL provides a rich set of built-in data types, including collection types. Along with these
data types, users can also create their own custom data types. The following table provides a
list of built-in data types available in CQL.

14
Collection Types
Cassandra Query Language also provides a collection data types. The following table
provides a list of Collections available in CQL.

User-defined datatypes
Cqlsh provides users a facility of creating their own data types. Given below are the
commands used while dealing with user defined data types.
 CREATE TYPE − Creates a user-defined datatype.
 ALTER TYPE − Modifies a user-defined datatype.
 DROP TYPE − Drops a user-defined datatype.
 DESCRIBE TYPE − Describes a user-defined datatype.

15
 DESCRIBE TYPES − Describes user-defined datatypes.

HIVE
Datatypes
All the data types in Hive are classified into four types, given as follows: 
ColumnTypes
 Literals
 NullValues
 ComplexTypes
Column Types
Column type are used as column data types of Hive. They are as follows: Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the
INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:

Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL (precision,scale)
Decimal (10,0)

16
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union.
The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data
is
composed of DOUBLE data type.

Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>

17
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>

OrientDB Graph database


Introduction
OrientDB is an Open Source NoSQL Database Management System, which contains the
features of traditional DBMS along with the new features of both Document and Graph
DBMS. It is written in Java and is amazingly fast. It can store 220,000 records per second on
commodity hardware. OrientDB, is one of the best open-source, multi-model, next generation
NoSQL product.
OrientDB is an Open Source NoSQL Database Management System. NoSQL Database
provides a mechanism for storing and retrieving NO-relation or NON-relational data that
refers to data other than tabular data such as document data or graph data. NoSQL databases
are increasingly used in Big Data and real-time web applications. NoSQL systems are also
sometimes called "Not Only SQL" to emphasize that they may support SQL-like query
languages.
OrientDB also belongs to the NoSQL family. OrientDB is a second generation Distributed
Graph Database with the flexibility of Documents in one product with an open source of
Apache 2 license.

18
OrientDB is the first Multi-Model open source NoSQL DBMS that brings together the power
of graphs and flexibility of documents into a scalable high-performance operational database.
The main feature of OrientDB is to support multi-model objects, i.e. it supports different
models like Document, Graph, Key/Value and Real Object. It contains a separate API to
support all these four models.
Document Model
The terminology Document model belongs to NoSQL database. It means the data is stored in
the Documents and the group of Documents are called as Collection. Technically, document
means a set of key/value pairs or also referred to as fields or properties.
OrientDB uses the concepts such as classes, clusters, and link for storing, grouping, and
analysing the documents.
The following table illustrates the comparison between relational model, document model,
and

19
OrientDB document model

Graph Model
A graph data structure is a data model that can store data in the form of Vertices (Nodes)
interconnected by Edges (Arcs). The idea of OrientDB graph database came from property
graph. The vertex and edge are the main artifacts of the Graph model. They contain the
properties, which can make these appear similar to documents.
The following table shows a comparison between graph model, relational data model, and
OrientDB graph model.

The Key/Value Model


The Key/Value model means that data can be stored in the form of key/value pair where the
values can be of simple and complex types. It can support documents and graph elements as
values.
The following table illustrates the comparison between relational model, key/value model,
and OrientDB key/value model.

20
The Object Model
This model has been inherited by Object Oriented programming and supports Inheritance
between types (sub-types extends the super-types), Polymorphism when you refer to a base
class and Direct binding from/to Objects used in programming languages.
The following table illustrates the comparison between relational model, Object model, and
OrientDB Object model.

Following are some of the important terminologies in OrientDB.


Record
The smallest unit that you can load from and store in the database. Records can be stored in
four types.

Record ID
When OrientDB generates a record, the database server automatically assigns a unit identifier
to the record, called RecordID (RID). The RID looks like #<cluster>:<position>. <cluster>
means cluster identification number and the <position> means absolute position of the record
in the cluster.
Documents
The Document is the most flexible record type available in OrientDB. Documents are softly
typed and are defined by schema classes with defined constraint, but you can also insert the
document without any schema, i.e. it supports schema-less mode too. Documents can be
easily handled by export and import in JSON format. For example, take a look at the
following JSON sample document. It defines the document details.
{
"id" :"1201",
"name" :"Jay",

21
26
"job" : "Developer",
"creations" :[
{
"name" : "Amiga",
"company" : "Commodore Inc."
},
{
"name" : "Amiga 500",
"company" : "CommodoreInc."
}
]
}
RecordBytes
Record Type is the same as BLOB type in RDBMS. OrientDB can load and store document
Record type along with binary data.
Vertex
OrientDB database is not only a Document database but also a Graph database. The new
concepts such as Vertex and Edge are used to store the data in the form of graph. In graph
databases, the most basic unit of data is node, which in OrientDB is called a vertex. The
Vertex stores information for the database.

Edge
There is a separate record type called the Edge that connects one vertex to another. Edges are
bidirectional and can only connect two vertices. There are two types of edges in OrientDB,
one is regular and another one lightweight.
Class
The class is a type of data model and the concept drawn from the Object-oriented
programming paradigm. Based on the traditional document database model, data is stored in
the form of collection, while in the Relational database model data is stored in tables.
OrientDB follows the Document API along with OPPS paradigm. As a concept, the class in
OrientDB has the closest relationship with the table in relational databases, but (unlike tables)
classes can be schema-less, schema-full or mixed. Classes can inherit from other classes,
creating trees of classes. Each class has its own cluster or clusters, (created by default, if none
are defined).
Cluster

22
Cluster is an important concept which is used to store records, documents, or vertices. In
simple words, Cluster is a place where a group of records are stored. By default, OrientDB
will create one cluster per class. All the records of a class are stored in the same cluster
having the same name as the class. You can create up to 32,767(2^15-1) clusters in a
database.
The CREATE class is a command used to create a cluster with specific name. Once the
cluster is created you can use the cluster to save records by specifying the name during the
creation of any datamodel.
Relationships
OrientDB supports two kinds of relationships: referenced and embedded. Referenced
relationships means it stores direct link to the target objects of the relationships. Embedded
relationships means it stores the relationship within the record that embeds it. This
relationship is stronger than the referencerelationship.
Database
The database is an interface to access the real storage. IT understands high-level concepts
such as queries, schemas, metadata, indices, and so on. OrientDB also provides multiple
database types. For more information on these types, see Database Types.

23
Unit : 4
XML DATABASES

Structured, Semi structured, and Unstructured Data:


Big Data includes huge volume, high velocity, and extensible variety of data. These are 3
types:
Structured data,
Semi-structured data, and
Unstructured data.

1. Structured data

Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information.

Example: Relational data.

2. Semi-Structured data

Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze. With some processes, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space.

Example: XML data.

3. Unstructured data

Unstructured data is a data which is not organized in a predefined manner or does not have
a predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications.

Example: Word, PDF, Text, Media logs.

1
Differences between Structured, Semi-structured and Unstructured data:

Properties Structured data Semi-structured data Unstructured data

It is based on It is based on It is based on


Relational database XML/RDF(Resource character and binary
Technology table Description Framework). data

Matured
transaction and
various No transaction
Transaction concurrency Transaction is adapted from management and no
management techniques DBMS not matured concurrency

Version Versioning over Versioning over tuples or Versioned as a


management tuples, row, tables graph is possible whole

It is schema It is more flexible than It is more flexible


dependent and less structured data but less and there is absence
Flexibility flexible flexible than unstructured data of schema

It is very difficult to It’s scaling is simpler than


Scalability scale DB schema structured data It is more scalable.

New technology, not very


Robustness Very robust spread —

Structured query
Query allow complex Queries over anonymous Only textual queries
performance joining nodes are possible are possible

2
What is XML?

 XML stands for Extensible Markup Language. It is designed for both human and
machine readable.
 It doesn’t contain any predefined tags that helps the user to define their own set of tags.
 It is a smaller version of SGML.
 It overcomes all the drawbacks of HTML.
 It is easy to understand and it is very much flexible than HTML
 It inherits the features of SGML and combines it with the features of HTML.
 XML document is just a pure information wrapped in tags.
 Someone must write a piece of software to send, receive or display it.
 It is used to exchange the information between organizations and systems.

Features of XML :

 XML is a metalanguage and it describes all the other markup languages.


 XML file displays the data in different format.
 It could also be helps to other applications for further processing.
 It is excellent for long-term data storage and reusability.
 It does not allow empty command declaration.
 Stylesheets allows to transform the structured data into different HTML views to
display data on different browsers [XSLT].

How XML Works?

3
 XML is used for both, storing and transferring data.
 For transferring, xml don't create a file and transfer it but send the content directly.
 Transferring data from a sender machine to a receiver machine, XML transfer the data into
Object.
 The sender serializes the object. That means it converts the object to a byte stream. This
stream can be formatted in XML, SOAP, JSON, whatever.
 The receiver receives the byte stream and deserializes it to an object. This object should be
equivalent to the one the sender has sent before in that it holds the same data.

XML Hierarchical Data Model


XML data is hierarchical; relational data is represented in a model of logical
relationships An XML document contains information about the relationship of data items to each
other in the form of the hierarchy. With the relational model, the only types of relationships that
can be defined are parent table and dependent table relationships.
Comparison of the XML model and the relational model

When you design your databases, you must decide whether your data is better suited to the XML
model or the relational model. Take advantage of the hybrid nature of Db2® databases that
supports both relational and XML data in a single database.

While this discussion explains some of the main differences between the models and the factors
that apply to each, there are numerous factors that can determine the most suitable choice for your
implementation. Use this discussion as a guideline to assess the factors that can impact your
specific implementation.

Major differences between XML data and relational data:

XML data is hierarchical; relational data is represented in a model of logical relationships


An XML document contains information about the relationship of data items to each
other in the form of the hierarchy. With the relational model, the only types of
relationships that can be defined are parent table and dependent table relationships.
XML data is self-describing; relational data is not
An XML document contains not only the data, but also tagging for the data that explains
what it is. A single document can have different types of data. With the relational model,
the content of the data is defined by its column definition. All data in a column must have
the same type of data.
XML data has inherent ordering; relational data does not
For an XML document, the order in which data items are specified is assumed to be the
order of the data in the document. There is often no other way to specify order within the
document. For relational data, the order of the rows is not guaranteed unless you specify
an ORDER BY clause on one or more columns.
Factors influencing data model choice:
When you need maximum flexibility

4
Relational tables follow a fairly rigid model. For example, normalizing one table into
many or denormalizing many tables into one can be very difficult. If the data design
changes often, representing it as XML data is a better choice. XML schemas can be
evolved over time, for example.
When you need maximum performance for data retrieval
Some expense is associated with serializing (Serialization is the process of converting a
data object into a series of bytes that saves the state of the object in an easily transmittable
form.) and interpreting XML data. If performance is more of an issue than flexibility,
relational data might be the better choice.
When data is processed later as relational data
If subsequent processing of the data depends on the data being stored in a relational
database, it might be appropriate to store parts of the data as relational, using
decomposition. An example of this situation is when online analytical processing
(OLAP) is applied to the data in a data warehouse. Also, if other processing is required
on the XML document as a whole, then storing some of the data as relational as well as
storing the entire XML document might be a suitable approach in this case.
When data components have meaning outside a hierarchy
Data might be inherently hierarchical in nature, but the child components do not need the
parents to provide value. For example, a purchase order might contain part numbers. The
purchase orders with the part numbers might be best represented as XML documents.
However, each part number has a part description associated with it. It might be better to
include the part descriptions in a relational table, because the relationship between the
part numbers and the part descriptions is logically independent of the purchase orders in
which the part numbers are used.
When data attributes apply to all data, or to only a small subset of the data
Some sets of data have a large number of possible attributes, but only a small number of
those attributes apply to any particular data value. For example, in a retail catalog, there
are many possible data attributes, such as size, color, weight, material, style, weave,
power requirements, or fuel requirements. For any given item in the catalog, only a subset
of those attributes is relevant: power requirements are meaningful for a table saw, but not
for a coat. This type of data is difficult to represent and search with a relational model,
but relatively easy to represent and search with an XML model.
When the ratio of data complexity to volume is high
Many situations involve highly structured information in very small quantities.
Representation of that data with a relational model can involve complex star schemas in
which each dimension table is joined to many more dimension tables, and most of the
tables have only a few rows. A better way to represent this data is to use a single table
with an XML column, and to create views on that table, where each view represents a
dimension.
When referential integrity is required
XML columns cannot be defined as part of referential constraints. Therefore, if values in
XML documents need to participate in referential constraints, you should store the data
as relational data.
When the data needs to be updated often
You update XML data in an XML column only by replacing full documents. If you need
to frequently update small fragments of very large documents for a large number of rows,
it can be more efficient to store the data in non-XML columns. If, however, you are
updating small documents and only a few documents at a time, storing as XML can be
efficient as well.

XML – Documents
5
Structure of an XML

 The structure of xml consist of :


o Declaration of Xml
o Root node
o Node Attributes and its value
o Empty node

XML Document Example

A simple document is shown in the following example1:

A simple document is shown in the following example 2:

<?xml version = "1.0"?>


<contact-info>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</contact-info>
The following image depicts the parts of XML document.

6
Document Prolog Section

Document Prolog comes at the top of the document, before the root element. This section contains

 XML declaration
 Document type declaration

Document Elements Section

Document Elements are the building blocks of XML. These divide the document into a hierarchy
of sections, each serving a specific purpose. You can separate a document into multiple sections
so that they can be rendered differently, or used by a search engine. The elements can be containers,
with a combination of text and other elements.

XML Comments

 XML comments are similar to HTML comments.


 It is used to make the codes more understandable for other developers to use.
 Comments not consider as a XML code.
 These comments add notes or lines for the purpose of better understanding the XML code.
 XML is known as self-describing data but sometimes XML comments are necessary.
 Comments begin with a <!-- and end with a -->
 XML Comments will ignored by the parser not to be validated.

Syntax

An XML comment should be written as:

<!-- Write your comment-->

Rules for adding XML comments

 Doesn't use the comments before an XML declaration.

7
 We can use these comments anywhere in XML document except within attribute value.
 Don't nest the comments inside the other.

XML Vs HTML

XML HTML

XML stands for Extensible Markup Language- HTML stands for Hypertext Markup
Language

Allow user to create own tags and attributes. Html tags and attributes are pre-determined
and rigid.

Content and format are separate and Content and formatting can be placed together.
formatting will be made by external stylesheets Example: <p><font=”Arial”>text</font>

In XML Designs represent the HTML Design represent the


logical structure of a document. presentation structure of a document.

More effective for machine- machine interaction. More effective for machine - human interaction.

XML provide a framework to define markup HTML is a markup language


languages

Syntax is strictly defined which means Loosely Defined syntax compared to XML.
mandatory to close all the tags. In Html, not necessary to close all the tag.

XML is dynamic because it is used for both HTML is static because it is used to
display only display the data/content
and transport the data.

Preserve White Spaces Does not preserve Whitespaces

XML is case sensitive HTML is case insensitive

XML Elements

8
 XML elements are represented by tags. Xml elements behave as a container to store all the
text, other elements, attributes and all media objects.
 There is no limitation to use elements in Xml.
 Elements usually consist of an opening tag and a closing tag, but they are consider as a
single tag.
 Opening tags consist of <, followed by the element name, and ending with >.
 Closing tags are the same but have a forward slash inserted between the less than symbol
and the element name.

Syntax

<tag>Data</tag>

 Empty elements doesn’t have the closing tag. They are closed by inserting a forward slash
before the greater than symbol.

Syntax for Empty Element

<tag/>

XML Elements Rules

 All the XML elements must have closing tag:


o In HTML some tags don't need to be closed.
o In XML however, you must close all tags except empty tag.
o The opening and closing tags must be similar except the closing tag contains a
forward slash before the element name.

Example for opening/closing tags

<child>Data</child>

Example for empty elements

<child attribute="value" />

 The XML elements are case sensitive:

9
o All tags must be written using the correct case. XML sees <tutorial> as a different
tag to <Tutorial>
 XML Elements Must Be Nested Properly
o You can place elements inside other elements but you need to ensure each.

Wrong Right
<Employee> <Employee>
<Name> Mrs. Abi </Employee> <Name> Mrs. Abi </Name>
</Name> </Employee>

 XML Element Naming Rules


o Element names can contain any character (including letters and numbers).
o Element names must not contain spaces
o Element names must not begin with a number or punctuation character (for
example a comma or semi-colon etc.,)
o Element names must not start with the letters xml (whether lowercase, uppercase,
or mixed case)
o Element names shouldn’t use a colon (:) because colon is reserved for another
purpose.

XML Attributes

 Attributes are the part of the Xml Elements. An Element contains multiple attributes. All
are unique.
 By the use of attributes we can add the information about the element.
 XML attributes enhance the properties of the elements.

10
Syntax

<tag attibuteName=“attributeValue”>

Xml attribute Example

<author bookType="Classical"> // Double quote

//else

<author bookType=‘Classical’> // Single quote

Syntax

<tag attibuteName=“attributeValue”>

Xml attribute Example

<author bookType="Classical">

//else

<author bookType=‘Classical’>

Attribute vs Sub-element or Elements


11
 Attributes are part of markup, while sub elements are part of the basic document contents.

In the below code bookType is an attribute

<author bookType="Classical">

In the below code bookType is an element

<author>

<bootType>Classical</bookType>

</author>

XML Attributes Rules

 An attribute name must be appear once in the same start-tag or empty-element tag.
 The value of the attributes within the quotation mark.
 They used either single or double quotes.

 Attributes must contain a value. Some HTML coders does provide the attribute name
without a value or it will equal true. This is not allowed in XML.
 The values must not contain direct or indirect entity references to external entities.
 Using an Attribute-List Declaration, an attribute must be declared in the Document Type
Definition (DTD).

12
XML Tree Structure

XML Attributes Drawbacks:

 Attributes doesn’t have multiple values but child elements can have multiple values.
 Attributes cannot contain tree structure but child element can.
 Attributes are not easily expandable. If you want to change in attribute's vales in future,
it may be complicated.
 Attributes cannot describe structure but child elements can.
 Attributes are more difficult to be manipulated by program code.
 Attributes values are not easy to test against a DTD, which is used to define the legal
elements of an XML document.

How XML is Presented ?

13
Benefits of XML - Business Benefits

 Information Sharing :
o XML define data formats to build tools which helps to read, write and transform
data between XML and other formats.

 Single application usage :


o In an application, there is not necessary to code by xml in whole applications.
o Some specific parts of an application that involve formatting or transferring the data
between applications can be coded by XML.

 Content Delivery :
o XML supports different users, channels, and also build more efficient applications.
o These channels have information delivery mechanisms such as digital TV,
phone, the Web, and multimedia/touchscreen kiosks.

Technological Benefits:

 XML is text-based (unicode) Format.


o Takes less space to store data.
14
o More efficient to transfer the data

 One XML document can be displayed differently in different media.


o Html, video, CD, DVD
o You need to change the XML document in order to change all the rest of the media.

 XML documents can be modularized.


 The XML can be reused

XML Namespace

 In XML, a namespace is used to prevent any conflicts with element or attribute names.
o XML allows you to create your own element names, there's always the possibility
of naming an element exactly the same as one in another XML document.
o It is OK if you never use both documents together.
o You would have a name conflict. But you want to combine the content of the both
documents
o We would have two different elements, with different purposes, both with the same
name. In this case we use Namespace for the element to avoid the confusion.

Namespaces allows the browser to :

o Combine various sources of documents.


o It helps to identify the source of elements or attributes.
o The Uniform Resource Locator (URL) contains the reference for a document or
an HTML page on the Web.

15
Example Name Conflict

 Imagine we have an XML document containing a list of books. Something like this:

 And imagine we want to combine it with the following HTML page.

16
 We will encounter a problem if we try to combine the above documents. This is because
they both have an element called title. One is the title of the book, the other is the title of
the HTML page. We have a name conflict.
 When we prevent this name conflict we want to create a namespace for the XML document.
 Example for Namespace

XML CDATA

 CDATA means character data.


 CDATA is defined as blocks of text that are not parsed by the parser. But they are
recognized as a markup.
 The predefined entities such as <, >, and & require typing and they are generally difficult
to read in the markup.
 In such cases, CDATA section can be used. Using CDATA section, user commanding the
parser that the particular section of the document contains no markup and should be
treated as regular text.

17
Syntax

 The CDATA syntax contains three sections:


o CDATA Start section − CDATA begins with the nine-character delimiter
<![CDATA[
o CDATA Content section − This section contain markup characters (<, >, and &),
but they are ignored by the XML processor.
o CDATA End section − CDATA section ends with ]]> delimiter.

<![CDATA[

Character that contain markup

]]>

Example for CDATA

18
The output of the above example look like this

CDATA Rules

 Cdata cannot contain the string “]]>” anywhere in the document


 Cdata not allowed nesting
19
XML Usage in applications

 The XML technology is commonly used for:


o Configuration files.
o Data exchange format between applications.
o Structured data sets.
o Simple file-based databases.

XML DTD :

 DTD stands for Document Type Definition.


 A DTD defines the structure and the legal elements and attributes of an XML document.
 An application can use a DTD to verify that XML data is valid.

Syntax

<!ELEMENT element_name content_model>

Example

<!ELEMENT tutorials “Wikitechy XML Tutorial”>


Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8 is capable of
encoding all 1,112,064 valid character code points in Unicode using one to four one- byte (8-bit)
code units.

Internal DTD

 The following example demonstrates Internal DTD.


o Doctype with DTD will be placed inside the XML
o Elements and tags will be accessed by the xml from DTD.

Syntax

<!DOCTYPE root-element [element-declarations]>

root-element is the name of root element and element-declarations is where you declare the
elements.

20
External DTD

 The following example demonstrates External DTD.


o Doctype with DTD will be placed as a separate file.
o Elements and tags will be accessed by the xml file from DTD file.

Syntax

<!DOCTYPE root-element SYSTEM "file-name">

file-name is the file with .dtd extension

21
XML Schema

An XML Schema describes the structure of an XML document, just like a DTD.

An XML document with correct syntax is called "Well Formed".

An XML document validated against an XML Schema is both "Well Formed" and "Valid".

XML Schema is an XML-based alternative to DTD:

<xs:element name="note">

<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>

</xs:element>

The Schema above is interpreted like this:

 <xs:element name="note"> defines the element called "note"


 <xs:complexType> the "note" element is a complex type
 <xs:sequence> the complex type is a sequence of elements
 <xs:element name="to" type="xs:string"> the element "to" is of type string (text)
22
 <xs:element name="from" type="xs:string"> the element "from" is of type string
 <xs:element name="heading" type="xs:string"> the element "heading" is of type string
 <xs:element name="body" type="xs:string"> the element "body" is of type string

XML Schemas are More Powerful than DTD

 XML Schemas are written in XML


 XML Schemas are extensible to additions
 XML Schemas support data types
 XML Schemas support namespaces

Why Use an XML Schema?

With XML Schema, your XML files can carry a description of its own format.

With XML Schema, independent groups of people can agree on a standard for interchanging data.

With XML Schema, you can verify data.

XML Schemas Support Data Types

One of the greatest strengths of XML Schemas is the support for data types:

 It is easier to describe document content


 It is easier to define restrictions on data
 It is easier to validate the correctness of data
 It is easier to convert data between different data types

XML Schemas use XML Syntax

Another great strength about XML Schemas is that they are written in XML:

 You don't have to learn a new language


 You can use your XML editor to edit your Schema files
 You can use your XML parser to parse your Schema files
 You can manipulate your Schemas with the XML DOM
 You can transform your Schemas with XSLT

XML Documents and Database


Different approaches for storing XML documents are as given below:

 A RDBMS or object-oriented database management system is used to store XML document in


the form of text.
 A tree model is very useful to store data elements located at leaf level in tree structure.
23
 The large amount of data is stored in the form of relational database or object-oriented databases.
A middleware software is used to manage communication between XML document and relational
database.

XML Query:

XML query is based on two methods.

1. Xpath:

 Xpath is a syntax for defining parts or elements of XML documents.


 Xpath is used to navigate between elements and attributes in the XML documents.
 Xpath uses path of expressions to select nodes in XML documents.
 XPath contains over 200 built-in functions

There are functions for string values, numeric values, booleans, date and time comparison,
node manipulation, sequence manipulation, and much more.

XPath Terminology
Nodes

In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-
instruction, comment, and document nodes.

XML documents are treated as trees of nodes. The topmost element of the tree is called the root
element.

Look at the following XML document:

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>
<book>
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>

Example of nodes in the XML document above:

<bookstore> (root element node)

24
<author>J K. Rowling</author> (element node)

lang="en" (attribute node)


Atomic values

Atomic values are nodes with no children or parent.

Example of atomic values:

J K. Rowling

"en"
Items

Items are atomic values or nodes.

Relationship of Nodes
Parent

Each element and attribute has one parent.

In the following example; the book element is the parent of the title, author, year, and price:

<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Children

Element nodes may have zero, one or more children.

In the following example; the title, author, year, and price elements are all children of the book
element:

<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Siblings

Nodes that have the same parent.

25
In the following example; the title, author, year, and price elements are all siblings:

<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Ancestors

A node's parent, parent's parent, etc.

In the following example; the ancestors of the title element are the book element and the
bookstore element:

<bookstore>

<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>

</bookstore>
Descendants

A node's children, children's children, etc.

In the following example; descendants of the bookstore element are the book, title, author, year,
and price elements:

<bookstore>

<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>

</bookstore>

XPath Syntax:

XPath uses path expressions to select nodes or node-sets in an XML document. The node is
selected by following a path or steps.

26
The XML Example Document

We will use the following XML document in the examples below.

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
<title lang="en">Harry Potter</title>
<price>29.99</price>
</book>

<book>
<title lang="en">Learning XML</title>
<price>39.95</price>
</book>

</bookstore>

Selecting Nodes

XPath uses path expressions to select nodes in an XML document. The node is selected by
following a path or steps. The most useful path expressions are listed below:

Expression Description

nodename Selects all nodes with the name "nodename"

/ Selects from the root node

// Selects nodes in the document from the current node that match the selection
no matter where they are

. Selects the current node

.. Selects the parent of the current node

@ Selects attributes

In the table below we have listed some path expressions and the result of the expressions:

Path Expression Result

bookstore Selects all nodes with the name "bookstore"

27
/bookstore Selects the root element bookstore
Note: If the path starts with a slash ( / ) it always represents an
absolute path to an element!

bookstore/book Selects all book elements that are children of bookstore

//book Selects all book elements no matter where they are in the document

bookstore//book Selects all book elements that are descendant of the bookstore
element, no matter where they are under the bookstore element

//@lang Selects all attributes that are named lang

Predicates

Predicates are used to find a specific node or a node that contains a specific value.

Predicates are always embedded in square brackets.

In the table below we have listed some path expressions with predicates and the result of the
expressions:

Path Expression Result

/bookstore/book[1] Selects the first book element that is the child of the
bookstore element.

Note: In IE 5,6,7,8,9 first node is[0], but according to


W3C, it is [1]. To solve this problem in IE, set the
SelectionLanguage to XPath:

In JavaScript:
xml.setProperty("SelectionLanguage","XPath");

/bookstore/book[last()] Selects the last book element that is the child of the
bookstore element

/bookstore/book[last()-1] Selects the last but one book element that is the child of
the bookstore element

/bookstore/book[position()<3] Selects the first two book elements that are children of the
bookstore element

//title[@lang] Selects all the title elements that have an attribute named
lang

28
//title[@lang='en'] Selects all the title elements that have a "lang" attribute
with a value of "en"

/bookstore/book[price>35.00] Selects all the book elements of the bookstore element


that have a price element with a value greater than 35.00

/bookstore/book[price>35.00]/title Selects all the title elements of the book elements of the
bookstore element that have a price element with a value
greater than 35.00

Selecting Unknown Nodes

XPath wildcards can be used to select unknown XML nodes.

Wildcard Description

* Matches any element node

@* Matches any attribute node

node() Matches any node of any kind

In the table below we have listed some path expressions and the result of the expressions:

Path Expression
Result

/bookstore/* Selects all the child element nodes of the bookstore element

//* Selects all elements in the document

//title[@*] Selects all title elements which have at least one attribute of any kind

Selecting Several Paths

By using the | operator in an XPath expression you can select several paths.

In the table below we have listed some path expressions and the result of the expressions:

29
Path Expression Result

//book/title | //book/price Selects all the title AND price elements of all book elements

//title | //price Selects all the title AND price elements in the document

/bookstore/book/title | //price Selects all the title elements of the book element of the
bookstore element AND all the price elements in the document

XPath Operators

Below is a list of the operators that can be used in XPath expressions:

Operator Description Example

| Computes two node-sets //book | //cd

+ Addition 6+4

- Subtraction 6-4

* Multiplication 6*4

div Division 8 div 4

= Equal price=9.80

!= Not equal price!=9.80

< Less than price<9.80

<= Less than or equal to price<=9.80

> Greater than price>9.80

>= Greater than or equal to price>=9.80

or or price=9.80 or price=9.70

and and price>9.00 and price<9.90

mod Modulus (division remainder) 5 mod 2

XPath Examples

We will use the following XML document in the examples below.

"books.xml":

30
<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>

<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>

<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>

<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>

</bookstore>
Loading the XML Document

Using an XMLHttpRequest object to load XML documents is supported in all modern browsers.

var xmlhttp = new XMLHttpRequest();

31
Selecting Nodes

Unfortunately, there are different ways of dealing with XPath in different browsers.

Chrome, Firefox, Edge, Opera, and Safari use the evaluate() method to select nodes:

xmlDoc.evaluate(xpath, xmlDoc, null, XPathResult.ANY_TYPE,null);

Internet Explorer uses the selectNodes() method to select node:

xmlDoc.selectNodes(xpath);

In our examples we have included code that should work with most major browsers.

Select all the titles

The following example selects all the title nodes:

Example
/bookstore/book/title
Select the title of the first book

The following example selects the title of the first book node under the bookstore element:

Example
/bookstore/book[1]/title
Select all the prices

The following example selects the text from all the price nodes:

Example
/bookstore/book/price[text()]

Select price nodes with price>35

The following example selects all the price nodes with a price higher than 35:

Example
/bookstore/book[price>35]/price
Select title nodes with price>35

32
The following example selects all the title nodes with a price higher than 35:

Example
/bookstore/book[price>35]/title

2. Xquery:

 XQuery is to XML what SQL is to databases.

 XQuery is designed to query XML data.

 Xquery is a query and functional programming language. Xquery provides a facility to extract
and manipulate data from XML documents or any data source, like relational database.
 The Xquery defines FLWR expression which supports iteration and binding of variable to
intermediate results.
FLWR is an abbreviation of FOR, LET, WHERE, RETURN. Which are explained as follows:

 For - selects a sequence of nodes

 Let - binds a sequence to a variable

 Where - filters the nodes

 Order by - sorts the nodes

 Return - what to return (gets evaluated once for every node)

XQuery Basic Syntax Rules

Some basic syntax rules:

 XQuery is case-sensitive
 XQuery elements, attributes, and variables must be valid XML names
 An XQuery string value can be in single or double quotes
 An XQuery variable is defined with a $ followed by a name, e.g. $bookstore
 XQuery comments are delimited by (: and :), e.g. (: XQuery Comment :)

 Xquery comparisons:

The two methods for Xquery comparisons are as follows:

1. General comparisons: =, !=, <=, >, >=


Example:
In this example, the expression (Query) can return true value if any attributes have a value
33
greater than or equal to 15000.
$ TVStore//TV/price > =15000

2. Value comparisons: eq, ne, lt, le, gt , ge


Example:
In this example, the expression (Query) can return true value, if there is only one attribute
returned by the expression, and its value is equal to 15000.
$ TVStore//TV/price eq 15000

Example:
Lets take an example to understand how to write a XMLquery.

FOR $x in doc (“student.xml”)/student information /marks


WHERE $x/marks >700
RETURN <res>
$x/Name
</res>

The queries in the above example can return true value ( Name of the students) only if the value
(marks) is greater than 700.

The XML Example Document

We will use the following XML document in the examples below.

"books.xml":

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>

<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
34
<year>2005</year>
<price>29.99</price>
</book>

<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>

<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>

</bookstore>

How to Select Nodes From "books.xml"?


Functions

XQuery uses functions to extract data from XML documents.

The doc() function is used to open the "books.xml" file:

doc("books.xml")
Path Expressions

XQuery uses path expressions to navigate through elements in an XML document.

The following path expression is used to select all the title elements in the "books.xml" file:

doc("books.xml")/bookstore/book/title

(/bookstore selects the bookstore element, /book selects all the book elements under the
bookstore element, and /title selects all the title elements under each book element)

The XQuery above will extract the following:

35
<title lang="en">Everyday Italian</title>
<title lang="en">Harry Potter</title>
<title lang="en">XQuery Kick Start</title>
<title lang="en">Learning XML</title>
Predicates

XQuery uses predicates to limit the extracted data from XML documents.

The following predicate is used to select all the book elements under the bookstore element that
have a price element with a value that is less than 30:

doc("books.xml")/bookstore/book[price<30]

The XQuery above will extract the following:

<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>

How to Select Nodes From "books.xml" With FLWOR

Look at the following path expression:

doc("books.xml")/bookstore/book[price>30]/title

The expression above will select all the title elements under the book elements that are under the
bookstore element that have a price element with a value that is higher than 30.

The following FLWOR expression will select exactly the same as the path expression above:

for $x in doc("books.xml")/bookstore/book
where $x/price>30
return $x/title

The result will be:

<title lang="en">XQuery Kick Start</title>


<title lang="en">Learning XML</title>

With FLWOR you can sort the result:

for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
36
The for clause selects all book elements under the bookstore element into a variable called $x.

The where clause selects only book elements with a price element with a value greater than 30.

The order by clause defines the sort-order. Will be sort by the title element.

The return clause specifies what should be returned. Here it returns the title elements.

The result of the XQuery expression above will be:

<title lang="en">Learning XML</title>


<title lang="en">XQuery Kick Start</title>

37
UNIT-V
Introduction to Information Retrival and Web search

1. Why information retrieval is necessary? List the Retrieval Models and explain Boolean
and Vector space model with example.
Information retrieval
 Process of retrieving documents from a collect in response to a query (search
request)
 Deals mainly with unstructured data
 Example: home buying contract documents
Unstructured information
 Does not have a well-defined formal model
 Based on an understanding of natural language
 Stored in a wide variety of standard format
 Information retrieval field predates database
 Academic programs in Library and Information Science
 RDBMS vendors providing new capabilities to support various data types
 Extended RDBMSs or object-relational database management systems
 User’s information need expressed as free-form search request
 Keyword search query
 Characterizing an IR system
TYPES OF USERS
Users can greatly vary in their abilities to interact with computational environment
 Expert
The user may be an expert user (for example, a curator or a librarian) who is searching
for specific information
 Layperson
forms relevant queries for the task, or a layperson user with a generic information
need. h (for example, students trying to find information about a new topic, researchers
trying to assimilate different points view about a historical issue,
 Types of data
search systems can be tailored to specific types of data
For example, the problem of retrieving information about a specific topic may be
handled more efficiently by customized search systems that are built to collect and
retrieve only information related to that specific topic
 Domain specific
The information repository could be hierarchically organized based on a concept or
topic hierarchy. These topical domain-specific or vertical IR systems are not as large
as or as diverse as the generic World Wide Web, which contains information on all
kinds of topics.
 Types of information needs
In the context of Web search, users’ information needs may be defined as navigational,
informational, or transactional.
 Navigational search
Navigational search refers to finding a particular piece of information (such as the
Georgia Tech University Web site)
 Informational search
is to find current information about a topic (such as research activities)
 Transactional search
activitiesThe goal of transactional search is to reach a site where further interaction
happens resulting in some transactional event (such as joining a social network,
shopping for products.
 Enterprise search systems
Limited to an intranet
 Desktop search engine
Searches an individual computer system
 Databases have fixed schemas
IR system has no fixed data model
Comparing Databases and IR Systems

Databases IR Systems
 Structured data  Unstructured data
 Schema driven  No fixed schema; various data
models (e.g., vector space model)
 Relational (or object, hierarchical,  Free-form query models
and network) model is
predominant
 Structured query model  Rich data operations
 Rich metadata operations  Search request returns list or
pointers to documents
 Query returns data 
 Results are based on exact  Results are based on approximate
matching (always correct) matching and measures of
effectiveness (may be imprecise
and ranked)
A Brief History of IR
 Stone tablets and papyrus scrolls
 Printing press
 Public libraries
 Computers and automated storage systems
Inverted file organization based on keywords and their weights as indexing method
 Search engine
 Crawler
 Challenge: provide high quality, pertinent, timely information
Modes of Interactions in IR Systems
 Primary modes of interaction
 Retrival
Extract relevant information from document repository
 Browsing
Exploratory activity based on user’s assessment of relevance
 Web search combines both interaction modes
Rank of a web page measures its relevance to query that generated the result set

Generic IR Pipeline
 Statistical approach
Documents analyzed and broken down into chunks of text
Each word or phrase is counted, weighted, and measured for relevance or importance
 Types of statistical approaches
 Boolean
 Vector space
 Probabilistic
 Semantic approaches
Use knowledge-based retrieval technique
 Rely on syntactic, lexical, sentential, discourse-based, and pragmatic levels of
knowledge understanding
 Also apply some form of statistical analysis
Retrieval Models
Boolean model - One of earliest and simplest IR models
In the Boolean retrieval model we can pose any query in the form of a Boolean expression
of terms i.e., one in which terms are combined with the operators and, or, and not.
Example: Shakespeare
Brutus AND Caesar AND NOT Calpurnia
 Which plays of Shakespeare contain the words Brutus and Caesar, but not
Calpurnia?
 Naive solution: linear scan through all text – “grepping” In this case, works OK
(Shakespeare’s Collected works has less than 1M words).
 But in the general case, with much larger text colletions, we need to index.
 Indexing is an offline operation that collects data about which words occur in a
text, so that at search time you only have to access the precompiled index.
 Main idea: record for each document whether it contains each word out of all the
different words Shakespeare used (about 32K).

 Matrix element (t, d) is 1 if the play in column d contains the word in row t, 0
otherwise.
.
Can’t build the Term-Document incidence matrix.
Vector space model
 Weighting, ranking, and determining relevance are possible
 Uses individual terms as dimensions
 Each document represented by an n-dimensional vector of values
 Features
 Subset of terms in a document set that are deemed most relevant to an IR search for the
document set
 Different similarity assessment functions can be Used Term frequency-inverse
document frequency
 (TF-IDF)
 Statistical weight measure used to evaluate the importance of a document word in a
collection of documents
 A discriminating term must occur in only a few documents in the general population
Probabilistic model
 Involves ranking documents by their estimated probability of relevance with respect
to the query and the document
 IR system must decide whether a document belongs to the relevant set or nonrelevant
set for a query
 Calculate probability that document belongs to the relevant set
 BM25: a popular ranking algorithm
 Semantic model
 Morphological analysis
Analyze roots and affixes to determine parts of speech of search words
 Syntactic analysis
Parse and analyze complete phrases in documents
 Semantic analysis
Resolve word ambiguities and generate relevant synonyms based on semantic
relationship Uses techniques from artificial intelligence and expert systems
2. Explain the Types of Queries in IR Systems.
Types of Queries in IR Systems.
Keyword queries
 Simplest and most commonly used
 Keyword terms implicitly connected by logical AND
Boolean queries
 Allow use of AND, OR, NOT, and other operator
 Exact matches returned
 No ranking possible
Phrase queries
 Sequence of words that make up a phrase
 Phrase enclosed in double quotes
 Each retrieved document must contain at least on instance of the exact phrase
Proximity queries
 How close within a record multiple search terms are to each other
 Phrase search is most commonly used proximity query
 Specify order of search terms
 NEAR, ADJ (adjacent), or AFTER operator
 Sequence of words with maximum allowed distance between them
 Computationally expensive
 Suitable for smaller document collections rather than the Web
Wildcard queries
 Supports regular expressions and pattern-based matching
 Example ‘data*’ would retrieve data, database, dataset, etc.
 Not generally implemented by Web search engine
Natural language queries
 Definitions of textual terms or common facts
 Semantic models can support
3. Discuss the different methods used in Text Preprocessing.
Text Preprocessing
 Stopword removal must be performed before indexing
 Words that are expected to occur in 80% or more of the documents of a collection
 Examples: the, of, to, a, and, said, for, that
 Do not contribute much to relevance
 Queries preprocessed for stopword removal before retrieval process
 Many search engines do not remove stopwords
Stemming
 Trims suffix and prefix
 Reduces the different forms of the word to a common stem
 Martin Porter’s stemming algorithm
Utilizing a thesaurus
 Important concepts and main words that describe each concept for a particular
knowledge domain
 Collection of synonyms
 UMLS
Other preprocessing steps
 Digits
May or may not be removed during preprocessing
 Hyphens and punctuation marks
Handled in different ways
 Cases
Most search engines use case-insensitive search
 Information extraction tasks
 Identifying noun phrases, facts, events, people, places, and relationships
Inverted Indexing
 Inverted index structure
 Vocabulary information
 Set of distinct query terms in the document set
 Document information
 Data structure that attaches distinct terms with a list of all documents that contain
the term
 Construction of an inverted index
Break documents into vocabulary term
tokenizing, cleansing, removing stopwords, stemming, and/or using a thesaurus
Collect document statistics
Store statistics in document lookup table
Invert the document-term stream into a term document stream
Add additional information such as termfrequencies, term positions, and term weights

 Searching for relevant documents from an inverted index


 Vocabulary search
 Document information
 retrieval Manipulation of retrieved information
Introduction to Lucene
 Lucene:
 open source indexing/search engine
 Indexing is primary focus
 Document composed of set of fields
 Chunks of untokenized text
 Series of processed lexical units called token streams
 Created by tokenization and filtering algorithms
 Highly-configurable search API
 Ease of indexing large, unstructured document collections
4. How the search relevance will be measured?Explain the Evaluation Measures of Search
Relevance in detail.
Evaluation Measures of Search Relevance
Topical relevance
 Measures result topic match to query topic
User relevance
 Describes ‘goodness’ of retrieved result with regard to user’s information need
Web information retrieval
 No binary classification made for relevance or nonrelevance
 Ranking of documents
Recall
 Number of relevant documents retrieved by a search divided by the total number of
actually relevant documents existing in the database
Precision
 Number of relevant documents retrieved by a search divided by total number of
documents retrieved by that search

Retrieved Versus Relevant Search Results


 TP: true positive
 FP: false positive
 TN: true negative
 FN: false negative
 Recall can be increased by presenting more results to the user
 May decrease the precision
The terms true positive, false positive, false negative, and true negative are generally used in any
type of classification tasks to compare the given classification of an item with the desired correct
classification. Using the term hits for the documents that truly or “correctly” match the user
request, we can define recall and precision as follows:
Recall = |Hits|/|Relevant|
Precision = |Hits|/|Retrieved|
Recall and precision can also be defined in a ranked retrieval setting. Let us assume that there is
one document at each rank position. The recall at rank position i for document di q (denoted by
r(i)) (di q is the retrieved document at position i for query q) is the fraction of relevant documents
from d1 q to di q in the result set for the query. Let the set of relevant documents from d1 q to di
q in that set be Si with cardinality | Si |.
Let (|Dq| be the size of relevant documents for the query. In this case,|Si | ≤ |Dq|). Then: Ranked
retrieval_recall: r(i) = |Si |/|Dq| The precision at rank position i or document di q (denoted by p(i))
is the fraction of documents from d1 q to di q in the result set that are relevant:
Ranked_retrieval_precision: p(i) = |Si |/i
Table 27.2 illustrates the p(i), r(i), and average precision (discussed in the next section) metrics. It
can be seen that recall can be increased by presenting more results to the user, but this approach
runs the risk of decreasing the precision. In the example, the number of relevant documents for
some query = 10.
The rank position and the relevance of an individual document are shown. The precision and recall
value can be computed at each position within the ranked list as shown in the last two columns.
As we see in Table 27.2, the ranked_retrieval_recall rises monotonically whereas the precision is
prone to fluctuation.

Average precision
 Computed based on the precision at each relevant document in the ranking
Recall/precision curve
 Based on the recall and precision values at each rank position
 x-axis is recall and y-axis is precision
F-score
 Harmonic mean of the precision (p) and recall (r) values
5. Explain in detail about web search and analysis.
Search engines must crawl and index web sites and document collections
 Regularly update indexes
 Link analysis used to identify page importance
Vertical search engines
 Customized topic-specific search engines that crawl and index a specific collection of
documents on the Web
Metasearch engines
 Query different search engines simultaneously and aggregate information
Digital libraries
 Collections of electronic resources and services for the delivery of materials in a variety
of formats
Web analysis
 Applies data analysis techniques to discover and analyze useful information from the
Web
Goals of Web analysis
 Finding relevant information
 Personalization of the information
 Finding information of social value
Categories of Web analysis
 Web structure analysis
 Web content analysis
 Web usage analysis
Web structure analysis
 Hyperlink
 Destination page
 Anchor text
 Hub
PageRank ranking algorithm
 Used by Google
 Analyzes forward links and backlinks
 Highly linked pages are more important
 Web content analysis tasks
Structured data extraction
 Wrapper
Web information integration
 Web query interface integration
 Schema matching
 Ontology-based information integration
Building concept hierarchies
 Segmenting web pages and detecting noise
 Approaches to Web content analysis
Agent-based
 Intelligent Web agents
 Personalized Web agents
 Information filtering/categorization
Database-base
 Attempts to organize a Web site as a database
 Object Exchange Model
 Multilevel database
 Web query system
 Web usage analysis attempts to discover usage patterns from Web
data
Preprocessing
 Usage, content, structure
Pattern discovery
 Statistical analysis, association rules, cluster classification, sequential
patterns, dependency modeling
Pattern analysis
 Filter out patterns not of interest
 Practical applications of Web analysis
Web analytics
 Understand and optimize the performance of Web usage
Web spamming
 Deliberate activity to promote a page by manipulating search engine result
Web security
 Allow design of more robust Web sites
Web crawlers
6. Discuss the Trends in Information Retrieval.
Trends in Information Retrieval
 Faceted search
Classifying content
 social search
Collaborative social search
 Conversational information access
Intelligent agents perform intent extraction to provide information relevant to a
conversation
 Probabilistic topic modeling
Automatically organize large collections of documents into relevant themes
Question-answering systems
 Factoid questions
 List questions
 Definition questions
 Opinion questions
 Composed of question analysis, query generation, search, candidate answer
generation, and answer scoring

You might also like