0% found this document useful (0 votes)
2K views159 pages

MC4202 - Adavanced Database Technology

notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views159 pages

MC4202 - Adavanced Database Technology

notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

Karpaga Vinayaga College of Enginnering and Technology

Master of Computer Applications


MC4202 - ADVANCED DATABASE TECHNOLOGY
UNIT I
DISTRIBUTED DATABASES
Distributed Systems – Introduction – Architecture – Distributed Database
Concepts – Distributed Data Storage – Distributed Transactions – Commit
Protocols – Concurrency Control – Distributed Query Processing

Distributed DBMS - Distributed Databases

A distributed database is a collection of multiple interconnected


databases, which are spread physically across various locations that
communicate via a computer network.

Features

 Databases in the collection are logically interrelated with each other.


Often they represent a single logical database.
 Data is physically stored across multiple sites. Data in each site can be
managed by a DBMS independent of the other sites.
 The processors in the sites are connected via a network. They do not have
any multiprocessor configuration.
 A distributed database is not a loosely connected file system.
 A distributed database incorporates transaction processing, but it is not
synonymous with a transaction processing system.
Advantages of Distributed Databases
Following are the advantages of distributed databases over centralized
databases.

Modular Development − If the system needs to be expanded to new locations


or new units, in centralized database systems, the action requires substantial
efforts and disruption in the existing functioning. However, in distributed
databases, the work simply requires adding new computers and local data to the
new site and finally connecting them to the distributed system, with no
interruption in current functions.

More Reliable − In case of database failures, the total system of centralized


databases comes to a halt. However, in distributed systems, when a component
fails, the functioning of the system continues may be at a reduced performance.
Hence DDBMS is more reliable.

Better Response − If data is distributed in an efficient manner, then user


requests can be met from local data itself, thus providing faster response. On the
other hand, in centralized systems, all queries have to pass through the central
computer for processing, which increases the response time.

Lower Communication Cost − In distributed database systems, if data is


located locally where it is mostly used, then the communication costs for data
manipulation can be minimized. This is not feasible in centralized systems.

Types of Distributed Databases

Distributed databases can be broadly classified into homogeneous and


heterogeneous distributed database environments, each with further sub-
divisions, as shown in the following illustration.

Homogeneous Distributed Databases

In a homogeneous distributed database, all the sites use identical DBMS and
operating systems. Its properties are −

 The sites use very similar software.


 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to
process user requests.
 The database is accessed through a single interface as if it is a single
database.
Types of Homogeneous Distributed Database

There are two types of homogeneous distributed database −

 Autonomous − Each database is independent that functions on its own.


They are integrated by a controlling application and use message passing
to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes
and a central or master DBMS co-ordinates data updates across the sites.

Heterogeneous Distributed Databases

In a heterogeneous distributed database, different sites have different operating


systems, DBMS products and data models. Its properties are −

 Different sites use dissimilar schemas and software.


 The system may be composed of a variety of DBMSs like relational,
network, hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation
in processing user requests.

Types of Heterogeneous Distributed Databases

 Federated − The heterogeneous database systems are independent in


nature and integrated together so that they function as a single database
system.
 Un-federated − The database systems employ a central coordinating
module through which the databases are accessed.

Distributed DBMS Architectures

DDBMS architectures are generally developed depending on three parameters −


 Distribution − It states the physical distribution of data across the
different sites.
 Autonomy − It indicates the distribution of control of the database
system and the degree to which each constituent DBMS can operate
independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data
models, system components and databases.

Architectural Models

Some of the common architectural models are −

 Client - Server Architecture for DDBMS


 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture

Client - Server Architecture for DDBMS

This is a two-level architecture where the functionality is divided into servers


and clients. The server functions primarily encompass data management, query
processing, optimization and transaction management. Client functions include
mainly user interface. However, they have some functions like consistency
checking and transaction management.

The two different client - server architecture are −

 Single Server Multiple Client


 Multiple Server Multiple Client (shown in the following diagram)
Peer- to-Peer Architecture for DDBMS

In these systems, each peer acts both as a client and a server for imparting
database services. The peers share their resource with other peers and co-
ordinate their activities.

This architecture generally has four levels of schemas −

 Global Conceptual Schema − Depicts the global logical view of data.


 Local Conceptual Schema − Depicts logical data organization at each
site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.
Multi - DBMS Architectures

This is an integrated database system formed by a collection of two or more


autonomous database systems.

Multi-DBMS can be expressed through six levels of schemas −

 Multi-database View Level − Depicts multiple user views comprising of


subsets of the integrated distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database
that comprises of global logical multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across
different sites and multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at
each site.
 Local database Internal Level − Depicts physical data organization at
each site.
There are two design alternatives for multi-DBMS −

 Model with multi-database conceptual level.


 Model without multi-database conceptual level.
What are distributed databases?

 Distributed database is a system in which storage devices are not


connected to a common processing unit.
 Database is controlled by Distributed Database Management System and
data may be stored at the same location or spread over the interconnected
network. It is a loosely coupled system.
 Shared nothing architecture is used in distributed databases.
 The above diagram is a typical example of distributed database system, in
which communication channel is used to communicate with the different
locations and every system has its own memory and database.

Goals of Distributed Database system.

Reliability: In distributed database system, if one system fails down or stops


working for some time another system can complete the task.
Availability: In distributed database system reliability can be achieved even if
sever fails down. Another system is available to serve the client request.
Performance: Performance can be achieved by distributing database over
different locations. So the databases are available to every location which is
easy to maintain.

Types of distributed databases.

The two types of distributed systems are as follows:

1. Homogeneous distributed databases system:

 Homogeneous distributed database system is a network of two or more


databases (With same type of DBMS software) which can be stored on
one or more machines.
 So, in this system data can be accessed and modified simultaneously on
several databases in the network. Homogeneous distributed system are
easy to handle.

Example: Consider that we have three departments using Oracle-9i for DBMS.
If some changes are made in one department then, it would update the other
department also.
2. Heterogeneous distributed database system.

 Heterogeneous distributed database system is a network of two or more


databases with different types of DBMS software, which can be stored on
one or more machines.
 In this system data can be accessible to several databases in the network
with the help of generic connectivity (ODBC and JDBC).

Example: In the following diagram, different DBMS software are accessible to


each other using ODBC and JDBC.

The basic types of distributed DBMS are as follows:


1. Client-server architecture of Distributed system.

 A client server architecture has a number of clients and a few servers


connected in a network.
 A client sends a query to one of the servers. The earliest available server
solves it and replies.
 A Client-server architecture is simple to implement and execute due to
centralized server system.

2. Collaborating server architecture.

 Collaborating server architecture is designed to run a single query on


multiple servers.
 Servers break single query into multiple small queries and the result is
sent to the client.
 Collaborating server architecture has a collection of database servers.
Each server is capable for executing the current transactions across the
databases.
Distributed DBMS - Concepts

Database and Database Management System

A database is an ordered collection of related data that is built for a specific


purpose. A database may be organized as a collection of multiple tables, where
a table represents a real world element or entity. Each table has several different
fields that represent the characteristic features of the entity.

Examples of DBMS Application Areas

 Automatic Teller Machines


 Train Reservation System
 Employee Management System
 Student Information System

Examples of DBMS Packages

 MySQL
 Oracle
 SQL Server
 dBASE
 FoxPro
 PostgreSQL, etc.

Types of DBMS

There are four types of DBMS.

Hierarchical DBMS

In hierarchical DBMS, the relationships among data in the database are


established so that one data element exists as a subordinate of another. The data
elements have parent-child relationships and are modelled using the “tree” data
structure. These are very fast and simple.
Network DBMS

Network DBMS in one where the relationships among data in the database are
of type many-to-many in the form of a network. The structure is generally
complicated due to the existence of numerous many-to-many relationships.
Network DBMS is modelled using “graph” data structure.

Relational DBMS

In relational databases, the database is represented in the form of relations. Each


relation models an entity and is represented as a table of values. In the relation
or table, a row is called a tuple and denotes a single record. A column is called a
field or an attribute and denotes a characteristic property of the entity. RDBMS
is the most popular database management system.

For example − A Student Relation −


Object Oriented DBMS

Object-oriented DBMS is derived from the model of the object-oriented


programming paradigm. They are helpful in representing both consistent data as
stored in databases, as well as transient data, as found in executing programs.
They use small, reusable elements called objects. Each object contains a data
part and a set of operations which works upon the data. The object and its
attributes are accessed through pointers instead of being stored in relational
table models.

For example − A simplified Bank Account object-oriented database −

Distributed DBMS

A distributed database is a set of interconnected databases that is distributed


over the computer network or internet. A Distributed Database Management
System (DDBMS) manages the distributed database and provides mechanisms
so as to make the databases transparent to the users. In these systems, data is
intentionally distributed among multiple nodes so that all computing resources
of the organization can be optimally used.

Operations on DBMS

The four basic operations on a database are Create, Retrieve, Update and Delete.

 CREATE database structure and populate it with data − Creation of a


database relation involves specifying the data structures, data types and the
constraints of the data to be stored.

Example − SQL command to create a student table −

CREATE TABLE STUDENT (


ROLL INTEGER PRIMARY KEY,
NAME VARCHAR2(25),
YEAR INTEGER,
STREAM VARCHAR2(10) );
 Once the data format is defined, the actual data is stored in accordance
with the format in some storage medium.

Example SQL command to insert a single tuple into the student table −

INSERT INTO STUDENT ( ROLL, NAME, YEAR, STREAM)


VALUES ( 1, 'ANKIT JHA', 1, 'COMPUTER SCIENCE');

 RETRIEVE information from the database – Retrieving information


generally involves selecting a subset of a table or displaying data from the
table after some computations have been done. It is done by querying
upon the table.

Example − To retrieve the names of all students of the Computer Science


stream, the following SQL query needs to be executed −

SELECT NAME FROM STUDENT


WHERE STREAM = 'COMPUTER SCIENCE';

 UPDATE information stored and modify database structure – Updating a


table involves changing old values in the existing table’s rows with new
values.

Example − SQL command to change stream from Electronics to


Electronics and Communications −

UPDATE STUDENT
SET STREAM = 'ELECTRONICS AND COMMUNICATIONS'
WHERE STREAM = 'ELECTRONICS';

 Modifying database means to change the structure of the table. However,


modification of the table is subject to a number of restrictions.

Example − To add a new field or column, say address to the Student


table, we use the following SQL command −

ALTER TABLE STUDENT


ADD ( ADDRESS VARCHAR2(50) );

 DELETE information stored or delete a table as a whole – Deletion of


specific information involves removal of selected rows from the table that
satisfies certain conditions.

Example − To delete all students who are in 4th year currently when they
are passing out, we use the SQL command −
DELETE FROM STUDENT
WHERE YEAR = 4;

 Alternatively, the whole table may be removed from the database.

Example − To remove the student table completely, the SQL command


used is −

DROP TABLE STUDENT;

Distributed Data Storage

Consider a relation r that is to be stored in the database. There are two


approaches to storing this relation in the distributed database:

• Replication. The system maintains several identical replicas (copies) of the


relation, and stores each replica at a different site. The alternative to replica-tion
is to store only one copy of relation r.

• Fragmentation. The system partitions the relation into several fragments, and
stores each fragment at a different site.
Data Replication

Data replication is the process of storing separate copies of the database at two
or more sites. It is a popular fault tolerance technique of distributed databases.

Advantages of Data Replication

 Reliability − In case of failure of any site, the database system continues


to work since a copy is available at another site(s).
 Reduction in Network Load − Since local copies of data are available,
query processing can be done with reduced network usage, particularly
during prime hours. Data updating can be done at non-prime hours.
 Quicker Response − Availability of local copies of data ensures quick
query processing and consequently quick response time.
 Simpler Transactions − Transactions require less number of joins of
tables located at different sites and minimal coordination across the
network. Thus, they become simpler in nature.
Disadvantages of Data Replication

 Increased Storage Requirements − Maintaining multiple copies of data


is associated with increased storage costs. The storage space required is in
multiples of the storage required for a centralized system.
 Increased Cost and Complexity of Data Updating − Each time a data
item is updated, the update needs to be reflected in all the copies of the
data at the different sites. This requires complex synchronization
techniques and protocols.
 Undesirable Application – Database coupling − If complex update
mechanisms are not used, removing data inconsistency requires complex
co-ordination at application level. This results in undesirable application
– database coupling.

Some commonly used replication techniques are −

 Snapshot replication
 Near-real-time replication
 Pull replication

Fragmentation

Fragmentation is the task of dividing a table into a set of smaller tables. The
subsets of the table are called fragments. Fragmentation can be of three types:
horizontal, vertical, and hybrid (combination of horizontal and vertical).
Horizontal fragmentation can further be classified into two techniques: primary
horizontal fragmentation and derived horizontal fragmentation.

Advantages of Fragmentation

 Since data is stored close to the site of usage, efficiency of the database
system is increased.
 Local query optimization techniques are sufficient for most queries since
data is locally available.
 Since irrelevant data is not available at the sites, security and privacy of
the database system can be maintained.

Disadvantages of Fragmentation

 When data from different fragments are required, the access speeds may
be very low.
 In case of recursive fragmentations, the job of reconstruction will need
expensive techniques.
 Lack of back-up copies of data in different sites may render the database
ineffective in case of failure of a site.

Types of data replication


There are two types of data replication:

1. Synchronous Replication:
In synchronous replication, the replica will be modified immediately after some
changes are made in the relation table. So there is no difference between
original data and replica.

2. Asynchronous replication:
In asynchronous replication, the replica will be modified after commit is fired
on to the database.
Replication Schemes
The three replication schemes are as follows:
1. Full Replication
In full replication scheme, the database is available to almost every location or
user in communication network.

Advantages of full replication

 High availability of data, as database is available to almost every


location.
 Faster execution of queries.
Disadvantages of full replication

 Concurrency control is difficult to achieve in full replication.


 Update operation is slower.

2. No Replication
No replication means, each fragment is stored exactly at one location.

Advantages of no replication

 Concurrency can be minimized.


 Easy recovery of data.

Disadvantages of no replication

 Poor availability of data.


 Slows down the query execution process, as multiple clients are accessing
the same server.
3. Partial replication
Partial replication means only some fragments are replicated from the database.

Advantages of partial replication


The number of replicas created for fragments depend upon the importance of
data in that fragment.
Vertical Fragmentation

In vertical fragmentation, the fields or columns of a table are grouped into


fragments. In order to maintain reconstructiveness, each fragment should
contain the primary key field(s) of the table. Vertical fragmentation can be used
to enforce privacy of data.

For example, let us consider that a University database keeps records of all
registered students in a Student table having the following schema.

STUDENT

Regd_No Name Course Address Semester Fees Marks

Now, the fees details are maintained in the accounts section. In this case, the
designer will fragment the database as follows −

CREATE TABLE STD_FEES AS


SELECT Regd_No, Fees
FROM STUDENT;
Horizontal Fragmentation

Horizontal fragmentation groups the tuples of a table in accordance to values of


one or more fields. Horizontal fragmentation should also confirm to the rule of
reconstructiveness. Each horizontal fragment must have all columns of the
original base table.

For example, in the student schema, if the details of all students of Computer
Science Course needs to be maintained at the School of Computer Science, then
the designer will horizontally fragment the database as follows −

CREATE COMP_STD AS
SELECT * FROM STUDENT
WHERE COURSE = "Computer Science";

Hybrid Fragmentation

In hybrid fragmentation, a combination of horizontal and vertical fragmentation


techniques are used. This is the most flexible fragmentation technique since it
generates fragments with minimal extraneous information. However,
reconstruction of the original table is often an expensive task.

Hybrid fragmentation can be done in two alternative ways −

 At first, generate a set of horizontal fragments; then generate vertical


fragments from one or more of the horizontal fragments.
 At first, generate a set of vertical fragments; then generate horizontal
fragments from one or more of the vertical fragments.

Distributed Transactions

A transaction is a program including a collection of database operations,


executed as a logical unit of data processing. The operations performed in a
transaction include one or more of database operations like insert, delete, update
or retrieve data. It is an atomic process that is either performed into completion
entirely or is not performed at all. A transaction involving only data retrieval
without any data update is called read-only transaction.

Each high level operation can be divided into a number of low level tasks or
operations. For example, a data update operation can be divided into three tasks
read_item() − reads data item from storage to main memory.

 modify_item() − change value of item in the main memory.


 write_item() − write the modified value from main memory to storage.

Database access is restricted to read_item() and write_item() operations.


Likewise, for all transactions, read and write forms the basic database
operations.

Transaction Operations

The low level operations performed in a transaction are −

 begin_transaction − A marker that specifies start of transaction


execution.
 read_item or write_item − Database operations that may be interleaved
with main memory operations as a part of transaction.
 end_transaction − A marker that specifies end of transaction.
 commit − A signal to specify that the transaction has been successfully
completed in its entirety and will not be undone.
 rollback − A signal to specify that the transaction has been unsuccessful
and so all temporary changes in the database are undone. A committed
transaction cannot be rolled back.

Transaction States

A transaction may go through a subset of five states, active, partially


committed, committed, failed and aborted.

 Active − The initial state where the transaction enters is the active state.
The transaction remains in this state while it is executing read, write or
other operations.
 Partially Committed − The transaction enters this state after the last
statement of the transaction has been executed.
 Committed − The transaction enters this state after successful
completion of the transaction and system checks have issued commit
signal.
 Failed − The transaction goes from partially committed state or active
state to failed state when it is discovered that normal execution can no
longer proceed or system checks fail.
 Aborted − This is the state after the transaction has been rolled back after
failure and the database has been restored to its state that was before the
transaction began.

The following state transition diagram depicts the states in the transaction and
the low level transaction operations that causes change in states.
Desirable Properties of Transactions

Any transaction must maintain the ACID properties, viz. Atomicity,


Consistency, Isolation, and Durability.

 Atomicity − This property states that a transaction is an atomic unit of


processing, that is, either it is performed in its entirety or not performed at
all. No partial update should exist.
 Consistency − A transaction should take the database from one
consistent state to another consistent state. It should not adversely affect
any data item in the database.
 Isolation − A transaction should be executed as if it is the only one in the
system. There should not be any interference from the other concurrent
transactions that are simultaneously running.
 Durability − If a committed transaction brings about a change, that
change should be durable in the database and not lost in case of any
failure.

Example: Let's assume that following transaction T consisting of T1 and


T2. A consists of Rs 600 and B consists of Rs 300. Transfer Rs 100 from
account A to account B.

T1 T2
Read(A) Read(B)
A:= A-100 Y:= Y+100
Write(A) Write(B)


Schedule

A series of operation from one transaction to another transaction is known as


schedule. It is used to preserve the order of the operation in each of the
individual transaction.

1. Serial Schedule

The serial schedule is a type of schedule where one transaction is executed


completely before starting another transaction. In the serial schedule, when the
first transaction completes its cycle, then the next transaction is executed.

For example: Suppose there are two transactions T1 and T2 which have some
operations. If it has no interleaving of operations, then there are the following
two possible outcomes:

1. Execute all the operations of T1 which was followed by all the operations
of T2.
2. Execute all the operations of T1 which was followed by all the operations
of T2.

 In the given (a) figure, Schedule A shows the serial schedule where T1
followed by T2.
 In the given (b) figure, Schedule B shows the serial schedule where T2
followed by T1.

2. Non-serial Schedule

 If interleaving of operations is allowed, then there will be non-serial


schedule.
 It contains many possible orders in which the system can execute the
individual operations of the transactions.
 In the given figure (c) and (d), Schedule C and Schedule D are the non-
serial schedules. It has interleaving of operations.
3. Serializable schedule

 The serializability of schedules is used to find non-serial schedules that


allow the transaction to execute concurrently without interfering with one
another.
 It identifies which schedules are correct when executions of the
transaction have interleaving of their operations.
 A non-serial schedule will be serializable if its result is equal to the result
of its transactions executed serially.
Commit Protocols
When the controlling site receives “DONE” message from each slave, it
makes a decision to commit or abort. This is called the commit point. Then, it
sends this message to all the slaves. On receiving this message, a slave either
commits or aborts and then sends an acknowledgement message to the
controlling site

Two Phase Commit protocal

At the heart of every distributed system is a consensus algorithm. Consensus is


an act wherein a system of processes agree upon a value or decision. In this post
let’s look at two famous consensus protocol namely two phase and three phase
commits widely in use with the database servers.
The consensus has three characteristics

Agreement — all nodes in N decide on the same value.

Validity — The value that’s decided upon should have been proposed by some
process

Termination — A decision should be reached !!

Two phase commit

This protocol requires a coordinator. The client contacts the coordinator and
proposes a value. The coordinator then tries to establish the consensus among a
set of processes in two phases, hence the name.

1. In the first phase, coordinator contacts all the participants suggests value
proposed by the client and solicit their response.

2. After receiving all the responses, the coordinator makes a decision to commit
if all participants agreed upon the value or abort if someone disagrees.

3. In the second phase, coordinator contacts all participants again and


communicates the commit or abort decision.
We can see that all the above-mentioned conditions are met. The agreement is
there because the participants only make a yes or no decision on the value
proposed by the coordinator and don’t propose values. Validity is there because
the same decision to commit or abort is enforced by the coordinator on all
participants. Termination is guaranteed only if all participants communicate the
responses to the coordinator. However, this is prone to failures.

When speaking about failures what are the types of failures of a node?

Fail-Stop Model, Nodes just crash and don’t recover at all.

Fail Recover Model, Nodes crash, and recover after a certain time and continue
executing.

Three phase commit

This is an extension of two-phase commit wherein the commit phase is split into
two phases as follows.

a. Prepare to commit, After unanimously receiving yes in the first phase of 2PC
the coordinator asks all participants to prepare to commit. During this phase, all
participants acquire locks etc, but they don’t actually commit.

b. If the coordinator receives yes from all participants during the prepare to
commit phase then it asks all participants to commit.
The pre-commit phase introduced

above helps us to recover from the case when a participant failure or both
coordinator and participant node failure during commit phase. The recovery
coordinator when it takes over after coordinator failure during phase2 of
previous 2 pc the new pre-commit comes handy as follows. On querying
participants, if it learns that some nodes are in commit phase then it assumes
that previous coordinator before crashing has made the decision to commit.
Hence it can shepherd the protocol to commit. Similarly, if a participant says
that it doesn’t receive prepare to commit, then the new coordinator can assume
that previous coordinator failed even before it started the prepare to commit
phase. Hence it can safely assume no other participant would have committed
the changes and hence safely abort the transaction.

Concurrency Control

Concurrency Control is the working concept that is required for controlling and
managing the concurrent execution of database operations and thus avoiding the
inconsistencies in the database. Thus, for maintaining the concurrency of the
database, we have the concurrency control protocols.

Concurrency Control Protocols

The concurrency control protocols ensure the atomicity, consistency, isolation,


durability and serializability of the concurrent execution of the database
transactions. Therefore, these protocols are categorized as:
 Lock Based Concurrency Control Protocol
 Time Stamp Concurrency Control Protocol
 Validation Based Concurrency Control Protocol

Lock-Based Protocol

In this type of protocol, any transaction cannot read or write data until it
acquires an appropriate lock on it. There are two types of lock:

1. Shared lock:

 It is also known as a Read-only lock. In a shared lock, the data item can
only read by the transaction.
 It can be shared between the transactions because when the transaction
holds a lock, then it can't update the data on the data item.

2. Exclusive lock:

 In the exclusive lock, the data item can be both reads as well as written by
the transaction.
 This lock is exclusive, and in this lock, multiple transactions do not
modify the same data simultaneously.

Two-phase locking (2PL)

 The two-phase locking protocol divides the execution phase of the


transaction into three parts.
 In the first part, when the execution of the transaction starts, it seeks
permission for the lock it requires.
 In the second part, the transaction acquires all the locks. The third phase
is started as soon as the transaction releases its first lock.
 In the third phase, the transaction cannot demand any new locks. It only
releases the acquired locks.

There are two phases of 2PL:

Growing phase: In the growing phase, a new lock on the data item may be
acquired by the transaction, but none can be released.

Shrinking phase: In the shrinking phase, existing lock held by the transaction
may be released, but no new locks can be acquired.

In the below example, if lock conversion is allowed then the following phase
can happen:
1. Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.
2. Downgrading of lock (from X(a) to S(a)) must be done in shrinking
phase.

Example:

The following way shows how unlocking and locking work with 2-PL.

Transaction T1:

 Growing phase: from step 1-3


 Shrinking phase: from step 5-7
 Lock point: at 3

Transaction T2:

 Growing phase: from step 2-6


 Shrinking phase: from step 8-9
 Lock point: at 6

Time Stamp Concurrency Control Protocol

1. Check the following condition whenever a transaction Ti issues a Read (X)


operation:

 If W_TS(X) >TS(Ti) then the operation is rejected.


 If W_TS(X) <= TS(Ti) then the operation is executed.
 Timestamps of all the data items are updated.

2. Check the following condition whenever a transaction Ti issues a Write(X)


operation:

 If TS(Ti) < R_TS(X) then the operation is rejected.


 If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back
otherwise the operation is executed.

Where

TS(TI) denotes the timestamp of the transaction Ti.

R_TS(X) denotes the Read time-stamp of data-item X.

W_TS(X) denotes the Write time-stamp of data-item X.

Validation Based Protocol

Validation phase is also known as optimistic concurrency control technique. In


the validation based protocol, the transaction is executed in the following three
phases:

1. Read phase: In this phase, the transaction T is read and executed. It is


used to read the value of various data items and stores them in temporary
local variables. It can perform all the write operations on temporary
variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be
validated against the actual data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the
temporary results are written to the database or system otherwise the
transaction is rolled back.

Here each phase has the following different timestamps:

Start(Ti): It contains the time when Ti started its execution.

Validation (Ti): It contains the time when Ti finishes its read phase and starts
its validation phase.

Finish(Ti): It contains the time when Ti finishes its write phase.


 This protocol is used to determine the time stamp for the transaction for
serialization using the time stamp of the validation phase, as it is the
actual phase which determines if the transaction will commit or rollback.
 Hence TS(T) = validation(T).
 The serializability is determined during the validation process. It can't be
decided in advance.
 While executing the transaction, it ensures a greater degree of
concurrency and also less number of conflicts.
 Thus it contains transactions which have less number of rollbacks.

Conflict Graphs

Another method is to create conflict graphs. For this transaction classes are
defined. A transaction class contains two set of data items called read set and
write set. A transaction belongs to a particular class if the transaction’s read set
is a subset of the class’ read set and the transaction’s write set is a subset of the
class’ write set. In the read phase, each transaction issues its read requests for
the data items in its read set. In the write phase, each transaction issues its write
requests.

A conflict graph is created for the classes to which active transactions belong.
This contains a set of vertical, horizontal, and diagonal edges. A vertical edge
connects two nodes within a class and denotes conflicts within the class. A
horizontal edge connects two nodes across two classes and denotes a write-write
conflict among different classes. A diagonal edge connects two nodes across
two classes and denotes a write-read or a read-write conflict among two classes.

The conflict graphs are analyzed to ascertain whether two transactions within
the same class or across two different classes can be run in parallel.

Distributed databases - Query processing and Optimization

Query Processing

Query processing refers to the range of activities involved in extracting data


from a database.The steps involved in processing a query appear in Figure 12.1.
The basic steps are:

1. Parsing and translation.


2. Optimization.
3. Evaluation.
Suppose a user executes a query. As we have learned that there are various
methods of extracting the data from the database. In SQL, a user wants to fetch
the records of the employees whose salary is greater than or equal to 10000. For
doing this, the following query is undertaken:

select emp_name from Employee where salary>10000;

Thus, to make the system understand the user query, it needs to be translated in
the form of relational algebra. We can bring this query in the relational algebra
form as:

 σsalary>10000 (πsalary (Employee))


 πsalary (σsalary>10000 (Employee))

After translating the given query, we can execute each relational algebra
operation by using different algorithms. So, in this way, a query processing
begins its working.

Evaluation

For this, with addition to the relational algebra translation, it is required to


annotate the translated relational algebra expression with the instructions used
for specifying and evaluating each operation. Thus, after translating the user
query, the system executes a query evaluation plan.
Query Evaluation Plan

 In order to fully evaluate a query, the system needs to construct a query


evaluation plan.
 The annotations in the evaluation plan may refer to the algorithms to be
used for the particular index or the specific operations.
 Such relational algebra with annotations is referred to as Evaluation
Primitives. The evaluation primitives carry the instructions needed for
the evaluation of the operation.
 Thus, a query evaluation plan defines a sequence of primitive operations
used for evaluating a query. The query evaluation plan is also referred to
as the query execution plan.
 A query execution engine is responsible for generating the output of the
given query. It takes the query execution plan, executes it, and finally
makes the output for the user query.

Optimization

 The cost of the query evaluation can vary for different types of queries.
Although the system is responsible for constructing the evaluation plan,
the user does need not to write their query efficiently.
 Usually, a database system generates an efficient query evaluation plan,
which minimizes its cost. This type of task performed by the database
system and is known as Query Optimization.
 For optimizing a query, the query optimizer should have an estimated
cost analysis of each operation. It is because the overall operation cost
depends on the memory allocations to several operations, execution costs,
and so on.

Finally, after selecting an evaluation plan, the system evaluates the query and
produces the output of the query.

Each SQL query can itself be translated into a relational-


algebra expression in one of several ways. Furthermore, the relational-algebra
representation of a query specifies only partially how to evaluate a query; there
are usually several ways to evaluate relational-algebra expressions. As an
illustration,

consider the query:


select salary
from instructor
where salary < 75000
DDBMS processes and optimizes a query in terms of communication cost of
processing a distributed query and other parameters.

Various factors which are considered while processing a query are as


follows:
Costs of Data transfer

 This is a very important factor while processing queries. The intermediate


data is transferred to other location for data processing and the final result
will be sent to the location where the actual query is processing.
 The cost of data increases if the locations are connected via high
performance communicating channel.
 The DDBMS query optimization algorithms are used to minimize the cost
of data transfer.

Semi-join based query optimization

 Semi-join is used to reduce the number of relations in a table before


transferring it to another location.
 Only joining columns are transferred in this method.
 This method reduces the cost of data transfer.

Cost based query optimization

 Query optimization involves many operations like, selection, projection,


aggregation.
 Cost of communication is considered in query optimization.
 In centralized database system, the information of relations at remote
location is obtained from the server system catalogs.

Distributed Transactions

 A Distributed Databases Management System should be able to survive


in a system failure without losing any data in the database.
 This property is provided in transaction processing.
 The local transaction works only on own location(Local Location) where
it is considered as a global transaction for other locations.
 Transactions are assigned to transaction monitor which works as a
supervisor.
 Transaction Processing is very useful for concurrent execution and
recovery of data.
Two Marks
1. What are the two phases of 2PC protocol?
Phase 1: Obtaining the decision – whether to commit or abort a transaction T.
Phase 2: Recording the decision – implement the decision taken in phase 1.
2. What is the major disadvantage of 2PC protocol?
The major disadvantage of two phase commit protocol is Blocking problem.
Blocking problem – This happens is the transaction coordinator fails. In case of
coordinator failure, 2PC uses various ways to complete the transaction (to abort or
commit). It can be done using the messages found in the log files of participating sites. If
all the participating sites have <ready T> message and no other control messages (such as
<abort T> or <commit T>) then all the active sites which are participating in that
transaction have to wait for their transaction coordinator to recover. This indefinite
blocking with locks held is called blocking problem.
3. Define replication
Creating a copy of existing data or table is called replication. It increases availability at the cost of
redundancy.
4. List down some advantages of DDBMSs.
 Reflects organizational structure
 Improved ability to share and local autonomy
 Improved availability
 Improved reliability
 Improved performance
 Economy in deploying software and hardware resources
 Modular growth of the system
 Integration

4.What are the messages used by 2 Phase Commit protocol?


<prepare T>
<no T>
<ready T>
<abort T>
<commit T>
5. List various failures that are possible in distributed DBMS.
Loss of message
Communication link failure
Site failure.
6.What is distributed database?
A distributed database is a database that consists of two or more files located in different sites
either on the same network or on entirely different networks. Portions of the database are stored in
multiple physical locations and processing is distributed among multiple database nodes.

7.What is DDBMS?
A centralized distributed database management system (DDBMS) integrates data logically so it
can be managed as if it were all stored in the same location. The DDBMS synchronizes all the
data periodically and ensures that data updates and deletes performed at one location will be
automatically reflected in the data stored elsewhere.
Karpaga Vinayaga College of Enginnering and Technology
Master of Computer Applications
MC4202 - ADVANCED DATABASE TECHNOLOGY

UNIT II
SPATIAL AND TEMPORAL DATABASES
Active Databases Model – Design and Implementation Issues - Temporal Databases -
Temporal Querying - Spatial Databases: Spatial Data Types, Spatial Operators and Queries –
Spatial Indexing and Mining – Applications -– Mobile Databases: Location and Handoff
Management, Mobile Transaction Models – Deductive Databases - Multimedia Databases.

Active Databases

Active Database is a database consisting of set of triggers.


Triggers are executed when a specified condition occurs duringinsert/delete/update
Triggers are action that fire automatically based on these conditions

Features of Active Database:

1. It possess all the concepts of a conventional database i.e. data modellingfacilities,


query language etc.
2. It supports all the functions of a traditional database like data definition,data
manipulation, storage management etc.
3. It supports definition and management of ECA rules.
4. It detects event occurrence.
5. It must be able to evaluate conditions and to execute actions.

Event-Condition-Action (ECA) Model


 Triggers follow an Event-condition-action (ECA) model
 Event:
 Database modification
 (E.g., insert, delete, update),
 Condition:
 Any true/false expression
 Optional: If no condition is specified then condition isalways
true
 Action:
 Sequence of SQL statements that will be automaticallyexecuted
Trigger syntax:

CREATE TRIGGER <trigger name>

( AFTER | BEFORE ) <triggering events> ON <table name>

[ FOR EACH ROW ]

[ WHEN <condition> ]

<trigger actions> ;

<trigger event> ::= INSERT I DELETE I UPDATE [ OF <column


name> { , <column name> } ]

<trigger action> ::= <PL/SQL block>

Trigger Example
 When a new employees is added to a department, modify the Total_sal of
the Department to include the new employees salary
 Logically this means that we will CREATE a TRIGGER, let us call
the trigger Total_sal1
 This trigger will execute AFTER INSERT ON Employee
table
 It will do the following FOR EACH ROW
 WHEN NEW.Dno is NOT NULL
 The trigger will UPDATE DEPARTMENT
 By Setting the new Total_sal to be the sum of
 old Total_sal and NEW. Salary
 WHERE the Dno matches the NEW.Dno;
CREATE or ALTER TRIGGER
 CREATE TRIGGER <name>
 Creates a trigger
 ALTER TRIGGER <name>
 Alters a trigger (assuming one exists)
 CREATE OR ALTER TRIGGER <name>
 Creates a trigger if one does not exist
 Alters a trigger if one does exist
 Works in both cases, whether a trigger exists or not
 AFTER
 Executes after the event
 BEFORE
 Executes before the event
 INSTEAD OF
 Executes instead of the event
 Note that event does not execute in this case
R3:CREATE TRIGGER TOTAL_SAL3
AFTER DELETE ON EMPLOYEE
FOR EACH ROW
WHEN(OLD.Dno is NOT NULL)
UPDATE DEPARTMENT
SET Total_sal=Total_sal-OLD.Salary
Where Dno=OLD.Dno;
Design and Implementation Issues for Active Databases

The previous section gave an overview of some of the main concepts for
specifying active rules. In this section, we discuss some additional issues
concerning how rules are designed and implemented. The first issue concerns
activation, deactivation, and grouping of rules.

The second issue concerns whether the triggered action should be executed
before, after, instead of, or concurrently with the triggering event. A before
trigger executes the trigger before executing the event that caused the trigger. It
can be used in applications such as checking for constraint violations. An after
trigger executes the trigger after executing the event, and it can be used in
applications such as maintaining derived data and monitoring for specific events
and conditions. An instead of trigger executes the trigger instead of executing
the event, and it can be used in applications such as executing corresponding
updates on base relations in response to an event that is an update of a view.

The rule condition evaluation is also known as rule consideration, since the
action is to be executed only after considering whether the condition evaluates
to true or false. There are three main possibilities for rule consideration:

Immediate consideration. The condition is evaluated as part of the same


transaction as the triggering event, and is evaluated immediately. This case can
be further categorized into three options:

Evaluate the condition before executing the triggering event.


Evaluate the condition after executing the triggering event.
Evaluate the condition instead of executing the triggering event.

Deferred consideration. The condition is evaluated at the end of the trans-


action that included the triggering event. In this case, there could be many
triggered rules waiting to have their conditions evaluated.
Detached consideration. The condition is evaluated as a separate
transaction..

The next set of options concerns the relationship between evaluating the rule
condition and executing the rule action. Here, again, three options are possible:
immediate, deferred, or detached execution. Most active systems use the first
option. That is, as soon as the condition is evaluated, if it returns true, the action
is immediately executed.

Temporal Database Concepts

Time Representation, Calendars, and Time Dimensions

Temporal Aspects

There are two different aspects of time in temporal databases.

 Valid Time: Time period during which a fact is true in real world,
provided to the system.
 Transaction time
o The time when the information from a certain transaction becomes
valid
Temporal Relation

Temporal Relation is one where each tuple has associated time; either valid time
or transaction time or both associated with it.

 Uni-Temporal Relations: Has one axis of time, either Valid Time or


Transaction Time.
 Bi-Temporal Relations: Has both axis of time – Valid time and
Transaction time. It includes Valid Start Time, Valid End Time,
Transaction Start Time, Transaction End Time.

 Point events
 Single time point event
 E.g., bank deposit
 Series of point events can form a time series data
 Duration events
 Associated with specific time period
 Time period is represented by start time and end time
 Transaction time
 The time when the information from a certain transaction becomes
valid
 Bitemporal database
 Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning

 Add to every tuple


 Valid start time

Valid end time


Temporal Data Types
MySQL provides data types for storing different kinds of temporal information.
In the following descriptions, the terms YYYY, MM, DD, hh, mm, and ss stand
for a year, month, day of month, hour, minute, and second value, respectively.

The following table summarizes the storage requirements and ranges for the
date and time data types.

Type Storage Range


Required
DATE 3 bytes '1000-01-01' to '9999-12-31'
TIME 3 bytes '-838:59:59' to '838:59:59'
DATETIME 8 bytes '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
TIMESTAMP 4 bytes '1970-01-01 00:00:00' to mid-year 2037
YEAR 1 byte 1901 to 2155 (for YEAR(4)), 1970 to 2069
(for YEAR(2))
Each temporal data type also has a "zero" value that's used when you attempt to
store an illegal value. The "zero" value is represented in a format appropriate for
the type (such as '0000-00-00' for DATE values and '00:00:00' for TIME)
values.

The DATE, TIME, DATETIME, and YEAR Data Types

The DATE data type represents date values in 'YYYY-MM-DD' format. The
supported range of DATE values is '1000-01-01' to '9999-12-31'. You might be
able to use earlier dates than that, but it's better to stay within the supported
range to avoid unexpected behavior.

The TIME data type represents time values in 'hh:mm:ss' format. The range
of TIME columns is '-838:59:59' to '838:59:59'. This is outside the time-of-day
range of '00:00:00' to '23:59:59' because TIME columns can be used to
represent elapsed time. Thus, values might be larger than time-of-day values, or
even negative.

The DATETIME data type stores date-and-time values in 'YYYY-MM-DD


hh:mm:ss' format. It's similar to a combination of DATE and TIME values, but
the TIME part represents time of day rather than elapsed time and has a range
limited to '00:00:00' to '23:59:59'. The date part of DATETIME columns has the
same range as DATE columns; combined with the TIME part, this results in
a DATETIME range from '1000-01-01 00:00:00' to '9999-12-31 23:59:59'.

The YEAR data type represents year-only values. You can declare such
columns as YEAR(4) or YEAR(2) to obtain a four-digit or two-digit display
format. If you don't specify any display width, the default is four digits.
The TIMESTAMP Data Type
The TIMESTAMP type, like DATETIME, stores date-and-time values, but has
a different range and some special properties that make it especially suitable for
tracking data modification times.

MySQL displays TIMESTAMP values using the same format


as DATETIME values; that is, 'YYYY-MM-DD hh:mm:ss'.

To control the initialization and update behavior of a TIMESTAMP column,


you add either or both of the DEFAULT CURRENT_TIMESTAMP and ON
UPDATE CURRENT_TIMESTAMP attributes to the column definition when
creating the table with CREATE TABLE or changing it with ALTER TABLE.

The DEFAULT CURRENT_TIMESTAMP attribute causes the column to be


initialized with the current timestamp at the time the record is created. The ON
UPDATE CURRENT_TIMESTAMP attribute causes the column to be updated
with the current timestamp when the value of another column in the record is
changed from its current value.

CREATE TABLE ts_test1 ( -> ts1 TIMESTAMP, -> ts2 TIMESTAMP,


-> data CHAR(30) -> );

mysql> DESCRIBE ts_test1;

mysql> INSERT INTO ts_test1 (data) VALUES ('original_value');

mysql> SELECT * FROM ts_test1; +---------------------+---------------------+-----


-----------+ | ts1 | ts2 | data | +---------------------+-------
--------------+----------------+ | 2005-01-04 14:45:51 | 0000-00-00 00:00:00 |
original_value | +
mysql> UPDATE ts_test1 SET data='updated_value';

mysql> SELECT * FROM ts_test1; +---------------------+---------------------+-----


----------+ | ts1 | ts2 | data | +---------------------+---------
------------+---------------+ | 2005-01-04 14:46:17 | 0000-00-00 00:00:00 |
updated_value | + + -+- + 1 row in
set (0.00 sec)

The same behavior occurs if you specify both DEFAULT


CURRENT_TIMESTAMP and ON UPDATE
CURRENT_TIMESTAMP explicitly for the first TIMESTAMP column. It is
also possible to use just one of the attributes. The following example
uses DEFAULT CURRENT_TIMESTAMP, but omits ON UPDATE
CURRENT_TIMESTAMP. The result is that the column is initialized
automatically, but not updated when the record is updated:

mysql> CREATE TABLE ts_test2 ( -> created_time TIMESTAMP


DEFAULT CURRENT_TIMESTAMP, -> data CHAR(30) -> );

mysql> INSERT INTO ts_test2 (data) VALUES ('original_value');

SELECT * FROM ts_test2; +---------------------+----------------+ | created_time


| data | +---------------------+----------------+ | 2005-01-04 14:46:39 |
original_value | +--
mysql> UPDATE ts_test2 SET data='updated_value';

mysql> SELECT * FROM ts_test2; +---------------------+---------------+ |


created_time | data | +---------------------+---------------+ | 2005-01-04
14:46:39 | updated_value | +---------------------+---------

The next example demonstrates how to create a TIMESTAMP column that is


not set to the current timestamp when the record is created, but only when it is
updated. In this case, the column definition includes ON UPDATE
CURRENT_TIMESTAMP but omits DEFAULT CURRENT_TIMESTAMP.

mysql> CREATE TABLE ts_test3 ( -> updated_time TIMESTAMP ON


UPDATE CURRENT_TIMESTAMP, -> data CHAR(30) -> );

mysql> INSERT INTO ts_test3 (data) VALUES ('original_value');

mysql> SELECT * FROM ts_test3; +---------------------+----------------+ |


updated_time | data | +---------------------+ --------------- + | 0000-00-00
00:00:00 | original_value | +---------------------+ ----------------+

mysql> UPDATE ts_test3 SET data='updated_value';

Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings:
0

mysql> SELECT * FROM ts_test3; +---------------------+---------------+ |


updated_time | data | +---------------------+---------------+ | 2005-01-04
14:47:10 | updated_value | +---------------------+ --------------- +
Note that you can choose to use CURRENT_TIMESTAMP with neither, either,
or both of the attributes for a single TIMESTAMP column, but you cannot
use DEFAULT CURRENT_TIMESTAMP with one column and ON UPDATE
CURRENT_TIMESTAMP with another:

mysql> CREATE TABLE ts_test4 ( -> created TIMESTAMP DEFAULT


CURRENT_TIMESTAMP, -> updated TIMESTAMP ON UPDATE
CURRENT_TIMESTAMP, -> data CHAR(30) -> );

ERROR 1293 (HY000): Incorrect table definition; there can be only one
TIMESTAMP column with CURRENT_TIMESTAMP in DEFAULT or ON
UPDATE clause

mysql> CREATE TABLE ts_test5 ( -> created TIMESTAMP DEFAULT 0,


-> updated TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, ->
data CHAR(30) -> );

mysql> INSERT INTO ts_test5 (created, data) -> VALUES (NULL,


'original_value');

mysql> SELECT * FROM ts_test5; +---------------------+---------------------+-----


-----------+ | created | updated | data | +---------------------+--
-------------------+----------------+ | 2005-01-04 14:47:39 | 0000-00-00 00:00:00 |
original_value | + + ------ + +

mysql> . . . time passes . . .

mysql> UPDATE ts_test5 SET data='updated_value';


mysql> SELECT * FROM ts_test5; +---------------------+---------------------+-----
----------+ | created | updated | data | +---------------------+----
-----------------+---------------+ | 2005-01-04 14:47:39 | 2005-01-04 14:47:52 |
updated_value | +- -+ +--------------- +

Spatial Database Concepts

Spatial data is associated with geographic locations such as cities,towns etc. A


spatial database is optimized to store and query data representing objects. These
are the objects which are defined in a geometric space.

A road map is a 2-dimensional object which contains points, lines, and polygons
that can represent cities, roads.

What are the concepts used in spatial database?

The three basic types of features are points, lines, and polygons (or areas).
Points are used to represent spatial characteristics of objects whose locations
correspond to a single 2-d coordinate (x, y, or longitude/latitude) in the scale of
a particular application.

 Keep track of objects in a multi-dimensional space


 Maps
 Geographical Information Systems (GIS)
 Weather
 In general spatial databases are n-dimensional
 This discussion is limited to 2-dimensional spatial databases
 Typical Spatial Queries
 Range query: Finds objects of a particular type within a particular
distance from a given location
 E.g., Taco Bells in Pleasanton, CA
 Nearest Neighbor query: Finds objects of a particular type that is
nearest to a given location
 E.g., Nearest Taco Bell from an address in Pleasanton, CA
 Spatial joins or overlays: Joins objects of two types based on some
spatial condition (intersecting, overlapping, within certain distance,
etc.)
 E.g., All Taco Bells within 2 miles from I-680.
 R-trees
 What is R-tree in DBMS?

An index organizes access to data so that entries can be found quickly,


without searching every row.

R-trees are tree data structures used for spatial access methods, i.e., for
indexing multi-dimensional information such as geographical
coordinates, rectangles or polygons.

 Technique for typical spatial queries


 Group objects close in spatial proximity on the same leaf nodes of
a tree structured index
 Internal nodes define areas (rectangles) that cover all areas of the
rectangles in its subtree.
 Quad trees

Quad trees are trees used to efficiently store data of points on a two-
dimensional space. In this tree, each node has at most four children.
Characteristics of Spatial Database

A spatial database system has the following characteristics

 It is a database system
 It offers spatial data types (SDTs) in its data model and query language.
 It supports spatial data types in its implementation, providing at least
spatial indexing and efficient algorithms for spatial join.

Example

A road map is a visualization of geographic information. A road map is a 2-


dimensional object which contains points, lines, and polygons that can represent
cities, roads, and political boundaries such as states or provinces.

In general, spatial data can be of two types −

 Vector data: This data is represented as discrete points, lines and


polygons
 Rastor data: This data is represented as a matrix of square cells.

The spatial data in the form of points, lines, polygons etc. is used by many
different databases as shown above.

The specification published by Open Geospatial Consortium publishes


(OGC) specifies that how MySQL implements spatial extensions as a subset of
the SQL with Geometry Types environment. This term refers to an SQL
environment that has been extended with a set of geometry types.
Features of MySQL Spatial Data Types
MySQL spatial extensions enable the generation, storage, and analysis of
geographic features:

 Data types for representing spatial values


 Functions for manipulating spatial values
 Spatial indexing for improved access times to spatial columns
MySQL supports a number of Spatial Data Types
MySQL has data types that correspond to OpenGIS classes. Some of these types
hold single geometry values:

 GEOMETRY
 POINT
 LINESTRING
 POLYGON
The other data types hold collections of values:

 MULTIPOINT
 MULTILINESTRING
 MULTIPOLYGON
 GEOMETRYCOLLECTION
Use the CREATE TABLE statement to create a table with a spatial column:
CREATE TABLE geotest (code int(5),descrip varchar(50), g GEOMETRY);

Sample Output:

MySQL> describe geotest;


+ -+ +- -+ -+ + +
| Field | Type | Null | Key | Default | Extra |
+ -+ +- -+ -+ + +
| code | int(5) | YES | | NULL | |
| descrip | varchar(50) | YES | | NULL | |
|g | geometry | YES | | NULL | |
+ -+ +- -+ -+ + +
3 rows in set (0.01 sec)
Use the ALTER TABLE statement to add or drop a spatial column to or from
an existing table:
ALTER TABLE geotest ADD pt_loca POINT;

ALTER TABLE geotest DROP pt_loca ;


Point Type
A Point is a geometry which represents a single location in coordinate space.
Usage of Point
On a city map, a Point object could represent a rail station.
Point Properties

 X-coordinate value.
 Y-coordinate value.
 Point is defined as a zero-dimensional geometry.
 The boundary of a Point is the empty set.

Example
MySQL> SELECT X(POINT(18, 23));
+ +
| X(POINT(18, 23)) |
+ +
| 18 |
+ +
1 row in set (0.00 sec)

MySQL> SELECT X(GeomFromText('POINT(18 23)'));


+ -+
| X(GeomFromText('POINT(18 23)')) |
+ -+
| 18 |
+ -+
Curve Type
A Curve is a one-dimensional geometry, in general, it represented by a sequence
of points.
LineString Type
A LineString is a Curve with linear interpolation between points.
Usage of LineString
LineString objects could represent a river within a country map.
LineString Properties

 A LineString has coordinates of segments, defined by each consecutive


pair of points.
 A LineString is a Line if it consists of exactly two points.
 A LineString is a LinearRing if it is both closed and simple.

Example
 MySQL> SET @g = 'LINESTRING(0 0,1 2,2 4)';

 Query OK, 0 rows affected (0.00 sec)

 MySQL> INSERT INTO geotest VALUES (123,"Test


Data",GeomFromText(@g));

Surface Type

 A Surface is a two-dimensional geometry. It is a noninstantiable class. Its


only instantiable subclass is Polygon.

Polygon Type
 A Polygon is a planar Surface representing a multisided geometry. It is
defined by a single exterior boundary and zero or more interior
boundaries, where each interior boundary defines a hole in the Polygon.
Usage of Polygon
 The Polygon objects could represent districts, blocks and so on from a
state map.

Example
MySQL> SET @g = 'POLYGON((0 0,8 0,12 9,0 9,0 0),(5 3,4 5,7 9,3 7, 2
5))';
MySQL> INSERT INTO geotest VALUES (123,"Test
Data",GeomFromText(@g));
GeometryCollection Type
A GeometryCollection is a geometry that is a collection of one or more
geometries of any class.
Example
MySQL> SET @g = 'GEOMETRYCOLLECTION(POINT(3
2),LINESTRING(0 0,1 3,2 5,3 5,4 7))';

MySQL> INSERT INTO geotest VALUES (123,"Test


Data",GeomFromText(@g));

MultiPoint Type
A MultiPoint is a geometry collection composed of Point elements. The points
are not connected or ordered in any way.
Usage of MultiPoint
On a world map, a MultiPoint could represent a chain of small islands.
MultiCurve Type
A MultiCurve is a geometry collection composed of Curve elements.
MultiCurve is a noninstantiable class.
MySQL> SET @g ='MULTIPOINT(0 0, 15 25, 45 65)';

MySQL> INSERT INTO geotest VALUES


(123,"Multipoint",GeomFromText(@g));

MultiLineString Type
A MultiLineString is a MultiCurve geometry collection composed of LineString
elements.

Usage of MultiLineString

 On a region map, a MultiLineString could represent a river system or a


highway system.
Example
MySQL> SET @g ='MULTILINESTRING((12 12, 22 22), (19 19, 32 18))';

MySQL> INSERT INTO geotest VALUES


(123,"Multistring",GeomFromText(@g));
MultiSurface Type
A MultiSurface is a geometry collection composed of surface elements.
MultiSurface is a noninstantiable class. Its only instantiable subclass is
MultiPolygon.
MultiPolygon Type
MultiPolygon is a MultiSurface object composed of Polygon elements.
Usage of MultiPolygon
A MultiPolygon could represent a system of lakes on a region map.
Example
MySQL> SET @g ='MULTIPOLYGON(((0 0,11 0,12 11,0 9,0 0)),((3 5,7 4,4
7,7 7,3 5)))';

MySQL> INSERT INTO geotest VALUES


(123,"Multipolygon",GeomFromText(@g));

Spatial Indexing

A spatial index is a data structure that allows for accessing a spatial object
efficiently. It is a common technique used by spatial databases. Without
indexing, any search for a feature would require a "sequential scan" of every
record in the database, resulting in much longer processing time.

1. Overview

Spatial Index is a data structure that allows for accessing a spatial object
efficiently. It is a common technique used by spatial databases. A variety of
spatial operations needs the support from spatial index for efficient processing:

Range query: Finding objects containing a given point (point query) or


overlapping with an area of interest (window query).
Spatial join: Finding pairs of objects that interact spatially with each other.
Intersection, adjacency, and containment are common examples of spatial
predicates to perform spatial joins.

K-Nearest Neighbor (KNN): Finding the nearest K spatial objects in a defined


neighborhood of a target object.

Figure 1. The point query test in Mysql 5.17.19 using 2017 TIGER national
geodatasets. Time is measured in milliseconds, which is the precision of Mysql.

Figure 2. Minimum bounding rectangles of 2D objects (a) and 3D objects (b).

In a 2D plane, an MBR is defined by four coordinates, (xmin ymin) and (xmaxymax).


These coordinates represent the following:

 xmin is the x-coordinate of the lower-left corner of the bounding box.


 ymin is the y-coordinate of the lower-left corner of the bounding box.
 xmax is the x-coordinate of the upper-right corner of the bounding box.
 ymax is the y-coordinate of the upper-right corner of the bounding box.
When performing a spatial operation on a collection of indexed spatial objects,
we first use MBR instead of the shape of spatial object to test the relation. By
filtering spatial objects outside the MBR, time consuming on spatial predicates
will reduce significantly. Then, in a second step, we use the actual shape in the
subset of first step to test the spatial relation with target object.

2. Spatial Index in Different Databases

Different data sources use different data structures and access methods. Here we
list two well-known spatial indices as well as databases which use them. The
categorization system proposed by Riguax et al. (2002) is employed here for
illustration:

Space-driven structures. These data structures are based on partitioning of the


embedding 2D space into cells (or grids), mapping MBRs to the cells according
to some spatial relationship (overlap or intersect).

Data-driven structures. These data structures are directly organized by


partition of the collection of spatial objects. Data Objects are grouped using
MBRs adapting to their distribution in the embedding space. Commercial
databases like Oracle and Informix as well as some open-source databases,
PostGIS and MySQL, use these data structures.

3. Space-driven Structures

Space-driven structures of spatial objects are instinctively using partitioning and


mapping strategies to decompose 2D plane into a list of cells, and can be used in
spatial extensions with B+-tree, which is dynamic and efficient in memory space
and query time. Although some of the indices can handle with arbitrary
dimensions, we here describe four common partitioning structures in 2D case.

Fixed grid index

Fixed grid index is an n×n array of equal-size cells. Each one is associated with
a list of spatial objects which intersect or overlap with the cell. Figure 3 depicts
a fixed 4×4 gird indexing a collection of three spatial objects.
Figure 3. An example of a fixed grid structure.

Quadtree

Quadtree is a very popular spatial indexing technique. It is a specialized form of


grid in which the resolution of the grid is varied according to the density of the
spatial objects to be fitted.

Figure 6. A representation of a quadtree structure.

KD-tree

The general idea behind KD-tree is that it is a binary tree, each of its nodes
represents an axis-aligned hyper-rectangle as Figure 7 shows. Each node
specifies an axis and splits the set of points based on whether their coordinate
along that axis is greater than or less than a particular value (Rigaux, 2012;
Maneewongvatana, 1999), such as the coordinate median.

Figure 7. A representation of a KD-tree structure.

The KD-tree can be used to index a set of k-dimensional points. Every non-leaf
node divides the space into two parts by a hyper-plane in the specific
dimension. Points in the left half-space are represented by the left subtree of that
node and points falling to the right half-space are represented by the right
subtree.
4. Data-driven Structures

The difference between space-driven structures and data-driven structures is


that the latter use the idea of a spatial containment relationship instead of the
order of the index. These structures adapt themselves to the MBR of the spatial
objects and they are part of the R-tree family.

R-tree

For a layer of geometries, an R-tree index consists of a hierarchical index on the


MBRs of the geometries in the layer (Guttman, 1984). This hierarchical
structure is based on the heuristic optimization of the area of MBRs in each
node in order to improve the access efficiency.

Figure 9. R-Tree Hierarchical Index in Minimum Bounding Rectangles (MBRs)

In Figure 9, M4 through M9 are MBRs of spatial objects in a layer. They are the
leaf nodes of the R-tree index, and contain minimum bounding rectangles of
spatial objects, along with pointers to the spatial objects. M2 and M3 are parents
of the leaf nodes. M1 is the root, containing all the MBRs. This R-tree has a
depth of three.

Difference between Spatial and Temporal Data Mining

1. Spatial Data Mining :

Spatial data mining is the process of discovering interesting and previously


unknown, but potentially useful patterns from spatial databases. In spatial data
mining analyst use geographical or spatial information to produce business
intelligence or other results.
2. Temporal Data Mining :

Temporal data refers to the extraction of implicit, non-trivial and potentially


useful abstract information from large collection of temporal data. It is
concerned with the analysis of temporal data and for finding temporal patterns
and regularities in sets of temporal data tasks of temporal data mining are –

 Data Characterization and Comparison


 Cluster Analysis
 Classification
 Association rules
 Prediction and Trend Analysis
 Pattern Analysis

Difference between Spatial and Temporal Data Mining :

SNO. Spatial data mining Temporal data mining


1. It requires space. It requires time.

Spatial mining is the extraction Temporal mining is the


of knowledge/spatial extraction of knowledge about
2. relationship and interesting occurrence of an event whether
measures that are not explicitly they follow Cyclic , Random
stored in spatial database. ,Seasonal variations etc.

It deals with implicit or explicit


It deals with spatial (location ,
3. Temporal content , from large
Geo-referenced) data.
quantities of data.

Spatial databases reverses


Temporal data mining comprises
spatial objects derived by
the subject as well as its
4. spatial data. types and spatial
utilization in modification of
association among such
fields.
objects.

It includes finding It aims at mining new and


characteristic rules, unknown knowledge, which
5.
discriminant rules, association takes into account the temporal
rules and evaluation rules etc. aspects of data.
It is the method of identifying
unusual and unexplored data It deals with useful knowledge
6.
but useful models from spatial from temporal data.
databases.

Examples –
An association rule which looks
like – “Any Person who buys a
Examples –
car also buys steering lock”. By
7. Determining hotspots ,
temporal aspect this rule would
Unusual locations.
be – ” Any person who buys a
car also buys a steering lock after
that “.

What is Spatial Data Mining?

A spatial database saves a huge amount of space-related data, including maps,


preprocessed remote sensing or medical imaging records, and VLSI chip design
data. Spatial databases have several features that distinguish them from
relational databases.

The primitives of spatial data mining are as follows −

Rules − There are several types of rules that can be found from databases in
general. For example characteristic rules, discriminant rules, association rules,
or deviation and evaluation rules can be mined.

A Spatial characteristic rule is a general representation of the spatial data. For


instance, a rule defining the general cost range of houses in several geographic
areas in a city is a spatial characteristic rule.

A discriminant rule is the usual representation of the features discriminating or


contrasting a class of spatial records from different classes like the comparison
of cost ranges of houses in several geographical areas.
Thematic Maps − Thematic map is a map generally designed to display a
theme, an individual spatial distribution, or a pattern, using a definite map type.
These maps display the distribution of features over limited geographical
regions. Each map represents a partitioning of the area into a group of closed
and disjoint areas; each contains all the points with a similar feature value.

Applications

 Spatial Information and Data Mining Applications.


 Spatial Binning for Detection of Regional Patterns.
 Materializing Spatial Correlation.
 Colocation Mining.
 Spatial Clustering.
 Location Prospecting.

Spatial Information and Data Mining Applications

ODM allows automatic discovery of knowledge from a database. Its techniques


include discovering hidden associations between different data attributes,
classification of data based on some samples, and clustering to identify intrinsic
patterns.

Spatial Binning for Detection of Regional Patterns

Spatial binning (spatial discretization) discretizes the location values into a


small number of groups associated with geographical areas. The assignment of a
location to a group can be done by any of the following methods:

Materializing Spatial Correlation

Spatial correlation (or, neighborhood influence) refers to the phenomenon of the


location of a specific object in an area affecting some nonspatial attribute of the
object. For example, the value (nonspatial attribute) of a house at a given
address (geocoded to give a spatial attribute) is largely determined by the value
of other houses in the neighborhood.

Colocation Mining

Colocation is the presence of two or more spatial objects at the same location or
at significantly close distances from each other. Colocation patterns can indicate
interesting associations among spatial data objects with respect to their
nonspatial attributes

Spatial Clustering

Spatial clustering returns cluster geometries for a layer of data. An example of


spatial clustering is the clustering of crime location data.

Location Prospecting

Location prospecting can be performed by using thematic layers to compute


aggregates for a layer, and choosing the locations that have the maximum
values for computed aggregates.

Mobile Databases

Mobile databases are separate from the main database and can easily be
transported to various places. Even though they are not connected to the main
database, they can still communicate with the database to share and exchange
data.

The mobile database includes the following components −

 The main system database that stores all the data and is linked to the
mobile database.
 The mobile database that allows users to view information even while on
the move. It shares information with the main database.
 The device that uses the mobile database to access data. This device can
be a mobile phone, laptop etc.
 A communication link that allows the transfer of data between the mobile
database and the main database.

Advantages of Mobile Databases

Some advantages of mobile databases are −

 The data in a database can be accessed from anywhere using a mobile


database. It provides wireless database access.

 The database systems are synchronized using mobile databases and


multiple users can access the data with seamless delivery process.

 Mobile databases require very little support and maintenance.

 The mobile database can be synchronized with multiple devices such as


mobiles, computer devices, laptops etc.

Disadvantages of Mobile Databases

Some disadvantages of mobile databases are −

 The mobile data is less secure than data that is stored in a conventional
stationary database. This presents a security hazard.
 The mobile unit that houses a mobile database may frequently lose power
because of limited battery. This should not lead to loss of data in
database.

Location and Handoff Management Location Management : In cellular


systems a mobile unit is free to move around within the entire area of coverage
movement is random.
Thus, the entire process of the mobility management component of the cellular
system is responsible for two tasks:

(a) location management - identification of the current geographical location


or current point of attachment of a mobile unit which is required by the MSC
(Mobile Switching Center) to route the call.

(b) handoff- transferring (handing off) the current (active) communication


session to the next base station.

The location management performs three fundamental tasks:

(a) location update,

(b) location lookup, and

(c) paging.

Location update- is initiated by the mobile unit, the current location of the unit
is recorded in HLR and VLR databases.

Location lookup- a database search to obtain the current location of the mobile
unit.

Paging -the system informs the caller the location of the called unit in terms
of its current base station. These two tasks are initiated by the MSC.

A mobile unit can freely move around in

(a) active mode, (b) doze mode, or (c) power down mode.

In active mode, the mobile actively communicates with other subscriber, and it
may continue to move within the cell or may encounter a handoff which may
interrupt the communication. It is the task of the location manager to find the
new location and resume the communication.

In doze mode a mobile unit does not actively communicate with other
subscribers but continues to listen to the base station and monitors the signal
levels around it
In Power down mode the unit is not functional at all.

The location management module uses a two-tier scheme for location- related
tasks. The first tier provides a quick location lookup, and the second tier search
is initiated only when the first tier search fails.

Location Lookup :A location lookup finds the location of the called party to
establish the communication session. It involves searching VLR and possibly
HLR.

Location Search Using Forward Pointers:. A user in "Source" registration


area wants to communicate with a user in "Destination" area.

MOBILE TRANSACTION MODEL

The conventional ACID transaction model was unable to satisfactorily


manage mobile data processing tasks. Some of the important reasons were: the
presence of hundoff, which is unpredictable; the presence of doze mode,
disconnected mode, and forced disconnection; lack of necessary resources such
as memory, and wireless channels; presence of location dependent data; etc. To
manage data processing in the presence of these new issues, a more powerful
transaction model or ACID transaction execution model that can handle
mobility during data processing was highly desirable.

Two approaches to manage mobile databases were proposed. There are


basically two ways to handle transactional requests on MDS:

(a) execution model based on ACID transaction framework and (b)


mobile transaction model and its execution. The first approach creates an
execution model based on ACID transaction framework. In the second approach
a user query is mapped to a mobile transaction model and executed under
mobile ACID constraints.
Types: (i) HiCoMo: High Commit Mobile Transaction Model
Although it has been presented as a mobile transaction model, in reality it is a
mobile transaction execution model. The execution model is mainly for
processing aggregate data stored in a data warehouse which resides in mobile
units.

This conversion is done by a Transaction Transformation Function, which


works as follows: Conflict detection: A conflict is identified among other
HiCoMo transactions and between HiCoMo and bases transactions. If there is a
conflict between HiCoMo transaction, then the transaction which is being
considered for transformation is aborted, Base transaction generation: In the
absence of a conflict, initial base transactions are generated and executed as
subtransactions on the base database at the server. The type of base transaction
depends upon the HiCoMo transactions.

Alternate base transaction generation: It is possible that some of these sub


transactions may violate integrity constraints (may be outside the error margin)
and, therefore, are aborted. These updates are tried again by redistribution of
error margin. In the worst-case scenario the original HiCoMu transactions are
aborted. If there is no integrity violation, then base transactions are committed.

Moflex Transaction Model A mobile transaction model called Moflex, which


is based on a flexible transaction model. The structure of a Moflex has 7
components and can be defined as Moflex transaction T = {M, S, D, H, .I, G) M
= {tl, tz,.. ., t,,}, where t, are compensable on noncompensable subtransactions.
Every compensable t, is associated with a corresponding compensating
transaction. failure occurs (when a JT fails) in a default mode, then no new local
or global transaction is created from KT and previously committed JTs are not
compensated. As a result, in compensating mode, JTs are serializable but may
not be in split mode.
Kangaroo transaction processing: A KT, when initiated by a mobile unit, is
assigned a unique identity. The initial base station immediately creates a JT
with a unique identity and becomes responsible for its execution. There is one
JT per base station. When the mobile unit encounters a handoff (i., moves to a
different cell), KT is split into two transactions - JT1 and JT2.

Transaction Execution Model An execution model called Multidatabuse


Transaction Processing Munager (MDSTPM) is which supports transaction
initiation from mobile units. The model uses message and queuing facilities to
establish necessary communication among mobile and stationary (base station)
units.

The MDSTPM has the following components: Global Communication


Manager (GCM): This module manages message communication among
transaction processing units. It maintains a message queue for handling this
task. Global Transaction Manager (GTM): This module coordinates the
initiation of transactions and their subtransactions. It acts as a Global
Scheduling Submanager (GSS) which schedules global transactions and
subtranactions. It can also act as a Globul Concurrency Subnzanager (GCS),
which is responsible for the execution of these transactions and subtranactions.

Local Transaction Manager (LTM): This module is responsible local


transaction execution and database recovery. Global Recovery Manager
(GRM): This module is responsible for managing global transaction commit
and their recovery in the event of failure.

Global Interface Manager (GIM): This serves as a link between MDSTPM


and local database managers transaction models did address most of the
important issues of mobility,however, no single model captured or incorporated
these issues at one place. In the Kangaroo model a transaction issued by a user
at one mobile unit can be fragmented and executed at multiple mobile units.
Mobilaction-A Mobile Transaction Model Mobilaction is capable of
processing location-dependent data in the presence of spatial replication. It is
composed of a set of subtransactions, which is also called Execution Fragments,
and each fragment is a Mobilaction. Mobilaction is based on the framework of
the ACID model. To manage location based processing, a new fundamental
property called “location (L)” is incorporated extending the ACID model to
ACIDL. The “location (L)” property is managed by a location mapping
function.

for Mobilaction The purpose of atomicity is to ensure the consistency of the


data. However, in a mobile environment we have two types of consistency.
Certainly, atomicity at the execution fragment level is needed to ensure spatial
consistency. However, transaction atomicity is not. We could have some
fragments execute and others not.

Isolation for Mobilaction Transaction isolation ensures that a transaction does


not interfere with the execution of another transaction. Isolation is normally
enforced by some concurrency control mechanism. As with atomicity, isolation
is needed to ensure that consistency is preserved. Thus we need to reevaluate
isolation when spatial consistency is present. As with consistency, isolation at
the transaction level is too strict. The important thing is to ensure that execution
fragments satisfy isolation at the execution fragment level.

Overview of Deductive Databases

A deductive database is a database system that can make deductions (i.e.


conclude additional facts) based on rules and facts stored in the (deductive) ...

Deductive databases are an extension of relational databases which support


more complex data modeling. In this section we will see how simple examples
of deductive databases can be represented in Prolog, and we will see more of
the limitations of Prolog. A standard example in Prolog is a geneology database.

 Declarative Language
 Language to specify rules
 Inference Engine (Deduction Machine)
 Related to logic programming
 Prolog language (Prolog => Programming in logic)
 Uses backward chaining to evaluate

Top-down application of the rules

 Speciation consists of:


 Facts
 Similar to relation specification without the necessity of
including attribute names
 Rules
 Similar to relational views (virtual relations that are not
stored)
 Predicate has
 a name
 a fixed number of arguments
 Convention:
 Constants are numeric or character strings
 Variables start with upper case letters
 E.g., SUPERVISE(Supervisor, Supervisee)
 States that Supervisor SUPERVISE(s) Supervisee
 Example based on the company database

3 predicate names: supervise, superior, subordinate

a) the supervise predicate:


defined via a set of facts, each with two arguments: (supervisor name,
supervisee name)
SUPERVISE(SUPERVISOR, SUPERVISEE)
attributes are represented by position. 1st argument is the supervisor, 2nd
argument is the supervisee. supervise(X,Y) states the fact that X
supervises Y.

b) the superior, subordinate predicates:

defined via a set of rules (superior allows us to express the idea of non-
direct supervision)

Rule

 Is of the form head :- body


 where :- is read as if and only iff
 E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)
 E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)
 Query
 Involves a predicate symbol followed by y some variable
arguments to answer the question
 where :- is read as if and only iff
 E.g., SUPERIOR(james,Y)?
 E.g., SUBORDINATE(james,X)?
 (a) Prolog notation (b) Supervisory tree


Multimedia Database Concepts

Multimedia Databases

The multimedia databases are used to store multimedia data such as images,
animation, audio, video along with text. This data is stored in the form of
multiple file types like .txt(text), .jpg(images), .swf(videos), .mp3(audio) etc.

Contents of the Multimedia Database

The multimedia database stored the multimedia data and information related to
it. This is given in detail as follows −
Media data

This is the multimedia data that is stored in the database such as images, videos,
audios, animation etc.

Media format data

The Media format data contains the formatting information related to the media
data such as sampling rate, frame rate, encoding scheme etc.

Media keyword data

This contains the keyword data related to the media in the database. For an
image the keyword data can be date and time of the image, description of the
image etc.

Media feature data

Th Media feature data describes the features of the media data. For an image,
feature data can be colours of the image, textures in the image etc.

Challenges of Multimedia Database

There are many challenges to implement a multimedia database. Some of these


are:

 Multimedia databases contains data in a large type of formats such as


.txt(text), .jpg(images), .swf(videos), .mp3(audio) etc. It is difficult to
convert one type of data format to another.
 The multimedia database requires a large size as the multimedia data is
quite large and needs to be stored successfully in the database.
 It takes a lot of time to process multimedia data so multimedia database is
slow.

 In the years ahead multimedia information systems are expected to


dominate our daily lives.
 Our houses will be wired for bandwidth to handle interactive
multimedia applications.
 Our high-definition TV/computer workstations will have access to
a large number of databases, including digital libraries, image and
video databases that will distribute vast amounts of multisource
multimedia content.
 Types of multimedia data are available in current systems
 Text: May be formatted or unformatted. For ease of parsing
structured documents, standards like SGML and variations such as
HTML are being used.
 Graphics: Examples include drawings and illustrations that are
encoded using some descriptive standards (e.g. CGM, PICT,
postscript).
 Types of multimedia data are available in current systems (contd.)
 Images: Includes drawings, photographs, and so forth, encoded in
standard formats such as bitmap, JPEG, and MPEG. Compression
is built into JPEG and MPEG.
 These images are not subdivided into components. Hence
querying them by content (e.g., find all images containing
circles) is nontrivial.
 Animations: Temporal sequences of image or graphic data.
 Video: A set of temporally sequenced photographic data for
presentation at specified rates– for example, 30 frames per second.
 Structured audio: A sequence of audio components comprising
note, tone, duration, and so forth.
 Audio: Sample data generated from aural recordings in a string of
bits in digitized form. Analog recordings are typically converted
into digital form before storage.
 Types of multimedia data are available in current systems (contd.)
 Composite or mixed multimedia data: A combination of
multimedia data types such as audio and video which may be
physically mixed to yield a new storage format or logically mixed
while retaining original types and formats. Composite data also
contains additional control information describing how the
information should be rendered.
 Nature of Multimedia Applications:
 Multimedia data may be stored, delivered, and utilized in many
different ways.

Multimedia Database Applications:

1. Documents and record management: Industries which keep a lot of


documentation and records. Ex: Insurance claim industry.
2. Knowledge dissemination: Multimedia database is an extremely
efficient tool for knowledge dissemination and providing several
resources. Ex: electronic books
3. Education and training: Multimedia sources can be used to create
resources useful in education and training. These are popular sources of
learning in recent days. Ex: Digital libraries.
4. Real-time monitoring and control: Multimedia presentation when
coupled with active database technology can be an effective means for
controlling and monitoring complex tasks. Ex: Manufacture control
5. Marketing
6. Advertisement
7. Retailing
8. Entertainment
9. Travel
Two Marks
1. Why do we need a temporal database?
Temporal databases enable you to see the data as it was seen in the past, while also enabling you to
update even the past in the future.
2. What do you understand by temporal data?
Temporal data is simply data that represents a state in time, such as the land-use patterns of Hong
Kong in 1990, or total rainfall in Honolulu on July 1, 2009. Temporal data is collected to analyze
weather patterns and other environmental variables, monitor traffic conditions, study demographic
trends, and so on
3. What is spatial and temporal data?
Spatiotemporal data are data that relate to both space and time. Spatiotemporal data mining refers to
the process of discovering patterns and knowledge from spatiotemporal data.
4. What is the difference between spatial and temporal analysis?
Spatial analysis is how we understand the space around us and the world. It helps us solve complex
location-analytics problems. A temporal understanding of data helps us see if patterns are consistent
over time and to detect unusual patterns if any
5. What is the difference between spatial and temporal resolution?
The spatial resolution is the amount of spatial detail in an observation, and the temporal
resolution is the amount of temporal detail in an observation.
6. What are the 4 types of resolution?
There are four types of resolution to consider for any dataset—radiometric, spatial, spectral, and
temporal. Radiometric resolution is the amount of information in each pixel, that is, the number of bits
representing the energy recorded.
7. What is an example of spatial resolution?
Spatial resolution refers to the size of one pixel on the ground. For example 15 meters means that one
pixel on the image corresponds to a square of 15 by 15 meters on the ground.
8. What is spatial database in DBMS?
A spatial database is a general-purpose database (usually a relational database) that has been enhanced
to include spatial data that represents objects defined in a geometric space, along with tools for
querying and analyzing such data.
9. What are types of spatial database?
Spatial types in databases
 Dameng.
 IBM Db2.
 Microsoft SQL Server.
 Oracle.
 PostgreSQL.
 SAP HANA.
 SQLite.
 Teradata Vantage.
10. What are the two spatial data models?
There are two broad categories of spatial data models. These are vector data model and raster
data models.
11. What are spatial data types in DBMS?
Spatial data, also known as geospatial data, is a particular type of information about either the
physical object or physical location of data that can be constituted by numerical values in
geographic collaborate systems
12. Why spatial data is important?
Spatial analysis allows you to solve complex location-oriented problems and better understand
where and what is occurring in your world. It goes beyond mere mapping to let you study the
characteristics of places and the relationships between them. Spatial analysis lends new perspectives to
your decision-making
13. What are the sources of spatial data?
6 The most common general sources for spatial data are: hard copy maps; aerial photographs;
remotely-sensed imagery; point data, samples from surveys; and existing digital data files

14. What are spatial objects?


Spatial object is the digital representation of geographical entity or phenomenon • which forms the
basis for data management and analysis ;spatial relationship is the. connexion between spatial objects
when geometric properties are considered.
15. Who uses spatial data?
Geospatial data is often used in scientific or government administration contexts, but it has an
increasing number of commercial uses as well. From retail to investment to insurance, here are 10
scenarios where you can make use of geospatial data
16. What is a spatial function?
Spatial functions allow you to perform advanced spatial analysis and combine spatial files with
data in other formats like text files or spreadsheets. For example, you might have a spatial file of
city council districts, and a text file containing latitude and longitude coordinates of reported potholes
17. What is a deductive database system?
A deductive database is a database system that makes conclusions about its data based on a set of
well-defined rules and facts. This type of database was developed to combine logic programming
with relational database management systems.
18. What are the 4 types of database?
Four types of database management systems

hierarchical database systems. network database systems. object-oriented database systems


19. What are the different types of multimedia databases?
There are three classes of the multimedia database which includes static media, dynamic media and
dimensional media.
20. Why do we need multimedia database?
Why is a database needed? It is for storing and retrieving data more efficiently. Multimedia
information (e.g., text, graphics, audio, video, etc.) has to be managed differently depending on the
type of data. However, efficient retrieval of data depends on the database system.
21. What is multimedia database explain?
A Multimedia database (MMDB) is a collection of related for multimedia data. The multimedia data
include one or more primary media data types such as text, images, graphic objects (including
drawings, sketches and illustrations) animation sequences, audio and video.
22. What are the characteristics of multimedia database management system?
Characteristics of MDBMS

Comprehensive search methods: During a search in the database, an entry, given in the form of text
or a graphical image, is found using different search queries and the corresponding search methods.
Format independent interface: database queries should be independent of media format.
23. What is multimedia database describe any two image databases?
Multimedia database systems are increasingly common owing to the popular use of audio-video
equipment, digital cameras, CD-ROMs, and the Internet. There are multimedia database systems
include NASA's EOS (Earth Observation System), various kinds of image and audio video
databases, and Internet databases
MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

UNIT III NOSQL DATABASES

NoSQL – CAP Theorem – Sharding - Document based – MongoDB Operation:


Insert, Update,Delete, Query, Indexing, Application, Replication, Sharding–Cassandra:
Data Model, Key Space,Table Operations, CRUD Operations, CQL Types – HIVE:
Data types, Database Operations,Partitioning – HiveQL – OrientDB Graph database –
OrientDB Features

3.1 NoSQL

3.1 Introduction to NoSQL databases

What is NoSQL?
NoSQL Database is a non-relational Data Management System, that does not require
a fixed schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL
database is for distributed data stores with humongous data storage needs. NoSQL is used for
Big data and real-time web apps. For example, companies like Twitter, Facebook and Google
collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term
would be “NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.

Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database technologies that
can store structured, semi-structured, unstructured and polymorphic data. Let’s understand
about NoSQL with a diagram in this NoSQL database tutorial:

Why NoSQL?

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 1


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response time
becomes slow when you use RDBMS for massive volumes of data.

To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.

Brief History of NoSQL Databases

 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
 2000- Graph database Neo4j is launched
 2004- Google BigTable is launched
 2005- CouchDB is launched
 2007- The research paper on Amazon Dynamo is released
 2008- Facebooks open sources the Cassandra project
 2009- The term NoSQL was reintroduced

Features of NoSQL
Non-relational

 NoSQL databases never follow the relational model


 Never provide tables with flat fixed-column records
 Work with self-contained aggregates or BLOBs
 Doesn’t require object-relational mapping and data normalization
 No complex features like query languages, query planners,referential integrity joins,
ACID

Schema-free

 NoSQL databases are either schema-free or have relaxed schemas


 Do not require any sort of definition of the schema of the data
 Offers heterogeneous structures of data in the same domain

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 2


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Advantages of NoSQL

 Can be used as Primary or Analytic Data Source


 Big Data Capability
 No Single Point of Failure
 Easy Replication
 No Need for Separate Caching Layer
 It provides fast performance and horizontal scalability.
 Can handle structured, semi-structured, and unstructured data with equal effect
 Object-oriented programming which is easy to use and flexible
 NoSQL databases don’t need a dedicated high-performance server
 Support Key Developer Languages and Platforms
 Simple to implement than using RDBMS
 It can serve as the primary data source for online applications.
 Handles big data which manages data velocity, variety, volume, and complexity
 Excels at distributed database and multi-data center operations
 Eliminates the need for a specific caching layer to store data
 Offers a flexible schema design which can easily be altered without downtime or
service disruption

Disadvantages of NoSQL

 No standardization rules
 Limited query capabilities
 RDBMS databases and tools are comparatively mature
 It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.
 When the volume of data increases it is difficult to maintain unique values as keys
become difficult
 Doesn’t work as well with relational data
 The learning curve is stiff for new developers
 Open source options so not so popular for enterprises.

3.2 CAP Theorem

What is the CAP Theorem?


CAP theorem is also called brewer’s theorem. It states that is impossible for a distributed data
store to offer more than two out of three guarantees

1. Consistency
2. Availability

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 3


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

3. Partition Tolerance

Consistency:

The data should remain consistent even after the execution of an operation. This means once
data is written, any future read request should contain that data. For example, after updating
the order status, all the clients should be able to see the same data.

Availability:

The database should always be available and responsive. It should not have any downtime.

Partition Tolerance:

Partition Tolerance means that the system should continue to function even if the
communication among the servers is not stable. For example, the servers can be partitioned
into multiple groups which may not communicate with each other. Here, if part of the
database is unavailable, other parts are always unaffected.

Eventual Consistency
The term “eventual consistency” means to have copies of data on multiple machines to get
high availability and scalability. Thus, changes made to any data item on one machine has to
be propagated to other replicas.

Data replication may not be instantaneous as some copies will be updated immediately while
others in due course of time. These copies may be mutually, but in due course of time, they
become consistent. Hence, the name eventual consistency.

BASE: Basically Available, Soft state, Eventual consistency

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 4


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

 Basically, available means DB is available all the time as per CAP theorem
 Soft state means even without an input; the system state may change
 Eventual consistency means that the system will become consistent over time

3.3 Sharding

What is Sharding in MongoDB?


Sharding is a concept in MongoDB, which splits large data sets into small data sets across
multiple MongoDB instances.

Sometimes the data within MongoDB will be so huge, that queries against such big data sets
can cause a lot of CPU utilization on the server. To tackle this situation, MongoDB has a
concept of Sharding, which is basically the splitting of data sets across multiple MongoDB
instances.

The collection which could be large in size is actually split across multiple collections or
Shards as they are called. Logically all the shards work as one collection.

How to Implement Sharding


Shards are implemented by using clusters which are nothing but a group of MongoDB
instances.

The components of a Shard include

1. A Shard – This is the basic thing, and this is nothing but a MongoDB instance which
holds the subset of the data. In production environments, all shards need to be part of
replica sets.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 5


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

2. Config server – This is a mongodb instance which holds metadata about the cluster,
basically information about the various mongodb instances which will hold the shard
data.
3. A Router – This is a mongodb instance which basically is responsible to re-directing
the commands send by the client to the right servers.

How sharding works


When dealing with high throughput applications or very large databases, the underlying
hardware becomes the main limitation. High query rates can stress the CPU, RAM, and I/O
capacity of disk drives resulting in a poor end-user experience.
To mitigate this problem, there are two types of scaling methods.

Vertical scaling
Vertical scaling is the traditional way of increasing the hardware capabilities of a single
server. The process involves upgrading the CPU, RAM, and storage capacity. However,
upgrading a single server is often challenged by technological limitations and cost
constraints.

Horizontal scaling
This method divides the dataset into multiple servers and distributes the database load among
each server instance. Distributing the load reduces the strain on the required hardware
resources and provides redundancy in case of a failure.

However, horizontal scaling increases the complexity of underlying architecture. MongoDB


supports horizontal scaling through sharding—one of its major benefits, as we’ll see below.

MongoDB sharding basics


MongoDB sharding works by creating a cluster of MongoDB instances consisting of at least
three servers. That means sharded clusters consist of three main components:

 The shard
 Mongos
 Config servers

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 6


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Shard
A shard is a single MongoDB instance that holds a subset of the sharded data. Shards can be
deployed as replica sets to increase availability and provide redundancy. The combination of
multiple shards creates a complete data set. For example, a 2 TB data set can be broken down
into four shards, each containing 500 GB of data from the original data set.

Mongos
Mongos act as the query router providing a stable interface between the application and the
sharded cluster. This MongoDB instance is responsible for routing the client requests to the
correct shard.

Config Servers
Configuration servers store the metadata and the configuration settings for the whole cluster.

Components illustrated
The following diagram from the official MongoDB docs explains the relationship between
each component:

1. The application communicates with the routers (mongos) about the query to be executed.
2. The mongos instance consults the config servers to check which shard contains the
required data set to send the query to that shard.
3. Finally, the result of the query will be returned to the application.
It’s important to remember that the config servers also work as replica sets.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 7


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Sharding benefits & limitations


Now that we’ve got the concept down, let’s look at benefits and limitations of sharding in
MongoDB:

Benefits

 In traditional replication scenarios, the primary node handles the bulk of write
operations, while the secondary servers are limited to read-only operations or
maintaining the backup of the data set. However, as sharding utilizes shards with replica
sets, all queries are distributed among all the nodes in the cluster.
 As each shard consists of a subset of the complete data set, simply adding additional
shards will increase the cluster’s storage capacity without having to do complex
hardware restructuring.
 Replication requires vertical scaling when handling large data sets. This requirement can
lead to hardware limitations and prohibitive costs compared to the horizontal scaling
approach. But, because MongoDB utilizes horizontal scaling, the workload is
distributed. When the need arises, additional servers can be added to a cluster.
 In sharding, both read and write performance directly correlates to the number of server
nodes in the cluster. This process provides a quick method to increase the cluster’s
performance by simply adding additional nodes.
 A sharded cluster can continue to operate even if a single or multiple shards are
unavailable. While the data on those shards are unavailable, the client application can
still access all the other available shards within the cluster without any downtime. In
production environments, all individual shards deploy as replica sets, further increasing
the availability of the cluster.

Limitations

 Sharding requires careful planning and maintenance to maintain a sharded cluster—


because of the complexity involved.
 When you shard a MongoDB collection, there is no way to unshard the sharded
collection.
 The shard key directly impacts the overall performance of the underlying cluster, as it is
used to identify all the documents within the collections.
 There are some operational restrictions within a MongoDB sharded environment. For
example, the geoSearch command is not supported within a sharded environment.
 In an instance where a shard key or a prefix of compound shard key is not present,
Mongo will perform a broadcast operation that queries all the shards in the cluster,
which can result in long-running query tasks.

3.4 Document based

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 8


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

3.5 MongoDB Operation: Insert, Update,Delete, Query, Indexing, Application,


Replication, Sharding

Creating a database using “use” command


Creating a database in MongoDB is as simple as issuing the “using” command. The
following example shows how this can be done.

Creating a Collection/Table using insert()


The easiest way to create a collection is to insert a record (which is nothing but a document
consisting of Field names and Values) into a collection. If the collection does not exist a new
one will be created.

The following example shows how this can be done.

db.Employee.insert
(
{
"Employeeid" : 1,
"EmployeeName" : "Martin"
}
)

3A

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 9


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Basic document updates


MongoDB provides the update() command to update the documents of a collection. To
update only the documents you want to update, you can add a criteria to the update statement
so that only selected documents are updated.

The basic parameters in the command is a condition for which document needs to be updated,
and the next is the modification which needs to be performed.

The following example shows how this can be done.

Step 1) Issue the update command

Step 2) Choose the condition which you want to use to decide which document needs to be
updated. In our example, we want to update the document which has the Employee id 22.

Step 3) Use the set command to modify the Field Name

Step 4) Choose which Field Name you want to modify and enter the new value accordingly.

db.Employee.update(
{"Employeeid" : 1},
{$set: { "EmployeeName" : "NewMartin"}});

If the command is executed successfully, the following Output will be shown

Output:

The output clearly shows that one record matched the condition and hence the relevant field
value was modified.

Example #1 – deleteOne
Code:

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 10


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

db.code.deleteOne({"name":"malti"})

Output:

Explanation:

 Here we attempt to delete a single record, which matches with mentioned key-value
pair. To start with, code is our collection in which the document might or might not
exist. Then we have our method which is deleteOne, and then we have the filter
mentioned inside. Here, our filter should look for a document that has the key as
“name” and the value must match to “malti”.
 Upon finding a document which matches the filter, the method will delete the
document. As you can see, we implemented the deleteOne method and then when we
listed the whole collection, we now don’t have any record or a document with the
name as malti.

Example #2 – deleteMany
Code:

db.code.find()
db.code.deleteMany({"city":"Pune"})

Output:

Explanation:

 Started with db, with collection name, we have our deleteMany method, which will
delete multiple documents in the code collection. It will rely on the filter mentioned to
delete these documents. Our filter is “{“city”: “Pune”}”, meaning it will delete every
document that has the city key, matching with the value of Pune.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 11


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

 Executing this query, every document present in the collection “code” will be deleted
at once with the Pune as a city. As you can see, we implemented the deleteMany
method with filter and then returned the whole collection, which is now empty.
Initially, we had two documents with the city like Pune, but there are no documents
with the city as Pune after executing our query. This is how deleteMany deletes every
record that matches the filter.

Example #3 – Complete Deletion


A delete method which deletes every single record available in the collection, at once. By
simply not specifying any filter, we attempt to delete every single record stored in the
collection.

Code:

db.locs.deleteMany( {} )
db.code.find().count()

Output:

Explanation:

 As you can see in the above screenshot, we firstly checked the total count of records
in the collection, which was 195. Then we executed the deleteMany query with a
blank filter, which deleted every single record available.
 Which resulted in emptying the whole collection. Later upon checking for the count,
we get 0; as a result, meaning no record. That’s how deleteMany with no filter works.

MongoDB - Query Document


In this chapter, we will learn how to query document from MongoDB collection.

The find() Method

To query data from MongoDB collection, you need to use MongoDB's find() method.
Syntax
The basic syntax of find() method is as follows −
>db.COLLECTION_NAME.find()

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 12


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

find() method will display all the documents in a non-structured way.


Example
Assume we have created a collection named mycol as −
> use sampleDB
switched to db sampleDB
> db.createCollection("mycol")
{ "ok" : 1 }
>
And inserted 3 documents in it using the insert() method as shown below −
> db.mycol.insert([
{
title: "MongoDB Overview",
description: "MongoDB is no SQL database",
by: "tutorials point",
url: "http://www.tutorialspoint.com",
tags: ["mongodb", "database", "NoSQL"],
likes: 100
},
{
title: "NoSQL Database",
description: "NoSQL database doesn't have tables",
by: "tutorials point",
url: "http://www.tutorialspoint.com",
tags: ["mongodb", "database", "NoSQL"],
likes: 20,
comments: [
{
user:"user1",
message: "My first comment",
dateCreated: new Date(2013,11,10,2,35),
like: 0
}
]
}
])
Following method retrieves all the documents in the collection −
> db.mycol.find()
{ "_id" : ObjectId("5dd4e2cc0821d3b44607534c"), "title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database", "by" : "tutorials point", "url" :
"http://www.tutorialspoint.com", "tags" : [ "mongodb", "database", "NoSQL" ], "likes" : 100
}

{ "_id" : ObjectId("5dd4e2cc0821d3b44607534d"), "title" : "NoSQL Database",


"description" : "NoSQL database doesn't have tables", "by" : "tutorials point", "url" :
"http://www.tutorialspoint.com", "tags" : [ "mongodb", "database", "NoSQL" ], "likes" : 20,

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 13


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

"comments" : [ { "user" : "user1", "message" : "My first comment", "dateCreated" :


ISODate("2013-12-09T21:05:00Z"), "like" : 0 } ] }
>

The pretty() Method

To display the results in a formatted way, you can use pretty() method.
Syntax
>db.COLLECTION_NAME.find().pretty()
Example
Following example retrieves all the documents from the collection named mycol and arranges
them in an easy-to-read format.
> db.mycol.find().pretty()
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534c"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 100
}
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534d"),
"title" : "NoSQL Database",
"description" : "NoSQL database doesn't have tables",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 20,
"comments" : [
{
"user" : "user1",
"message" : "My first comment",
"dateCreated" : ISODate("2013-12-09T21:05:00Z"),
"like" : 0
}
]
}

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 14


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

The findOne() method

Apart from the find() method, there is findOne() method, that returns only one document.
Syntax
>db.COLLECTIONNAME.findOne()
Example
Following example retrieves the document with title MongoDB Overview.
> db.mycol.findOne({title: "MongoDB Overview"})
{
"_id" : ObjectId("5dd6542170fb13eec3963bf0"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 100
}

understanding Impact of Indexes


Now even though from the introduction we have seen that indexes are good for
queries, but having too many indexes can slow down other operations such as the Insert,
Delete and Update operation.

If there are frequent insert, delete and update operations carried out on documents, then the
indexes would need to change that often, which would just be an overhead for the collection.

The below example shows an example of what field values could constitute an index in a
collection. An index can either be based on just one field in the collection, or it can be based
on multiple fields in the collection.

In the example below, the Employeeid “1” and EmployeeCode “AA” are used to index the
documents in the collection. So when a query search is made, these indexes will be used to
quickly and efficiently find the required documents in the collection.

So even if the search query is based on the EmployeeCode “AA”, that document would be
returned.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 15


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

How to Create Indexes: createIndex()


Creating an Index in MongoDB is done by using the “createIndex” method.

The following example shows how add index to collection. Let’s assume that we have our
same Employee collection which has the Field names of “Employeeid” and
“EmployeeName”.

db.Employee.createIndex({Employeeid:1})

Code Explanation:

1. The createIndex method is used to create an index based on the “Employeeid” of the
document.
2. The ‘1’ parameter indicates that when the index is created with the “Employeeid”
Field values, they should be sorted in ascending order. Please note that this is different
from the _id field (The id field is used to uniquely identify each document in the
collection) which is created automatically in the collection by MongoDB. The
documents will now be sorted as per the Employeeid and not the _id field.

If the command is executed successfully, the following Output will be shown:

Output:

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 16


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

1. The numIndexesBefore: 1 indicates the number of Field values (The actual fields in
the collection) which were there in the indexes before the command was run.
Remember that each collection has the _id field which also counts as a Field value to
the index. Since the _id index field is part of the collection when it is initially created,
the value of numIndexesBefore is 1.
2. The numIndexesAfter: 2 indicates the number of Field values which were there in the
indexes after the command was run.
3. Here the “ok: 1” output specifies that the operation was successful, and the new index
is added to the collection.

How to Find Indexes: getindexes()


Finding an Index in MongoDB is done by using the “getIndexes” method.

The following example shows how this can be done;

db.Employee.getIndexes()
Code Explanation:

 The getIndexes method is used to find all of the indexes in a collection.

If the command is executed successfully, the following Output will be shown:

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 17


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Output:

 The output returns a document which just shows that there are 2 indexes in the
collection which is the _id field, and the other is the Employee id field. The :1
indicates that the field values in the index are created in ascending order.

How to Drop Indexes: dropindex()


Removing an Index in MongoDB is done by using the dropIndex method.

The following example shows how this can be done;

db.Employee.dropIndex(Employeeid:1)
Code Explanation:

 The dropIndex method takes the required Field values which needs to be removed
from the Index.

If the command is executed successfully, the following Output will be shown:

Output:

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 18


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

1. The nIndexesWas: 3 indicates the number of Field values which were there in the
indexes before the command was run. Remember that each collection has the _id field
which also counts as a Field value to the index.
2. The ok: 1 output specifies that the operation was successful, and the “Employeeid”
field is removed from the index.

To remove all of the indexes at once in the collection, one can use the dropIndexes command.

The following example shows how this can be done.

db.Employee.dropIndex()
Code Explanation:

 The dropIndexes method will drop all of the indexes except for the _id index.

If the command is executed successfully, the following Output will be shown:

Output:

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 19


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

1. The nIndexesWas: 2 indicates the number of Field values which were there in the
indexes before the command was run.
2. Remember again that each collection has the _id field which also counts as a Field
value to the index, and that will not be removed by MongoDB and that is what this
message indicates.
3. The ok: 1 output specifies that the operation was successful.

3.6 Cassandra: Data Model, Key Space,

Table Operations, CRUD Operations, CQL Types

3.6.1 Cassandra - Data Model


The data model of Cassandra is significantly different from what we normally see in an
RDBMS. This chapter provides an overview of how Cassandra stores its data.

Cluster

Cassandra database is distributed over several machines that operate together. The outermost
container is known as the Cluster. For failure handling, every node contains a replica, and in
case of a failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring
format, and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace
in Cassandra are −
 Replication factor − It is the number of machines in the cluster that will receive
copies of the same data.
 Replica placement strategy − It is nothing but the strategy to place replicas in the
ring. We have strategies such as simple strategy (rack-aware strategy), old network
topology strategy (rack-aware strategy), and network topology strategy (datacenter-
shared strategy).
 Column families − Keyspace is a container for a list of one or more column families.
A column family, in turn, is a container of a collection of rows. Each row contains
ordered columns. Column families represent the structure of your data. Each keyspace
has at least one and often many column families.
The syntax of creating a Keyspace is as follows −

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 20


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

CREATE KEYSPACE Keyspace name


WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};

3.6.2 Cassandra Create Keyspace

Cassandra Query Language (CQL) facilitates developers to communicate with Cassandra.


The syntax of Cassandra query language is very similar to SQL.

What is Keyspace?

A keyspace is an object that is used to hold column families, user defined types. A keyspace
is like RDBMS database which contains column families, indexes, user defined types, data
center awareness, strategy used in keyspace, replication factor, etc.

In Cassandra, "Create Keyspace" command is used to create keyspace.

Syntax:

1. CREATE KEYSPACE <identifier> WITH <properties>

Or

1. Create keyspace KeyspaceName with replicaton={'class':strategy name,


2. 'replication_factor': No of replications on different nodes}

Different components of Cassandra Keyspace

Strategy: There are two types of strategy declaration in Cassandra syntax:

o Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes are
placed in clockwise direction in the ring without considering rack or node location.
o Network Topology Strategy: This strategy is used in the case of more than one data
centers. In this strategy, you have to provide replication factor for each data center
separately.

Replication Factor: Replication factor is the number of replicas of data placed on different
nodes. More than two replication factor are good to attain no single point of failure. So, 3 is
good replication factor.

Example:

Let's take an example to create a keyspace named "javatpoint".

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 21


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

1. CREATE KEYSPACE javatpoint


2. WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};

Keyspace is created now.

Verification:

To check whether the keyspace is created or not, use the "DESCRIBE" command. By using
this command you can see all the keyspaces that are created.

There is another property of CREATE KEYSPACE in Cassandra.

Durable_writes

By default, the durable_writes properties of a table is set to true, you can also set this
property to false. But, this property cannot be set to simplex strategy.

Example:

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 22


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Let's take an example to see the usage of durable_write property.

CREATE KEYSPACE sssit

1. WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 }


2. AND DURABLE_WRITES = false;

Verification:

To check whether the keyspace is created or not, use the "DESCRIBE" command. By using
this command you can see all the keyspaces that are created.

Using a Keyspace

To use the created keyspace, you have to use the USE command.

Syntax:

1. USE <identifier>

See this example:

Here, we are using javatpoint keyspace.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 23


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

3.6.1 Operations on table in Cassandra

Creating a table – Register:


First, we are going to create a table namely as Register in which we have id, name, email,
city are the fields. Let’s have a look.

Create table Register


(
id uuid primary key,
name text,
email text,
city text
);
Inserting data into Register table:
After creating a table now, we are going to insert some data into Register table. Let’s have
a look.
Insert into Register (id, name, email, city)
values(uuid(), 'Ashish', '[email protected]', 'delhi');

Insert into Register (id, name, email, city)


values(uuid(), 'abi', '[email protected]', 'mumbai');

Insert into Register (id, name, email, city)


values(uuid(), 'rana', '[email protected]', 'bangalore');
Verify the results:
To verify the results using the following CQL query given below. Let’s have a look.
select *
from Register;

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 24


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Cassandra Crud Operation – Create, Update, Read & Delete


Crud Operation

we will learn about the Cassandra CRUD Operation: Create, Update, Read & Delete.
Moreover, we will cover the syntax and example of each CRUD operation in Cassandra.
So, let’s start with Cassandra CRUD Operation.

Cassandra Crud Operation – Create, Update, Read & Delete


What is Cassandra CRUD Operation?

Cassandra CRUD Operation stands for Create, Update, Read and Delete or Drop. These
operations are used to manipulate data in Cassandra. Apart from this, CRUD operations in
Cassandra, a user can also verify the command or the data.
a. Create Operation

A user can insert data into the table using Cassandra CRUD operation. The data is stored in
the columns of a row in the table. Using INSERT command with proper what, a user can
perform this operation.
A Syntax of Create Operation-
INSERT INTO <table name>

(<column1>,<column2>....)

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 25


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

VALUES (<value1>,<value2>...)

USING<option>

INPUT:

cqlsh:keyspace1> INSERT INTO student(en, name, branch, phone, city)

VALUES(001, 'Ayush', 'Electrical Engineering', 9999999999, 'Boston');

cqlsh:keyspace1> INSERT INTO student(en, name, branch, phone, city)

VALUES(002, 'Aarav', 'Computer Engineering', 8888888888, 'New York City');

cqlsh:keyspace1> INSERT INTO student(en, name, branch, phone, city)

VALUES(003, 'Kabir', 'Applied Physics', 7777777777, 'Philadelphia');

Cassandra Crud Operation – OUTPUT After Verification (READ operation)


EN NAME BRANCH PHONE CITY

001 Ayush Electrical Engineering 9999999999 Boston

002 Aarav Computer Engineering 8888888888 New York City

003 Kabir Applied Physics 7777777777 Philadelphia

b.Update Operation

The second operation in the Cassandra CRUD operation is the UPDATE operation. A user
can use UPDATE command for the operation. This operation uses three keywords while
updating the table.
 Where: This keyword will specify the location where data is to be updated.
 Set: This keyword will specify the updated value.
 Must: This keyword includes the columns composing the primary key.
Furthermore, at the time of updating the rows, if a row is unavailable, then Cassandra has a
feature to create a fresh row for the same.
A Syntax of Update Operation-
UPDATE <table name>

SET <column name>=<new value><column name>=<value>...

WHERE <condition>

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 26


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Let’s change few details in the table ‘student’. In this example, we will update Aarav’s city
from ‘New York City’ to ‘San Fransisco’.
INPUT:
cqlsh:keyspace1> UPDATE student SET city='San Fransisco'

WHERE en=002;

Cassandra Crud Operation – OUTPUT After Verification


EN NAME BRANCH PHONE CITY

001 Ayush Electrical Engineering 9999999999 Boston

002 Aarav Computer Engineering 8888888888 San Fransisco

003 Kabir Applied Physics 7777777777 Philadelphia

c. Read Operation

This is the third Cassandra CRUD Operation – Read Operation. A user has a choice to read
either the whole table or a single column. To read data from a table, a user can use SELECT
clause. This command is also used for verifying the table after every operation.

SYNTAX to read the whole table-

SELECT * FROM <table name>;

To read the whole table ‘student’.


INPUT:
cqlsh:keyspace1> SELECT * FROM student;
Cassandra Crud Operation – OUTPUT After Verification
NAME CITY

Ayush Boston

Aarav San Fransisco

Kabir Philadelphia

Table.4 Cassandra Crud Operation – OUTPUT After Verification

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 27


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

EN NAME BRANCH PHONE CITY

001 Ayush Electrical Engineering 9999999999 Boston

002 Aarav Computer Engineering 8888888888 San Fransisco

003 Kabir Applied Physics 7777777777 Philadelphia

SYNTAX to read selected columns-


SELECT <column name1>,<column name2>.... FROM <table name>;
To read columns of name and city from table ‘student’.
INPUT:

cqlsh:keyspace1> SELECT name, city FROM student;

d. Delete Operation

Delete operation is the last Cassandra CRUD Operation, allows a user to delete data from a
table. The user can use DELETE command for this operation.
A Syntax of Delete Operation-

DELETE <identifier> FROM <table name> WHERE <condition>;


In the ‘student’ table let us delete the ‘phone’ or phone number from 003 row.

cqlsh:keyspace1> DELETE phone FROM student WHERE en=003;


Table.6 Cassandra Crud Operation – OUTPUT After Verification
EN NAME BRANCH PHONE CITY

001 Ayush Electrical Engineering 9999999999 Boston

002 Aarav Computer Engineering 8888888888 San Fransisco

003 Kabir Applied Physics Null Philadelphia

SYNTAX for deleting the entire row-

DELETE FROM <identifier> WHERE <condition>;


In the ‘student’ table, let us delete the entire third row.
cqlsh:keyspace1> DELETE FROM student WHERE en=003;
Cassandra Crud Operation – OUTPUT After Verification
EN NAME BRANCH PHONE CITY

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 28


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Electrical
001 Ayush 9999999999 Boston
Engineering

Computer
002 Aarav 8888888888 San Fransisco
Engineering

Cassandra - CQL Datatypes


CQL provides a rich set of built-in data types, including collection types. Along with these
data types, users can also create their own custom data types. The following table provides a
list of built-in data types available in CQL.

Data Type Constants Description

Asci Strings Represents ASCII character string

Bigint Bigint Represents 64-bit signed long

Blob Blobs Represents arbitrary bytes

Boolean Booleans Represents true or false

counter Integers Represents counter column

decimal integers, floats Represents variable-precision decimal

Double Integers Represents 64-bit IEEE-754 floating point

Float integers, floats Represents 32-bit IEEE-754 floating point

Inet Strings Represents an IP address, IPv4 or IPv6

Int Integers Represents 32-bit signed int

Text Strings Represents UTF8 encoded string

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 29


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

timestamp integers, strings Represents a timestamp

timeuuid Uuids Represents type 1 UUID

Uuid Uuids Represents type 1 or type 4

UUID

varchar Strings Represents uTF8 encoded string

Varint Integers Represents arbitrary-precision integer

Collection Types

Cassandra Query Language also provides a collection data types. The following table
provides a list of Collections available in CQL.

Collection Description

List A list is a collection of one or more ordered elements.

Map A map is a collection of key-value pairs.

Set A set is a collection of one or more elements.

User-defined datatypes

Cqlsh provides users a facility of creating their own data types. Given below are the
commands used while dealing with user defined datatypes.
 CREATE TYPE − Creates a user-defined datatype.
 ALTER TYPE − Modifies a user-defined datatype.
 DROP TYPE − Drops a user-defined datatype.
 DESCRIBE TYPE − Describes a user-defined datatype.
 DESCRIBE TYPES − Describes user-defined datatypes.

3.7 HIVE: Data types, Database Operations,

Partitioning

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 30


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It


resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible

Data types in Hive


Data types are very important elements in Hive query language and data modeling. For
defining the table column types, we must have to know about the data types and its usage.

The following gives brief overview of some data types present in Hive:

 Numeric Types
 String Types
 Date/Time Types
 Complex Types

Hive Numeric Data Types:


Type Memory allocation

TINY INT Its 1-byte signed integer (-128 to 127)

SMALL INT 2-byte signed integer (-32768 to 32767)

INT 4 –byte signed integer ( -2,147,484,648 to 2,147,484,647)

BIG INT 8 byte signed integer

FLOAT 4 – byte single precision floating point number

DOUBLE 8- byte double precision floating point number

DECIMAL We can define precision and scale in this Type

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 31


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Hive String Data Types:


Type Length

CHAR 255

VARCHAR 1 to 65355

STRING We can define length here(No Limit)

Hive Date/Time Data Types:


Type Usage

Timestamp Supports traditional Unix timestamp with optional nanosecond precision

 It’s in YYYY-MM-DD format.


Date  The range of values supported for the Date type is be 0000-01-01 to 9999-12-
31, dependent onsupport by the primitive Java Date type

Hive Complex Data Types:


Type Usage

ARRAY<data_type>
Arrays
Negative values and non-constant expressions not allowed

MAP<primitive_type, data_type>
Maps
Negative values and non-constant expressions not allowed

Structs STRUCT<col_name :datat_type, ….. >

Union UNIONTYPE<data_type, datat_type, ……>

Database operation:
Following are the steps on how to create and drop databases in Hive.

Step 1: Create Database in Hive

For creating a database in Hive shell, we have to use the command as shown in the syntax
below:-

Syntax:

Create database <DatabaseName>


Example: -Create database “guru99”

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 32


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

3.3M

How to use Workspace in Photoshop CC

From the above screen shot, we are doing two things

 Creating database “guru99” in Hive


 Displaying existing databases using “show” command
 In the same screen, Database “guru99” name is displayed at the end when we execute
the show command. Which means Database “guru99” is successfully created.

Step 2: Drop Database in Hive

For Dropping database in Hive shell, we have to use the “drop” command as shown in the
syntax below:-

Syntax:

Drop database <DatabaseName>


Example: -Drop database guru99

In the above screenshot, we are doing two things

We are dropping database ‘guru99’ from Hive

Cross checking the same with “show” command

In the same screen, after checking databases with show command, database”guru99″ does
not appear inside Hive.

So we can confirm now that database “guru99” is dropped

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 33


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

3. Creating a table
Syntax:
create database.tablename(columns);
Example:
create table geeksportal.geekdata(id int,name string);
Here id and string are the two columns.
Output :

4. Display Database
Syntax:
show databases;
Output: Display the databases created.

5. Describe Database
Syntax:
describe database database_name;
Example:
describe database geeksportal;
Output: Display the HDFS path of a particular database.

Partitioning in Hive

The partitioning in Hive means dividing the table into some parts based on the values
of a particular column like date, course, city or country. The advantage of partitioning is that
since the data is stored in slices, the query response time becomes faster.

As we know that Hadoop is used to handle the huge amount of data, it is always required to
use the best approach to deal with it. The partitioning in Hive is the best example of it.

Let's assume we have a data of 10 million students studying in an institute. Now, we have to
fetch the students of a particular course. If we use a traditional approach, we have to go

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 34


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

through the entire data. This leads to performance degradation. In such a case, we can adopt
the better approach i.e., partitioning in Hive and divide the data among the different datasets
based on particular columns.

The partitioning in Hive can be executed in two ways -

Skip Ad

o Static partitioning
o Dynamic partitioning

Static Partitioning

In static or manual partitioning, it is required to pass the values of partitioned columns


manually while loading the data into the table. Hence, the data file doesn't contain the
partitioned columns.

Example of Static Partitioning

o First, select the database in which we want to create a table.

1. hive> use test;


o Create the table and provide the partitioned columns by using the following
command: -

hive> create table student (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';

o Let's retrieve the information associated with the table.

hive> describe student;

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 35


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

o Load the data into the table and pass the values of partition columns with it by using
the following command: -

hive> load data local inpath '/home/codegyani/hive/student_details1' into table student


partition(course= "java");

Here, we are partitioning the students of an institute based on courses.

o Load the data of another file into the same table and pass the values of partition
columns with it by using the following command: -

hive> load data local inpath '/home/codegyani/hive/student_details2' into table student


partition(course= "hadoop");

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 36


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

In the following screenshot, we can see that the table student is divided into two categories.

o Let's retrieve the entire data of the able by using the following command: -

hive> select * from student;

o Now, try to retrieve the data based on partitioned columns by using the following
command: -

hive> select * from student where course="java";

In this case, we are not examining the entire data. Hence, this approach improves query
response time.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 37


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Dynamic Partitioning

In dynamic partitioning, the values of partitioned columns exist within the table. So, it is
not required to pass the values of partitioned columns manually.

o First, select the database in which we want to create a table.

hive> use show;

o Enable the dynamic partition by using the following commands: -

hive> set hive.exec.dynamic.partition=true;


hive> set hive.exec.dynamic.partition.mode=nonstrict;
o Create a dummy table to store the data.

hive> create table stud_demo(id int, name string, age int, institute string, course string)
row format delimited

fields terminated by ',';

o Now, load the data into the table.

hive> load data local inpath '/home/codegyani/hive/student_details' into table stud_demo;

o Create a partition table by using the following command: -

hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited

fields terminated by ',';

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 38


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

o Now, insert the data of dummy table into the partition table.

hive> insert into student_part

partition(course)
select id, name, age, institute, course

from stud_demo;

o In the following screenshot, we can see that the table student_part is divided into two

categories.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 39


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

o Let's retrieve the entire data of the table by using the following command: -
hive> select * from student_part;

o Now, try to retrieve the data based on partitioned columns by using the following
command: -

hive> select * from student_part where course= "java ";

In this case, we are not examining the entire data. Hence, this approach improves query
response time.

o Let's also retrieve the data of another partitioned dataset by using the following
command: -

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 40


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

hive> select * from student_part where course= "hadoop";

3.8 HiveQL

What is Hive Query Language (HiveQL)?


Hive Query Language (HiveQL) is a query language in Apache Hive for processing
and analyzing structured data. It separates users from the complexity of Map Reduce
programming. It reuses common concepts from relational databases, such as tables, rows,
columns, and schema, to ease learning. Hive provides a CLI for Hive query writing using
Hive Query Language (HiveQL).
Most interactions tend to take place over a command line interface (CLI). Generally,
HiveQL syntax is similar to the SQL syntax that most data analysts are familiar with. Hive
supports four file formats which are: TEXTFILE, SEQUENCEFILE, ORC and RCFILE
(Record Columnar File).

Hive uses derby database for single user metadata storage, and for multiple user Metadata or
shared Metadata case, Hive uses MYSQL.

3.9 OrientDB Graph database

OrientDB is the first Multi-Model Open Source NoSQL DBMS that combines the
power of graphs and the flexibility of documents into one scalable, high-performance
operational database.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 41


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Gone are the days where your database only supports a single data model. As a direct
response to polyglot persistence, multi-model databases acknowledge the need for multiple
data models, combining them to reduce operational complexity and maintain data
consistency. Though graph databases have grown in popularity, most NoSQL products are
still used to provide scalability to applications sitting on a relational DBMS. Advanced
2nd generation NoSQL products like OrientDB are the future: providing more
functionality and flexibility, while being powerful enough to replace your operational
DBMS..

Speed
OrientDB was engineered from the ground up with performance as a key specification. It’s
fast on both read and write operations. Stores up to 120,000 records per second

 No more Joins: relationships are physical links to the records.


 Better RAM use.
 Traverses parts of or entire trees and graphs of records in milliseconds.
 Traversing speed is not affected by the database size.

Enterprise

While most NoSQL DBMSs are used as secondary databases, OrientDB is powerful and
flexible enough to be used as an operational DBMS. Though OrientDB Community
Edition is free for commercial use, robust applications need enterprise level functionalities
to guarantee data security and flawless performance. OrientDB Enterprise Edition gives
you all the features of our community edition plus:

 Incremental backups
 Unmatched security
 24x7 Support
 Query Profiler
 Distributed Clustering configuration
 Metrics Recording
 Live Monitor with configurable alerts

Record
The smallest unit that you can load from and store in the database. Records can be stored in
four types.

 Document
 Record Bytes
 Vertex
 Edge
The SQL Reference of the OrientDB database provides several commands to create, alter,
and drop databases.
The following statement is a basic syntax of Create Database command.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 42


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

CREATE DATABASE <database-url> [<user> <password> <storage-type> [<db-


type>]]
Following are the details about the options in the above syntax.
<database-url> − Defines the URL of the database. URL contains two parts, one is
<mode> and the second one is <path>.
<mode> − Defines the mode, i.e. local mode or remote mode.
<path> − Defines the path to the database.
<user> − Defines the user you want to connect to the database.
<password> − Defines the password for connecting to the database.
<storage-type> − Defines the storage types. You can choose between PLOCAL and
MEMORY.

Example

You can use the following command to create a local database named demo.
Orientdb> CREATE DATABASE PLOCAL:/opt/orientdb/databses/demo
If the database is successfully created, you will get the following output.
Database created successfully.
Current database is: plocal: /opt/orientdb/databases/demo

orientdb {db = demo}>


Alter database
Database is a one of the important data models with different attributes that you can modify
as per your requirements.
The following statement is the basic syntax of the Alter Database command.
ALTER DATABASE <attribute-name> <attribute-value>
Where <attribute-name> defines the attribute that you want to modify and <attribute-
value> defines the value you want to set for that attribute.

Example

From the version of OrientDB-2.2, the new SQL parser is added which will not allow the
regular syntax in some cases. Therefore, we have to disable the new SQL parser (StrictSQL)
in some cases. You can use the following Alter database command to disable the StrictSQL
parser.

orientdb> ALTER DATABASE custom strictSQL = false


If the command is executed successfully, you will get the following output.
Database updated successfully.

Drop database

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 43


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

Similar to RDBMS, OrientDB provides the feature to drop a database. Drop database refers
to removing a database completely.
The following statement is the basic syntax of the Drop database command.
DROP DATABASE [<database-name> <server-username> <server-user-password>]
Following are the details about the options in the above syntax.
<database-name> − Database name you want to drop.
<server-username> − Username of the database who has the privilege to drop a database.
<server-user-password> − Password of the particular user.

Example

There are two ways to drop a database, one is drop a currently open database and second is
drop a particular database by providing the particular name.
In this example, we will use the same database named ‘demo’ that we created in an earlier
chapter. You can use the following command to drop a database demo.

orientdb {db = demo}> DROP DATABASE


If this command is successfully executed, you will get the following output.
Database 'demo' deleted successfully
OR
You can use another command to drop a database as follows.
orientdb> DROP DATABASE PLOCAL:/opt/orientdb/databases/demo admin admin
If this command is successfully executed, you will get the following output.
Database 'demo' deleted successfully

Components of orientdb graph database


Vertex
OrientDB database is not only a Document database but also a Graph database. The new
concepts such as Vertex and Edge are used to store the data in the form of graph. In graph
databases, the most basic unit of data is node, which in OrientDB is called a vertex. The
Vertex stores information for the database.
Edge
There is a separate record type called the Edge that connects one vertex to another. Edges are
bidirectional and can only connect two vertices. There are two types of edges in OrientDB,
one is regular and another one lightweight.

Download OrientDB from the following URL:

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 44


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

https://orientdb.com/download

Unzip it on your FileSystem and open a shell in the directory.

Now type

cd orientdb-3.0.0
cd bin
and then, if you are on Linux/OSX, you can start the server with

./server.sh
if you are on Windows, start the server with

server.bat
You will see OrientDB starting

.` `
, `:.
`,` ,:`
.,. :,,
.,, ,,,
. .,.::::: ` ` ::::::::: :::::::::
,` .::,,,,::.,,,,,,`;; .: :::::::::: ::: :::
`,. ::,,,,,,,:.,,.` ` .: ::: ::: ::: :::
,,:,:,,,,,,,,::. ` ` `` .: ::: ::: ::: :::
,,:.,,,,,,,,,: `::, ,, ::,::` : :,::` :::: ::: ::: ::: :::
,:,,,,,,,,,,::,: ,, :. : :: : .: ::: ::: :::::::
:,,,,,,,,,,:,:: ,, : : : : .: ::: ::: :::::::::
` :,,,,,,,,,,:,::, ,, .:::::::: : : .: ::: ::: ::: :::
`,...,,:,,,,,,,,,: .:,. ,, ,, : : .: ::: ::: ::: :::
.,,,,::,,,,,,,: `: , ,, : ` : : .: ::: ::: ::: :::
...,::,,,,::.. `: .,, :, : : : .: ::::::::::: ::: :::
,::::,,,. `: ,, ::::: : : .: ::::::::: ::::::::::
,,:` `,,.
,,, .,`
,,. `, GRAPH DATABASE
`` `.
`` orientdb.com
`

2017-08-14 14:11:12:824 INFO Loading configuration from:


/Users/luigidellaquila/temp/orient/orientdb-community-3.0.0m2/config/orientdb-server-
config.xml... [OServerConfigurationLoaderXml]
2017-08-14 14:11:12:932 INFO OrientDB Server v3.0.0 (build
4abea780acc12595bad8cbdcc61ff96980725c3b) is starting up... [OServer]
2017-08-14 14:11:12:951 INFO OrientDB auto-config DISKCACHE=12.373MB
(heap=1.963MB direct=524.288MB os=16.384MB) [orientechnologies]
2017-08-14 14:11:12:994 INFO Databases directory:
/Users/luigidellaquila/temp/orient/orientdb-community-3.0.0m2/databases [OServer]
2017-08-14 14:11:13:017 INFO Creating the system database 'OSystem' for current server
[OSystemDatabase]

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 45


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

2017-08-14 14:11:14:457 INFO Listening binary connections on 0.0.0.0:2424 (protocol


v.37, socket=default) [OServerNetworkListener]
2017-08-14 14:11:14:459 INFO Listening http connections on 0.0.0.0:2480 (protocol
v.10, socket=default) [OServerNetworkListener]

+---------------------------------------------------------------+
| WARNING: FIRST RUN CONFIGURATION |
+---------------------------------------------------------------+
| This is the first time the server is running. Please type a |
| password of your choice for the 'root' user or leave it blank |
| to auto-generate it. |
| |
| To avoid this message set the environment variable or JVM |
| setting ORIENTDB_ROOT_PASSWORD to the root password to use. |
+---------------------------------------------------------------+

Root password [BLANK=auto generate it]: *


The first time you start the server, you will be asked to enter a root password (twice). You
can choose the password you prefer, just make sure to remember it, you will need it alter.

3.10 OrientDB Features

Features

 Quick installation. OrientDB can be installed and running in less than 60 seconds
 Fully transactional: supports ACID transactions guaranteeing that all database
transactions are processed reliably and in the event of a crash all pending documents are
recovered and committed.
 Graph structured data model: native management of graphs. Fully compliant with
the Apache TinkerPop Gremlin (previously known as Blueprints) open source graph
computing framework.
 SQL: supports SQL queries with extensions to handle relationships without SQL join,
manage trees, and graphs of connected documents.
 Web technologies: natively supports HTTP, RESTful protocol, and JSON additional
libraries or components.
 Distributed: full support for multi-master replication including geographically distributed
clusters.
 Run anywhere: implemented using pure Java allowing it to be run on Linux, OS
X, Windows, or any system with a compliant JVM.
 Embeddable: local mode to use the database bypassing the Server. Perfect for scenarios
where the database is embedded.
 Apache 2 License: always free for any usage. No fees or royalties required to use it.
 Full server has a footprint of about 512 MB.
 Commercial support is available from OrientDB.
 Pattern matching: Introduced in version 2.2, the Match statement queries the database in
a declarative manner, using pattern matching.

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 46


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III

 Security features introduced in OrientDB 2.2 provide an extensible framework for adding
external authenticators, password validation, LDAP import of database roles and users,
advanced auditing capabilities, and syslog support. OrientDB Enterprise Edition
provides Kerberos (protocol) authentication full browser SPNEGO support. When it
comes to database encryption, starting with version 2.2, OrientDB can encrypt records on
disk. This prevents unauthorized users from accessing database content or even from
bypassing OrientDB security.
 Teleporter: Allows relational databases to be quickly imported into OrientDB in few
simple steps.
 Cloud ready: OrientDB can be deployed in the cloud and supports the following
providers: Amazon Web Services, Microsoft Azure, CenturyLink Cloud, Jelastic,
DigitalOcean

Prepared By,E.Janakiraman.MCA,MPhil,. AP/MCA - APEC Page 47


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

UNIT IV XML DATABASES

Structured, Semi structured, and Unstructured Data – XML Hierarchical Data Model –
XML Documents – Document Type Definition – XML Schema – XML Documents and
Databases – XML Querying – XPath – XQuery

4.1 Structured, Semi structured, and Unstructured data

Big Data includes huge volume, high velocity, and extensible variety of data. These are 3 types:
Structured data, Semi-structured data, and Unstructured data.

1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which can
be stored in database SQL in a table with rows and columns. They have relational keys and can
easily be mapped into pre-designed fields. Today, those data are most processed in the
development and simplest way to manage information. Example: Relational data.

2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze. With some processes, you can
store them in the relation database (it could be very hard for some kind of semi-structured data),
but Semi-structured exist to ease space. Example: XML data.

3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications. Example: Word, PDF, Text, Media logs.

4.2 XML Hierarchical Data Model

XML Hierarchical (Tree) Data Model

We now introduce the data model used in XML. The basic object in XML is the XML
document. Two main structuring concepts are used to construct an XML
document: elements and attributes. It is important to note that the term attribute in XML is not used
in the same manner as is customary in database terminology, but rather as it is used in document

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 1


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

description languages such as HTML and SGML. Attributes in XML provide additional information
that describes elements, as we will see. There are additional concepts in XML, such as entities,
identifiers, and references, but first we concentrate on describing elements and attributes to show the
essence of the XML model.

Figure 12.3 shows an example of an XML element called <Projects>. As in HTML, elements are
identified in a document by their start tag and end tag. The tag names are enclosed between angled
brackets < ... >, and end tags are further identified by a slash, </ ... >.

Figure 12.3 A complex XML element called <Projects>

Complex elements are constructed from other elements hierarchically, whereas simple
elements contain data values. A major difference between XML and HTML is that XML tag names
are defined to describe the meaning of the data elements in the document, rather than to describe how
the text is to be displayed. This makes it possible to process the data elements in the XML document
automatically by computer programs. Also, the XML tag (element) names can be defined in another
document, known as the schema document, to give a semantic meaning to the tag names that can be

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 2


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

exchanged among multiple users. In HTML, all tag names are predefined and fixed; that is why they
are not extendible.

It is straightforward to see the correspondence between the XML textual representation


shown in Figure 12.3 and the tree structure shown in Figure 12.1. In the tree representation, internal
nodes represent complex elements, whereas leaf nodes rep-resent simple elements. That is why the
XML model is called a tree model or a hierarchical model. In Figure 12.3, the simple elements are
the ones with the tag names <Name>, <Number>, <Location>, <Dept_no>, <Ssn>, <Last_name>,

<First_name>, and <Hours>. The complex elements are the ones with the tag
names <Projects>, <Project>, and <Worker>. In general, there is no limit on the levels of nesting of
elements.

It is possible to characterize three main types of XML documents:

Data-centric XML documents. These documents have many small data items that follow a
specific structure and hence may be extracted from a structured database. They are formatted as XML
documents in order to exchange them over or display them on the Web. These usually follow
a predefined schema that defines the tag names.

Document-centric XML documents. These are documents with large amounts of text, such as
news articles or books. There are few or no struc-tured data elements in these documents.

Hybrid XML documents. These documents may have parts that contain structured data and
other parts that are predominantly textual or unstruc-tured. They may or may not have a predefined
schema.

XML documents that do not follow a predefined schema of element names and cor-
responding tree structure are known as schemaless XML documents. It is important to note that
datacentric XML documents can be considered either as semistructured data or as structured data as
defined in Section 12.1. If an XML document conforms to a predefined XML schema or DTD (see
Section 12.3), then the document can be considered as structured data. On the other hand, XML
allows documents that do not conform to any schema; these would be considered as semistructured
data and are schemaless XML documents. When the value of the standalone attribute in an XML
document is yes, as in the first line in Figure 12.3, the document is standalone and schemaless.

XML attributes are generally used in a manner similar to how they are used in HTML (see
Figure 12.2), namely, to describe properties and characteristics of the elements (tags) within which
they appear. It is also possible to use XML attributes to hold the values of simple data elements;
however, this is generally not recommended. An exception to this rule is in cases that need
to reference another element in another part of the XML document. To do this, it is common to use
attribute values in one element as the references. This resembles the concept of foreign keys in

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 3


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

relational databases, and is a way to get around the strict hierarchical model that the XML tree model
implies. We discuss XML attributes further in Section 12.3 when we discuss XML schema and DTD.
4.3 XML Documents

Two types of XML

 Well-Formed XML
 Valid XML

Well-formed XML Document

An XML document is said to be well-formed if it adheres to the following rules −


 Non DTD XML files must use the predefined character entities for amp(&), apos(single
quote), gt(>), lt(<), quot(double quote).
 It must follow the ordering of the tag. i.e., the inner tag must be closed before closing the outer
tag.
 Each of its opening tags must have a closing tag or it must be a self ending
tag.(<title>....</title> or <title/>).
 It must have only one attribute in a start tag, which needs to be quoted.
 amp(&), apos(single quote), gt(>), lt(<), quot(double quote) entities other than these must
be declared.
Example
Following is an example of a well-formed XML document −

<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>


<!DOCTYPE address
[
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
]>

<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
The above example is said to be well-formed as −
 It defines the type of document. Here, the document type is element type.
 It includes a root element named as address.
 Each of the child elements among name, company and phone is enclosed in its self
explanatory tag.
 Order of the tags is maintained.

4.4 Document Type Definition

XML DTD

An XML document with correct syntax is called "Well Formed".

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 4


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

An XML document validated against a DTD is both "Well Formed" and "Valid".

What is a DTD?

DTD stands for Document Type Definition.

A DTD defines the structure and the legal elements and attributes of an XML document

Valid XML Documents

A "Valid" XML document is "Well Formed", as well as it conforms to the rules of a DTD:

<?xml version="1.0" encoding="UTF-8"?>


<!DOCTYPE note SYSTEM "Note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

The DOCTYPE declaration above contains a reference to a DTD file. The content of the DTD file is
shown and explained below.

XML DTD

The purpose of a DTD is to define the structure and the legal elements and attributes of an XML
document:

Note.dtd:
<!DOCTYPE note
[
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>

The DTD above is interpreted like this:

 !DOCTYPE note - Defines that the root element of the document is note
 !ELEMENT note - Defines that the note element must contain the elements: "to, from,
heading, body"
 !ELEMENT to - Defines the to element to be of type "#PCDATA"
 !ELEMENT from - Defines the from element to be of type "#PCDATA"
 !ELEMENT heading - Defines the heading element to be of type "#PCDATA"
 !ELEMENT body - Defines the body element to be of type "#PCDATA"

Tip: #PCDATA means parseable character data.

Using DTD for Entity Declaration

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 5


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

A DOCTYPE declaration can also be used to define special characters or strings, used in the
.document:

Example
<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE note [
<!ENTITY nbsp "&#xA0;">
<!ENTITY writer "Writer: adhiparasakthi.">
<!ENTITY copyright "Copyright: apec.">
]>

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<footer>&writer;&nbsp;&copyright;</footer>
</note>

4.5 XML Schema

XML XSD (XML Schema Definition)

XML schema is an alternative to DTD. An XML document is considered “well formed” and “valid” if
it is successfully validated against XML Schema. The extension of Schema file is .xsd.
XSD stands for XML Schema Definition and it is a way to describe the structure of a XML
document. It defines the rules for all the attributes and elements in a XML document. It can also be
used to generate the XML documents. It also checks the vocabulary of the document. It doesn’t
require processing by a parser. XSD checks for the correctness of the structure of the XML file.
XSD was first published in 2001 and after that it was published in 2004.

What is XML schema?

 XML schema defines the structure of an XML document.


 A schema defines, list of elements, attributes used in XML documents and data type of these
elements.
 It is known as XSD.

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 6


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

XSD vs DTD

XSD DTD

Defines list, order, data types of elements and Defines list, order of elements and attributes.
attributes.

Provides control over the elements and attributes It does not provide control over elements and
used in XML documents. attributes.

XSD allows to create customized datatype. DTD does not allow to create customized datatype.

Syntax of XSD is similar to XML document. Syntax of DTD is different from XML document.

XSD allows to define restrictions on data. DTD does not allows to define restrictions on data.
For example: Define the content in a document
by using only integer data type.

Declaring Elements in XSD

Syntax:
<xsd:element name =“elementname” type=“datatype” minOccurs = “notNegativeInteger”
maxOccurs=“nonNegativeInteger | unbounded”/>

Where,
name: Defines the element name

type: Defines data type.

minOccurs: If its value is zero, the use of element is optional and if its value is greater than zero,
the use of element is compulsory, and should occur at least for specified number of times.

maxOccurs: If value is set as unbounded, the use of element can appear any number of times in
the XML document without any limitation.

XML Schema Example

Let's create a schema file.

employee.xsd

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.javatpoint.com"

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 7


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

xmlns="http://www.javatpoint.com"
elementFormDefault="qualified">

<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
<xs:element name="email" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>

</xs:schema>

Let's see the xml file using XML schema or XSD file.

employee.xml

<?xml version="1.0"?>
<employee
xmlns="http://www.javatpoint.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.javatpoint.com employee.xsd">
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
</employee>
Description of XML Schema

<xs:element name="employee"> : It defines the element name employee.

<xs:complexType> : It defines that the element 'employee' is complex type.

<xs:sequence> : It defines that the complex type is a sequence of elements.

<xs:element name="firstname" type="xs:string"/> : It defines that the element 'firstname' is of


string/text type.

<xs:element name="lastname" type="xs:string"/> : It defines that the element 'lastname' is of


string/text type.

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 8


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

<xs:element name="email" type="xs:string"/> : It defines that the element 'email' is of string/text


type.

XML Schema Data types

There are two types of data types in XML schema.

1. simpleType
2. complexType

simpleType

A simple element is an XML element that can contain only text. It cannot contain any other
elements or attributes.

However, the "only text" restriction is quite misleading. The text can be of many different types. It
can be one of the types included in the XML Schema definition (boolean, string, date, etc.), or it can
be a custom type that you can define yourself.

You can also add restrictions (facets) to a data type in order to limit its content, or you can require the
data to match a specific pattern.

Defining a Simple Element

The syntax for defining a simple element is:

<xs:element name="xxx" type="yyy"/>

complexType

There are four kinds of complex elements:

 empty elements
 elements that contain only other elements
 elements that contain only text
 elements that contain both other elements and text

A complex XML element, "description", which contains both elements and text:

<description>
It happened on <date lang="norwegian">03.03.99</date> ....
</description>

4.6 XML Documents and Databases

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 9


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

XML Database is used to store huge amount of information in the XML format. As the use
of XML is increasing in every field, it is required to have a secured place to store the XML
documents. The data stored in the database can be queried using XQuery, serialized, and exported
into a desired format.

XML Database Types

There are two major types of XML databases −

 XML- enabled
 Native XML (NXD)

XML - Enabled Database

XML enabled database is nothing but the extension provided for the conversion of XML document.
This is a relational database, where data is stored in tables consisting of rows and columns. The tables
contain set of records, which in turn consist of fields.
Native XML Database
Native XML database is based on the container rather than table format. It can store large amount of
XML document and data. Native XML database is queried by the XPath-expressions.
Native XML database has an advantage over the XML-enabled database. It is highly capable to store,
query and maintain the XML document than XML-enabled database.
Example
Following example demonstrates XML database −

<?xml version = "1.0"?>


<contact-info>
<contact1>
<name>Mr.arunkumar</name>
<company>apec</company>
<phone>(011) 123-4567</phone>
</contact1>

<contact2>
<name>Manisha Patil</name>
<company>mec</company>
<phone>(011) 789-4567</phone>
</contact2>
</contact-info>
Here, a table of contacts is created that holds the records of contacts (contact1 and contact2), which in
turn consists of three entities − name, company and phone.

4.7 XML Querying

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 10


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

4.8 XPath

XPath is a major element in the XSLT standard.

XPath can be used to navigate through elements and attributes in an XML document.

An XPath expression returns a collection of element nodes that satisfy certain patterns
specified in the expression.

The names in the XPath expression are node names in the XML document tree that are either tag
(element) names or attribute names, possibly with additional qualifier conditions to further restrict
the nodes that satisfy the pattern.

XPath Path Expressions

XPath uses path expressions to select nodes or node-sets in an XML document.

These path expressions look very much like the path expressions you use with traditional computer
file systems:

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 11


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

Selecting Nodes

XPath uses path expressions to select nodes in an XML document. The node is selected by following
a path or steps. The most useful path expressions are listed below:

Expression Description

Nodename Selects all nodes with the name "nodename"

/ Selects from the root node

// Selects nodes in the document from the current node that match the selection no
matter where they are

. Selects the current node

.. Selects the parent of the current node

@ Selects attributes

In the table below we have listed some path expressions and the result of the expressions:

Path Expression Result

Bookstore Selects all nodes with the name "bookstore"

/bookstore Selects the root element bookstore


Note: If the path starts with a slash ( / ) it always represents an absolute path to an
element!

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 12


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

bookstore/book Selects all book elements that are children of bookstore

//book Selects all book elements no matter where they are in the document

bookstore//book Selects all book elements that are descendant of the bookstore element, no matter where
they are under the bookstore element

//@lang Selects all attributes that are named lang

Absolute Xpath: This is the XPath expression in the XML document that starts with the root node or
with ‘/’, For Example, /SoftwareTestersList/softwareTester/@name=” T1″

Relative XPath: If the XPath expression starts with the selected context node then that is considered
as Relative XPath. For Example, if the software tester is the currently selected node then /@name=”
T1” is considered as the Relative XPath.

Important features of XPath

o XPath defines structure: XPath is used to define the parts of an XML document i.e.
element, attributes, text, namespace, processing-instruction, comment, and document nodes.
o XPath provides path expression: XPath provides powerful path expressions, select nodes,
or list of nodes in XML documents.
o XPath is a core component of XSLT: XPath is a major element in XSLT standard and must
be followed to work with XSLT documents.
o XPath is a standard function: XPath provides a rich library of standard functions to
manipulate string values, numeric values, date and time comparison, node and QName
manipulation, sequence manipulation, Boolean values etc.
o Path is W3C recommendation.

4.9 XQuery

What is XQuery

XQuery is a functional query language used to retrieve information stored in XML format. It is same
as for XML what SQL is for databases. It was designed to query XML data.

XQuery is built on XPath expressions. It is a W3C recommendation which is supported by all major
databases.

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 13


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

The XML Example Document

We will use the following XML document in the examples below.

"books.xml":

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>

<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>

<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>

<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>

</bookstore>

How to Select Nodes From "books.xml"?


Functions

XQuery uses functions to extract data from XML documents.

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 14


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

The doc() function is used to open the "books.xml" file:

doc("books.xml")

XQuery FLWOR Expressions


What is FLWOR?

FLWOR (pronounced "flower") is an acronym for "For, Let, Where, Order by, Return".

 For - selects a sequence of nodes


 Let - binds a sequence to a variable
 Where - filters the nodes
 Order by - sorts the nodes
 Return - what to return (gets evaluated once for every node)

The XML Example Document

We will use the "books.xml" document in the examples below (same XML file as in the previous
chapter).

How to Select Nodes From "books.xml" With FLWOR

Look at the following path expression:

doc("books.xml")/bookstore/book[price>30]/title

The expression above will select all the title elements under the book elements that are under the
bookstore element that have a price element with a value that is higher than 30.

The following FLWOR expression will select exactly the same as the path expression above:

for $x in doc("books.xml")/bookstore/book
where $x/price>30
return $x/title

The result will be:

<title lang="en">XQuery Kick Start</title>


<title lang="en">Learning XML</title>

With FLWOR you can sort the result:

for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title

The for clause selects all book elements under the bookstore element into a variable called
$x.

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 15


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - IV

The where clause selects only book elements with a price element with a value greater than
30.

The order by clause defines the sort-order. Will be sort by the title element.

The return clause specifies what should be returned. Here it returns the title elements.

The result of the XQuery expression above will be:

<title lang="en">Learning XML</title>


<title lang="en">XQuery Kick Start</title>

What does it do

XQuery is a functional language which is responsible for finding and extracting elements and
attributes from XML documents.

It can be used for following things:

o To extract information to use in a web service.


o To generates summary reports.
o To transform XML data to XHTML.
o Search Web documents for relevant information.

Advantages of XQuery

o XQuery can be used to retrieve both hierarchal and tabular data.


o XQuery can also be used to query tree and graphical structures.
o XQuery can be used to query web pages.
o XQuery is best for XML-based databases and object-based databases. Object databases are
much more flexible and powerful than purely tabular databases.
o XQuery can be used to transform XML documents into XHTML documents.

References Files used in demo - https://drive.google.com/drive/folder...

XML Formatter - https://www.freeformatter.com/xml-for...

XML Schema Generator - https://www.freeformatter.com/xsd-gen...

XML Schema Validator - https://www.liquid-technologies.com/o... https://extendsclass.com/xml-


schema-v.

Prepared By. E.Janakiraman,MCA,MPhil,. AP/MCA - APEC Page 16


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

UNIT V INFORMATION RETRIEVAL AND WEB SEARCH

IR concepts – Retrieval Models – Queries in IR system – Text Preprocessing –


Inverted Indexing – Evaluation Measures – Web Search and Analytics – Current
trends.

5.1 IR concepts

Information retrieval (IR) may be defined as a software program that deals with the
organization, storage, retrieval and evaluation of information from document repositories
particularly textual information. The system assists users in finding the information they
require but it does not explicitly return the answers of the questions. It informs the existence
and location of documents that might consist of the required information. The documents that
satisfy user’s requirement are called relevant documents. A perfect IR system will retrieve
only relevant documents.
With the help of the following diagram, we can understand the process of information
retrieval (IR) −

It is clear from the above diagram that a user who needs information will have to formulate a
request in the form of query in natural language. Then the IR system will respond by
retrieving the relevant output, in the form of documents, about the required information.

Classical Problem in Information Retrieval (IR) System

The main goal of IR research is to develop a model for retrieving information from
the repositories of documents. Here, we are going to discuss a classical problem, named ad-
hoc retrieval problem, related to the IR system.
In ad-hoc retrieval, the user must enter a query in natural language that describes the required
information. Then the IR system will return the required documents related to the desired
information. For example, suppose we are searching something on the Internet and it gives
some exact pages that are relevant as per our requirement but there can be some non-relevant
pages too. This is due to the ad-hoc retrieval problem.

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 1


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

5.2 Retrieval Models

Information Retrieval (IR) Model

Mathematically, models are used in many scientific areas having objective to


understand some phenomenon in the real world. A model of information retrieval predicts
and explains what a user will find in relevance to the given query. IR model is basically a
pattern that defines the above-mentioned aspects of retrieval procedure and consists of the
following −
 A model for documents.
 A model for queries.
 A matching function that compares queries to documents.
Mathematically, a retrieval model consists of −
D − Representation for documents.
R − Representation for queries.
F − The modeling framework for D, Q along with relationship between them.
R (q,di) − A similarity function which orders the documents with respect to the query. It is
also called ranking.

Types of Information Retrieval (IR) Model

An information mode l (IR) model can be classified into the following three models −
Classical IR Model
It is the simplest and easy to implement IR model. This model is based on mathematical
knowledge that was easily recognized and understood as well. Boolean, Vector and
Probabilistic are the three classical IR models.
Non-Classical IR Model
It is completely opposite to classical IR model. Such kind of IR models are based on
principles other than similarity, probability, Boolean operations. Information logic model,
situation theory model and interaction models are the examples of non-classical IR model.
Alternative IR Model
It is the enhancement of classical IR model making use of some specific techniques from
some other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models are
the example of alternative IR model.

Design features of Information retrieval (IR) systems

Let us now learn about the design features of IR systems −


Inverted Index
The primary data structure of most of the IR systems is in the form of inverted index.
We can define an inverted index as a data structure that list, for every word, all documents
that contain it and frequency of the occurrences in document. It makes it easy to search for
‘hits’ of a query word.

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 2


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

Stop Word Elimination


Stop words are those high frequency words that are deemed unlikely to be useful for
searching. They have less semantic weights. All such kind of words are in a list called stop
list. For example, articles “a”, “an”, “the” and prepositions like “in”, “of”, “for”, “at” etc. are
the examples of stop words. The size of the inverted index can be significantly reduced by
stop list. As per Zipf’s law, a stop list covering a few dozen words reduces the size of
inverted index by almost half. On the other hand, sometimes the elimination of stop word
may cause elimination of the term that is useful for searching. For example, if we eliminate
the alphabet “A” from “Vitamin A” then it would have no significance.
Stemming
Stemming, the simplified form of morphological analysis, is the heuristic process of
extracting the base form of words by chopping off the ends of words. For example, the words
laughing, laughs, laughed would be stemmed to the root word laugh.
In our subsequent sections, we will discuss about some important and useful IR models.

The Boolean Model

It is the oldest information retrieval (IR) model. The model is based on set theory and
the Boolean algebra, where documents are sets of terms and queries are Boolean expressions
on terms. The Boolean model can be defined as −
 D − A set of words, i.e., the indexing terms present in a document. Here, each term is
either present (1) or absent (0).
 Q − A Boolean expression, where terms are the index terms and operators are logical
products − AND, logical sum − OR and logical difference − NOT
 F − Boolean algebra over sets of terms as well as over sets of documents
If we talk about the relevance feedback, then in Boolean IR model the Relevance
prediction can be defined as follows −
 R − A document is predicted as relevant to the query expression if and only if it
satisfies the query expression as −
((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)
We can explain this model by a query term as an unambiguous definition of a set of
documents.
For example, the query term “economic” defines the set of documents that are indexed with
the term “economic”.
Now, what would be the result after combining terms with Boolean AND Operator? It will
define a document set that is smaller than or equal to the document sets of any of the single
terms. For example, the query with terms “social” and “economic” will produce the
documents set of documents that are indexed with both the terms. In other words, document
set with the intersection of both the sets.
Now, what would be the result after combining terms with Boolean OR operator? It
will define a document set that is bigger than or equal to the document sets of any of the
single terms. For example, the query with terms “social” or “economic” will produce the
documents set of documents that are indexed with either the term “social” or “economic”. In
other words, document set with the union of both the sets.

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 3


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

Advantages of the Boolean Mode

The advantages of the Boolean model are as follows −


 The simplest model, which is based on sets.
 Easy to understand and implement.
 It only retrieves exact matches
 It gives the user, a sense of control over the system.

Disadvantages of the Boolean Model

The disadvantages of the Boolean model are as follows −


 The model’s similarity function is Boolean. Hence, there would be no partial matches.
This can be annoying for the users.
 In this model, the Boolean operator usage has much more influence than a critical
word.
 The query language is expressive, but it is complicated too.
 No ranking for retrieved documents.

Vector Space Model

Due to the above disadvantages of the Boolean model, Gerard Salton and his
colleagues suggested a model, which is based on Luhn’s similarity criterion. The similarity
criterion formulated by Luhn states, “the more two representations agreed in given elements
and their distribution, the higher would be the probability of their representing similar
information.”
Consider the following important points to understand more about the Vector Space Model −
 The index representations (documents) and the queries are considered as vectors
embedded in a high dimensional Euclidean space.
 The similarity measure of a document vector to a query vector is usually the cosine of
the angle between them.

5.3 Queries in IR system

During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of document. They are
used by an IR system to build an inverted index which is then consulted during the search.
The queries formulated by users are compared to the set of index keywords. Most IR
systems also allow the use of Boolean and other operators to build a complex query. The
query language with these operators enriches the expressiveness of a user’s information
need.
The Information Retrieval (IR) system finds the relevant documents from a large
data set according to the user query. Queries submitted by users to search engines might be
ambiguous, concise and their meaning may change over time. Some of the types of Queries
in IR systems are –

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 4


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

1. Keyword Queries :
 Simplest and most common queries.
 The user enters just keyword combinations to retrieve documents.
 These keywords are connected by logical AND operator.
 All retrieval models provide support for keyword queries.
2. Boolean Queries :
 Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in
combination of keyword formulations.
 No ranking is involved because a document either satisfies such a query or does not
satisfy it.
 A document is retrieved for boolean query if it is logically true as exact match in
document.
3. Phase Queries :
 When documents are represented using an inverted keyword index for searching, the
relative order of items in document is lost.
 To perform exact phase retrieval, these phases are encoded in inverted index or
implemented differently.
 This query consists of a sequence of words that make up a phase.
 It is generally enclosed within double quotes.
4. Proximity Queries :
 Proximity refers ti search that accounts for how close within a record multiple items
should be to each other.
 Most commonly used proximity search option is a phase search that requires terms to be
in exact order.
 Other proximity operators can specify how close terms should be to each other. Some
will specify the order of search terms.
 Search engines use various operators names such as NEAR, ADJ (adjacent), or AFTER.
 However, providing support for complex proximity operators becomes expensive as it
requires time-consuming pre-processing of documents and so it is suitable for smaller
document collections rather than for web.
5. Wildcard Queries :
 It supports regular expressions and pattern matching-based searching in text.
 Retrieval models do not directly support for this query type.
 In IR systems, certain kinds of wildcard search support may be implemented. Example:
usually words ending with trailing characters.
6. Natural Language Queries :
 There are only a few natural language search engines that aim to understand the
structure and meaning of queries written in natural language text, generally as question
or narrative.
 The system tries to formulate answers for these queries from retrieved results.
 Semantic models can provide support for this query type.

5.4 Text Preprocessing

In this section we review the commonly used text preprocessing techniques that are
part of the text processing task

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 5


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

1. Stopword Removal
Stopwords are very commonly used words in a language that play a major role in the
formation of a sentence but which seldom contribute to the meaning of that sentence. Words
that are expected to occur in 80 percent or more of the documents in a collection are typically
referred to as stopwords, and they are rendered potentially useless. Because of the
commonness and function of these words, they do not contribute much to the relevance of a
document for a query search. Examples include words such
as the, of, to, a, and, in, said, for, that, was, on, he, is, with, at, by, and it. These words are
presented here with decreasing frequency of occurrence from a large corpus of documents
called AP89. The fist six of these words account for 20 percent of all words in the listing, and
the most frequent 50 words account for 40 percent of all text.

Removal of stopwords from a document must be performed before indexing. Articles,


prepositions, conjunctions, and some pronouns are generally classified as stopwords. Queries
must also be preprocessed for stopword removal before the actual retrieval process. Removal
of stopwords results in elimination of possible spurious indexes, thereby reducing the size of
an index structure by about 40 percent or more. However, doing so could impact the recall if
the stopword is an integral part of a query (for example, a search for the phrase ‘To be or not
to be,’ where removal of stopwords makes the query inappropriate, as all the words in the
phrase are stopwords). Many search engines do not employ query stopword removal for this
reason.
2. Stemming
A stem of a word is defined as the word obtained after trimming the suffix and prefix
of an original word. For example, ‘comput’ is the stem word for computer, computing,
and computation. These suffixes and prefixes are very common in the English language for
supporting the notion of verbs, tenses, and plural forms. Stemming reduces the different
forms of the word formed by inflection (due to plurals or tenses) and derivation to a common
stem.
A stemming algorithm can be applied to reduce any word to its stem. In English, the
most famous stemming algorithm is Martin Porter’s stemming algorithm. The Porter
stemmer is a simplified version of Lovin’s technique that uses a reduced set of about 60 rules
(from 260 suffix patterns in Lovin’s technique) and organizes them into sets; conflicts within
one subset of rules are resolved before going on to the next. Using stemming for
preprocessing data results in a decrease in the size of the indexing structure and an increase in
recall, possibly at the cost of precision.
3. Utilizing a Thesaurus
A thesaurus comprises a precompiled list of important concepts and the main word
that describes each concept for a particular domain of knowledge. For each concept in this
list, a set of synonyms and related words is also compiled. Thus, a synonym can be converted
to its matching concept during preprocessing. This preprocessing step assists in providing a

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 6


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

standard vocabulary for indexing and searching. Usage of a thesaurus, also known as
a collection of synonyms, has a substantial impact on the recall of information systems. This
process can be complicated because many words have different meanings in different
contexts.

UMLS is a large biomedical thesaurus of millions of concepts (called the Metathesaurus) and
a semantic network of meta concepts and relationships that organize the Metathesaurus (see
Figure 27.3). The concepts are assigned labels from the semantic network. This thesaurus of
concepts contains synonyms of medical terms, hierarchies of broader and narrower terms, and
other relationships among words and concepts that make it a very extensive resource for
information retrieval of documents in the medical domain. Figure 27.3 illustrates part of the
UMLS Semantic Network.

WordNet is a manually constructed thesaurus that groups words into strict synonym
sets called synsets. These synsets are divided into noun, verb, adjective, and adverb
categories. Within each category, these synsets are linked together by appropriate
relationships such as class/subclass or “is-a” relationships for nouns.

WordNet is based on the idea of using a controlled vocabulary for indexing, thereby
eliminating redundancies. It is also useful in providing assistance to users with locating terms
for proper query formulation.
4. Other Preprocessing Steps: Digits, Hyphens, Punctuation Marks, Cases
Digits, dates, phone numbers, e-mail addresses, URLs, and other standard types of
text may or may not be removed during preprocessing. Web search engines, however, index
them in order to to use this type of information in the document metadata to improve
precision and recall (see Section 27.6 for detailed definitions of precision and recall).

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 7


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

Hyphens and punctuation marks may be handled in different ways. Either the entire phrase
with the hyphens/punctuation marks may be used, or they may be eliminated. In some
systems, the character representing the hyphen/punctuation mark may be removed, or may be
replaced with a space. Different information retrieval systems follow different rules of
processing. Handling hyphens automatically can be complex: it can either be done as a
classification problem, or more commonly by some heuristic rules.

Most information retrieval systems perform case-insensitive search, converting all the
letters of the text to uppercase or lowercase. It is also worth noting that many of these text
preprocessing steps are language specific, such as involving accents and diacritics and the
idiosyncrasies that are associated with a particular language.

5. Information Extraction
Information extraction (IE) is a generic term used for extracting structured con-tent
from text. Text analytic tasks such as identifying noun phrases, facts, events, people, places,
and relationships are examples of IE tasks. These tasks are also called named entity
recognition tasks and use rule-based approaches with either a the-saurus, regular expressions
and grammars, or probabilistic approaches. For IR and search applications, IE technologies
are mostly used to identify contextually relevant features that involve text analysis, matching,
and categorization for improving the relevance of search systems. Language technologies
using part-of-speech tagging are applied to semantically annotate the documents with
extracted features to aid search relevance.

5.5 Inverted Indexing

Inverted Indexing

The simplest way to search for occurrences of query terms in text collections can be
performed by sequentially scanning the text. This kind of online searching is only appropriate
when text collections are quite small. Most information retrieval systems process the text
collections to create indexes and operate upon the inverted index data. An inverted index
structure comprises vocabulary and document information. Vocabulary is a set of distinct
query terms in the document set. Each term in a vocabulary set has an associated collection of
information about the documents that contain the term, such as document id, occurrence
count, and offsets within the document where the term occurs. The simplest form of
vocabulary terms consists of words or individual tokens of the documents. In some cases,
these vocabulary terms also consist of phrases, n-grams, entities, links, names, dates, or
manually assigned descriptor terms from documents and/or Web pages. For each term in the
vocabulary, the cor-responding document ids, occurrence locations of the term in each

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 8


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

document, number of occurrences of the term in each document, and other relevant
information may be stored in the document information section.
Weights are assigned to document terms to represent an estimate of the usefulness of the
given term as a descriptor for distinguishing the given document from other documents in the
same collection. A term may be a better descriptor of one document than of another by the
weighting process (see Section 27.2).

An inverted index of a document collection is a data structure that attaches distinct


terms with a list of all documents that contains the term. The process of inverted index
construction involves the extraction and processing steps shown in Figure 27.2. Acquired text
is first preprocessed and the documents are represented with the vocabulary terms.
Documents’ statistics are collected in document lookup tables. Statistics generally include
counts of vocabulary terms in individual documents as well as different collections, their
positions of occurrence within the documents, and the lengths of the documents. The
vocabulary terms are weighted at indexing time according to different criteria for collections.
For example, in some cases terms in the titles of the documents may be weighted more
heavily than terms that occur in other parts of the documents.

One of the most popular weighting schemes is the TF-IDF (term frequency-inverse
document frequency) metric that we described in Section 27.2. For a given term this
weighting scheme distinguishes to some extent the documents in which the term occurs more
often from those in which the term occurs very little or never. These weights are normalized
to account for varying document lengths, further ensuring that longer documents with
proportionately more occurrences of a word are not favored for retrieval over shorter
documents with proportionately fewer occurrences. These processed document-term streams
(matrices) are then inverted into term-document streams (matrices) for further IR steps.
The different steps involved in inverted index construction can be summarized as follows:
1. Break the documents into vocabulary terms by tokenizing, cleansing, stopword
removal, stemming, and/or use of an additional thesaurus as vocabulary.
2. Collect document statistics and store the statistics in a document lookup table.
3. Invert the document-term stream into a term-document stream along with
additional information such as term frequencies, term positions, and term weights.
Searching for relevant documents from the inverted index, given a set of query terms, is
generally a three-step process.
Vocabulary search. If the query comprises multiple terms, they are separated and
treated as independent terms. Each term is searched in the vocabulary. Various data
structurwes, like variations of B+-tree or hashing, may be

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 9


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

used to optimize the search process. Query terms may also be ordered in lexicographic order
to improve space efficiency.
Document information retrieval. The document information for each term is retrieved.
Manipulation of retrieved information. The document information vector for each
term obtained in step 2 is now processed further to incorporate various forms of query logic.
Various kinds of queries like prefix, range, context, and proximity queries are processed in
this step to construct the final result based on the document collections returned in step 2.
5.6 Evaluation Measures

Without proper evaluation techniques, one cannot compare and measure the relevance
of different retrieval models and IR systems in order to make improvements.
Evaluation techniques of IR systems measure the topical
relevance and user relevance. Topical relevance measures the extent to which the topic of a
result matches the topic of the query. Mapping one’s information need with “perfect” queries
is a cognitive task, and many users are not able to effectively form queries that would retrieve
results more suited to their information need. Also, since a major chunk of user queries are
informational in nature, there is no fixed set of right answers to show to the user. User
relevance is a term used to describe the “goodness” of a retrieved result with regard to the
user’s information need. User relevance includes other implicit factors, such as user
perception, context, timeliness, the user’s environment, and current task needs. Evaluating
user relevance may also involve subjective analysis and study of user retrieval tasks to
capture some of the properties of implicit factors involved in accounting for users’ bias for
judging performance.

In Web information retrieval, no binary classification decision is made on whether a


document is relevant or nonrelevant to a query (whereas the Boolean (or binary) retrieval
model uses this scheme, as we discussed in Section 27.2.1). Instead, a ranking of the
documents is produced for the user. Therefore, some evaluation measures focus on

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 10


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

comparing different rankings produced by IR systems. We discuss some of these measures


next.
1. Recall and Precision
Recall and precision metrics are based on the binary relevance assumption (whether
each document is relevant or nonrelevant to the query). Recall is defined as the number of
relevant documents retrieved by a search divided by the total number of existing relevant
documents. Precision is defined as the number of relevant documents retrieved by a search
divided by the total number of documents retrieved by that search. Figure 27.5 is a pictorial
representation of the terms retrieved vs. relevant and shows how search results relate to four
different sets of documents.

The notation for Figure 27.5 is as follows:


TP: true positive
FP: false positive
FN: false negative
TN: true negative
The terms true positive, false positive, false negative, and true negative are generally
used in any type of classification tasks to compare the given classification of an item with the
desired correct classification. Using the term hits for the documents that truly or “correctly”
match the user request, we can define:
Recall = |Hits|/|Relevant|
Precision = |Hits|/|Retrieved|
Recall and precision can also be defined in a ranked retrieval setting. The Recall at rank
position i for document diq (denoted by r(i)) (diq is the retrieved document at position i for
query q) is the fraction of relevant documents from d1q to diq in the result set for the query.
Let the set of relevant documents from d1q to diq in that set be Si with cardinality | Si |. Let
(|Dq| be the size of relevant documents for the query. In this case,|Si | ≤ |Dq|). Then:

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 11


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

Recall r(i) = |Si |/|Dq|

The Precision at rank position i or document diq (denoted by p(i)) is the fraction of documents
from d1q to diq in the result set that are relevant:
Precision p(i) = |Si |/i

5.7 Web Search and Analytics

What is web analytics?


Web analytics is the process of analyzing the behavior of visitors to a website. This
involves tracking, reviewing and reporting data to measure web activity, including the use of
a website and its components, such as webpages, images and videos.

Web Search and Analysis


The emergence of the Web has brought millions of users to search for information,
which is stored in a very large number of active sites. To make this information accessible,
search engines such as Google and Yahoo! have to crawl and index these sites and document
collections in their index databases. Moreover, search engines have to regularly update their
indexes given the dynamic nature of the Web as new Web sites are created and current ones
are updated or deleted. Since there are many millions of pages available on the Web on
different topics, search engines have to apply many sophisticated techniques such as link
analysis to identify the importance of pages.
There are other types of search engines besides the ones that regularly crawl the Web
and create automatic indexes: these are human-powered, vertical search engines or
metasearch engines. These search engines are developed with the help of computer-assisted
systems to aid the curators with the process of assigning indexes. They consist of manually
created specialized Web directories that are hierarchically organized indexes to guide user
navigation to different resources on the Web. Vertical search engines are customized topic-
specific search engines that crawl and index a specific collection of documents on the Web
and provide search results from that specific collection. Metasearch engines are built on top
of search engines: they query different search engines simultaneously and aggregate and
provide search results from these sources.
1. Web Analysis and Its Relationship to Information Retrieval
In addition to browsing and searching the Web, another important activity closely
related to information retrieval is to analyze or mine information on the Web for new
information of interest. (We discuss mining of data from files and databases in Chapter 28.)
Application of data analysis techniques for discovery and analysis of useful information from

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 12


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

the Web is known as Web analysis. Over the past few years the World Wide Web has
emerged as an important repository of information for many day-to-day applications for
individual consumers, as well as a significant plat-form for e-commerce and for social
networking. These properties make it an interesting target for data analysis applications. The
Web mining and analysis field is an integration of a wide range of fields spanning
information retrieval, text analysis, natural language processing, data mining, machine
learning, and statistical analysis.

The goals of Web analysis are to improve and personalize search results relevance
and to identify trends that may be of value to various businesses and organizations. We
elaborate on these goals next.

Finding relevant information. People usually search for specific information on the Web
by entering keywords in a search engine or browsing information portals and using services.
Search services are constrained by search relevance problems since they have to map and
approximate the information need of millions of users as an a priori task.

Personalization of the information. Different people have different content and


presentation preferences. By collecting personal information and then generating user-
specific dynamic Web pages, the pages are personalized for the user.

Finding information of commercial value. This problem deals with finding interesting
patterns in users’ interests, behaviors, and their use of products and services, which may be of
commercial value. For example, businesses such as the automobile industry, clothing, shoes,
and cosmetics may improve their services by identifying patterns such as usage trends and
user preferences using various Web analysis techniques.
2. Searching the Web
The World Wide Web is a huge corpus of information, but locating resources that are
both high quality and relevant to the needs of the user is very difficult. The set of Web pages
taken as a whole has almost no unifying structure, with variability in authoring style and
content, thereby making it more difficult to precisely locate needed information. Index-based
search engines have been one of the prime tools by which users search for information on the
Web. Web search engines crawl the Web and create an index to the Web for searching
purposes.
3. Analyzing the Link Structure of Web Pages
The goal of Web structure analysis is to generate structural summary about the
Website and Web pages. It focuses on the inner structure of documents and deals with the
link structure using hyperlinks at the interdocument level. The structure and content of Web
pages are often combined for information retrieval by Web search engines. Given a collection
of interconnected Web documents, interesting and informative facts describing their
connectivity in the Web subset can be discovered. Web structure analysis is also used to

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 13


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

reveal the structure of Web pages, which helps with navigation and makes it possible to
compare/integrate Web page schemes. This aspect of Web structure analysis facilitates Web
document classification and clustering on the basis of structure.
4. Web Content Analysis
As mentioned earlier, Web content analysis refers to the process of discovering
useful information from Web content/data/documents. The Web content data consists of
unstructured data such as free text from electronically stored documents, semi-structured data
typically found as HTML documents with embedded image data, and more structured data
such as tabular data, and pages in HTML, XML, or other markup languages generated as
output from databases. More generally, the term Web content refers to any real data in the
Web page that is intended for the user accessing that page. This usually consists of but is not
limited to text and graphics.
5. Approaches to Web Content Analysis
The two main approaches to Web content analysis are
(1) agent based (IR view) and
(2) database based (DB view).

The agent-based approach involves the development of sophisticated artificial intelligence


systems that can act autonomously or semi-autonomously on behalf of a particular user, to
discover and process Web-based information.
The database-based approach aims to infer the structure of the Website or to trans-form a
Web site to organize it as a database so that better information management and querying on
the Web become possible. This approach of Web content analysis primarily tries to model the
data on the Web and integrate it so that more sophisticated queries than keyword-based
search can be performed.

6. Web Usage Analysis


Web usage analysis is the application of data analysis techniques to discover
usage patterns from Web data, in order to understand and better serve the needs of Web-
based applications. This activity does not directly contribute to information retrieval; but it is
important to improve or enhance the users’ search experience. Web usage data describes the
pattern of usage of Web pages, such as IP addresses, page references, and the date and time
of accesses for a user, user group, or an application. Web usage analysis typically consists of
three main phases: preprocessing, pattern discovery, and pattern analysis.
7. Practical Applications of Web Analysis
Web Analytics. The goal of web analytics is to understand and optimize the
performance of Web usage. This requires collecting, analyzing, and performance monitoring
of Internet usage data. On-site Web analytics measures the performance of a Website in a

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 14


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

commercial context. This data is typically compared against key performance indicators to
measure effectiveness or performance of the Website as a whole, and can be used to improve
a Website or improve the marketing strategies.
5.8 Current trends.

Trends in Information Retrieval


In this section we review a few concepts that are being considered in more recent
research work in information retrieval.
1. Faceted Search
Faceted Search is a technique that allows for integrated search and navigation
experience by allowing users to explore by filtering available information. This search
technique is used often in ecommerce Websites and applications enabling users to navigate a
multi-dimensional information space. Facets are generally used for handling three or more
dimensions of classification. This allows the faceted classification scheme to classify an
object in various ways based on different taxonomical criteria. For example, a Web page may
be classified in various ways: by content (air-lines, music, news, ...); by use (sales,
information, registration, ...); by location; by language used (HTML, XML, ...) and in other
ways or facets. Hence, the object can be classified in multiple ways based on multiple
taxonomies.

A facet defines properties or characteristics of a class of objects. The properties should be


mutually exclusive and exhaustive. For example, a collection of art objects might be
classified using an artist facet (name of artist), an era facet (when the art was created), a type
facet (painting, sculpture, mural, ...), a country of origin facet, a media facet (oil, watercolor,
stone, metal, mixed media, ...), a collection facet (where the art resides), and so on.

Faceted search uses faceted classification that enables a user to navigate information along
multiple paths corresponding to different orderings of the facets. This contrasts with
traditional taxonomies in which the hierarchy of categories is fixed and unchanging.
University of California, Berkeley’s Flamenco project is one of the earlier examples of a
faceted search system.

2. Social Search
The traditional view of Web navigation and browsing assumes that a single user is
searching for information. This view contrasts with previous research by library scientists
who studied users’ information seeking habits. This research demonstrated that additional
individuals may be valuable information resources during information search by a single
user. More recently, research indicates that there is often direct user cooperation during Web-
based information search. Some studies report that significant segments of the user

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 15


MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - V

population are engaged in explicit collaboration on joint search tasks on the Web. Active
collaboration by multiple parties also occur in certain cases (for example, enterprise settings);
at other times, and perhaps for a majority of searches, users often interact with others
remotely, asynchronously, and even involuntarily and implicitly.

Socially enabled online information search (social search) is a new phenomenon


facilitated by recent Web technologies. Collaborative social search involves different ways
for active involvement in search-related activities such as co-located search, remote
collaboration on search tasks, use of social network for search, use of expertise networks,
involving social data mining or collective intelligence to improve the search process and even
social interactions to facilitate information seeking and sense making. This social search
activity may be done synchronously, asynchronously, co-located or in remote shared
workspaces. Social psychologists have experimentally validated that the act of social
discussions has facilitated cognitive performance. People in social groups can provide
solutions (answers to questions), pointers to databases or to other people (meta-knowledge),
validation and legitimization of ideas, and can serve as memory aids and help with problem
reformulation. Guided participation is a process in which people co-construct knowledge in
concert with peers in their community. Information seeking is mostly a solitary activity on
the Web today. Some recent work on collaborative search reports several interesting findings
and the potential of this technology for better information access.

3. Conversational Search
Conversational Search (CS) is an interactive and collaborative information
finding interaction. The participants engage in a conversation and perform a social search
activity that is aided by intelligent agents. The collaborative search activity helps the agent
learn about conversations with interactions and feedback from participants. It uses the
semantic retrieval model with natural language understanding to provide the users with faster
and relevant search results. It moves search from being a solitary activity to being a more
participatory activity for the user. The search agent performs multiple tasks of finding
relevant information and connecting the users together; participants provide feedback to the
agent during the conversations that allows the agent to perform better.

Prepared By. E.Janakiraman.MCA,Mphil,. AP/MCA - APEC Page 16

You might also like