0% found this document useful (0 votes)
16 views53 pages

Ilovepdf Merged

Uploaded by

Jasvan Sundar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views53 pages

Ilovepdf Merged

Uploaded by

Jasvan Sundar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

lOMoARcPSD|13657851 lOMoARcPSD|13657851

MC4202 ADVANCED DATABASE TEHNOLOGY

UNIT – I DISTRIBUTED DATABASES

What Are Distributed Systems?

A distributed system is a computing environment in which various components


So the idea behind distributed architectures is to have these components presented on
are spread across multiple computers (or other computing devices) on a network. different platforms, where components can communicate with each other over a
These devices split up the work, coordinating their efforts to complete the job more efficiently communication network in order to achieve specifics objectives.
than if a single device had been responsible for the task. 3) Architectural Styles
There are four different architectural styles, plus the hybrid architecture, when it comes to
How do distributed systems communicate with each other? distributed systems. The basic idea is to organize logically different components, and
distribute those computers over the various machines.
Distributed System Architecture Distributed systems must have a network that connects all  Layered Architecture
components (machines, hardware, or software) together so they can transfer messages to
 Object Based Architecture
communicate with each other. That network could be connected with an IP address or use
cables or even on a circuit board.  Data-centered Architecture
 Event Based Architecture
One of the major disadvantages of distributed systems is the complexity of the underlying
 Hybrid Architecture
hardware and software arrangements. This arrangement is generally known as a topology or
an overlay. This is what provides the platform for distributed nodes to communicate and Layered Architecture
coordinate with each other as needed. The layered architecture separates layers of components from each other, giving it a much
1. What is a Distributed System more modular approach. A well known example for this is the OSI model that incorporates a
layered architecture when interacting with each of the components. Each interaction is
2. Distributed System Architectures
sequential where a layer will contact the adjacent layer and this process continues, until the
3. Architectural Styles request is been catered to. But in certain cases, the implementation can be made so that some
4. System Level Architecture layers will be skipped, which is called cross-layer coordination. Through cross-layer
coordination, one can obtain better results due to performance increase.
5. A Comparison Between Client Server and Peer to Peer Architectures The layers on the bottom provide a service to the layers on the top. The request flows from top
6. Middleware in Distributed Applications to bottom, whereas the response is sent from bottom to top. The advantage of using this
7. Centralized vs. Decentralized Architectures approach is that, the calls always follow a predefined path, and that each layer can be easily
replaced or modified without affecting the entire architecture.
8. Summary on Structured and Unstructured P2P Systems
1) What is a Distributed System?
A distributed system is a software system that interconnects a collection of heterogeneous
independent computers, where coordination and communication between computers only
happen through message passing, with the intention of working towards a common goal. The
idea behind distributed systems is to provide a viewpoint of being a single coherent system, to
the outside world. So, the set of independent computers or nodes are interconnected through a
Local Area Network (LAN) or a Wide Area Network (WAN).
2) Distributed System Architectures
In this blog, I would like to talk about the available Distributed System architectures that we
see today and how they are being utilized in our day to day applications. Distributed system
architectures are bundled up with components and connectors. Components can be individual
nodes or important components in the architecture whereas connectors are the ones that
connect each of these components. Object Based Architecture
 Component: A modular unit with well-defined interfaces; replaceable; reusable This architecture style is based on loosely coupled arrangement of objects. This has no specific
architecture like layers. Like in layers, this does not have a sequential set of steps that needs to
 Connector: A communication link between modules which mediates coordination or
be carried out for a given call. Each of the components is referred to as objects, where each
cooperation among components
object can interact with other objects through a given connector or interface. These are much
1|Page 2|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

more direct where all the different components can interact directly with other components notified telling that such an event has occurred. So, if anyone is interested, that node
through a direct method call. can pull the event from the bus and use it. Sometimes these events could be data, or
even URLs to resources. So the receiver can access whatever the information is given in
the event and process accordingly.
 These events occasionally carry data. An advantage in this architectural style is that,
components are loosely coupled. So it is easy to add, remove and modify components in
the system.
 One major advantage is that, these heterogeneous components can contact the bus,
through any communication protocol. But an ESB or a specific bus, has the capability
to handle any type of incoming request and process accordingly.

As shown in the above image, communication between object happen as method invocations.
These are generally called Remote Procedure Calls (RPC). Some popular examples are Java
RMI, Web Services and REST API Calls. This has the following properties.
 This architecture style is less structured.
 component = object
 connector = RPC or RMI
When decoupling these processes in space, people wanted the components to be anonymous
and replaceable. And the synchronization process needed to be asynchronous, which has led to
Data Centered Architectures and Event Based Architectures. This architectural style is based on the publisher-subscriber architecture. Between each node
Data Centered Architecture there is no direct communication or coordination. Instead, objects which are subscribed to the
 As the title suggests, this architecture is based on a data center, where the primary service communicate through the event bus.
communication happens via a central data repository. The event based architecture supports, several communication styles.
 This common repository can be either active or passive.  Publisher-subscriber
 This is more like a producer consumer problem.  Broadcast
 The producers produce items to a common data store, and the consumers can request  Point-to-Point
data from it. The major advantage of this architecture is that the Components are decoupled in space -
 This common repository could even be a simple database. But the idea is that, the loosely coupled.
communication between objects happening through this shared common storage. 4) System Level Architecture
 This supports different components (or objects) by providing a persistent storage space The two major system level architectures that we use today are Client-server and Peer-to-
for those components (such as a MySQL database). peer (P2P). We use these two kinds of services in our day to day lives, but the difference
 All the information related to the nodes in the system are stored in this persistent between these two is often misinterpreted.
storage. In event-based architectures, data is only sent and received by those Client Server Architecture
components who have already subscribed. The client server architecture has two major components.
Some popular examples are distributed file systems, producer consumer, and web based data  The client and
services.  The server.
 The Server is where all the processing, computing and data handling is happening,
whereas the Client is where the user can access the services and resources given by the
Server (Remote Server).
 The clients can make requests from the Server, and the Server will respond accordingly.
 Generally, there is only one server that handles the remote side. But to be on the safe
side, we do use multiple servers will load balancing techniques.

Event Based Architecture


 The entire communication in this kind of a system happens through events. When an
event is generated, it will be sent to the bus system. With this, everyone else will be

3|Page 4|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

As one common design feature, the Client Server architecture has a centralized security
database. This database contains security details like credentials and access details. Users
can't log in to a server, without the security credentials. So, it makes this architecture a bit
more stable and secure than Peer to Peer. The stability comes where the security database can
allow resource usage in a much more meaningful way. But on the other hand, the system
might get low, as the server only can handle a limited amount of workload at a given time.
Advantages:
 Easier to Build and Maintain 6) Middleware in Distributed Applications
 Better Security If we look at Distributed systems today, they lack the uniformity and consistency. Various
 Stable heterogeneous devices have taken over the world where distributed system cater to all these
Disadvantages: devices in a common way. One way distributed systems can achieve uniformity is through a
 Single point of failure common layer to support the underlying hardware and operating systems. This common layer
 Less scalable is known as a middleware, where it provides services beyond what is already provided by
Peer to Peer (P2P) Operating systems, to enable various features and components of a distributed system to
The general idea behind peer to peer is where there is no central control in a distributed enhance its functionality better. This layer provides a certain data structures and operations
system. The basic idea is that, each node can either be a client or a server at a given time. If that allow processes and users on far-flung machines to inter-operate and work together in a
the node is requesting something, it can be known as a client, and if some node is providing consistent way. The image given below, depicts the usage of a middleware to inter-connect
something, it can be known as a server. In general, each node is referred to as a Peer. various kinds of nodes together.

In this network, any new node has to first join the network. After joining in, they can either
request a service or provide a service. The initiation phase of a node (Joining of a node), can
vary according to implementation of a network. There are two ways in how a new node can
get to know, what other nodes are providing. 7) Centralized vs Decentralized Architectures
 Centralized Lookup Server - The new node has to register with the centralized look The two main structures that we see within distributed system overlays are Centralized and
up server and mention the services it will be providing, on the network. So, whenever Decentralized architectures. The centralized architecture can be explained by a simple client-
you want to have a service, you simply have to contact the centralized look up server server architecture where the server acts as a central unit. This can also be considered as
and it will direct you to the relevant service provider. centralized look up table with the following characteristics.
 Decentralized System - A node desiring for specific services must, broadcast and ask  Low overhead
every other node in the network, so that whoever is providing the service will respond.  Single point of failure
5) A Comparison between Client Server and Peer to Peer Architectures
 Easy to Track
 Additional Overhead.
5|Page 6|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

 Mapping Function: Map the hash value to a specific node in the system
 Lookup table: Return the network address of the node represented by the unique hash
value.
Unstructured P2P Systems
There is no specific structure in these systems, hence the name "unstructured networks". Due
to this reason, the scalability of the unstructured p2p systems is very high. These systems rely
on randomized algorithms for constructing an overlay network. As in structured p2p systems,
there is no specific path for a certain node. It's generally random, where every unstructured
system tried to maintain a random path. Due to this reason, the search of a certain file or node
is never guaranteed in unstructured systems.
The basic principle is that each node is required to randomly select another node, and contact
it.
 Let each peer maintain a partial view of the network, consisting of n other nodes
 Each node P periodically selects a node Q from its partial view
When it comes to distributed systems, we are more interested in studying more on the overlay  P and Q exchange information and exchange members from their respective partial
and unstructured network topologies that we can see today. In general, the peer to peer views
systems that we see today can be separated into three unique sections. Hybrid P2P Systems
 Structured P2P: nodes are organized following a specific distributed data structure Hybrid systems are often based on both client server architectures and p2p networks. A
 Unstructured P2P: nodes have randomly selected neighbors famous example is Bittorrent, which we use everyday. The torrent search engines provide a
 Hybrid P2P: some nodes are appointed special functions in a well-organized fashion client server architecture, where the trackers provide a structured p2p overlay. The rest of
Structured P2P Architecture nodes, which are also known as leechers and seeders, become the unstructured overlay of the
The meaning of the word structured is that the system already has a predefined structure that network, allowing it to scale itself as needed and further.
other nodes will follow. Every structured network inherently suffers from poor scalability, due
to the need for structure maintenance. In general, the nodes in a structured overlay network
are formed in a logical ring, with nodes being connected to the this ring. In this ring, certain
nodes are responsible for certain services.
A common approach that can be used to tackle the coordination between nodes, is to use
distributed hash tables (DHTs). A traditional hash function converts a unique key into a hash
value that will represent an object in the network. The hash function value is used to insert an
object in the hash table and to retrieve it.

Summary on Structured and Unstructured P2P Systems

In a DHT, each key is assigned to a unique hash, where the random hash value needs to be of a
very large address space, in order to ensure uniqueness. A mapping function is being used to
assign objects to nodes based on the hash function value. A look up based on the hash function
value, returns the network address of the node that stores the requested object.
 Hash Function: Takes a key and produces a unique hash value

7|Page 8|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

A database is an ordered collection of related data. A DBMS is a software package to work


upon a database.
The three topics covered are database schemas, types of databases and operations on
databases.

Database and Database Management System

A database is an ordered collection of related data that is built for a specific purpose. A
database may be organized as a collection of multiple tables, where a table represents a real
world element or entity. Each table has several different fields that represent the
characteristic features of the entity.
For example, a company database may include tables for projects, employees, departments,
products and financial records. The fields in the Employee table may be Name, Company_Id,
Date_of_Joining, and so forth.
A database management system is a collection of programs that enables creation and
maintenance of a database. DBMS is available as a software package that facilitates
definition, construction, manipulation and sharing of data in a database. Definition of a
database includes description of the structure of a database. Construction of a database
involves actual storing of the data in any storage medium. Manipulation refers to the
retrieving information from the database, updating the database and generating reports.
Sharing of data facilitates data to be accessed by different users or programs.
Examples of DBMS Application Areas
DISTRIBUTED DATABASE CONCEPTS  Automatic Teller Machines
 Train Reservation System
WHAT IS DISTRIBUTED DATABASE?  Employee Management System
 Student Information System
A Distributed database is defined as a logically related collection of data that is shared Examples of DBMS Packages
which is physically distributed over a computer network on different sites. The Distributed  MySQL
DBMS is defined as, the software that allows for the management of the distributed database  Oracle
and make the distributed data available for the users.  SQL Server
 dBASE
 FoxPro
 PostgreSQL, etc.

Database Schemas

A database schema is a description of the database which is specified during database design
and subject to infrequent alterations. It defines the organization of the data, the relationships
among them, and the constraints associated with them.
Databases are often represented through the three-schema architecture or ANSISPARC
architecture. The goal of this architecture is to separate the user application from the
physical database. The three levels are −
 Internal Level having Internal Schema − It describes the physical structure, details
of internal storage and access paths for the database.

9|Page 10 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

 Conceptual Level having Conceptual Schema − It describes the structure of the 3.Relational DBMS
whole database while hiding the details of physical storage of data. This illustrates the In relational databases, the database is represented in the form of relations. Each relation
entities, attributes with their data types and constraints, user operations and models an entity and is represented as a table of values. In the relation or table, a row is called
relationships.
a tuple and denotes a single record. A column is called a field or an attribute and denotes a
 External or View Level having External Schemas or Views − It describes the
characteristic property of the entity. RDBMS is the most popular database management
portion of a database relevant to a particular user or a group of users while hiding the
system.
rest of database.
For example − A Student Relation −
Types of DBMS

There are four types of DBMS.


1. Hierarchical DBMS
In hierarchical DBMS, the relationships among data in the database are established so that
one data element exists as a subordinate of another. The data elements have parent-child
relationships and are modelled using the “tree” data structure. These are very fast and simple. 4.Object Oriented DBMS
Object-oriented DBMS is derived from the model of the object-oriented programming
paradigm. They are helpful in representing both consistent data as stored in databases, as
well as transient data, as found in executing programs. They use small, reusable elements
called objects. Each object contains a data part and a set of operations which works upon the
data. The object and its attributes are accessed through pointers instead of being stored in
relational table models.
For example − A simplified Bank Account object-oriented database −

2.Network DBMS
Network DBMS in one where the relationships among data in the database are of type many-
to-many in the form of a network. The structure is generally complicated due to the existence Distributed DBMS
of numerous many-to-many relationships. Network DBMS is modelled using “graph” data
structure. A distributed database is a set of interconnected databases that is distributed over the
computer network or internet. A Distributed Database Management System (DDBMS)
manages the distributed database and provides mechanisms so as to make the databases
transparent to the users. In these systems, data is intentionally distributed among multiple
nodes so that all computing resources of the organization can be optimally used.

Operations on DBMS

The four basic operations on a database are Create, Retrieve, Update and Delete.
 CREATE database structure and populate it with data − Creation of a database
relation involves specifying the data structures, data types and the constraints of the
data to be stored.
Example − SQL command to create a student table −
CREATE TABLE STUDENT (
11 | P a g e 12 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

ROLL INTEGER PRIMARY KEY,  Data is physically stored across multiple sites. Data in each site can be managed by a
NAME VARCHAR2(25), DBMS independent of the other sites.
YEAR INTEGER,  The processors in the sites are connected via a network. They do not have any
STREAM VARCHAR2(10) multiprocessor configuration.
);  A distributed database is not a loosely connected file system.
 Once the data format is defined, the actual data is stored in accordance with the  A distributed database incorporates transaction processing, but it is not synonymous
format in some storage medium. with a transaction processing system.
Example SQL command to insert a single tuple into the student table −
INSERT INTO STUDENT ( ROLL, NAME, YEAR, STREAM) Distributed Database Management System
VALUES ( 1, 'ANKIT JHA', 1, 'COMPUTER SCIENCE');
A distributed database management system (DDBMS) is a centralized software system that
 RETRIEVE information from the database – Retrieving information generally involves
manages a distributed database in a manner as if it were all stored in a single location.
selecting a subset of a table or displaying data from the table after some computations
have been done. It is done by querying upon the table. Features
Example − To retrieve the names of all students of the Computer Science stream, the  It is used to create, retrieve, update and delete distributed databases.
following SQL query needs to be executed −  It synchronizes the database periodically and provides access mechanisms by the virtue
SELECT NAME FROM STUDENT of which the distribution becomes transparent to the users.
WHERE STREAM = 'COMPUTER SCIENCE';  It ensures that the data modified at any site is universally updated.
 UPDATE information stored and modify database structure – Updating a table  It is used in application areas where large volumes of data are processed and accessed
involves changing old values in the existing table’s rows with new values. by numerous users simultaneously.
Example − SQL command to change stream from Electronics to Electronics and  It is designed for heterogeneous database platforms.
Communications −  It maintains confidentiality and data integrity of the databases.
UPDATE STUDENT
SET STREAM = 'ELECTRONICS AND COMMUNICATIONS' Factors Encouraging DDBMS
WHERE STREAM = 'ELECTRONICS';
 Modifying database means to change the structure of the table. However, modification The following factors encourage moving over to DDBMS −
of the table is subject to a number of restrictions.  Distributed Nature of Organizational Units − Most organizations in the current
Example − To add a new field or column, say address to the Student table, we use the times are subdivided into multiple units that are physically distributed over the globe.
following SQL command − Each unit requires its own set of local data. Thus, the overall database of the
ALTER TABLE STUDENT organization becomes distributed.
ADD ( ADDRESS VARCHAR2(50) );  Need for Sharing of Data − The multiple organizational units often need to
 DELETE information stored or delete a table as a whole – Deletion of specific communicate with each other and share their data and resources. This demands
information involves removal of selected rows from the table that satisfies certain common databases or replicated databases that should be used in a synchronized
conditions. manner.
Example − To delete all students who are in 4 th year currently when they are passing  Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and
out, we use the SQL command − Online Analytical Processing (OLAP) work upon diversified systems which may have
DELETE FROM STUDENT common data. Distributed database systems aid both these processing by providing
WHERE YEAR = 4; synchronized data.
 Alternatively, the whole table may be removed from the database.  Database Recovery − One of the common techniques used in DDBMS is replication of
Example − To remove the student table completely, the SQL command used is − data across different sites. Replication of data automatically helps in data recovery if
DROP TABLE STUDENT; database in any site is damaged. Users can access data from other sites while the
damaged site is being reconstructed. Thus, database failure may become almost
A distributed database is a collection of multiple interconnected databases, which are inconspicuous to users.
spread physically across various locations that communicate via a computer network.  Support for Multiple Application Software − Most organizations use a variety of
application software each with its specific database support. DDBMS provides a uniform
Features functionality for using the same data among different platforms.

 Databases in the collection are logically interrelated with each other. Often they Advantages of Distributed Databases
represent a single logical database.

13 | P a g e 14 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Following are the advantages of distributed databases over centralized databases. Homogeneous Distributed Databases
Modular Development − If the system needs to be expanded to new locations or new units, In a homogeneous distributed database, all the sites use identical DBMS and operating
in centralized database systems, the action requires substantial efforts and disruption in the systems. Its properties are −
existing functioning. However, in distributed databases, the work simply requires adding new
 The sites use very similar software.
computers and local data to the new site and finally connecting them to the distributed
 The sites use identical DBMS or DBMS from the same vendor.
system, with no interruption in current functions.
 Each site is aware of all other sites and cooperates with other sites to process user
More Reliable − In case of database failures, the total system of centralized databases comes requests.
to a halt. However, in distributed systems, when a component fails, the functioning of the  The database is accessed through a single interface as if it is a single database.
system continues may be at a reduced performance. Hence DDBMS is more reliable. Types of Homogeneous Distributed Database
Better Response − If data is distributed in an efficient manner, then user requests can be met There are two types of homogeneous distributed database −
from local data itself, thus providing faster response. On the other hand, in centralized  Autonomous − Each database is independent that functions on its own. They are
systems, all queries have to pass through the central computer for processing, which increases integrated by a controlling application and use message passing to share data updates.
the response time.  Non-autonomous − Data is distributed across the homogeneous nodes and a central
Lower Communication Cost − In distributed database systems, if data is located locally or master DBMS co-ordinates data updates across the sites.
where it is mostly used, then the communication costs for data manipulation can be Heterogeneous Distributed Databases
minimized. This is not feasible in centralized systems. In a heterogeneous distributed database, different sites have different operating systems,
DBMS products and data models. Its properties are −
Adversities of Distributed Databases
 Different sites use dissimilar schemas and software.
Following are some of the adversities associated with distributed databases.  The system may be composed of a variety of DBMSs like relational, network,
hierarchical or object oriented.
 Need for complex and expensive software − DDBMS demands complex and often  Query processing is complex due to dissimilar schemas.
expensive software to provide data transparency and co-ordination across the several  Transaction processing is complex due to dissimilar software.
sites.  A site may not be aware of other sites and so there is limited co-operation in processing
 Processing overhead − Even simple operations may require a large number of user requests.
communications and additional calculations to provide uniformity in data across the Types of Heterogeneous Distributed Databases
sites.  Federated − The heterogeneous database systems are independent in nature and
 Data integrity − The need for updating data in multiple sites pose problems of data integrated together so that they function as a single database system.
integrity.  Un-federated − The database systems employ a central coordinating module through
 Overheads for improper data distribution − Responsiveness of queries is largely which the databases are accessed.
dependent upon proper data distribution. Improper data distribution often leads to very
slow response to user requests. Distributed DBMS Architectures

Types of Distributed Databases DDBMS architectures are generally developed depending on three parameters −

Distributed databases can be broadly classified into homogeneous and heterogeneous  Distribution − It states the physical distribution of data across the different sites.
distributed database environments, each with further sub-divisions, as shown in the following  Autonomy − It indicates the distribution of control of the database system and the
illustration. degree to which each constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.

Architectural Models

Some of the common architectural models are −


 Client - Server Architecture for DDBMS
 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture

15 | P a g e 16 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Client - Server Architecture for DDBMS This is an integrated database system formed by a collection of two or more autonomous
This is a two-level architecture where the functionality is divided into servers and clients. The database systems.
server functions primarily encompass data management, query processing, optimization and Multi-DBMS can be expressed through six levels of schemas −
transaction management. Client functions include mainly user interface. However, they have
some functions like consistency checking and transaction management.  Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
The two different client - server architecture are −  Multi-database Conceptual Level − Depicts integrated multi-database that
comprises of global logical multi-database structure definitions.
 Single Server Multiple Client
 Multi-database Internal Level − Depicts the data distribution across different sites
 Multiple Server Multiple Client (shown in the following diagram)
and multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −
 Model with multi-database conceptual level.
 Model without multi-database conceptual level.

Peer- to-Peer Architecture for DDBMS


In these systems, each peer acts both as a client and a server for imparting database services.
The peers share their resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas −
 Global Conceptual Schema − Depicts the global logical view of data.
 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.
DISTRIBUTED DATA STORAGE
Distributed database storage is managed in two ways: In database replication, the systems
store copies of data on different sites. If an entire database is available on multiple sites, it is a
fully redundant database.

Distributed databases are used for horizontal scaling, and they are designed to meet the
workload requirements without having to make changes in the database application or
vertically scale a single machine.

Distributed databases resolve various issues, such as availability, fault tolerance,


throughput, latency, scalability, and many other problems that can arise from using a single
machine and a single database.
Multi - DBMS Architectures
Distributed Database Definition

17 | P a g e 18 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

A distributed database represents multiple interconnected databases spread out across The following diagram shows an example of a homogeneous database:
several sites connected by a network. Since the databases are all connected, they appear as a
single database to the users.

Distributed databases utilize multiple nodes. They scale horizontally and develop a distributed
system. More nodes in the system provide more computing power, offer greater availability,
and resolve the single point of failure issue.

Different parts of the distributed database are stored in several physical locations, and the
processing requirements are distributed among processors on multiple database nodes.

A centralized distributed database management system (DDBMS) manages the distributed


data as if it were stored in one physical location. DDBMS synchronizes all data operations
among databases and ensures that the updates in one database automatically reflect on
databases in other sites.

Distributed Database Features

Some general features of distributed databases are:


Heterogeneous : A heterogeneous distributed database uses different schemas, operating
 Location independency - Data is physically stored at multiple sites and managed by systems, DDBMS, and different data models.
an independent DDBMS.
 Distributed query processing - Distributed databases answer queries in a In the case of a heterogeneous distributed database, a particular site can be completely
distributed environment that manages data at multiple sites. High-level queries are unaware of other sites causing limited cooperation in processing user requests. The limitation
transformed into a query execution plan for simpler management. is why translations are required to establish communication between sites.
 Distributed transaction management - Provides a consistent distributed database
through commit protocols, distributed concurrency control techniques, and distributed The following diagram shows an example of a heterogeneous database:
recovery methods in case of many transactions and failures.
 Seamless integration - Databases in a collection usually represent a single logical
database, and they are interconnected.
 Network linking - All databases in a collection are linked by a network and
communicate with each other.
 Transaction processing - Distributed databases incorporate transaction processing,
which is a program including a collection of one or more database operations.
Transaction processing is an atomic process that is either entirely executed or not at
all.

Distributed Database Types : There are two types of distributed databases:

 Homogenous
 Heterogenous

Homogeneous :

A homogenous distributed database is a network of identical databases stored on multiple


sites. The sites have the same operating system, DDBMS, and data structure, making them
easily manageable.
Distributed Database Storage
Homogenous databases allow users to access data from each of the databases seamlessly.

19 | P a g e 20 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Distributed database storage is managed in two ways: There are two types of fragmentation:

 Replication  Horizontal fragmentation - The relation schema is fragmented into groups of rows,
 Fragmentation and each group (tuple) is assigned to one fragment.
 Vertical fragmentation - The relation schema is fragmented into smaller schemas,
Replication and each fragment contains a common candidate key to guarantee a lossless join.

In database replication, the systems store copies of data on different sites. If an entire Distributed Database Advantages and Disadvantages
database is available on multiple sites, it is a fully redundant database. Advantages Disadvantages

The advantage of database replication is that it increases data availability on different sites Modular development Costly software
and allows for parallel query requests to be processed.
Reliability Large overhead
However, database replication means that data requires constant updates and
synchronization with other sites to maintain an exact database copy. Any changes made on Lower communication costs Data integrity
one site must be recorded on other sites, or else inconsistencies occur.
Better response Improper data distribution
Constant updates cause a lot of server overhead and complicate concurrency control, as a lot
of concurrent queries must be checked in all available sites.  What is a Distributed Transaction?

A distributed transaction is a set of operations on data that is performed


across two or more data repositories (especially databases). It is typically
coordinated across separate nodes connected by a network, but may also span
multiple databases on a single server.

 There are two possible outcomes: 1) all operations successfully complete, or 2) none
of the operations are performed at all due to a failure somewhere in the system. In
the latter case, if some work was completed prior to the failure, that work will be
reversed to ensure no net work was done. This type of operation is in compliance
with the “ACID” (atomicity-consistency-isolation-durability) principles of databases
that ensure data integrity. ACID is most commonly associated with transactions on a
single database server, but distributed transactions extend that guarantee across
multiple databases.
 The operation known as a “two-phase commit” (2PC) is a form of a distributed
transaction. “XA transactions” are transactions using the XA protocol, which is one
implementation of a two-phase commit operation.

Fragmentation

When it comes to fragmentation of distributed database storage, the relations are


fragmented, which means they are split into smaller parts. Each of the fragments is stored
on a different site, where it is required.

The prerequisite for fragmentation is to make sure that the fragments can later be
reconstructed into the original relation without losing data.

The advantage of fragmentation is that there are no data copies, which prevents data
inconsistency.

21 | P a g e 22 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

sites where the transaction is being executed and uniformly enforce the decision. When
processing is complete at each site, it reaches the partially committed transaction state and
waits for all other transactions to reach their partially committed states. When it receives the
message that all the sites are ready to commit, it starts to commit. In a distributed system,
either all sites commit or none of them does.
The different distributed commit protocols are −
 One-phase commit
 Two-phase commit
 Three-phase commit

 Distributed One-phase Commit

 Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
A distributed transaction spans multiple databases and guarantees data integrity. controlling site and a number of slave sites where the transaction is being executed. The steps
in distributed commit are −
How Do Distributed Transactions Work?  After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site.
 Distributed transactions have the same processing completion requirements as  The slaves wait for “Commit” or “Abort” message from the controlling site. This waiting
regular database transactions, but they must be managed across multiple resources, time is called window of vulnerability.
making them more challenging to implement for database developers. The multiple  When the controlling site receives “DONE” message from each slave, it makes a
resources add more points of failure, such as the separate software systems that run decision to commit or abort. This is called the commit point. Then, it sends this message
the resources (e.g., the database software), the extra hardware servers, and network to all the slaves.
failures. This makes distributed transactions susceptible to failures, which is why  On receiving this message, a slave either commits or aborts and then sends an
safeguards must be put in place to retain data integrity. acknowledgement message to the controlling site.
 For a distributed transaction to occur, transaction managers coordinate the
resources (either multiple databases or multiple nodes of a single database). The  Distributed Two-phase Commit
transaction manager can be one of the data repositories that will be updated as part
of the transaction, or it can be a completely independent separate resource that is Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The
only responsible for coordination. The transaction manager decides whether to steps performed in the two phases are as follows −
commit a successful transaction or rollback an unsuccessful transaction, the latter of
which leaves the database unchanged. Phase 1: Prepare Phase
 First, an application requests the distributed transaction to the transaction  After each slave has locally completed its transaction, it sends a “DONE” message to the
manager. The transaction manager then branches to each resource, which will have controlling site. When the controlling site has received “DONE” message from all slaves,
its own “resource manager” to help it participate in distributed transactions. it sends a “Prepare” message to the slaves.
Distributed transactions are often done in two phases to safeguard against partial  The slaves vote on whether they still want to commit or not. If a slave wants to commit,
updates that might occur when a failure is encountered. The first phase involves it sends a “Ready” message.
acknowledging intent to commit, or a “prepare-to-commit” phase. After all resources  A slave that does not want to commit sends a “Not Ready” message. This may happen
acknowledge, they are then asked to run a final commit, and then the transaction is when the slave has conflicting concurrent transactions or there is a timeout.
completed.
Phase 2: Commit/Abort Phase
 After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the
COMMIT PROTOCOLS controlling site.
o When the controlling site receives “Commit ACK” message from all the slaves, it
In a local database system, for committing a transaction, the transaction manager has to only considers the transaction as committed.
convey the decision to commit to the recovery manager. However, in a distributed system, the  After the controlling site has received the first “Not Ready” message from any slave −
transaction manager should convey the decision to commit to all the servers in the various
23 | P a g e 24 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o The controlling site sends a “Global Abort” message to the slaves. Locking-based concurrency control systems can use either one-phase or two-phase locking
o The slaves abort the transaction and send a “Abort ACK” message to the protocols.
controlling site. 1. One-phase Locking Protocol: In this method, each transaction locks an item
o When the controlling site receives “Abort ACK” message from all the slaves, it before use and releases the lock as soon as it has finished using it. This locking
considers the transaction as aborted. method provides for maximum concurrency but does not always enforce
serializability.
 Distributed Three-phase Commit
2. Two-phase Locking Protocol: In this method, all locking operations precede
The steps in distributed three-phase commit are as follows − the first lock-release or unlock operation. The transaction comprise of two
phases. In the first phase, a transaction only acquires all the locks it needs and
Phase 1: Prepare Phase do not release any lock. This is called the expanding or the growing phase. In
The steps are same as in distributed two-phase commit. the second phase, the transaction releases the locks and cannot request any
new locks. This is called the shrinking phase.
Phase 2: Prepare to Commit Phase
Every transaction that follows two-phase locking protocol is guaranteed to be serializable.
 The controlling site issues an “Enter Prepared State” broadcast message. However, this approach provides low parallelism between two conflicting transactions.
 The slave sites vote “OK” in response.
Phase 3: Commit / Abort Phase  Timestamp Concurrency Control Algorithms: Timestamp-based concurrency
control algorithms use a transaction’s timestamp to coordinate concurrent access to a data
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message is item to ensure serializability. A timestamp is a unique identifier given by DBMS to a
not required. transaction that represents the transaction’s start time.

These algorithms ensure that transactions commit in the order dictated by their timestamps.
An older transaction should commit before a younger transaction, since the older transaction
enters the system before the younger one.
Timestamp-based concurrency control techniques generate serializable schedules such that
the equivalent serial schedule is arranged in order of the age of the participating transactions.

Optimistic Concurrency Control Algorithm : In systems with low conflict rates, the task
of validating every transaction for serializability may lower performance. In these cases, the
test for serializability is postponed to just before commit. Since the conflict rate is low, the
probability of aborting transactions which are not serializable is also low. This approach is
called optimistic concurrency control technique.

In this approach, a transaction’s life cycle is divided into the following three phases −
CONCURRENCY CONTROL  Execution Phase − A transaction fetches data items to memory and performs
Concurrency control in distributed system is achieved by a program which is operations upon them.
called scheduler. Scheduler help to order the operations of transaction in such a way that the  Validation Phase − A transaction performs checks to ensure that committing its
resulting logs is serializable. There have two type of the concurrency control that are locking changes to the database passes serializability test.
approach and non-locking approach.  Commit Phase − A transaction writes back modified data item in memory to the disk.

VARIOUS APPROACHES FOR CONCURRENCY CONTROL.

 LOCKING BASED CONCURRENCY CONTROL PROTOCOLS


QUERY PROCESSING IN DISTRIBUTED DBMS
A Query processing in a distributed database management system requires the
Locking-based concurrency control protocols use the concept of locking data items. A lock is a
transmission of data between the computers in a network. A distribution strategy for a
variable associated with a data item that determines whether read/write operations can be
query is the ordering of data transmissions and local data processing in a database system.
performed on that data item. Generally, a lock compatibility matrix is used which states
whether a data item can be locked by two transactions at the same time.
25 | P a g e 26 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851

Distributed query processing is the procedure of answering queries (which means


mainly read operations on large data sets) in a distributed environment where data is
managed at multiple sites in a computer network. Query processing involves the
transformation of a high-level query (e.g., formulated in SQL) into a query execution plan
(consisting of lower-level query operators in some variation of relational algebra) as well as
the execution of this plan. The goal of the transformation is to produce a plan which is
equivalent to the original query (returning the same result) and efficient, i.e., to minimize
resource consumption like total costs or response time.
1. Costs (Transfer of data) of Distributed Query processing:
In Distributed Query processing, the data transfer cost of distributed query processing
means the cost of transferring intermediate files to other sites for processing and therefore
the cost of transferring the ultimate result files to the location where that result’s required.
Commonly, the data transfer cost is calculated in terms of the size of the messages. By using
the below formula, we can calculate the data transfer cost:
Data transfer cost = C * Size
Where C refers to the cost per byte of data transferring and Size is the no. of bytes
transmitted.

27 | P a g e

Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

MC4202 ADVANCED DATABASE TEHNOLOGY


Geocoding, on the other hand, provides address and location descriptors. These can include
UNIT – II SPATIAL AND TEMPORAL DATABASES information about cities, states, countries, and so on. Each exact coordinate references a specific
location on the earth's surface.
SPATIAL DATABASE
Vector and raster

Spatial data is any type of data that directly or indirectly references a specific geographical area Vector and raster are common data formats used to store geospatial data.
or location. Sometimes called geospatial data or geographic information, spatial data can also
numerically represent a physical object in a geographic coordinate system. However, spatial data Vectors are graphical representations of the real world. There are three main types
is much more than a spatial component of a map. of vector data: points, lines and polygons. The points help create lines, and the connecting lines
form enclosed areas or polygons. Vectors often represent the generalization of features or objects
Users can save spatial data in a variety of different formats, as it can also contain more than on the planet's surface. Vector data is usually stored in shapefiles, sometimes referred to as .shp
location-specific data. Analyzing this data provides a better understanding of how each variable files.
impacts individuals, communities, populations, etc.
Raster represents information presented in a pixel grid. Each pixel stored within a raster
There are several spatial data types, but the two primary kinds of spatial data are has value. This can be anything from a unit of measurement, color or information about a specific
 Geometric data and element. Typically, raster refers to imagery, but in spatial analysis it frequently refers to an
 Geographic data. orthoimage or the photos taken from aerial devices or satellites.

Geometric data is a spatial data type that is mapped on a two-dimensional flat surface. An There is also something called an attribute. Whenever spatial data contains additional
example is the geometric data in floor plans. Google Maps is an application that uses geometric information or non-spatial data, it is called an attribute. Spatial data can have any number of
data to provide accurate direction. In fact, it is one of the simplest examples of spatial data in attributes about a location. For example, this may be a map, photographs, historical information
action. or anything else that may be deemed necessary.

Geographic data is information mapped around a sphere. Most often, the sphere is planet earth.
Geographic data highlights the latitude and longitude relationships to a specific object or
location. A familiar example of geographic data is a global positioning system.

Georeferencing and Geocoding

Similar processes, georeferencing and geocoding, are important aspects of geospatial analysis.
Both geocoding and georeferencing involve fitting data into the real world by using
appropriate coordinates, but that is where the similarity ends.

Georeferencing concentrates on assigning data coordinates to vectors or rasters. This approach


helps accurately model the planet's surface.

1|Page 2|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

XML index, spatial indexes also require that the database table that you are creating the spatial
index on also has a clustered primary key index defined.

Why use a spatial index?

When it comes to spatial data you are generally testing if two points and/or areas intersect or are
within a certain distance. The benefit of having a spatial index created on your column is it
allows the query to easily prune/skip over column values that there is no chance of intersection.
When creating the index there are options available that increase the accuracy of the index but
with that comes the drawback that the index will use more space.

How to create a spatial index?

In order to create a spatial index we will use of the tables in our sample database and create the
index using the CREATE SPATIAL INDEX command. Note that this is not just a clause of the
"CREATE INDEX" command. It's actually a command with many other options specific to spatial
indexes. You can read more details on those options here.

CREATE SPATIAL INDEX IX_Address_SpatialLocation ON Person. Address (Spatial


SPATIAL DATA TYPES:
Location);

SQL Server supports two spatial data types: Confirm Index Usage

1. The geometry data type and The following query should use the spatial index in order to determine the 7 closest points to the
2. The geography data type. specified location. Here is the TSQL.

 The geometry type represents data in a Euclidean (flat) coordinate system. DECLARE @g geography = ‘POINT (-121.626 47.8315)';
 The geography type represents data in a round-earth coordinate system.
SELECT TOP (7) SpatialLocation.ToString (), City FROM Person. Address
 The geography spatial data type, geography, is implemented as a . NET common language
runtime (CLR) data type in SQL Server. This type represents data in a round-earth coordinate WHERE SpatialLocation.STDistance (@g) IS NOT NULL
system. The SQL Server geography data type stores ellipsoidal (round-earth) data, such as GPS
latitude and longitude coordinates. *******************************************************************************************
 Geometry: Stores data based on a flat (Euclidean) coordinate system. The data type is
often used to store the X and Y coordinates that represent lines, points, and polygons in two- Decomposing Indexed Space into a Grid Hierarchy
dimensional spaces. In SQL Server, spatial indexes are built using B-trees, which means that the indexes must
 Geography: Stores data based on a round-earth coordinate system. represent the 2-dimensional spatial data in the linear order of B-trees. Therefore, before reading
data into a spatial index, SQL Server implements a hierarchical uniform decomposition of space.
Spatial Indexes The index-creation process decomposes the space into a four-level grid hierarchy. These levels are
referred to as level 1 (the top level), level 2, level 3, and level 4.
SQL Server supports spatial data and spatial indexes. A spatial index is a type of extended index
that allows you to index a spatial column. A spatial column is a table column that contains data of Each successive level further decomposes the level above it, so each upper-level cell contains a
a spatial data type, such as geometry or geography. complete grid at the next level. On a given level, all the grids have the same number of cells along
both axes (for example, 4x4 or 8x8), and the cells are all one size.
What is a spatial index?
The following illustration shows the decomposition for the upper-right cell at each level of the
A spatial index is another special index type that is built to accommodate adding indexes on grid hierarchy into a 4x4 grid. In reality, all the cells are decomposed in this way. Thus, for
columns created using the spatial data type’s geography and geometry. As was the case with an

3|Page 4|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

example, decomposing a space into four levels of 4x4 grids actually produces a total of 65,536  The deepest-cell rule :
level-four cells.
The deepest-cell rule generates the best approximation of an object by recording only the
bottom-most cells that have been tessellated for the object. Parent cells do not contribute to
the cells-per-object count, and they are not recorded in the index.

Spatial Data Mining


What Does Spatial Data Mining Mean?

Spatial data mining is the application of data mining to spatial models. In spatial data
mining, analysts use geographical or spatial information to produce business intelligence or other
results. This requires specific techniques and resources to get the geographical data into relevant
and useful formats.
Tessellation
Spatial data mining refers to the extraction of knowledge, spatial relationships, or other
After decomposition of an indexed space into a grid hierarchy, the spatial index reads the data interesting patterns not explicitly stored in spatial databases. Such mining demands the
from the spatial column, row by row. After reading the data for a spatial object (or instance), the
unification of data mining with spatial database technologies. It can be used for learning spatial
spatial index performs a tessellation process for that object. The tessellation process fits the object
into the grid hierarchy by associating the object with a set of grid cells that it touches (touched records, discovering spatial relationships and relationships among spatial and nonspatial records,
cells). Starting at level 1 of the grid hierarchy, the tessellation process proceeds breadth constructing spatial knowledge bases, reorganizing spatial databases, and optimizing spatial
first across the level. Potentially, the process can continue through all four levels, one level at a queries.
time.
Deductive database
The output of the tessellation process is a set of touched cells that are recorded in the spatial index
for the object. By referring to these recorded cells, the spatial index can locate the object in space A deductive database is a database system that can make deductions (i.e. conclude
relative to other objects in the spatial column that are also stored in the index. additional facts) based on rules and facts stored in the (deductive) database. Data log is
the language typically used to specify facts, rules and queries in deductive databases.
Tessellation Rules
A deductive database is a finite collection of facts and rules. By applying the rules of a
To limit the number of touched cells that are recorded for an object, the tessellation process deductive database to the facts in the database, it is possible to infer additional facts, i.e. facts
applies several tessellation rules. These rules determine the depth of the tessellation process and that are implicitly true but are not explicitly represented in the database.
which of the touched cells are recorded in the index.

These rules are as follows: Deductive Database


 Logic Programming.
 The covering rule :
 Relational Database.
If the object completely covers a cell, that cell is said to be covered by the object. A covered  Database Systems.
cell is counted and is not tessellated. This rule applies at all levels of the grid hierarchy. The  Query Languages.
covering rule simplifies the tessellation process and reduces the amount of data that a  Closed World Assumption.
spatial index records.  Recursive Query.
 Theorem Prover.
 The cells-per-object rule :
 Data log.
This rule enforces the cells-per-object limit, which determines the maximum number of cells  Deductive database systems can be seen as bringing together mainstream data models
that can be counted for each object, except on level 1. At lower levels, the cells-per-object with logic programming languages for querying and analyzing database data.
rule controls the amount of information that can be recorded about the object.  A deductive (relational) database is a Data log program. A Data log program consists
of an extensional database and an intensional database.

5|Page 6|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

 An extensional database contains a set of facts of the form:  SDO_NN_DISTANCE - Returns the distance of an object returned by the SDO_NN operator.
 p(c1,…,cm)  SDO_POINTINPOLYGON - Takes a set of rows whose first column is a point's x-
coordinate value and the second column is a point's y-coordinate value, and returns those
 Where p is a predicate symbol and c1,…,cm are constants. Each predicate symbol with a rows that are within specified polygon geometry.
given arity in the extensional database can be seen as analogous to a relational table.  SDO_RELATE -Determines whether or not two geometries interact in a specified
 An intentional database contains a set of rules of the form: way.
 p(x1,…,xm):−q1(x11,…,x1k),…,qj(xj1,…,xjp)  SDO_WITHIN_DISTANCE-Determines if two geometries are within a specified distance
from one another.
 Where p and qi are predicate symbols, and all argument positions are occupied by  Operators for SDO_RELATE Operations
variables or constants.  Operator Description
 Some additional terminology is required before examples can be given of Data log queries  SDO_ANYINTERACT - Checks if any geometries in a table have the ANYINTERACT
and rules. A term is either a constant or a variable—variables are traditionally written topological relationship with a specified geometry.
with an initial capital letter. An atom p(t1,…,tm) consists of a predicate symbol and a list of  SDO_CONTAINS - Checks if any geometries in a table have the CONTAINS topological
arguments, each of which is a term. A literal is an atom or a negated atom ¬p(t1,…,tm). relationship with a specified geometry.
A Data log query is a conjunction of literals.  SDO_COVEREDBY -Checks if any geometries in a table have the COVEREDBY topological
relationship with a specified geometry.
Spatial data  SDO_COVERS-Checks if any geometries in a table have the COVERS topological relationship
with a specified geometry.
Spatial data is associated with geographic locations such as cities, towns etc. A spatial database is  SDO_EQUAL-Checks if any geometries in a table have the EQUAL topological relationship
optimized to store and query data representing objects. These are the objects which are defined in with a specified geometry.
a geometric space.  SDO_INSIDE-Checks if any geometries in a table have the INSIDE topological relationship
Characteristics of Spatial Database with a specified geometry.
 A spatial database system has the following characteristics  SDO_ON-Checks if any geometries in a table have the ON topological relationship with a
 It is a database system specified geometry.
 It offers spatial data types (SDTs) in its data model and query language.  SDO_OVERLAPBDYDISJOINT-Checks if any geometries in a table have the
 It supports spatial data types in its implementation, providing at least spatial indexing and OVERLAPBDYDISJOINT topological relationship with a specified geometry.
efficient algorithms for spatial join.  SDO_OVERLAPBDYINTERSECT-Checks if any geometries in a table have the
Example OVERLAPBDYINTERSECT topological relationship with a specified geometry.
A road map is a visualization of geographic information. A road map is a 2-dimensional object  SDO_OVERLAPS-Checks if any geometries in a table overlap (that is, have the
which contains points, lines, and polygons that can represent cities, roads, and political OVERLAPBDYDISJOINT or OVERLAPBDYINTERSECT topological relationship with) a
boundaries such as states or provinces. specified geometry.
In general, spatial data can be of two types −  SDO_TOUCH-Checks if any geometries in a table have the TOUCH topological relationship
 Vector data: This data is represented as discrete points, lines and polygons with a specified geometry.
 Rastor data: This data is represented as a matrix of square cells
Spatial queries:- Spatial query refers to the process of retrieving a data subset from a map layer
Spatial Operators: by working directly with the map features. In a spatial database, data are stored in attribute
This chapter describes the operators that you can use when working with the spatial object data tables and feature/spatial table
type. For an overview of spatial operators, including how they differ from spatial procedures and There are mainly three types of spatial queries as given below.
functions, see Spatial Operators_ Procedures_ and Functions. The following table lists the main  Nearness queries: It request objects that present near a specified location. ...
operators.  Region queries : It deals with spatial regions. ...
 Union/Intersection: In this type of queries, we may also request intersections and unions of
 Table 18-1 Main Spatial Operators regions.

Operator Description Mobile database


 SDO_FILTER - Specifies which geometries may interact with a given geometry. While Wikipedia defines a mobile database as “either a stationary database that can be
 SDO_JOIN - Performs a spatial join based on one or more topological connected to by a mobile computing device […] over a mobile network, or a database which
relationships. is actually stored by the mobile device,” we solely refer to databases that run on the mobile
 SDO_NN - Determines the nearest neighbor geometries to a geometry. device itself.

7|Page 8|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Why use a mobile database? rate offering the quality of service above a certain threshold. Here data is processed as it is
 There are some advantages associated with using a mobile database: delivered. Example: Annotating of video and audio data, real-time editing analysis.
 full offline modus for apps that depend on stored data 3. Collaborative work using multimedia information – It involves executing a complex
 frugal on bandwidth for apps that depend on stored data task by merging drawings, changing notifications. Example: Intelligent healthcare network.
 stable and predictable performance independent from network availability
 (personal data can be stored with the user, where some say they belong) There are still many challenges to multimedia databases, some of which are:
Mobile transaction model 1. Modeling – Working in this area can improve database versus information retrieval
techniques thus, documents constitute a specialized area and deserve special consideration.
A mobile transaction model has been defined addressing the movement behavior of 2. Design – The conceptual, logical and physical design of multimedia databases has not
transactions. Mobile transactions are named as Kangaroo Transactions which incorporate yet been addressed fully as performance and tuning issues at each level are far more complex
the property that the transactions in a mobile environment hop from one base station to as they consist of a variety of formats like JPEG, GIF, PNG, MPEG which is not easy to convert
another as the mobile unit moves. from one form to another.
3. Storage – Storage of multimedia database on any standard disk presents the problem of
representation, compression, mapping to device hierarchies, archiving and buffering during
Multimedia Database input-output operation. In DBMS, a ”BLOB”(Binary Large Object) facility allows untyped
bitmaps to be stored and retrieved.
Multimedia database is a collection of multimedia data which includes text, images, 4. Performance – For an application involving video playback or audio-video
graphics (drawings, sketches), animations, audio, and video, among others. These synchronization, physical limitations dominate. The use of parallel processing may alleviate
some problems but such techniques are not yet fully developed. Apart from this multimedia
databases have extensive amounts of data which can be multimedia and multisource.
database consume a lot of processing time as well as bandwidth.
Multimedia database is the collection of interrelated multimedia data that includes text, 5. Queries and retrieval –For multimedia data like images, video, audio accessing data
graphics (sketches, drawings), images, animations, video, audio etc and have vast amounts of through query opens up many issues like efficient query formulation, query execution and
multisource multimedia data. The framework that manages different types of multimedia data optimization which need to be worked upon.
which can be stored, delivered and utilized in different ways is known as multimedia database
management system. There are three classes of the multimedia database which includes Areas where multimedia database is applied are:
1. Static media,
2. Dynamic media and  Documents and record management: Industries and businesses that keep detailed
3. Dimensional media. records and variety of documents. Example: Insurance claim record.
 Knowledge dissemination: Multimedia database is a very effective tool for knowledge
Content of Multimedia Database management system: dissemination in terms of providing several resources. Example: Electronic books.
 Education and training: Computer-aided learning materials can be designed using
1. Media data – The actual data representing an object. multimedia sources which are nowadays very popular sources of learning. Example: Digital
2. Media format data – Information such as sampling rate, resolution, encoding scheme libraries.
etc. about the format of the media data after it goes through the acquisition, processing and  Marketing, advertising, retailing, entertainment and travel. Example: a virtual tour of
encoding phase. cities.
3. Media keyword data – Keywords description relating to the generation of data. It is  Real-time control and monitoring: Coupled with active database technology,
also known as content descriptive data. Example: date, time and place of recording. multimedia presentation of information can be very effective means for monitoring and
4. Media feature data – Content dependent data such as the distribution of colors, kinds controlling complex tasks Example: Manufacturing operation control.
of texture and different shapes present in data.
Database Models
Types of multimedia applications based on data management characteristic are:
1. Repository applications – A Large amount of multimedia data as well as meta-data What is Database Models?
(Media format date, Media keyword data, and Media feature data) that is stored for retrieval A Database model defines the logical design and structure of a database and defines how data
purpose, e.g., Repository of satellite images, engineering drawings, radiology scanned
will be stored, accessed and updated in a database management system. While the Relational
pictures.
2. Presentation applications – They involve delivery of multimedia data subject to Model is the most widely used database model, there are other models too:
temporal constraint. Optimal viewing or listening requires DBMS to deliver data at certain  Hierarchical Model
 Network Model

9|Page 10 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

 Entity-relationship Model of a diagram, which is known as Entity Relationship Diagram (ER Diagram). An ER model is a
 Relational Model design or blueprint of a database that can later be implemented as a database. The main
Relational Model : Relational Model represents how data is stored in Relational Databases. A components of E-R model are: entity set and relationship set.
relational database stores data in the form of relations (tables). Consider a relation STUDENT
with attributes ROLL_NO, NAME, ADDRESS, PHONE and AGE shown in Table.

ROLL_NO NAME ADDRESS PHONE AGE

1 KAVITHA COIMBATARE 9455123451 18

2 NIVEATHA MADURAI 9652431543 18


Design and implementation issues: When a nonlocal memory word is referenced, a chuck of
memory containing the word is fetched from its current location and put on the machine making
3 SURESH TRICHY 9156253131 20
the reference. An imported design issue is how big should the chuck be? A word, block, page or
segment
4 DINESH KERALA 9842654756 18 ISSUES TO DESIGN AND IMPLENTATION OF DSM
Hierarchical Model  Granularity
 Structure of shared memory space
A hierarchical database model is a data model in which the data are organized into a tree-  Memory coherence and access synchronization
like structure. The data are stored as records which are connected to one another through links.  Data location and access
 Replacement strategy
A record is a collection of fields, with each field containing only one value.
 Thrashing
 Heterogeneity
Structure of shared memory space: Structure refers to the design of the shared data in the
memory. The structure of the shared memory space of a DSM system is regularly dependent on
the sort of applications that the DSM system is intended to support.
Memory coherence and access synchronization: In the DSM system the shared data things
ought to be accessible by different nodes simultaneously in the network. The fundamental issue in
this system is data irregularity. To solve this problem in the DSM system we need to utilize some
synchronization primitives, semaphores, event count, and so on.
Network Model Data location and access: To share the data in the DSM system it ought to be possible to locate
The network model was created to represent complex data relationships more effectively and retrieve the data as accessed by clients or processors. data block finding system to serve
when compared to hierarchical models, to improve database performance and network data to meet the requirement of the memory coherence semantics being utilized.
standards. It has entities which are organized in a graphical representation and some entities Advantages:
are accessed through several paths.  Performance
 Sharing data
 Reliability
Disadvantages:
 Software development cost
 Greater potential for bugs
 Increased processing overhead
Entity Relationship Model

An Entity–relationship model (ER model) describes the structure of a database with the help

11 | P a g e 12 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

UNIT III NOSQL DATABASES NoSQL

NoSQL – CAP Theorem – Sharding - Document based – MongoDB Operation: Insert, NoSQL Database is used to refer a non-SQL or non relational database.
Update, Delete, Query, Indexing, Application, Replication, Sharding–Cassandra: Data Model,
Key Space, Table Operations, CRUD Operations, CQL Types – HIVE: Data types, Database It provides a mechanism for storage and retrieval of data other than tabular relations model
Operations, Partitioning – HiveQL – OrientDB Graph database – OrientDB Features. used in relational databases. NoSQL database doesn't use tables for storing data. It is
generally used to store big data and real-time web applications.
CAP theorem
Databases can be divided in 3 types:
It is very important to understand the limitations of NoSQL database. NoSQL can not provide
consistency and high availability together. This was first expressed by Eric Brewer in CAP 1. RDBMS (Relational Database Management System)
Theorem.
2. OLAP (Online Analytical Processing)
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance. 3. NoSQL (recently developed database)
Here Consistency means that all nodes in the network see the same data at the same time.
Advantages of NoSQL
Availability is a guarantee that every request receives a response about whether it was
successful or failed. However it does not guarantee that a read request returns the most recent
o It supports query language.
write.The more number of users a system can cater to better is the availability.
Partition Tolerance is a guarantee that the system continues to operate despite arbitrary o It provides fast performance.
message loss or failure of part of the system. In other words, even if there is a network outage o It provides horizontal scalability.
in the data center and some of the computers are unreachable, still the system continues to
perform. What is MongoDB?
What is CAP theorem in NoSQL databases?
MongoDB is an open-source document database that provides high performance, high
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of availability, and automatic scaling.
three guarantees for a database: Consistency, Availability and Partition Tolerance. Here
Consistency means that all nodes in the network see the same data at the same time. Mongo DB is a document-oriented database. It is an open source product, developed and
supported by a company named 10gen.
What Is Database Sharding? Sharding is a method for distributing a single dataset across
multiple databases, which can then be stored on multiple machines. This allows for larger MongoDB is a scalable, open source, high performance, document-oriented database." -
datasets to be split in smaller chunks and stored in multiple data nodes, increasing the total
10gen
storage capacity of the system.
What is difference between sharding and partitioning? MongoDB was designed to work with commodity servers. Now it is used by the company of
Sharding and partitioning are both about breaking up a large data set into smaller subsets. The all sizes, across all industry.
difference is that sharding implies the data is spread across multiple computers while
partitioning does not. Partitioning is about grouping subsets of data within a single database MongoDB Advantages
instance.
o MongoDB is schema less. It is a document database in which one collection holds
What are the types of sharding?
different documents.
Sharding Architectures
o There may be difference between number of fields, content and size of the
 Key Based Sharding. This technique is also known as hash-based sharding. ... document from one to other.
 Horizontal or Range Based Sharding. In this method, we split the data based on the ranges of
a given value inherent in each entity. ... o Structure of a single object is clear in MongoDB.
 Vertical Sharding. ... o There are no complex joins in MongoDB.
 Directory-Based Sharding.

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o MongoDB provides the facility of deep query because it supports a powerful To check the database list, use the command show dbs:
dynamic query on documents. >show dbs

o It is very easy to scale. insert at least one document into it to display database:
o It uses internal memory for storing working sets and this is the reason of its fast
MongoDB insert documents
access.
In MongoDB, the db.collection.insert() method is used to add or insert new documents into a
Distinctive features of MongoDB collection in your database.

o Easy to use >db.movie.insert({"name":"javatpoint"})


o Light Weight
MongoDB Drop Database
o Extremely faster than RDBMS
The dropDatabase command is used to drop a database. It also deletes the associated data
Where MongoDB should be used files. It operates on the current database.

Syntax:
o Big and complex data
o Mobile and social infrastructure db.dropDatabase()
o Content management and delivery
This syntax will delete the selected database. In the case you have not selected any database,
o User data management it will delete default "test" database.
o Data hub
If you want to delete the database "javatpointdb", use the dropDatabase() command as
MongoDB Create Database follows:

There is no create database command in MongoDB. Actually, MongoDB do not provide any >db.dropDatabase()
command to create database. MongoDB Create Collection

How and when to create database In MongoDB, db.createCollection(name, options) is used to create collection. But usually you
don?t need to create collection. MongoDB creates collection automatically when you insert
If there is no existing database, the following command is used to create a new database. some documents. It will be explained later. First see how to create collection:

Syntax: Syntax:

use DATABASE_NAME db.createCollection(name, options)


we are going to create a database "javatpointdb" Name: is a string type, specifies the name of the collection to be created.

>use javatpointdb Options: is a document type, specifies the memory size and indexing of the collection. It is
an optional parameter.
To check the currently selected database, use the command db
To check the created collection, use the command "show collections".
>db
>show collections

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

How does MongoDB create collection automatically Create an array of documents

MongoDB creates collections automatically when you insert some documents. For example: Define a variable named Allcourses that hold an array of documents to insert.
Insert a document named seomount into a collection named SSSIT. The operation will create
the collection if the collection does not currently exist. var Allcourses =
[
>db.SSSIT.insert({"name" : "seomount"})
{
>show collections
Course: "Java",
SSSIT
details: { Duration: "6 months", Trainer: "Sonoo Jaiswal" },
MongoDB update documents
Batch: [ { size: "Medium", qty: 25 } ],
In MongoDB, update() method is used to update or modify the existing documents of a category: "Programming Language"
collection. },
{
Syntax:
Course: ".Net",
db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA) details: { Duration: "6 months", Trainer: "Prashant Verma" },
Batch: [ { size: "Small", qty: 5 }, { size: "Medium", qty: 10 }, ],
Example
category: "Programming Language"
Consider an example which has a collection name javatpoint. Insert the following documents },
in collection: {
Course: "Web Designing",
db.javatpoint.insert(
details: { Duration: "3 months", Trainer: "Rashmi Desai" },
{
Batch: [ { size: "Small", qty: 5 }, { size: "Large", qty: 10 } ],
course: "java",
category: "Programming Language"
details: {
}
duration: "6 months",
];
Trainer: "Sonoo jaiswal"
}, Inserts the documents
Batch: [ { size: "Small", qty: 15 }, { size: "Medium", qty: 25 } ],
Pass this Allcourses array to the db.collection.insert() method to perform a bulk insert.
category: "Programming language"
} > db.javatpoint.insert( Allcourses );
)
MongoDB Delete documents
Update the existing course "java" into "android":
In MongoDB, the db.colloction.remove() method is used to delete documents from a
collection. The remove() method works on two parameters.
>db.javatpoint.update({'course':'java'},{$set:{'course':'android'}})
1. Deletion criteria: With the use of its syntax you can remove the documents from the
MongoDB insert multiple documents collection.

If you want to insert multiple documents in a collection, you have to pass an array of 2. JustOne: It removes only one document when set to true or 1.
documents to the db.collection.insert() method.

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Syntax: Syntax –

b.collection_name.remove (DELETION_CRITERIA) db.NAME_OF_COLLECTION.dropIndexes({KEY1:1, KEY2, 1})


Applications of MongoDB
Remove all documents
These are some important features of MongoDB:
If you want to remove all documents from a collection, pass an empty query document {} to
the remove() method. The remove() method does not remove the indexes. 1. Support ad hoc queries:

db.javatpoint.remove({}) In MongoDB, you can search by field, range query and it also supports regular expression
searches.
Indexing in MongoDB :
2. Indexing:
MongoDB uses indexing in order to make the query processing more efficient. If there is no
You can index any field in a document.
indexing, then the MongoDB must scan every document in the collection and retrieve only
those documents that match the query. Indexes are special data structures that stores some 3. Replication:
information related to the documents such that it becomes easy for MongoDB to find the
MongoDB supports Master Slave replication.
right data file. The indexes are order by the value of the field specified in the index.
A master can perform Reads and Writes and a Slave copies data from the master and
can only be used for reads or back up (not writes)
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index. 4. Duplication of data:
Syntax db.COLLECTION_NAME.createIndex({KEY:1})
MongoDB can run over multiple servers. The data is duplicated to keep the system up and
also keep its running condition in case of hardware failure.
Example
5. Load balancing:
db.mycol.createIndex({“age”:1})
It has an automatic load balancing configuration because of data placed in shards.
{
“createdCollectionAutomatically” : false, 6. Supports map reduce and aggregation tools.
“numIndexesBefore” : 1, 7. Uses JavaScript instead of Procedures.
“numIndexesAfter” : 2,
8. It is a schema-less database written in C++.
“ok” : 1
} 9. Provides high performance.

In order to drop an index, MongoDB provides the dropIndex() method. 10. Stores files of any size easily without complicating your stack.
Syntax
11. Easy to administer in the case of failures.
db.NAME_OF_COLLECTION.dropIndex({KEY:1})
12. It also supports:
The dropIndex() methods can only delete one index at a time. In order to delete (or drop)
multiple indexes from the collection, MongoDB provides the dropIndexes() method that
takes multiple indexes as its parameters. o JSON data model with dynamic schemas
o Auto-sharding for horizontal scalability

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o Built in replication for high availability MongoDBsh.addShard(<url>) command

Now a day many companies using MongoDB to create new types of applications, improve A shard replica set added to a sharded cluster using this command. If we add it among the
performance and availability. shard of cluster, it affects the balance of chunks. It starts transferring chunks to balance the
cluster.
MongoDB Replication Methods
<replica_set>/<hostname><:port>,<hostname><:port>, ...
The MongoDB Replication methods are used to replicate the member to the replica sets.
Syntax:
rs.add(host, arbiterOnly)
sh.addShard("<replica_set>/<hostname><:port>")
The add method adds a member to the specified replica set. We are required to connect to the
primary set of the replica set to this method. The connection to the shell will be terminated if Example:
the method will trigger an election for primary. For example - if we try to add a new member
with a higher priority than the primary. An error will be reflected by the mongo shell even if sh.addShard("repl0/mongodb3.example.net:27327")
the operation succeeds.
Output:
Example:

In the following example we will add a new secondary member with default vote.

rs.add( { host: "mongodbd4.example.net:27017" } )


MongoDBSharding Commands

Sharding is a method to distribute the data across different machines. Sharding can be used by
MongoDB to support deployment on very huge scale data sets and high throughput
operations.

It will add a shard to specify the name of the replica set and the hostname of at least one
member of the replica set.

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o Replica placement Strategy: It is a strategy which species how to place replicas in


Cassandra the ring. There are three types of strategies such as:

What is Cassandra? 1) Simple strategy (rack-aware strategy)

Apache Cassandra is highly scalable, high performance, distributed NoSQL database. 2) old network topology strategy (rack-aware strategy)
Cassandra is designed to handle huge amount of data across many commodity servers,
providing high availability without a single point of failure. 3) network topology strategy (datacenter-shared strategy)

Cassandra is a NoSQL database Cassandra Create Keyspace

NoSQL database is Non-relational database. It is also called Not Only SQL. It is a database Cassandra Query Language (CQL) facilitates developers to communicate with Cassandra.
that provides a mechanism to store and retrieve data other than the tabular relations used in The syntax of Cassandra query language is very similar to SQL.
relational databases. These databases are schema-free, support easy replication, have simple
API, eventually consistent, and can handle huge amounts of data. What is Keyspace?

Important Points of Cassandra A keyspace is an object that is used to hold column families, user defined types. A keyspace
is like RDBMS database which contains column families, indexes, user defined types, data
o Cassandra is a column-oriented database. center awareness, strategy used in keyspace, replication factor, etc.

o Cassandra is scalable, consistent, and fault-tolerant. In Cassandra, "Create Keyspace" command is used to create keyspace.
o Cassandra is created at Facebook. It is totally different from relational database
Syntax:
management systems.
o Cassandra is being used by some of the biggest companies like Facebook, Twitter, CREATE KEYSPACE <identifier> WITH <properties>
Cisco, Rackspace, ebay, Twitter, Netflix, and more.
Different components of Cassandra Keyspace
Cassandra Data Model Strategy: There are two types of strategy declaration in Cassandra syntax:
Data model in Cassandra is totally different from normally we see in RDBMS. Let's see how
Cassandra stores its data. o Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes are
Cluster placed in clockwise direction in the ring without considering rack or node location.

Cassandra database is distributed over several machines that are operated together. The o Network Topology Strategy: This strategy is used in the case of more than one data
outermost container is known as the Cluster which contains different nodes. Every node centers. In this strategy, you have to provide replication factor for each data center
contains a replica, and in case of a failure, the replica takes charge. Cassandra arranges the
separately.
nodes in a cluster, in a ring format, and assigns data to them.
Replication Factor: Replication factor is the number of replicas of data placed on different
Keyspace nodes. More than two replication factor are good to attain no single point of failure. So, 3 is
good replication factor.
Keyspace is the outermost container for data in Cassandra. Following are the basic attributes
of Keyspace in Cassandra: Example:

o Replication factor: It specifies the number of machine in the cluster that will receive Let's take an example to create a keyspace named "javatpoint".
copies of the same data.
CREATE KEYSPACE javatpoint

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3}; Single primary key: Use the following syntax for single primary key.

Keyspace is created now. Primary key (ColumnName)


Compound primary key: Use the following syntax for single primary key.
Using a Keyspace
Primary key(ColumnName1,ColumnName2 . . .)
To use the created keyspace, you have to use the USE command.
Example:
Syntax:
Let's take an example to demonstrate the CREATE TABLE command.
USE <identifier>
Here, we are using already created Keyspace "javatpoint".
Cassandra Alter Keyspace
CREATE TABLE student(
The "ALTER keyspace" command is used to alter the replication factor, strategy name and
durable writes properties in created keyspace in Cassandra. student_id int PRIMARY KEY,
student_name text,
Syntax: student_city text,
student_fees varint,
ALTER KEYSPACE <identifier> WITH <properties>
student_phone varint
Cassandra Drop Keyspace
);
In Cassandra, "DROP Keyspace" command is used to drop keyspaces with all the data,
SELECT * FROM student;
column families, user defined types and indexes from Cassandra.
Cassandra Alter Table
Syntax:
ALTER TABLE command is used to alter the table after creating it. You can use the ALTER
DROP keyspace KeyspaceName ; command to perform two types of operations:

Cassandra Create Table


o Add a column
In Cassandra, CREATE TABLE command is used to create a table. Here, column family is o Drop a column
used to store data just like table in RDBMS.
Syntax:
So, you can say that CREATE TABLE command is used to create a column family in
Cassandra.
ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>
Syntax:
Adding a Column

CREATE TABLE tablename( You can add a column in the table by using the ALTER command. While adding column, you
column1 name datatype PRIMARYKEY, have to aware that the column name is not conflicting with the existing column names and
that the table is not defined with compact storage option.
column2 name data type,
column3 name data type. Syntax:
)
ALTER TABLE table name
There are two types of primary keys:
ADD new column datatype;

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

After using the following command: TRUNCATE <tablename>

ALTER TABLE student Example:


ADD student_email text;
Cassandra Batch
A new column is added. You can check it by using the SELECT command.
In Cassandra BATCH is used to execute multiple modification statements (insert, update,
delete) simultaneously. It is very useful when you have to update some column as well as
Dropping a Column delete some of the existing.
You can also drop an existing column from a table by using ALTER command. You should Syntax:
check that the table is not defined with compact storage option before dropping a column
from a table.
BEGIN BATCH
Syntax: <insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
ALTER table name
DROP column name;
Example: Use of WHERE Clause

WHERE clause is used with SELECT command to specify the exact location from where we
After using the following command:
have to fetch data.
ALTER TABLE student
Syntax:
DROP student_email;

Now you can see that a column named "student_email" is dropped now. SELECT FROM <table name> WHERE <condition>;
SELECT * FROM student WHERE student_id=2;
If you want to drop the multiple columns, separate the columns name by ",".
Cassandra Update Data
Cassandra DROP table UPDATE command is used to update data in a Cassandra table. If you see no result after
updating the data, it means data is successfully updated otherwise an error will be returned.
DROP TABLE command is used to drop a table. While updating data in Cassandra table, the following keywords are commonly used:
Syntax:
o Where: The WHERE clause is used to select the row that you want to update.
DROP TABLE <tablename>
Example: o Set: The SET clause is used to set the value.
After using the following command:
o Must: It is used to include all the columns composing the primary key.
DROP TABLE student;
Syntax:
The table named "student" is dropped now. You can use DESCRIBE command to verify if
the table is deleted or not. Here the student table has been deleted; you will not find it in the
UPDATE <tablename>
column families list.
SET <column name> = <new value>
Cassandra Truncate Table <column name> = <value>....
WHERE <condition>
TRUNCATE command is used to truncate a table. If you truncate a table, all the rows of the
table are deleted permanently.

Syntax:

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

SELECT MIN(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;


Cassandra DELETE Data MAX Function with HAVING Clause:
SELECT MAX(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;
DELETE command is used to delete data from Cassandra table. You can delete the complete
table or a selected row by using this command. AVERAGE CLAUSE:

Syntax: SELECT AVG(Emp_Salary), Emp_Dept FROM Employee_Dept GROUP BY Emp_Dept;

DELETE FROM <identifier> WHERE <condition>; SQL ORDER BY Clause

Delete an entire row o Whenever we want to sort the records based on the columns stored in the tables of the

To delete the entire row of the student_id "3", use the following command: SQL database, then we consider using the ORDER BY clause in SQL.

DELETE FROM student WHERE student_id=3; o The ORDER BY clause in SQL will help us to sort the records based on the specific
column of a table. This means that all the values stored in the column on which we are
Delete a specific column name
applying ORDER BY clause will be sorted, and the corresponding column values will
Example:
be displayed in the sequence in which we have obtained the values in the earlier step.
Delete the student_fees where student_id is 4.
Syntax to sort the records in ascending order:
DELETE student_fees FROM student WHERE student_id=4;
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
HAVING Clause in SQL
ColumnName ASC;
The HAVING clause places the condition in the groups defined by the GROUP BY clause in Syntax to sort the records in descending order:
the SELECT statement.
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
This SQL clause is implemented after the 'GROUP BY' clause in the 'SELECT' statement. ColumnNameDESC;
Syntax to sort the records in ascending order without using ASC keyword:
This clause is used in SQL because we cannot use the WHERE clause with the SQL
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
aggregate functions. Both WHERE and HAVING clauses are used for filtering the records in
SQL queries. ColumnName;
Index Cassandra Mongodb
Syntax of HAVING clause in SQL
SELECT column_Name1, column_Name2, ....., column_NameN aggregate_function_name( 1) Cassandra is high performance MongoDB is cross-platform document-oriented
column_Name) distributed database system. database system.
GROUP BY 2) Cassandra is written in Java. MongoDB is written in C++.
SELECT SUM(Emp_Salary), Emp_City FROM Employee GROUP BY Emp_City;
3) Cassandra stores data in tabular form MongoDB stores data in JSON format.
the following query with the HAVING clause in SQL: like SQL format.

4) Cassandra is got license by Apache. MongoDB is got license by AGPL and drivers
SELECT SUM(Emp_Salary), Emp_City FROM Employee GROUP BY Emp_City by Apache.
HAVING SUM(Emp_Salary)>12000;
5) Cassandra is mainly designed to MongoDB is designed to deal with JSON-like
MIN Function with HAVING Clause: handle large amounts of data across documents and access applications easier and
many commodity servers. faster.
If you want to show each department and the minimum salary in each department, you have
to write the following query: 6) Cassandra provides high availability MongoDB is easy to administer in the case of
with no single point of failure. failure.

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Hive Decimal Type


What is HIVE?
Type Size Range
Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.
FLOAT 4-byte Single precision floating point number
Hive provides the functionality of reading, writing, and managing large datasets residing in DOUBLE 8-byte Double precision floating point number
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Date/Time Types
Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF). TIMESTAMP

Features of Hive o It supports traditional UNIX timestamp with optional nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o Hive is fast and scalable.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce
decimal precision.
or Spark jobs.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD
o It is capable of analyzing large datasets stored in HDFS.
HH:MM:SS.fffffffff" (9 decimal place precision)
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem. DATES

o It supports user-defined functions (UDFs) where user can provide its functionality. The Date value is used to specify a particular year, month and day, in the form YYYY--MM--
DD. However, it didn't provide the time of the day. The range of Date type lies between
HIVE Data Types 0000--01--01 to 9999--12--31.

Hive data types are categorized in numeric types, string types, misc types, and complex types. String Types
A list of Hive data types is given below.
STRING
Integer Types
The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").
Type Size Range
Varchar
TINYINT 1-byte signed integer -128 to 127
The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.
SMALLINT 2-byte signed integer 32,768 to 32,767
CHAR
INT 4-byte signed integer 2,147,483,648 to 2,147,483,647 The char is a fixed-length type whose maximum length is fixed at 255.

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to


9,223,372,036,854,775,807

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Complex Type Internal Table

The internal tables are also called managed tables as the lifecycle of their data is controlled by
Type Size Range
the Hive. By default, these tables are stored in a subdirectory under the directory defined by
Struct It is similar to C struct or an object where fields struct('James','Roy') hive. metastore. warehouse.dir (i.e. /user/hive/warehouse). The internal tables are not flexible
are accessed using the "dot" notation.
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes
Map It contains the key-value tuples where the fields map('first','James','last','Roy') both table schema and data.
are accessed using array notation.

Array It is a collection of similar type of values that array('James','Roy') o Let's create an internal table by using the following command:-
indexable using zero-based integers.

Hive - Create Database hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain fields terminated by ',' ;
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default. Let's see the metadata of the created table by using the following command:-

o Initially, we check the default database provided by Hive. So, to check the list of hive> describe demo.employee
existing databases, follow the below command: -
External Table
o hive> create database demo
The external table allows us to create and access a table and a data externally.
hive> show databases; The external keyword is used to specify the external table, whereas the location keyword is
used to determine the location of loaded data.
Let's create a new database by using the following command: -
As the table is external, the data is not present in the Hive directory. Therefore, if we try to
Hive - Drop Database drop the table, the metadata of the table will be deleted, but the data still exists.

In this section, we will see various ways to drop the existing database.
Let's create an external table using the following command: -

drop the database by using the following command.


hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
hive> drop database demo;
fields terminated by ','
Hive - Create Table
location '/HiveDirectory';
In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide we can use the following command to retrieve the data: -
range of flexibility where the data files for tables are stored. It provides two types of table: -
select * from emplist;
o Internal table
Hive - Load Data
o External table
Once the internal table has been created, the next step is to load the data into it. So, in Hive,
we can easily load data from any file to the database.

o Let's load the data of the file into the database by using the following command: -

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Change Column

In Hive, we can rename a column, change its type and position. Here, we are changing the
load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee; name of the column by using the following signature: -

Hive - Drop Table Alter table table_name change old_column_name new_column_name datatype;
o Now, change the name of the column by using the following command: -
Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the
below steps to drop the table from the database.
Alter table employee_data change name first_name string;
o Let's check the list of existing databases by using the following command: - Delete or Replace Column

Hive allows us to delete one or more columns by replacing them with the new columns. Thus,
hive> show databases;
we cannot drop the column directly.

hive> use demo; o Let's see the existing schema of the table.

hive> show tables; o Now, drop a column from the table.


hive> drop table new_employee;
alter table employee_data replace columns( id string, first_name string, age int);
Hive - Alter Table
Partitioning in Hive
In Hive, we can perform modifications in the existing table like changing the table name,
column name, comments, and table properties. It provides SQL like commands to alter the The partitioning in Hive means dividing the table into some parts based on the values of a
table. particular column like date, course, city or country. The advantage of partitioning is that since
the data is stored in slices, the query response time becomes faster.
Rename a Table
The partitioning in Hive can be executed in two ways –
If we want to change the name of an existing table, we can rename that table by using the
following signature: - o Static partitioning
o Dynamic partitioning
Alter table old_table_name rename to new_table_name;
Static Partitioning
o Now, change the name of the table by using the following command: -
In static or manual partitioning, it is required to pass the values of partitioned columns
Alter table emp rename to employee_data; manually while loading the data into the table. Hence, the data file doesn't contain the
Adding column partitioned columns.

In Hive, we can add one or more columns in an existing table by using the following Example of Static Partitioning
signature:
o First, select the database in which we want to create a table.
Alter table table_name add columns(column_name datatype);
hive> use test;
o Now, add a new column to the table by using the following command: -
o Create the table and provide the partitioned columns by using the following
Alter table employee_data add columns (age int); command: -

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

hive> create table student (id int, name string, age int, institute string) hive> create table stud_demo(id int, name string, age int, institute string, course string)
partitioned by (course string) row format delimited
row format delimited fields terminated by ',';
fields terminated by ','; o Now, load the data into the table.
hive> describe student;
hive> load data local inpath '/home/codegyani/hive/student_details' into table stud_demo;
o Load the data into the table and pass the values of partition columns with it by using
the following command: - o Create a partition table by using the following command: -

hive> load data local inpath '/home/codegyani/hive/student_details1' into table student hive> create table student_part (id int, name string, age int, institute string)
partition(course= "java"); partitioned by (course string)
row format delimited
Here, we are partitioning the students of an institute based on courses.
fields terminated by ',';

o Load the data of another file into the same table and pass the values of partition o Now, insert the data of dummy table into the partition table.
columns with it by using the following command: -
hive> insert into student_part
hive> load data local inpath '/home/codegyani/hive/student_details2' into table student partition(course)
partition(course= "hadoop"); select id, name, age, institute, course
from stud_demo;

hive> select * from student; OrientDB Graph database


o Now, try to retrieve the data based on partitioned columns by using the following
What is Graph?
command: -
A graph is a pictorial representation of objects which are connected by some pair of links. A
hive> select * from student where course="java"; graph contains two elements: Nodes (vertices) and relationships (edges).

Dynamic Partitioning What is Graph database


In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not A graph database is a database which is used to model the data in the form of graph. It store
required to pass the values of partitioned columns manually. any kind of data using:

o First, select the database in which we want to create a table. o Nodes

hive> use show; o Relationships

o Enable the dynamic partition by using the following commands: - o Properties

Nodes: Nodes are the records/data in graph databases. Data is stored as properties and
hive> set hive.exec.dynamic.partition=true; properties are simple name/value pairs.
hive> set hive.exec.dynamic.partition.mode=nonstrict;
Relationships: It is used to connect nodes. It specifies how the nodes are related.
o Create a dummy table to store the data.
o Relationships always have direction.

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o Relationships always have a type. MongoDB vs OrientDB

o Relationships form patterns of data. MongoDB and OrientDB contains many common features but the engines are fundamentally
different. MongoDB is pure Document database and OrientDB is a hybrid Document with
Properties: Properties are named data values. graph engine.

Popular Graph Databases Features MongoDB OrientDB

Neo4j is the most popular Graph Database. Other Graph Databases are Uses the RDBMS JOINS to create Embeds and connects documents
relationship between entities. It has like relational database. It uses
Relationships high runtime cost and does not direct, super-fast links taken from
o Oracle NoSQL Database
scale when database scale graph database world.
o OrientDB increases.
o HypherGraphDB Costly JOIN operations. Easily returns complete graph
Fetch Plan
o GraphBase with interconnected documents.

o InfiniteGraph Doesn’t support ACID Supports ACID transactions as


Transactions transactions, but it supports atomic well as atomic operations.
o AllegroGraph etc. operations.

Graph Database vs. RDBMS Has its own language based on Query language is built on SQL.
Query language
JSON.
Differences between Graph database and RDBMS:
Uses the B-Tree algorithm for all Supports three different indexing
Indexes indexes. algorithms so that the user can
In Graph Database RDBMS achieve best performance.
de
x Uses memory mapping technique. Uses the storage engine name
Storage engine
LOCAL and PLOCAL.
1. In graph database, data is stored in graphs. In RDBMS, data is stored in tables. The following table illustrates the comparison between relational model, document model,
and OrientDB document model −
2. In graph database there are nodes. In RDBMS, there are rows.
Relational Model Document Model OrientDB Document Model

3. In graph database there are properties and In RDBMS, there are columns and Table Collection Class or Cluster
their values. data.
Row Document Document
4. In graph database the connected nodes are In RDBMS, constraints are used Column Key/value pair Document field
defined by relationships. instead of that.
Relationship Not available Link
5. In graph database traversal is used instead In RDBMS, join is used instead of The SQL Reference of the OrientDB database provides several commands to create, alter, and
of join. traversal. drop databases.
Create database
The following statement is a basic syntax of Create Database command.
CREATE DATABASE <database-url> [<user> <password> <storage-type> [<db-type>]]

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Following are the details about the options in the above syntax. Example
<database-url> − Defines the URL of the database. URL contains two parts, one is <mode>
We have already created a database named ‘demo’ in the previous chapters. In this example,
and the second one is <path>.
we will connect to that using the user admin.
<mode> − Defines the mode, i.e. local mode or remote mode.
You can use the following command to connect to demo database.
<path> − Defines the path to the database.
orientdb> CONNECT PLOCAL:/opt/orientdb/databases/demo admin admin
<user> − Defines the user you want to connect to the database.
If it is successfully connected, you will get the following output −
<password> − Defines the password for connecting to the database.
Connecting to database [plocal:/opt/orientdb/databases/demo] with user 'admin'…OK
<storage-type> − Defines the storage types. You can choose between PLOCAL and Orientdb {db = demo}>
MEMORY.

Example the following statement is the basic syntax of the info command.
LIST DATABASES
You can use the following command to create a local database named demo.
Orientdb> CREATE DATABASE PLOCAL:/opt/orientdb/databses/demo The following statement is the basic syntax of the Drop database command.
If the database is successfully created, you will get the following output. DROP DATABASE [<database-name> <server-username> <server-user-password>]
Database created successfully. Following are the details about the options in the above syntax.
<database-name> − Database name you want to drop.
Current database is: plocal: /opt/orientdb/databases/demo
<server-username> − Username of the database who has the privilege to drop a database.
orientdb {db = demo}> <server-user-password> − Password of the particular user.
The following statement is the basic syntax of the Alter Database command.
ALTER DATABASE <attribute-name> <attribute-value> In this example, we will use the same database named ‘demo’ that we created in an earlier
Where <attribute-name> defines the attribute that you want to modify and <attribute- chapter. You can use the following command to drop a database demo.
value> defines the value you want to set for that attribute.
orientdb {db = demo}> DROP DATABASE
orientdb> ALTER DATABASE custom strictSQL = false
If this command is successfully executed, you will get the following output.
If the command is executed successfully, you will get the following output.
Database 'demo' deleted successfully
Database updated successfully
INSERT DATABASE
The following statement is the basic syntax of the Connect command.
The following statement is the basic syntax of the Insert Record command.
CONNECT <database-url> <user> <password>
INSERT INTO [class:]<class>|cluster:<cluster>|index:<index>
Following are the details about the options in the above syntax.
[(<field>[,]*) VALUES (<expression>[,]*)[,]*]|
<database-url> − Defines the URL of the database. URL contains two parts one is <mode> [SET <field> = <expression>|<sub-command>[,]*]|
and the second one is <path>. [CONTENT {<JSON>}]
[RETURN <expression>]
<mode> − Defines the mode, i.e. local mode or remote mode.
[FROM <query>]
<path> − Defines the path to the database. Following are the details about the options in the above syntax.
<user> − Defines the user you want to connect to the database. SET − Defines each field along with the value.
<password> − Defines the password for connecting to the database. CONTENT − Defines JSON data to set field values. This is optional.

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

RETURN − Defines the expression to return instead of number of records inserted. The most FETCHPLAN − Specifies the strategy defining how you want to fetch results.
common use cases are −
TIMEOUT − Defines the maximum time in milliseconds for the query.
 @rid − Returns the Record ID of the new record.
LOCK − Defines the locking strategy. DEFAULT and RECORD are the available lock
 @this − Returns the entire new record.
strategies.

FROM − Where you want to insert the record or a result set. PARALLEL − Executes the query against ‘x’ concurrent threads.

The following command is to insert the first record into the Customer table. NOCACHE − Defines whether you want to use cache or not.

INSERT INTO Customer (id, name, age) VALUES (01,'satish', 25) Example
The following command is to insert the second record into the Customer table. Method 1 − You can use the following query to select all records from the Customer table.
INSERT INTO Customer SET id = 02, name = 'krishna', age = 26 orientdb {db = demo}> SELECT FROM Customer
The following command is to insert the next two records into the Customer table. orientdb {db = demo}> SELECT FROM Customer WHERE name LIKE 'k%'
orientdb {db = demo}> SELECT FROM Customer WHERE name.left(1) = 'k'
INSERT INTO Customer (id, name, age) VALUES (04,'javeed', 21), (05,'raja', 29) orientdb {db = demo}> SELECT id, name.toUpperCase() FROM Customer
orientdb {db = demo}> SELECT FROM Customer WHERE age in [25,29]
orientdb {db = demo}> SELECT FROM Customer WHERE ANY() LIKE '%sh%'
SELECT COMMAND orientdb {db = demo}> SELECT FROM Customer ORDER BY age DESC
The following statement is the basic syntax of the SELECT command.
SELECT [ <Projections> ] [ FROM <Target> [ LET <Assignment>* ] ] UPDATE QUERY
[ WHERE <Condition>* ]
[ GROUP BY <Field>* ]
[ ORDER BY <Fields>* [ ASC|DESC ] * ] Update Record command is used to modify the value of a particular record. SET is the basic
[ UNWIND <Field>* ]
command to update a particular field value.
[ SKIP <SkipRecords> ]
[ LIMIT <MaxRecords> ] The following statement is the basic syntax of the Update command.
[ FETCHPLAN <FetchPlan> ]
UPDATE <class>|cluster:<cluster>|<recordID>
[ TIMEOUT <Timeout> [ <STRATEGY> ] ]
[SET|INCREMENT|ADD|REMOVE|PUT <field-name> = <field-value>[,]*] |[CONTENT|
[ LOCK default|record ]
MERGE <JSON>]
[ PARALLEL ]
[UPSERT]
[ NOCACHE ]
[RETURN <returning> [<returning-expression>]]
Following are the details about the options in the above syntax. [WHERE <conditions>]
[LOCK default|record]
<Projections> − Indicates the data you want to extract from the query as a result records set.
[LIMIT <max-records>] [TIMEOUT <timeout>]
FROM − Indicates the object to query. This can be a class, cluster, single Record ID, set of
Following are the details about the options in the above syntax.
Record IDs. You can specify all these objects as target.
SET − Defines the field to update.
WHERE − Specifies the condition to filter the result-set.
INCREMENT − Increments the specified field value by the given value.
LET − Indicates the context variable which are used in projections, conditions or sub queries.
ADD − Adds the new item in the collection fields.
GROUP BY − Indicates the field to group the records.
REMOVE − Removes an item from the collection field.
ORDER BY − Indicates the filed to arrange a record in order.
PUT − Puts an entry into map field.
UNWIND − Designates the field on which to unwind the collection of records.
CONTENT − Replaces the record content with JSON document content.
SKIP − Defines the number of records you want to skip from the start of the result-set.
MERGE − Merges the record content with a JSON document.
LIMIT − Indicates the maximum number of records in the result-set.

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.
OrientDB Features
UPSERT − Updates a record if it exists or inserts a new record if it doesn’t. It helps in
executing a single query in the place of executing two queries. providing more functionality and flexibility, while being powerful enough to replace your
operational DBMS.
RETURN − Specifies an expression to return instead of the number of records.
SPEED
LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out. OrientDB was engineered from the ground up with performance as a key specification. It’s
fast on both read and write operations. Stores up to 120,000 records per second
Try the following query to update the age of a customer ‘Raja’.

Orientdb {db = demo}> UPDATE Customer SET age = 28 WHERE name = 'Raja'
Truncate  No more Joins: relationships are physical links to the records.
 Better RAM use.
Truncate Record command is used to delete the values of a particular record.  Traverses parts of or entire trees and graphs of records in milliseconds.
The following statement is the basic syntax of the Truncate command.  Traversing speed is not affected by the database size.
TRUNCATE RECORD <rid>* ENTERPRISE
Where <rid>* indicates the Record ID to truncate. You can use multiple Rids separated by
comma to truncate multiple records. It returns the number of records truncated.  Incremental backups
 Unmatched security
Try the following query to truncate the record having Record ID #11:4.
 24x7 Support
Orientdb {db = demo}> TRUNCATE RECORD #11:4  Query Profiler
 Distributed Clustering configuration
DELETE  Metrics Recording
 Live Monitor with configurable alerts
Delete Record command is used to delete one or more records completely from the database.
The following statement is the basic syntax of the Delete command. With a master-slave architecture, the master often becomes the bottleneck. With OrientDB,
throughput is not limited by a single server. Global throughput is the sum of the throughput
DELETE FROM <Class>|cluster:<cluster>|index:<index>
of all the servers.
[LOCK <default|record>]
[RETURN <returning>]
 Multi-Master + Sharded architecture
[WHERE <Condition>*]
 Elastic Linear Scalability
[LIMIT <MaxRecords>]
 estore the database content using WAL
[TIMEOUT <timeout>]
Following are the details about the options in the above syntax.  OrientDB Community is free for commercial use.
 Comes with an Apache 2 Open Source License.
LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.  Eliminates the need for multiple products and multiple licenses.

RETURN − Specifies an expression to return instead of the number of records.


LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out.
Note − Don’t use DELETE to remove Vertices or Edges because it effects the integrity of the
graph.
Try the following query to delete the record having id = 4.
orientdb {db = demo}> DELETE FROM Customer WHERE id = 4

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

MC4202 ADVANCED DATABASE TEHNOLOGY

UNIT – IV XML DATABASES

STRUCTURED, SEMI STRUCTURED AND UNSTRUCTURED DATA


What Is Data?

 Data is a set of facts such as descriptions, observations, and numbers used in decision
making.
 We can classify data as structured, unstructured, or semi-structured data.
1) What is structured data?

 Structured data is generally tabular data that is represented by columns and rows in a
database.
 Databases that hold tables in this form are called relational databases.
 The mathematical term “relation” specify to a formed set of data held as a table.
 In structured data, all row in a table has the same set of columns.
 SQL (Structured Query Language) programming language used for structured data. 3) What is Unstructured Data

 Unstructured data is information that either does not organize in a pre-defined manner
or not have a pre-defined data model.
 Unstructured information is a set of text-heavy but may contain data such as numbers,
dates, and facts as well.
 Videos, audio, and binary data files might not have a specific structure. They’re
assigned to as unstructured data.

2) What is Semi structured Data

 Semi-structured data is information that doesn’t consist of Structured data (relational


database) but still has some structure to it.
 Semi-structured data consist of documents held in JavaScript Object Notation (JSON)
format. It also includes key-value stores and graph databases.

Structured Data vs Unstructured Data vs Semi-Structured:

Structured data is stored is predefined format and is highly specific; whereas unstructured
data is a collection of many varied data types which are stored in their native formats;
while semi structured data that does not follow the tabular data structure models associated
with relational databases or other data table forms.
Pros and Cons of Structured Data

Pros Cons

1|Page 2|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

database, the database schema needs to be changed each time a new field is needed, and fields
can not be left empty. Semi-structured data can allow you to capture any data in any structure
Requires less processing in without making changes to the database schema or coding. Adding or removing data does not
Limited usability because of its impact functionality or dependencies.
comparison to unstructured data and
pre-defined structure/format
is easier to manage.
Use cases for unstructured data
Structured data is stored in data
Machine algorithms can easily crawl
warehouses which are built for space Here are a few examples where unstructured data is being used in analytics today.
and use structured data which
saving but are difficult to change and
simplifies querying
not very scalable/flexible. Classifying image and sound. Using deep learning, a system can be trained to recognize images
and sounds. The systems learn from labeled examples in order to accurately classify new images
As an older format of data, there are
or sounds.
several tools available for structured
As input to predictive models. Text analytics — using natural language processing (NLP) or
data that simplify usage,
machine learning — is being used to structure unstructured text.
management, and analysis
Chatbots in customer experience. Chatbots have been in the market for a number of years, but
Pros and Cons of Unstructured Data the newer ones have a better understanding of language and are more interactive.
Characteristics Of Structured (Relational) and Unstructured (Non-
Pros Cons
Relational) Data
The greater number of formats
Variety of native formats facilitate a Relational Data
makes it equally challenging to
greater number of use-cases and
analyze and leverage
applications  Relational databases provide undoubtedly the most well-understood model for holding
unstructured data.
data.
The large volume and undefined  The simplest structure of columns and tables makes them very easy to use initially, but the
As there is no need to predefine data,
formats make data management inflexible structure can cause some problems.
unstructured data is collected quickly and
a challenge and specialized tools  We can communicate with relational databases using Structured Query
easily.
a necessity. Language (SQL).
Unstructured data is stored in on-premises  SQL allows the joining of tables using a few lines of code, with a structure most beginner
or cloud data lakes which are highly employees can learn very fast.
scalable.  Examples of relational databases:
o MySQL
Although challenging, the greater volume o PostgreSQL
of unstructured data provides better
o Db2
insights and more opportunities to turn
your data into a competitive advantage.
Use cases for Structured data

Examples of structured data include names, dates, addresses, credit card numbers, stock
information,geolocation,andmore.
Structured data is highly organized and easily understood by machine language. Those working
within relational databases can input, search, and manipulate structured data relatively quickly
using a relational database management system (RDBMS).
Non-Relational Data
Use cases for Semi-Structured data  Non-relational databases permit us to store data in a format that more closely meets the
original structure.
The use of semi-structured data enables us to integrate data from various sources or exchange  A non-relational database is a database that does not use the tabular schema of columns
data between different systems. Applications and systems need to evolve with time, but if we work and rows found in most traditional database systems.
purely with structured data, this is not possible. Let’s consider web forms. You may want to modify
forms and capture different data for different users. If you are using a traditional relational

3|Page 4|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

 It uses a storage model that is enhanced for the specific requirements of the type of data A hierarchical database model is a data model in which the data are organized into a tree-like
being stored. structure. The data are stored as records which are connected to one another through links. A
 In a non-relational database the data may be stored as JSON documents, as record is a collection of fields, with each field containing only one value. The type of a record
simple key/value pairs, or as a graph consisting of edges and vertices. defines which fields the record contains.
 Examples of non-relational databases:
The hierarchical database model mandates that each child record has only one parent, whereas
o Redis
each parent record can have one or more child records. In order to retrieve data from a
o JanusGraph hierarchical database, the whole tree needs to be traversed starting from the root node. This
o MongoDB model is recognized as the first database model created by IBM in the 1960s.
o RabbitMQ
Examples of hierarchical data represented as relational tables[edit]
An organization could store employee information in a table that contains attributes/columns
such as employee number, first name, last name, and department number. The organization
provides each employee with computer hardware as needed, but computer equipment may only
be used by the employee to which it is assigned. The organization could store the computer
hardware information in a separate table that includes each part's serial number, type, and the
employee that uses it. The tables might look like this:

employee table
computer table
First Dept.
Structured Data words on the basis of relational database tables. Semi-Structured Data works on the EmpNo Last Name Serial Num Type User EmpNo
basis of Relational Data Framework (RDF) or XML. Unstructured data works on the basis of binary data
Name Num
3009734-4 Computer 100
and the available characters. The data depends a lot on the schema. 100 Almukhtar Khan 10-L
3-23-283742 Monitor 100
101 Gaurav Soni 10-L
2-22-723423 Monitor 100
102 Siddhartha Soni 20-B
232342 Printer 100
103 Siddhant Soni 20-B
XML HIERARCHICAL DATA MODEL In this model, the employee data table represents the "parent" part of the hierarchy, while
XML data is hierarchical; relational data is represented in a model of logical relationships An the computer table represents the "child" part of the hierarchy. In contrast to tree structures
XML document contains information about the relationship of data items to each other in the usually found in computer software algorithms, in this model the children point to the parents. As
shown, each employee may possess several pieces of computer equipment, but each individual
form of the hierarchy. With the relational model, the only types of relationships that can be
piece of computer equipment may have only one employee owner.
defined are parent table and dependent table relationships.
Consider the following structure:
EmpNo Designation ReportsTo
10 Director
20 Senior Manager 10
30 Typist 20
40 Programmer 20
In this, the "child" is the same type as the "parent". The hierarchy stating EmpNo 10 is boss of 20,
and 30 and 40 each report to 20 is represented by the "ReportsTo" column. In Relational database
terms, the ReportsTo column is a foreign key referencing the EmpNo column. If the "child" data
type were different, it would be in a different table, but there would still be a foreign key
referencing the EmpNo column of the employees table.

5|Page 6|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

XML DOCUMENT STRUCTURE

An XML (EXtensible Markup Language) Document contains declarations, elements, text,


and attributes. It is made up of entities (storing units) and It tells us the structure of the data it
refers to. It is used to provide a standard format of data transmission. As it helps in message
delivery, it is not always stored physically, i.e. in a disk but generated dynamically but its
structure always remains the same.

XML STANDARD STRUCTURE AND ITS RULES:

Rule 1: Its standard format consists of an XML prolog which contains both XML Declaration
and XML DTD (Document Type Definition) and the body. If the XML prolog is present, it should
always be the beginning of the document. The XML Version, by default, is 1.0, and including only
this forms the shortest XML Declaration. UTF-8 is the default character encoding and is one of
seven character-encoding schemes. If it is not present, it can result in some encoding errors.
Syntax of XML Declaration:
<?xml version="1.0" encoding="UTF-8"?> Source Code of the above diagram:

Syntax of DTD:  XML


<!DOCTYPE root-element [<!element-declarations>]>
Example: <?xml version="1.0" encoding="UTF-8"?>
<website>
 XML <company category="geeksforgeeks">
<title>Machine learning</title>
<!DOCTYPE website [ <author>aarti majumdar</author>
<!ELEMENT website (name,company,phone)> <year>2022</year>
<!ELEMENT name (#PCDATA)> </company>
<!ELEMENT company (#PCDATA)> <company category="geeksforgeeks">
<!ELEMENT phone (#PCDATA)> <title>Web Development</title>
]> <author>aarti majumdar</author>
<website> <year>2022</year>
<name>GeeksforGeeks</name> </company>
<company>GeeksforGeeks</company> <company category="geeksforgeekse">
<phone>011-24567981</phone> <title>XML</title>
</website> <author>aarti majumdar</author>
<year>2022</year>
Rule 2: XML Documents must have a root element (the supreme parent element) and its child </company>
elements (sub-elements). To have a better view of the hierarchy of the data elements, XML </website>
follows the XML tree structure which comprises of one single root (parent) and multiple leaves
(children). Rule 3: All XML Elements are required to have Closing and opening Tags(similar to HTML).
<message>Welcome to GeeksforGeeks</message>
Rule 4: The opening and closing tags are case-sensitive.
For Example, <Message> is different from <message> from above example.
Rule 5: Values of XML attributes are required to have quotations:
 XML

<website category="open source">


<company>geeksforgeeks</company>

7|Page 8|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

utility called an XML Parser is then able to test whether or not the document meets the
</website> prescribed rules.
Rule 6: White-Spaces are retained and maintained in XML.
 XML

<message>welcome to geeksforgeeks</message>
Rule 7: Comments can be defined in XML enclosed between <!– and –> tags.
 XML

<!-- XML Comments are defined like this -->


Rule 8: XML elements must be nested properly.
Checking the structure of XML Doc using XML Parser
 XML

<message> An XML parser is a software library or package that provides client applications with interfaces
<company>GeeksforGeeks</company> for working with XML documents. In other words, The XML parser is a program that reads XML
</message> and allows a program to use it. The XML parser checks the document and ensures that it is
properly formatted. XML parsers are included in most modern browsers.
XML document is a well-organized collection of components and associated markup. An XML What is an XML file?
document can hold a wide range of information. For instance, a database having numbers or a
mathematical equation etc. An XML file contains XML code and ends with the file extension ".xml". It contains tags that define
Example: A simple document can be created as – not only how the document should be structured but also how it should be stored and transported
<?xml version = "1.0"?> over the internet.
<Student-info>
Let’s look at a basic example of an XML file below. You can also click here to view the file directly
<name>Tanya Bajaj</name> in your browser.
<Organization>GeeksForGeeks</Organization>
<contactNumber>(+91)-9966778909</contactNumber>
</Student-info>
XML document has 2 sections:
 Document Prolog: It contains XML & document type declaration (First 2 lines in the
above XML doc).
 Document Elements: The building components of XML are Document Elements. These
sections divide the document into a hierarchy of sections, each with its own utility.
Can the structure of the XML Document be checked somehow?
A tool for creating rules that regulate how documents are constructed is included in XML. These
are referred to as DTDs (Document Type Definitions) in jargon. You may set up a DTD to check
XML documents automatically in a variety of ways. Here is a couple of such examples:
 An optical title, a given name, and a surname make up a person’s name.
 One or more channels can be found on a television schedule. There are one or more time
slots on each channel. There is a program title and an optional description for each time slot.
These effects can be achieved by identifying the element types that you want to employ in your
document and indicating the structural order in which they can appear in the document type. A

9|Page 10 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Also, you’ll often see XML code formatted such that each level of element is indented, as is true in
our example. This makes the file easier for humans to read, and does not affect how computers
process the code.

XML Language

XML, short for "eXtensible Markup Language," was published by the World Wide Web Consortium
(W3C) in 1998 to meet the challenges of large-scale electronic publishing. Since then, it has
become one of the most widely used formats for sharing structured information among people,
computers, and networks.

Since XML can be read and interpreted by people as well as computer software, it is known as
human- and machine-readable.

The primary purpose of XML, however, is to store data in a way that can be easily read by and
shared between software applications. Since its format is standardized, XML can be shared across
systems or platforms, both locally and over the internet, and the recipient will still be able to
parse the data.

It's important to understand that XML doesn’t do anything with the data other than store it, like a
database. Another piece of software must be created or used to send, receive, store, or display the
data.

At this point, you might be thinking XML sounds a lot like another markup language,
the Hypertext Markup Language (HTML). Let's take a closer look at the differences between these
Image Source languages below.

As you can see, this file consists of plain text and tags. The plain text is shown in black and the tags Besides their purpose, there’s one other key difference between XML and HTML tags.
are shown in green.
When programming in HTML, a developer must use tags from the HTML tag library, or a
Plain text is the actual data being stored. In this example, the XML is storing student names as standardized set of tags. While you can do a lot with these tags, there is a limited
well as test scores associated with each student. number available. That means there are only so many ways you can structure content on a web
page.
While plain text represents the data, tags indicate what the data is. Each tag represents a type of
data, like "first name," "last name," or "score," and tells the computer what to do with the plain XML does not have this limitation, as there is no preset library of XML tags. Instead, developers
text data inside of it. Tags aren’t supposed to be seen by users, only the software itself. can create an unlimited number of custom tags to fit their data needs. This extensive
customization is the "X" in XML.
XML Hierarchy
To create custom tags, a developer writes a Document Type Definition (DTD), which is XML’s
Each instance of an XML tag is called an element. In an XML file, elements are arranged in a version of a tag library. An XML file’s DTD is indicated at the top of the file, and tells the software
hierarchy, which means that elements can contain other elements. what each tag means and what to do with it.

The topmost element is called the "root" element, and contains all other elements, which are For instance, an XML file containing info for a reservation system might have a custom
called "child" elements. "<res_start>" tag to define a time when a reservation begins. By reading the DTD, a program
processing this file will know what the code "<res_start>7:00 PM PST</res_start>" means, and
In the example above, "studentsList" is the root element. It contains two "student" elements. Each can use the information within the tag accordingly. This could mean sending this data in a
"student" element contains the elements "firstName," "lastName," "scores," etc. The beginning and confirmation email or storing it in another database.
end of each element are represented by a starting tag (e.g., "<firstName>") and a closing tag (e.g.,
"</firstName>") respectively.

11 | P a g e 12 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

To summarize: An XML file is a file used to store data in the form of hierarchical elements. Data viewing, editing, and reformatting XML files, use an online XML editor or a text editor on your
stored in XML files can be read by computer programs with the help of custom tags, which computer.
indicate the type of element.
In this section, I’ll cover how to open XML files with each of these programs.
What is an XML file used for?
How to Open XML Files With a Web Browser
Since XML files are plain text documents, they are easy to create, store, transport, and interpret by
computers and humans alike. This is why XML is one of the most commonly used languages on the All modern web browsers allow you to read XML files right in the browser window. Like with the
internet. Many web-based software applications store information and send information to other menu example from earlier, you can select an XML file from your device and choose to open it with
apps in XML format. your web browser. Here's how a file looks in Google Chrome:

Here are the most common uses of XML today: While the appearance of the text will differ by browser, you should be able to easily parse the
contents of the file, and you might also be able to hide and reveal specific elements.
Transporting Digital Information
If there’s an error in the file, your browser will tell you with an error window. Google Chrome will
The text-based format of XML files makes them highly portable, and therefore widely used for display an error message like the following:
transferring information between web servers. Certain APIs, namely SOAP APIs and REST APIs,
send information to other applications packaged in XML files.
Web Searching

Since XML defines the type of information contained in a document, it's easier and more effective
to search the web with than HTML, for example.

Let's say you want to search for songs by Taylor Swift. Using HTML, you'd likely get back search
results including interviews and articles that mention her songs. Using XML, search results would
be restricted to songs only.
Note that your browser won’t let you edit the file this way. To change the file, you’ll need to use a
Computer Applications specialized tool.
XML files allow computer apps to easily structure and fetch the data that they need. After How to Open XML Files With an Online XML Editor
retrieving data from the file, programs can decide what to do with the data. This could mean
storing in another database, using it in the program backend, or displaying it on the screen. You can use a free online text file editor to view your XML files, change their contents, or convert
them to other file formats. We recommend Code Beautify’s XML Viewer for this purpose.
Additionally, some popular file formats are built with XML. Consider the Microsoft Office file In the tool, click Browse to upload a file from your computer. Once uploaded, you can edit the file
extensions .docx (for Word documents), .xlsx (for Excel spreadsheets), and .pptx (for PowerPoint on the left and view the hierarchy of the XML contents on the right.
presentations). The "x" at the end of these file extensions stands for XML.
Once finished editing, click Save & Share to create a fresh XML file.
Websites and Web Apps Code Beautify also offers many free conversion tools to convert your XML files to other popular
data storage formats like JSON and CSV.
Websites and web apps can pull content for their pages from XML files. This is a common example
of how the markup languages XML and HTML work together. How to Open XML Files With a Text Editor
XML code modules might even appear within an HTML file in order to help display content on the
page. This makes XML especially applicable to interactive websites and pages whose content As with any text file, you can open XML files in any text editor. However, common editors like
changes dynamically. Depending on the user or screen size, an HTML file can choose to display Notepad and Word probably won’t display your XML files with colors or indentation. This makes
only certain elements in the XML code, providing visitors with a personalized browsing the files less readable, as seen in the example below.
experience.
How to Open an XML File

Since XML files are text files, you can open them in a few different ways. If you’re occasionally
viewing XML files, you can open them directly in your favorite browser. If you’re frequently

13 | P a g e 14 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

2. On the first line, write an XML declaration.

This declaration tells the application running the file that the language is XML.

3. Set your root element below the declaration.


You’ll want to opt for a specialized text editor that will detect the .xml format and display your Every XML file has one root element, which contains all other child elements. The root element is
files accordingly. For PCs, Notepad++ is a popular option. For Macs, try Xmplify or Eclipse. written below the declaration.
Alternatively, you can use a simple text editor and apply indentation to your files with a free
online XML formatter.

If any of your systems implement XML files, they will almost certainly write all of these files for
you. If you want to practice writing your own basic XML files, you can do so in a text editor. Let's
walk through how to create an XML file below.

How to Create an XML File

1. Open your text editor of choice.

2. On the first line, write an XML declaration.

3. Set your root element below the declaration. In this example file, "<root_element>" is the starting tag for the root element, and
"</root_element>" is the closing element. All other elements will go between these tags.
4. Add your child elements within the root element.
You can substitute "root_element" in both tags with a name relevant to the information you're
5. Review your file for errors. storing.

6. Save your file with the .xml file extension. 4. Add your child elements within the root element.

7. Test your file by opening it in the browser window. Next, add your child elements between the starting and closing tag of the root element. You can
nest a child element within another child element.
1. Open your text editor of choice.
Like the root element, each child element needs a starting tag and a closing tag. After adding
I'll use Sublime Text for this demo since it's free and works on macOs, Linux, and Microsoft
child tags, your file will look something like this:
operating systems.

15 | P a g e 16 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

6. Save your file with the .xml file extension.


As said above, an XML file ends with the file extension ".xml". So make sure to save your file with
that extension.

Instances of "root_element", "child_element", and "Content" can be swapped with names that
make more sense for your file.
7. Test your file by opening it in the browser window.
5. Review your file for errors.
Finally, test that your file is working by dragging and dropping it into a new browser tab or
Time to review. Are there any missing closing tags? Any rogue ampersands? Does the document window.
type declaration appear after the first element in the document? These are just a few possible
errors.

Notice that line 5 is highlighted below. That's because the closing tag of the "child_element_2" is
missing a bracket.

XML SCHEMA
What is an XML Schema?

An XML Schema describes the structure of an XML document.


The XML Schema language is also referred to as XML Schema Definition (XSD).

17 | P a g e 18 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

XSD Example XML Schemas Secure Data Communication

<?xml version="1.0"?> When sending data from a sender to a receiver, it is essential that both parts have the same
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> "expectations" about the content.
<xs:element name="note"> With XML Schemas, the sender can describe the data in a way that the receiver will understand.
<xs:complexType> A date like: "03-11-2004" will, in some countries, be interpreted as 3.November and in other
<xs:sequence> countries as 11.March.
<xs:element name="to" type="xs:string"/> However, an XML element with a data type like this:
<xs:element name="from" type="xs:string"/> <date type="date">2004-03-11</date>
<xs:element name="heading" type="xs:string"/> ensures a mutual understanding of the content, because the XML data type "date" requires the
<xs:element name="body" type="xs:string"/> format "YYYY-MM-DD".
</xs:sequence>
</xs:complexType> A well-formed XML document is a document that conforms to the XML syntax rules, like:
</xs:element>
</xs:schema>  it must begin with the XML declaration
The purpose of an XML Schema is to define the legal building blocks of an XML document:  it must have one unique root element
 the elements and attributes that can appear in a document  start-tags must have matching end-tags
 the number of (and order of) child elements  elements are case sensitive
 data types for elements and attributes  all elements must be closed
 default and fixed values for elements and attributes  all elements must be properly nested
 all attribute values must be quoted
Why Learn XML Schema?  entities must be used for special characters
In the XML world, hundreds of standardized XML formats are in daily use. Even if documents are well-formed they can still contain errors, and those errors can have serious
Many of these XML standards are defined by XML Schemas. consequences.
XML Schema is an XML-based (and more powerful) alternative to DTD.
XML documents can have a reference to a DTD or to an XML Schema.
XML Schemas Support Data Types

One of the greatest strength of XML Schemas is the support for data types. A Simple XML Document
 It is easier to describe allowable document content
 It is easier to validate the correctness of data Look at this simple XML document called "note.xml":
 It is easier to define data facets (restrictions on data)
 It is easier to define data patterns (data formats) <?xml version="1.0"?>
 It is easier to convert data between different data types <note>
<to>Tove</to>
XML Schemas use XML Syntax
<from>Jani</from>
Another great strength about XML Schemas is that they are written in XML. <heading>Reminder</heading>
 You don't have to learn a new language <body>Don't forget me this weekend!</body>
 You can use your XML editor to edit your Schema files </note>
 You can use your XML parser to parse your Schema files
 You can manipulate your Schema with the XML DOM A DTD File
 You can transform your Schema with XSLT
XML Schemas are extensible, because they are written in XML. The following example is a DTD file called "note.dtd" that defines the elements of the XML
With an extensible Schema definition you can: document above ("note.xml"):
 Reuse your Schema in other Schemas
 Create your own data types derived from the standard types <!ELEMENT note (to, from, heading, body)>
 Reference multiple schemas in the same document <!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
19 | P a g e 20 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

<!ELEMENT heading (#PCDATA)> </book>


<!ELEMENT body (#PCDATA)> <book category="CHILDREN">
<title lang="en">Harry Potter</title>
The first line defines the note element to have four child elements: "to, from, heading, body". <author>J K. Rowling</author>
<year>2005</year>
Line 2-5 defines the to, from, heading, body elements to be of type "#PCDATA". <price>29.99</price>
</book>
An XML Schema
<book category="WEB">
The following example is an XML Schema file called "note.xsd" that defines the elements of the <title lang="en">XQuery Kick Start</title>
XML document above ("note.xml"): <author>James McGovern</author>
<author>Per Bothner</author>
<?xml version="1.0"?> <author>Kurt Cagle</author>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" <author>James Linn</author>
targetNamespace="https://www.w3schools.com" <author>Vaidyanathan Nagarajan</author>
xmlns="https://www.w3schools.com" <year>2003</year>
elementFormDefault="qualified"> <price>49.99</price>
<xs:element name="note"> </book>
<xs:complexType> <book category="WEB">
<xs:sequence> <title lang="en">Learning XML</title>
<xs:element name="to" type="xs:string"/> <author>Erik T. Ray</author>
<xs:element name="from" type="xs:string"/> <year>2003</year>
<xs:element name="heading" type="xs:string"/> <price>39.95</price>
<xs:element name="body" type="xs:string"/> </book>
</xs:sequence> </bookstore>
</xs:complexType> View the "books.xml" file in your browser.
</xs:element> How to Select Nodes From "books.xml"?
</xs:schema>
Functions
The note element is a complex type because it contains other elements. The other elements (to,
from, heading, body) are simple types because they do not contain other elements. XQuery uses functions to extract data from XML documents.
The doc() function is used to open the "books.xml" file:
doc("books.xml")
XQUERY EXAMPLE Path Expressions
The XML Example Document
XQuery uses path expressions to navigate through elements in an XML document.
We will use the following XML document in the examples below. The following path expression is used to select all the title elements in the "books.xml" file:
"books.xml": doc("books.xml")/bookstore/book/title
<?xml version="1.0" encoding="UTF-8"?> (/bookstore selects the bookstore element, /book selects all the book elements under the
bookstore element, and /title selects all the title elements under each book element)
<bookstore>
The XQuery above will extract the following:
<book category="COOKING"> <title lang="en">Everyday Italian</title>
<title lang="en">Everyday Italian</title> <title lang="en">Harry Potter</title>
<author>Giada De Laurentiis</author> <title lang="en">XQuery Kick Start</title>
<year>2005</year> <title lang="en">Learning XML</title>
<price>30.00</price>

21 | P a g e 22 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Predicates Path Expression Result


XQuery uses predicates to limit the extracted data from XML documents.
bookstore Selects all nodes with the name "bookstore"
The following predicate is used to select all the book elements under the bookstore element that
have a price element with a value that is less than 30: /bookstore Selects the root element bookstore
doc("books.xml")/bookstore/book[price<30] Note: If the path starts with a slash ( / ) it always represents an absolute
The XQuery above will extract the following: path to an element!
<book category="CHILDREN">
<title lang="en">Harry Potter</title> bookstore/book Selects all book elements that are children of bookstore
<author>J K. Rowling</author>
<year>2005</year> //book Selects all book elements no matter where they are in the document
<price>29.99</price> bookstore//book Selects all book elements that are descendant of the bookstore element, no
</book> matter where they are under the bookstore element

XPATH SYNTAX: //@lang Selects all attributes that are named lang

XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected
by following a path or steps. Predicates
The XML Example Document Predicates are used to find a specific node or a node that contains a specific value.
We will use the following XML document in the examples below. Predicates are always embedded in square brackets.
<?xml version="1.0" encoding="UTF-8"?> In the table below we have listed some path expressions with predicates and the result of the
<bookstore> expressions:
<book> Path Expression Result
<title lang="en">Harry Potter</title>
<price>29.99</price> /bookstore/book[1] Selects the first book element that is the child of the
</book> bookstore element.
<book> Note: In IE 5,6,7,8,9 first node is[0], but according to W3C,
<title lang="en">Learning XML</title> it is [1]. To solve this problem in IE, set the
<price>39.95</price> SelectionLanguage to XPath:
</book> In JavaScript:
</bookstore> xml.setProperty("SelectionLanguage","XPath");
Selecting Nodes
/bookstore/book[last()] Selects the last book element that is the child of the
XPath uses path expressions to select nodes in an XML document. The node is selected by following bookstore element
a path or steps. The most useful path expressions are listed below:
/bookstore/book[last()-1] Selects the last but one book element that is the child of
Expression Description the bookstore element
nodename Selects all nodes with the name "nodename" /bookstore/book[position()<3] Selects the first two book elements that are children of the
bookstore element
/ Selects from the root node
//title[@lang] Selects all the title elements that have an attribute named
// Selects nodes in the document from the current node that match the lang
selection no matter where they are
//title[@lang='en'] Selects all the title elements that have a "lang" attribute
. Selects the current node with a value of "en"
.. Selects the parent of the current node /bookstore/book[price>35.00] Selects all the book elements of the bookstore element that
@ Selects attributes have a price element with a value greater than 35.00

In the table below we have listed some path expressions and the result of the expressions:
23 | P a g e 24 | P a g e

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851

/bookstore/book[price>35.00]/title Selects all the title elements of the book elements of the
bookstore element that have a price element with a value
greater than 35.00

25 | P a g e

Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

MC4202 ADVANCED DATABASE TEHNOLOGY


UNIT V INFORMATION RETRIEVAL AND WEB SEARCH
Types of retrieval model:
Information Retrieval (IR) can be defined as a software program that deals with the
 Classical IR Model. It is the simplest and easy to implement IR model. ...
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material  Non-Classical IR Model. It is completely opposite to classical IR model. ...
that can usually be documented on an unstructured nature i.e. usually text which satisfies  Alternative IR Model. ...
an information need from within large collections which is stored on computers. For  Inverted Index. ...
example, Information Retrieval can be when a user enters a query into the system.  Stop Word Elimination. ...
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the  Stemming. ...
user or the user has asked for in the form of a query. The documents and the queries are  Term Weighting. ...
represented in a similar manner, so that document selection and ranking can be formalized  Term Frequency (tfij)
by a matching function that returns a retrieval status value (RSV) for each document in
the collection. Many of the Information Retrieval systems represent document contents by a
set of descriptors, called terms, belonging to a vocabulary V. An IR model determines the
query-document matching function according to four main approaches: TYPES OF QUERIES IN IR SYSTEMS:

During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of document. They are
used by an IR system to build an inverted index which is then consulted during the search.
The queries formulated by users are compared to the set of index keywords. Most IR systems
also allow the use of Boolean and other operators to build a complex query. The query
language with these operators enriches the expressiveness of a user’s information need.
1. Keyword Queries:
 Simplest and most common queries.
 The user enters just keyword combinations to retrieve documents.
 These keywords are connected by logical AND operator.
 All retrieval models provide support for keyword queries.
Retrieval Models
2. Boolean Queries:
It is the simplest and easy to implement IR model. This model is based on  Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in combination
mathematical knowledge that was easily recognized and understood as well. Boolean, of keyword formulations.
Vector and Probabilistic are the three classical IR models. These are the three main statistical  No ranking is involved because a document either satisfies such a query or does not
models—Boolean, vector space, and probabilistic—and the semantic model. satisfy it.
 A document is retrieved for Boolean query if it is logically true as exact match in
document.
3. Phase Queries:
 When documents are represented using an inverted keyword index for searching, the
relative order of items in document is lost.
 To perform exact phase retrieval, these phases are encoded in inverted index or
implemented differently.
 This query consists of a sequence of words that make up a phase.
 It is generally enclosed within double quotes.
4. Proximity Queries:
 Proximity refers ti search that accounts for how close within a record multiple items
should be to each other.
 Most commonly used proximity search option is a phase search that requires terms to
be in exact order.

1|Page 2|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

 Other proximity operators can specify how close terms should be to each other. Some
will specify the order of search terms. Stemming and Lemmatization are Text Normalization (or sometimes called Word
 Search engines use various operators’ names such as NEAR, ADJ (adjacent), or Normalization) techniques in the field of Natural Language Processing that are used to
AFTER. prepare text, words, and documents for further processing.
 However, providing support for complex proximity operators becomes expensive as it
requires time-consuming pre-processing of documents and so it is suitable for smaller
document collections rather than for web.
5. Wildcard Queries:
 It supports regular expressions and pattern matching-based searching in text.
 Retrieval models do not directly support for this query type.
 In IR systems, certain kinds of wildcard search support may be implemented.
Example: usually words ending with trailing characters.
6. Natural Language Queries:
 There are only a few natural language search engines that aim to understand the Stop words removal: Stop word removal is one of the most commonly used
structure and meaning of queries written in natural language text, generally as question preprocessing steps across different NLP applications. The idea is simply removing the
or narrative. words that occur commonly across all the documents in the corpus. Typically, articles and
 The system tries to formulate answers for these queries from retrieved results. pronouns are generally classified as stop words.
 Semantic models can provide support for this query type.

TEXT PREPROCESSING: Text preprocessing is an initial phase in text mining. There are
various preprocessing techniques to categorize text documents. These are filtering, splitting
of sentences, stemming, stop words removal and token frequency count. Filtering has
a set of rules for removing duplicate strings and irrelevant text
The various text preprocessing steps are:
1. Tokenization.
2. Lower casing. The preprocessing of the text data is an essential step as there we prepare the text data
3. Stop words removal. ready for the mining. If we do not apply then data would be very inconsistent and could not
generate good analytics results.
4. Stemming.
5. Lemmatization. Text Pre-processing is used to clean up text data: Convert words to their roots (in other
The purpose of tokenization is to protect sensitive data while preserving its business words, lemmatize). Filter out unwanted digits, punctuation, and stop words.
utility. This differs from encryption, where sensitive data is modified and stored with methods
that do not allow its continued use for business purposes. If tokenization is like a poker chip, Some of the common text preprocessing / cleaning steps are:
encryption is like a lockbox.  Lower casing.
 Removal of Punctuations.
 Removal of Stop words.
 Removal of Frequent words.
 Removal of Rare words.
 Stemming.
 Lemmatization.
 Removal of emojis.

Evaluation measure

3|Page 4|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Evaluation measures for an information retrieval system are used to assess how well the Web search engines are large data mining applications. There are several data mining
search results satisfied the user's query intent. The field of information retrieval has used techniques are used in all elements of search engines, ranging from crawling (e.g.,
various types of quantitative metrics for this purpose, based on either observed user behavior deciding which pages must be crawled and the crawling frequencies), indexing (e.g.,
or on scores from prepared benchmark test sets. Besides benchmarking by using this type of selecting pages to be indexed and determining to which extent the index must be
measure, an evaluation for an information retrieval system should also include a validation of constructed), and searching (e.g., determining how pages must be ranked, which
the measures used, i.e. an assessment of how well the measures what they are intended to advertisements must be added, and how the search results can be customized or create
measure and how well the system fits its intended use case. [1] Metrics are often split into two “context aware”).
types: online metrics look at users' interactions with the search system, while offline metrics ANALYTICS
measure theoretical relevance, in other words how likely each result, or search engine results
Analytics is the systematic computational analysis of data or statistics. [1] It is used for the
page (SERP) page as a whole, is to meet the information needs of the user.
discovery, interpretation, and communication of meaningful patterns in data. It also entails
Online metrics applying data patterns toward effective decision-making. It can be valuable in areas rich with
recorded information; analytics relies on the simultaneous application of statistics, computer
Online metrics are generally created from search logs. The metrics are often used to determine programming, and operations research to quantify performance.
the success of an A/B test.
Organizations may apply analytics to business data to describe, predict, and improve business
Session abandonment rate performance. Specifically, areas within analytics include descriptive analytics, diagnostic
Session abandonment rate is a ratio of search sessions which do not result in a click. analytics, predictive analytics, prescriptive analytics, and cognitive analytics.[2] Analytics may
Click-through rate apply to a variety of fields such as marketing, management, finance, online systems,
information security, and software services. Since analytics can require extensive computation
Click-through rate (CTR) is the ratio of users who click on a specific link to the number of total (see big data), the algorithms and software used for analytics harness the most current
users who view a page, email, or advertisement. It is commonly used to measure the success of methods in computer science, statistics, and mathematics
an online advertising campaign for a particular website as well as the effectiveness of email
campaigns.[2]
CURRENT TRENDS IN WEB SEARCH
Session success rate
Session success rate measures the ratio of user sessions that lead to a success. Defining
"success" is often dependent on context, but for search a successful result is often measured 1. Voice search will become even more relevant
using dwell time as a primary factor along with secondary user interaction, for instance, the Voice search is already an integral part of our daily lives: we ask Siri where the closest gas
user copying the result URL is considered a successful result, as is copy/pasting from the station is or say “Hey Google, which Thai restaurant is the highest rated in my town?“ At the
snippet. moment, optimizing for these kinds of voice searches is recommended especially for
ecommerce or websites whose users are likely to have their hands full. For example, if you
Zero result rate
run a recipe blog, you want your users to find the answer on how long to let the dough rest
Zero result rate (ZRR) is the ratio of Search Engine Results Pages (SERPs) which returned with without having to type with their potentially dirty hands on the phone.
zero results. The metric either indicates a recall issue, or that the information being searched 2. Your site search can no longer offer zero results pages
for is not in the index. A zero result page for your user means a lost client for you. But what seems like a problem
can be a great opportunity to increase your revenue. Let’s go back to our example. In this case,
Offline metrics you cannot offer your user Ralph Lauren winter shoes. But you can show them results for
other relevant products such as summer shoes by Ralph Lauren or winter shoes by other
Offline metrics are generally created from relevance judgment sessions where the judges score
brands.
the quality of the search results. Both binary (relevant/non-relevant) and multi-level (e.g.,
3. Search will become more personalized than ever
relevance from 0 to 5) scales can be used to score each document returned in response to a
With personalization, you can offer relevant results for each user based on their preferences
query. In practice, queries may be ill-posed, and there may be different shades of relevance.
and prior search behavior. Going back to our example, an HR person might have already
WEB SEARCH downloaded a pdf targeted towards HR managers on the website. Based on their behavior,
A web search engine is a specialized computer server that searches for data on the they would get assessed as a B2B user and can get more B2B oriented results in their search.
Web. The search results of a user query are restored as a list (known as hits). The hits 4. Site search will feel less like search and more intuitive
can include web pages, images, and different types of files. A good site search is the one you do not even think about as a user. You use it so intuitively
There are various search engines also search and return data available in public that you don’t need to assess what you are doing – you just do it. In 2022, site search will
databases or open directories. Search engines differ from web directories in that web look even less like classical search.
directories are supported by human editors whereas search engines works
algorithmically or by a combination of algorithmic and human input.

5|Page 6|Page

Downloaded by Jasvan Sundar ([email protected]) Downloaded by Jasvan Sundar ([email protected])

You might also like