MC4202 - Advanced Database Technologies
MC4202 - Advanced Database Technologies
UNIT: I
DISTRIBUTED SYSTEMS
WHAT IS DISTRIBUTED SYSTEM? (OR) DEFINE DISTRIBUTED SYSTEM. (PART A)
EXPLAIN BRIEFLY ABOUT DISTRIBUTED SYSTEMS (PART B)
INTRODUCTION
In a distributed database system, the database is stored
on several computers. The computers in a distributed
systemcommunicate with one another through various
communication media, such as high-speed private networks
or the Internet.
They do not share main memory or disks. The
computers in a distributed system may vary in size and
function, ranging from workstations up to mainframe systems.
The computers in a distributed system are referred to by
different names, such as sites or nodes.
1. Each site is a database system in its own right
2. It has its own local user;
3. Its own local DBMS;
In a Distributed database system, both data and transaction processing are divided between one or more
computer (CPU’s) connected by a network each computer playing a special role in the system.
A distributed database system allows applications to access data from local & Remote Database.
Modern distributed systems have evolved to include autonomous processes that might run on the same
physical machine, but interact by exchanging messages with each other.
Example: Internet, ATM (bank) machines
Disadvantages
WHAT ARE THE DISADVANTAGES OF MIDDLEWARE DISTRIBUTED SYSTEM?(PART A)
Complexity − They are more complex than centralized systems.
Security − More susceptible to external attack.
Manageability − More effort required for system management.
Unpredictability − Unpredictable responses depending on the system organization and network load.
Client-Server Architecture
EXPLAIN ABOUT CLIENT SERVER ARCHITECTURE DISTRIBUTED SYSTEM.(PART B)
WHAT IS THE ROLE OF CLIENT AND SERVER IN DISTRIBUTED SYSTEM? (PART A)
The client-server architecture is the most common distributed system architecture which decomposes the
system into two major subsystems or logical processes −
Client − This is the first process that issues a request to the second process i.e. the server.
Server − This is the second process that receives the request, carries it out, and sends a reply to the
client.
In this architecture, the application is modelled as a set of services that are provided by servers and a set of
clients that use these services. The servers need not know about clients, but the clients must know the identity
of servers, and the mapping of processors to processes is not necessarily 1 : 1
Client-server Architecture can be classified into two models based on the functionality of the client −
Thin-client model
WHAT IS THIN-CLIENT MODEL? (PART A)
In thin-client model, all the application processing and data management is carried by the server. The client
is simply responsible for running the presentation software.
Used when legacy systems are migrated to client server architectures in which legacy system acts as a
server in its own right with a graphical interface implemented on a client
A major disadvantage is that it places a heavy processing load on both the server and the network.
Thick/Fat-client model
WHAT IS THICK/FAT-CLIENT MODEL?(PART A)
In thick-client model, the server is only in charge for data management. The software on the client
implements the application logic and the interactions with the system user.
Most appropriate for new C/S systems where the capabilities of the client system are known in advance
More complex than a thin client model especially for management. New versions of the application
have to be installed on all clients.
The most general use of multi-tier architecture is the three-tier architecture. A three-tier architecture is
typically composed of a presentation tier, an application tier, and a data storage tier and may execute on a
separate processor.
Presentation Tier:
WHAT IS PRESENTATION TIRE? (PART A)
Presentation layer is the topmost level of the application by which users can access directly such as
webpage or Operating System GUI (Graphical User interface). The primary function of this layer is to
translate the tasks and results to something that user can understand. It communicates with other tiers so that it
places the results to the browser/client tier and all other tiers in the network.
Application Tier (Business Logic, Logic Tier, or Middle Tier):
WHAT IS APPLICATION TIRE? (PART A)
Application tier coordinates the application, processes the commands, makes logical decisions,
evaluation, and performs calculations. It controls an application’s functionality by performing detailed
processing. It also moves and processes data between the two surrounding layers.
Data Tier
NARRATE THE ROLE OF DATA TIRE. (PART A)
In this layer, information is stored and retrieved from the database or file system. The information is
then passed back for processing and then back to the user. It includes the data persistence mechanisms
(database servers, file shares, etc.) and provides API (Application Programming Interface) to the application
tier which provides methods of managing the stored data.
Advantages
WHAT ARE THE ADVANTAGES OF THREE
TIRE ARCHITECTURE? (PART A)
Better performance than a thin-client
approach and is simpler to manage than a
thick-client approach.
Enhances the reusability and scalability − as
demands increase, extra servers can be
added.
Provides multi-threading support and also reduces network traffic.
Provides maintainability and flexibility
Disadvantages
WHAT ARE THE DISADVANTAGES OF THREE TIRE ARCHITECTURE? (PART A)
Unsatisfactory Testability due to lack of testing tools.
More critical server reliability and availability.
DISTRIBUTED DATABASE
EXPLAIN BRIEFLY ABOUT DISTRIBUTED DATABASE.(PART B)
DEFINE DISTRIBUTED DATABASE. (2MARKS)
A distributed database is a set of interconnected databases that is distributed over the computer
network or internet.[Or]
A Distributed Database System across several sites connected
together via communication network. Each site is typically managed by a
DBMS that is capable of running independently of other sites.
Properties of Distributed Database system:
WHAT ARE THE PROPERTIES OF DISTRIBUTED DATABASE
SYSTEM?(PART A)
1) Distributed Data Independence:-The user should be able to
access the database without having the need to know the location
of the data.
2) Distributed Transaction Atomicity:-The concept of atomicity should be distributed for the
operation taking place at the distributed sites.
Types of Distributed Databases:
LIST OUT THE TYPES OF DISTRIBUTED DATABASE(PART A)
a) Homogeneous Distributed Database is where the data stored across multiple sites is managed by
same DBMS software at all the sites.
b) Heterogeneous Distributed Database is where multiple sites which may be autonomous are under
the control of different DBMS software.
Vertical Fragmentation
You may also divide the CUSTOMER relation into vertical fragments that are composed of a collection
of attributes. For example, suppose that the company is divided into two departments: the service department
and the collections department. Each department is located in a separate building, and each has an interest in
only a few of the CUSTOMER table’s attributes. In this case, the fragments are defined as shown in the
following table.
Mixed Fragmentation
The XYZ Company’s1 structure requires that the CUSTOMER data be fragmented horizontally to
accommodate the various company locations; within the locations, the data must be fragmented vertically to
accommodate the two departments (service and collection). In short, the CUSTOMER table requires mixed
fragmentation. Mixed fragmentation requires a two-step procedure. First, horizontal fragmentation is introduced
for each site based on the location within a state (CUS_STATE). The horizontal fragmentation yields the
subsets of customer tuples (horizontal fragments) that are located at each site. Because the departments are
located in different buildings, vertical fragmentation is used within each horizontal fragment to divide the
attributes, thus meeting each department’s information needs at each sub site. Mixed fragmentation yields the
results displayed in the following Table.
Advantages of Fragmentation
WHAT ARE THE ADVANTAGES OF FRAGMENTATION? (PART A)
Horizontal: –
o Allows parallel processing on a relation.
o Allows a global table to be split so that tuples are located where they are most frequently
accessed.
Vertical: –
o Allows for further decomposition than can be achieved with normalization
o Tuple-id attribute allows efficient joining of vertical fragments.
o Allows parallel processing on a relation.
o Allows tuples to be split so that each part of the tuple is stored where it is most frequently
accessed.
DISTRIBUTED TRANSACTIONS
TS protocol ensures freedom from deadlock that means no transaction ever waits.
But the schedule may not be recoverable and may not even be cascade- free.
Example –
Input :All players called “Muller", who are playing for a team
QUERY: SELECT p.Name FROM Players p, Teams tWHEREp.TID = t.TIDANDp.Name LIKE " Muller"
.In a distributed system, we must take into account several other matters, including:
The cost of data transmission over the network.
The potential gain in performance from having several sites process parts of the query in parallel.
Query Trading
In query trading algorithm for distributed database systems, the controlling/client site for a distributed query is
called the buyer and the sites where the local queries execute are called sellers. The buyer formulates a number
of alternatives for choosing sellers and for reconstructing the global results. The target of the buyer is to achieve
the optimal cost.
The algorithm starts with the buyer assigning sub-queries to the seller sites. The optimal plan is created from
local optimized query plans proposed by the sellers combined with the communication cost for reconstructing
the final result. Once the global optimal plan is formulated, the query is executed.
Reduction of Solution Space of the Query
Optimal solution generally involves reduction of solution space so that the cost of query and data transfer is
reduced. This can be achieved through a set of heuristic rules, just as heuristics in centralized systems.
Following are some of the rules −
Perform selection and projection operations as early as possible. This reduces the data flow over
communication network.
Simplify operations on horizontal fragments by eliminating selection conditions which are not relevant
to a particular site.
In case of join and union operations comprising of fragments located in multiple sites, transfer
fragmented data to the site where most of the data is present and perform operation there.
Use semi-join operation to qualify tuples that are to be joined. This reduces the amount of data transfer
which in turn reduces communication cost.
Merge the common leaves and sub-trees in a distributed query tree.
Possible Questions:
PART A
1. What is distributed systems.
2. What are the characteristics of distributed system?
3. What are the characteristics of distributed system?
4. What are the disadvantages of distributed system?
5. What are the advantages and disadvantages of middleware distributed system?
6. What is thin-client model?
7. What is thick/fat-client model?
8. Give a quick comparison of thin-client vs thin client model?
9. Define distributed database
10. List out the types of distributed database
11. Explain about middleware.
12. What are the several factors of data replications?
13. What is fragmentation?
14. What are the advantages of fragmentation?
15. Define concurrency control.
PART B
1. Explain briefly about distributed systems.
2. Explain briefly about the distributed system architecture.
3. What is the role of client and server in distributed system?
4. Explain the architecture of distributed database system?
5. Explain about distributed data storage.
6. Explain briefly about distributed transaction.
7. Explain briefly about distributed query processing
8. Explain about commit protocols
UNIT II
NOSQL DATABASES
NOSQL
To retrieve (Select) the inserted document, run the below command. The find() command will retrieve all the
documents of the given collection.
NOTE: Please observe that the record retrieved contains an attribute called _id with some unique identifier
value called ObjectId which acts as a document identifier.
If a record is to be retrieved based on some criteria, the find() method should be called passing parameters,
then the record will be retrieved based on the attributes specified.
db.collection_name.find({"fieldname":"value"})
For Example: Let us retrieve the record from the student collection where the attribute regNo is 3014 and the
query for the same is as shown below:
db.student.find({"regNo":"2KVYSAMCA01"})
In order to update specific field values of a collection in MongoDB, run the below query.
db.collection_name.update()
update() method specified above will take the fieldname and the new value as argument to update a document.
Let us update the attribute name of the collection student for the document with regNo 3014.
db.student.update({"regNo":"2KVYSAMCA01" },
{$set:
{"name":"NATZ"}
})
You will see the following in the Command Prompt:
Let us now look into the deleting an entry from a collection. In order to delete an entry from a collection, run
the command as shown below:
db.collection_name.remove({"fieldname":"value"})
For Example: db.student.remove({"regNo":"2KVYSAMCA01"})
Note that after running the remove() method, the entry has been deleted from the student collection.
An index in MongoDB is a special data structure that holds the data of few fields of documents on
which the index is created. Indexes improve the speed of search operations in database because instead of
searching the whole document, the search is performed on the indexes that holds only few fields
db.collection_name.createIndex({field_name: 1 or -1})
For Example :db.student.createIndex({"title":1})
We can use getIndexes() method to find all the indexes created on a collection. The syntax for this method is:
db.collection_name.getIndexes()
So to get the indexes of studentdata collection, the command would be:
>db.student.getIndexes()
[
{
"v":2,
"key":{
"_id":1
},
"name":"_id_",
"ns":"test.student"
},
{
"v":2,
"key":{
"student_name":1
},
"name":"student_name_1",
"ns":"test.student"
}
]
Before you start using MongoDB in your Java programs, you need to make sure that you have MongoDB
CLIENT and Java set up on the machine. You can check Java tutorial for Java installation on your machine.
Now, let us check how to set up MongoDB CLIENT.
You need to download the jar mongodb-driver-3.11.2.jar and its dependency mongodb-driver-core-
3.11.2.jar.. Make sure to download the latest release of these jar files.
You need to include the downloaded jar files into your classpath.
Create a Collection
To create a collection, createCollection() method of com.mongodb.client.MongoDatabase class is used.
Following is the code snippet to create a collection −
importcom.mongodb.client.MongoDatabase;
importcom.mongodb.MongoClient;
importcom.mongodb.MongoCredential;
publicclassCreatingCollection{
publicstaticvoid main(Stringargs[]){
// Creating Credentials
MongoCredential credential;
credential=MongoCredential.createCredential("sampleUser","myDb",
"password".toCharArray());
System.out.println("Connected to the database successfully");
//Creating a collection
database.createCollection("sampleCollection");
System.out.println("Collection created successfully");
}
}
On compiling, the above program gives you the following result −
Connected to the database successfully
Collection created successfully
Cassandra (PART A)
Cassandra is a distributed database management system which is open source with wide column store, NoSQL
database to handle large amount of data across many commodity servers which provides high availability with
no single point of failure. It is written in Java and developed by Apache Software Foundation.
AvinashLakshman&Prashant Malik initially developed the Cassandra at Facebook to power the Facebook
inbox search feature. Facebook released Cassandra as an open source project on Google code in July 2008. In
March 2009 it became an Apache Incubator project and in February 2010 it becomes a top-level project. Due to
its outstanding technical features Cassandra becomes so popular.
Verification:
SYNTAX:
DESCRIBEKEYSPACE;
Using a Keyspace:
Syntax:USE <identifier>
EXAMPLE:
USE Vysya;
Syntax:
DROP keyspace KeyspaceName ;
Cassandra Create Table:
CREATE TABLE command is used to create a table. Here, column family is used to store data just like
table in RDBMS.
Syntax:
CREATE TABLE tablename(column1 name datatype PRIMARYKEY, column2 name data type,
column3 name data type)
There are two types of primary keys:
Single primary key: Use the following syntax for single primary key.
1. Primary key (ColumnName)
Compound primary key: Use the following syntax for single primary key.
1. Primary key(ColumnName1,ColumnName2 . . .)
Example:
Let's take an example to demonstrate the CREATE TABLE command.
Here, we are using already created Keyspace "Vysya".
CREATE TABLE MCA(Reg_No int PRIMARY KEY, Name text, Address text,Pincode varint,
phone varint);
The table is created now. You can check it by using the following command.
Example: SELECT * FROM MCA;
Cassandra Alter Table:
ALTER TABLE command is used to alter the table after creating it. You can use the ALTER command to
perform two types of operations:
Add a column
Drop a column
Syntax:
1. ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>
Adding a Column
While adding column, you have to aware that the column name is not conflicting with the existing
column names and that the table is not defined with compact storage option
Syntax:
1. ALTER TABLE table nameADD new column datatype;
Example:
ALTER TABLE MCA ADD email text;
Dropping a Column
Drop an existing column from a table by using ALTER command.
Example:
DROP TABLE MCA;
You can use DESCRIBE command to verify if the table is deleted or not. Here the student table has been
deleted; you will not find it in the column families list.
DESCRIBE COLUMNFAMILIES;
Cassandra Truncate Table
TRUNCATE command is used to truncate a table. If you truncate a table, all the rows of the table are
deleted permanently.
Syntax:
TRUNCATE <tablename>
Example:
TRUNCATE MCA;
Limitations of Hive
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It supports different
types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request from all those
programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver
is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.
Integer Types
Decimal Type
TIMESTAMP
o It supports traditional UNIX timestamp with optional nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal
place precision)
DATES
The Date value is used to specify a particular year, month and day, in the form YYYY--MM--DD. However, it
didn't provide the time of the day. The range of Date type lies between 0000--01--01 to 9999--12--31.
String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes (') or double quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which specifies that the maximum
number of characters allowed in the character string.
CHAR
The char is a fixed-length type whose maximum length is fixed at 255.
Complex Type
Map It contains the key-value tuples where the fields are map('first','James','last','Roy')
accessed using array notation.
In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain multiple tables
within a database where a unique name is assigned to each table. Hive also provides a default database with a
name default.
o Initially, we check the default database provided by Hive. So, to check the list of existing databases,
follow the below command: -
o Each database must contain a unique name. If we create two databases with the same name, the
following error generates: -
o If we want to suppress the warning generated by Hive on creating the database with the same name,
follow the below command: -
o Let's check the list of existing databases by using the following command: -
As we can see, the database demo is not present in the list. Hence, the database is dropped successfully.
o If we try to drop the database that doesn't exist, the following error generates:
o However, if we want to suppress the warning generated by Hive on creating the database with the same
name, follow the below command:-
Internal Table
The internal tables are also called managed tables as the lifecycle of their data is controlled by the Hive. By
default, these tables are stored in a subdirectory under the directory defined by
hive.metastore.warehouse.dir(i.e. /user/hive/warehouse). The internal tables are not flexible enough to share
with other tools like Pig. If we try to drop the internal table, Hive deletes both table schema and data.
1. hive> create table demo.employee (Id int, Name string , Salary float) ;
Here, the command also includes the information that the data is separated by ','.
o Let's see the metadata of the created table by using the following command:-
In such a case, the exception occurs. If we want to ignore this type of exception, we can use if not
exists command while creating the table.
1. hive> create table if not exists demo.employee (Id int, Name string , Salary float)
2. row format delimited
3. fields terminated by ',' ;
o While creating a table, we can add the comments to the columns and can also define the table
properties.
1. hive> create table demo.new_employee (Id int comment 'Employee Id', Name string comment 'Employee Na
me', Salary float comment 'Employee Salary')
2. comment 'Table Description'
3. TBLProperties ('creator'='Gaurav Chawla', 'created_at' = '2019-06-06 11:00:00');
o Let's see the metadata of the created table by using the following command: -
o Hive allows creating a new table by using the schema of an existing table.
Here, we can say that the new table is a copy of an existing table.
Hive - Drop Table
Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the below steps to drop
the table from the database.
o Let's check the list of existing databases by using the following command: -
1. hive> show databases;
o Now select the database from which we want to delete the table by using the following command: -
In Hive, we can perform modifications in the existing table like changing the table name, column name,
comments, and table properties. It provides SQL like commands to alter the table.
Rename a Table
If we want to change the name of an existing table, we can rename that table by using the following signature: -
o Now, change the name of the table by using the following command: -
Adding column
In Hive, we can add one or more columns in an existing table by using the following signature: -
o Now, add a new column to the table by using the following command: -
As we didn't add any data to the new column, hive consider NULL as the value.
Change Column
In Hive, we can rename a column, change its type and position. Here, we are changing the name of the column
by using the following signature: -
o Now, change the name of the column by using the following command: -
Hive allows us to delete one or more columns by replacing them with the new columns. Thus, we cannot drop
the column directly.
1. alter table employee_data replace columns( id string, first_name string, age int);
o Let's check whether the column has dropped or not.
Partitioning in Hive
The partitioning in Hive means dividing the table into some parts based on the values of a particular column
like date, course, city or country. The advantage of partitioning is that since the data is stored in slices, the
query response time becomes faster.
As we know that Hadoop is used to handle the huge amount of data, it is always required to use the best
approach to deal with it. The partitioning in Hive is the best example of it.
Let's assume we have a data of 10 million students studying in an institute. Now, we have to fetch the students
of a particular course. If we use a traditional approach, we have to go through the entire data. This leads to
performance degradation. In such a case, we can adopt the better approach i.e., partitioning in Hive and divide
the data among the different datasets based on particular columns.
o Static partitioning
o Dynamic partitioning
Static Partitioning
In static or manual partitioning, it is required to pass the values of partitioned columns manually while loading
the data into the table. Hence, the data file doesn't contain the partitioned columns.
1. hive> create table student (id int, name string, age int, institute string)
2. partitioned by (course string)
3. row format delimited
4. fields terminated by ',';
1. Let's retrieve the information associated with the table.
2. hive> describe student;
o Load the data into the table and pass the values of partition columns with it by using the following
command: -
o Load the data of another file into the same table and pass the values of partition columns with it by
using the following command: -
o Let's retrieve the entire data of the able by using the following command: -
o Now, try to retrieve the data based on partitioned columns by using the following command: -
o Let's also retrieve the data of another partitioned dataset by using the following command: -
Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not required to pass
the values of partitioned columns manually.
1. hive> create table stud_demo(id int, name string, age int, institute string, course string)
2. row format delimited
3. fields terminated by ',';
o Now, load the data into the table.
1. hive> create table student_part (id int, name string, age int, institute string)
2. partitioned by (course string)
3. row format delimited
4. fields terminated by ',';
o Now, insert the data of dummy table into the partition table.
o Let's retrieve the entire data of the table by using the following command: -
In this case, we are not examining the entire data. Hence, this approach improves query response time.
o Let's also retrieve the data of another partitioned dataset by using the following command:
What Is OrientDB?
OrientDB is a multi-model database capable of efficiently storing and retrieving data like all
traditional database systems while it also supports new functionality adopted from graph and
document databases. It is written in Java and belongs to the NoSQL database family.
Graph Databases, the structure follows tow classes: V as the base class for vertices and E as the base class
for edges. OrientDB builds these classes automatically when you create the graph database. In the event that
you don't have these classes, create them, (see below).
Working with Vertex and Edge Classes
While you can build graphs using V and E class instances, it is strongly recommended that you create
custom types for vertices and edges.
To create a custom vertex class (or type) use the createVertexType(<name>):
// Create Custom Vertex Class
OrientVertexType account = graph.createVertexType("Account");
To create a vertex of that class Account, pass a string with the format "class:<name>":
// Add Vertex Instance
Vertex v = graph.addVertex("class:Account");
In Blueprints, edges have the concept of labels used in distinguishing between edge types. OrientDB binds
the concept of edge labels to edge classes. There is a similar method in creating custom edge types,
using createEdgeType(<name):
// Create Graph Database Instance
OrientGraph graph = newOrientGraph("plocal:/tmp/db");
// Create Vertices
Vertex account = graph.addVertex("class:Account");
Vertex address = graph.addVertex("class:Address");
// Create Edge
Edge e = account.addEdge("Lives", address);
Inheritance Tree
Classes can extend other classes. To create a class that extend a class different from V or E, pass the class
name in the construction:
graph.createVertexType(<class>, <super-class>); // Vertex
graph.createEdgeType(<class>, <super-class>); // Edge
For instance, create the base class Account, then create two subclasses: Provider and Customer:
// Create Vertex Base Class
graph.createVertexType("Account");
Retrieve Types
Classes are polymorphic. If you search for generic vertices, you also receive all custom vertex instances:
// Retrieve Vertices
Iterable<Vertex>allVertices = graph.getVertices();
To retrieve custom classes, use the getVertexType() and the getEdgeType methods. For instance, retrieving
from the graph database instance:
OrientVertexTypeaccountVertex = graph.getVertexType("Account");
OrientEdgeTypelivesEdge = graph.getEdgeType("Lives");
OrientDB Enterprise Edition gives you all the features of our community edition plus:
Incremental backups.
Unmatched security.
24x7 Support.
Query Profiler.
Distributed Clustering configuration.
Metrics Recording.
Live Monitor with configurable alerts.
Possible Questions:
PART A
1. What is NOSQL?
2. Define Sharding.
3. What is MongoCurdOperations?
4. How to Insert a record in MONGODB?
5. How to Create a database in MONGODB, CASSANDRA, Hive, ORIENTDB?
6. How to Create a Table in MONGODB, CASSANDRA, Hive, ORIENTDB?
7. How to insert a record in MONGODB, CASSANDRA, Hive, ORIENTDB?
8. How to delete a record in MONGODB, CASSANDRA, Hive, ORIENTDB?
9. What is CURD Operation in MONGODB, CASSANDRA, Hive?
10. List out the CQL Types in Cassandra.
11. What is HIVE?
12. Define Partitioning
13. What is a Graph Database.
14. List out the features of Orientdb.
PART B
Object-Oriented Databases
An object-oriented database is a collection of object-oriented programming and relational database. There are
various items which are created using object-oriented programming languages like C++, Java which can be
stored in relational databases, but object-oriented databases are well-suited for those items.
An object-oriented database is organized around objects rather than actions, and data rather than logic. For
example, a multimedia record in a relational database can be a definable data object, as opposed to an
alphanumeric value.
Complex types are nested data structures composed of primitive data types. These data structures can also be
composed of other complex types. Some examples of complex types include struct(row), array/list, map and
union. Complex types are supported by most programming languages including Python, C++ and Java.
Any data that does not fall into the traditional field structure (alpha, numeric, dates) of a relational DBMS.
Examples of complex data types are bills of materials, word processing documents, maps, time-series, images
and video.
Complex types are non-scalar properties of entity types that enable scalar properties to be organized within
entities. Like entities, complex types consist of scalar properties or other complex type properties. ... Complex
types cannot participate in associations and cannot contain navigation properties.
Complex Data Types Motivation: Permit non-atomic domains (atomic indivisible) Example of non-atomic
domain: set of integers,or set of tuples Allows more intuitive modeling for applications with complex data
Intuitive definition: allow relations whenever we allow atomic (scalar) values — relations within relations
Retains mathematical foundation of relational model Violates first normal form.
Complex Types Extensions to SQL to support complex types include: Collection and large object types
Nested relations are an example of collection types Structured types Nested record structures like composite
attributes Inheritance Object orientation Including object identifiers and references
Object oriented data model is based upon real world situations. These situations are represented as objects, with
different attributes. All these object have multiple relationships between them.
Elements of Object oriented data model
Objects
The real world entities and situations are represented as objects in the Object oriented database model.
Every object has certain characteristics. These are represented using Attributes. The behaviour of the objects is
represented using Methods.
Class
Similar attributes and methods are grouped together using a class. An object can be called as an instance of the
class.
Inheritance
A new class can be derived from the original class. The derived class contains attributes and methods of the
original class as well as its own.
Example
Shape, Circle, Rectangle and Triangle are all objects in this model.
The objects Circle, Rectangle and Triangle inherit from the object Shape.
The Object Oriented (OO) Data Model in DBMS
Increasingly complex real-world problems demonstrated a need for a data model that more closely represented
the real world.
In the object oriented data model (OODM), both data and their relationships are contained in a single structure
known as an object.
In turn, the OODM is the basis for the object-oriented database management
Object-Oriented Languages
It is used to structure a software program into simple, reusable pieces of code blueprints (usually called
classes), which are used to create individual instances of objects. There are many object-oriented programming
languages including JavaScript, C++, Java, and Python
Object Oriented programming (OOP) is a programming paradigm that relies on the concept
of classes and objects. It is used to structure a software program into simple, reusable pieces of code blueprints
(usually called classes), which are used to create individual instances of objects. There are many object-oriented
programming languages including JavaScript, C++, Java, and Python.
A class is an abstract blueprint used to create more specific, concrete objects. Classes often represent broad
categories, like Car or Dog that share attributes. These classes define what attributes an instance of this type
will have, like color, but not the value of those attributes for a specific object.
Classes can also contain functions, called methods available only to objects of that type. These functions are
defined within the class and perform some action helpful to that specific type of object.
Benefits of OOP
Spatial Databases:
Spatial data is associated with geographic locations such as cities,towns etc. A spatial database is
optimized to store and query data representing objects. These are the objects which are defined in a geometric
space.
A common example of spatial data can be seen in a road map. A road map is a two-dimensional object that
contains points, lines, and polygons that can represent cities, roads, and political boundaries such as states or
provinces. ... A GIS is often used to store, retrieve, and render this Earth-relative spatial data.
There are two major supported data-type is SQL server namely geometry data type and geography data type.
These are represented as latitudinal and longitudinal degrees, as on a round-earth coordinate system.
The common use case of the Geography type is to store an application’s GPS data.
In SQL Server, both SQL data types have been implemented in the .NET common language runtime
(CLR)
Spatial data objects
This combines both special data types (geometry and geography). It supports a total of sixteen SQL data types
in which eleven can be utilized in the database. To be more specific, these objects have inherited a particular
property from their parent’s data types and this unique property distributes them as the object. Take the
examples of a Polygon or point or CircularString.
Among them, ten of the depicted data objects will available to Geometry and Geography data types. The ten
objects are respectively Point, MultiPoint, MultiLineString, CircularString, LineString, MultiLineString,
CompoundCurve, Polygon, MultiPolygon, CurvePolygon, and GeometryCollection. However, the FullGlobe is
utilized exclusively for the Geography SQL data types.
The object types associated with a spatial data type form a relationship with each other. In the following
diagram, consider it as an example of how the object types of the Geometry SQL data types are related to each
other. To be more specific, the graphic depicts the geometry hierarchy in which the geometry and geography
data types are included. Dark grey is representing types of geometry and geography.
Spatial relationship:
A spatial relation specifies how some object is located in space in relation to some reference object. When the
reference object is much bigger than the object to locate, the latter is often represented by a point. The reference
object is often represented by a bounding box.
Knowledge of object categories and attributes allows children to mentally and physically organize things in
their world. Spatial awareness and spatial relations allow children to locate objects and navigate successfully
in their environments.
Spatial data structures are structures that manipulate spatial data, that is, data that has geometric
coordinates. ... The class will try to float somewhere in the realm between algorithms, computational geometry,
graphics, databases and software design
Search trees such as BSTs, AVL trees, splay trees, 2-3 Trees, B-trees, and tries are designed for searching on a
one-dimensional key. A typical example is an integer key, whose one-dimensional range can be visualized as a
number line. These various tree structures can be viewed as dividing this one-dimensional number line into
pieces.
Some databases require support for multiple keys. In other words, records can be searched for using any one of
several key fields, such as name or ID number. Typically, each such key has its own one-dimensional index,
and any given search query searches one of these independent indices as appropriate.
Multdimensional Keys
A multidimensional search key presents a rather different concept. Imagine that we have a database of city
records, where each city has a name and an xyxy coordinate. A BST or splay tree provides good performance
for searches on city name, which is a one-dimensional key. Separate BSTs could be used to index
the xx and yy coordinates. This would allow us to insert and delete cities, and locate them by name or by one
coordinate. However, search on one of the two coordinates is not a natural way to view search in a two-
dimensional space. Another option is to combine the xyxy coordinates into a single key, say by concatenating
the two coordinates, and index cities by the resulting key in a BST. That would allow search by coordinate, but
would not allow for an efficient two-dimensional range query such as searching for all cities within a given
distance of a specified point. The problem is that the BST only works well for one-dimensional keys, while a
coordinate is a two-dimensional key where neither dimension is more important than the other.
Multidimensional range queries are the defining feature of a spatial application. Because a coordinate gives a
position in space, it is called a spatial attribute. To implement spatial applications efficiently requires the use of
a spatial data structure. Spatial data structures store data objects organized by position and are an important
class of data structures used in geographic information systems, computer graphics, robotics, and many other
fields.
A number of spatial data structures are used for storing point data in two or more dimensions. The kd tree is a
natural extension of the BST to multiple dimensions. It is a binary tree whose splitting decisions alternate
among the key dimensions. Like the BST, the kd tree uses object-space decomposition. The PR
quadtree uses key-space decomposition and so is a form of trie. It is a binary tree only for one-dimensional
keys (in which case it is a trie with a binary alphabet). For dd dimensions it has 2d2d branches. Thus, in two
dimensions, the PR quadtree has four branches (hence the name "quadtree"), splitting space into four equal-
sized quadrants at each branch. Two other variations on these data structures are the bintree and the point
quadtree. In two dimensions, these four structures cover all four combinations of object- versus key-space
decomposition on the one hand, and multi-level binary versus 2d2d-way branching on the other.
The main problem in design of spatial access methods is that there is no total ordering among the spatial data
objects that preserves spatial proximity. Consider, for example, a user wants to find restaurants closest to her
location. One try to answer this query is to build a one-dimensional index that contains the distances of all
restaurants from user's location sorted in ascending order. To answer her query, we can return first entries
from the sorted index. However, this index cannot support a query issued by some other user at a different
location. In order to answer the query of this new user, we will have to sort all the restaurants again in
ascending order of their distances from this user.
Temporal Databases
A temporal database stores data relating to time instances. It offers temporal data types and stores information
relating to past, present and future time. Temporal databases could be uni-temporal, bi-temporal or tri-temporal.
More specifically the temporal aspects usually include valid time, transaction time or decision time.
Valid time is the time period during which a fact is true in the real world.
Transaction time is the time at which a fact was recorded in the database.
Decision time is the time at which the decision was made about the fact.
For example, a conventional database cannot directly support historical queries about past status and cannot
represent inherently retroactive or proactive changes. Without built-in temporal table support from the
DBMS, applications are forced to use complex and often manual methods to manage and maintain temporal
information.
ACTIVE DATABASE:
Deductive Database:
A deductive database is a database system that makes conclusions about its data based on a set of well-
defined rules and facts. This type of database was developed to combine logic programming with relational
database management systems. Usually, the language used to define the rules and facts is the logical
programming language Datalog.
Recursive Queries:
In general, a recursive CTE has three parts: An initial query that returns the base result set of the CTE.
The initial query is called an anchor member. A recursive query that references the common table expression,
therefore, it is called the recursive member.
One of the most fascinating features of SQL is its ability to execute recursive queries. Like sub-queries,
recursive queries save us from the pain of writing complex SQL statements. In most of the situations, recursive
queries are used to retrieve hierarchical data. Let’s take a look at a simple example of hierarchical data.
The below Employee table has five columns: id, name, department, position, and manager. The rationale behind
this table design is that an employee can be managed by none or one person who is also the employee of the
organization. Therefore, we have a manager column in the table which contains the value from the id column of
the same table. This results in a hierarchical data where the parent of a record in a table exists in the same table.
Employee Table
From the Employee table, it can be seen that IT department has a manager David with id 1. David is the
manager of Suzan and John since both of them have 1 in their manager column. Suzan further manages Jacob in
the same IT department. Julia is the manager of the HR department. She has no manager but she manages
Wayne who is an HR supervisor. Wayne manages the office boy Zack. Finally we have Sophie, who manages
the Marketing department and she has two subordinates, Wickey and Julia.
We can retrieve a variety of data from this table. We can get the name of the manager of any employee, all the
employees managed by a particular manager, or the level/seniority of employee in the hierarchy of employees.
Mobile Databases:
Mobile databases are separate from the main database and can easily be transported to various places.
Even though they are not connected to the main database, they can still communicate with the database to share
and exchange data.
The main system database that stores all the data and is linked to the mobile database.
The mobile database that allows users to view information even while on the move. It shares
information with the main database.
The device that uses the mobile database to access data. This device can be a mobile phone, laptop etc.
A communication link that allows the transfer of data between the mobile database and the main
database.
Current most mobile DBMSs only provide limited prepackaged SQL functions for the mobile
application.It is expected that in the near-future, mobile DBMSs will provide functionality matching
that at the corporate site.
Mobile TransactionModels
Transaction
A set of operations that translate a database from one consistent state to another
consistent state,
Transaction : Atomicity
An executable program, assumed that this program will finally terminate, has one initial state and
one final state.
otherwise if it is at the initial state after some execution steps then it is aborted or rollback.
Transaction : Consistency
if a program produces consistent result only then it satisfies the consistency property and it will
be at the final state or committed.
If the result is not consistent then a transaction program should be at the initial state, in other
word the transaction is aborted.
Transaction : Isolation
if a program is executing and if it is only single program on the system then it satisfies the
isolation property.
If there are several other processes on the system, then none of the intermediate state of this
program is viewable until it reaches its final state.
Transaction : Durability
If a program reaches to its final state and the result is made available to the outside world then
this result is made permanent.
Multimedia Databases.
• Multimedia database is the collection of interrelated multimedia data that includes text, graphics
(sketches, drawings), images, animations, video, audio etc and have vast amounts of multisource
multimedia data. The framework that manages different types of multimedia data which can be
stored, delivered and utilized in different ways is known as multimedia database management
system. There are three classes of the multimedia database which includes static media, dynamic
media and dimensional media.
• Media format data – Information such as sampling rate, resolution, encoding scheme etc. about
the format of the media data after it goes through the acquisition, processing and encoding phase.
• Media keyword data – Keywords description relating to the generation of data. It is also known
as content descriptive data. Example: date, time and place of recording.
• Media feature data – Content dependent data such as the distribution of colors, kinds of texture
and different shapes present in data.
• Modelling – Working in this area can improve database versus information retrieval techniques
thus, documents constitute a specialized area and deserve special consideration.
• Design –The conceptual, logical and physical design of multimedia databases has not yet been
addressed fully as performance and tuning issues at each level are far more complex as they
consist of a variety of formats like JPEG, GIF, PNG, MPEG which is not easy to convert from one
form to another.
• Storage – Storage of multimedia database on any standard disk presents the problem of
representation, compression, mapping to device hierarchies, archiving and buffering during input-
output operation. In DBMS, a ”BLOB”(Binary Large Object) facility allows untyped bitmaps to be
stored and retrieved.
• Queries and retrieval –For multimedia data like images, video, audio accessing data through
query opens up many issues like efficient query formulation, query execution and optimization
which need to be worked upon.
Possible Questions:
Part A
Part B
XML
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding
documents in a format that is both human-readable and machine-readable.
• The tags in the example above (like <to> and <from>) are not defined in any XML standard.
These tags are "invented" by the author of the XML document.
• HTML works with predefined tags like <p>, <h1>, <table>, etc.
• With XML, the author must define both the tags and the document structure.
Books.xml
XML- enabled
Native XML (NXD)
XML enabled database is nothing but the extension provided for the conversion of XML document. This is a
relational database, where data is stored in tables consisting of rows and columns. The tables contain set of
records, which in turn consist of fields.
Native XML database is based on the container rather than table format. It can store large amount of XML
document and data. Native XML database is queried by the XPath-expressions.
Native XML database has an advantage over the XML-enabled database. It is highly capable to store, query and
maintain the XML document than XML-enabled database.
Example
<contact2>
<name>ManishaPatil</name>
<company>TutorialsPoint</company>
<phone>(011) 789-4567</phone>
</contact2>
</contact-info>
Here, a table of contacts is created that holds the records of contacts (contact1 and contact2), which in turn
consists of three entities − name, company and phone.
XML Schema
XML schema is a language which is used for expressing constraint about XML documents. There are so many
schema languages which are used now a days for example Relax- NG and XSD (XML schema definition).
An XML schema is used to define the structure of an XML document. It is like DTD but provides more control
on XML structure.
Checking Validation
An XML document is called "well-formed" if it contains the correct syntax. A well-formed and valid XML
document is one which have been validated against Schema.
employee.xsd
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
<xs:element name="email" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Let's see the xml file using XML schema or XSD file.
employee.xml
<?xml version="1.0"?>
<employee>
<firstname>Natarajan </firstname>
<lastname>Subramanian</lastname>
<email>[email protected]</email>
</employee>
XML Parsers
An XML parser is a software library or package that provides interfaces for client applications to work with an
XML document. The XML Parser is designed to read the XML and create a way for programs to use XML.
XML parser validates the document and check that the document is well formatted.
Let's understand the working of XML parser by the figure given below:
Types of XML Parsers
1. DOM
2. SAX
A DOM document is an object which contains all the information of an XML document. It is composed like a
tree structure. The DOM Parser implements a DOM API. This API is very simple to use.
Advantages
1) It supports both read and write operations and the API is very simple to use.
Disadvantages
1) It is memory inefficient. (consumes more memory because the whole XML document needs to loaded into
memory).
A SAX Parser implements SAX API. This API is an event based API and less intuitive.
Clients does not know what methods to call, they just overrides the methods of the API and place his own code
inside method.
Advantages
1) It is simple and memory efficient.
2) It is very fast and works for huge documents.
Disadvantages
1) It is event-based so its API is less intuitive.
2) Clients never know the full information because the data is broken into pieces.
What is XSL?
XSL is a language for expressing style sheets. An XSL style sheet is, like with CSS, a file that describes how to
display an XML document of a given type. XSL shares the functionality and is compatible with CSS2 (although
it uses a different syntax). It also adds:
A transformation language for XML documents: XSLT. Originally intended to perform complex styling
operations, like the generation of tables of contents and indexes, it is now used as a general purpose
XML processing language. XSLT is thus widely used for purposes other than XSL, like generating
HTML web pages from XML data.
Advanced styling features, expressed by an XML document type which defines a set of elements called
Formatting Objects, and attributes (in part borrowed from CSS2 properties and adding more complex
ones.
Styling requires a source XML documents, containing the information that the style sheet will display and the
style sheet itself which describes how to display a document of a given type.
The following shows a sample XML file and how it can be transformed and rendered.
<scene>
<FX>General Road Building noises.</FX>
<speech speaker="Prosser">
Come off it Mr Dent, you can't win
you know. There's no point in lying
down in the path of progress.
</speech>
<speech speaker="Arthur">
I've gone off the idea of progress.
It's overrated
</speech>
</scene>
This XML file doesn't contain any presentation information, which is contained in the stylesheet. Separating the
document's content and the document's styling information allows displaying the same document on different
media (like screen, paper, cell phone), and it also enables users to view the document according to their
preferences and abilities, just by modifying the style sheet.
The Stylesheet
Here are two templates from the stylesheet used to format the XML file. The full stylesheet (which includes
extra information on pagination and margins) is available.
...
<xsl:template match="FX">
<fo:block font-weight="bold">
<xsl:apply-templates/>
</fo:block>
</xsl:template>
<xsl:template match="speech[@speaker='Arthur']">
<fo:block background-color="blue">
<xsl:value-of select="@speaker"/>:
<xsl:apply-templates/>
</fo:block>
</xsl:template>
...
The stylesheet can be used to transform any instance of the DTD it was designed for. The first rule says that an
FX element will be transformed into a block with a bold font. <xsl:apply-templates/> is a recursive call to
the template rules for the contents of the current element. The second template applies to all speech elements
that have the speaker attribute set to Arthur, and formats them as blue blocks within which the value speaker
attribute is added before the text.
What is XSLT
Before XSLT, first we should learn about XSL. XSL stands for EXtensibleStylesheet Language. It is a styling
language for XML just like CSS is a styling language for HTML.
XSLT stands for XSL Transformation. It is used to transform XML documents into other formats (like
transforming XML into HTML).
What is XSL
In HTML documents, tags are predefined but in XML documents, tags are not predefined. World Wide Web
Consortium (W3C) developed XSL to understand and style an XML document, which can act as XML based
Stylesheet Language.
XSLT: It is a language for transforming XML documents into various other types of documents.
XPath: It is a language for navigating in XML documents.
XQuery: It is a language for querying XML documents.
XSL-FO: It is a language for formatting XML documents.
The XSLT stylesheet is written in XML format. It is used to define the transformation rules to be applied on the
target XML document. The XSLT processor takes the XSLT stylesheet and applies the transformation rules on
the target XML document and then it generates a formatted document in the form of XML, HTML, or text
format. At the end it is used by XSLT formatter to generate the actual output and displayed on the end-user.
Image representation:
Advantage of XSLT
XSLT provides an easy way to merge XML data into presentation because it applies user defined
transformations to an XML document and the output can be HTML, XML, or any other structured
document.
XSLT provides Xpath to locate elements/attribute within an XML document. So it is more convenient
way to traverse an XML document rather than a traditional way, by using scripting language.
XSLT is template based. So it is more resilient to changes in documents than low level DOM and SAX.
By using XML and XSLT, the application UI script will look clean and will be easier to maintain.
XSLT templates are based on XPath pattern which is very powerful in terms of performance to process
the XML document.
XSLT can be used as a validation language as it uses tree-pattern-matching approach.
You can change the output simply modifying the transformations in XSL files.
XPath
XPath is a major element in the XSLT standard.
XPath can be used to navigate through elements and attributes in an XML document.
XPath stands for XML Path Language
XPath uses "path like" syntax to identify and navigate nodes in an XML document
XPath contains over 200 built-in functions
XPath is a major element in the XSLT standard
XPath is a W3C recommendation
• These path expressions look very much like the path expressions you use
with traditional computer file systems:
A multidimensional model views data in the form of a data-cube. A data cube enables data to be modeled and
viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps records.
Now, if we want to view the sales data with a third dimension, For example, suppose the data according to time
and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D
data are shown in the table. The 3D data of the table are represented as a series of 2D tables.
STAR SCHEMA:
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or
measured, such as a sale or log in. A dimension includes reference data
about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose
design represents a multidimensional data model. The star schema is the
explicit data warehouse schema. It is known as star schema because the
entity-relationship diagram of this schemas simulates a star, with points,
diverge from a central table. The center of the schema consists of a large
fact table, and the points of the star are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two types of
columns: those that include fact and those that are foreign keys to the dimension table. The primary key of the
fact tables is generally a composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables that include
aggregated fact are often instead called summary tables). A fact table generally contains facts with the same
level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a dimension
has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of each of the
dimensions table are part of the composite primary keys of the fact table. Dimensional attributes help to define
the dimensional value. They are generally descriptive, textual values. Dimensional tables are usually small in
size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets, cities),
clients, products, times, channels.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the following features:
It creates a DE-normalized database that can quickly provide query responses.
It provides a flexible design that can be changed easily or added to throughout the development cycle,
and as the database grows.
It provides a parallel in design to how end-users typically think of and use the data.
It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema
Star Schemas are easy for end-users and application to understand and navigate. With a well-designed schema,
the customer can instantly analyze large, multidimensional data sets.
The main advantage of star schemas in a decision-support environment are:
Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster than they do
against OLTP systems. Small single-table queries, frequently of a dimension table, are almost instantaneous.
Large join queries that contain multiple tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central fact table. When the two-
dimension table is used in a query, only one join path, intersecting the fact tables, exist between those two
tables. This design feature enforces authentic and consistent query results.
Load performance and administration
Structural simplicity also decreases the time required to load large batches of record into a star schema
database. By describing facts and dimensions and separating them into the various table, the impact of a load
structure is reduced. Dimension table can be populated once and occasionally refreshed. We can add new facts
regularly and selectively by appending records to a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential integrity is enforced
because each data in dimensional tables has a unique primary key, and all keys in the fact table are legitimate
foreign keys drawn from the dimension table. A record in the fact table which is not related correctly to a
dimension cannot be given the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the fact table. These
joins are more significant to the end-user because they represent the fundamental relationship between parts of
the underlying business. Customer can also browse dimension table attributes before constructing a query.
Disadvantage of Star Schema
There is some condition which cannot be meet by star schemas like the relationship between the user, and bank
account cannot describe as star schema as the relationship between them is many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables connected
to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns for each
item_Key, item_name, brand, type, supplier_type. The BRANCH table has columns for each branch_key,
branch_name, branch_type. The LOCATION table has columns of geographic data, including street, city, state,
and country.
In this scenario, the SALES table contains only four columns with IDs from the dimension tables, TIME,
ITEM, BRANCH, and LOCATION, instead of four columns for time data, four columns for ITEM data, three
columns for BRANCH data, and four columns for LOCATION data. Thus, the size of the fact table is
significantly reduced. When we need to change an item, we need only make a single change in the dimension
table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into several tables. The
normalized dimension table is called a Snowflake.
A star schema store all attributes for a dimension into one denormalized
table. This needed more disk space than a more normalized snowflake
schema. Snowflaking normalizes the dimension by moving attributes with
low cardinality into separate dimension tables that relate to the core
dimension table by using foreign keys. Snowflaking for the sole purpose of minimizing disk space is not
recommended, because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables are damaged
into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing
company. The sales fact table include quantity, price, and other
relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME
are the dimension tables.
The STAR schema for sales, as shown above, contains only five
tables, whereas the normalized version now extends to eleven tables.
We will notice that in the snowflake schema, the attributes with low
cardinality in each original dimension tables are removed to form
separate tables. These new tables are connected back to the original
dimension
table through
artificial keys.
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is like
zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-down can be
performed by either stepping down a concept hierarchy for a dimension or adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a concept hierarchy
which is defined as day, month, quarter, and year. Drill-down appears by descending the time hierarchy from
the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a new dimension
to a cube. For example, a drill-down on the central cubes of the figure can occur by introducing an additional
dimension, such as a customer group.
Example
Drill-down adds more details to the given data
Temperature cool mild hot
Day 1 0 0 0
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
The following diagram illustrates how Drill-down works.
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension. For
example, a slice operation is executed when the customer wants a selection on one dimension of a three-
dimensional cube resulting in a two-dimensional site. So, the Slice operations perform a selection on one
dimension of the given cube, thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
The following diagram illustrates how Slice works.
Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool OR
temperature = hot) to the original cubes we get the following subcube (still two-dimensional)
Temperature cool hot
Day 3 0 1
Day 4 0 0
Consider the following diagram, which shows the dice operations.
The dice operation on the cubes based on the following selection criteria involves three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data axes in
view to provide an alternative presentation of the data. It may contain swapping the rows and columns or
moving one of the row-dimensions into the column dimensions.
POSSIBLE QUESTIONS
Part A
1) What is XML Database?
2) What is the use of Prolog in XML?
3) Explain about XML Schema and How to validate the XML Schema.
4) What is XSLT and their parts of XSL Document.
5) Write a note on XPath?
6) Define XQuery and Write a program to implement XQuery.
7) What is Fact Table.
Part B
1) What is XML and explain their data types?
2) Explain in detail about the Parser in XML.
3) What is XSL and how does it work on XML?
4) What is XSLT and how it work explain in Image representation.
5) What is Data warehouse and explain their goals.
6) What is Multidimensional data modelling?
7) Explain in detail about Star Schema.
8) Explain in detail about Snow Flake Schema.
9) Explain the OLAP Operations in multidimensional Data model.
UNIT V
IR CONCEPTS:
The problem of IR
Goal = find documents relevant to an information need from a large document set
Possible approaches
Introduction
Text mining refers to data mining using text documents as data.
Most text mining tasks use Information Retrieval (IR) methods to pre-process text documents.
These methods are quite different from traditional data pre-processing methods used for relational
tables.
Web search also has its root in IR.
IR architecture
IR queries
Keyword queries
Boolean queries (using AND, OR, NOT)
Phrase queries
Proximity queries
Full document queries
Natural language questions
1. Keyword Queries :
2. Boolean Queries :
Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in combination of keyword
formulations.
No ranking is involved because a document either satisfies such a query or does not satisfy it.
A document is retrieved for boolean query if it is logically true as exact match in document.
3. Phase Queries :
When documents are represented using an inverted keyword index for searching, the relative order of
items in document is lost.
To perform exact phase retrieval, these phases are encoded in inverted index or implemented differently.
This query consists of a sequence of words that make up a phase.
It is generally enclosed within double quotes.
4. Proximity Queries :
Proximity refers ti search that accounts for how close within a record multiple items should be to each
other.
Most commonly used proximity search option is a phase search that requires terms to be in exact order.
Other proximity operators can specify how close terms should be to each other. Some will specify the
order of search terms.
Search engines use various operators names such as NEAR, ADJ (adjacent), or AFTER.
However, providing support for complex proximity operators becomes expensive as it requires time-
consuming pre-processing of documents and so it is suitable for smaller document collections rather
than for web.
5. Wildcard Queries :
There are only a few natural language search engines that aim to understand the structure and meaning
of queries written in natural language text, generally as question or narrative.
The system tries to formulate answers for these queries from retrieved results.
Semantic models can provide support for this query type.
Main models:
Boolean model
Vector space model
Boolean model
Each document or query is treated as a “bag” of words or terms. Word sequence is not considered.
Given a collection of documents D, let V = {t1, t2, ...,t|V|} be the set of distinctive words/terms in the
collection. V is called the vocabulary.
A weight wij> 0 is associated with each term tiof a document dj∈D. For a term that does not appear in
document dj, wij= 0.
dj= (w1j, w2j, ..., w|V|j),
Query terms are combined logically using the Boolean operators AND, OR, and NOT.
E.g., ((data AND mining) AND (NOT text))
Retrieval
Given a Boolean query, the system retrieves every document that makes the query logically true.
Called exact match.
The retrieval results are usually quite poor because term frequency is not considered
Text pre-processing
Word (term) extraction: easy
Stopwords removal
Stemming
Frequency counts and computing TF-IDF term weights.
1. Stopword Removal
Stopwordsare very commonly used words in a language that play a major role inthe formation of a
sentence but which seldom contribute to the meaning of that sentence. Words that are expected to occur in
80 percent or more of the documents in a collection are typically referred to as stopwords, and they are
rendered potentially useless. Because of the commonness and function of these words, they do not
contribute much to the relevance of a document for a query search. Examples include words such as the, of,
to, a, and, in, said, for, that, was, on, he, is, with, at, by, and it. These words are presented here with
decreasing frequency of occurrence from a large corpus of documents called AP89. The fist six of these
words account for 20 percent of all words in the listing, and the most frequent 50 words account for 40
percent of all text.
Removal of stopwords from a document must be performed before indexing. Articles, prepositions,
conjunctions, and some pronouns are generally classified as stopwords. Queries must also be preprocessed
for stopword removal before the actual retrieval process. Removal of stopwords results in elimination of
possible spurious indexes, thereby reducing the size of an index structure by about 40 percent or more.
However, doing so could impact the recall if the stopword is an integral part of a query (for example, a
search for the phrase ‘To be or not to be,’ where removal of stopwords makes the query inappropriate, as all
the words in the phrase are stopwords). Many search engines do not employ query stopword removal for
this reason.
2. Stemming
A stem of a word is defined as the word obtained after trimming the suffix and prefix of an original word.
For example, ‘comput’ is the stem word for computer, computing, and computation. These suffixes and
prefixes are very common in theEnglish language for supporting the notion of verbs, tenses, and plural
forms. Stemming reduces the different forms of the word formed by inflection (due to plurals or tenses) and
derivation to a common stem.
A stemming algorithm can be applied to reduce any word to its stem. In English, the most famous stemming
algorithm is Martin Porter’s stemming algorithm. The Porter stemmer is a simplified version of Lovin’s
technique that uses a reduced set of about 60 rules (from 260 suffix patterns in Lovin’s technique) and
organizes them into sets; conflicts within one subset of rules are resolved before going on to the next. Using
stemming for preprocessing data results in a decrease in the size of the indexing structure and an increase in
recall, possibly at the cost of precision.
3. Utilizing a Thesaurus
A thesaurus comprises a precompiled list of important concepts and the main word that describes each
concept for a particular domain of knowledge. For each concept in this list, a set of synonyms and related
words is also compiled. Thus, a synonym can be converted to its matching concept during preprocessing.
This preprocessing step assists in providing a standard vocabulary for indexing and searching. Usage of a
thesaurus, also known as a collection of synonyms, has a substantial impact on the recall of information
systems. This process can be complicated because many words have different meanings in different
contexts.
UMLS is a large biomedical thesaurus of millions of concepts (called theMetathesaurus) and a semantic
network of meta concepts and relationships thatorganize the Metathesaurus (see Figure 27.3). The concepts
are assigned labels from the semantic network. This thesaurus of concepts contains synonyms of medical
terms, hierarchies of broader and narrower terms, and other relationships among words and concepts that
make it a very extensive resource for information retrieval of documents in the medical domain. Figure 27.3
illustrates part of the UMLS Semantic Network.
WordNetis a manually constructed thesaurus that groups words into strict synonym sets called synsets.
These synsets are divided into noun, verb, adjective, and adverb categories. Within each category, these
synsets are linked together by appropriate relationships such as class/subclass or “is-a” relationships for
nouns.
WordNet is based on the idea of using a controlled vocabulary for indexing, thereby eliminating
redundancies. It is also useful in providing assistance to users with locating terms for proper query
formulation.
Digits, dates, phone numbers, e-mail addresses, URLs, and other standard types of text may or may not be
removed during preprocessing. Web search engines, however, index them in order to to use this type of
information in the document metadata to improve precision and recall (see Section 27.6 for detailed
definitions of precision and recall).
Hyphens and punctuation marks may be handled in different ways. Either the entire phrase with the
hyphens/punctuation marks may be used, or they may be eliminated. In some systems, the character
representing the hyphen/punctuation mark may be removed, or may be replaced with a space. Different
information retrieval systems follow different rules of processing. Handling hyphens automatically can be
complex: it can either be done as a classification problem, or more commonly by some heuristic rules.
Most information retrieval systems perform case-insensitive search, converting all the letters of the text to
uppercase or lowercase. It is also worth noting that many of these text preprocessing steps are language
specific, such as involving accents and diacritics and the idiosyncrasies that are associated with a particular
language.
5. Information Extraction
Information extraction (IE) is a generic term used for extracting structured con-tent from text. Text
analytic tasks such as identifying noun phrases, facts, events, people, places, and relationships are examples
of IE tasks. These tasks are also called named entity recognition tasks and use rule-based approaches with
either a the-saurus, regular expressions and grammars, or probabilistic approaches. For IR and search
applications, IE technologies are mostly used to identify contextually relevant features that involve text
analysis, matching, and categorization for improving the relevance of search systems. Language
technologies using part-of-speech tagging are applied to semantically annotate the documents with extracted
features to aid search relevance.
Web Search as a huge IR system
A Web crawler (robot) crawls the Web to collect all the pages.
Servers establish a huge inverted indexing database and other indexing databases
At query (search) time, search engines conduct different types of vector query matching.
Inverted index
The inverted index of a document collection is basically a data structure that
attaches each distinctive term with a list of all documents that contains the term.
Thus, in retrieval, it takes constant time to
find the documents that contains a query term.
multiple query terms are also easy handle as we will see soon.
An example
Evaluation Measures:
Evaluation measures for an information retrieval system are used to assess how well the search results
satisfied the user's query intent. Such metrics are often split into kinds: online metrics look at users' interactions
with the search system, while offline metrics measure relevance, in other words how likely each result, or
search engine results page (SERP) page as a whole, is to meet the information needs of the user.
–Web Search and Analytics Ontology based Search -Current trends.
Web: –
A huge, widely-distributed, highly heterogeneous, semistructured,, interconnected, evolving,
hypertext/hypermedia information repository
•Main issues –
Abundance of information
•The 99% of all the information are not interesting for the 99% of all users –The static Web is a very small part
of all the Web
•Dynamic Website –To access the Web user need to exploit Search Engines (SE)
•SE must be improved
•To help people to better formulate their information needs
•More personalization is needed
• WordNet
– A large lexical database organized in terms of meanings.
– Nouns, Adjectives, Adverbs, and Verbs
– Synonym words are grouped into synset
{car, auto, automobile, machine, motorcar}
{food, nutrient}
{police, police force, constabulary, law}
– Number of words, synsets, and senses
Possible Questions:
Part A
1) Write a note on IR Concepts?
2) Explain the models of IR Conceps.
3) Write a note on
a. Stemming
b. Thesaurus
4) What is wordnet?
5) What is Onthology Search?
6) What are the current Treads in IR Model?
Part B
1) Explain in detail about Information Retrieval architecture?
2) What are the types of Queries in Information Retrieval?
3) Explain in detail about Text pre-processing.
4) Explain about Information Extraction.
5) What are the exaluation measures to search a text in WEB.
UNIT V COMPLETED
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
UNIT 1
DISTRIBUTED DATABASES
1. Elaborate how the data spread over multiple machines? Explain its architecture.
Distributed Systems
Data spread over multiple machines (also referred to as sites or nodes).
Network interconnects the machines
Data shared by users on multiple machines
Distributed Database
Homogeneous distributed databases
o Same software/schema on all sites, data may be partitioned among sites
o Goal: provide a view of a single database, hiding details of distribution
Sharing data - on users at one site able to access the data residing at some other sites.
Autonomy - each site is able to retain a degree of control over data stored locally.
Higher system availability through redundancy – data can be replicated at remote sites, and system
can function even if a site fails.
Disadvantage: added complexity required to ensure proper coordination among sites.
o Software development cost.
o Greater potential for bugs.
o Increased processing overhead.
Atomicity needed even for transactions that update data at multiple sites
The two-phase commit protocol (2PC) is used to ensure atomicity
o Basic idea: each site executes transaction until just before commit, and the leaves final
decision to a coordinator
1
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
o Each site must follow decision of coordinator, even if there is failure while waiting for
coordinators decision.
2PC is not always appropriate: other transaction models based on persistent messaging, and
workflows, are also used
Distributed concurrency control (and deadlock detection) required
Data items may be replicated to improve data availability
Network Types
Local-area networks (LANs) - composed of processors that are distributed over small geographical
areas, such as a single building or a few adjacent buildings.
Wide-area networks (WANs) - composed of processors distributed over a large geographical area.
Storage-area network
2
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
A storage area network (SAN) is a high-speed network that provides access to data storage at the block
level. It connects servers with storage devices like disk arrays, RAID hardware, and tape libraries.
Tape library
A tape library is a storage system that contains multiple tape drives. It is essentially a collection of tapes
and tape drive that store information, usually for backup.
Networks Types
WANs with continuous connection (e.g., the Internet) are needed for implementing distributed
database systems
Groupware applications such as Lotus notes can work on WANs with discontinuous connection:
o Data is replicated.
o Updates are propagated to replicas periodically.
o Copies of data may be updated independently.
3
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
2. How the relation is partitioned into several fragments and stored in distinct sites? Explain with
diagram.
Distributed Data Storage
Data Replication
A relation or fragment of a relation is replicated, if it is stored redundantly in two or more sites.
In the most extreme case, we have Full replication of a relation is the case where the relation is stored
at all sites.
Fully redundant databases are those in which every site contains a copy of the entire database.
Advantages of Replication
o Availability: failure of site containing relation r does not result in unavailability of r is
replicas exist.
o Parallelism: queries on r may be processed by several nodes in parallel.
o Reduced data transfer: relation r is available locally at each site containing are replica of
r.
Disadvantages of Replication
o Increased cost of updates: each replica of relation r must be updated.
o Increased complexity of concurrency control: concurrent updates to distinct replicas may
lead to inconsistent data unless special concurrency control mechanisms are
implemented.
One solution: choose one copy as primary copy and apply concurrency control
operations on primary copy.
Data Fragmentation
Division of relation r into fragments r1, r2, …, rn which contain sufficient information to
reconstruct relation r.
Horizontal fragmentation: each tuple of r is assigned to one or more fragments
4
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Vertical fragmentation: the schema for relation r is split into several smaller schemas
All schemas must contain a common candidate key (or superkey) to ensure lossless join property.
A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate
key.
5
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
A-305 500 1
A-226 336 2
A-177 205 3
A-402 10000 4
A-155 62 5
A-408 1123 6
7
A-639 750
Advantages of Fragmentation
Horizontal:
allows parallel processing on fragments of a relation
allows a relation to be split so that tuples are located where they are most frequently
accessed
Vertical:
allows tuples to be split so that each part of the tuple is stored where it is most
frequently accessed
tuple-id attribute allows efficient joining of vertical fragments
allows parallel processing on a relation
Vertical and horizontal fragmentation can be mixed.
Fragments may be successively fragmented to an arbitrary depth
Data Transparency
Data transparency: Degree to which system user may remain unaware of the details of how and where the
data items are stored in a distributed system
Consider transparency issues in relation to:
o Fragmentation transparency
o Replication transparency
o Location transparency
6
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Structure:
o name server assigns all names
o each site maintains a record of local data items
o sites ask name server to locate non-local data item
Advantages:
o satisfies naming criteria 1-3
Disadvantages:
o does not satisfy
o name server is a potential performance bottleneck
o name server is a single point of failure.
Use of Aliases
Alternative to centralized scheme: each site prefixes its own site identifier to any name that it
generates 1.e., site 17.account.
o Fulfills having a unique identifier, and avoids problems associated with central control.
o However, fails to achieve network transparency.
Solution: Create a set of aliases for data items; Store the mapping of aliases to the real names at
each site.
The user can be unaware of the physical location of a data item, and is unaffected if the data item
is moved from one site to another.
3. Name the protocols used to ensure atomicity across sites and explain how the Distributed
Transactions occurs.
Distributed Transactions
7
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
o Network partition
A network is said to be partitioned when it has been split into two or more
subsystems that lack any connection between them
- Note: a subsystem may consist of a single node
Network partitioning and site failures are generally indistinguishable.
Commit Protocols
Assumes fail-stop model - failed sites simply stop working, and do not cause any other harm, such
as sending incorrect messages to other sites.
Execution of the protocol is initiated by the coordinator after the last step of the transaction has been
reached.
8
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
The protocol involves all the local sites at which the transaction executed
Let T be a transaction initiated at site Si , and let the transaction coordinator at Si be Ci
T can be committed of C, received a ready T message from all the participating sites: otherwise
7 must be aborted.
e Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto
stable storage. Once the record stable storage it is irrevocable (even if failures occur)
Coordinator sends a message to each participant informing it of the decision (commit or abort)
Participants take appropriate action locally.
When site S_ recovers, it examines its log to determine the fate of transactions active at the time of the
failure.
Log contain <commit T> record: txn had completed, nothing to be done
Log contains <abort T> record: txn had completed, nothing to be done
Log contains <ready T> record: site must consult Ci, to determine the fate of T.
o If T committed, redo (T); write <commit T> record
o If T aborted, undo (T)
9
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
If coordinator fails while the commit protocol for T is executing then participating sites must decide on T’s
fate:
1. If an active site contains a<commit T> record in its log, then T must be committed.
2. If an active site contains an <abort T> record in its log, then T must be aborted
3. If some active participating site does not contain a <ready T> record in its log, then the
failed coordinator C, cannot have decided to commit T.
Can therefore abort T; however, such a site must reject any subsequent <prepare
T> message From C,
4. If none of the above cases holds, then all active sites must have a <ready T>record in their
logs, but no additional Control records (such as <abort T> of <commit T>).
In this case active sites must wait for C. to recover, to find decision.
Blocking problem: active sites may have to wait for failed coordinator to recover.
4. How the Recovery and Concurrency Control works? Explain 3 phase commit and Implementation
of Persistent Messaging.
Recovery and Concurrency Control
In-doubt transactions have a <ready T>, but neither a <commit T>, nor an <abort 7> log record.
The recovering site must determine the commit-abort status of such transactions by contacting other
sites; this can slow and potentially block recovery.
Recovery algorithms can note lock information in the log.
o Instead of <ready T>, write out <ready T, L> L = list of locks held by T when the log is
written (read locks can be omitted).
10
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
o For every in-doubt transaction T, all the locks noted in the <ready T, L> log record are
reacquired.
After lock reacquisition, transaction processing can resume; the commit or rollback of in-doubt
transactions is performed concurrently with the execution of new transactions.
11
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Atomicity issue
once transaction sending a message is committed, message must guaranteed to be delivered
Guarantee as long as destination site is up and reachable, code to handle undeliverable
messages must also be available
-e.g. credit money back to source account.
If sending transaction aborts, message must not be sent
Workflows provide a general model of transactional processing involving multiple sites and
possibly human processing of certain steps
o E.g. when a bank receives a loan application, it may need to
Contact external credit-checking agencies
Get approvals of one or more managers and then respond to the loan application
We study workflows in Chapter 25
Persistent messaging forms the underlying infrastructure for workflows in a distributed environment
12
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
o In many messaging systems, it is possible for messages to get delayed arbitrarily, although
such delays are very unlikely.
Each message is given a timestamp, and if the timestamp of a received message is
older than some cutoff, the message is discarded.
All messages recorded in the received messages relation that are older than the
cutoff can be deleted.
5. What is Concurrency Control? Explain in detail about Distributed lock manager approaches with
example.
Concurrency Control
13
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
14
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Advantages
Implementation is simple
Deadlock handling is simple
Disadvantages
manager site becomes a bottleneck
Lock manager site is vulnerable to failure
Distributed lock manager approach
Lock managers of all sites involved
Local data items are controlled by local lock managers
Variants
o Primary copy protocol
o Majority protocol
o Biased protocol
o Quorum consensus protocol
15
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Advantages
16
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
17
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
It works as follows;
1. The protocol assigns each site that have a replica with a weight.
2. For any data item, the protocol assigns a read quorum Qr, and write quorum Qw. Here, Q. and Q, are
two integers (sum of weights of some sites). And, these two integers are chosen according to the
following conditions put together;
Qr + Qw > S - this rule avoids read-write conflict. (i.e., two transactions cannot read and write
concurrently)
2 * Qw > S - this rule avoids write-write conflict. (i.e., two transactions cannot write
concurrently)
Here, S is the total weight of all sites in which the data item replicated.
o Number of locks required for read operation >= Qr and for write operation >= Qw.
Timestamping
18
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
With multimaster replication (also called update-anywhere replication) updates are permitted at any
replica, and are automatically propagated to all replicas
o Basic model in distributed databases, where transactions are unaware of the details of
replication, and database system propagates updates as part of the same transaction
Coupled with 2 phase commit
Many systems support lazy propagation where updates are transmitted after transaction commits
o Allows updates to occur even if some sites are disconnected from the network, but at the
cost of consistency
Deadlock Handling
Consider the following two transactions and history, with item X and transaction T1,at site 1, and item Y
and transaction T2 at site 2:
19
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Centralized Approach:
A global wait-for graph is constructed and maintained in a single site; the deadlock-detection
coordinator
o Real graph: Real, but unknown, state of the system.
o Constructed graph: Approximation generated by the controller during the execution of its
algorithm .
the global wait-for graph can be constructed when:
o A new edge is inserted in or removed from one of the local wait-for graphs.
o A number of changes have occurred in a local wait-for graph.
o the coordinator needs to invoke cycle-detection.
If the coordinator finds a cycle, it selects a victim and notifies all sites. The sites roll back the victim
transaction.
Global
Initial state:
20
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
False Cycles
Suppose further that the insert message reaches before the delete message
o this can happen due to network delays
The coordinator would then find a false cycle
T1 T2 T3 T1
The false cycle above never existed in reality.
False cycles cannot occur if two-phase locking is used.
Unnecessary Rollbacks
Unnecessary rollbacks may result when deadlock has indeed occurred and a victim has been picked,
and meanwhile one of the transactions was aborted for reasons unrelated to the deadlock.
Unnecessary rollbacks can result from false cycles in the global wait-for graph; however, likelihood
of false cycles is low.
Availability
High availability: time for which system is not fully usable should be extremely low (e.g. 99.99%
availability)
Robustness: ability of system to function spite of failures of components
Failures are more likely in large distributed systems
To be robust, a distributed system must
o Detect failures
o Reconfigure the system so computation may continue
o Recovery/reintegration when a site or link is repaired
Failure detection: distinguishing link failure from site failure is hard
(partial) solution: have multiple links, multiple link failure is likely a site failure
Reconfiguration
Reconfiguration:
o Abort all transactions that were active at a failed site
Making them wait could interfere with other transactions since they may hold locks
on other sites
However, in case only some replicas of a data item failed, it may be possible to
continue transactions that had accessed data at a failed site (more on this later)
o If replicated data items were at failed site, update system catalog to remove them from the
list of replicas.
21
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
If a failed site was a central server for some subsystem, an election must be held to determine the
new serve
o E.g. name server, concurrency coordinator, global deadlock detector
Since network partition may not be distinguishable from site failure, the following situations must
be avoided
o Two or more central servers elected in distinct partitions
o More than one partition updates a replicated data item
Updates must be able to continue even if some sites are down
Solution: majority based approach
o Alternative of “read one write all available” is tantalizing but causes problems
Majority-Based Approach
The majority protocol for distributed concurrency control can be modified to work even if some
sites are unavailable
o Each replica of each item has a version number which is updated when the replica is
updated, as outlined below
o A lock request is sent to at least 1/2 the sites at which item replicas are stored and operation
continues only when a lock is obtained on a majority of the sites
o Read operations look at all replicas locked, and read the value from the replica with largest
version number
May write this value and version number back to replicas with lower version
numbers (no need to obtain locks on all replicas for this task)
Majority protocol (Cont.)
o Write operations
find highest version number like reads, and set new version number to old highest
version + 1
Writes are then performed on all locked replicas and version number on these
replicas is set to new version number
o Failures (network and site) cause no problems as long as
Sites at commit contain a majority of replicas of any updated data items
During reads a majority of replicas are available to find version number
Subject to above, 2 phase commit can be used to update replicas
o Note: reads are guaranteed to see latest version of data item
o Reintegration is trivial: nothing needs to be done
Quorum consensus algorithm can be similarly extended
22
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
o If site was aware of failure reintegration could have been performed, but no way to
guarantee this
o With network partitioning, sites in each partition may update same item concurrently
Site Reintegration
When failed site recovers, it must catch up with all updates that 1t missed while it was down
o Problem: updates may be happening to items whose replica is stored at the site while the
site is recovering
o Solution 1: halt all updates on system while reintegrating a site
Unacceptable disruption
o Solution 2: lock all replicas of all data items at the site, update to latest version, then
release locks
Other solutions with better concurrency also available
Comparison with Remote Backup
Remote backup (hot spare) systems (Section 17.10) are also designed to provide high availability
Remote backup systems are simpler and have lower overhead
o All actions performed at a single site, and only log records shipped
o No need for distributed concurrency control, or 2 phase commit
Using distributed databases with replicas of data items can provide higher availability by having
multiple (> 2) replicas and using the majority protocol
o Also avoid failure detection and switchover time associated with remote backup systems
Coordinator Selection
Backup coordinators
o site which maintains enough information locally to assume the role of coordinator if the
actual coordinator fails
o executes the same algorithms and maintains the same internal state information as the actual
coordinator fails executes state information as the actual coordinator
o Allows fast recovery from coordinator failure but involves overhead during normal
processing.
Election algorithms
o used to elect a new coordinator in case of failures
o Example: Bully Algorithm - applicable to systems where every site can send a message to
every other site.
Bully Algorithm
If site Sj sends a request that is not answered by the coordinator within a time interval T, assume that
the coordinator has failed Si tries to elect itself as the new coordinator.
Si sends an election message to every site with a higher identification number, Si then waits for any
of these processes to answer within T.
23
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
If no response within T, assume that all sites with number greater than i have failed, Si elects itself
the new coordinator.
If answer is received Si begins time interval T’, waiting to receive a message that a site with a higher
identification number has been elected.
If no message is sent within T’, assume the site with a higher number has failed; S, restarts the
algorithm.
After a failed site recovers, it immediately begins execution of the same algorithm.
If there are no active sites with higher numbers, the recovered site forces all processes with lower
numbers to let it become the coordinator site, even if there is a currently active coordinator with a
lower number.
Availability
24
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Brewer’s CAP “Theorem”: You can have at most two of these three properties for any system
Very large systems will partition at some point
Choose one of consistency or availability
o Traditional database choose consistency
o Most Web applications choose availability
Except for specific parts such as order processing
Replication with Weak Consistency
Many systems support replication of data with weak degrees of consistency (I.e., without a guarantee
of serializabiliy)
o i.e. QR + QW <= S or 2*QW < S
o Usually only when not enough sites are available to ensure quorum
But sometimes to allow fast local reads
Key issues:
Eventual Consistency
When no updates occur for a long period of time, eventually all updates will propagate through the
system and all the nodes will be consistent
Fora given accepted update and a given node, eventually either the update reaches the node or the
node is removed from service Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID
25
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Eventually Consistent - copies becomes consistent at somelater time if there are no more updates
to that data item
Availability vs Latency
o Even if partitions are rare, applications may trade off consistency for latency
7. How the queries will be processed in distributed system? Explain in detail with Query Processing
Strategies.
Distributed Query Processing
For centralized systems, the primary criterion for measuring the cost of a particular strategy is the
number of disk accesses.
In a distributed system, other issues must be taken into account:
o The cost of a data transmission over the network.
o The potential gain in performance from having several sites process parts of the query in
parallel.
Query Transformation
26
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Since account1, has only tuples pertaining to the Hillside branch, we can eliminate the selection
operation.
This expression is the empty set regardless of the contents of the account relation.
Final strategy is for the Hillside site to return account, as the result of the query.
o depositor at S2
o branch at S3
o For a query issued at site SI, the system needs to produce the result at site S1
Ship copies of all three relations to site Sl, and choose a strategy for processing the entire locally at
site Sl
Ship a copy of the account relation to site S, and compute temp1 = account depositor at S2 .Ship
temp1, from S2 to S3, and compute temp2 = temp1 branch at S3. Ship the result temp2 to S1
Devise similar strategies, exchanging the roles S1, S2, S3
Must consider following factors:
o amount of data being shipped
o cost of transmitting a data block between sites
o relative processing speed at each site
Semijoin Strategy
27
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Formal Definition
Many database applications require data from a variety of preexisting databases located in a
heterogeneous collection of hardware and software platforms
Data models may differ (hierarchical, relational, etc.)
Transaction commit protocols may be incompatible
Concurrency control may be based on different techniques (locking, timestamping, etc.)
System-level details almost certainly are totally incompatible.
A multidatabase system is a software layer on top of existing database systems, which is designed
to manipulate information in heterogeneous databases
o Creates an illusion of logical database integration without any physical database integration
Advantages
28
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
o Hardware
o system software
o Applications
Local autonomy and administrative control
Allows use of special-purpose DBMSs
Step towards a unified homogeneous DBMS
o Full integration into a homogeneous DBMS faces
Technical difficulties and cost of conversion
Organizational/political difficulties
-Organizations do not want to give up control on their data
-Local databases wish to retain a great deal of autonomy
Query Processing
29
Jaya Engineering College
I MCA Unit 1 MC4202 - ADVANCED DATABASE TECHNOLOGY
Mediator Systems
Mediator systems are systems that integrate multiple heterogeneous data sources by providing an
integrated global view, and providing query facilities on global view
o Unlike full-fledged multidatabase systems, mediators generally do not bother about
transaction processing
o But the terms mediator and multidatabase are sometimes used interchangeably
o The term virtual database is also used to refer to mediator/multidatabase systems
Local transactions are executed by each local DBMS, outside of the MDBS system control.
30
Jaya Engineering College
UNIT II
SPATIAL AND TEMPORAL DATABASES
SPATIAL AND TEMPORAL DATABASES INTRODUCTION:
What is spatial database in DBMS?
A spatial database is a general-purpose database (usually a relational database) that has been
enhanced to include spatial data that represents objects defined in a geometric space, along with
tools for querying and analyzing such data.
What is temporal database in DBMS?
A temporal database is a database that has certain features that support time-sensitive status
for entries. Where some databases are considered current databases and only support factual data
considered valid at the time of use, a temporal database can establish at what times certain entries
are accurate.
A temporal database stores data relating to time instances. It offers temporal data types and stores
information relating to past, present and future time.
A spatiotemporal database is a database that manages both space and time information.
What is active database system?
An active database is a database that includes an event-driven architecture (often in the form
of ECA rules) which can respond to conditions both inside and outside the database. Possible
uses include security monitoring, alerting, statistics gathering and authorization.
Which model is used to implement active databases?
The model that has been used to specify active database rules is referred to as the Event-
Condition-Action (ECA) model.
1. Explain in detail about active database model. or
List the three components in ECA Model and explain how to create trigger for a
relation.
ACTIVE DATABASE MODEL
The model that has been used to specify active database rules is referred to as the event-condition-
action (ECA) model. A rule in the ECA model has three components:
1. The event(s) that triggers the rule: These events are usually database update operations that are
explicitly applied to the database. However, in the general model, they could also be temporal
events or other kinds of external events.
2. The condition that determines whether the rule action should be executed: Once the triggering
event has occurred, an optional condition may be evaluated. If no condition is specified, the action
will be executed once the event occurs. If a condition is specified, it is first evaluated, and only if
it evaluates to true will the rule action be executed.
3. The action to be taken: The action is usually a sequence of SQL statements, but it could also be
a database transaction or an external program that will be automatically executed.
TRIGGER:A database trigger is procedural code that is automatically executed in response to
certain events on a particular table or view in a database. The trigger is mostly used for maintaining
the integrity of the information on the database.
1
The basic events that can be specified for triggering the active rules are the standard SQL
update commands: INSERT, DELETE, and UPDATE. They are specified by the keywords
INSERT, DELETE, and UPDATE in Oracle notation.
The keywords NEW and OLD are used in Oracle notation; NEW is used to refer to a newly
inserted or newly updated tuple, whereas OLD is used to refer to a deleted tuple or to a
tuple before it was updated.
2
by referring to their rule names. A deactivated rule will not be triggered by the triggering event.
This feature allows users to selectively deactivate rules for certain periods of time when they are
not needed. The activate command will make the rule active again. The drop command deletes the
rule from the system. Another option is to group rules into named rule sets, so the whole set of
rules can be activated, deactivated, or dropped. It is also useful to have a command that can trigger
a rule or rule set via an explicit PROCESS RULES command issued by the user.
The second issue concerns whether the triggered action should be executed before, after, instead
of, or concurrently with the triggering event. A before trigger executes the trigger before executing
the event that caused the trigger. It can be used in applications such as checking for constraint
violations. An after trigger executes the trigger after executing the event, and it can be used in
applications such as maintaining derived data and monitoring for specific events and conditions.
An instead of trigger executes the trigger instead of executing the event, and it can be used in
applications such as executing corresponding updates on base relations in response to an event that
is an update of a view.
Let us assume that the triggering event occurs as part of a transaction execution. We should first
consider the various options for how the triggering event is related to the evaluation of the rule’s
condition. The rule condition evaluation is also known as rule consideration, since the action is to
be executed only after considering whether the condition evaluates to true or false. There are three
main possibilities for rule consideration:
1. Immediate consideration: The condition is evaluated as part of the same transaction as the
triggering event and is evaluated immediately. This case can be further categorized into three
options: Evaluate the condition before executing the triggering event. Evaluate the condition after
executing the triggering event. Evaluate the condition instead of executing the triggering event.
2. Deferred consideration: The condition is evaluated at the end of the transaction that included
the triggering event. In this case, there could be many triggered rules waiting to have their
conditions evaluated.
3. Detached consideration: The condition is evaluated as a separate transaction, spawned from
the triggering transaction.
Most active systems use the first option. That is, as soon as the condition is evaluated, if it returns
true, the action is immediately executed.
Another issue concerning active database rules is the distinction between row-level rules and
statement-level rules. The SQL-99 standard and the Oracle system allow the user to choose which
of the options is to be used for each rule, whereas STARBURST uses statement-level semantics
only.
In STARBURST, the basic events that can be specified for triggering the rules are the standard
SQL update commands: INSERT, DELETE, and UPDATE. These are specified by the keywords
INSERTED, DELETED, and UPDATED in STARBURST notation. Second, the rule designer
needs to have a way to refer to the tuples that have been modified. The keywords INSERTED,
DELETED, NEW-UPDATED, and OLD-UPDATED are used in STARBURST notation to refer
to four transition tables (relations) that include the newly inserted tuples, the deleted tuples, the
updated tuples before they were updated, and the updated tuples after they were updated,
respectively.
3
In statement-level semantics, the rule designer can only refer to the transition tables as a whole
and the rule is triggered only once, so the rules must be written differently than for row-level
semantics. Because multiple employee tuples may be inserted in a single insert statement.
POTENTIAL APPLICATIONS FOR ACTIVE DATABASES
One important application is to allow notification of certain conditions that occur. For example,
an active database may be used to monitor, say, the temperature of an industrial furnace.
Active rules can also be used to enforce integrity constraints by specifying the types of events
that may cause the constraints to be violated and then evaluating appropriate conditions that check
whether the constraints are actually violated by the event or not.
Other applications include the automatic maintenance of derived data, maintain the consistency
of materialized views whenever the base relations are modified.
2. Explain in detail about Temporal Databases.
TEMPORAL DATABASE
Temporal databases, in the broadest sense, encompass all database applications that require
some aspect of time when organizing their information. Hence, they provide a good
example to illustrate the need for developing a set of unifying concepts for application
developers to use.
Temporal database applications have been developed since the early days of database
usage. There are many examples of applications where some aspect of time is needed to
maintain the information in a database. These include healthcare, where patient histories
need to be maintained; insurance, where claims and accident histories are required as well
as information about the times when insurance policies are in effect; reservation systems
in general (hotel, airline, car rental, train, and so on), where information on the dates and
times when reservations are in effect are required; scientific databases, where data collected
from experiments includes the time when each data is measured; and so on.
A temporal relation is one where each tuple has an associated time when it is true; the
time may be either valid time or transaction time.
Both valid time and transaction time can be stored, in which case the relation is said to be
a bitemporal relation.
TIME SPECIFICATION IN SQL:
The SQL standard defines the types date, time, and timestamp.
The type date contains four digits for the year (1–9999), two digits for the month (1–12),
and two digits for the date (1–31).
The type time contains two digits for the hour, two digits for the minute, and two digits for
the second, plus optional fractional digits.
The seconds field can go beyond 60, to allow for leap seconds that are added during some
years to correct for small variations in the speed of rotation of Earth.
The type timestamp contains the fields of date and time, with six fractional digits for the
seconds field.
The Universal Coordinated Time (UTC) is a standard reference point for specifying time,
with local times defined as offsets from UTC.
4
SQL also supports two types, time with time zone, and timestamp with time zone, which
specify the time as a local time plus the offset of the local time from UTC.
SQL supports a type called interval, which allows us to refer to a period of time such as “1
day” or “2 days and 5 hours,” without specifying a particular time when this period starts.
TEMPORAL QUERY LANGUAGES
A database relation without temporal information is sometimes called a snapshot relation,
since it reflects the state in a snapshot of the real world. The snapshot operation on a
temporal relation gives the snapshot of the relation at a specified time (or the current time,
if the time is not specified).
A temporal selection is a selection that involves the time attributes; a temporal projection
is a projection where the tuples in the projection inherit their times from the tuples in the
original relation.
A temporal join is a join, with the time of a tuple in the result being the intersection of the
times of the tuples from which it is derived. If the times do not intersect, the tuple is
removed from the result.
The predicates precedes, overlaps, and contains can be applied on intervals; their meanings
should be clear. The intersect operation can be applied on two intervals, to give a single
(possibly empty) interval. However, the union of two intervals may or may not be a single
interval.
A temporal functional dependency X → Y holds on a relation schema R if, for all legal
instances r of R, all snapshots of r satisfy the functional dependency X → Y.
3. Explain Spatial Data Types, Spatial Operators and Queries with example.
SPATIAL DATABASE:
Spatial data include geographic data, such as maps and associated information, and
computer-aided-design data, such as integrated circuit designs or building designs.
Applications of spatial data initially stored data as files in a file system, as did early-
generation business applications.
Spatial databases incorporate functionality that provides support for databases that keep
track of objects in a multidimensional space.
The systems that manage geographic data and related applications are known as geographic
information systems (GISs), and they are used in areas such as environmental applications,
transportation systems, emergency response systems, and battle management.
Other databases, such as meteorological databases for weather information, are three-
dimensional, since temperatures and other meteorological information are related to three-
dimensional spatial points.
In general, a spatial database stores objects that have spatial characteristics that describe
them and that have spatial relationships among them.
The spatial relationships among the objects are important, and they are often needed when
querying the database.
A spatial database is optimized to store and query data related to objects in space, including
points, lines and polygons. Satellite images are a prominent example of spatial data.
Queries posed on these spatial data, where predicates for selection deal with spatial
parameters, are called spatial queries. For example, “What are the names of all bookstores
within five miles of the College of Computing building at Georgia Tech?” is a spatial query.
5
Common Types of Analysis for Spatial Data:
6
Topological operators. Topological properties are invariant when topological
transformations are applied. Topological operators are hierarchically structured in several
levels, where the base level offers operators the ability to check for detailed topological
relations between regions with a broad boundary, and the higher levels offer more abstract
operators that allow users to query uncertain spatial data independent of the underlying
geometric data model. Examples include open (region), close (region), and inside (point,
loop).
Projective operators. Projective operators, such as convex hull, are used to express
predicates about the concavity/convexity of objects as well as other spatial relations (for
example, being inside the concavity of a given object).
Metric operators. Metric operators provide a more specific description of the object’s
geometry. They are used to measure some global properties of single objects (such as the
area, relative size of an object’s parts, compactness, and symmetry), and to measure the
relative position of different objects in terms of distance and direction. Examples include
length (arc) and distance (point, point).
Dynamic Spatial Operators. Dynamic operations alter the objects upon which the
operations act. The three fundamental dynamic operations are create, destroy, and update.
SPATIAL QUERIES
Spatial queries are requests for spatial data that require the use of spatial operations. The following
categories illustrate three typical types of spatial queries:
Nearness queries request objects that lie near a specified location. A query to find all
restaurants that lie within a given distance of a given point is an example of a nearness
query. The nearest-neighbour query requests the object that is nearest to a specified point.
For example, we may want to find the nearest gasoline station.
Region queries deal with spatial regions. Such a query can ask for objects that lie partially
or fully inside a specified region. A query to find all retail shops within the geographic
boundaries of a given town is an example.
Spatial joins or overlays. Typically joins the objects of two types based on some spatial
condition, such as the objects intersecting or overlapping spatially or being within a certain
distance of one another. For example, find all townships located on a major highway
between two cities or find all homes that are within two miles of a lake.
SPATIAL DATA INDEXING
Indices are required for efficient access to spatial data. Traditional index structures, such as hash
indices and B-trees, are not suitable, since they deal only with one-dimensional data, whereas
spatial data are typically of two or more dimensions.
k-d Trees
A tree structure called a k-d tree was one of the early structures used for indexing in
multiple dimensions. Each level of a k-d tree partitions the space into two. The partitioning
is done along one dimension at the node at the top level of the tree, along another dimension
in nodes at the next level, and so on, cycling through the dimensions. The partitioning
proceeds in such a way that, at each node, approximately one-half of the points stored in
the subtree fall on one side and one-half fall on the other. Partitioning stops when a node
has less than a given maximum number of points.
7
Figure 25.4 shows a set of points in two-dimensional space, and a k-d tree representation
of the set of points. Each line corresponds to a node in the tree, and the maximum number
of points in a leaf node has been set at 1. Each line in the figure (other than the outside box)
corresponds to a node in the k-d tree. The numbering of the lines in the figure indicates the
level of the tree at which the corresponding node appears. The k-d-B tree extends the k-d
tree to allow multiple child nodes for each internal node, just as a B-tree extends a binary
tree, to reduce the height of the tree. k-d-B trees are better suited for secondary storage than
k-d trees.
Quadtrees
An alternative representation for two-dimensional data is a quadtree.
An example of the division of space by a quadtree appears in Figure 25.5. The set of points
is the same as that in Figure 25.4. Each node of a quadtree is associated with a rectangular
region of space. The top node is associated with the entire target space. Each non leaf node
in a quadtree divides its region into four equal-sized quadrants, and correspondingly each
such node has four child nodes corresponding to the four quadrants. Leaf nodes have
between zero and some fixed maximum number of points. Correspondingly, if the region
corresponding to a node has more than the maximum number of points, child nodes are
8
created for that node. In the example in Figure 25.5, the maximum number of points in a
leaf node is set to 1.
This type of quadtree is called a PR quadtree, to indicate it stores points, and that the
division of space is divided based on regions, rather than on the actual set of points stored.
R-Trees
A storage structure called an R-tree is useful for indexing of objects such as points,
line segments, rectangles, and other polygons.
An R-tree is a balanced tree structure with the indexed objects stored in leaf nodes,
much like a B+-tree. However, instead of a range of values, a rectangular bounding
box is associated with each tree node.
The bounding box of a leaf node is the smallest rectangle parallel to the axes that
contains all objects stored in the leaf node.
Each internal node stores the bounding boxes of the child nodes along with the
pointers to the child nodes.
Each leaf node stores the indexed objects, and may optionally store the bounding
boxes of the objects; the bounding boxes help speed up checks for overlaps of the
rectangle with the indexed objects—if a query rectangle does not overlap with the
bounding box of an object, it cannot overlap with the object, either.
The R-tree itself is at the right side of Figure 25.6. The figure refers to the coordinates of
bounding box i as B Bi in the figure.
Comparison with Quad-trees:
Tiling level optimization is required in Quad-trees whereas in R-tree doesn’t require
any such optimization.
Quad-tree can be implemented on top of existing B-tree whereas R-tree follow a
different structure from a B-tree.
Spatial index creation in Quad-trees is faster as compared to R-trees.
R-trees are faster than Quad-trees for Nearest Neighbour queries while for window
queries, Quad-trees are faster than R-trees.
SPATIAL DATA MINING
9
Spatial data tends to be highly correlated. For example, people with similar characteristics,
occupations, and backgrounds tend to cluster together in the same neighbourhoods. The three
major spatial data mining techniques are spatial classification, spatial association, and spatial
clustering.
Spatial classification. The goal of classification is to estimate the value of an attribute of
a relation based on the value of the relation’s other attributes. An example of the spatial
classification problem is determining the locations of nests in a wetland based on the value
of other attributes (for example, vegetation durability and water depth); it is also called the
location prediction problem. Similarly, where to expect hotspots in crime activity is also a
location prediction problem.
Spatial association. Spatial association rules are defined in terms of spatial predicates
rather than items. A spatial association rule is of the form P1 ∧ P2 ∧ … ∧ Pn ⇒ Q1 ∧ Q2
∧ … ∧ Qm where at least one of the Pi ’s or Qj ’s is a spatial predicate.
Spatial clustering attempts to group database objects so that the most similar objects are
in the same cluster, and objects in different clusters are as dissimilar as possible. An
example of a spatial clustering algorithm is density-based clustering, which tries to find
clusters based on the density of data points in a region.
APPLICATIONS OF SPATIAL DATA
Spatial data management is useful in many disciplines, including geography, remote
sensing, urban planning, and natural resource management.
Spatial database management is playing an important role in the solution of challenging
scientific problems such as global climate change and genomics.
Due to the spatial nature of genome data, GIS and spatial database management systems
have a large role to play in the area of bioinformatics.
Some of the typical applications include pattern recognition, genome browser
development, and visualization maps.
Another important application area of spatial data mining is the spatial outlier detection.
Detecting spatial outliers is useful in many applications of geographic information systems
and spatial databases.
These application domains include transportation, ecology, public safety, public health,
climatology, and location-based services.
10
The mobile-computing environment consists of mobile computers, referred to as mobile hosts, and
a wired network of computers. Mobile hosts communicate with the wired network via computers
referred to as mobile support stations. Each mobile support station manages those mobile hosts
within its cell— that is, the geographical area that it covers. Mobile hosts may move between cells,
thus necessitating a handoff of control from one mobile support station to another. It is possible
for mobile hosts to communicate directly without the intervention of a mobile support station.
However, such communication can occur only between nearby hosts.
HANDOFF MANAGEMENT
Ensuring that a mobile user remains connected while moving from one location (e.g., cell) to
another.
Packets or connection are routed to the new location decide when to handoff to a new access point
(AP).
11
Select a new AP from among several APs
Acquire resources such as bandwidth channels (GSM), or a new IP address (Mobile IP)
Channel allocation is a research issue: goal may be to maximize channel usage, satisfy
QoS, or maximize revenue generated
Inform the old AP to reroute packets and also to transfer state information to the new AP.
Packets are routed to the new AP.
12
13
TRADEOFF IN LOCATION MANAGEMENT
Network may only know approximate location
By location update (or location registration):
Network is informed of the location of a mobile user
By location search or terminal paging:
Network is finding the location of a mobile user
A tradeoff exists between location update and search
When the user is not called often (or if the service arrival rate) is low, resources are wasted
with frequent updates.
If not done and a call comes, bandwidth or time is wasted in searching.
d. Co-transactions:
These transactions executed like procedures executed. When one transaction is executed then
control passes from current transaction to another transaction during sharing the results. At a time
either both transactions successfully executed or failed.
2.Kangaroo transaction model
14
This model is proposed by Dunham and made to perform to represent the movement
behaviour and data behaviour of transaction when a mobile host changing the position
from one mobile cell to another in static network. It is named so because in mobile
environment hop transaction move one base station to another.
This transaction model develops and grows based on abstract idea of global and split
transaction in multi database environment. In this model Data Access Agent
(DAA) at each base station used for accessing local and global databases. DAA accepts
transactions express to need from a mobile user, and forwards the request to the
corresponding database servers.
These transactions will be committed on servers. DAA acts as a Mobile Transaction
Manager and data access coordinator.
Kangaroo transaction has a unique identification number composed of the base station
number and unique sequence number within that base station. When the mobile unit
changes location from one to another, the control of the Kangaroo transaction changes to
a new DAA at another base station. The DAA at the new base station produce a new Joey
transaction.
a. Clustering model:
This model is proposed by Pitoura and accepts a fully distributed system and considered
as an open nested transaction model. This model is grounded on collection of related to
meaning or nearly placed data together to form a cluster. Clusters can characterize statically
or dynamically.
Transaction from a mobile host composed of a set of weak and strict transactions grounded
on the consistency requirement. Weak
transaction consists only weak read and weak write operations which can access only
within the clusters.
b. Isolation –only model:
This model is proposed by Satyanarayan and used in Coda file system. Coda is a distributed
file system by using file hoarding and concurrency control for mobile clients which
provides disconnected operations.
Here transactions are chronological succession of file accessing operations.
Like Clustering, transactions are arranged in two categories:
First class which doesn’t hold any separate section file accesses
Second class which are carried out under disconnection.
First class transaction performs to act without delay after being executed, whereas Second
class on one occasion goes to a pending state and waits for validation. When reconnection
becomes possible second-class transactions are made legally valid according to the wanted
consistency criteria. If validation is successful, results are integrated and committed
otherwise transactions entering the resolution state.
15
When the mobile host is abrupt, Tentative transactions modify the replicated data copy.
5. Explain in detail about deductive database system with prolog notations and
examples.
DEDUCTIVE DATABASES:
In a deductive database system, we typically specify rules through a declarative
language—a language in which we specify what to achieve rather than how to achieve it.
An inference engine (or deduction mechanism) within the system can deduce new facts
from the database by interpreting these rules. The model used for deductive databases is
16
closely related to the relational data model, and particularly to the domain relational
calculus formalism.
It is also related to the field of logic programming and the Prolog language.
A variation of Prolog called Datalog is used to define rules declaratively in conjunction
with an existing set of relations, which are themselves treated as literals in the language.
Although the language structure of Datalog resembles that of Prolog, its operational
semantics—that is, how a Datalog program is executed—is still different.
A deductive database uses two main types of specifications: facts and rules. Facts are
specified in a manner similar to the way relations are specified, except that it is not
necessary to include the attribute names.
In a deductive database, the meaning of an attribute value in a tuple is determined solely
by its position within the tuple. Rules are somewhat similar to relational views. They
specify virtual relations that are not actually stored but that can be formed from the facts
by applying inference mechanisms based on the rule specifications. The main difference
between rules and views is that rules may involve recursion and hence may yield virtual
relations that cannot be defined in terms of basic relational views.
The evaluation of Prolog programs is based on a technique called backward chaining,
which involves a top-down evaluation of goals.
In the deductive databases that use Datalog, attention has been devoted to handling large
volumes of data stored in a relational database. Hence, evaluation techniques have been
devised that resemble those for a bottom-up evaluation
Prolog/Datalog Notation :
Prolog suffers from the limitation that the order of specification of facts and rules is
significant in evaluation; moreover, the order of literals (defined in Section within a rule is
significant). The execution techniques for Datalog programs attempt to circumvent these
problems.
Prolog/Datalog Notation The notation used in Prolog/Datalog is based on providing
predicates with unique names. A predicate has an implicit meaning, which is suggested
by the predicate name, and a fixed number of arguments. If the arguments are all constant
values, the predicate simply states that a certain fact is true. If, on the other hand, the
predicate has variables as arguments, it is either considered as a query or as part of a rule
or constraint. In our discussion, we adopt the Prolog convention that all constant values
in a predicate are either numeric or character strings; they are represented as identifiers (or
names) that start with a lowercase letter, whereas variable names always start with an
uppercase letter. Consider the example shown in Figure 26.11, which is based on the
relational database in Figure 3.6, but in a much simplified form. There are three predicate
names: supervise, superior, and subordinate.
The SUPERVISE predicate is defined via a set of facts, each of which has two arguments:
a supervisor name, followed by the name of a direct supervisee (subordinate) of that
supervisor. These facts correspond to the actual data that is stored in the database, and they
can be considered as constituting a set of tuples in a relation SUPERVISE with two
attributes whose schema is
SUPERVISE(Supervisor, Supervisee)
17
Thus, SUPERVISE(X, Y) states the fact that X supervises Y. Notice the omission of the
attribute names in the Prolog notation. Attribute names are only represented by virtue of
the position of each argument in a predicate: the first argument represents the supervisor,
and the second argument represents a direct subordinate. The other two predicate names
are defined by rules. The main contributions of deductive databases are the ability to
specify recursive rules and to provide a framework for inferring new information based on
the specified rules. A rule is of the form head :– body, where :– is read as if and only if. A
rule usually has a single predicate to the left of the :– symbol—called the head or left-hand
side (LHS) or conclusion of the rule—and one or more predicates to the right of the :–
symbol— called the body or right-hand side (RHS) or premise(s) of the rule. A predicate
with constants as arguments is said to be ground; we also refer to it as an instantiated
predicate. The arguments of the predicates that appear in a rule typically include a number
of variable symbols, although predicates can also contain constants as arguments. A rule
specifies that, if a particular assignment or binding of constant values to the variables in
the body (RHS predicates) makes all the RHS predicates true, it also makes the head (LHS
predicate) true by using the same assignment of constant values to variables. Hence, a rule
provides us with a way of generating new facts that are instantiations of the head of the
rule. These new facts are based on facts that already exist, corresponding to the
instantiations (or bindings) of predicates in the body of the rule. Notice that by listing
multiple predicates in the body of a rule we implicitly apply the logical AND operator to
these predicates. Hence, the commas between the RHS predicates may be read as meaning
and.
Datalog Notation:
In Datalog, as in other logic-based languages, a program is built from basic objects called
atomic formulas. It is customary to define the syntax of logic-based languages by describing the
syntax of atomic formulas and identifying how they can be combined to form a program. In
Datalog, atomic formulas are literals of the form p(a1, a2, … , an), where p is the predicate name
and n is the number of arguments for predicate p. Different predicate symbols can have different
numbers of arguments, and the number of arguments n of predicate p is sometimes called the arity
or degree of p. The arguments can be either constant values or variable names. As mentioned
earlier, we use the convention that constant values either are numeric or start with a lowercase
character, whereas variable names always start with an uppercase character. A literal is either an
atomic formula as defined earlier—called a positive literal—or an atomic formula preceded by
not. The latter is a negated atomic formula, called a negative literal.
Clausal Form and Horn Clauses:
18
Recall from Section 6.6 that a formula in the relational calculus is a condition that includes
predicates called atoms (based on relation names). Additionally, a formula can have quantifiers—
namely, the universal quantifier (for all) and the existential quantifier (there exists). In clausal
form, a formula must be transformed into another formula with the following characteristics: ■ All
variables in the formula are universally quantified. Hence, it is not necessary to include the
universal quantifiers (for all) explicitly; the quantifiers are removed, and all variables in the
formula are implicitly quantified by the universal quantifier. ■ In clausal form, the formula is made
up of a number of clauses, where each clause is composed of a number of literals connected by
OR logical connectives only. Hence, each clause is a disjunction of literals. ■ The clauses
themselves are connected by AND logical connectives only, to form a formula. Hence, the clausal
form of a formula is a conjunction of clauses.
Interpretations of Rules:
There are two main alternatives for interpreting the theoretical meaning of rules: proof-
theoretic and model-theoretic.
In the proof-theoretic interpretation of rules, we consider the facts and rules to be true
statements, or axioms. Ground axioms contain no variables. The facts are ground axioms
that are given to be true. Rules are called deductive axioms, since they can be used to
deduce new facts. The deductive axioms can be used to construct proofs that derive new
facts from existing facts. For example, Figure 26.12 shows how to prove the fact
SUPERIOR(james, ahmad) from the rules and facts given in Figure 26.11. The proof-
theoretic interpretation gives us a procedural or computational approach for computing an
answer to the Datalog query. The process of proving whether a certain fact (theorem) holds
is known as theorem proving. The second type of interpretation is called the model-
theoretic interpretation. Here, given a finite or an infinite domain of constant values,33 we
assign to a predicate every possible combination of values as arguments. We must then
determine whether the predicate is true or false. In general, it is sufficient to specify the
combinations of arguments that make the predicate true, and to state that all other
combinations make the predicate false. If this is done for every predicate, it is called an
interpretation of the set of predicates. For example, consider the interpretation shown in
Figure 26.13 for the predicates SUPERVISE and SUPERIOR. This interpretation assigns
a truth value (true or false) to every possible combination of argument values (from a finite
domain) for the two predicates.
An interpretation is called a model for a specific set of rules if those rules are always true
under that interpretation; that is, for any values assigned to the variables in the rules, the
head of the rules is true when we substitute the truth values assigned to the predicates in
the body of the rule by that interpretation. Hence, whenever a particular substitution
(binding) to the variables in the rules is applied, if all the predicates in the body of a rule
are true under the interpretation, the predicate in the head of the rule must also be true. The
interpretation shown in Figure 26.13 is a model for the two rules shown, since it can never
cause the rules to be violated. Notice that a rule is violated if a particular binding of
constants to the variables makes all the predicates in the rule body true but makes the
predicate in the rule head false.
19
6. Discuss the challenges to multimedia databases and Areas where multimedia
database is applied
MULTIMEDIA DATABASE:
Multimedia databases provide features that allow users to store and query different types
of multimedia information, which includes images (such as photos or drawings), video clips (such
as movies, newsreels, or home videos), audio clips(such as songs, phone messages, or speeches),
and documents (such as books or articles).
Content of Multimedia Database management system :
1. Media data – The actual data representing an object.
2. Media format data – Information such as sampling rate, resolution, encoding scheme etc.
about the format of the media data after it goes through the acquisition, processing and
encoding phase.
3. Media keyword data – Keywords description relating to the generation of data. It is also
known as content descriptive data. Example: date, time and place of recording.
4. Media feature data – Content dependent data such as the distribution of colors, kinds of
texture and different shapes present in data.
Types of multimedia applications based on data management characteristic are :
1. Repository applications – A Large amount of multimedia data as well as meta-data(Media
format date, Media keyword data, Media feature data) that is stored for retrieval purpose,
e.g., Repository of satellite images, engineering drawings, radiology scanned pictures.
2. Presentation applications – They involve delivery of multimedia data subject to temporal
constraint. Optimal viewing or listening requires DBMS to deliver data at certain rate
offering the quality of service above a certain threshold. Here data is processed as it is
delivered. Example: Annotating of video and audio data, real-time editing analysis.
3. Collaborative work using multimedia information – It involves executing a complex
task by merging drawings, changing notifications. Example: Intelligent healthcare
network.
There are still many challenges to multimedia databases, some of which are :
1. Modelling – Working in this area can improve database versus information retrieval
techniques thus, documents constitute a specialized area and deserve special consideration.
2. Design – The conceptual, logical and physical design of multimedia databases has not yet
been addressed fully as performance and tuning issues at each level are far more complex
as they consist of a variety of formats like JPEG, GIF, PNG, MPEG which is not easy to
convert from one form to another.
3. Storage – Storage of multimedia database on any standard disk presents the problem of
representation, compression, mapping to device hierarchies, archiving and buffering during
20
input-output operation. In DBMS, a ”BLOB”(Binary Large Object) facility allows untyped
bitmaps to be stored and retrieved.
4. Performance – For an application involving video playback or audio-video
synchronization, physical limitations dominate. The use of parallel processing may
alleviate some problems but such techniques are not yet fully developed. Apart from this
multimedia database consume a lot of processing time as well as bandwidth.
5. Queries and retrieval –For multimedia data like images, video, audio accessing data
through query opens up many issues like efficient query formulation, query execution and
optimization which need to be worked upon.
Areas where multimedia database is applied are :
Documents and record management : Industries and businesses that keep detailed
records and variety of documents. Example: Insurance claim record.
Knowledge dissemination : Multimedia database is a very effective tool for knowledge
dissemination in terms of providing several resources. Example: Electronic books.
Education and training : Computer-aided learning materials can be designed using
multimedia sources which are nowadays very popular sources of learning. Example:
Digital libraries.
Marketing, advertising, retailing, entertainment and travel. Example: a virtual tour of cities.
Real-time control and monitoring : Coupled with active database technology,
multimedia presentation of information can be very effective means for monitoring and
controlling complex tasks Example: Manufacturing operation control.
The main types of database queries that are needed involve locating multimedia sources
that contain certain objects of interest. For example, one may want to locate all video clips
in a video database that include a certain person, say Michael Jackson. One may also want
to retrieve video clips based on certain activities included in them, such as video clips
where a soccer goal is scored by a certain player or team.
The above types of queries are referred to as content-based retrieval, because the
multimedia source is being retrieved based on its containing certain objects or activities.
Hence, a multimedia database must use some model to organize and index the multimedia
sources based on their contents. Identifying the contents of multimedia sources is a difficult
and time-consuming task.
There are two main approaches. The first is based on automatic analysis of the multimedia
sources to identify certain mathematical characteristics of their contents. This approach
uses different techniques depending on the type of multimedia source (image, video, audio,
or text). The second approach depends on manual identification of the objects and
activities of interest in each multimedia source and on using this information to index the
sources.
An image is typically stored either in raw form as a set of pixel or cell values, or in
compressed form to save space. The image shape descriptor describes the geometric shape
of the raw image, which is typically a rectangle of cells of a certain width and height.
Hence, each image can be represented by an m by n grid of cells. Each cell contains a pixel
value that describes the cell content. In black and white images, pixels can be one bit. In
21
grayscale or colour images, a pixel is multiple bits. Because images may require large
amounts of space, they are often stored in compressed form. Compression standards, such
as GIF, JPEG, or MPEG, use various mathematical transformations to reduce the number
of cells stored but still maintain the main image characteristics.
Applicable mathematical transforms include discrete Fourier transform (DFT), discrete
cosine transform (DCT), and wavelet transforms.
To identify objects of interest in an image, the image is typically divided into homogeneous
segments using a homogeneity predicate. For example, in a color image, adjacent cells that
have similar pixel values are grouped into a segment.
The homogeneity predicate defines conditions for automatically grouping those cells.
Segmentation and compression can be Automatic Analysis of Images.
Automatic Analysis of Images:
Analysis of multimedia sources is critical to support any type of query or search interface.
We need to represent multimedia source data such as images in terms of features that would
enable us to define similarity. The work done so far in this area uses low-level visual
features such as colour, texture, and shape, which are directly related to the perceptual
aspects of image content. These features are easy to extract and represent, and it is
convenient to design similarity measures based on their statistical properties.
Colour is one of the most widely used visual features in content-based image retrieval
since it does not depend upon image size or orientation. Retrieval based on color similarity
is mainly done by computing a colour histogram for each image that identifies the
proportion of pixels within an image for the three colour channels (red, green, blue—
RGB).
Texture refers to the patterns in an image that present the properties of homogeneity that
do not result from the presence of a single color or intensity value. Example of texture
classes are rough and silky. Examples of textures that can be identified include pressed calf
leather, straw matting, cotton canvas, and so on. Just as pictures are represented by arrays
of pixels (picture elements), textures are represented by arrays of texels (texture elements).
Texture identification is primarily done by modelling it as a two-dimensional, grey-level
variation. The relative brightness of pairs of pixels is computed to estimate the degree of
contrast, regularity, coarseness, and directionality.
Shape refers to the shape of a region within an image. It is generally determined by
applying segmentation or edge detection to an image. Segmentation is a region-based
approach that uses an entire region (sets of pixels), whereas edge detection is a boundary-
based approach that uses only the outer boundary characteristics of entities. Shape
representation is typically required to be invariant to translation, rotation, and scaling.
Object Recognition in Images:
Object recognition is the task of identifying real-world objects in an image or a video
sequence. The system must be able to identify the object even when the images of the
object vary in viewpoints, size, scale, or even when they are rotated or translated. Some
approaches have been developed to divide the original image into regions based on
similarity of contiguous pixels.
Thus, in a given image showing a tiger in the jungle, a tiger subimage may be detected
against the background of the jungle, and when compared with a set of training images, it
may be tagged as a tiger.
The representation of the multimedia object in an object model is extremely important. One
approach is to divide the image into homogeneous segments using a homogeneous
predicate. For example, in a colored image, adjacent cells that have similar pixel values are
grouped into a segment.
22
An important contribution to this field was made by Lowe,30 who used scale-invariant
features from images to perform reliable object recognition. This approach is called scale-
invariant feature transform (SIFT). For image matching and recognition, SIFT features
(also known as keypoint features) are first extracted from a set of reference images and
stored in a database. Object recognition is then performed by comparing each feature from
the new image with the features stored in the database and finding candidate matching
features based on the Euclidean distance of their feature
The SIFT features are invariant to image scaling and rotation, and partially invariant to
change in illumination and 3D camera viewpoint.
For image matching and recognition, SIFT features (also known as keypoint features) are
first extracted from a set of reference images and stored in a database. Object recognition
is then performed by comparing each feature from the new image with the features stored
in the database and finding candidate matching features based on the Euclidean distance of
their feature vectors. Since the keypoint features are highly distinctive, a single feature can
be correctly matched with good probability in a large database of features.
Semantic Tagging of Images:
The notion of implicit tagging is an important one for image recognition and comparison.
Multiple tags may attach to an image or a subimage: for instance, in the example we
referred to above, tags such as “tiger,” “jungle,” “green,” and “stripes” may be associated
with that image.
Most image search techniques retrieve images based on user-supplied tags that are often
not very accurate or comprehensive.
To improve search quality, a number of recent systems aim at automated generation of
these image tags. In case of multimedia data, most of its semantics is present in its content.
These systems use image-processing and statistical-modeling techniques to analyze image
content to generate accurate annotation tags that can then be used to retrieve images by
content.
Since different annotation schemes will use different vocabularies to annotate images, the
quality of image retrieval will be poor.
To solve this problem, recent research techniques have proposed the use of concept
hierarchies, taxonomies, or ontologies using OWL (Web Ontology Language), in which
terms and their relationships are clearly defined. These can be used to infer higherlevel
concepts based on tags.
Concepts like “sky” and “grass” may be further divided into “clear sky” and “cloudy sky”
or “dry grass” and “green grass” in such a taxonomy. These approaches generally come
under semantic tagging and can be used in conjunction with the above feature-analysis and
object-identification strategies.
Analysis of Audio Data Sources:
Audio sources are broadly classified into speech, music, and other audio data. Each of these
is significantly different from the others; hence different types of audio data are treated
differently.
Audio data must be digitized before it can be processed and stored. Indexing and retrieval
of audio data is arguably the toughest among all types of media, because like video, it is
continuous in time and does not have easily measurable characteristics such as text.
Clarity of sound recordings is easy to perceive humanly but is hard to quantify for machine
learning. Interestingly, speech data often uses speech recognition techniques to aid the
actual audio content, as this can make indexing this data a lot easier and more accurate.
This is sometimes referred to as text-based indexing of audio data.
23
The speech metadata is typically content dependent, in that the metadata is generated from
the audio content; for example, the length of the speech, the number of speakers, and so
on. However, some of the metadata might be independent of the actual content, such as the
length of the speech and the format in which the data is stored.
Music indexing, on the other hand, is done based on the statistical analysis of the audio
signal, also known as content-based indexing. Content-based indexing often makes use of
the key features of sound: intensity, pitch, timbre, and rhythm. It is possible to compare
different pieces of audio data and retrieve information from them based on the calculation
of certain features, as well as application of certain transforms.
24
ADVANCED DATABASE TECHNOLOGY
UNIT 3
NoSQL Databases
Problems with RDBMS:
Should know the entire schema upfront
Every record should have the same properties [rigid structure]
Scalability is costly [transactions and joins are expensive when running on a distributed
database]
Many Relational Databases do not provide out of the box support for scaling
Normalization
Fixed schemas make it hard to adjust to application needs
Altering schema on a running database is expensive
Application changes for any change in schema structure
SQL was designed for running on single server systems.
Horizontal Scalability is a problem
1
Advantages
Speed — Files are retrieved from the nearest location
If one site fails, the system can still run
Disadvantages
Time for synchronization of the multiple databases
Data replication
The Benefits of NoSQL
When compared to relational databases, NoSQL databases are more scalable and
provide superior performance, and their data model addresses several issues that the
relational model is not designed to address
Large volumes of rapidly changing structured, semi-structured, and unstructured
data:[Schema-less]
Mostly Open Source
Object-oriented programming that is easy to use and flexible
Running well on Clusters-Geographically distributed scale-out architecture instead of
expensive, monolithic architecture
NOSQL Categories
Most of the NOSQL products can be put into these categories:
Key /Value Stores Y
Document Databases
Graph Databases
Column Databases
2
Examples of Key Value Databases:
Redis
Riak.
Oracle NoSQL
Document Databases
Documents are composed of field-and-value pairs and have the following structure:
Documents can contain many different key-value pairs, or key-array pairs, or even nested
documents
{ field1: value1, field2: value2, field3: value3, ...
fieldN: valueN }
3
Graph Databases
Graph databases are NoSQL databases which use the graph data model comprised of vertices,
which is an entity such as a person, place, object or relevant piece of data and edges, which
represent the relationship between two nodes.
Advantages
Easy to represent connected data
Very faster to retrieve, navigate and traverse connected data
Can represent semi-structured data easily
Not require complex or costly joins to retrieve connected data’
It supports full ACID(Atomicity, Consistency, Isolation and Durability) rules
Let’s Convert a Relational Model to a Graph Data Model using an Example
Column family databases are probably most known because of Google’s Big-Table
implementation.
The are very similar on the surface to relational database, but they are actually quite
different beast.
Some of the difference is storing data by rows(relational) vs. storing data by columns
(column family databases).
4
But a lot of the difference is conceptual in nature. You can’t apply the same sort of
solutions that you used in a relational form to a column database.
5
Base Property of Transaction
Basically Available — Failure will not halt the system
Soft state — State of the system will change over time
Eventual consistency — Will become consistent over time
NEWSQL Databases
NewSQL is a new approach to relationa| databases that wants to combine transactional
ACID (atomicity, consistency,isolation, durability) guarantees of good RDBMS and
the horizontal scalability of NoSQL.
They maintain ACID Guarantees
They run on SQL
NEWSAQL Databases support
Partitioning and Sharding— Fragmentation is Supported
Replication — Copies of database stored in a remote site
6
Secondary Indexes — Accessing database records using a value other than a primary
key
Concurrency Control — Data Integrity while executing simultaneous transactions
Crash Recovery — Recovers to a consistent state
MongoDB
Mongod vs mongos vs mongo
Mongod :
Mongod is almost like an API. It’s the middleman between the application and the db.
It handles data requests, manages data access, and performs
Background management operations.
Mongos :
Mongos is also a middleman.
The mongos instances route queries and write operations to the shards a sharded cluster
Could it be fair to say it does the same as mongod ? What's the big difference?
Then run the command " mongo --port xxxx" l’am connecting to the cluster / replica itself and
not starting a middleman.
The MongoDB is divided into a two components server and client.
The server is the main database component which stores and manages data. And, the
clients come in various flavours and connect to the server to perform various queries
and db operations.
Here, Mongod is the server component. You start it, it runs, that’s it.
By definition we also call it the primary daemon process for the MongoDB database
which handles data requests, manages data access, and performs background
management operations.
Whereas Mongo is a default command line client. You start it, you connect to a server,
you enter commands, you exit out of it.
You have to run mongod first, otherwise you have no database to interact with.
Now, the question comes then what is Mongos?
Mongos is a kind of query router, providing an interface between client applications
and the sharded cluster
What is Replication?
Replication is the process of synchronizing data across multiple servers.
Replication provides redundancy and increases data availability with multiple copies of
data on different database servers.
7
Advantages
High (24*7) availability of data Disaster recovery’
No downtime for maintenance (like backups, index rebuilds, compaction)
Read scaling (extra copies to read from)
8
MongoDB - Replication
Primary Server —
The primary server receives all write operations.
A replica set can have only one primary capable of confirming writes
The primary records all changes to its data sets in its operation log, i.e. oplog.
Secondary Server —
The secondary servers replicate the primary oplog and apply the operations to their data
sets such that the secondaries’ data sets reflect the primary’s data set.
If the primary is unavailable, an eligible secondary will hold an election to elect itself
the new primary
Replica set —
A replica set is a group of mongod instance that maintain the same data set.
A replica set contains several data bearing nodes.
Of the data bearing nodes, one and only one member is deemed the primary node, while
the other nodes are deemed secondary nodes.
Oplog File
The oplog (operations log) is a special cappe collection that keeps a rolling record of
all operatior that modify the data stored in your databases.
MongoDB applies database operations on the primary and then records the operations
on the primary’s oplog.
The secondary members then copy and apply these operations in an asynchronous
process.
All replica set members contain a copy of the oplog, in the local.oplog.rs collection,
which allows them to maintain the current state of the database.
9
On failure of a primary node
When a primary goes not communicate with the other members of the set for more
than the configured electionTimeoutMillis period (10 seconds by default), and eligible
secondary calls for an election to nominate itself as the new primary.
The cluster attempts to complete the election of a new primary and resume normal
operations.
The replica set cannot process write operations until the election completes
successfully. The replica set can continue te serve read queries if such queries are
configured to run on secondaries while the primary is offline.
What is Sharding?
Sharding is a method for distributing data across multiple machines.
MongoDB uses sharding to support deployments with very large data sets and high
throughput operations.
Why Sharding?
10
Database systems with large data sets or high throughput applications can challenge
the capacity of a single server.
For example, high query rates can exhaust the CPU capacity of the server.
Working set sizes larger than the systerm’s RAM stress the I/O capacity of disk drives
11
Sharded Cluster in Mongo DB
A MongoDB sharded cluster consists of the following components:
shard: Each shard contains a subset of the sharded data. Each shard can be deployed
as a replica set.
mongos: The mongos acts as a query router, providing an interface between client
applications and the sharded cluster.
config servers: Config servers store metadata and configuration settings for the
cluster. As of MongoDB 3.4, config servers must be deployed as a replica set (CSRS).
Cassandra
The design goal of Cassandra is to handle big data workloads across multiple nodes without
any single point of failure. Cassandra has peer-to-peer distributed system across its nodes,
and data is distributed among all the nodes in a cluster.
All the nodes in a cluster play the same role. Each node is independent and at the
same time interconnected to other nodes.
12
Each node in a cluster can accept read and write requests, regardless of where the data
is actually located in the cluster.
When a node goes down, read/write requests can be served from other nodes in the
network.
Data Replication in Cassandra
Cassandra uses the Gossip Protocol in the background to allow the nodes to communicate
with each other and detect any faulty nodes in the cluster.
Components of Cassandra
The key components of Cassandra are as follows —
Node — It is the place where data is stored.
Data center — It is a collection of related nodes.
Cluster — A cluster is a component that contains one or more data centers.
Commit log — The commit log is a crash-recovery mechanism in Cassandra. Every
write operation is written to the commit log.
Mem-table — A mem-table is a memory-resident data structure. After commit log,
the data will be written to the mem-table. Sometimes, for a single-column family,
there will be multiple mem-tables.
SSTable — It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
Bloom filter — These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters
are accessed after every query.
Cassandra Query Language
Users can access Cassandra through its nodes using Cassandra Query Language
(CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use
cqlsh: a prompt to work with CQL or separate application language drivers.
Clients approach any of the nodes for their read-write operations. That node
(coordinator) plays a proxy between the client and the nodes holding the data.
13
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes.
RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
In RDBMS, a table is an array of arrays. (ROW x In Cassandra, a table is a list of “nested key-value
COLUMN) pairs”. (ROW x COLUMN key x COLUMN value)
Database is the outermost container that contains data Keyspace is the outermost container that contains data
corresponding to an application. corresponding to an application.
Tables are the entities of a database. Tables or column families are the entity of a keyspace.
RDBMS supports the concepts of foreign keys, joins. Relationships are represented using collections.
Later the data will be captured and stored in the mem-table. Whenever the mem-table
is full, data will be written into the SStable data file. All writes are automatically
partitioned and replicated throughout the cluster. Cassandra periodically consolidates
the SSTables, discarding unnecessary data.
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the
bloom filter to find the appropriate SSTable that holds the required data.
Data Models of Cassandra and RDBMS
CQLTypes
CQL provides a rich set of built-in data types, including collection types. Along with these
data types, users can also create their own custom data types. The following table provides a
list of built-in data types available in CQL.
14
Collection Types
Cassandra Query Language also provides a collection data types. The following table
provides a list of Collections available in CQL.
User-defined datatypes
Cqlsh provides users a facility of creating their own data types. Given below are the
commands used while dealing with user defined data types.
CREATE TYPE − Creates a user-defined datatype.
ALTER TYPE − Modifies a user-defined datatype.
DROP TYPE − Drops a user-defined datatype.
DESCRIBE TYPE − Describes a user-defined datatype.
15
DESCRIBE TYPES − Describes user-defined datatypes.
HIVE
Datatypes
All the data types in Hive are classified into four types, given as follows:
ColumnTypes
Literals
NullValues
ComplexTypes
Column Types
Column type are used as column data types of Hive. They are as follows: Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the
INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL (precision,scale)
Decimal (10,0)
16
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union.
The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data
is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
17
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
18
OrientDB is the first Multi-Model open source NoSQL DBMS that brings together the power
of graphs and flexibility of documents into a scalable high-performance operational database.
The main feature of OrientDB is to support multi-model objects, i.e. it supports different
models like Document, Graph, Key/Value and Real Object. It contains a separate API to
support all these four models.
Document Model
The terminology Document model belongs to NoSQL database. It means the data is stored in
the Documents and the group of Documents are called as Collection. Technically, document
means a set of key/value pairs or also referred to as fields or properties.
OrientDB uses the concepts such as classes, clusters, and link for storing, grouping, and
analysing the documents.
The following table illustrates the comparison between relational model, document model,
and
19
OrientDB document model
Graph Model
A graph data structure is a data model that can store data in the form of Vertices (Nodes)
interconnected by Edges (Arcs). The idea of OrientDB graph database came from property
graph. The vertex and edge are the main artifacts of the Graph model. They contain the
properties, which can make these appear similar to documents.
The following table shows a comparison between graph model, relational data model, and
OrientDB graph model.
20
The Object Model
This model has been inherited by Object Oriented programming and supports Inheritance
between types (sub-types extends the super-types), Polymorphism when you refer to a base
class and Direct binding from/to Objects used in programming languages.
The following table illustrates the comparison between relational model, Object model, and
OrientDB Object model.
Record ID
When OrientDB generates a record, the database server automatically assigns a unit identifier
to the record, called RecordID (RID). The RID looks like #<cluster>:<position>. <cluster>
means cluster identification number and the <position> means absolute position of the record
in the cluster.
Documents
The Document is the most flexible record type available in OrientDB. Documents are softly
typed and are defined by schema classes with defined constraint, but you can also insert the
document without any schema, i.e. it supports schema-less mode too. Documents can be
easily handled by export and import in JSON format. For example, take a look at the
following JSON sample document. It defines the document details.
{
"id" :"1201",
"name" :"Jay",
21
26
"job" : "Developer",
"creations" :[
{
"name" : "Amiga",
"company" : "Commodore Inc."
},
{
"name" : "Amiga 500",
"company" : "CommodoreInc."
}
]
}
RecordBytes
Record Type is the same as BLOB type in RDBMS. OrientDB can load and store document
Record type along with binary data.
Vertex
OrientDB database is not only a Document database but also a Graph database. The new
concepts such as Vertex and Edge are used to store the data in the form of graph. In graph
databases, the most basic unit of data is node, which in OrientDB is called a vertex. The
Vertex stores information for the database.
Edge
There is a separate record type called the Edge that connects one vertex to another. Edges are
bidirectional and can only connect two vertices. There are two types of edges in OrientDB,
one is regular and another one lightweight.
Class
The class is a type of data model and the concept drawn from the Object-oriented
programming paradigm. Based on the traditional document database model, data is stored in
the form of collection, while in the Relational database model data is stored in tables.
OrientDB follows the Document API along with OPPS paradigm. As a concept, the class in
OrientDB has the closest relationship with the table in relational databases, but (unlike tables)
classes can be schema-less, schema-full or mixed. Classes can inherit from other classes,
creating trees of classes. Each class has its own cluster or clusters, (created by default, if none
are defined).
Cluster
22
Cluster is an important concept which is used to store records, documents, or vertices. In
simple words, Cluster is a place where a group of records are stored. By default, OrientDB
will create one cluster per class. All the records of a class are stored in the same cluster
having the same name as the class. You can create up to 32,767(2^15-1) clusters in a
database.
The CREATE class is a command used to create a cluster with specific name. Once the
cluster is created you can use the cluster to save records by specifying the name during the
creation of any datamodel.
Relationships
OrientDB supports two kinds of relationships: referenced and embedded. Referenced
relationships means it stores direct link to the target objects of the relationships. Embedded
relationships means it stores the relationship within the record that embeds it. This
relationship is stronger than the referencerelationship.
Database
The database is an interface to access the real storage. IT understands high-level concepts
such as queries, schemas, metadata, indices, and so on. OrientDB also provides multiple
database types. For more information on these types, see Database Types.
23
Unit : 4
XML DATABASES
1. Structured data
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information.
2. Semi-Structured data
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze. With some processes, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space.
3. Unstructured data
Unstructured data is a data which is not organized in a predefined manner or does not have
a predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications.
1
Differences between Structured, Semi-structured and Unstructured data:
Matured
transaction and
various No transaction
Transaction concurrency Transaction is adapted from management and no
management techniques DBMS not matured concurrency
Structured query
Query allow complex Queries over anonymous Only textual queries
performance joining nodes are possible are possible
2
What is XML?
XML stands for Extensible Markup Language. It is designed for both human and
machine readable.
It doesn’t contain any predefined tags that helps the user to define their own set of tags.
It is a smaller version of SGML.
It overcomes all the drawbacks of HTML.
It is easy to understand and it is very much flexible than HTML
It inherits the features of SGML and combines it with the features of HTML.
XML document is just a pure information wrapped in tags.
Someone must write a piece of software to send, receive or display it.
It is used to exchange the information between organizations and systems.
Features of XML :
3
XML is used for both, storing and transferring data.
For transferring, xml don't create a file and transfer it but send the content directly.
Transferring data from a sender machine to a receiver machine, XML transfer the data into
Object.
The sender serializes the object. That means it converts the object to a byte stream. This
stream can be formatted in XML, SOAP, JSON, whatever.
The receiver receives the byte stream and deserializes it to an object. This object should be
equivalent to the one the sender has sent before in that it holds the same data.
When you design your databases, you must decide whether your data is better suited to the XML
model or the relational model. Take advantage of the hybrid nature of Db2® databases that
supports both relational and XML data in a single database.
While this discussion explains some of the main differences between the models and the factors
that apply to each, there are numerous factors that can determine the most suitable choice for your
implementation. Use this discussion as a guideline to assess the factors that can impact your
specific implementation.
4
Relational tables follow a fairly rigid model. For example, normalizing one table into
many or denormalizing many tables into one can be very difficult. If the data design
changes often, representing it as XML data is a better choice. XML schemas can be
evolved over time, for example.
When you need maximum performance for data retrieval
Some expense is associated with serializing (Serialization is the process of converting a
data object into a series of bytes that saves the state of the object in an easily transmittable
form.) and interpreting XML data. If performance is more of an issue than flexibility,
relational data might be the better choice.
When data is processed later as relational data
If subsequent processing of the data depends on the data being stored in a relational
database, it might be appropriate to store parts of the data as relational, using
decomposition. An example of this situation is when online analytical processing
(OLAP) is applied to the data in a data warehouse. Also, if other processing is required
on the XML document as a whole, then storing some of the data as relational as well as
storing the entire XML document might be a suitable approach in this case.
When data components have meaning outside a hierarchy
Data might be inherently hierarchical in nature, but the child components do not need the
parents to provide value. For example, a purchase order might contain part numbers. The
purchase orders with the part numbers might be best represented as XML documents.
However, each part number has a part description associated with it. It might be better to
include the part descriptions in a relational table, because the relationship between the
part numbers and the part descriptions is logically independent of the purchase orders in
which the part numbers are used.
When data attributes apply to all data, or to only a small subset of the data
Some sets of data have a large number of possible attributes, but only a small number of
those attributes apply to any particular data value. For example, in a retail catalog, there
are many possible data attributes, such as size, color, weight, material, style, weave,
power requirements, or fuel requirements. For any given item in the catalog, only a subset
of those attributes is relevant: power requirements are meaningful for a table saw, but not
for a coat. This type of data is difficult to represent and search with a relational model,
but relatively easy to represent and search with an XML model.
When the ratio of data complexity to volume is high
Many situations involve highly structured information in very small quantities.
Representation of that data with a relational model can involve complex star schemas in
which each dimension table is joined to many more dimension tables, and most of the
tables have only a few rows. A better way to represent this data is to use a single table
with an XML column, and to create views on that table, where each view represents a
dimension.
When referential integrity is required
XML columns cannot be defined as part of referential constraints. Therefore, if values in
XML documents need to participate in referential constraints, you should store the data
as relational data.
When the data needs to be updated often
You update XML data in an XML column only by replacing full documents. If you need
to frequently update small fragments of very large documents for a large number of rows,
it can be more efficient to store the data in non-XML columns. If, however, you are
updating small documents and only a few documents at a time, storing as XML can be
efficient as well.
XML – Documents
5
Structure of an XML
6
Document Prolog Section
Document Prolog comes at the top of the document, before the root element. This section contains
XML declaration
Document type declaration
Document Elements are the building blocks of XML. These divide the document into a hierarchy
of sections, each serving a specific purpose. You can separate a document into multiple sections
so that they can be rendered differently, or used by a search engine. The elements can be containers,
with a combination of text and other elements.
XML Comments
Syntax
7
We can use these comments anywhere in XML document except within attribute value.
Don't nest the comments inside the other.
XML Vs HTML
XML HTML
XML stands for Extensible Markup Language- HTML stands for Hypertext Markup
Language
Allow user to create own tags and attributes. Html tags and attributes are pre-determined
and rigid.
Content and format are separate and Content and formatting can be placed together.
formatting will be made by external stylesheets Example: <p><font=”Arial”>text</font>
More effective for machine- machine interaction. More effective for machine - human interaction.
Syntax is strictly defined which means Loosely Defined syntax compared to XML.
mandatory to close all the tags. In Html, not necessary to close all the tag.
XML is dynamic because it is used for both HTML is static because it is used to
display only display the data/content
and transport the data.
XML Elements
8
XML elements are represented by tags. Xml elements behave as a container to store all the
text, other elements, attributes and all media objects.
There is no limitation to use elements in Xml.
Elements usually consist of an opening tag and a closing tag, but they are consider as a
single tag.
Opening tags consist of <, followed by the element name, and ending with >.
Closing tags are the same but have a forward slash inserted between the less than symbol
and the element name.
Syntax
<tag>Data</tag>
Empty elements doesn’t have the closing tag. They are closed by inserting a forward slash
before the greater than symbol.
<tag/>
<child>Data</child>
9
o All tags must be written using the correct case. XML sees <tutorial> as a different
tag to <Tutorial>
XML Elements Must Be Nested Properly
o You can place elements inside other elements but you need to ensure each.
Wrong Right
<Employee> <Employee>
<Name> Mrs. Abi </Employee> <Name> Mrs. Abi </Name>
</Name> </Employee>
XML Attributes
Attributes are the part of the Xml Elements. An Element contains multiple attributes. All
are unique.
By the use of attributes we can add the information about the element.
XML attributes enhance the properties of the elements.
10
Syntax
<tag attibuteName=“attributeValue”>
//else
Syntax
<tag attibuteName=“attributeValue”>
<author bookType="Classical">
//else
<author bookType=‘Classical’>
<author bookType="Classical">
<author>
<bootType>Classical</bookType>
</author>
An attribute name must be appear once in the same start-tag or empty-element tag.
The value of the attributes within the quotation mark.
They used either single or double quotes.
Attributes must contain a value. Some HTML coders does provide the attribute name
without a value or it will equal true. This is not allowed in XML.
The values must not contain direct or indirect entity references to external entities.
Using an Attribute-List Declaration, an attribute must be declared in the Document Type
Definition (DTD).
12
XML Tree Structure
Attributes doesn’t have multiple values but child elements can have multiple values.
Attributes cannot contain tree structure but child element can.
Attributes are not easily expandable. If you want to change in attribute's vales in future,
it may be complicated.
Attributes cannot describe structure but child elements can.
Attributes are more difficult to be manipulated by program code.
Attributes values are not easy to test against a DTD, which is used to define the legal
elements of an XML document.
13
Benefits of XML - Business Benefits
Information Sharing :
o XML define data formats to build tools which helps to read, write and transform
data between XML and other formats.
Content Delivery :
o XML supports different users, channels, and also build more efficient applications.
o These channels have information delivery mechanisms such as digital TV,
phone, the Web, and multimedia/touchscreen kiosks.
Technological Benefits:
XML Namespace
In XML, a namespace is used to prevent any conflicts with element or attribute names.
o XML allows you to create your own element names, there's always the possibility
of naming an element exactly the same as one in another XML document.
o It is OK if you never use both documents together.
o You would have a name conflict. But you want to combine the content of the both
documents
o We would have two different elements, with different purposes, both with the same
name. In this case we use Namespace for the element to avoid the confusion.
15
Example Name Conflict
Imagine we have an XML document containing a list of books. Something like this:
16
We will encounter a problem if we try to combine the above documents. This is because
they both have an element called title. One is the title of the book, the other is the title of
the HTML page. We have a name conflict.
When we prevent this name conflict we want to create a namespace for the XML document.
Example for Namespace
XML CDATA
17
Syntax
<![CDATA[
]]>
18
The output of the above example look like this
CDATA Rules
XML DTD :
Syntax
Example
Internal DTD
Syntax
root-element is the name of root element and element-declarations is where you declare the
elements.
20
External DTD
Syntax
21
XML Schema
An XML Schema describes the structure of an XML document, just like a DTD.
An XML document validated against an XML Schema is both "Well Formed" and "Valid".
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
With XML Schema, your XML files can carry a description of its own format.
With XML Schema, independent groups of people can agree on a standard for interchanging data.
One of the greatest strengths of XML Schemas is the support for data types:
Another great strength about XML Schemas is that they are written in XML:
XML Query:
1. Xpath:
There are functions for string values, numeric values, booleans, date and time comparison,
node manipulation, sequence manipulation, and much more.
XPath Terminology
Nodes
In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-
instruction, comment, and document nodes.
XML documents are treated as trees of nodes. The topmost element of the tree is called the root
element.
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
24
<author>J K. Rowling</author> (element node)
J K. Rowling
"en"
Items
Relationship of Nodes
Parent
In the following example; the book element is the parent of the title, author, year, and price:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Children
In the following example; the title, author, year, and price elements are all children of the book
element:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Siblings
25
In the following example; the title, author, year, and price elements are all siblings:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Ancestors
In the following example; the ancestors of the title element are the book element and the
bookstore element:
<bookstore>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
Descendants
In the following example; descendants of the bookstore element are the book, title, author, year,
and price elements:
<bookstore>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
XPath Syntax:
XPath uses path expressions to select nodes or node-sets in an XML document. The node is
selected by following a path or steps.
26
The XML Example Document
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="en">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
Selecting Nodes
XPath uses path expressions to select nodes in an XML document. The node is selected by
following a path or steps. The most useful path expressions are listed below:
Expression Description
// Selects nodes in the document from the current node that match the selection
no matter where they are
@ Selects attributes
In the table below we have listed some path expressions and the result of the expressions:
27
/bookstore Selects the root element bookstore
Note: If the path starts with a slash ( / ) it always represents an
absolute path to an element!
//book Selects all book elements no matter where they are in the document
bookstore//book Selects all book elements that are descendant of the bookstore
element, no matter where they are under the bookstore element
Predicates
Predicates are used to find a specific node or a node that contains a specific value.
In the table below we have listed some path expressions with predicates and the result of the
expressions:
/bookstore/book[1] Selects the first book element that is the child of the
bookstore element.
In JavaScript:
xml.setProperty("SelectionLanguage","XPath");
/bookstore/book[last()] Selects the last book element that is the child of the
bookstore element
/bookstore/book[last()-1] Selects the last but one book element that is the child of
the bookstore element
/bookstore/book[position()<3] Selects the first two book elements that are children of the
bookstore element
//title[@lang] Selects all the title elements that have an attribute named
lang
28
//title[@lang='en'] Selects all the title elements that have a "lang" attribute
with a value of "en"
/bookstore/book[price>35.00]/title Selects all the title elements of the book elements of the
bookstore element that have a price element with a value
greater than 35.00
Wildcard Description
In the table below we have listed some path expressions and the result of the expressions:
Path Expression
Result
/bookstore/* Selects all the child element nodes of the bookstore element
//title[@*] Selects all title elements which have at least one attribute of any kind
By using the | operator in an XPath expression you can select several paths.
In the table below we have listed some path expressions and the result of the expressions:
29
Path Expression Result
//book/title | //book/price Selects all the title AND price elements of all book elements
//title | //price Selects all the title AND price elements in the document
/bookstore/book/title | //price Selects all the title elements of the book element of the
bookstore element AND all the price elements in the document
XPath Operators
+ Addition 6+4
- Subtraction 6-4
* Multiplication 6*4
= Equal price=9.80
or or price=9.80 or price=9.70
XPath Examples
"books.xml":
30
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
Loading the XML Document
Using an XMLHttpRequest object to load XML documents is supported in all modern browsers.
31
Selecting Nodes
Unfortunately, there are different ways of dealing with XPath in different browsers.
Chrome, Firefox, Edge, Opera, and Safari use the evaluate() method to select nodes:
xmlDoc.selectNodes(xpath);
In our examples we have included code that should work with most major browsers.
Example
/bookstore/book/title
Select the title of the first book
The following example selects the title of the first book node under the bookstore element:
Example
/bookstore/book[1]/title
Select all the prices
The following example selects the text from all the price nodes:
Example
/bookstore/book/price[text()]
The following example selects all the price nodes with a price higher than 35:
Example
/bookstore/book[price>35]/price
Select title nodes with price>35
32
The following example selects all the title nodes with a price higher than 35:
Example
/bookstore/book[price>35]/title
2. Xquery:
Xquery is a query and functional programming language. Xquery provides a facility to extract
and manipulate data from XML documents or any data source, like relational database.
The Xquery defines FLWR expression which supports iteration and binding of variable to
intermediate results.
FLWR is an abbreviation of FOR, LET, WHERE, RETURN. Which are explained as follows:
XQuery is case-sensitive
XQuery elements, attributes, and variables must be valid XML names
An XQuery string value can be in single or double quotes
An XQuery variable is defined with a $ followed by a name, e.g. $bookstore
XQuery comments are delimited by (: and :), e.g. (: XQuery Comment :)
Xquery comparisons:
Example:
Lets take an example to understand how to write a XMLquery.
The queries in the above example can return true value ( Name of the students) only if the value
(marks) is greater than 700.
"books.xml":
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
34
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
doc("books.xml")
Path Expressions
The following path expression is used to select all the title elements in the "books.xml" file:
doc("books.xml")/bookstore/book/title
(/bookstore selects the bookstore element, /book selects all the book elements under the
bookstore element, and /title selects all the title elements under each book element)
35
<title lang="en">Everyday Italian</title>
<title lang="en">Harry Potter</title>
<title lang="en">XQuery Kick Start</title>
<title lang="en">Learning XML</title>
Predicates
XQuery uses predicates to limit the extracted data from XML documents.
The following predicate is used to select all the book elements under the bookstore element that
have a price element with a value that is less than 30:
doc("books.xml")/bookstore/book[price<30]
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
doc("books.xml")/bookstore/book[price>30]/title
The expression above will select all the title elements under the book elements that are under the
bookstore element that have a price element with a value that is higher than 30.
The following FLWOR expression will select exactly the same as the path expression above:
for $x in doc("books.xml")/bookstore/book
where $x/price>30
return $x/title
for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
36
The for clause selects all book elements under the bookstore element into a variable called $x.
The where clause selects only book elements with a price element with a value greater than 30.
The order by clause defines the sort-order. Will be sort by the title element.
The return clause specifies what should be returned. Here it returns the title elements.
37
UNIT-V
Introduction to Information Retrival and Web search
1. Why information retrieval is necessary? List the Retrieval Models and explain Boolean
and Vector space model with example.
Information retrieval
Process of retrieving documents from a collect in response to a query (search
request)
Deals mainly with unstructured data
Example: home buying contract documents
Unstructured information
Does not have a well-defined formal model
Based on an understanding of natural language
Stored in a wide variety of standard format
Information retrieval field predates database
Academic programs in Library and Information Science
RDBMS vendors providing new capabilities to support various data types
Extended RDBMSs or object-relational database management systems
User’s information need expressed as free-form search request
Keyword search query
Characterizing an IR system
TYPES OF USERS
Users can greatly vary in their abilities to interact with computational environment
Expert
The user may be an expert user (for example, a curator or a librarian) who is searching
for specific information
Layperson
forms relevant queries for the task, or a layperson user with a generic information
need. h (for example, students trying to find information about a new topic, researchers
trying to assimilate different points view about a historical issue,
Types of data
search systems can be tailored to specific types of data
For example, the problem of retrieving information about a specific topic may be
handled more efficiently by customized search systems that are built to collect and
retrieve only information related to that specific topic
Domain specific
The information repository could be hierarchically organized based on a concept or
topic hierarchy. These topical domain-specific or vertical IR systems are not as large
as or as diverse as the generic World Wide Web, which contains information on all
kinds of topics.
Types of information needs
In the context of Web search, users’ information needs may be defined as navigational,
informational, or transactional.
Navigational search
Navigational search refers to finding a particular piece of information (such as the
Georgia Tech University Web site)
Informational search
is to find current information about a topic (such as research activities)
Transactional search
activitiesThe goal of transactional search is to reach a site where further interaction
happens resulting in some transactional event (such as joining a social network,
shopping for products.
Enterprise search systems
Limited to an intranet
Desktop search engine
Searches an individual computer system
Databases have fixed schemas
IR system has no fixed data model
Comparing Databases and IR Systems
Databases IR Systems
Structured data Unstructured data
Schema driven No fixed schema; various data
models (e.g., vector space model)
Relational (or object, hierarchical, Free-form query models
and network) model is
predominant
Structured query model Rich data operations
Rich metadata operations Search request returns list or
pointers to documents
Query returns data
Results are based on exact Results are based on approximate
matching (always correct) matching and measures of
effectiveness (may be imprecise
and ranked)
A Brief History of IR
Stone tablets and papyrus scrolls
Printing press
Public libraries
Computers and automated storage systems
Inverted file organization based on keywords and their weights as indexing method
Search engine
Crawler
Challenge: provide high quality, pertinent, timely information
Modes of Interactions in IR Systems
Primary modes of interaction
Retrival
Extract relevant information from document repository
Browsing
Exploratory activity based on user’s assessment of relevance
Web search combines both interaction modes
Rank of a web page measures its relevance to query that generated the result set
Generic IR Pipeline
Statistical approach
Documents analyzed and broken down into chunks of text
Each word or phrase is counted, weighted, and measured for relevance or importance
Types of statistical approaches
Boolean
Vector space
Probabilistic
Semantic approaches
Use knowledge-based retrieval technique
Rely on syntactic, lexical, sentential, discourse-based, and pragmatic levels of
knowledge understanding
Also apply some form of statistical analysis
Retrieval Models
Boolean model - One of earliest and simplest IR models
In the Boolean retrieval model we can pose any query in the form of a Boolean expression
of terms i.e., one in which terms are combined with the operators and, or, and not.
Example: Shakespeare
Brutus AND Caesar AND NOT Calpurnia
Which plays of Shakespeare contain the words Brutus and Caesar, but not
Calpurnia?
Naive solution: linear scan through all text – “grepping” In this case, works OK
(Shakespeare’s Collected works has less than 1M words).
But in the general case, with much larger text colletions, we need to index.
Indexing is an offline operation that collects data about which words occur in a
text, so that at search time you only have to access the precompiled index.
Main idea: record for each document whether it contains each word out of all the
different words Shakespeare used (about 32K).
Matrix element (t, d) is 1 if the play in column d contains the word in row t, 0
otherwise.
.
Can’t build the Term-Document incidence matrix.
Vector space model
Weighting, ranking, and determining relevance are possible
Uses individual terms as dimensions
Each document represented by an n-dimensional vector of values
Features
Subset of terms in a document set that are deemed most relevant to an IR search for the
document set
Different similarity assessment functions can be Used Term frequency-inverse
document frequency
(TF-IDF)
Statistical weight measure used to evaluate the importance of a document word in a
collection of documents
A discriminating term must occur in only a few documents in the general population
Probabilistic model
Involves ranking documents by their estimated probability of relevance with respect
to the query and the document
IR system must decide whether a document belongs to the relevant set or nonrelevant
set for a query
Calculate probability that document belongs to the relevant set
BM25: a popular ranking algorithm
Semantic model
Morphological analysis
Analyze roots and affixes to determine parts of speech of search words
Syntactic analysis
Parse and analyze complete phrases in documents
Semantic analysis
Resolve word ambiguities and generate relevant synonyms based on semantic
relationship Uses techniques from artificial intelligence and expert systems
2. Explain the Types of Queries in IR Systems.
Types of Queries in IR Systems.
Keyword queries
Simplest and most commonly used
Keyword terms implicitly connected by logical AND
Boolean queries
Allow use of AND, OR, NOT, and other operator
Exact matches returned
No ranking possible
Phrase queries
Sequence of words that make up a phrase
Phrase enclosed in double quotes
Each retrieved document must contain at least on instance of the exact phrase
Proximity queries
How close within a record multiple search terms are to each other
Phrase search is most commonly used proximity query
Specify order of search terms
NEAR, ADJ (adjacent), or AFTER operator
Sequence of words with maximum allowed distance between them
Computationally expensive
Suitable for smaller document collections rather than the Web
Wildcard queries
Supports regular expressions and pattern-based matching
Example ‘data*’ would retrieve data, database, dataset, etc.
Not generally implemented by Web search engine
Natural language queries
Definitions of textual terms or common facts
Semantic models can support
3. Discuss the different methods used in Text Preprocessing.
Text Preprocessing
Stopword removal must be performed before indexing
Words that are expected to occur in 80% or more of the documents of a collection
Examples: the, of, to, a, and, said, for, that
Do not contribute much to relevance
Queries preprocessed for stopword removal before retrieval process
Many search engines do not remove stopwords
Stemming
Trims suffix and prefix
Reduces the different forms of the word to a common stem
Martin Porter’s stemming algorithm
Utilizing a thesaurus
Important concepts and main words that describe each concept for a particular
knowledge domain
Collection of synonyms
UMLS
Other preprocessing steps
Digits
May or may not be removed during preprocessing
Hyphens and punctuation marks
Handled in different ways
Cases
Most search engines use case-insensitive search
Information extraction tasks
Identifying noun phrases, facts, events, people, places, and relationships
Inverted Indexing
Inverted index structure
Vocabulary information
Set of distinct query terms in the document set
Document information
Data structure that attaches distinct terms with a list of all documents that contain
the term
Construction of an inverted index
Break documents into vocabulary term
tokenizing, cleansing, removing stopwords, stemming, and/or using a thesaurus
Collect document statistics
Store statistics in document lookup table
Invert the document-term stream into a term document stream
Add additional information such as termfrequencies, term positions, and term weights
Average precision
Computed based on the precision at each relevant document in the ranking
Recall/precision curve
Based on the recall and precision values at each rank position
x-axis is recall and y-axis is precision
F-score
Harmonic mean of the precision (p) and recall (r) values
5. Explain in detail about web search and analysis.
Search engines must crawl and index web sites and document collections
Regularly update indexes
Link analysis used to identify page importance
Vertical search engines
Customized topic-specific search engines that crawl and index a specific collection of
documents on the Web
Metasearch engines
Query different search engines simultaneously and aggregate information
Digital libraries
Collections of electronic resources and services for the delivery of materials in a variety
of formats
Web analysis
Applies data analysis techniques to discover and analyze useful information from the
Web
Goals of Web analysis
Finding relevant information
Personalization of the information
Finding information of social value
Categories of Web analysis
Web structure analysis
Web content analysis
Web usage analysis
Web structure analysis
Hyperlink
Destination page
Anchor text
Hub
PageRank ranking algorithm
Used by Google
Analyzes forward links and backlinks
Highly linked pages are more important
Web content analysis tasks
Structured data extraction
Wrapper
Web information integration
Web query interface integration
Schema matching
Ontology-based information integration
Building concept hierarchies
Segmenting web pages and detecting noise
Approaches to Web content analysis
Agent-based
Intelligent Web agents
Personalized Web agents
Information filtering/categorization
Database-base
Attempts to organize a Web site as a database
Object Exchange Model
Multilevel database
Web query system
Web usage analysis attempts to discover usage patterns from Web
data
Preprocessing
Usage, content, structure
Pattern discovery
Statistical analysis, association rules, cluster classification, sequential
patterns, dependency modeling
Pattern analysis
Filter out patterns not of interest
Practical applications of Web analysis
Web analytics
Understand and optimize the performance of Web usage
Web spamming
Deliberate activity to promote a page by manipulating search engine result
Web security
Allow design of more robust Web sites
Web crawlers
6. Discuss the Trends in Information Retrieval.
Trends in Information Retrieval
Faceted search
Classifying content
social search
Collaborative social search
Conversational information access
Intelligent agents perform intent extraction to provide information relevant to a
conversation
Probabilistic topic modeling
Automatically organize large collections of documents into relevant themes
Question-answering systems
Factoid questions
List questions
Definition questions
Opinion questions
Composed of question analysis, query generation, search, candidate answer
generation, and answer scoring