MC4202 - Adavanced Database Technology
MC4202 - Adavanced Database Technology
Features
In a homogeneous distributed database, all the sites use identical DBMS and
operating systems. Its properties are −
Architectural Models
In these systems, each peer acts both as a client and a server for imparting
database services. The peers share their resource with other peers and co-
ordinate their activities.
Example: Consider that we have three departments using Oracle-9i for DBMS.
If some changes are made in one department then, it would update the other
department also.
2. Heterogeneous distributed database system.
MySQL
Oracle
SQL Server
dBASE
FoxPro
PostgreSQL, etc.
Types of DBMS
Hierarchical DBMS
Network DBMS in one where the relationships among data in the database are
of type many-to-many in the form of a network. The structure is generally
complicated due to the existence of numerous many-to-many relationships.
Network DBMS is modelled using “graph” data structure.
Relational DBMS
Distributed DBMS
Operations on DBMS
The four basic operations on a database are Create, Retrieve, Update and Delete.
Example SQL command to insert a single tuple into the student table −
UPDATE STUDENT
SET STREAM = 'ELECTRONICS AND COMMUNICATIONS'
WHERE STREAM = 'ELECTRONICS';
Example − To delete all students who are in 4th year currently when they
are passing out, we use the SQL command −
DELETE FROM STUDENT
WHERE YEAR = 4;
• Fragmentation. The system partitions the relation into several fragments, and
stores each fragment at a different site.
Data Replication
Data replication is the process of storing separate copies of the database at two
or more sites. It is a popular fault tolerance technique of distributed databases.
Snapshot replication
Near-real-time replication
Pull replication
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The
subsets of the table are called fragments. Fragmentation can be of three types:
horizontal, vertical, and hybrid (combination of horizontal and vertical).
Horizontal fragmentation can further be classified into two techniques: primary
horizontal fragmentation and derived horizontal fragmentation.
Advantages of Fragmentation
Since data is stored close to the site of usage, efficiency of the database
system is increased.
Local query optimization techniques are sufficient for most queries since
data is locally available.
Since irrelevant data is not available at the sites, security and privacy of
the database system can be maintained.
Disadvantages of Fragmentation
When data from different fragments are required, the access speeds may
be very low.
In case of recursive fragmentations, the job of reconstruction will need
expensive techniques.
Lack of back-up copies of data in different sites may render the database
ineffective in case of failure of a site.
1. Synchronous Replication:
In synchronous replication, the replica will be modified immediately after some
changes are made in the relation table. So there is no difference between
original data and replica.
2. Asynchronous replication:
In asynchronous replication, the replica will be modified after commit is fired
on to the database.
Replication Schemes
The three replication schemes are as follows:
1. Full Replication
In full replication scheme, the database is available to almost every location or
user in communication network.
2. No Replication
No replication means, each fragment is stored exactly at one location.
Advantages of no replication
Disadvantages of no replication
For example, let us consider that a University database keeps records of all
registered students in a Student table having the following schema.
STUDENT
Now, the fees details are maintained in the accounts section. In this case, the
designer will fragment the database as follows −
For example, in the student schema, if the details of all students of Computer
Science Course needs to be maintained at the School of Computer Science, then
the designer will horizontally fragment the database as follows −
CREATE COMP_STD AS
SELECT * FROM STUDENT
WHERE COURSE = "Computer Science";
Hybrid Fragmentation
Distributed Transactions
Each high level operation can be divided into a number of low level tasks or
operations. For example, a data update operation can be divided into three tasks
read_item() − reads data item from storage to main memory.
Transaction Operations
Transaction States
Active − The initial state where the transaction enters is the active state.
The transaction remains in this state while it is executing read, write or
other operations.
Partially Committed − The transaction enters this state after the last
statement of the transaction has been executed.
Committed − The transaction enters this state after successful
completion of the transaction and system checks have issued commit
signal.
Failed − The transaction goes from partially committed state or active
state to failed state when it is discovered that normal execution can no
longer proceed or system checks fail.
Aborted − This is the state after the transaction has been rolled back after
failure and the database has been restored to its state that was before the
transaction began.
The following state transition diagram depicts the states in the transaction and
the low level transaction operations that causes change in states.
Desirable Properties of Transactions
T1 T2
Read(A) Read(B)
A:= A-100 Y:= Y+100
Write(A) Write(B)
Schedule
1. Serial Schedule
For example: Suppose there are two transactions T1 and T2 which have some
operations. If it has no interleaving of operations, then there are the following
two possible outcomes:
1. Execute all the operations of T1 which was followed by all the operations
of T2.
2. Execute all the operations of T1 which was followed by all the operations
of T2.
In the given (a) figure, Schedule A shows the serial schedule where T1
followed by T2.
In the given (b) figure, Schedule B shows the serial schedule where T2
followed by T1.
2. Non-serial Schedule
Validity — The value that’s decided upon should have been proposed by some
process
This protocol requires a coordinator. The client contacts the coordinator and
proposes a value. The coordinator then tries to establish the consensus among a
set of processes in two phases, hence the name.
1. In the first phase, coordinator contacts all the participants suggests value
proposed by the client and solicit their response.
2. After receiving all the responses, the coordinator makes a decision to commit
if all participants agreed upon the value or abort if someone disagrees.
When speaking about failures what are the types of failures of a node?
Fail Recover Model, Nodes crash, and recover after a certain time and continue
executing.
This is an extension of two-phase commit wherein the commit phase is split into
two phases as follows.
a. Prepare to commit, After unanimously receiving yes in the first phase of 2PC
the coordinator asks all participants to prepare to commit. During this phase, all
participants acquire locks etc, but they don’t actually commit.
b. If the coordinator receives yes from all participants during the prepare to
commit phase then it asks all participants to commit.
The pre-commit phase introduced
above helps us to recover from the case when a participant failure or both
coordinator and participant node failure during commit phase. The recovery
coordinator when it takes over after coordinator failure during phase2 of
previous 2 pc the new pre-commit comes handy as follows. On querying
participants, if it learns that some nodes are in commit phase then it assumes
that previous coordinator before crashing has made the decision to commit.
Hence it can shepherd the protocol to commit. Similarly, if a participant says
that it doesn’t receive prepare to commit, then the new coordinator can assume
that previous coordinator failed even before it started the prepare to commit
phase. Hence it can safely assume no other participant would have committed
the changes and hence safely abort the transaction.
Concurrency Control
Concurrency Control is the working concept that is required for controlling and
managing the concurrent execution of database operations and thus avoiding the
inconsistencies in the database. Thus, for maintaining the concurrency of the
database, we have the concurrency control protocols.
Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it
acquires an appropriate lock on it. There are two types of lock:
1. Shared lock:
It is also known as a Read-only lock. In a shared lock, the data item can
only read by the transaction.
It can be shared between the transactions because when the transaction
holds a lock, then it can't update the data on the data item.
2. Exclusive lock:
In the exclusive lock, the data item can be both reads as well as written by
the transaction.
This lock is exclusive, and in this lock, multiple transactions do not
modify the same data simultaneously.
Growing phase: In the growing phase, a new lock on the data item may be
acquired by the transaction, but none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction
may be released, but no new locks can be acquired.
In the below example, if lock conversion is allowed then the following phase
can happen:
1. Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.
2. Downgrading of lock (from X(a) to S(a)) must be done in shrinking
phase.
Example:
The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
Transaction T2:
Where
Validation (Ti): It contains the time when Ti finishes its read phase and starts
its validation phase.
Conflict Graphs
Another method is to create conflict graphs. For this transaction classes are
defined. A transaction class contains two set of data items called read set and
write set. A transaction belongs to a particular class if the transaction’s read set
is a subset of the class’ read set and the transaction’s write set is a subset of the
class’ write set. In the read phase, each transaction issues its read requests for
the data items in its read set. In the write phase, each transaction issues its write
requests.
A conflict graph is created for the classes to which active transactions belong.
This contains a set of vertical, horizontal, and diagonal edges. A vertical edge
connects two nodes within a class and denotes conflicts within the class. A
horizontal edge connects two nodes across two classes and denotes a write-write
conflict among different classes. A diagonal edge connects two nodes across
two classes and denotes a write-read or a read-write conflict among two classes.
The conflict graphs are analyzed to ascertain whether two transactions within
the same class or across two different classes can be run in parallel.
Query Processing
Thus, to make the system understand the user query, it needs to be translated in
the form of relational algebra. We can bring this query in the relational algebra
form as:
After translating the given query, we can execute each relational algebra
operation by using different algorithms. So, in this way, a query processing
begins its working.
Evaluation
Optimization
The cost of the query evaluation can vary for different types of queries.
Although the system is responsible for constructing the evaluation plan,
the user does need not to write their query efficiently.
Usually, a database system generates an efficient query evaluation plan,
which minimizes its cost. This type of task performed by the database
system and is known as Query Optimization.
For optimizing a query, the query optimizer should have an estimated
cost analysis of each operation. It is because the overall operation cost
depends on the memory allocations to several operations, execution costs,
and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and
produces the output of the query.
Distributed Transactions
7.What is DDBMS?
A centralized distributed database management system (DDBMS) integrates data logically so it
can be managed as if it were all stored in the same location. The DDBMS synchronizes all the
data periodically and ensures that data updates and deletes performed at one location will be
automatically reflected in the data stored elsewhere.
Karpaga Vinayaga College of Enginnering and Technology
Master of Computer Applications
MC4202 - ADVANCED DATABASE TECHNOLOGY
UNIT II
SPATIAL AND TEMPORAL DATABASES
Active Databases Model – Design and Implementation Issues - Temporal Databases -
Temporal Querying - Spatial Databases: Spatial Data Types, Spatial Operators and Queries –
Spatial Indexing and Mining – Applications -– Mobile Databases: Location and Handoff
Management, Mobile Transaction Models – Deductive Databases - Multimedia Databases.
Active Databases
[ WHEN <condition> ]
<trigger actions> ;
Trigger Example
When a new employees is added to a department, modify the Total_sal of
the Department to include the new employees salary
Logically this means that we will CREATE a TRIGGER, let us call
the trigger Total_sal1
This trigger will execute AFTER INSERT ON Employee
table
It will do the following FOR EACH ROW
WHEN NEW.Dno is NOT NULL
The trigger will UPDATE DEPARTMENT
By Setting the new Total_sal to be the sum of
old Total_sal and NEW. Salary
WHERE the Dno matches the NEW.Dno;
CREATE or ALTER TRIGGER
CREATE TRIGGER <name>
Creates a trigger
ALTER TRIGGER <name>
Alters a trigger (assuming one exists)
CREATE OR ALTER TRIGGER <name>
Creates a trigger if one does not exist
Alters a trigger if one does exist
Works in both cases, whether a trigger exists or not
AFTER
Executes after the event
BEFORE
Executes before the event
INSTEAD OF
Executes instead of the event
Note that event does not execute in this case
R3:CREATE TRIGGER TOTAL_SAL3
AFTER DELETE ON EMPLOYEE
FOR EACH ROW
WHEN(OLD.Dno is NOT NULL)
UPDATE DEPARTMENT
SET Total_sal=Total_sal-OLD.Salary
Where Dno=OLD.Dno;
Design and Implementation Issues for Active Databases
The previous section gave an overview of some of the main concepts for
specifying active rules. In this section, we discuss some additional issues
concerning how rules are designed and implemented. The first issue concerns
activation, deactivation, and grouping of rules.
The second issue concerns whether the triggered action should be executed
before, after, instead of, or concurrently with the triggering event. A before
trigger executes the trigger before executing the event that caused the trigger. It
can be used in applications such as checking for constraint violations. An after
trigger executes the trigger after executing the event, and it can be used in
applications such as maintaining derived data and monitoring for specific events
and conditions. An instead of trigger executes the trigger instead of executing
the event, and it can be used in applications such as executing corresponding
updates on base relations in response to an event that is an update of a view.
The rule condition evaluation is also known as rule consideration, since the
action is to be executed only after considering whether the condition evaluates
to true or false. There are three main possibilities for rule consideration:
The next set of options concerns the relationship between evaluating the rule
condition and executing the rule action. Here, again, three options are possible:
immediate, deferred, or detached execution. Most active systems use the first
option. That is, as soon as the condition is evaluated, if it returns true, the action
is immediately executed.
Temporal Aspects
Valid Time: Time period during which a fact is true in real world,
provided to the system.
Transaction time
o The time when the information from a certain transaction becomes
valid
Temporal Relation
Temporal Relation is one where each tuple has associated time; either valid time
or transaction time or both associated with it.
Point events
Single time point event
E.g., bank deposit
Series of point events can form a time series data
Duration events
Associated with specific time period
Time period is represented by start time and end time
Transaction time
The time when the information from a certain transaction becomes
valid
Bitemporal database
Databases dealing with two time dimensions
The following table summarizes the storage requirements and ranges for the
date and time data types.
The DATE data type represents date values in 'YYYY-MM-DD' format. The
supported range of DATE values is '1000-01-01' to '9999-12-31'. You might be
able to use earlier dates than that, but it's better to stay within the supported
range to avoid unexpected behavior.
The TIME data type represents time values in 'hh:mm:ss' format. The range
of TIME columns is '-838:59:59' to '838:59:59'. This is outside the time-of-day
range of '00:00:00' to '23:59:59' because TIME columns can be used to
represent elapsed time. Thus, values might be larger than time-of-day values, or
even negative.
The YEAR data type represents year-only values. You can declare such
columns as YEAR(4) or YEAR(2) to obtain a four-digit or two-digit display
format. If you don't specify any display width, the default is four digits.
The TIMESTAMP Data Type
The TIMESTAMP type, like DATETIME, stores date-and-time values, but has
a different range and some special properties that make it especially suitable for
tracking data modification times.
Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings:
0
ERROR 1293 (HY000): Incorrect table definition; there can be only one
TIMESTAMP column with CURRENT_TIMESTAMP in DEFAULT or ON
UPDATE clause
A road map is a 2-dimensional object which contains points, lines, and polygons
that can represent cities, roads.
The three basic types of features are points, lines, and polygons (or areas).
Points are used to represent spatial characteristics of objects whose locations
correspond to a single 2-d coordinate (x, y, or longitude/latitude) in the scale of
a particular application.
R-trees are tree data structures used for spatial access methods, i.e., for
indexing multi-dimensional information such as geographical
coordinates, rectangles or polygons.
Quad trees are trees used to efficiently store data of points on a two-
dimensional space. In this tree, each node has at most four children.
Characteristics of Spatial Database
It is a database system
It offers spatial data types (SDTs) in its data model and query language.
It supports spatial data types in its implementation, providing at least
spatial indexing and efficient algorithms for spatial join.
Example
The spatial data in the form of points, lines, polygons etc. is used by many
different databases as shown above.
GEOMETRY
POINT
LINESTRING
POLYGON
The other data types hold collections of values:
MULTIPOINT
MULTILINESTRING
MULTIPOLYGON
GEOMETRYCOLLECTION
Use the CREATE TABLE statement to create a table with a spatial column:
CREATE TABLE geotest (code int(5),descrip varchar(50), g GEOMETRY);
Sample Output:
X-coordinate value.
Y-coordinate value.
Point is defined as a zero-dimensional geometry.
The boundary of a Point is the empty set.
Example
MySQL> SELECT X(POINT(18, 23));
+ +
| X(POINT(18, 23)) |
+ +
| 18 |
+ +
1 row in set (0.00 sec)
Example
MySQL> SET @g = 'LINESTRING(0 0,1 2,2 4)';
Surface Type
Polygon Type
A Polygon is a planar Surface representing a multisided geometry. It is
defined by a single exterior boundary and zero or more interior
boundaries, where each interior boundary defines a hole in the Polygon.
Usage of Polygon
The Polygon objects could represent districts, blocks and so on from a
state map.
Example
MySQL> SET @g = 'POLYGON((0 0,8 0,12 9,0 9,0 0),(5 3,4 5,7 9,3 7, 2
5))';
MySQL> INSERT INTO geotest VALUES (123,"Test
Data",GeomFromText(@g));
GeometryCollection Type
A GeometryCollection is a geometry that is a collection of one or more
geometries of any class.
Example
MySQL> SET @g = 'GEOMETRYCOLLECTION(POINT(3
2),LINESTRING(0 0,1 3,2 5,3 5,4 7))';
MultiPoint Type
A MultiPoint is a geometry collection composed of Point elements. The points
are not connected or ordered in any way.
Usage of MultiPoint
On a world map, a MultiPoint could represent a chain of small islands.
MultiCurve Type
A MultiCurve is a geometry collection composed of Curve elements.
MultiCurve is a noninstantiable class.
MySQL> SET @g ='MULTIPOINT(0 0, 15 25, 45 65)';
MultiLineString Type
A MultiLineString is a MultiCurve geometry collection composed of LineString
elements.
Usage of MultiLineString
Spatial Indexing
A spatial index is a data structure that allows for accessing a spatial object
efficiently. It is a common technique used by spatial databases. Without
indexing, any search for a feature would require a "sequential scan" of every
record in the database, resulting in much longer processing time.
1. Overview
Spatial Index is a data structure that allows for accessing a spatial object
efficiently. It is a common technique used by spatial databases. A variety of
spatial operations needs the support from spatial index for efficient processing:
Figure 1. The point query test in Mysql 5.17.19 using 2017 TIGER national
geodatasets. Time is measured in milliseconds, which is the precision of Mysql.
Different data sources use different data structures and access methods. Here we
list two well-known spatial indices as well as databases which use them. The
categorization system proposed by Riguax et al. (2002) is employed here for
illustration:
3. Space-driven Structures
Fixed grid index is an n×n array of equal-size cells. Each one is associated with
a list of spatial objects which intersect or overlap with the cell. Figure 3 depicts
a fixed 4×4 gird indexing a collection of three spatial objects.
Figure 3. An example of a fixed grid structure.
Quadtree
KD-tree
The general idea behind KD-tree is that it is a binary tree, each of its nodes
represents an axis-aligned hyper-rectangle as Figure 7 shows. Each node
specifies an axis and splits the set of points based on whether their coordinate
along that axis is greater than or less than a particular value (Rigaux, 2012;
Maneewongvatana, 1999), such as the coordinate median.
The KD-tree can be used to index a set of k-dimensional points. Every non-leaf
node divides the space into two parts by a hyper-plane in the specific
dimension. Points in the left half-space are represented by the left subtree of that
node and points falling to the right half-space are represented by the right
subtree.
4. Data-driven Structures
R-tree
In Figure 9, M4 through M9 are MBRs of spatial objects in a layer. They are the
leaf nodes of the R-tree index, and contain minimum bounding rectangles of
spatial objects, along with pointers to the spatial objects. M2 and M3 are parents
of the leaf nodes. M1 is the root, containing all the MBRs. This R-tree has a
depth of three.
Examples –
An association rule which looks
like – “Any Person who buys a
Examples –
car also buys steering lock”. By
7. Determining hotspots ,
temporal aspect this rule would
Unusual locations.
be – ” Any person who buys a
car also buys a steering lock after
that “.
Rules − There are several types of rules that can be found from databases in
general. For example characteristic rules, discriminant rules, association rules,
or deviation and evaluation rules can be mined.
Applications
Colocation Mining
Colocation is the presence of two or more spatial objects at the same location or
at significantly close distances from each other. Colocation patterns can indicate
interesting associations among spatial data objects with respect to their
nonspatial attributes
Spatial Clustering
Location Prospecting
Mobile Databases
Mobile databases are separate from the main database and can easily be
transported to various places. Even though they are not connected to the main
database, they can still communicate with the database to share and exchange
data.
The main system database that stores all the data and is linked to the
mobile database.
The mobile database that allows users to view information even while on
the move. It shares information with the main database.
The device that uses the mobile database to access data. This device can
be a mobile phone, laptop etc.
A communication link that allows the transfer of data between the mobile
database and the main database.
The mobile data is less secure than data that is stored in a conventional
stationary database. This presents a security hazard.
The mobile unit that houses a mobile database may frequently lose power
because of limited battery. This should not lead to loss of data in
database.
(c) paging.
Location update- is initiated by the mobile unit, the current location of the unit
is recorded in HLR and VLR databases.
Location lookup- a database search to obtain the current location of the mobile
unit.
Paging -the system informs the caller the location of the called unit in terms
of its current base station. These two tasks are initiated by the MSC.
(a) active mode, (b) doze mode, or (c) power down mode.
In active mode, the mobile actively communicates with other subscriber, and it
may continue to move within the cell or may encounter a handoff which may
interrupt the communication. It is the task of the location manager to find the
new location and resume the communication.
In doze mode a mobile unit does not actively communicate with other
subscribers but continues to listen to the base station and monitors the signal
levels around it
In Power down mode the unit is not functional at all.
The location management module uses a two-tier scheme for location- related
tasks. The first tier provides a quick location lookup, and the second tier search
is initiated only when the first tier search fails.
Location Lookup :A location lookup finds the location of the called party to
establish the communication session. It involves searching VLR and possibly
HLR.
Declarative Language
Language to specify rules
Inference Engine (Deduction Machine)
Related to logic programming
Prolog language (Prolog => Programming in logic)
Uses backward chaining to evaluate
defined via a set of rules (superior allows us to express the idea of non-
direct supervision)
Rule
Multimedia Databases
The multimedia databases are used to store multimedia data such as images,
animation, audio, video along with text. This data is stored in the form of
multiple file types like .txt(text), .jpg(images), .swf(videos), .mp3(audio) etc.
The multimedia database stored the multimedia data and information related to
it. This is given in detail as follows −
Media data
This is the multimedia data that is stored in the database such as images, videos,
audios, animation etc.
The Media format data contains the formatting information related to the media
data such as sampling rate, frame rate, encoding scheme etc.
This contains the keyword data related to the media in the database. For an
image the keyword data can be date and time of the image, description of the
image etc.
Th Media feature data describes the features of the media data. For an image,
feature data can be colours of the image, textures in the image etc.
Comprehensive search methods: During a search in the database, an entry, given in the form of text
or a graphical image, is found using different search queries and the corresponding search methods.
Format independent interface: database queries should be independent of media format.
23. What is multimedia database describe any two image databases?
Multimedia database systems are increasingly common owing to the popular use of audio-video
equipment, digital cameras, CD-ROMs, and the Internet. There are multimedia database systems
include NASA's EOS (Earth Observation System), various kinds of image and audio video
databases, and Internet databases
MC4202 – ADVANCE DATABASE TECHNOLOGY UNIT - III
3.1 NoSQL
What is NoSQL?
NoSQL Database is a non-relational Data Management System, that does not require
a fixed schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL
database is for distributed data stores with humongous data storage needs. NoSQL is used for
Big data and real-time web apps. For example, companies like Twitter, Facebook and Google
collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term
would be “NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database technologies that
can store structured, semi-structured, unstructured and polymorphic data. Let’s understand
about NoSQL with a diagram in this NoSQL database tutorial:
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response time
becomes slow when you use RDBMS for massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.
1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational
database
2000- Graph database Neo4j is launched
2004- Google BigTable is launched
2005- CouchDB is launched
2007- The research paper on Amazon Dynamo is released
2008- Facebooks open sources the Cassandra project
2009- The term NoSQL was reintroduced
Features of NoSQL
Non-relational
Schema-free
Advantages of NoSQL
Disadvantages of NoSQL
No standardization rules
Limited query capabilities
RDBMS databases and tools are comparatively mature
It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.
When the volume of data increases it is difficult to maintain unique values as keys
become difficult
Doesn’t work as well with relational data
The learning curve is stiff for new developers
Open source options so not so popular for enterprises.
1. Consistency
2. Availability
3. Partition Tolerance
Consistency:
The data should remain consistent even after the execution of an operation. This means once
data is written, any future read request should contain that data. For example, after updating
the order status, all the clients should be able to see the same data.
Availability:
The database should always be available and responsive. It should not have any downtime.
Partition Tolerance:
Partition Tolerance means that the system should continue to function even if the
communication among the servers is not stable. For example, the servers can be partitioned
into multiple groups which may not communicate with each other. Here, if part of the
database is unavailable, other parts are always unaffected.
Eventual Consistency
The term “eventual consistency” means to have copies of data on multiple machines to get
high availability and scalability. Thus, changes made to any data item on one machine has to
be propagated to other replicas.
Data replication may not be instantaneous as some copies will be updated immediately while
others in due course of time. These copies may be mutually, but in due course of time, they
become consistent. Hence, the name eventual consistency.
Basically, available means DB is available all the time as per CAP theorem
Soft state means even without an input; the system state may change
Eventual consistency means that the system will become consistent over time
3.3 Sharding
Sometimes the data within MongoDB will be so huge, that queries against such big data sets
can cause a lot of CPU utilization on the server. To tackle this situation, MongoDB has a
concept of Sharding, which is basically the splitting of data sets across multiple MongoDB
instances.
The collection which could be large in size is actually split across multiple collections or
Shards as they are called. Logically all the shards work as one collection.
1. A Shard – This is the basic thing, and this is nothing but a MongoDB instance which
holds the subset of the data. In production environments, all shards need to be part of
replica sets.
2. Config server – This is a mongodb instance which holds metadata about the cluster,
basically information about the various mongodb instances which will hold the shard
data.
3. A Router – This is a mongodb instance which basically is responsible to re-directing
the commands send by the client to the right servers.
Vertical scaling
Vertical scaling is the traditional way of increasing the hardware capabilities of a single
server. The process involves upgrading the CPU, RAM, and storage capacity. However,
upgrading a single server is often challenged by technological limitations and cost
constraints.
Horizontal scaling
This method divides the dataset into multiple servers and distributes the database load among
each server instance. Distributing the load reduces the strain on the required hardware
resources and provides redundancy in case of a failure.
The shard
Mongos
Config servers
Shard
A shard is a single MongoDB instance that holds a subset of the sharded data. Shards can be
deployed as replica sets to increase availability and provide redundancy. The combination of
multiple shards creates a complete data set. For example, a 2 TB data set can be broken down
into four shards, each containing 500 GB of data from the original data set.
Mongos
Mongos act as the query router providing a stable interface between the application and the
sharded cluster. This MongoDB instance is responsible for routing the client requests to the
correct shard.
Config Servers
Configuration servers store the metadata and the configuration settings for the whole cluster.
Components illustrated
The following diagram from the official MongoDB docs explains the relationship between
each component:
1. The application communicates with the routers (mongos) about the query to be executed.
2. The mongos instance consults the config servers to check which shard contains the
required data set to send the query to that shard.
3. Finally, the result of the query will be returned to the application.
It’s important to remember that the config servers also work as replica sets.
Benefits
In traditional replication scenarios, the primary node handles the bulk of write
operations, while the secondary servers are limited to read-only operations or
maintaining the backup of the data set. However, as sharding utilizes shards with replica
sets, all queries are distributed among all the nodes in the cluster.
As each shard consists of a subset of the complete data set, simply adding additional
shards will increase the cluster’s storage capacity without having to do complex
hardware restructuring.
Replication requires vertical scaling when handling large data sets. This requirement can
lead to hardware limitations and prohibitive costs compared to the horizontal scaling
approach. But, because MongoDB utilizes horizontal scaling, the workload is
distributed. When the need arises, additional servers can be added to a cluster.
In sharding, both read and write performance directly correlates to the number of server
nodes in the cluster. This process provides a quick method to increase the cluster’s
performance by simply adding additional nodes.
A sharded cluster can continue to operate even if a single or multiple shards are
unavailable. While the data on those shards are unavailable, the client application can
still access all the other available shards within the cluster without any downtime. In
production environments, all individual shards deploy as replica sets, further increasing
the availability of the cluster.
Limitations
db.Employee.insert
(
{
"Employeeid" : 1,
"EmployeeName" : "Martin"
}
)
3A
The basic parameters in the command is a condition for which document needs to be updated,
and the next is the modification which needs to be performed.
Step 2) Choose the condition which you want to use to decide which document needs to be
updated. In our example, we want to update the document which has the Employee id 22.
Step 4) Choose which Field Name you want to modify and enter the new value accordingly.
db.Employee.update(
{"Employeeid" : 1},
{$set: { "EmployeeName" : "NewMartin"}});
Output:
The output clearly shows that one record matched the condition and hence the relevant field
value was modified.
Example #1 – deleteOne
Code:
db.code.deleteOne({"name":"malti"})
Output:
Explanation:
Here we attempt to delete a single record, which matches with mentioned key-value
pair. To start with, code is our collection in which the document might or might not
exist. Then we have our method which is deleteOne, and then we have the filter
mentioned inside. Here, our filter should look for a document that has the key as
“name” and the value must match to “malti”.
Upon finding a document which matches the filter, the method will delete the
document. As you can see, we implemented the deleteOne method and then when we
listed the whole collection, we now don’t have any record or a document with the
name as malti.
Example #2 – deleteMany
Code:
db.code.find()
db.code.deleteMany({"city":"Pune"})
Output:
Explanation:
Started with db, with collection name, we have our deleteMany method, which will
delete multiple documents in the code collection. It will rely on the filter mentioned to
delete these documents. Our filter is “{“city”: “Pune”}”, meaning it will delete every
document that has the city key, matching with the value of Pune.
Executing this query, every document present in the collection “code” will be deleted
at once with the Pune as a city. As you can see, we implemented the deleteMany
method with filter and then returned the whole collection, which is now empty.
Initially, we had two documents with the city like Pune, but there are no documents
with the city as Pune after executing our query. This is how deleteMany deletes every
record that matches the filter.
Code:
db.locs.deleteMany( {} )
db.code.find().count()
Output:
Explanation:
As you can see in the above screenshot, we firstly checked the total count of records
in the collection, which was 195. Then we executed the deleteMany query with a
blank filter, which deleted every single record available.
Which resulted in emptying the whole collection. Later upon checking for the count,
we get 0; as a result, meaning no record. That’s how deleteMany with no filter works.
To query data from MongoDB collection, you need to use MongoDB's find() method.
Syntax
The basic syntax of find() method is as follows −
>db.COLLECTION_NAME.find()
To display the results in a formatted way, you can use pretty() method.
Syntax
>db.COLLECTION_NAME.find().pretty()
Example
Following example retrieves all the documents from the collection named mycol and arranges
them in an easy-to-read format.
> db.mycol.find().pretty()
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534c"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 100
}
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534d"),
"title" : "NoSQL Database",
"description" : "NoSQL database doesn't have tables",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 20,
"comments" : [
{
"user" : "user1",
"message" : "My first comment",
"dateCreated" : ISODate("2013-12-09T21:05:00Z"),
"like" : 0
}
]
}
Apart from the find() method, there is findOne() method, that returns only one document.
Syntax
>db.COLLECTIONNAME.findOne()
Example
Following example retrieves the document with title MongoDB Overview.
> db.mycol.findOne({title: "MongoDB Overview"})
{
"_id" : ObjectId("5dd6542170fb13eec3963bf0"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 100
}
If there are frequent insert, delete and update operations carried out on documents, then the
indexes would need to change that often, which would just be an overhead for the collection.
The below example shows an example of what field values could constitute an index in a
collection. An index can either be based on just one field in the collection, or it can be based
on multiple fields in the collection.
In the example below, the Employeeid “1” and EmployeeCode “AA” are used to index the
documents in the collection. So when a query search is made, these indexes will be used to
quickly and efficiently find the required documents in the collection.
So even if the search query is based on the EmployeeCode “AA”, that document would be
returned.
The following example shows how add index to collection. Let’s assume that we have our
same Employee collection which has the Field names of “Employeeid” and
“EmployeeName”.
db.Employee.createIndex({Employeeid:1})
Code Explanation:
1. The createIndex method is used to create an index based on the “Employeeid” of the
document.
2. The ‘1’ parameter indicates that when the index is created with the “Employeeid”
Field values, they should be sorted in ascending order. Please note that this is different
from the _id field (The id field is used to uniquely identify each document in the
collection) which is created automatically in the collection by MongoDB. The
documents will now be sorted as per the Employeeid and not the _id field.
Output:
1. The numIndexesBefore: 1 indicates the number of Field values (The actual fields in
the collection) which were there in the indexes before the command was run.
Remember that each collection has the _id field which also counts as a Field value to
the index. Since the _id index field is part of the collection when it is initially created,
the value of numIndexesBefore is 1.
2. The numIndexesAfter: 2 indicates the number of Field values which were there in the
indexes after the command was run.
3. Here the “ok: 1” output specifies that the operation was successful, and the new index
is added to the collection.
db.Employee.getIndexes()
Code Explanation:
Output:
The output returns a document which just shows that there are 2 indexes in the
collection which is the _id field, and the other is the Employee id field. The :1
indicates that the field values in the index are created in ascending order.
db.Employee.dropIndex(Employeeid:1)
Code Explanation:
The dropIndex method takes the required Field values which needs to be removed
from the Index.
Output:
1. The nIndexesWas: 3 indicates the number of Field values which were there in the
indexes before the command was run. Remember that each collection has the _id field
which also counts as a Field value to the index.
2. The ok: 1 output specifies that the operation was successful, and the “Employeeid”
field is removed from the index.
To remove all of the indexes at once in the collection, one can use the dropIndexes command.
db.Employee.dropIndex()
Code Explanation:
The dropIndexes method will drop all of the indexes except for the _id index.
Output:
1. The nIndexesWas: 2 indicates the number of Field values which were there in the
indexes before the command was run.
2. Remember again that each collection has the _id field which also counts as a Field
value to the index, and that will not be removed by MongoDB and that is what this
message indicates.
3. The ok: 1 output specifies that the operation was successful.
Cluster
Cassandra database is distributed over several machines that operate together. The outermost
container is known as the Cluster. For failure handling, every node contains a replica, and in
case of a failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring
format, and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace
in Cassandra are −
Replication factor − It is the number of machines in the cluster that will receive
copies of the same data.
Replica placement strategy − It is nothing but the strategy to place replicas in the
ring. We have strategies such as simple strategy (rack-aware strategy), old network
topology strategy (rack-aware strategy), and network topology strategy (datacenter-
shared strategy).
Column families − Keyspace is a container for a list of one or more column families.
A column family, in turn, is a container of a collection of rows. Each row contains
ordered columns. Column families represent the structure of your data. Each keyspace
has at least one and often many column families.
The syntax of creating a Keyspace is as follows −
What is Keyspace?
A keyspace is an object that is used to hold column families, user defined types. A keyspace
is like RDBMS database which contains column families, indexes, user defined types, data
center awareness, strategy used in keyspace, replication factor, etc.
Syntax:
Or
o Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes are
placed in clockwise direction in the ring without considering rack or node location.
o Network Topology Strategy: This strategy is used in the case of more than one data
centers. In this strategy, you have to provide replication factor for each data center
separately.
Replication Factor: Replication factor is the number of replicas of data placed on different
nodes. More than two replication factor are good to attain no single point of failure. So, 3 is
good replication factor.
Example:
Verification:
To check whether the keyspace is created or not, use the "DESCRIBE" command. By using
this command you can see all the keyspaces that are created.
Durable_writes
By default, the durable_writes properties of a table is set to true, you can also set this
property to false. But, this property cannot be set to simplex strategy.
Example:
Verification:
To check whether the keyspace is created or not, use the "DESCRIBE" command. By using
this command you can see all the keyspaces that are created.
Using a Keyspace
To use the created keyspace, you have to use the USE command.
Syntax:
1. USE <identifier>
we will learn about the Cassandra CRUD Operation: Create, Update, Read & Delete.
Moreover, we will cover the syntax and example of each CRUD operation in Cassandra.
So, let’s start with Cassandra CRUD Operation.
Cassandra CRUD Operation stands for Create, Update, Read and Delete or Drop. These
operations are used to manipulate data in Cassandra. Apart from this, CRUD operations in
Cassandra, a user can also verify the command or the data.
a. Create Operation
A user can insert data into the table using Cassandra CRUD operation. The data is stored in
the columns of a row in the table. Using INSERT command with proper what, a user can
perform this operation.
A Syntax of Create Operation-
INSERT INTO <table name>
(<column1>,<column2>....)
VALUES (<value1>,<value2>...)
USING<option>
INPUT:
b.Update Operation
The second operation in the Cassandra CRUD operation is the UPDATE operation. A user
can use UPDATE command for the operation. This operation uses three keywords while
updating the table.
Where: This keyword will specify the location where data is to be updated.
Set: This keyword will specify the updated value.
Must: This keyword includes the columns composing the primary key.
Furthermore, at the time of updating the rows, if a row is unavailable, then Cassandra has a
feature to create a fresh row for the same.
A Syntax of Update Operation-
UPDATE <table name>
WHERE <condition>
Let’s change few details in the table ‘student’. In this example, we will update Aarav’s city
from ‘New York City’ to ‘San Fransisco’.
INPUT:
cqlsh:keyspace1> UPDATE student SET city='San Fransisco'
WHERE en=002;
c. Read Operation
This is the third Cassandra CRUD Operation – Read Operation. A user has a choice to read
either the whole table or a single column. To read data from a table, a user can use SELECT
clause. This command is also used for verifying the table after every operation.
Ayush Boston
Kabir Philadelphia
d. Delete Operation
Delete operation is the last Cassandra CRUD Operation, allows a user to delete data from a
table. The user can use DELETE command for this operation.
A Syntax of Delete Operation-
Electrical
001 Ayush 9999999999 Boston
Engineering
Computer
002 Aarav 8888888888 San Fransisco
Engineering
UUID
Collection Types
Cassandra Query Language also provides a collection data types. The following table
provides a list of Collections available in CQL.
Collection Description
User-defined datatypes
Cqlsh provides users a facility of creating their own data types. Given below are the
commands used while dealing with user defined datatypes.
CREATE TYPE − Creates a user-defined datatype.
ALTER TYPE − Modifies a user-defined datatype.
DROP TYPE − Drops a user-defined datatype.
DESCRIBE TYPE − Describes a user-defined datatype.
DESCRIBE TYPES − Describes user-defined datatypes.
Partitioning
What is Hive
Features of Hive
The following gives brief overview of some data types present in Hive:
Numeric Types
String Types
Date/Time Types
Complex Types
CHAR 255
VARCHAR 1 to 65355
ARRAY<data_type>
Arrays
Negative values and non-constant expressions not allowed
MAP<primitive_type, data_type>
Maps
Negative values and non-constant expressions not allowed
Database operation:
Following are the steps on how to create and drop databases in Hive.
For creating a database in Hive shell, we have to use the command as shown in the syntax
below:-
Syntax:
3.3M
For Dropping database in Hive shell, we have to use the “drop” command as shown in the
syntax below:-
Syntax:
In the same screen, after checking databases with show command, database”guru99″ does
not appear inside Hive.
3. Creating a table
Syntax:
create database.tablename(columns);
Example:
create table geeksportal.geekdata(id int,name string);
Here id and string are the two columns.
Output :
4. Display Database
Syntax:
show databases;
Output: Display the databases created.
5. Describe Database
Syntax:
describe database database_name;
Example:
describe database geeksportal;
Output: Display the HDFS path of a particular database.
Partitioning in Hive
The partitioning in Hive means dividing the table into some parts based on the values
of a particular column like date, course, city or country. The advantage of partitioning is that
since the data is stored in slices, the query response time becomes faster.
As we know that Hadoop is used to handle the huge amount of data, it is always required to
use the best approach to deal with it. The partitioning in Hive is the best example of it.
Let's assume we have a data of 10 million students studying in an institute. Now, we have to
fetch the students of a particular course. If we use a traditional approach, we have to go
through the entire data. This leads to performance degradation. In such a case, we can adopt
the better approach i.e., partitioning in Hive and divide the data among the different datasets
based on particular columns.
Skip Ad
o Static partitioning
o Dynamic partitioning
Static Partitioning
hive> create table student (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';
o Load the data into the table and pass the values of partition columns with it by using
the following command: -
o Load the data of another file into the same table and pass the values of partition
columns with it by using the following command: -
In the following screenshot, we can see that the table student is divided into two categories.
o Let's retrieve the entire data of the able by using the following command: -
o Now, try to retrieve the data based on partitioned columns by using the following
command: -
In this case, we are not examining the entire data. Hence, this approach improves query
response time.
Dynamic Partitioning
In dynamic partitioning, the values of partitioned columns exist within the table. So, it is
not required to pass the values of partitioned columns manually.
hive> create table stud_demo(id int, name string, age int, institute string, course string)
row format delimited
hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
o Now, insert the data of dummy table into the partition table.
partition(course)
select id, name, age, institute, course
from stud_demo;
o In the following screenshot, we can see that the table student_part is divided into two
categories.
o Let's retrieve the entire data of the table by using the following command: -
hive> select * from student_part;
o Now, try to retrieve the data based on partitioned columns by using the following
command: -
In this case, we are not examining the entire data. Hence, this approach improves query
response time.
o Let's also retrieve the data of another partitioned dataset by using the following
command: -
3.8 HiveQL
Hive uses derby database for single user metadata storage, and for multiple user Metadata or
shared Metadata case, Hive uses MYSQL.
OrientDB is the first Multi-Model Open Source NoSQL DBMS that combines the
power of graphs and the flexibility of documents into one scalable, high-performance
operational database.
Gone are the days where your database only supports a single data model. As a direct
response to polyglot persistence, multi-model databases acknowledge the need for multiple
data models, combining them to reduce operational complexity and maintain data
consistency. Though graph databases have grown in popularity, most NoSQL products are
still used to provide scalability to applications sitting on a relational DBMS. Advanced
2nd generation NoSQL products like OrientDB are the future: providing more
functionality and flexibility, while being powerful enough to replace your operational
DBMS..
Speed
OrientDB was engineered from the ground up with performance as a key specification. It’s
fast on both read and write operations. Stores up to 120,000 records per second
Enterprise
While most NoSQL DBMSs are used as secondary databases, OrientDB is powerful and
flexible enough to be used as an operational DBMS. Though OrientDB Community
Edition is free for commercial use, robust applications need enterprise level functionalities
to guarantee data security and flawless performance. OrientDB Enterprise Edition gives
you all the features of our community edition plus:
Incremental backups
Unmatched security
24x7 Support
Query Profiler
Distributed Clustering configuration
Metrics Recording
Live Monitor with configurable alerts
Record
The smallest unit that you can load from and store in the database. Records can be stored in
four types.
Document
Record Bytes
Vertex
Edge
The SQL Reference of the OrientDB database provides several commands to create, alter,
and drop databases.
The following statement is a basic syntax of Create Database command.
Example
You can use the following command to create a local database named demo.
Orientdb> CREATE DATABASE PLOCAL:/opt/orientdb/databses/demo
If the database is successfully created, you will get the following output.
Database created successfully.
Current database is: plocal: /opt/orientdb/databases/demo
Example
From the version of OrientDB-2.2, the new SQL parser is added which will not allow the
regular syntax in some cases. Therefore, we have to disable the new SQL parser (StrictSQL)
in some cases. You can use the following Alter database command to disable the StrictSQL
parser.
Drop database
Similar to RDBMS, OrientDB provides the feature to drop a database. Drop database refers
to removing a database completely.
The following statement is the basic syntax of the Drop database command.
DROP DATABASE [<database-name> <server-username> <server-user-password>]
Following are the details about the options in the above syntax.
<database-name> − Database name you want to drop.
<server-username> − Username of the database who has the privilege to drop a database.
<server-user-password> − Password of the particular user.
Example
There are two ways to drop a database, one is drop a currently open database and second is
drop a particular database by providing the particular name.
In this example, we will use the same database named ‘demo’ that we created in an earlier
chapter. You can use the following command to drop a database demo.
https://orientdb.com/download
Now type
cd orientdb-3.0.0
cd bin
and then, if you are on Linux/OSX, you can start the server with
./server.sh
if you are on Windows, start the server with
server.bat
You will see OrientDB starting
.` `
, `:.
`,` ,:`
.,. :,,
.,, ,,,
. .,.::::: ` ` ::::::::: :::::::::
,` .::,,,,::.,,,,,,`;; .: :::::::::: ::: :::
`,. ::,,,,,,,:.,,.` ` .: ::: ::: ::: :::
,,:,:,,,,,,,,::. ` ` `` .: ::: ::: ::: :::
,,:.,,,,,,,,,: `::, ,, ::,::` : :,::` :::: ::: ::: ::: :::
,:,,,,,,,,,,::,: ,, :. : :: : .: ::: ::: :::::::
:,,,,,,,,,,:,:: ,, : : : : .: ::: ::: :::::::::
` :,,,,,,,,,,:,::, ,, .:::::::: : : .: ::: ::: ::: :::
`,...,,:,,,,,,,,,: .:,. ,, ,, : : .: ::: ::: ::: :::
.,,,,::,,,,,,,: `: , ,, : ` : : .: ::: ::: ::: :::
...,::,,,,::.. `: .,, :, : : : .: ::::::::::: ::: :::
,::::,,,. `: ,, ::::: : : .: ::::::::: ::::::::::
,,:` `,,.
,,, .,`
,,. `, GRAPH DATABASE
`` `.
`` orientdb.com
`
+---------------------------------------------------------------+
| WARNING: FIRST RUN CONFIGURATION |
+---------------------------------------------------------------+
| This is the first time the server is running. Please type a |
| password of your choice for the 'root' user or leave it blank |
| to auto-generate it. |
| |
| To avoid this message set the environment variable or JVM |
| setting ORIENTDB_ROOT_PASSWORD to the root password to use. |
+---------------------------------------------------------------+
Features
Quick installation. OrientDB can be installed and running in less than 60 seconds
Fully transactional: supports ACID transactions guaranteeing that all database
transactions are processed reliably and in the event of a crash all pending documents are
recovered and committed.
Graph structured data model: native management of graphs. Fully compliant with
the Apache TinkerPop Gremlin (previously known as Blueprints) open source graph
computing framework.
SQL: supports SQL queries with extensions to handle relationships without SQL join,
manage trees, and graphs of connected documents.
Web technologies: natively supports HTTP, RESTful protocol, and JSON additional
libraries or components.
Distributed: full support for multi-master replication including geographically distributed
clusters.
Run anywhere: implemented using pure Java allowing it to be run on Linux, OS
X, Windows, or any system with a compliant JVM.
Embeddable: local mode to use the database bypassing the Server. Perfect for scenarios
where the database is embedded.
Apache 2 License: always free for any usage. No fees or royalties required to use it.
Full server has a footprint of about 512 MB.
Commercial support is available from OrientDB.
Pattern matching: Introduced in version 2.2, the Match statement queries the database in
a declarative manner, using pattern matching.
Security features introduced in OrientDB 2.2 provide an extensible framework for adding
external authenticators, password validation, LDAP import of database roles and users,
advanced auditing capabilities, and syslog support. OrientDB Enterprise Edition
provides Kerberos (protocol) authentication full browser SPNEGO support. When it
comes to database encryption, starting with version 2.2, OrientDB can encrypt records on
disk. This prevents unauthorized users from accessing database content or even from
bypassing OrientDB security.
Teleporter: Allows relational databases to be quickly imported into OrientDB in few
simple steps.
Cloud ready: OrientDB can be deployed in the cloud and supports the following
providers: Amazon Web Services, Microsoft Azure, CenturyLink Cloud, Jelastic,
DigitalOcean
Structured, Semi structured, and Unstructured Data – XML Hierarchical Data Model –
XML Documents – Document Type Definition – XML Schema – XML Documents and
Databases – XML Querying – XPath – XQuery
Big Data includes huge volume, high velocity, and extensible variety of data. These are 3 types:
Structured data, Semi-structured data, and Unstructured data.
1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which can
be stored in database SQL in a table with rows and columns. They have relational keys and can
easily be mapped into pre-designed fields. Today, those data are most processed in the
development and simplest way to manage information. Example: Relational data.
2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze. With some processes, you can
store them in the relation database (it could be very hard for some kind of semi-structured data),
but Semi-structured exist to ease space. Example: XML data.
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications. Example: Word, PDF, Text, Media logs.
We now introduce the data model used in XML. The basic object in XML is the XML
document. Two main structuring concepts are used to construct an XML
document: elements and attributes. It is important to note that the term attribute in XML is not used
in the same manner as is customary in database terminology, but rather as it is used in document
description languages such as HTML and SGML. Attributes in XML provide additional information
that describes elements, as we will see. There are additional concepts in XML, such as entities,
identifiers, and references, but first we concentrate on describing elements and attributes to show the
essence of the XML model.
Figure 12.3 shows an example of an XML element called <Projects>. As in HTML, elements are
identified in a document by their start tag and end tag. The tag names are enclosed between angled
brackets < ... >, and end tags are further identified by a slash, </ ... >.
Complex elements are constructed from other elements hierarchically, whereas simple
elements contain data values. A major difference between XML and HTML is that XML tag names
are defined to describe the meaning of the data elements in the document, rather than to describe how
the text is to be displayed. This makes it possible to process the data elements in the XML document
automatically by computer programs. Also, the XML tag (element) names can be defined in another
document, known as the schema document, to give a semantic meaning to the tag names that can be
exchanged among multiple users. In HTML, all tag names are predefined and fixed; that is why they
are not extendible.
<First_name>, and <Hours>. The complex elements are the ones with the tag
names <Projects>, <Project>, and <Worker>. In general, there is no limit on the levels of nesting of
elements.
Data-centric XML documents. These documents have many small data items that follow a
specific structure and hence may be extracted from a structured database. They are formatted as XML
documents in order to exchange them over or display them on the Web. These usually follow
a predefined schema that defines the tag names.
Document-centric XML documents. These are documents with large amounts of text, such as
news articles or books. There are few or no struc-tured data elements in these documents.
Hybrid XML documents. These documents may have parts that contain structured data and
other parts that are predominantly textual or unstruc-tured. They may or may not have a predefined
schema.
XML documents that do not follow a predefined schema of element names and cor-
responding tree structure are known as schemaless XML documents. It is important to note that
datacentric XML documents can be considered either as semistructured data or as structured data as
defined in Section 12.1. If an XML document conforms to a predefined XML schema or DTD (see
Section 12.3), then the document can be considered as structured data. On the other hand, XML
allows documents that do not conform to any schema; these would be considered as semistructured
data and are schemaless XML documents. When the value of the standalone attribute in an XML
document is yes, as in the first line in Figure 12.3, the document is standalone and schemaless.
XML attributes are generally used in a manner similar to how they are used in HTML (see
Figure 12.2), namely, to describe properties and characteristics of the elements (tags) within which
they appear. It is also possible to use XML attributes to hold the values of simple data elements;
however, this is generally not recommended. An exception to this rule is in cases that need
to reference another element in another part of the XML document. To do this, it is common to use
attribute values in one element as the references. This resembles the concept of foreign keys in
relational databases, and is a way to get around the strict hierarchical model that the XML tree model
implies. We discuss XML attributes further in Section 12.3 when we discuss XML schema and DTD.
4.3 XML Documents
Well-Formed XML
Valid XML
<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
The above example is said to be well-formed as −
It defines the type of document. Here, the document type is element type.
It includes a root element named as address.
Each of the child elements among name, company and phone is enclosed in its self
explanatory tag.
Order of the tags is maintained.
XML DTD
An XML document validated against a DTD is both "Well Formed" and "Valid".
What is a DTD?
A DTD defines the structure and the legal elements and attributes of an XML document
A "Valid" XML document is "Well Formed", as well as it conforms to the rules of a DTD:
The DOCTYPE declaration above contains a reference to a DTD file. The content of the DTD file is
shown and explained below.
XML DTD
The purpose of a DTD is to define the structure and the legal elements and attributes of an XML
document:
Note.dtd:
<!DOCTYPE note
[
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
!DOCTYPE note - Defines that the root element of the document is note
!ELEMENT note - Defines that the note element must contain the elements: "to, from,
heading, body"
!ELEMENT to - Defines the to element to be of type "#PCDATA"
!ELEMENT from - Defines the from element to be of type "#PCDATA"
!ELEMENT heading - Defines the heading element to be of type "#PCDATA"
!ELEMENT body - Defines the body element to be of type "#PCDATA"
A DOCTYPE declaration can also be used to define special characters or strings, used in the
.document:
Example
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE note [
<!ENTITY nbsp " ">
<!ENTITY writer "Writer: adhiparasakthi.">
<!ENTITY copyright "Copyright: apec.">
]>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<footer>&writer; ©right;</footer>
</note>
XML schema is an alternative to DTD. An XML document is considered “well formed” and “valid” if
it is successfully validated against XML Schema. The extension of Schema file is .xsd.
XSD stands for XML Schema Definition and it is a way to describe the structure of a XML
document. It defines the rules for all the attributes and elements in a XML document. It can also be
used to generate the XML documents. It also checks the vocabulary of the document. It doesn’t
require processing by a parser. XSD checks for the correctness of the structure of the XML file.
XSD was first published in 2001 and after that it was published in 2004.
XSD vs DTD
XSD DTD
Defines list, order, data types of elements and Defines list, order of elements and attributes.
attributes.
Provides control over the elements and attributes It does not provide control over elements and
used in XML documents. attributes.
XSD allows to create customized datatype. DTD does not allow to create customized datatype.
Syntax of XSD is similar to XML document. Syntax of DTD is different from XML document.
XSD allows to define restrictions on data. DTD does not allows to define restrictions on data.
For example: Define the content in a document
by using only integer data type.
Syntax:
<xsd:element name =“elementname” type=“datatype” minOccurs = “notNegativeInteger”
maxOccurs=“nonNegativeInteger | unbounded”/>
Where,
name: Defines the element name
minOccurs: If its value is zero, the use of element is optional and if its value is greater than zero,
the use of element is compulsory, and should occur at least for specified number of times.
maxOccurs: If value is set as unbounded, the use of element can appear any number of times in
the XML document without any limitation.
employee.xsd
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.javatpoint.com"
xmlns="http://www.javatpoint.com"
elementFormDefault="qualified">
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
<xs:element name="email" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Let's see the xml file using XML schema or XSD file.
employee.xml
<?xml version="1.0"?>
<employee
xmlns="http://www.javatpoint.com"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.javatpoint.com employee.xsd">
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
</employee>
Description of XML Schema
1. simpleType
2. complexType
simpleType
A simple element is an XML element that can contain only text. It cannot contain any other
elements or attributes.
However, the "only text" restriction is quite misleading. The text can be of many different types. It
can be one of the types included in the XML Schema definition (boolean, string, date, etc.), or it can
be a custom type that you can define yourself.
You can also add restrictions (facets) to a data type in order to limit its content, or you can require the
data to match a specific pattern.
complexType
empty elements
elements that contain only other elements
elements that contain only text
elements that contain both other elements and text
A complex XML element, "description", which contains both elements and text:
<description>
It happened on <date lang="norwegian">03.03.99</date> ....
</description>
XML Database is used to store huge amount of information in the XML format. As the use
of XML is increasing in every field, it is required to have a secured place to store the XML
documents. The data stored in the database can be queried using XQuery, serialized, and exported
into a desired format.
XML- enabled
Native XML (NXD)
XML enabled database is nothing but the extension provided for the conversion of XML document.
This is a relational database, where data is stored in tables consisting of rows and columns. The tables
contain set of records, which in turn consist of fields.
Native XML Database
Native XML database is based on the container rather than table format. It can store large amount of
XML document and data. Native XML database is queried by the XPath-expressions.
Native XML database has an advantage over the XML-enabled database. It is highly capable to store,
query and maintain the XML document than XML-enabled database.
Example
Following example demonstrates XML database −
<contact2>
<name>Manisha Patil</name>
<company>mec</company>
<phone>(011) 789-4567</phone>
</contact2>
</contact-info>
Here, a table of contacts is created that holds the records of contacts (contact1 and contact2), which in
turn consists of three entities − name, company and phone.
4.8 XPath
XPath can be used to navigate through elements and attributes in an XML document.
An XPath expression returns a collection of element nodes that satisfy certain patterns
specified in the expression.
The names in the XPath expression are node names in the XML document tree that are either tag
(element) names or attribute names, possibly with additional qualifier conditions to further restrict
the nodes that satisfy the pattern.
These path expressions look very much like the path expressions you use with traditional computer
file systems:
Selecting Nodes
XPath uses path expressions to select nodes in an XML document. The node is selected by following
a path or steps. The most useful path expressions are listed below:
Expression Description
// Selects nodes in the document from the current node that match the selection no
matter where they are
@ Selects attributes
In the table below we have listed some path expressions and the result of the expressions:
//book Selects all book elements no matter where they are in the document
bookstore//book Selects all book elements that are descendant of the bookstore element, no matter where
they are under the bookstore element
Absolute Xpath: This is the XPath expression in the XML document that starts with the root node or
with ‘/’, For Example, /SoftwareTestersList/softwareTester/@name=” T1″
Relative XPath: If the XPath expression starts with the selected context node then that is considered
as Relative XPath. For Example, if the software tester is the currently selected node then /@name=”
T1” is considered as the Relative XPath.
o XPath defines structure: XPath is used to define the parts of an XML document i.e.
element, attributes, text, namespace, processing-instruction, comment, and document nodes.
o XPath provides path expression: XPath provides powerful path expressions, select nodes,
or list of nodes in XML documents.
o XPath is a core component of XSLT: XPath is a major element in XSLT standard and must
be followed to work with XSLT documents.
o XPath is a standard function: XPath provides a rich library of standard functions to
manipulate string values, numeric values, date and time comparison, node and QName
manipulation, sequence manipulation, Boolean values etc.
o Path is W3C recommendation.
4.9 XQuery
What is XQuery
XQuery is a functional query language used to retrieve information stored in XML format. It is same
as for XML what SQL is for databases. It was designed to query XML data.
XQuery is built on XPath expressions. It is a W3C recommendation which is supported by all major
databases.
"books.xml":
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
doc("books.xml")
FLWOR (pronounced "flower") is an acronym for "For, Let, Where, Order by, Return".
We will use the "books.xml" document in the examples below (same XML file as in the previous
chapter).
doc("books.xml")/bookstore/book[price>30]/title
The expression above will select all the title elements under the book elements that are under the
bookstore element that have a price element with a value that is higher than 30.
The following FLWOR expression will select exactly the same as the path expression above:
for $x in doc("books.xml")/bookstore/book
where $x/price>30
return $x/title
for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
The for clause selects all book elements under the bookstore element into a variable called
$x.
The where clause selects only book elements with a price element with a value greater than
30.
The order by clause defines the sort-order. Will be sort by the title element.
The return clause specifies what should be returned. Here it returns the title elements.
What does it do
XQuery is a functional language which is responsible for finding and extracting elements and
attributes from XML documents.
Advantages of XQuery
5.1 IR concepts
Information retrieval (IR) may be defined as a software program that deals with the
organization, storage, retrieval and evaluation of information from document repositories
particularly textual information. The system assists users in finding the information they
require but it does not explicitly return the answers of the questions. It informs the existence
and location of documents that might consist of the required information. The documents that
satisfy user’s requirement are called relevant documents. A perfect IR system will retrieve
only relevant documents.
With the help of the following diagram, we can understand the process of information
retrieval (IR) −
It is clear from the above diagram that a user who needs information will have to formulate a
request in the form of query in natural language. Then the IR system will respond by
retrieving the relevant output, in the form of documents, about the required information.
The main goal of IR research is to develop a model for retrieving information from
the repositories of documents. Here, we are going to discuss a classical problem, named ad-
hoc retrieval problem, related to the IR system.
In ad-hoc retrieval, the user must enter a query in natural language that describes the required
information. Then the IR system will return the required documents related to the desired
information. For example, suppose we are searching something on the Internet and it gives
some exact pages that are relevant as per our requirement but there can be some non-relevant
pages too. This is due to the ad-hoc retrieval problem.
An information mode l (IR) model can be classified into the following three models −
Classical IR Model
It is the simplest and easy to implement IR model. This model is based on mathematical
knowledge that was easily recognized and understood as well. Boolean, Vector and
Probabilistic are the three classical IR models.
Non-Classical IR Model
It is completely opposite to classical IR model. Such kind of IR models are based on
principles other than similarity, probability, Boolean operations. Information logic model,
situation theory model and interaction models are the examples of non-classical IR model.
Alternative IR Model
It is the enhancement of classical IR model making use of some specific techniques from
some other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models are
the example of alternative IR model.
It is the oldest information retrieval (IR) model. The model is based on set theory and
the Boolean algebra, where documents are sets of terms and queries are Boolean expressions
on terms. The Boolean model can be defined as −
D − A set of words, i.e., the indexing terms present in a document. Here, each term is
either present (1) or absent (0).
Q − A Boolean expression, where terms are the index terms and operators are logical
products − AND, logical sum − OR and logical difference − NOT
F − Boolean algebra over sets of terms as well as over sets of documents
If we talk about the relevance feedback, then in Boolean IR model the Relevance
prediction can be defined as follows −
R − A document is predicted as relevant to the query expression if and only if it
satisfies the query expression as −
((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)
We can explain this model by a query term as an unambiguous definition of a set of
documents.
For example, the query term “economic” defines the set of documents that are indexed with
the term “economic”.
Now, what would be the result after combining terms with Boolean AND Operator? It will
define a document set that is smaller than or equal to the document sets of any of the single
terms. For example, the query with terms “social” and “economic” will produce the
documents set of documents that are indexed with both the terms. In other words, document
set with the intersection of both the sets.
Now, what would be the result after combining terms with Boolean OR operator? It
will define a document set that is bigger than or equal to the document sets of any of the
single terms. For example, the query with terms “social” or “economic” will produce the
documents set of documents that are indexed with either the term “social” or “economic”. In
other words, document set with the union of both the sets.
Due to the above disadvantages of the Boolean model, Gerard Salton and his
colleagues suggested a model, which is based on Luhn’s similarity criterion. The similarity
criterion formulated by Luhn states, “the more two representations agreed in given elements
and their distribution, the higher would be the probability of their representing similar
information.”
Consider the following important points to understand more about the Vector Space Model −
The index representations (documents) and the queries are considered as vectors
embedded in a high dimensional Euclidean space.
The similarity measure of a document vector to a query vector is usually the cosine of
the angle between them.
During the process of indexing, many keywords are associated with document set
which contains words, phrases, date created, author names, and type of document. They are
used by an IR system to build an inverted index which is then consulted during the search.
The queries formulated by users are compared to the set of index keywords. Most IR
systems also allow the use of Boolean and other operators to build a complex query. The
query language with these operators enriches the expressiveness of a user’s information
need.
The Information Retrieval (IR) system finds the relevant documents from a large
data set according to the user query. Queries submitted by users to search engines might be
ambiguous, concise and their meaning may change over time. Some of the types of Queries
in IR systems are –
1. Keyword Queries :
Simplest and most common queries.
The user enters just keyword combinations to retrieve documents.
These keywords are connected by logical AND operator.
All retrieval models provide support for keyword queries.
2. Boolean Queries :
Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in
combination of keyword formulations.
No ranking is involved because a document either satisfies such a query or does not
satisfy it.
A document is retrieved for boolean query if it is logically true as exact match in
document.
3. Phase Queries :
When documents are represented using an inverted keyword index for searching, the
relative order of items in document is lost.
To perform exact phase retrieval, these phases are encoded in inverted index or
implemented differently.
This query consists of a sequence of words that make up a phase.
It is generally enclosed within double quotes.
4. Proximity Queries :
Proximity refers ti search that accounts for how close within a record multiple items
should be to each other.
Most commonly used proximity search option is a phase search that requires terms to be
in exact order.
Other proximity operators can specify how close terms should be to each other. Some
will specify the order of search terms.
Search engines use various operators names such as NEAR, ADJ (adjacent), or AFTER.
However, providing support for complex proximity operators becomes expensive as it
requires time-consuming pre-processing of documents and so it is suitable for smaller
document collections rather than for web.
5. Wildcard Queries :
It supports regular expressions and pattern matching-based searching in text.
Retrieval models do not directly support for this query type.
In IR systems, certain kinds of wildcard search support may be implemented. Example:
usually words ending with trailing characters.
6. Natural Language Queries :
There are only a few natural language search engines that aim to understand the
structure and meaning of queries written in natural language text, generally as question
or narrative.
The system tries to formulate answers for these queries from retrieved results.
Semantic models can provide support for this query type.
In this section we review the commonly used text preprocessing techniques that are
part of the text processing task
1. Stopword Removal
Stopwords are very commonly used words in a language that play a major role in the
formation of a sentence but which seldom contribute to the meaning of that sentence. Words
that are expected to occur in 80 percent or more of the documents in a collection are typically
referred to as stopwords, and they are rendered potentially useless. Because of the
commonness and function of these words, they do not contribute much to the relevance of a
document for a query search. Examples include words such
as the, of, to, a, and, in, said, for, that, was, on, he, is, with, at, by, and it. These words are
presented here with decreasing frequency of occurrence from a large corpus of documents
called AP89. The fist six of these words account for 20 percent of all words in the listing, and
the most frequent 50 words account for 40 percent of all text.
standard vocabulary for indexing and searching. Usage of a thesaurus, also known as
a collection of synonyms, has a substantial impact on the recall of information systems. This
process can be complicated because many words have different meanings in different
contexts.
UMLS is a large biomedical thesaurus of millions of concepts (called the Metathesaurus) and
a semantic network of meta concepts and relationships that organize the Metathesaurus (see
Figure 27.3). The concepts are assigned labels from the semantic network. This thesaurus of
concepts contains synonyms of medical terms, hierarchies of broader and narrower terms, and
other relationships among words and concepts that make it a very extensive resource for
information retrieval of documents in the medical domain. Figure 27.3 illustrates part of the
UMLS Semantic Network.
WordNet is a manually constructed thesaurus that groups words into strict synonym
sets called synsets. These synsets are divided into noun, verb, adjective, and adverb
categories. Within each category, these synsets are linked together by appropriate
relationships such as class/subclass or “is-a” relationships for nouns.
WordNet is based on the idea of using a controlled vocabulary for indexing, thereby
eliminating redundancies. It is also useful in providing assistance to users with locating terms
for proper query formulation.
4. Other Preprocessing Steps: Digits, Hyphens, Punctuation Marks, Cases
Digits, dates, phone numbers, e-mail addresses, URLs, and other standard types of
text may or may not be removed during preprocessing. Web search engines, however, index
them in order to to use this type of information in the document metadata to improve
precision and recall (see Section 27.6 for detailed definitions of precision and recall).
Hyphens and punctuation marks may be handled in different ways. Either the entire phrase
with the hyphens/punctuation marks may be used, or they may be eliminated. In some
systems, the character representing the hyphen/punctuation mark may be removed, or may be
replaced with a space. Different information retrieval systems follow different rules of
processing. Handling hyphens automatically can be complex: it can either be done as a
classification problem, or more commonly by some heuristic rules.
Most information retrieval systems perform case-insensitive search, converting all the
letters of the text to uppercase or lowercase. It is also worth noting that many of these text
preprocessing steps are language specific, such as involving accents and diacritics and the
idiosyncrasies that are associated with a particular language.
5. Information Extraction
Information extraction (IE) is a generic term used for extracting structured con-tent
from text. Text analytic tasks such as identifying noun phrases, facts, events, people, places,
and relationships are examples of IE tasks. These tasks are also called named entity
recognition tasks and use rule-based approaches with either a the-saurus, regular expressions
and grammars, or probabilistic approaches. For IR and search applications, IE technologies
are mostly used to identify contextually relevant features that involve text analysis, matching,
and categorization for improving the relevance of search systems. Language technologies
using part-of-speech tagging are applied to semantically annotate the documents with
extracted features to aid search relevance.
Inverted Indexing
The simplest way to search for occurrences of query terms in text collections can be
performed by sequentially scanning the text. This kind of online searching is only appropriate
when text collections are quite small. Most information retrieval systems process the text
collections to create indexes and operate upon the inverted index data. An inverted index
structure comprises vocabulary and document information. Vocabulary is a set of distinct
query terms in the document set. Each term in a vocabulary set has an associated collection of
information about the documents that contain the term, such as document id, occurrence
count, and offsets within the document where the term occurs. The simplest form of
vocabulary terms consists of words or individual tokens of the documents. In some cases,
these vocabulary terms also consist of phrases, n-grams, entities, links, names, dates, or
manually assigned descriptor terms from documents and/or Web pages. For each term in the
vocabulary, the cor-responding document ids, occurrence locations of the term in each
document, number of occurrences of the term in each document, and other relevant
information may be stored in the document information section.
Weights are assigned to document terms to represent an estimate of the usefulness of the
given term as a descriptor for distinguishing the given document from other documents in the
same collection. A term may be a better descriptor of one document than of another by the
weighting process (see Section 27.2).
One of the most popular weighting schemes is the TF-IDF (term frequency-inverse
document frequency) metric that we described in Section 27.2. For a given term this
weighting scheme distinguishes to some extent the documents in which the term occurs more
often from those in which the term occurs very little or never. These weights are normalized
to account for varying document lengths, further ensuring that longer documents with
proportionately more occurrences of a word are not favored for retrieval over shorter
documents with proportionately fewer occurrences. These processed document-term streams
(matrices) are then inverted into term-document streams (matrices) for further IR steps.
The different steps involved in inverted index construction can be summarized as follows:
1. Break the documents into vocabulary terms by tokenizing, cleansing, stopword
removal, stemming, and/or use of an additional thesaurus as vocabulary.
2. Collect document statistics and store the statistics in a document lookup table.
3. Invert the document-term stream into a term-document stream along with
additional information such as term frequencies, term positions, and term weights.
Searching for relevant documents from the inverted index, given a set of query terms, is
generally a three-step process.
Vocabulary search. If the query comprises multiple terms, they are separated and
treated as independent terms. Each term is searched in the vocabulary. Various data
structurwes, like variations of B+-tree or hashing, may be
used to optimize the search process. Query terms may also be ordered in lexicographic order
to improve space efficiency.
Document information retrieval. The document information for each term is retrieved.
Manipulation of retrieved information. The document information vector for each
term obtained in step 2 is now processed further to incorporate various forms of query logic.
Various kinds of queries like prefix, range, context, and proximity queries are processed in
this step to construct the final result based on the document collections returned in step 2.
5.6 Evaluation Measures
Without proper evaluation techniques, one cannot compare and measure the relevance
of different retrieval models and IR systems in order to make improvements.
Evaluation techniques of IR systems measure the topical
relevance and user relevance. Topical relevance measures the extent to which the topic of a
result matches the topic of the query. Mapping one’s information need with “perfect” queries
is a cognitive task, and many users are not able to effectively form queries that would retrieve
results more suited to their information need. Also, since a major chunk of user queries are
informational in nature, there is no fixed set of right answers to show to the user. User
relevance is a term used to describe the “goodness” of a retrieved result with regard to the
user’s information need. User relevance includes other implicit factors, such as user
perception, context, timeliness, the user’s environment, and current task needs. Evaluating
user relevance may also involve subjective analysis and study of user retrieval tasks to
capture some of the properties of implicit factors involved in accounting for users’ bias for
judging performance.
The Precision at rank position i or document diq (denoted by p(i)) is the fraction of documents
from d1q to diq in the result set that are relevant:
Precision p(i) = |Si |/i
the Web is known as Web analysis. Over the past few years the World Wide Web has
emerged as an important repository of information for many day-to-day applications for
individual consumers, as well as a significant plat-form for e-commerce and for social
networking. These properties make it an interesting target for data analysis applications. The
Web mining and analysis field is an integration of a wide range of fields spanning
information retrieval, text analysis, natural language processing, data mining, machine
learning, and statistical analysis.
The goals of Web analysis are to improve and personalize search results relevance
and to identify trends that may be of value to various businesses and organizations. We
elaborate on these goals next.
Finding relevant information. People usually search for specific information on the Web
by entering keywords in a search engine or browsing information portals and using services.
Search services are constrained by search relevance problems since they have to map and
approximate the information need of millions of users as an a priori task.
Finding information of commercial value. This problem deals with finding interesting
patterns in users’ interests, behaviors, and their use of products and services, which may be of
commercial value. For example, businesses such as the automobile industry, clothing, shoes,
and cosmetics may improve their services by identifying patterns such as usage trends and
user preferences using various Web analysis techniques.
2. Searching the Web
The World Wide Web is a huge corpus of information, but locating resources that are
both high quality and relevant to the needs of the user is very difficult. The set of Web pages
taken as a whole has almost no unifying structure, with variability in authoring style and
content, thereby making it more difficult to precisely locate needed information. Index-based
search engines have been one of the prime tools by which users search for information on the
Web. Web search engines crawl the Web and create an index to the Web for searching
purposes.
3. Analyzing the Link Structure of Web Pages
The goal of Web structure analysis is to generate structural summary about the
Website and Web pages. It focuses on the inner structure of documents and deals with the
link structure using hyperlinks at the interdocument level. The structure and content of Web
pages are often combined for information retrieval by Web search engines. Given a collection
of interconnected Web documents, interesting and informative facts describing their
connectivity in the Web subset can be discovered. Web structure analysis is also used to
reveal the structure of Web pages, which helps with navigation and makes it possible to
compare/integrate Web page schemes. This aspect of Web structure analysis facilitates Web
document classification and clustering on the basis of structure.
4. Web Content Analysis
As mentioned earlier, Web content analysis refers to the process of discovering
useful information from Web content/data/documents. The Web content data consists of
unstructured data such as free text from electronically stored documents, semi-structured data
typically found as HTML documents with embedded image data, and more structured data
such as tabular data, and pages in HTML, XML, or other markup languages generated as
output from databases. More generally, the term Web content refers to any real data in the
Web page that is intended for the user accessing that page. This usually consists of but is not
limited to text and graphics.
5. Approaches to Web Content Analysis
The two main approaches to Web content analysis are
(1) agent based (IR view) and
(2) database based (DB view).
commercial context. This data is typically compared against key performance indicators to
measure effectiveness or performance of the Website as a whole, and can be used to improve
a Website or improve the marketing strategies.
5.8 Current trends.
Faceted search uses faceted classification that enables a user to navigate information along
multiple paths corresponding to different orderings of the facets. This contrasts with
traditional taxonomies in which the hierarchy of categories is fixed and unchanging.
University of California, Berkeley’s Flamenco project is one of the earlier examples of a
faceted search system.
2. Social Search
The traditional view of Web navigation and browsing assumes that a single user is
searching for information. This view contrasts with previous research by library scientists
who studied users’ information seeking habits. This research demonstrated that additional
individuals may be valuable information resources during information search by a single
user. More recently, research indicates that there is often direct user cooperation during Web-
based information search. Some studies report that significant segments of the user
population are engaged in explicit collaboration on joint search tasks on the Web. Active
collaboration by multiple parties also occur in certain cases (for example, enterprise settings);
at other times, and perhaps for a majority of searches, users often interact with others
remotely, asynchronously, and even involuntarily and implicitly.
3. Conversational Search
Conversational Search (CS) is an interactive and collaborative information
finding interaction. The participants engage in a conversation and perform a social search
activity that is aided by intelligent agents. The collaborative search activity helps the agent
learn about conversations with interactions and feedback from participants. It uses the
semantic retrieval model with natural language understanding to provide the users with faster
and relevant search results. It moves search from being a solitary activity to being a more
participatory activity for the user. The search agent performs multiple tasks of finding
relevant information and connecting the users together; participants provide feedback to the
agent during the conversations that allows the agent to perform better.