Unit 3 Notes UDS23201J Query Processing
Unit 3 Notes UDS23201J Query Processing
Unit 3 Notes UDS23201J Query Processing
Syllabus Content
MongoDB Introduction
Replication in MongoDB , Indexing in MongoDB
Distributed Query Optimization Algorithm
Query Processing
Query Processing Problem
Layers of Query Processing
Query Processing in Centralized Systems
Parsing
Translation
Optimization
Code generation
Example Query Processing in Distributed Systems
Mapping global query to local Optimization of Distributed Queries
Centralized Query Optimization
Data localization
Fragmented query ordering
Update Document in Mongodb
Bulk write operations in MongoDB
Delete documents in MongoDB
HDFS Commands-Hadoop
MongoDB
MongoDB is a document database. It stores data in a type of JSON format called BSON.
MongoDB is an open-source, nonrelational database management system (DBMS) that
uses flexible documents instead of tables and rows to process and store various forms of
data. It has a flexible data model that enables you to store unstructured data, and it provides
full indexing support, and replication with rich and intuitive APIs.
The document model maps to the objects in your application code, making data
easy to work with
Ad hoc queries, indexing, and real time aggregation provide powerful ways to
access and analyze your data
MongoDB is a distributed database at its core, so high availability, horizontal
scaling, and geographic distribution are built in and easy to use
MongoDB is free to use. Versions released prior to October 16, 2018 are published
under the AGPL. All versions released after October 16, 2018, including patch
fixes for prior versions, are published under the Server Side Public License (SSPL)
v1.
Indexing
Replication
Duplication of data
Load balancing
MongoDB uses sharding to support deployments with very large data sets and high
throughput operations. Sharding means to distribute data on multiple servers, here a large
amount of data is partitioned into data chunks using the shard key, and these data chunks
are evenly distributed across shards that reside across many physical servers.
MongoDB documents or collections of documents are the basic units of data. Formatted as
Binary JSON (Java Script Object Notation), these documents can store various types of
data and be distributed across multiple systems. Since MongoDB employs a dynamic
schema design, users have unparalleled flexibility when creating data records, querying
document collections through MongoDB aggregation and analyzing large amounts of
information.
SQL databases are considered relational databases. They store related data in separate
tables. When data is needed, it is queried from multiple tables to join the data back
together.
MongoDB stores data in flexible documents. Instead of having multiple tables you can
simply keep all of your related data together. This makes reading your data very fast.
You can still have multiple groups of data too. In MongoDB, instead of tables these are
called collections.
MongoDB Examples
Examples:
In the following examples, we are working with:
Database: gfg
Collection: student
Document: No document but, we want to insert in the form of the student name and
student marks.
Insert the document whose name is Akshay and marks is 500
Here, we insert a document in the student collection whose name is Akshay and marks is
500 using insert() method.
db.student.insert({Name: "Akshay", Marks: 500})
Output:
Query Processing includes translations on high level Queries into low level expressions
that can be used at physical level of file system, query optimization and actual execution of
query to get the actual result.
Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps
involved are:
As query processing includes certain activities for data retrieval. Initially, the given user
queries get translated in high-level database languages such as SQL. It gets translated into
expressions that can be further used at the physical level of the file system. After this, the
actual evaluation of the queries and a variety of query -optimizing transformations and
takes place. Thus before processing a query, a computer system needs to translate the query
into a human-readable and understandable language. Consequently, SQL or Structured
Query Language is the best suitable choice for humans. But it is not perfectly suitable for
the internal representation of the query to the system. Relational algebra is well suited for
the internal representation of a query. The translation process in query processing is similar
to the parser of a query. When a user executes any query, for generating the internal form
of the query, the parser in the system checks the syntax of the query, verifies the name of
the relation in the database, the tuple, and finally the required attribute value. The parser
creates a tree of the query, known as 'parse-tree.' Further, translate it into the form of
relational algebra. With this, it evenly replaces all the use of the views when used in the
query.
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the
employees whose salary is greater than or equal to 10000. For doing this, the following
query is undertaken:
Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and
evaluating each operation. Thus, after translating the user query, the system executes a
query evaluation plan.
Optimization
o The cost of the query evaluation can vary for different types of queries. Although
the system is responsible for constructing the evaluation plan, the user does need
not to write their query efficiently.
o Usually, a database system generates an efficient query evaluation plan, which
minimizes its cost. This type of task performed by the database system and is
known as Query Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis
of each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
Detailed Diagram is drawn as:
Data fragmentation
Data localization
Another challenge of query optimization in a distributed DBMS is data localization, which
refers to the degree to which the data needed for a query is available at the node where the
query is initiated. Data localization can affect the performance of queries, as it may reduce
or increase the amount of data transfers, communication, or synchronization. For example,
if a query is initiated at a node that has most or all of the data needed for the query, it can
be executed locally without much overhead. On the other hand, if a query is initiated at a
node that has little or none of the data needed for the query, it may need to send requests to
other nodes or fetch data from them, which can incur more costs and delays. Therefore,
query optimization in a distributed DBMS needs to consider the data localization factor
and choose the best node or nodes to initiate the query.
Data replication
A third challenge of query optimization in a distributed DBMS is data replication, which
refers to the process of creating and maintaining copies of data at different nodes. Data
replication can improve the availability, reliability, and scalability of the distributed
DBMS, as it can provide backup, load balancing, and fault tolerance. However, data
replication can also introduce complexity and overhead for query optimization, as it may
create inconsistency, redundancy, or conflicts. For example, if a query involves data that is
replicated at multiple nodes, it may need to decide which copy of the data to use, how to
ensure that the copies are consistent, and how to handle updates or transactions that affect
the replicated data. Therefore, query optimization in a distributed DBMS needs to consider
the data replication strategy and choose the best node or nodes to access or update the data.
Network heterogeneity
A fourth challenge of query optimization in a distributed DBMS is network heterogeneity,
which refers to the variation in the network characteristics among the nodes. Network
heterogeneity can affect the performance of queries, as it may cause different levels of
latency, bandwidth, reliability, or congestion. For example, if a query involves data that is
stored at nodes that have different network speeds, distances, or qualities, it may need to
account for the network delays, costs, or failures. Therefore, query optimization in a
distributed DBMS needs to consider the network heterogeneity factor and choose the best
node or nodes to communicate or transfer data.
1. Query Decomposition
2. Data Localization
3. Global Query Optimization
4. Distribution Query Execution
Query Decomposition
The first layer decomposes the calculus query into an algebraic query on global relations.
The information needed for this transformation is found in the global conceptual schema
describing the global relations.
I. Normalization
II. Analysis
III.Simplification
IV. Restructure
• First, the calculus query is rewritten in a normalized form that is suitable for subsequent
manipulation. Normalization of a query generally involves the manipulation of the query
quantifiers and of the query qualification by applying logical operator priority.
• Second, the normalized query is analyzed semantically so that incorrect queries are
detected and rejected as early as possible. Techniques to detect incorrect queries exist only
for a subset of relational calculus. Typically, they use some sort of graph that captures the
semantics of the query.
•Third, the correct query (still expressed in relational calculus) is simplified. One way to
simplify a query is to eliminate redundant predicates. Note that redundant queries are likely
to arise when a query is the result of system transformations applied to the user query. such
transformations are used for performing semantic data control (views, protection, and
semantic integrity control).
•Fourth, the calculus query is restructured as an algebraic query. The traditional way to do
this transformation toward a "“better" algebraic specification is to start with an initial
algebraic query and transform it in order to find a "go
•The algebraic query generated by this layer is good in the sense that the PR152 yorse
executions are typically avoided.
Query:
This query can be translated into either of the following relational-algebra expressions:
σsalary <75000(Πsalary (salary <75000(Πsalary ( instructor ))))
Πsalary (σsalary <75000(Πsalary (salary <75000( instructor ))))
Data Localization
• The input to the second layer is an algebraic query on global relations. The main role of
the second layer is to localize the query's data using data distribution information in the
fragment schema.
• This layer determines which fragments are involved in the query and transforms the
distributed query into a query on fragments.
• A global relation can be reconstructed by applying the fragmentation rules, and then
deriving a program, called a localization program, of relational algebra operators, which
then act on fragments.
• First, the query is mapped into a fragment query by substituting each relation by its
reconstruction program (also called materialization program).
• Second, the fragment query is simplified and restructured to produce another "good"
query.
• The input to the third layer is an algebraic query on fragments. The goal of query
optimization is to find an execution strategy for the query which is close to optimal.
• The previous layers have already optimized the query, for example, by eliminating
redundant expressions. However, this optimization is independent of fragment
characteristics such as fragment allocation and cardinalities.
• Query optimization consists of finding the "best" ordering of operators in the query,
including communication operators that minimize a cost function.
• The output of the query optimization layer is a optimized algebraic query with
communication operators included on fragments. It is typically represented and saved (for
future executions) as a distributed query execution plan.
• The last layer is performed by all the sites having fragments involved in the query.
• Each sub query executing a tone site, called a local query, is then optimized using the
local schema of the site and executed.
Increase parallelism
Query processing is the process of compiling and executing a query specification. It
consists of a compile-time phase and a runtime phase.
Query processing activities include:
Translation of high-level languages (HLL) queries into operations at physical file
level
Evaluation of queries
The four main phases of query processing are: Decomposition, Optimization, Code
generation, Execution
Parsing of a query is the process by which this decision making is done that for a given
query, calculating how many ways there are in which the query can run. Every query must
be parsed at least once. The parsing of a query is performed within the database using the
Optimizer component.
In query processing, parsing is the first step and involves checking the syntax of a
query. The database parses a statement when instructed by the application.
Query processing is the process of translating a high-level query, such as SQL, into a low-
level query that can be executed by the database system.
In query processing, translation is the process of converting a high-level query into a low-
level query that the database system can execute. The translation process is similar to the
parser of a query.
Validating: Checking the query's syntax and verifying the name of the relation in
the database, the tuple, and finally the required attribute value
1. Parser and translator: The first step in query processing is parsing and translation.
Parser just like a parser in compilers checks the syntax of the query whether the
relations mentioned are present in the database or not. A high-level query language
such as SQL is suitable for human use. But, it is totally unsuitable to system internal
representation. Therefore, translation is required. The internal representation can be
extended form of relational algebra.
2. Optimization: A SQL query can be written in many different ways. An optimized
query also depends on how the data is stored in the file organization. A Query can also
have different corresponding relational algebra expressions.
So, the above query can be written in the two forms of relational algebra. So it totally
depends on the implementation of the file system which one is better.
Code Generation:
In the database world, human readable SQL queries can be compiled to native code, which
executes faster and is more efficient when compared to the alternative of interpreted
execution
Code generation is a technique for efficient program execution and data processing. In
query processing, code generation involves:
Extracting parameters from the query
For example : we have a table of name students where we want to find those students
record whose marks is greater than 90% then we do,
Input:
A user wants to fetch the records of the students whose percentage is greater than 90%. For
this, user writes below SQL query.
When the DB finds the query already processed by some other session using shared pool
check it skips next two steps of query processing in DBMS i.e. optimisation and row
source generation this is known as Soft parsing. If we cannot find the query in already
processed pool, then we need do all of the steps, this is known as a Hard Parsing.
The process of mapping global queries to local ones can be realized as follows −
The tables required in a global query have fragments distributed across multiple
sites. The local databases have information only about local data. The controlling
site uses the global data dictionary to gather information about the distribution and
reconstructs the global view from the fragments.
If there is no replication, the global optimizer runs local queries at the sites where
the fragments are stored. If there is replication, the global optimizer selects the site
based upon communication cost, workload, and server speed.
The global optimizer generates a distributed execution plan so that least amount of
data transfer occurs across the sites. The plan states the location of the fragments,
order in which query steps needs to be executed and the processes involved in
transferring intermediate results.
The local queries are optimized by the local database servers. Finally, the local
query results are merged together through union operation in case of horizontal
fragments and join operation for vertical fragments.
For example, let us consider that the following Project schema is horizontally fragmented
according to City, the cities being New Delhi, Kolkata and Hyderabad.
PROJECT
Suppose there is a query to retrieve details of all projects whose status is “Ongoing”.
In order to get the overall result, we need to union the results of the three queries as follows
−
A distributed system has a number of database servers in the various sites to perform the
operations pertaining to a query. Following are the approaches for optimal resource
utilization −
Operation Shipping − In operation shipping, the operation is run at the site where the data
is stored and not at the client site. The results are then transferred to the client site. This is
appropriate for operations where the operands are available at the same site. Example:
Select and Project operations.
Data Shipping − In data shipping, the data fragments are transferred to the database
server, where the operations are executed. This is used in operations where the operands
are distributed at different sites. This is also appropriate in systems where the
communication costs are low, and local processors are much slower than the client server.
Hybrid Shipping − This is a combination of data and operation shipping. Here, data
fragments are transferred to the high-speed processors, where the operation runs. The
results are then sent to the client site.
Distributed database vs. Centralized database
Distributed Database System:
Distributed database management system is basically a set of multiple and logical
interrelated database which is distributed over the network. It includes single database
which is further divided into sub fragments. Each fragment is integrated with each other
and is controlled by individual database. It provides a mechanism that helps the users in
distributing the data transparently. Distributed database management system is mostly
used in warehouse to access and process the database of the clients at single time.
Centralized data base is another type of database system which is located, maintained
and stored in a single location such as mainframe computer. Data stored in the
centralized DBMS is distributed across the network computers. It includes set of
records which can easily be accessed from any location by using internet connection
such as WAN and LAN. Centralized database system is commonly used in the
organizations such as banks, schools, colleges etc to manage all their data in an
appropriate manner.
Difference:
Advantages:
By using centralized database system, individuals and teams can easily share their
ideas with each other. It becomes easy for the organization to co-ordinate their
work with the team members and achieve their business goals.
Most of the organizations prefer to use centralized database to reduce the conflicts
within the organization. Sharing the information with each other leads to a happier
working environment.
Organizations may face issues while using centralized database due to heavy
workload requirements.
While using centralized database system, organizations may have to spend more
money to manage and store the data.
By using one particular site, users can access the data stored at different sites easily
and effectively.
Disadvantage:
Distributed database system increases the complexity and cost of the organization.
It becomes difficult for the organization to maintain and manage the local database
management system due to which organizations may face difficulty to establish a
network between the sites.
While using distributed database system, organizations cannot use static SQL.
Fragmentation in Distributed DBMS
Fragmentation is a process of dividing the whole or full database into various
subtables or sub relations so that data can be stored in different systems. The small
pieces or sub relations or subtables are called fragments. These fragments are called
logical data units and are stored at various sites. It must be made sure that the
fragments are such that they can be used to reconstruct the original relation (i.e, there
isn’t any loss of data). The smaller parts or sub−tables are called fragments and are stored at
different locations. Data fragmentation should be done in a way that the reconstruction of the
original parent database from the fragments is possible. The restoration can be done using
UNION or JOIN operations.
Horizontal Fragmentation
It divides a table horizontally into a group of rows to create multiple fragments or subsets of a
table. These fragments can then be assigned to different sites in the database. Reconstruction is
done using UNION or JOIN operations. In relational algebra, it is represented as σp(T) for any
given table(T).
Example
In this example, we are going to see how the horizontal fragmentation looks in a table.
Input :
STUDENT
id name age salary
1 aman 21 20000
2 naman 22 25000
3 raman 23 35000
4 sonam 24 36000
Example
SELECT * FROM student WHERE salary<35000;
SELECT * FROM student WHERE salary>35000;
Output
id name age salary
1 aman 21 20000
2 naman 22 25000
There are three types of Horizontal fragmentation: Primary, Derived, and Complete Horizontal
Fragmentation
This example shows how the Select statement is used with a condition to provide output.
This example shows how the Select statement is used with the where clause to provide output.
It divides a table vertically into a group of columns to create multiple fragments or subsets of a
table. These fragments can then be assigned to different sites in the database. Reconstruction is
done using full outer join operation.
Example
This example shows how the Select statement is used to do the fragmentation and to provide the
output.
Input Table :
STUDENT
id name age salary
1 aman 21 20000
2 naman 22 25000
3 raman 23 35000
4 sonam 24 36000
Example
SELECT name FROM name;#fragmentation 1
SELECT age FROM id, age;#fragmentation 2
Output
name
aman
naman
raman
sonam
age
21
22
23
24
Mixed or Hybrid Fragmentation
It is done by performing both horizontal and vertical partitioning together. It is a group of rows
and columns in relation.
Example
This example shows how the Select statement is used with the where clause to provide the output.
Syntax:
>>db.createCollection(name, options)
Here,
Options: is a document type, specifies the memory size and indexing of the collection. It is
an optional parameter.
In MongoDB, the db.collection.insert() method is used to add or insert new documents into
a collection in your database.
Syntax
>>db.Collection_Name.insert(document)
Example:
// Record 1 Inserted
srmdb> db.students.insert({
... })
acknowledged: true,
}
// Record 2 Inserted
acknowledged: true,
insertOne() Function:
this function is used to insert only single record into the MongoDB
Syntax:
>>db.students.insertOne({record to be insert})
Example:
acknowledged: true,
insertedId: ObjectId("65d22b7504c8cb7bb974c36e")
Syntax:
>>db.students.insertMany([{record 1}, {record 2}, {record 3}])
Example
srmdb> db.students.insertMany([
... {name:"akshay",
... age:"15"
... },
... {
MongoDB's update() and save() methods are used to update document into a
collection. The update() method updates the values in the existing document while
the save() method replaces the existing document with the document passed in
save() method.
Syntax
The basic syntax of update() method is as follows −
>>db.COLLECTION_NAME.update(SELECTION_CRITERIA, UPDATED_DATA)
Example
Consider an example which has a collection name javatpoint. Insert the following documents
in collection:
db.javatpoint.insert(
Output:
acknowledged: true,
matched Count: 1,
modified Count: 1,
upserted Count: 0
Every document in the collection has an “_id” field that is used to uniquely identify the
document in a particular collection it acts as the primary key for the documents in the
collection. “_id” field can be used in any format and the default format is ObjectId of the
document.
An ObjectID is a 12-byte Field Of BSON type
The first 4 bytes representing the Unix Timestamp of the document
The next 3 bytes are the machine Id on which the MongoDB server is running.
The next 2 bytes are of process id
The last Field is 3 bytes used for increment the objectid.
Format of ObjectId:
ObjectId(<hexadecimal>)
ObjectId accepts one parameter which is optional Hexadecimal ObjectId in String.
We can give our own ObjectId to the document but it must be unique.
MongoDB also allows you to delete any particular document or multiple collections of
documents.
MongoDB allows you to delete a document or documents collectively using its one of the
three methods. Three methods provided by MongoDB for deleting documents are:
1. db.collection.deleteOne()
2. db.collection.remove()
3. db.collection.deleteMany()
db.collection.deleteOne()
This method is used to delete only a single document, even when more than one document
matches with the criteria. Here is an example of using this db.collection.deleteOne() method
for deleting the single document. To perform this process here, we have created a database
and saved all the data separately
srmdb> db.students.deleteOne({name:"akshay"})
db.collection.deleteMany()
acknowledged: true,
insertedIds: {
'0': ObjectId("65d2dc6904c8cb7bb974c37a"),
'1': ObjectId("65d2dc6904c8cb7bb974c37b")