Unit 3 Notes UDS23201J Query Processing

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 38

Unit 3 (Query Processing)

Syllabus Content

 MongoDB Introduction
 Replication in MongoDB , Indexing in MongoDB
 Distributed Query Optimization Algorithm
 Query Processing
 Query Processing Problem
 Layers of Query Processing
 Query Processing in Centralized Systems
 Parsing
 Translation
 Optimization
 Code generation
 Example Query Processing in Distributed Systems
 Mapping global query to local Optimization of Distributed Queries
 Centralized Query Optimization
 Data localization
 Fragmented query ordering
 Update Document in Mongodb
 Bulk write operations in MongoDB
 Delete documents in MongoDB
 HDFS Commands-Hadoop

MongoDB

MongoDB is a cross-platform, document-oriented database program. It's a NoSQL


database product that uses JSON-like documents with optional schemas.

MongoDB is a document database. It stores data in a type of JSON format called BSON.
MongoDB is an open-source, nonrelational database management system (DBMS) that
uses flexible documents instead of tables and rows to process and store various forms of
data. It has a flexible data model that enables you to store unstructured data, and it provides
full indexing support, and replication with rich and intuitive APIs.

 The document model maps to the objects in your application code, making data
easy to work with
 Ad hoc queries, indexing, and real time aggregation provide powerful ways to
access and analyze your data
 MongoDB is a distributed database at its core, so high availability, horizontal
scaling, and geographic distribution are built in and easy to use
 MongoDB is free to use. Versions released prior to October 16, 2018 are published
under the AGPL. All versions released after October 16, 2018, including patch
fixes for prior versions, are published under the Server Side Public License (SSPL)
v1.

Some features of MongoDB include:


 Support ad hoc queries

 Indexing

 Replication

 Duplication of data

 Load balancing

 Supports map reduce and aggregation tools

 Uses JavaScript instead of Procedures

 It is a schema-less database written in C++

MongoDB uses sharding to support deployments with very large data sets and high
throughput operations. Sharding means to distribute data on multiple servers, here a large
amount of data is partitioned into data chunks using the shard key, and these data chunks
are evenly distributed across shards that reside across many physical servers.
MongoDB documents or collections of documents are the basic units of data. Formatted as
Binary JSON (Java Script Object Notation), these documents can store various types of
data and be distributed across multiple systems. Since MongoDB employs a dynamic
schema design, users have unparalleled flexibility when creating data records, querying
document collections through MongoDB aggregation and analyzing large amounts of
information.

SQL vs Document Databases

SQL databases are considered relational databases. They store related data in separate
tables. When data is needed, it is queried from multiple tables to join the data back
together.

MongoDB is a document database which is often referred to as a non-relational database.


This does not mean that relational data cannot be stored in document databases. It means
that relational data is stored differently. A better way to refer to it is as a non-tabular
database.

MongoDB stores data in flexible documents. Instead of having multiple tables you can
simply keep all of your related data together. This makes reading your data very fast.

You can still have multiple groups of data too. In MongoDB, instead of tables these are
called collections.

MongoDB Examples

Examples:
In the following examples, we are working with:
Database: gfg
Collection: student
Document: No document but, we want to insert in the form of the student name and
student marks.
Insert the document whose name is Akshay and marks is 500
Here, we insert a document in the student collection whose name is Akshay and marks is
500 using insert() method.
db.student.insert({Name: "Akshay", Marks: 500})
Output:

Insert multiple documents in the collection


Here, we insert multiple documents in the collection by passing an array of documents in
the insert method.
db.student.insert([{Name: "Bablu", Marks: 550},
{Name: "Chintu", Marks: 430},
{Name: "Devanshu", Marks: 499}
])
Output:

Insert a document with _id field


Here, we insert a document in the student collection with _id field.
db.student.insert({_id: 102,Name: "Anup", Marks: 400})
Output:
Query Processing

Query Processing includes translations on high level Queries into low level expressions
that can be used at physical level of file system, query optimization and actual execution of
query to get the actual result.

Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps
involved are:

1. Parsing and translation


2. Optimization
3. Evaluation
The query processing works in the following way:

Parsing and Translation

As query processing includes certain activities for data retrieval. Initially, the given user
queries get translated in high-level database languages such as SQL. It gets translated into
expressions that can be further used at the physical level of the file system. After this, the
actual evaluation of the queries and a variety of query -optimizing transformations and
takes place. Thus before processing a query, a computer system needs to translate the query
into a human-readable and understandable language. Consequently, SQL or Structured
Query Language is the best suitable choice for humans. But it is not perfectly suitable for
the internal representation of the query to the system. Relational algebra is well suited for
the internal representation of a query. The translation process in query processing is similar
to the parser of a query. When a user executes any query, for generating the internal form
of the query, the parser in the system checks the syntax of the query, verifies the name of
the relation in the database, the tuple, and finally the required attribute value. The parser
creates a tree of the query, known as 'parse-tree.' Further, translate it into the form of
relational algebra. With this, it evenly replaces all the use of the views when used in the
query.

Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the
employees whose salary is greater than or equal to 10000. For doing this, the following
query is undertaken:

select emp_name from Employee where salary>10000;

Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:

o σsalary>10000 (πsalary (Employee))

o πsalary (σsalary>10000 (Employee))

After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation

For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and
evaluating each operation. Thus, after translating the user query, the system executes a
query evaluation plan.

Query Evaluation Plan


o In order to fully evaluate a query, the system needs to construct a query evaluation
plan.
o The annotations in the evaluation plan may refer to the algorithms to be used for the
particular index or the specific operations.
o Such relational algebra with annotations is referred to as Evaluation Primitives.
The evaluation primitives carry the instructions needed for the evaluation of the
operation.
o Thus, a query evaluation plan defines a sequence of primitive operations used for
evaluating a query. The query evaluation plan is also referred to as the query
execution plan.
o A query execution engine is responsible for generating the output of the given
query. It takes the query execution plan, executes it, and finally makes the output
for the user query.

Optimization

o The cost of the query evaluation can vary for different types of queries. Although
the system is responsible for constructing the evaluation plan, the user does need
not to write their query efficiently.
o Usually, a database system generates an efficient query evaluation plan, which
minimizes its cost. This type of task performed by the database system and is
known as Query Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis
of each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
Detailed Diagram is drawn as:

It is done in the following steps:


Step-1:
Parser: During parse call, the database performs the following checks- Syntax check,
Semantic check and Shared pool check, after converting the query into relational algebra.
Parser performs the following checks as (refer detailed diagram):

1. Syntax check – concludes SQL syntactic validity. Example:


SELECT * FORM employee

Here error of wrong spelling of FROM is given by this check.

2. Semantic check – determines whether the statement is meaningful


or not. Example: query contains a tablename which does not exist is
checked by this check.
3. Shared Pool check – Every query possess a hash code during its
execution. So, this check determines existence of written hash code
in shared pool if code exists in shared pool then database will not
take additional steps for optimization and execution.
Hard Parse and Soft Parse –
If there is a fresh query and its hash code does not exist in shared pool then that
query has to pass through from the additional steps known as hard parsing
otherwise if hash code exists then query does not passes through additional
steps. It just passes directly to execution engine (refer detailed diagram). This is
known as soft parsing.
Hard Parse includes following steps – Optimizer and Row source generation.
Step-2:
Optimizer: During optimization stage, database must perform a hard parse at least for one
unique DML statement and perform optimization during this parse. This database never
optimizes DDL unless it includes a DML component such as subquery that require
optimization.
It is a process in which multiple query execution plan for satisfying a query are examined
and most efficient query plan is satisfied for execution.
Database catalog stores the execution plans and then optimizer passes the lowest cost plan
for execution.

Row Source Generation –


The Row Source Generation is a software that receives a optimal execution plan from the
optimizer and produces an iterative execution plan that is usable by the rest of the database.
the iterative plan is the binary program that when executes by the sql engine produces the
result set.
Step-3:
Execution Engine: Finally runs the query and display the required result.

Query Processing Problem


One problem in a distributed system, however, is the efficient processing of queries.
Database management systems must consider that the transmission of data over
communication lines introduces substantial time delays and that distributed systems
process and transmit data in parallel at separate points in the network.
query processing in a distributed database management system (DBMS) is a complex
procedure that tackles issues with transaction management, data dissemination,
optimization, and fault tolerance. Distributed database systems’ performance, scalability,
and dependability depend on effective concurrency management, optimization, and query
decomposition techniques.

What are the main challenges of query optimization in a distributed DBMS?


Powered by AI and the LinkedIn community
1. Data fragmentation
2. Data localization
3. Data replication
4. Network heterogeneity
5. Query decomposition and allocation

Data fragmentation

One of the challenges of query optimization in a distributed DBMS is data fragmentation,


which refers to how the data is partitioned and allocated among the nodes. Data
fragmentation can affect the performance of queries, as it may require more data transfers,
joins, or replication. For example, if a query involves data that is horizontally fragmented,
meaning that different rows of a table are stored at different nodes, it may need to access
multiple nodes to retrieve the relevant data. On the other hand, if a query involves data that
is vertically fragmented, meaning that different columns of a table are stored at different
nodes, it may need to join the data from different nodes to reconstruct the table. Therefore,
query optimization in a distributed DBMS needs to consider the data fragmentation scheme
and choose the best node or nodes to process the query.

Data localization
Another challenge of query optimization in a distributed DBMS is data localization, which
refers to the degree to which the data needed for a query is available at the node where the
query is initiated. Data localization can affect the performance of queries, as it may reduce
or increase the amount of data transfers, communication, or synchronization. For example,
if a query is initiated at a node that has most or all of the data needed for the query, it can
be executed locally without much overhead. On the other hand, if a query is initiated at a
node that has little or none of the data needed for the query, it may need to send requests to
other nodes or fetch data from them, which can incur more costs and delays. Therefore,
query optimization in a distributed DBMS needs to consider the data localization factor
and choose the best node or nodes to initiate the query.

Data replication
A third challenge of query optimization in a distributed DBMS is data replication, which
refers to the process of creating and maintaining copies of data at different nodes. Data
replication can improve the availability, reliability, and scalability of the distributed
DBMS, as it can provide backup, load balancing, and fault tolerance. However, data
replication can also introduce complexity and overhead for query optimization, as it may
create inconsistency, redundancy, or conflicts. For example, if a query involves data that is
replicated at multiple nodes, it may need to decide which copy of the data to use, how to
ensure that the copies are consistent, and how to handle updates or transactions that affect
the replicated data. Therefore, query optimization in a distributed DBMS needs to consider
the data replication strategy and choose the best node or nodes to access or update the data.

Network heterogeneity
A fourth challenge of query optimization in a distributed DBMS is network heterogeneity,
which refers to the variation in the network characteristics among the nodes. Network
heterogeneity can affect the performance of queries, as it may cause different levels of
latency, bandwidth, reliability, or congestion. For example, if a query involves data that is
stored at nodes that have different network speeds, distances, or qualities, it may need to
account for the network delays, costs, or failures. Therefore, query optimization in a
distributed DBMS needs to consider the network heterogeneity factor and choose the best
node or nodes to communicate or transfer data.

Query decomposition and allocation

A fifth challenge of query optimization in a distributed DBMS is query decomposition and


allocation, which refers to the process of breaking down a query into subqueries and
assigning them to different nodes for execution. Query decomposition and allocation can
improve the performance of queries, as it can exploit the parallelism, concurrency, and
locality of the distributed DBMS. However, query decomposition and allocation can also
pose challenges for query optimization, as it may involve trade-offs, dependencies, or
coordination. For example, if a query is decomposed into subqueries that are allocated to
different nodes, it may need to balance the workload, minimize the data transfers, and
synchronize the results. Therefore, query optimization in a distributed DBMS needs to
consider the query decomposition and allocation problem and choose the best subqueries
and nodes to execute them.
Layers of Query Processing

Query processing has 4 layers:

1. Query Decomposition
2. Data Localization
3. Global Query Optimization
4. Distribution Query Execution

Query Decomposition
The first layer decomposes the calculus query into an algebraic query on global relations.
The information needed for this transformation is found in the global conceptual schema
describing the global relations.

Query decomposition can be viewed as four successive steps.

I. Normalization
II. Analysis
III.Simplification
IV. Restructure

• First, the calculus query is rewritten in a normalized form that is suitable for subsequent
manipulation. Normalization of a query generally involves the manipulation of the query
quantifiers and of the query qualification by applying logical operator priority.

• Second, the normalized query is analyzed semantically so that incorrect queries are
detected and rejected as early as possible. Techniques to detect incorrect queries exist only
for a subset of relational calculus. Typically, they use some sort of graph that captures the
semantics of the query.

•Third, the correct query (still expressed in relational calculus) is simplified. One way to
simplify a query is to eliminate redundant predicates. Note that redundant queries are likely
to arise when a query is the result of system transformations applied to the user query. such
transformations are used for performing semantic data control (views, protection, and
semantic integrity control).

•Fourth, the calculus query is restructured as an algebraic query. The traditional way to do
this transformation toward a "“better" algebraic specification is to start with an initial
algebraic query and transform it in order to find a "go

•The algebraic query generated by this layer is good in the sense that the PR152 yorse
executions are typically avoided.

Query Processing Example

Query:

select salary from instructor where salary < 75000;

This query can be translated into either of the following relational-algebra expressions:
 σsalary <75000(Πsalary (salary <75000(Πsalary ( instructor ))))
 Πsalary (σsalary <75000(Πsalary (salary <75000( instructor ))))

Data Localization

• The input to the second layer is an algebraic query on global relations. The main role of
the second layer is to localize the query's data using data distribution information in the
fragment schema.

• This layer determines which fragments are involved in the query and transforms the
distributed query into a query on fragments.

• A global relation can be reconstructed by applying the fragmentation rules, and then
deriving a program, called a localization program, of relational algebra operators, which
then act on fragments.

Generating a query on fragments is done in two steps

• First, the query is mapped into a fragment query by substituting each relation by its
reconstruction program (also called materialization program).

• Second, the fragment query is simplified and restructured to produce another "good"
query.

Global Query Optimization

• The input to the third layer is an algebraic query on fragments. The goal of query
optimization is to find an execution strategy for the query which is close to optimal.

• The previous layers have already optimized the query, for example, by eliminating
redundant expressions. However, this optimization is independent of fragment
characteristics such as fragment allocation and cardinalities.

• Query optimization consists of finding the "best" ordering of operators in the query,
including communication operators that minimize a cost function.
• The output of the query optimization layer is a optimized algebraic query with
communication operators included on fragments. It is typically represented and saved (for
future executions) as a distributed query execution plan.

Distribution Query Execution

• The last layer is performed by all the sites having fragments involved in the query.

• Each sub query executing a tone site, called a local query, is then optimized using the
local schema of the site and executed.

Query Processing in Centralized System

Query processing in a centralized system aims to:


 Minimize the response time of queries

 Maximize the system's throughput

 Reduce the amount of storage and memory required for processing

 Increase parallelism
Query processing is the process of compiling and executing a query specification. It
consists of a compile-time phase and a runtime phase.
Query processing activities include:
 Translation of high-level languages (HLL) queries into operations at physical file
level

 Query optimization transformations

 Evaluation of queries
The four main phases of query processing are: Decomposition, Optimization, Code
generation, Execution

Parsing in Query Processing

Parsing of a query is the process by which this decision making is done that for a given
query, calculating how many ways there are in which the query can run. Every query must
be parsed at least once. The parsing of a query is performed within the database using the
Optimizer component.

In query processing, parsing is the first step and involves checking the syntax of a
query. The database parses a statement when instructed by the application.

The parsing stage involves:


 Separating the pieces of a SQL statement into a data structure

 Breaking down the query into different tokens

 Removing white spaces and comments

 Checking the syntax of the query

 Verifying the name of the relation in the database

 Verifying the tuple

 Verifying the required attribute value

Query processing is the process of translating a high-level query, such as SQL, into a low-
level query that can be executed by the database system.

Translation in Query Processing

In query processing, translation is the process of converting a high-level query into a low-
level query that the database system can execute. The translation process is similar to the
parser of a query.

The query processing process includes:


 Parsing: Checking the syntax of the query, verifying the name of the relation in the
database, the tuple, and finally the required attribute value

 Validating: Checking the query's syntax and verifying the name of the relation in
the database, the tuple, and finally the required attribute value

 Optimizing: Optimizing the query

 Generating a query execution plan: Generating a query execution plan


 Actual execution: Executing the query to get the actual result

Optimization in Query Processing

There are three major steps involved in query processing:

1. Parser and translator: The first step in query processing is parsing and translation.
Parser just like a parser in compilers checks the syntax of the query whether the
relations mentioned are present in the database or not. A high-level query language
such as SQL is suitable for human use. But, it is totally unsuitable to system internal
representation. Therefore, translation is required. The internal representation can be
extended form of relational algebra.
2. Optimization: A SQL query can be written in many different ways. An optimized
query also depends on how the data is stored in the file organization. A Query can also
have different corresponding relational algebra expressions.

So, the above query can be written in the two forms of relational algebra. So it totally
depends on the implementation of the file system which one is better.

3. Execution plan: A systematic step-by-step execution of primitive operations for


fetching data from the database is termed a query evaluation plan. Different evaluation
plans for a particular query have different query costs. The cost may include the
number of disk accesses, CPU time for execution of the query, time of communication
in the case of distributed databases.
Purpose of SQL Query Optimization
The major purposes of SQL Query optimization are:
1. Reduce Response Time: The major goal is to enhance performance by reducing the
response time. The time difference between users requesting data and getting
responses should be minimized for a better user experience.
2. Reduced CPU execution time: The CPU execution time of a query must be
reduced so that faster results can be obtained.
3. Improved Throughput: The number of resources to be accessed to fetch all
necessary data should be minimized. The number of rows to be fetched in a particular
query should be in the most efficient manner such that the least number of resources
are used.

Code Generation:

Code generation is the process of creating code from a high-level representation. In


computing, code generation is part of the compiler process chain and converts the source
code's intermediate representation into a form that the target system can execute.
The goals of a good code generator are: Correctness, Efficiency, Self-efficiency.
Code generation tools parse input data to produce target code in programming
languages. The algorithm for code generation is split into four parts:
 Register descriptor set-up
 Basic block generation
 Instruction generation for operations on registers
 Ending the basic block with a jump statement or return command

In the database world, human readable SQL queries can be compiled to native code, which
executes faster and is more efficient when compared to the alternative of interpreted
execution
Code generation is a technique for efficient program execution and data processing. In
query processing, code generation involves:
 Extracting parameters from the query

 Converting the normalized query into a format tailored to the system

 Writing out the actual access routines to be executed

 Inserting all generated functions into a new C source file

 Invoking the compiler to compile the source file


Examples of Query Processing

For example : we have a table of name students where we want to find those students
record whose marks is greater than 90% then we do,

For our discussion let us take a scenario:

Input:

A user wants to fetch the records of the students whose percentage is greater than 90%. For
this, user writes below SQL query.

Select student_name from Students where percentage >90;


Hard Parse Vs Soft Parse

When the DB finds the query already processed by some other session using shared pool
check it skips next two steps of query processing in DBMS i.e. optimisation and row
source generation this is known as Soft parsing. If we cannot find the query in already
processed pool, then we need do all of the steps, this is known as a Hard Parsing.

Distributed Query Processing Architecture

In a distributed database system, processing a query comprises of optimization at both the


global and the local level. The query enters the database system at the client or controlling
site. Here, the user is validated, the query is checked, translated, and optimized at a global
level.

The architecture can be represented as −


Mapping Global Queries into Local Queries

The process of mapping global queries to local ones can be realized as follows −

 The tables required in a global query have fragments distributed across multiple
sites. The local databases have information only about local data. The controlling
site uses the global data dictionary to gather information about the distribution and
reconstructs the global view from the fragments.
 If there is no replication, the global optimizer runs local queries at the sites where
the fragments are stored. If there is replication, the global optimizer selects the site
based upon communication cost, workload, and server speed.
 The global optimizer generates a distributed execution plan so that least amount of
data transfer occurs across the sites. The plan states the location of the fragments,
order in which query steps needs to be executed and the processes involved in
transferring intermediate results.
 The local queries are optimized by the local database servers. Finally, the local
query results are merged together through union operation in case of horizontal
fragments and join operation for vertical fragments.

For example, let us consider that the following Project schema is horizontally fragmented
according to City, the cities being New Delhi, Kolkata and Hyderabad.

PROJECT

PId City Department Status

Suppose there is a query to retrieve details of all projects whose status is “Ongoing”.

The global query will be &inus;

$$\sigma_{status} = {\small "ongoing"}^{(PROJECT)}$$

Query in New Delhi’s server will be −

$$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})}$$


Query in Kolkata’s server will be −

$$\sigma_{status} = {\small "ongoing"}^{({Kol}_-{PROJECT})}$$

Query in Hyderabad’s server will be −

$$\sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$$

In order to get the overall result, we need to union the results of the three queries as follows

$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})} \cup \


sigma_{status} = {\small "ongoing"}^{({kol}_-{PROJECT})} \cup \sigma_{status} = {\
small "ongoing"}^{({Hyd}_-{PROJECT})}$

Optimal Utilization of Resources in the Distributed System

A distributed system has a number of database servers in the various sites to perform the
operations pertaining to a query. Following are the approaches for optimal resource
utilization −

Operation Shipping − In operation shipping, the operation is run at the site where the data
is stored and not at the client site. The results are then transferred to the client site. This is
appropriate for operations where the operands are available at the same site. Example:
Select and Project operations.
Data Shipping − In data shipping, the data fragments are transferred to the database
server, where the operations are executed. This is used in operations where the operands
are distributed at different sites. This is also appropriate in systems where the
communication costs are low, and local processors are much slower than the client server.
Hybrid Shipping − This is a combination of data and operation shipping. Here, data
fragments are transferred to the high-speed processors, where the operation runs. The
results are then sent to the client site.
Distributed database vs. Centralized database
Distributed Database System:
Distributed database management system is basically a set of multiple and logical
interrelated database which is distributed over the network. It includes single database
which is further divided into sub fragments. Each fragment is integrated with each other
and is controlled by individual database. It provides a mechanism that helps the users in
distributing the data transparently. Distributed database management system is mostly
used in warehouse to access and process the database of the clients at single time.

Centralized Database Management System:

Centralized data base is another type of database system which is located, maintained
and stored in a single location such as mainframe computer. Data stored in the
centralized DBMS is distributed across the network computers. It includes set of
records which can easily be accessed from any location by using internet connection
such as WAN and LAN. Centralized database system is commonly used in the
organizations such as banks, schools, colleges etc to manage all their data in an
appropriate manner.
Difference:

Centralized DBMS Distributed DBMS


Data is stored only on one site Data is stored on different sites
Data stored in single computer can be Data is stored over different sites which
used by multiple users are connected with each other.
If centralized system fails, then the entire If one of the system fails, then user can
system is halted. access the data from other sites.
Distributed DBMS is more reliable and
Centralized DBMS is less reliable and
reactive
reactive
Centralized DBMS is less complex Distributed DBMS is more complex

Advantages and disadvantages of Centralized DBMS:

Advantages:

 With the help of centralized database management system, organizations an easily


communicate with each other in less time. This approach basically allows the team
members of an organization to work on cross-functional projects. It becomes
easy for the team members to analyse the data and complete the tasks with good
quality.

 By using centralized database system, individuals and teams can easily share their
ideas with each other. It becomes easy for the organization to co-ordinate their
work with the team members and achieve their business goals.

 Centralized database system also provides high level of security.

 Most of the organizations prefer to use centralized database to reduce the conflicts
within the organization. Sharing the information with each other leads to a happier
working environment.

Disadvantages of Centralized database:

 Organizations may face issues while using centralized database due to heavy
workload requirements.
 While using centralized database system, organizations may have to spend more
money to manage and store the data.

Advantages and disadvantages of Distributed database system:


Advantages:
 Distributed database system reflects the organizational structure in an appropriate
manner. It becomes easy to access the organizational data in an effective manner.

 By using one particular site, users can access the data stored at different sites easily
and effectively.

 Distributed database system also improves the availability, reliability and


performance of the organization. Failure of one site allow the users to access the
information from other sites easily.

 Distributed database system helps the organizations to handle their growth


expansion. Increase in the database size can easily be managed by using distributed
database system.

Disadvantage:

 Distributed database system increases the complexity and cost of the organization.
It becomes difficult for the organization to maintain and manage the local database
management system due to which organizations may face difficulty to establish a
network between the sites.

 In distributed database system, it becomes difficult for the organizations to control


the replicate data.

 While using distributed database system, organizations may face difficulty in


maintaining the database integrity. Organizations have to spend more
communication and processing cost to enforce the integrity constraints.

 While using distributed database system, organizations cannot use static SQL.
Fragmentation in Distributed DBMS
Fragmentation is a process of dividing the whole or full database into various
subtables or sub relations so that data can be stored in different systems. The small
pieces or sub relations or subtables are called fragments. These fragments are called
logical data units and are stored at various sites. It must be made sure that the
fragments are such that they can be used to reconstruct the original relation (i.e, there
isn’t any loss of data). The smaller parts or sub−tables are called fragments and are stored at
different locations. Data fragmentation should be done in a way that the reconstruction of the
original parent database from the fragments is possible. The restoration can be done using
UNION or JOIN operations.

Database fragmentation is of three types: Horizontal fragmentation, Vertical fragmentation, and


Mixed or Hybrid fragmentation.

Horizontal Fragmentation

It divides a table horizontally into a group of rows to create multiple fragments or subsets of a
table. These fragments can then be assigned to different sites in the database. Reconstruction is
done using UNION or JOIN operations. In relational algebra, it is represented as σp(T) for any
given table(T).

Example

In this example, we are going to see how the horizontal fragmentation looks in a table.

Input :
STUDENT
id name age salary
1 aman 21 20000
2 naman 22 25000
3 raman 23 35000
4 sonam 24 36000
Example
SELECT * FROM student WHERE salary<35000;
SELECT * FROM student WHERE salary>35000;

Output
id name age salary
1 aman 21 20000
2 naman 22 25000

id name age salary


4 soman 24 36000

There are three types of Horizontal fragmentation: Primary, Derived, and Complete Horizontal
Fragmentation

A: Primary Horizontal Fragmentation: It is a process of segmenting a single table in a


row−wise manner using a set of conditions.
Example

This example shows how the Select statement is used with a condition to provide output.

SELECT * FROM student SALARY<30000;


Output
id name age salary
1 aman 21 20000
2 naman 22 25000
B: Derived Horizontal Fragmentation: Fragmentation that is being derived from primary
relation.
Example

This example shows how the Select statement is used with the where clause to provide output.

SELECT * FROM student WHERE age=21 AND salary<30000;


Output
id name age salary
1 aman 21 20000
C: Complete horizontal fragmentation: It derives a set of horizontal fragments to make the
table have at least one partition.
Vertical Fragmentation

It divides a table vertically into a group of columns to create multiple fragments or subsets of a
table. These fragments can then be assigned to different sites in the database. Reconstruction is
done using full outer join operation.
Example

This example shows how the Select statement is used to do the fragmentation and to provide the
output.

Input Table :
STUDENT
id name age salary
1 aman 21 20000
2 naman 22 25000
3 raman 23 35000
4 sonam 24 36000
Example
SELECT name FROM name;#fragmentation 1
SELECT age FROM id, age;#fragmentation 2
Output
name
aman
naman
raman
sonam
age
21
22
23
24
Mixed or Hybrid Fragmentation

It is done by performing both horizontal and vertical partitioning together. It is a group of rows
and columns in relation.

Example

This example shows how the Select statement is used with the where clause to provide the output.

SELECT * FROM name WHERE age=22;


Output
name age
naman 22
MongoDB Create Collection

In MongoDB, db.createCollection(name, options) is used to create collection. But usually


you don’t need to create collection. MongoDB creates collection automatically when you
insert some documents. It will be explained later. First see how to create collection:

Syntax:

>>db.createCollection(name, options)

Here,

Name: is a string type, specifies the name of the collection to be created.

Options: is a document type, specifies the memory size and indexing of the collection. It is
an optional parameter.

Writing Operation in MongoDB


How to Insert Document into MongoDB

In MongoDB, the db.collection.insert() method is used to add or insert new documents into
a collection in your database.

Syntax

>>db.Collection_Name.insert(document)

Example:
// Record 1 Inserted

srmdb> db.students.insert({

... name: "aaradhya Singh",

... age: "09"

... })

acknowledged: true,

insertedIds: { '0': ObjectId("65d229c304c8cb7bb974c36c") }

}
// Record 2 Inserted

srmdb> db.students.insert({ name: "vidisha choudhary", age: "09 months" })

acknowledged: true,

insertedIds: { '0': ObjectId("65d22a3604c8cb7bb974c36d") }

insertOne() Function:
this function is used to insert only single record into the MongoDB

Syntax:
>>db.students.insertOne({record to be insert})

Example:

srmdb> db.students.insertOne({ name:" ravi ranjan", age: "22", address: "Delhi"})

acknowledged: true,

insertedId: ObjectId("65d22b7504c8cb7bb974c36e")

Bulk write operations in MongoDB

In MongoDB, the insertMany() method inserts multiple documents into a collection in a


single operation. The method takes an array of documents to insert into the specified
collection. It inserts documents in the order specified until an exception occurs.
To insert documents, add your Document objects to a List and pass that List as an argument
to insertMany(). You can also specify additional options in the options object passed as the
second parameter of the insertMany() method

Syntax:
>>db.students.insertMany([{record 1}, {record 2}, {record 3}])

Example

srmdb> db.students.insertMany([

... {name:"akshay",

... age:"15"
... },

... {

... name: "saurav Mishra",

... age :"24"


... }])

insert() insertOne() insertMany()


Pymongo equivalent command is Pymongo equivalent Pymongo equivalent
insert() command is command is insert_many()
insert_one()
Deprecated in newer versions of Used in newer versions Used in newer versions of
mongo engine of mongo engine mongo engine
throws throws either a throws a BulkWriteError
WriteResult.writeConcernError and writeError or exception.
WriteResult.writeError for write writeConcernError
and non-write concern errors exception.
respectively
compatible with not compatible with not compatible with
db.collection.explain() db.collection.explain() db.collection.explain()
If ordered is set to true and any If error is reported for If ordered is set to true and
document reports an error then the the document it is not any document reports an
remaining documents are not inserted into the error then the remaining
inserted. If ordered is set to false database documents are not inserted.
then remaining documents are If ordered is set to false
inserted even if an error occurs. then remaining documents
are inserted even if an
error occurs.
returns an object that contains the returns the insert_id of returns the insert_ids of the
status of the operation. the document inserted documents inserted

Update Document in Mongodb

MongoDB's update() and save() methods are used to update document into a
collection. The update() method updates the values in the existing document while
the save() method replaces the existing document with the document passed in
save() method.

The update() method updates the values in the existing document.

Syntax
The basic syntax of update() method is as follows −
>>db.COLLECTION_NAME.update(SELECTION_CRITERIA, UPDATED_DATA)

Example
Consider an example which has a collection name javatpoint. Insert the following documents
in collection:

db.javatpoint.insert(

srmdb> db.students.updateOne({name:"Vidisha choudhary"},{$set:{age:"11 months"}})

Output:

acknowledged: true,

inserted Id: null,

matched Count: 1,

modified Count: 1,

upserted Count: 0

What is ObjectId in MongoDB

Every document in the collection has an “_id” field that is used to uniquely identify the
document in a particular collection it acts as the primary key for the documents in the
collection. “_id” field can be used in any format and the default format is ObjectId of the
document.
An ObjectID is a 12-byte Field Of BSON type
 The first 4 bytes representing the Unix Timestamp of the document
 The next 3 bytes are the machine Id on which the MongoDB server is running.
 The next 2 bytes are of process id
 The last Field is 3 bytes used for increment the objectid.
Format of ObjectId:
ObjectId(<hexadecimal>)
ObjectId accepts one parameter which is optional Hexadecimal ObjectId in String.
We can give our own ObjectId to the document but it must be unique.

Delete Document in MongoDB

MongoDB also allows you to delete any particular document or multiple collections of
documents.
MongoDB allows you to delete a document or documents collectively using its one of the
three methods. Three methods provided by MongoDB for deleting documents are:

1. db.collection.deleteOne()
2. db.collection.remove()
3. db.collection.deleteMany()

db.collection.deleteOne()

This method is used to delete only a single document, even when more than one document
matches with the criteria. Here is an example of using this db.collection.deleteOne() method
for deleting the single document. To perform this process here, we have created a database
and saved all the data separately

srmdb> db.students.deleteOne({name:"akshay"})

{ acknowledged: true, deletedCount: 1 }

db.collection.deleteMany()

MongoDB allows you to delete multiple documents using the db.collection.deleteMany()


method. This method deletes all your documents whichever match its criteria mentioned in
the parameter. To check its implementation of db.collection.deleteMany() method, you can
use the method the same way as done previously:
srmdb> db.students.insertMany([ { name: "rani", age: "15" }, { name: "himanshi",
age: "15" }])

acknowledged: true,

insertedIds: {

'0': ObjectId("65d2dc6904c8cb7bb974c37a"),

'1': ObjectId("65d2dc6904c8cb7bb974c37b")

srmdb> db.students.deleteMany({age: "15"})

{ acknowledged: true, deletedCount: 3 }

 $lt: Less than


 $lte: Less than or equal
 $gt: Greater than
 $gte: Greater than or equal

The following methods show common ways to use these


operators:
Method 1: Greater Than Query
db.myCollection.find({field1: {$gt:25}})

Method 2: Less Than Query


db.myCollection.find({field1: {$lt:25}})

Method 3: Greater Than and Less Than Query


db.myCollection.find({field1: {$gt:25, $lt:32}})

Method 4: Greater Than or Less Than Query


db.myCollection.find({ "$or": [ {"field1": {$gt: 30}}, {"field1": {$lt:
20}} ] })

You might also like