0% found this document useful (0 votes)
20 views

Efficient Join On DBMS

This document discusses join query optimization in distributed databases. It contains the following key points: 1. Query optimization aims to find the most efficient plan to retrieve data in the least amount of time. In distributed databases, the cost of a query plan depends on transmission costs between servers and local processing costs. 2. For join queries in distributed databases, one optimization method is to send the table with the smaller size to the other site before performing the join. Parallel query processing aims to maximize simultaneous data transmissions rather than minimize transmission size. 3. The objectives for join query optimization in distributed databases are to minimize the size of transmitted data, transmission time, and local processing costs like CPU and I/O usage.

Uploaded by

Qaim Hassan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Efficient Join On DBMS

This document discusses join query optimization in distributed databases. It contains the following key points: 1. Query optimization aims to find the most efficient plan to retrieve data in the least amount of time. In distributed databases, the cost of a query plan depends on transmission costs between servers and local processing costs. 2. For join queries in distributed databases, one optimization method is to send the table with the smaller size to the other site before performing the join. Parallel query processing aims to maximize simultaneous data transmissions rather than minimize transmission size. 3. The objectives for join query optimization in distributed databases are to minimize the size of transmitted data, transmission time, and local processing costs like CPU and I/O usage.

Uploaded by

Qaim Hassan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 1

ISSN 2250-3153

Join Query Optimization in Distributed Databases


* **
Pawandeep Kaur , Jaspreet Kaur Sahiwal

* M.Tech Student, CSE Department, Lovely Professional University, Phagwara, India


** Assistant Professor, CSE Department, Lovely Professional University, Phagwara, India

Abstract- Query Optimization is to use the best plan for the


query that improves the performance of the query. Query II. QUERY OPTIMIZATION
Optimization is difficult in distributed databases as compared to
Query Optimization is to operate the query in different way
centralized databases. Queries in distributed databases are
so that it gives the same result but the speed to retrieve the data
effected by factors such as insertion methods of the data into the
increases. The queries should be efficient so that data can be
remote server and transmission time between servers. Response
retrieved in less time or accessing to database became fast. There
time of the query depends upon the transmission time, local
are alternative ways to perform the query that give the same
processing speed.
result. The way to perform the query should give the result in
I. INTRODUCTION minimum time and should increase the performance of the query.
In distributed systems, the cost of a query plan is given by the

A s the data is increasing day by day, it is becoming more


complex to store the more data on a single site. Data on
sum of the transmission cost and the local processing cost [5].
The transmission cost is in factors of speed to transfer the data
from one machine to another machine and the local processing
a single site also suffers from many problems such as the
storage limitations, site failure. Therefore, distributed cost is in terms of CPU cycles, disk I/O. Query Optimizer
database is required to distribute and store the data on determines:
multiple sites. A distributed database is a collection of  Number of alternative plans 
multiple, logically interrelated databases distributed over a  Cost of every plan using cost model 
computer network [5]. Data distributed on different sites is  Selects the plan with the lowest cost 
accessed with the help of queries. A distributed database is
useful because of its benefits. Join Query in distributed databases is used to join the data
from multiple sites. Optimization of join query in centralized
Benefits of using distributed databases are: databases is simple as compared to distributed databases. More
work is done on join query in centralized databases and more
i. Improved Performance: Because the data is stored on optimization is required in distributed databases.
multiple sites, so the overhead on one machine
decreases which improves the performance.
ii. Localization: means the data is present as close to the III. RELATED WORK ON JOIN QUERY
site where it is needed, therfore data can be accessed in OPTIMIZATION IN DISTRIBUTED DATABASES
less time and data transfer time also reduces.
There are different methods to optimize the queries in the
iii. Availability and Reliability: In distributed database databases. These methods improve the performance of the query
systems, the availability of the data increases because and decrease the cost. The optimizer determines that in which
the replicas of the data are distributed at different sites. order the queries (e. g. joins, selects, and projects) should be
It also increases the reliability because if the one site executed. Related work on join query optimization in distributed
fails, then data can be accessed from the other site databases is to calculate the size of the data on two different
where its replica is present. So, in distributed machines and then to send the table having smaller size to
environment, failure of one site does not result in another site and then perform the join query [5].
unavailability of the data.
iv. Reduced communication overhead: Communication
overhead reduces in distributed environment because a
relation is available at each site locally that contains the
replicas of the data.
v. Easier System Expansion: The capacity of the
distributed database can be increased easily by adding
the computers to the network [7]. In the distributed
environment, because system can associate and
coordinate a number of small machines so it gains the Figure 1: Minimize size of transmission data [5]
power equal to the power of a supercomputer.

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 2
ISSN 2250-3153
the data from the remote sites, objectives for the optimization
Another method for join query in distributed databases is the are:
parallel query processing. Parallel processing doesn't focus on
minimizing the quantity of transmission data but rather  Size of transmitted data: It is the amount of the data that is to
maximizing the number of simultaneous transmissions [7]. be transmitted. Size of the transmitted data should be small
so that less time will be required for transmission.
 Transmission speed: it depends upon the network speed. For
the wide area networks, transmission speed more affects the
query.
 Local processing costs: it consists of CPU cost, I/O cost.
Local processing costs can vary with the machine
processing speed.
To increase the performance of join query, these costs should be
less and operations should be performed in efficient manner for
optimization of the query.

V. FACTORS FOR PARALLEL PROCESSING


OF JOIN QUERY
Figure 2: Parallel processing of join query Parallel processing of join query in distributed database
depends upon the following factors:
Client sends the request for the data from server 1 and Time for query execution for requesting the data: time for the
server 2 by the queries. After that, server 1 sends the SUPPLY client to send the request to servers for the data that is placed at
data and server 2 sends the SUPPLIER data to client. Then client servers.
inserts the data into its database and performs the join query on Time for transmitting data: The transmission time increases as
the data from two servers. the quantity of transmitted data increases. It depends upon the
network speed between client and server. Response time to
If server 1 contains the SUPPLY relation as: transmit the data is given by:
SUPPLY(SUPPLY_NO, FROM_PLACE, TO_PLACE) Max(size(SUPPLY), size(SUPPLIER))
Time for inserting data: time taken to insert the data into the
and server 2 contains the SUPPLIER relation as: client database from the servers.
SUPPLIER(SUPPLY_NO, S_NAME, S_ADDRESS) Different insertion methods can be used:
 Row-by-row insertion
and client wants the join of the SUPPLY and SUPPLIER relation  Bulk insertion
from server 1 and server 2 respectively and want to perform the Bulk insertion is better to use than the row-by-row insertion.
query Q. Another type of insertion methods can be used to optimize the
Q: SELECT *FROM SUPPLY S, SUPPLIER sr WHERE insertion of data.
s.SUPPLY_NO = sr.SUPPLY_NO Time for join execution: time taken to perform the join query at
client side that joins the tables from server 1 and server 2.
In distributed databases, query Q can be divided into three parts: Optimization of join query can be done by using:
1. SELECT *FROM SUPPLY  different join orders
2. SELECT *FROM SUPPLIER  alternative “where” clause that will give the same result
3. SELECT *FROM SUPPLY S, SUPPLIER sr WHERE  different join methods
s.SUPPLY_NO = sr.SUPPLY_NO join also depends upon the local processing cost of query such as
CPU cost and I/O cost.
Queries 1 and 2 select the data from two source tables. To perform the distributed join query that is accessing the data
Because this data resides on the remote machines, the executions from the remote sites, costs should be less so that performance
of these two queries do not require data transmission. Query 3 is of join query increases. If a machine wants the result of join
the join query which can not be executed until the data on the query of data that is present at different machines, then
remote sites have been transferred to the same sites. transmission costs and the insertion costs are very important.
Insertion of the server’s data into client database takes the more
time than the transmission of the data. Therefore, the important
IV. OBJECTIVES OF JOIN QUERY IN DISTRIBUTED objective is to improve the performance of the join query that
DATABASES is accessing the data from two different machines by using the
To perform the distributed join query that is accessing different insertion methods that take less time.

www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 3
ISSN 2250-3153
REFERENCES
VI. PROPOSED WORK [1] Aljanaby Alaa , Abuelrub Emad, and Odeh Mohammed , “ A Survey of
Distributed Query Optimization”, The International Arab Journal of
First method for the join query is first to transfer the Information Technology, Vol. 2, No. 1.
data from servers to client and then insert the data into client [2] Mullins Craig S., “Distributed Query Optimization”, Technical Support .
database, after that results are shown by performing join [3] Nicoleta Iacob , “Distributed Query Optimization”, PhD Student,
query at the client site that takes the data from its own University of Piteşti, Issue 4/2010
[4] Ioannidis Y. E. and Kang Y. C. , “Randomized Algorithms for Optimizing
database . Time for this method will equal to the addition of Large Join Queries”, in Proceedings of the ACM SIGMOD Conference on
time to fetch the data from server sites, time to insert the data Management of Data, Atlantic City, USA, pp. 312-321.
into client database and join query processing time. [5] Ghaemi Reza, Amin Milani Fard, Tabatabaee Hamid, and Sadeghizadeh
Second method that is my proposed work will show the Mahdi ,” “Evolutionary Query Optimization for Heterogeneous Distributed
Database Systems”, World Academy of Science, Engineering and
results by performing the join query on the client site without Technology 19.
inserting the data into its database. The join query in my [6] Mor Jyoti, Kashyap Indu, Rathy R. K., “Analysis of Query Optimization
proposed work will directly take the data from server sides. Techniques in Databases”, International Journal of Computer Applications
Time for proposed method will depend upon the time to fetch (0975 – 888) Volume 47– No.15.
[7] Jiang Shun , “Optimizing Join Query in Distributed Database”, University
the data from server sides and join query execution time. of North Carolina, Wilmington.
Then compare both the methods based upon their [8] Valduriez Patrick , “Join Indices”, ACM Transactions on Database
performance. In the proposed method, insertion time of data Systems, Vol. 12, No. 2 .
into client database will be deducted. Therefore, the join [9] Ioannidis Y. E. and Kang Y. C. , “Left-Deep vs. Bushy Trees: An Analysis
of Strategy Spaces and its Implications for Query Optimization”, SIGMOD
query will be optimized in distributed databases. Conference 1991: 168-177.
[10] Sukheja Deepak, Singh Umesh Kumar (July 2011), “A Novel Approach of
VII. CONCLUSION Query Optimization for Distributed Database Systems”, IJCSI International
Journal of Computer Science Issues, Vol. 8, Issue 4, No 1.
This paper presents the join query optimization in
distributed databases. One method for the join query is first
to transfer the data from servers to client site and then insert AUTHORS
the data into client database, after that join query is First Author – Pawandeep Kaur, M.Tech Student, CSE
performed. Proposed method will directly perform the join Department, Lovely Professional University, Phagwara,
query on the client site after fetching from servers site and it India. email: [email protected]
will not insert the data into client database. By the proposed Second Author – Jaspreet Kaur Sahiwal, Assistant Professor,
method, insertion time of data into client database will be CSE Department, Lovely Professional University, Phagwara,
deducted. So, this method will optimize the join query in India. email: [email protected]
distributed databases.

www.ijsrp.org

You might also like