International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
RDF Query Path Optimization Using
Hybrid Genetic Algorithms:
Semantic Web vs. Data-Intensive Cloud Computing
Qazi Mudassar Ilyas, King Faisal University, Saudi Arabia
https://orcid.org/0000-0003-4238-8093
Muneer Ahmad, University of Malaya, Malaysia
Sonia Rauf, COMSATS University Islamabad, Abbottabad, Pakistan
Danish Irfan, COMSATS University Islamabad, Abbottabad, Pakistan
ABSTRACT
Resource description framework (RDF) inherently supports data mergers from various resources
into a single federated graph that can become very large even for an application of modest size. This
results in severe performance degradation in the execution of RDF queries. As every RDF query
essentially traverses a graph to find the output of the query, an efficient path traversal reduces the
execution time of RDF queries. Hence, query path optimization is required to reduce the execution
time as well as the cost of a query. Query path optimization is an NP-hard problem that cannot be
solved in polynomial time. Genetic algorithms have proven to be very useful in optimization problems.
The authors propose a hybrid genetic algorithm for query path optimization. The proposed algorithm
selects an initial population using iterative improvement, thus reducing the initial solution space
for the genetic algorithm. The proposed algorithm makes significant improvements in the overall
performance. They show that the overall number of joins for complex queries is reduced considerably,
resulting in reduced cost.
Keywords
Cloud Computing, Genetic Algorithm, Information Retrieval, Query Path Optimization, Resource Description
Framework, SPARQL
INTRODUCTION
Cloud computing is a relatively new paradigm that provides extensive services to various customers
(Ziebell et al., 2019). The data-intensive application running on Cloud may benefit from the machine-
understandable Semantic Web technologies (Hossain et al., 2019).
The semantic web technologies have recently gained attention towards proposing viable solutions
to Cloud-computing related problems (Elzein et al., 2018). The machine-understandable representation
of information has opened new paradigms (Olakanmi & Dada, 2019). This scenario has opened new
potentials to researchers and scientists in managing large amounts of data in the best possible and
DOI: 10.4018/IJCAC.2022010101
Copyright © 2022, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
1
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
available ways using the recent archive (Ali, 2019), process and execution mechanism (Bellini et al.,
2015; Lee et al., 2013; Silva et al., 2013).
The current challenges related to big data and its emergence have complicated efficient data
management due to the exponential growth of data (Acosta et al., 2017; Siow et al., 2017; X. Wang
et al., 2015). The current Cloud resources seem insufficient to manage large data repositories and
extract knowledge from them (Herzfeldt et al., 2019). Although companies are providing excellent
services in terms of data archive, application deployments, and fact findings from existing data, still
keeping in mind the current explosion of data, we need more robust and resilient solutions towards
better management of Cloud data (Destefano et al., 2016; Wu et al., 2019). Semantic Web technologies
provide one possible solution to the problem, Several researchers have used these technologies in
solving similar problems (Fang et al., 2016; Niknia & Mirtaheri, 2015; Srinivasulu et al., 2015).
The cyber physical system concepts have introduced new platforms in the form of the industry
4.0 revolution (S. Wang et al., 2016). The current need to integrate Cloud capabilities to align with
the industry 4.0 standards is practically realized (AlZu’bi et al., 2020)(Tewari & Gupta, 2020)(H.
Wang et al., 2020)(D. Li et al., 2019)(Bhushan & Gupta, 2019). Besides, the intrusion of big data
generated from industry-related objects has resulted in new challenges for researchers to devise robust
query control and management mechanisms (Liao et al., 2016; Samanthula et al., 2015; Verginadis et
al., 2017). Additionally, the smart cities concept enables the smart devices to generate many queries
related to each smart city notion. At times, we need to prioritize the queries that demand immediate
attention; for instance, the queries related to fire, temperate, explosive detection and natural disaster
alarms, etc. Such high priority queries are mixed with low priority queries generated from other
objects of smart cities (Liu et al., 2016; Rady et al., 2019; Ye et al., 2018). Since the smart devices are
being controlled through network resources and being managed by Cloud applications, the semantic
web technologies may offer great help in identifying and addressing the potential queries from low
priority queries (Kaoutar et al., 2018; Niknia & Mirtaheri, 2015; Srinivasulu et al., 2015).
Resource Description Framework (RDF) is a simple data model. Its basic building block is a
subject-predicate-object triple, called a statement. A statement consists of a resource, property, and
value. Values can either be resources or string literals. These statements can be visualized using an
RDF graph, also called a semantic network. It is a directed graph with labeled nodes and arcs; the
arcs are produced from the resource (subject of the statement) to the value (object of the statement)
and known as predicates.
RDF query is used to retrieve and manipulate data stored in RDF format. It can be visualized
as a tree where leaf nodes represent inputs (sources), and internal nodes specify relational algebra
operations, enabling a user to specify basic retrieval requests on these sources (Stuckenschmidt et
al., 2005). To retrieve the required data, the nodes can be ordered in numerous ways, all producing
the same result. The different orders in which these possible solutions for retrieval of data needed are
executed are known as query plans or query paths. RDF query is solved in the same manner as queries
in relational databases, i.e., by dividing the Query into several sub-queries. At the end, results of all
sub-queries are combined with the help of join operators and are referred to as chain queries. Most
of the time, bushy and right deep query trees are given preferences in an RDF data store (Steinbrunn
et al., 1997). The inner relation of each join is a base relation. In bushy trees, base relations can be
joined with the results of earlier joins. Right-deep trees, which are a subset of bushy trees, require
the left-hand join operands to be base relations. Figures 1 and 2 show an example depicting concepts
(R1, R2, R3, R4, R5, R6, R7) by bushy and right-deep trees.
Query optimization is the process of identifying the access plan with the minimum cost (time
taken to get all the answers).
The cost of a query depends on the order of joins of sub-paths. Optimal order of joins will give
minimum price by reducing the overall response time.
A query is divided into sub-parts, whose results are then combined with the help of join operators.
Join ordering is an NP-hard problem, which cannot be solved with traditional algorithms’ support
2
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
Figure 1.
(Zemánek et al., 2008). However, dynamic programming with the worst case complexity of O(2n)
(where n represents the number of joins) can give an optimal solution for smaller queries with
approximately ten joins. However, it becomes infeasible for larger queries such as a query requiring
more than ten joins.
Genetic algorithms are a popular technique used to solve optimization problems. At an abstract
level, genetic algorithms comprise chromosome formulation, defining fitness function & stopping
criterion, initial solution population generation, testing proposed solutions using fitness function,
producing new and stronger generation using elitism, cross over & mutation, and stopping when a
“good enough” solution is found.
We propose a hybrid genetic algorithm for RDF query path optimization. An Iterative Improvement
algorithm is used as a pre-processor to reduce solution space for genetic algorithms. This step helps
the genetic algorithm converge more quickly and optimize an RDF query to a satisfactory level.
The rest of the paper is organized as follows. Section 2 presents some critical schemes proposed
for optimizing the RDF query path. Section 3 and 4 discuss randomized and evolutionary algorithms,
respectively. The proposed hybrid genetic algorithm is presented in Section 5, and the concluding
remarks are given in Section 6.
RELATED WORK
The world is facing new challenges in managing massive data generated from several entities, i.e.
industry, medical applications, social media, and other terrestrial objects (Atlam et al., 2018; Wolfert
et al., 2017; Yang et al., 2017). In this era of data explosion (Aouzal et al., 2019), the devices and
objects communicate using information sharing. Such sharing of information is generally inspired
by the patterns of existing data. For instance, the business intelligence stakeholders require a
multidimensional view of information emerging from different data sources (Larson & Chang, 2016;
Ransbotham et al., 2017). The semantic web capabilities have, no doubt been realized to exploit RDF
for optimization of a query path (Straccia & Straccia, 2019). The literature demonstrates the adoption
3
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
Figure 2.
of various algorithms for aligning RDF in robust solutions towards efficient query management
(Ristoski et al., 2019).
Traditional database queries are optimized using dynamic programming (Han et al., 2019),
Iterative Improvement (II) (Myung et al., 2010), Simulated Annealing (SA) (Yong et al., 2020),
Two-Phase Optimization (2PO) (Kumar Yadav & Rizvi, 2020) and genetic algorithm (YE & PENG,
2018). (P. K. Yadav & Rizvi, 2018) presented an excellent comparative study of these techniques.
(Andrade et al., 2020) proposed a novel relinking path Biased Randomization key Genetic Algorithm
hybridization that uses multi-parents during the mating process. (Abdi & Feizi-Derakhshi, 2020)
proposed a framework for hybridizing meta-heuristics for advancing the performance of optimizing
methodologies and its extension for multi-objective problems. (A. Hogenboom et al., 2013) proposed
an ant colony optimization algorithm, appropriate for Semantic Web environment and the tentative
results compared to optimizing resource description framework chain queries on huge RDF data
4
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
sources (M. Wang et al., 2019) proposed a dynamic programming methodology aiming at selection
rate estimation. (Zouaghi et al., 2020) aim to strengthen relational databases by proposing logical query
structures describing graph-based execution by defining new statistical methods for better capturing
the original graph’s dependencies. It also redefined execution plan based on these logical structures.
(Peng et al., 2019) proposed a heuristic to optimize the evaluation of multiple queries and proposing
a rewriting algorithm with a bounded approximation ratio. (Ge et al., 2019) presented multiple query
optimization in the federated RDF systems and proposed rewriting-based query approach, along with
a method for inter-connection topology among SPARQL endpoints.
(Meimaris et al., 2017) spotlights the limitations of indexing schemes of RDF to answer complex
Query with comparatively smaller datasets, along with a new indexing scheme (ECS index) aiming
to accelerate the processing of queries with multi-chain-star patterns.
(Stuckenschmidt et al., 2005) solved RDF query path optimization problem using two-phase
optimization (2PO) algorithm. As can be deduced from its name, 2PO consists of two phases. In the
first phase, II algorithm is used, which is a local search technique. It randomly selects any neighbor
(sub-query) and compares its cost with the current price. However, since it does not check all its
neighbors, it cannot guarantee an optimal solution. For this purpose, in the second phase, the SA
algorithm is applied. The results obtained from II EW used as starting state for SA. SA technique
are inspired by the annealing process in the metallurgy domain in which crystals are formed through
heating followed by slow cooling of a liquid. The technique always retains solutions that make
improvements. However, a small proportion of solution making no significant improvement is also
accepted with little probability. This probability is proportional to the temperature of the system and
decreases with each iteration. (A. Hogenboom et al., 2008) have solved the same problem using a
genetic algorithm. They have considered an RDF model of the CIA World Factbook. It comprises
of 100,000 statements having data about 250 countries generated using Qmap (F. Hogenboom et al.,
2008). The authors claim that the genetic algorithm is more efficient for RDF query processing as
compared to 2PO for a single resource. As information retrieval from single source-based RDF is
not much more efficient, so it must be performed using multiple sources.
Randomized Algorithms
In these algorithms, a random move is initially generated, which is an edge between two nodes. After
that, a random walk is performed along the edges according to specific rules based upon different
algorithms (M. Li & Wang, 2017).
Evolutionary Algorithms
Evolutionary algorithms are based on the biological evolution process, in which only the fittest
solutions are accepted from a whole set of the population (individuals) (Mirjalili, 2019a). These
algorithms start with a randomly initiated population and generate offspring by random crossover and
mutation processes. For the next generation, only the fittest individuals of the population (according to
the cost function) are allowed to reproduce next generation of solutions. The process terminates until
no further improvement is observed in solution space or the number of required ages are completed.
The fittest member of the last population is considered as a solution.
Randomized Algorithms
It is decided randomly (based upon some rules) which potential solution should be next generated
and tested from entire solution space. Three randomized methods are described below.
Iterative Improvement (II)
This technique is based on local search technique in which neighbor state to be visited selected
randomly (Swami & Gupta, 1988). Neighbors are defined in such a way that they can be reached by
one move from current state. Two different types of activities described are:
5
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
• Swap: A swap is performed on two randomly selected relations.
• 3Cycle: Three relations (R1, R2, R3) are selected randomly. After applying right rotation, the
result would be (R3, R1, R2).
A slightly modified version of the iterative improvement algorithm proposed by (Dom, 1995)
is given in Algorithm 1.
II is time-consuming as it does not search all its neighbors. It keeps track of just one state (current
state), not a memory of past forms. Solution found by II is not always optimal because of searching
a limited number of neighbors. Results are dependent on the initially randomized state.
Simulated Annealing
The major drawback of the iterative improvement algorithm is that it accepts moves that decrease the
cost. However, in simulated annealing, ing activities with increasing cost are also acceptable with
probability. It depends on increase in cost and parameter called temperature. The exponential form of
the probability function (in algorithm) is derived from the annealing process’s mathematical model.
The advantage of SA is that it is not trapped in local minima. The drawback is that it is dependent
on initialization and reduction of temperature.
Two-Phase Optimization Algorithm
We have seen that both algorithms mentioned above, II and SA, have their own strengths and
weaknesses. Based on powers of these two, a hybrid algorithm 2PO containing two phases was
Algorithm 1. Iterative Improvement (II)
Input: S: Set of states, Cf: Cost function, Ct: Minimum cost threshold
Output: Smin: State with minimum cost
1 Cmin ← ∞
2 While (S is not empty)
3 R ← A randomly selected element from S
4 C ← Cf (R)
5 While (C > Ct)
6 R′ ← State after random move R
7 C′ ← Cf (R′)
8 if (C′ < C)
9 R ← R′
10 C ← C′
11 End if
12 End while
13 if (C < Cmin)
14 Smin ← R
15 Cmin ← C
16 End if
17 End While
18 return Smin
6
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
proposed. In the first phase, Iterative Improvement is run for a short period of time. In the second
phase the resulting solution is applied as the starting state for the Simulated Annealing. In the SA pass,
a low initial temperature is used, limiting the search to some extent. According to (Ioannidis & Kang,
1990), Two Phase Optimization outperforms the two original algorithms in running time and quality.
Evolutionary Algorithms
To overcome the drawbacks of randomized algorithms, one of the evolutionary algorithms, i.e., a
genetic algorithm is applied to query optimization.
Genetic Algorithm
A genetic algorithm that has been proved more efficient by (A. Hogenboom et al., 2008) among
all of the discussed algorithms, is an optimization algorithm inspired by the biological evolution
process based on the principle of survival of the fittest. It starts with a set of solutions (represented
by chromosomes) called populations. Solutions of one population are taken and are exposed to
evolution, consisting of selection (where individual chromosomes are handled according to their
fitness), crossovers (creating new offspring by combining selected chromosomes), and mutations
(randomly altering some chromosomes). According to their fitness, the new population caused is
assumed to be better than the previous one due to selection of population (offspring). The more
suitable they are, the more chances they have to reproduce a better population. The evolution process
is repeated until either the maximum number of iterations is reached or no improvement is achieved
in several generations. The basics three steps of genetic algorithm in the context of RDF query path
optimization are discussed below.
Encoding
To represent the individual population (solutions), genetic algorithms support various encoding
schemes, including binary encoding, value encoding, permutation encoding, and tree encoding
(Goldberg & Holland, 1988). For join optimization, solutions are processing trees, either bushy or
right-deep trees, which use a simple value encoding scheme as proposed by (Zemánek et al., 2008).
Value encoding is not only an efficient representation but also facilitates crossover operations.
According to this algorithm, join is applied to two concepts selected from the ordered list of concepts.
Besides, result of the join operator is stored in the position of the first appearing concept. Consider the
following ordered list of concepts (R1, R2, R3, R4, R5, R6). Taking the join of fifth and sixth concept
will result in (R1, R2, R3, R4, R5R6). Another join operation between first and second concept will yield
in (R1R2, R3, R4, R5R6). Applying join to second and third concept we will get (R1R2,R3R4, R5R6), then
again applying on second and third concept we have (R1R2,R3R4 R5R6) and finally the desired output
is (R1R2R3R4 R5R6). Encoded values for these joins are ((5,6),(1,2),(2,3),(2,3),(1,2)).
Selection
Genetic algorithms provide numerous selection methods like roulette-wheel selection, boltzmann
selection(Sohn et al., 2013), tournament selection (Xie & Zhang, 2013), rank selection (Valem &
Pedronette, 2019), and steady-state selection (S. L. Yadav & Sohal, 2017). Selection of any one
of these methods depends on the requirement of the fitness function. In terms of RDF query path
optimization, the fitness of the solution depends on the cost factor. The answers (queries) having
lower cost represent better solutions, and those with higher cost refer to worse solutions. The issue
is that an individual with a relatively higher fitness value will be selected every time, and those with
lower fitness will be neglected. This problem is known as crowding (Swami & Gupta, 1988). This
problem is overcome in the rank selection method in which first whole population is ranked, and then
each chromosome is assigned fitness value according to the rank. The lowest costs will be assigned
the highest rankgrade, while the solution with the highest costs will be given the lowest rank.
7
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
Crossover and Mutation
After the selection of the best parents using the rank selection method, a crossover operator is applied.
It produces better solutions after processing partial reasonable solutions. The choice of crossover is
directly dependent on the encoding scheme. Here we have used subsequence exchange and subset
exchange methods. The mutation is applied for diversity brings new features that are not present in
any member of the population. For details about crossover and mutation techniques related to our
goal, we refer to (Husa & Kalkreuth, 2018; Kumar et al., 2020; Mirjalili, 2019b). Table 1 presents a
brief comparison of salient existing techniques.
Proposed Solution: Hybrid Genetic Algorithm
We propose a Hybrid Genetic Algorithm, which is a combination of iterative improvement and
genetic algorithm. In the genetic algorithm’s first step, the population is selected randomly from
the entire solution space. In the proposed solution, an initial population is set using II algorithm,
which has already been discussed. The output of II is used as an input for genetic algorithm, and the
objective function is evaluated. In this case, the objective function is to minimize cost, i.e., sub-query
with minimum cost will be selected. In the end, some of the necessary operations like crossover and
mutation will be applied (already discussed) to achieve optimal results.
The proposed algorithm’s main advantage is that the initial solution space for the genetic
algorithm is limited because of the execution of Algorithm 2. Due to the limited search space, the
genetic algorithm can come up with a final solution in less time.
The proposed mechanism is inspired by iterative scheme incurred from iterative solutions and the
prospects of genetic algorithm in the context of encoding, selection, crossoverby an iterative procedure
incurred from iterative solutions the possibilities of genetic algorithm in the context of encoding,
selection, crossover, and mutation. The objective function described above considers the sub-queries
with minimum cost function to bestow the optimized conventional query control mechanism. The
Table 1. Comparison of existing techniques
Exhaustive Short running
Technique Scalable Complex queries
search time
II NO YES YES NO
SA NO NO YES NO
2PO NO NO YES NO
Genetic algorithms NO YES YES YES
Algorithm 2. Hybrid Genetic Algorithm – II
Input: G: RDF graph, Q: Set of Query, Qt: Minimum query cost threshold
Output: Qsub: Subset of Queries after applying II algorithm
1 Qsub ← {}
2 Ɐ q in Q
3 qmin ← II (G, q, Qt)
4 Qsub ← Qsub U qmin
5 End
6 Return Qsub
8
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
Algorithm 3. Hybrid Genetic Algorithm – GA
Input: G: RDF graph, Qsub: Set of Queries, C: Minimum Acceptable Query Cost
Output: q: Query with minimum cost
1 QGA ←GA representation of Qsub
2 Qcur ← Initial set of queries selected randomly from Qsub
3 Ɐ q in Qcur
4 If Cost(q) ≤ C
5 Return q
6 End if
7 Else
8 Qel ← Elite solutions from Qcur
9 Qco ← Solutions generated by performing cross over
10 Qmu ← Mutated solutions
11 Qcur ← Qcur U Qel U Qco U Qmu
12 End Else
13 End Ɐ
population size refers to the total number of inhabitants or objects of a particular population. Further,
we present a case study scenario in the next section to demonstrateRDF’s potentials in conjunction
with genetic algorithm capabilities.
CASE STUDY
For an online reservation system, we have considered a website http://www.onlineresrv.com/
weeklyschd. For example, we have considered only one weekly schedule for all the flights from
Islamabad (Pakistan) to London Heathrow (United Kingdom). There are three flights, namely PK14,
PK15, and PK16. Table 2 presents flight information of these three times, including departure time,
arrival time, availability on days of week and number of stopovers. An RDF graph for this case study
is given in Figure 2. This figure depicts resources and their relationship for flight PK14 only. PK15
and PK16 have similar resources and connections.
Consider a query in which we want to retrieve the departure and arrival time for all flights having
available seats on Monday with no stopovers from Islamabad to London (Heathrow).
This Query can be solved in many ways depending upon the order of join operators. The primary
aim of query path optimization is to select the query path, which results in minimum cost. Here we
have written the above Query in SPARQL (which is the most popular query language for RDF) in
Table 2. Flight schedule
Flight Departure Arrival Mon Tue Wed Thu Fri Sat Sun Stops
PK14 0715 1500 Y N N N Y N N 1
PK15 1050 1400 N N N N Y N N 0
PK16 1150 1500 Y Y N Y N N N 0
9
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
three different ways to get the same output. We have also discussed the cost-effectiveness of Query
a, Query b, and Query c with the help of results obtained from their join operators.
Figure 3 presents the RDF graph for an online reservation system. We further elaborate it with
the help of different queries below.
Query A
PREFIX Resrv:
<http://www.onlineresrv/weeklysched>
SELECT ?Flights, ?Depart, ?Arrives
WHERE
{
?FlightsResrv:FNO ?F_Num
?F_NumResrv:Mon ?Monday.
?MondayResrv:Seat “Yes”.
?MondayResrv:Stops “0”.
?MondayResrv:Depart ?Depart.
?MondayResrv:Arrives ?Arrives
}
Figure 3.
10
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
Query B
PREFIX Resrv:
<http://www.onlineresrv/weeklysched>
SELECT ?Flights, ?Depart, ?Arrives
WHERE
{
?MondayResrv:Stops “0”.
?MondayResrv:Seat “Yes”.
?MondayResrv:Depart ?Depart.
?MondayResrv:Arrives ?Arrives.
?MondayResrv:Mon ?Flight
}
Query C
PREFIX Resrv:
<http://www.onlineresrv/weeklysched>
SELECT ?Flights, ?Depart, ?Arrives
WHERE
{
{
?FlightsResrv:FNO ?PK14.
?PK14 Resrv:Mon ?Monday.
?MondayResrv:Seat “Yes”.
?MondayResrv:Stops “0”.
?MondayResrv:Depart ?Depart.
?MondayResrv:Arrives ?Arrives
}
UNION
{
?FlightsResrv:FNO ?PK15.
?PK15 Resrv:Mon ?Monday.
?MondayResrv:Seat “Yes”.
?MondayResrv:Stops “0”.
?MondayResrv:Depart “Depart”.
?MondayResrv:Arrives “Arrives”
}
UNION
{
?FlightsResrv:FNO ?PK16.
?PK16 Resrv:Mon ?Monday.
?MondayResrv:Seat “Yes”.
?MondayResrv:Stops “0”.
?MondayResrv:Depart ?Depart.
?MondayResrv:Arrives “Arrives”
}
}
11
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
According to Query a, the join operator is applied on the first two concepts; Flights and ?F_Num,
Which gives three resources namely: ?PK14, ?PK15 and ?PK16. Further processing is used on each
of these flights and their join is taken with concept ?Monday. Again, three resources are selected.
Then seat availability for Monday is checked, and two results are returned. After taking join with stop,
single output is returned. At the end, join is taken for departure and arrival time. The result produced
by the sum of all join operators is 11. In Query b the join of the last two concepts is selected at the
start i.e. ?Monday with Stops “0” that produces two results. These two results are joined to check
the availability of seats and give a single output. The departure and arrival time of that available seat
is selected.
At the end join operation is taken to select the corresponding flight. The sum of join operators
for this Query is 6. According to Query c three subqueries are combined using the UNION operator.
In the first subquery, all the results for flight PK14 are returned, having 3 results. Query for PK15
return 2 results and for PK16 return 6 results. Two additional operations would be required for taking
UNION, so the sum is 13. It is concluded that Query b has the least cost i.e., 6, and is most efficient.
Hence it is proved that the order of join operators-join operators’ order directly relates to the cost
and execution time of Query.
CONCLUSION
This paper has identified the strengths and weaknesses of existing approaches for RDF query path
optimization. We propose a hybrid genetic algorithm that combines iterative improvement and genetic
algorithms and reduced the number of joins significantly for complex queries in a cloud computing
environment, resulting in improved performance. The improvement is significant for small queries.
As web is distributed, and many practical applications will require a considerable number of queries
to be executed, the overall improvement for most scenarios would be manifold. We have considered a
relatively small query to explain the concept. A full-scale implementation of the proposed algorithm is
required to test the proposed approach in a real-life scenario and identify its strengths and limitations.
Conflict of Interest
The authors state that they don’t have any conflict of interest in this research.
12
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
REFERENCES
Abdi, Y., & Feizi-Derakhshi, M. R. (2020). Hybrid multi-objective evolutionary algorithm based on Search
Manager framework for big data optimization problems. Applied Soft Computing, 87, 105991. Advance online
publication. doi:10.1016/j.asoc.2019.105991
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. (2017). Enhancing answer completeness of SPARQL
queries via crowdsourcing. Journal of Web Semantics, 45, 41–62. Advance online publication. doi:10.1016/j.
websem.2017.07.001
Ali, M. B. (2019). Multiple Perspective of Cloud Computing Adoption Determinants in Higher Education a
Systematic Review. International Journal of Cloud Applications and Computing, 9(3), 89–109. Advance online
publication. doi:10.4018/IJCAC.2019070106
AlZu’bi, S., Shehab, M., Al-Ayyoub, M., Jararweh, Y., & Gupta, B. (2020). Parallel implementation for 3D
medical volume fuzzy segmentation. Pattern Recognition Letters, 130, 312–318. Advance online publication.
doi:10.1016/j.patrec.2018.07.026
Andrade, C. E., Toso, R. F., Gonçalves, J. F., & Resende, M. G. C. (2020). The Multi-Parent Biased Random-
Key Genetic Algorithm with Implicit Path-Relinking and its real-world applications. European Journal of
Operational Research. Advance online publication. doi:10.1016/j.ejor.2019.11.037
Aouzal, K., Hafiddi, H., & Dahchour, M. (2019). Policy-Driven Middleware for Multi-Tenant SaaS Services
Configuration. International Journal of Cloud Applications and Computing, 9(4), 86–106. Advance online
publication. doi:10.4018/IJCAC.2019100105
Atlam, H. F., Walters, R. J., & Wills, G. B. (2018). Fog computing and the internet of things: A review. In Big
Data and Cognitive Computing. doi:10.3390/bdcc2020010
Bellini, P., Bruno, I., Nesi, P., & Rauch, N. (2015). Graph databases methodology and tool supporting index/
store versioning. Journal of Visual Languages and Computing, 31, 222–229. Advance online publication.
doi:10.1016/j.jvlc.2015.10.018
Bhushan, K., & Gupta, B. B. (2019). Distributed denial of service (DDoS) attack mitigation in software defined
network (SDN)-based cloud computing environment. Journal of Ambient Intelligence and Humanized Computing,
10(5), 1985–1997. Advance online publication. doi:10.1007/s12652-018-0800-9
Destefano, R. J., Tao, L., & Gai, K. (2016). Improving Data Governance in Large Organizations through Ontology
and Linked Data. Proceedings - 3rd IEEE International Conference on Cyber Security and Cloud Computing,
CSCloud 2016 and 2nd IEEE International Conference of Scalable and Smart Cloud, SSC 2016. doi:10.1109/
CSCloud.2016.47
Dom, J. (1995). Iterative improvement methods for knowled‒ebased scheduling. AI Communications, 8(1),
20–34. doi:10.3233/AIC-1995-8102
Elzein, N. M., Majid, M. A., Hashem, I. A. T., Yaqoob, I., Alaba, F. A., & Imran, M. (2018). Managing big RDF
data in clouds: Challenges, opportunities, and solutions. Sustainable Cities and Society, 39, 375–386. Advance
online publication. doi:10.1016/j.scs.2018.02.019
Fang, Y., Jiaming, Z., Yaohui, L., & Mei, G. (2016). Semantic description and link construction of smart tourism
linked data based on big data. Proceedings of 2016 IEEE International Conference on Cloud Computing and
Big Data Analysis, ICCCBDA 2016. doi:10.1109/ICCCBDA.2016.7529530
Ge, Q., Peng, P., Xu, Z., Zou, L., & Qin, Z. (2019). FMQO: A federated RDF system supporting multi-query
optimization. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics), 11642 LNCS, 397–401. doi:10.1007/978-3-030-26075-0_30
Goldberg, D. E., & Holland, J. H. (1988). Genetic algorithms and machine learning. Academic Press.
Han, M., Kim, H., Gu, G., Park, K., & Han, W. S. (2019). Efficient subgraph matching: Harmonizing dynamic
programming, adaptive matching order, and failing set together. Proceedings of the ACM SIGMOD International
Conference on Management of Data. doi:10.1145/3299869.3319880
13
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
Herzfeldt, A., Floerecke, S., Ertl, C., & Krcmar, H. (2019). Examining the Antecedents of Cloud Service
Profitability. International Journal of Cloud Applications and Computing, 9(4), 37–65. Advance online
publication. doi:10.4018/IJCAC.2019100103
Hogenboom, A., Frasincar, F., & Kaymak, U. (2013). Ant colony optimization for RDF chain queries for decision
support. Expert Systems with Applications, 40(5), 1555–1563. Advance online publication. doi:10.1016/j.
eswa.2012.08.074
Hogenboom, A., Milea, V., Frasincar, F., & Kaymak, U. (2008). Genetic algorithms for RDF query path
optimization. CEUR Workshop Proceedings, 419(May).
Hogenboom, F., Hogenboom, A., Van Gelder, R., Milea, V., Frasincar, F., & Kaymak, U. (2008). QMap: An
RDF-based Queryable World Map. Third International Conference on Knowledge Management in Organizations
(KMO 2008), 99–110.
Hossain, K., Rahman, M., & Roy, S. (2019). IoT Data Compression and Optimization Techniques in Cloud
Storage: Current Prospects and Future Directions. International Journal of Cloud Applications and Computing,
9(2), 43–59. Advance online publication. doi:10.4018/IJCAC.2019040103
Husa, J., & Kalkreuth, R. (2018). A comparative study on crossover in cartesian genetic programming. Lecture
Notes in Computer Science. Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics. doi:10.1007/978-3-319-77553-1_13
Ioannidis, Y. E., & Kang, Y. (1990). Randomized algorithms for optimizing large join queries. SIGMOD Record,
19(2), 312–321. doi:10.1145/93605.98740
Kaoutar, L., Ghadi, A., & Kudagba, F. K. (2018). Big data: Methods, prospects, techniques. Lecture Notes in
Networks and Systems. doi:10.1007/978-3-319-74500-8_28
Kumar, M., Husain, M., Upreti, N., & Gupta, D. (2020). Genetic Algorithm: Review and Application. SSRN
Electronic Journal. 10.2139/ssrn.3529843
Kumar Yadav, P., & Rizvi, S. (2020). Analysis of Two Phase Query Optimization Algorithm for Generating
Optimal Query Plan using Randomized Algorithm. SSRN Electronic Journal. 10.2139/ssrn.3579179
Larson, D., & Chang, V. (2016). A review and future direction of agile, business intelligence, analytics and
data science. International Journal of Information Management, 36(5), 700–710. Advance online publication.
doi:10.1016/j.ijinfomgt.2016.04.013
Lee, K., Liu, L., Tang, Y., Zhang, Q., & Zhou, Y. (2013). Efficient and customizable data partitioning framework
for distributed big RDF data processing in the Cloud. IEEE International Conference on Cloud Computing.
doi:10.1109/CLOUD.2013.63
Li, D., Deng, L., Bhooshan Gupta, B., Wang, H., & Choi, C. (2019). A novel CNN based security guaranteed
image watermarking generation scenario for smart city applications. Information Sciences, 479, 432–447.
Advance online publication. doi:10.1016/j.ins.2018.02.060
Li, M., & Wang, D. (2017). Insights into randomized algorithms for neural networks: Practical issues and common
pitfalls. Information Sciences, 382-383, 170–178. Advance online publication. doi:10.1016/j.ins.2016.12.007
Liao, Y. T., Zhou, J., Lu, C. H., Chen, S. C., Hsu, C. H., Chen, W., Jiang, M. F., & Chung, Y. C. (2016). Data
adapter for querying and transformation between SQL and NoSQL database. Future Generation Computer
Systems, 65, 111–121. Advance online publication. doi:10.1016/j.future.2016.02.002
Liu, J. K., Au, M. H., Huang, X., Lu, R., & Li, J. (2016). Fine-grained two-factor access control for web-based
cloud computing services. IEEE Transactions on Information Forensics and Security, 11(3), 484–497. Advance
online publication. doi:10.1109/TIFS.2015.2493983
Meimaris, M., Papastefanatos, G., Mamoulis, N., & Anagnostopoulos, I. (2017). Extended characteristic sets:
Graph indexing for SPARQL Query Optimization. Proceedings - International Conference on Data Engineering,
497–508. doi:10.1109/ICDE.2017.106
Mirjalili, S. (2019a). Evolutionary Algorithms and Neural Networks. Soft Computing and Intelligent Systems.
doi:10.1007/978-3-319-93025-1
14
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
Mirjalili, S. (2019b). Genetic algorithm. Studies in Computational Intelligence. doi:10.1007/978-3-319-93025-1_4
Myung, J., Yeon, J., & Lee, S. G. (2010). SPARQL basic graph pattern processing with iterative MapReduce.
ACM International Conference Proceeding Series. doi:10.1145/1779599.1779605
Niknia, M., & Mirtaheri, S. L. (2015). Mapping a decade of linked data progress through co-word analysis.
Webology.
Olakanmi, O. O., & Dada, A. (2019). An Efficient Privacy-preserving Approach for Secure Verifiable Outsourced
Computing on Untrusted Platforms. International Journal of Cloud Applications and Computing, 9(2), 79–98.
Advance online publication. doi:10.4018/IJCAC.2019040105
Peng, P., Ge, Q., Zou, L., Ozsu, M. T., Xu, Z., & Zhao, D. (2019). Optimizing Multi-Query Evaluation in
Federated RDF Systems. IEEE Transactions on Knowledge and Data Engineering, 14(8), 1. doi:10.1109/
TKDE.2019.2947050
Rady, M., Abdelkader, T., & Ismail, R. (2019). Integrity and Confidentiality in Cloud Outsourced Data. Ain
Shams Engineering Journal. doi:10.1016/j.asej.2019.03.002
Ransbotham, S., Kiron, D., Gerbert, P., & Reeves, M. (2017). Reshaping Business With Artificial Intelligence.
MIT Sloan Mangement Review and The Boston Consulting Group.
Ristoski, P., Rosati, J., Di Noia, T., De Leone, R., & Paulheim, H. (2019). RDF2Vec: RDF graph embeddings
and their applications. Semantic Web, 10(4), 721–752. Advance online publication. doi:10.3233/SW-180317
Samanthula, B. K., Elmehdwi, Y., Howser, G., & Madria, S. (2015). A secure data sharing and query processing
framework via federation of cloud computing. Information Systems, 48, 196–212. Advance online publication.
doi:10.1016/j.is.2013.08.004
Silva, T., Wuwongse, V., & Sharma, H. N. (2013). Disaster mitigation and preparedness using linked open
data. Journal of Ambient Intelligence and Humanized Computing, 4(5), 591–602. Advance online publication.
doi:10.1007/s12652-012-0128-9
Siow, E., Tiropanis, T., & Hall, W. (2017). Ewya: An interoperable fog computing infrastructure with RDF stream
processing. Lecture Notes in Computer Science. Including Subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics. doi:10.1007/978-3-319-70284-1_20
Sohn, K., Zhou, G., Lee, C., & Lee, H. (2013). Learning and selecting features jointly with point-wise gated
Boltzmann machines. 30th International Conference on Machine Learning, ICML 2013.
Srinivasulu, S., Sakthivel, P., & Ramya, V. (2015). A novel semantic cloud computing interoperability model
for platform as a service system. International Review on Computers and Software. 10.15866/irecos.v10i4.5669
Steinbrunn, M., Moerkotte, G., & Kemper, A. (1997). Heuristic and randomized optimization for the join
ordering problem. The VLDB Journal, 6(3), 191–208. Advance online publication. doi:10.1007/s007780050040
Straccia, U., & Straccia, U. (2019). Web Ontology Language OWL. Foundations of Fuzzy Logic and Semantic
Web Languages. doi:10.1201/b15460-5
Stuckenschmidt, H., Vdovjak, R., Broekstra, J., & Houben, G. J. (2005). Towards distributed processing of
RDF path queries. International Journal of Web Engineering and Technology, 2(2/3), 207. Advance online
publication. doi:10.1504/IJWET.2005.008484
Swami, A., & Gupta, A. (1988). Optimization of large join queries. Proceedings of the ACM SIGMOD
International Conference on Management of Data, 1988-June, 8–17. doi:10.1145/50202.50203
Tewari, A., & Gupta, B. B. (2020). Security, privacy and trust of different layers in Internet-of-Things (IoTs)
framework. Future Generation Computer Systems, 108, 909–920. Advance online publication. doi:10.1016/j.
future.2018.04.027
Valem, L. P., & Pedronette, D. C. G. (2019). An unsupervised genetic algorithm framework for rank selection and
fusion on image retrieval. ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia
Retrieval. doi:10.1145/3323873.3325022
15
International Journal of Cloud Applications and Computing
Volume 12 • Issue 1
Verginadis, Y., Michalas, A., Gouvas, P., Schiefer, G., Hübsch, G., & Paraskakis, I. (2017). PaaSword: A Holistic
Data Privacy and Security by Design Framework for Cloud Services. Journal of Grid Computing, 15(2), 219–234.
Advance online publication. doi:10.1007/s10723-017-9394-2
Wang, H., Li, Z., Li, Y., Gupta, B. B., & Choi, C. (2020). Visual saliency guided complex image retrieval. Pattern
Recognition Letters, 130, 64–72. Advance online publication. doi:10.1016/j.patrec.2018.08.010
Wang, M., Fu, H., & Xu, F. (2019). RDF multi-query optimization algorithm for query rewriting using common
subgraphs. ACM International Conference Proceeding Series. doi:10.1145/3331453.3361278
Wang, S., Wan, J., Zhang, D., Li, D., & Zhang, C. (2016). Towards smart factory for industry 4.0: A self-organized
multi-agent system with big data based feedback and coordination. Computer Networks, 101, 158–168. Advance
online publication. doi:10.1016/j.comnet.2015.12.017
Wang, X., Yang, T., Chen, J., He, L., & Du, X. (2015). RDF partitioning for scalable SPARQL query processing.
Frontiers of Computer Science, 9(6), 919–933. Advance online publication. doi:10.1007/s11704-015-4104-3
Wolfert, S., Ge, L., Verdouw, C., & Bogaardt, M. J. (2017). Big Data in Smart Farming – A review. Agricultural
Systems. doi:10.1016/j.agsy.2017.01.023
Wu, C., Horiuchi, S., & Tayama, K. (2019). A resource design framework to realize intent-based cloud
management. Proceedings of the International Conference on Cloud Computing Technology and Science,
CloudCom. doi:10.1109/CloudCom.2019.00018
Xie, H., & Zhang, M. (2013). Parent selection pressure auto-tuning for tournament selection in genetic
programming. IEEE Transactions on Evolutionary Computation, 17(1), 1–19. Advance online publication.
doi:10.1109/TEVC.2011.2182652
Yadav, P. K., & Rizvi, S. (2018). Query optimization: Issues and challenges in mining of distributed data.
Advances in Intelligent Systems and Computing. doi:10.1007/978-981-10-6620-7_67
Yadav, S. L., & Sohal, A. (2017). Comparative Study of Different Selection Techniques in Genetic Algorithm.
International Journal of Engineering, Science and Mathematics.
Yang, C., Yu, M., Hu, F., Jiang, Y., & Li, Y. (2017). Utilizing Cloud Computing to address big geospatial data
challenges. Computers, Environment and Urban Systems, 61, 120–128. Advance online publication. doi:10.1016/j.
compenvurbsys.2016.10.010
Ye, S., & Peng, Y. (2018). An Optimization for Distributed Database Multi-join Query Based on Improved Genetic
Algorithm. DEStech Transactions on Computer Science and Engineering. 10.12783/dtcse/iceiti2017/18832
Ye, Y., Hu, T., Zhang, C., & Luo, W. (2018). Design and development of a CNC machining process knowledge
base using cloud technology. International Journal of Advanced Manufacturing Technology, 94(9-12), 3413–3425.
Advance online publication. doi:10.1007/s00170-016-9338-1
Yong, W., Zhou, J., Jahed Armaghani, D., Tahir, M. M., Tarinejad, R., Pham, B. T., & Van Huynh, V. (2020). A
new hybrid simulated annealing-based genetic programming technique to predict the ultimate bearing capacity
of piles. Engineering with Computers. Advance online publication. doi:10.1007/s00366-019-00932-9
Zemánek, J., Schenk, S., & Svátek, V. (2008). Optimizing SPARQL queries over disparate RDF data sources
through distributed semi-joins. CEUR Workshop Proceedings.
Ziebell, R.-C., Albors-Garrigos, J., Schoeneberg, K.-P., & Marin, M. R. P. (2019). Adoption and Success of
e-HRM in a Cloud Computing Environment. International Journal of Cloud Applications and Computing, 9(2),
1–27. Advance online publication. doi:10.4018/IJCAC.2019040101
Zouaghi, I., Mesmoudi, A., Galicia, J., Bellatreche, L., & Aguili, T. (2020). Query optimization for large scale
clustered RDF data. CEUR Workshop Proceedings, 2572, 56–65.
16