Complex Matrices For The Approximate Evaluation of Probabilistic Queries
Complex Matrices For The Approximate Evaluation of Probabilistic Queries
ABSTRACT
This paper studies the evaluation of probabilistic SPARQL queries. The evaluation of such queries is, in
principle, conceptually simple and straightforward. However, it does incur a time cost that must be taken into
account. It thus expedient to devise methods that lower this cost. We propose and explain such a method, which
makes use of complex matrices. The idea to resort to the complex numbers is inspired from the greatly
expanding field of unconventional computing, and, in particular, from quantum computing. This novel proposal
simplifies and speeds up calculation of products of probabilities. Therefore, it is particularly promising in these
cases where the same answer set can be obtained either by employing exact computations or by employing
suitable approximations, as is indeed the case in many probabilistic queries.
Keywords - RDF graph, SPARQL query, probabilistic SPARQL query, complex matrices
----------------------------------------------------------------------------------------------------------------------------- ----------
Date of Submission: 28-10-2020 Date of Acceptance: 09-11-2020
----------------------------------------------------------------------------------------------------------------------------- ----------
of transitive predicates, makes possible the
I. INTRODUCTION formation of paths resembling those typically found
During the last two decades numerous in standard directed graphs. The paths start from an
researches have focused their attention to all things initial node, follow adjacent directed edges labeled
related to the Web. This tremendous effort has by R, and finally terminate at some other node. It is
produced state of the art technologies and has implicitly assumed that such a path expresses some
literally embedded the World Wide Web into meaningful property and that this property can be
everyone’s life. The resounding success of the whole formulated by SPARQL.
endeavor can be, at least in part, attributed to the Let us now shift the emphasis to the fact
adherence to well-designed standards. Linked Open that in many situations that data contained in the
Data [1] provides guidelines regarding the storage dataset are approximations, estimations, beliefs, i.e.,
and communication of data on the Web. The they lack certainty. The world is uncertain and, in
Resource Description Framework (RDF) is the most many respects, probabilistic. Therefore, the data
prominent standard governing the storage of stored in an RDF graph may not be totally correct,
information, whereas SPARQL deals with retrieving but probably correct. In domains where uncertainty
this information, i.e., querying the data. is prevalent, we may adopt a new perspective,
RDF promotes the use of directed graphs namely that the triplets are assigned a nonnegative
[2] as convenient and effective structures for real value that indicates the probability of being
keeping information. The data contained in the RDF accurate. It is expedient to assume that this number
graph obey an explicit syntactic pattern: subject takes values in the interval [0, 1], as expected from a
predicate object. The idea behind this is simple probability measure. Alternatively, one may well
and functional: the predicate relates the subject with view this number as a weight or even as a degree of
the object. The rules by which one can query such certainty. There are many domains of practical
datasets are stipulated by SPARQL [3]. This work is importance where probability and uncertainty arise.
concerned with probabilistic Regular Path Queries, One important example worth mentioning would be
which have the potential to specify a path of biological data where connections among biological
adjacent nodes in the underlying graph. This is concepts have a probabilistic nature and where
achieved through transitive predicates. To see how additionally the links themselves are statistically
this works, let us start by picturing predicates as independent [4]. It is, thus, evident that tools and
labels on the directed edges of the RDF graph. In techniques must be developed to address the issue of
this setting, a predicate R is defined to be transitive uncertainty.
if two triples (x, R, y) and (y, R, z) lead to the
inference of the triple (x, R, z). Hence, the existence
II. MOTIVATION
In this section, we shall present an
extensive example to motivate the reader and The matrix represents all the
explain the idea behind this work. information contained in the RDF graph about
Example 1. Suppose we are given the RDF predicate R and is a representation of obj1. The
graph shown in Figure 1a. The predicate R is a element of stores the probability value for
transitive predicate and the numbers are
the triplet . A value of zero indicates
probabilities.
the absence of such a triplet. In this matrix - vector
representation we can compute the products ,
and , which will give us the following
results.
reasonable approach. Of course, the correction below the adopted cut-point. Then, even when the
coefficient must be chosen carefully. Due to the non- matrix coefficient is the nonzero complex
linearity of the trigonometric functions, it is number (expressing the initial degree of
impossible to use one correction coefficient for the confidence about the corresponding triple), if it
entire interval . The correction coefficient
happens that , then must be replaced by
suitable for a specific subinterval must be validated
.
experimentally. Table 1 below shows that to get a
good approximation for the probability subinterval In the process of answering queries, starting
we may use the formula from the known given matrix , it is inevitable that
with . This formula gives for the some matrix power will be computed. The
approximate value 0.77, which is very close to the coefficients of have the obvious
accurate value 0.81. ▲ interpretation. If , then with approximate
probability there exists an inferred triple
Table 1. The coefficient for the interval [0.8, 0.9]. through an R-path of length or,
equivalently, with approximate probability
there exists an R-path of length from
node to node . If , then there exists no
inferred triple through an R-path of length
or, equivalently, there exists no R-path of length
from node to node . Remark 1 also applies in this
III. MAIN RESULTS case, i.e., if it happens that , where is the
In this section, we shall present the ideas cut-point, then must be replaced by ,
outlined in Example 1 in a general and formal way. something that will add one more zero to the matrix
Definition 1. Let G be an RDF graph . Alternatively, it is worth considering the
fragment containing a transitive predicate R, and let scenario where , which means that
, be an arbitrary enumeration of its nodes.
represents a probability too large to be ignored. In
We associate to G the matrix whose such a case, it is expedient to address the relative
elements are defined as follows: error. We caution the reader here that there is no
error in the coefficients of , since they express
explicit facts taken from the RDF graph itself. The
errors occur when we compute the powers , where
Definition 2. The cut-point is a positive . As we have explained in detail in Example 1,
real number, understood to express radians, such that the following very simple formula (1) can be used
any complex number of the form with , is for this purpose.
simply replaced by (zero).
Remark 1. The choice of the cut-point is of It is more accurate to approximate the
great importance because it has the potential to probability that corresponds to the triple by
significantly facilitate computations by providing a formula (1). The complex number
strict criterion for consistently omitting very small contains the information
probabilities. This policy has as an immediate and regarding , so it only remains to estimate the
concrete practical impact on the measurable speed- correction coefficient . The non-linearity of the
up of the evaluation of probabilistic SPARQL and functions preclude the possibility of
queries. Obviously, datasets of different nature, or
increased accuracy requirements, must be taken into one correction coefficient for all cases. Instead, the
account when determining the cut-point. An probability interval must be partitioned into
aggressive approach for increased performance smaller subintervals. This can be achieved through
could be to set the cut-point , extensive experimental tests. Tables 2a - 2d give the
suggested correction coefficients for the subintervals
corresponding to probability . Such an approach
, , , and ,
might also modify the original complex matrix .
respectively. The Tables also contain indicative
This is due to the fact that in the above Definition 1, examples of probability values within the respective
the case that the triple does not exist includes the
interval, along with the value of before the
case where the probability attributed to a triple is
correction of formula (1) is applied.
IV. CONCLUSION
This paper advocates the evaluation of
Table 2b. The coefficient for the interval [0.5, 0.6]. probabilistic SPARQL queries via unconventional
means, namely the use of complex matrices. The
extensive example we have presented in section 2
describes in detail the proposed methodology. This
novel approach has an unquestionable advantage,
and that is that the computation proceeds via
additions instead of multiplications. Thus, the
potential to scale down the computational cost is real
and pragmatic. Finally, we have suggested two ways
Table 2c. The coefficient for the interval [0.4, 0.5]. to handle efficiently the numerical errors that will
come-up, either by utilizing a cut-point or my using
an approximation formula, which has been
thoroughly tested experimentally.
REFERENCES
[1] LOD Project, 2014. Linking Open (LOD)
Data Project,
http://esw.w3.org/topic/SweoIG/TaskForces/
CommunityProjects/LinkingOpenData.
Table 2d. The coefficient for the interval [0.2, 0.4]. [2] Resource Description Framework (RDF),
https://www.w3.org/TR/2015/NOTE-rdfa-
primer-20150317/.
[3] SPARQL 1.1 Query Language. Tech. rep.,
W3C (2013),
https://www.w3.org/TR/2013/REC-sparql11-
query-20130321/.
[4] Huang, H., Liu, C.: Query evaluation on
probabilistic RDF databases. In: International
Conference on Web Information Systems
Engineering, pp. 307–320, Springer (2009).
[5] Hartig, O.: An overview on execution
To biggest benefit of the proposed approach strategies for Linked Data queries,
is the fact that the computation of the matrix powers Datenbank-Spektrum 13(2), 89–99 (2013).
involves only the operation of addition (of [6] Zhang, X., Feng, Z., Wang, X., Rao, G., Wu,
angles) and foregoes the operation of multiplication W.: Context-free path queries on RDF graphs.
of real numbers that the conventional approach In: International Semantic Web Conference,
would entail. Using additions instead of pp. 632–648, Springer (2016).
multiplications is always preferable, as the [7] Sistla, A.P., Hu, T., Chowdhry, V.: Similarity
compound computational cost scales down based retrieval from sequence databases using
significantly with the size of the problem at hand. automata as queries. In: Proceedings of the
Hence, this method is undoubtedly better than the eleventh international conference on
conventional. Of course, there is a trade-off, which Information and knowledge management, pp.
in this case takes the form of arithmetic errors. There 237–244, ACM (2002).
are two ways to remedy this situation. One is the [8] Giannakis K., Andronikos T., “Querying
systematic use of a cut-point, a well-established Linked Data and Büchi Automata”, IEEE
technique in probabilistic scenarios, that can safely Proceedings of the 9th International
and rapidly dismiss inferred fact of small Workshop on Semantic and Social Media
Adaptation and Personalization (SMAP), 6-7 [20] H. Fang and X. Zhang, “pSPARQL: a
November, Corfu, Greece, pp. 110 - 114, querying language for probabilistic RDF
2014. (extended abstract),” in Proceedings of the
[9] Giannakis K., Theocharopoulou G., Papalitsas ISWC’16, Posters, 2016.
C., Andronikos T., Vlamos P., “Associating [21] Fang, H. pSPARQL: A Querying Language
ω-automata to Path Queries on Webs of for Probabilistic RDF Data Complexity,
Linked Data”, Engineering Applications of Hindawi Limited, 2019, 1-7.
Artificial Intelligence, Elsevier, Volume 51, [22] Andronikos T., Singh A., Giannakis K.,
May 2016, pages 115-123. Sioutas S., “Computing probabilistic queries
[10] Wang, X., Ling, J., Wang, J., Wang, K., Feng, in the presence of uncertainty via probabilistic
Z.: Answering provenance-aware regular path automata”, Algorithmic Aspects of Cloud
queries on RDF graphs using an automata- Computing, Third International Workshop,
based algorithm. In: Proceedings of the 23rd ALGOCLOUD 2017, Vienna, Austria, 5
International Conference on World Wide September, 2017, Revised Selected Papers.
Web, pp. 395–396, ACM (2014). Springer Theoretical Computer Science and
[11] Schmidt M., Meier M., Lausen G.: General Issues, Volume 10739, pp. 106-122,
Foundations of SPARQL Query ISBN: 978-3-319-74874-0 (Print) 978-3-319-
Optimization. In: Proceedings of the 13th 74875-7 (Online), 2018.
International Conference on Database Theory [23] Andronikos T., Singh A., Giannakis K.,
(ICDT '10), pp. 4–33, Lausanne, Switzerland, Sioutas S., “Computing probabilistic queries
2010. in the presence of uncertainty via probabilistic
[12] Andronikos, T., “Classification of SPARQL automata”, Proceedings of the 3rd
queries into equivalence classes of relevant International Workshop on Algorithmic
queries”, International Journal of Advanced Aspects of Cloud Computing (ALGOCLOUD
Research in Computer Science, December 2017), 4-8 September, Vienna, Austria, 2017.
2017, Volume 8, No. 9, pages 152-159. [24] Papalitsas C., Giannakis K., Andronikos T.,
[13] Hua, M., Pei, J.: Probabilistic path queries in Theotokis D., Sifaleras A., “Initialization
road networks: traffic uncertainty aware path methods for the TSP with Time Windows
selection. In: Proceedings of the 13th using Variable Neighborhood Search”, 6th
International Conference on Extending International Conference on Information,
Database Technology, pp. 347–358, ACM Intelligence, Systems and Applications (IISA),
(2010). 6-8 July, Corfu, Greece, 2015.
[14] Andronikos T., Stefanidakis M., Papadakis I., [25] Papalitsas C., Andronikos T., Karakostas P.,
“Adding Temporal Dimension to Ontologies “Studying the impact of perturbation methods
via OWL Reification”, Proceedings of the on the efficiency of GVNS for the ATSP”,
13th Panhellenic Conference on Informatics - Proceedings of the 6th International
PCI 2009 Conference, 10-12 September, Conference on Variable Neighborhood
Corfu, Greece, IEEE Computer Society, pp. Search (ICVNS 2018), 4-7 October, Sithonia,
19-22, 2009. Halkidiki, Greece, 2018.
[15] Lian, X., Chen, L., Wang, G.: Quality-aware [26] Papalitsas C., Andronikos T., Karakostas P.,
subgraph matching over inconsistent “Studying the impact of perturbation methods
probabilistic graph databases. IEEE on the efficiency of GVNS for the ATSP”, 6th
Transactions on Knowledge and Data International Conference, ICVNS 2018,
Engineering 28(6), 1560–1574 (2016). Sithonia, Greece, October 4–7, 2018, Revised
[16] Reynolds, D.: Position paper: Uncertainty Selected Papers. Springer Theoretical
reasoning for Linked Data. In: Workshop, vol. Computer Science and General Issues,
14 (2014). Volume 11328, pp. 287-302, ISBN: 978-3-
[17] Schoenfisch, J.: Querying probabilistic 030-15842-2 (Print) 978-3-030-15843-9
ontologies with SPARQL, GI- (Online), 2019.
Edition/Proceedings 232, 2245–2256 (2014). [27] Papalitsas C., Andronikos T.,
[18] Krompaß, D., Nickel, M., Tresp, V.: “Unconventional GVNS for Solving the
Querying factorized probabilistic triple Garbage Collection Problem with Time
databases. In: International Semantic Web Windows”, (MDPI - Open Access
Conference, pp. 114–129. Springer (2014). Publishing), Technologies 2019, 7(3), 61;
[19] Khan, A., Chen, L.: On uncertain graphs https://doi.org/10.3390/technologies7030061.
modeling and queries, Proceedings of the [28] Papalitsas C., Karakostas P., Andronikos T.,
VLDB Endowment 8(12), 2042–2043 (2015). “A Performance Study of the Impact of