0% found this document useful (0 votes)
69 views4 pages

Uncertain Location Based Range Aggregates in A Multi-Dimensional Space

The document proposes a filtering-and-verification framework to efficiently process uncertain location-based range aggregate queries in multi-dimensional space. It introduces two filtering techniques: the Statistical Threshold Filtering technique that bounds appearance probabilities using statistical inequalities, and applying existing Probabilistically Constrained Regions techniques. The framework first filters entries in an R-tree index using these techniques to prune or validate entries, then performs verification on any remaining uncertain entries to compute the aggregate.

Uploaded by

manai raghav
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views4 pages

Uncertain Location Based Range Aggregates in A Multi-Dimensional Space

The document proposes a filtering-and-verification framework to efficiently process uncertain location-based range aggregate queries in multi-dimensional space. It introduces two filtering techniques: the Statistical Threshold Filtering technique that bounds appearance probabilities using statistical inequalities, and applying existing Probabilistically Constrained Regions techniques. The framework first filters entries in an R-tree index using these techniques to prune or validate entries, then performs verification on any remaining uncertain entries to compute the aggregate.

Uploaded by

manai raghav
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Uncertain Location based Range Aggregates in a multi-dimensional space

Ying Zhang # , Xuemin Lin # , Yufei Tao , Wenjie Zhang # # The University Of New South Wales
{yingz,lxue,zhangw}@cse.unsw.edu.au

Chinese University of Hong Kong


[email protected]

Abstract Uncertain data are inherent in many applications such as environmental surveillance and quantitative economics research. Recently, considerable research efforts have been put into the eld of analysing uncertain data. In this paper, we study the problem of processing the uncertain location based range aggregate in a multi-dimensional space. We rst formally introduce the problem, then propose a general ltering-and-verication framework to solve the problem. Two ltering techniques, named STF and PCR respectively, are proposed to signcantly reduce the verication cost.

I. I NTRODUCTION Uncertain data are inherent in many applications such as environmental surveillance, market analysis, information extraction, moving object management and quantitative economics research. The uncertain data in those applications are generally caused by data randomness and incompleteness, limitation of measuring equipment, delayed data updates, etc. With the rapid development of various optical, infrared and radar sensors and GPS techniques, there is a huge amount of uncertain data collected and accumulated everyday. So how to efciently analysing large collections of uncertain data becomes a great concern in many areas [1], [2]. An important operation in those applications is the range query. Although the studies of the range query on spatial database has a long history, it is until very recently that the community starts to investigate this problem against the uncertain data [3], [4], [5], [6], [7]. There are many applications for the range query operation against uncertain data. In this paper, we focus on the distance based range aggregates computation where the location of the query point is uncertain while the target data are conventional points (i.e, certain points ). In general, an uncertain location based query, denoted by Q, is a multi-dimensional point whose location might appear at any location x within a region denoted by Q.region, subject to a probabilistic density function pdf (x). Then for given set of data points P and query distance , we want to retrieve the aggregate information from the data points which are within distance to Q with probability at least . There are many applications for the problem we studied in the paper. One application is to estimate the extent of damage a missile attack might cause [8]. As we know, even the most advanced laser-guided missile can not exactly hit the aim point with 100 percent guarantee. So the commander can not simply predicate the effect of the missile attack by issuing a conventional distance based range aggregate query centred at the aim point to count the number of military targets (e.g., buildings, missile wells, mines, parked ghters) being covered. Instead, it is more reasonable to consider the likelihood of being destroyed for each target points. The distribution of the falling point of various missiles has been extensively

studied and different probability density functions (PDFs) are proposed, and bivariate normal distribution is the simplest one [8].Therefore, the commander can predicate the effect of attack by counting the number of target points which might be destroyed with likelihood at least , which may depend on the condence level of the commander. Moreover, suppose there are different military values for the target points, the evaluation can be based on the sum of the values for those target points. A straightforward solution of this problem is to compute the appearance probability 1 of each points p P for Q. Then do the aggregate computation on the points which appear in Q with probability at least . Usually the appearance probability computation is expensive because it requires costly numerical evaluation of a complex integral. So the key of the problem is how to efciently disqualify a point p or validate it as an result based on some pre-computed information. That is, we need to lter as many data points as possible to reduce the number of appearance probability computations. In the paper, we rst propose a ltering-and-verication framework to solve the problem based on ltering technique. Then we propose a distance based ltering techniques, named STF. The basic idea of the STF technique is to bound the appearance probability of the points by applying some well known statistical inequalities where only a few statistics about the uncertain location based query Q is required. The STF technique is simple and space efcient (only d + 2 oat numbers required), and experiments show that it has a decent ltering capacity. We also investigate how to apply existing probabilistically constrained regions (PCR) technique [5] to our problem. The remainder of the paper is organized as follows. In Section II, we formally dene the problem. Section III proposed a general ltering-and-verication framework and two ltering techniques. Section IV evaluates the proposed techniques with experiments. Then Section V concludes the paper. II. P ROBLEM D EFINITION A data point p P or query instance q Q referred in this paper, by default, is in a d-dimensional numerical space. The distance between two points x and y is denoted by |x y|. An object o in the paper has arbitrary shape which might enclose a set of data points2 , |o1 o2 |min denotes the min({|xi yj |}) for xi o1 and yj o2 ; Similar denition goes to |o1 o2 |max . Note that the Euclidean distance is employed as the distance metric in the paper. Nevertheless, our technique can be easily extended to other distance metrics as long as
1 For presentation simplicity, we say a point p appears with respect to query point q if the distance between query point q and p is not larger than 2 object o is corresponding to the MBR of an entry in R tree in the paper

the triangle inequality holds. For presentation simplicity, we use uncertain query to represent uncertain location based query. Following is the denition of uncertain query Q on both continuous and discrete case. Denition 1 ( Uncertain Query Q (continuous) ): Uncertain query Q is described by a probabilistic density function Q.pdf . Let Q.region present the region where Q might appear, then xQ.region Q.pdf (x)dx = 1; Denition 2 ( Uncertain Query Q (discrete) ): The uncertain query Q consists of a set of instances {q1 , q2 , . . . , qn } where qi appears with probability Pqi and qQ Pq = 1; For a point p, we use Papp (Q, p, ) to represent the probability of point p located within distance of towards uncertain query Q. It is called the appearance probability of p regarding uncertain query Q and query distance for presentation simplicity. Following is the formal denition of the appearance probability of p under the continuous and discrete cases respectively. For the continuous case, Papp (Q, p, ) =
xQ.region |xp|

Suppose P is organized by aggregate R tree RP and a lter F on Q is available, Algorithm 1 describes how to apply a lter for the aggregate query processing in a branch-and-bound fashion. Note that the lter should support the intermediate entry so that a group of data points can be ltered at same time. Algorithm 1: Filtering-and-Verication(RP , Q, F , , )
: RP : an aggregate R tree on data set P , Q : uncertain query, F : Filter, : query distance, : Probabilistic threshold Output : |Q, (P )| Stack := ; S := 0; C := ; insert root of RP into Stack; while Stack = do /* filtering */ Remove top entry e from Stack ; Load entry e from disk ; for each child entry ei of e do status := F.check( ei ); switch status do case pruned do break ; case validated do S := S + |ei |; break; case unknown do if ei is data entry then C := C ei ; else put ei into Stack; break; Input for data entry e in C do if Papp (Q, e, ) then S := S + 1 ; return S /* verification */

Q.pdf (x)dx

(1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

As to the discrete case, Papp (Q, p, ) =


qQ

Pq , where |q p| .

(2)

Specially, we have Papp (Q, p, ) = 0 for any < 0; Note that when there is no ambiguity, we use Papp (p, ) to replace Papp (Q, p, ). And Q and pdf are employed to represent Q.region and Q.pdf respectively. It is immediate that Papp (p, ) is a monotonic function with respect to distance . Problem Statement. In this paper we investigate the problem of uncertain location based range aggregate query on spatial data; it is formally dened below. Denition 3 ( Uncertain Range Aggregate Query ): Given a set of points P , an uncertain query Q, query distance and probabilistic threshold , we want to compute the aggregate information against points p Q, (P ), where Q, (P ) denotes the set of points p P and Papp (p, ) . In the paper, we employ the count as the aggregate operation. That is, we want to efciently compute |Q, (P )|. Nevertheless, our technique can be easily extended to other aggregates (e.g., sum, avg, max and min). III. Filtering-and-Verication A LGORITHM In this section, we rst introduce a general framework for the ltering-and-verication Algorithm based on ltering techniques. Then we proposes a simple statistical ltering technique. We also investigate how to apply the PCR technique [5] to tackle our problem. A. A framework for ltering-and-verication Algorithm For the given uncertain query Q, probability threshold and distance , the naive way is to compute the Papp (p, ) for each data point p P based on Equation 1 , then count the number of points with Papp (p, ) . Clearly, it is inefcient as we need to visit every point p P and the integral computation is expensive. To reduce the number of verications, it is desirable to apply ltering technique to prune or validate the data points.

19 20 21 22

An immediate ltering technique is based on the distance between the entry and uncertain query. Clearly, for any we can safely prune an entry with |Q e|min > or validate it if |Q e|max . We refer this as maximal/minimal distance based ltering technique, named MMD. MMD technique is time efcient as only O(d) time is required to compute the minimal and maximal distance between Q.M BR and e.M BR, where Q.M BR is the minimal bounding rectangle of Q. However, the MMD technique does not make use of , which inherently limits its ltering capacity. In the sequel, we introduce two ltering techniques which can enhance the ltering capacity with some pre-computed information. B. Statistical Filter
p

Q
Q1
g Q1

Q2
g Q2

gQ

p'
Fig. 1.

'

Motivation Example Fig. 2. Proof of Upper bound

In this subsection, we propose a statistical ltering technique, named STF. As shown in Figure 1, for given = 0.5 we can not prune p for uncertain query Q1 based on MMD technique although intuitively p should be pruned. Similarly, we can not validate p for Q2 either. This motivate us to develope a new ltering technique which is as simple as MMD, but can exploit to enhance the ltering capacity. We rst introduce three statistics and a lemma which are employed by STF technique. Denition 4 ( Geometric Centroid (gQ ) ): Informally, geometric centroid is the average of all points of an object. Let gQ denote the geometric centroid of uncertain query Q, we have gQ = xQ x pdf (x)dx and gQ = qQ q Pq for continuous and discrete case respectively. Base on the gQ , we have two denitions, named Q and Q respectively, which describe the variance of the distribution of uncertain query Q. Q represents the weighted average distance to gQ with Q = xQ |x gQ | pdf (x)dx and qQ |q gQ | Pq for continuous and discrete case respectively. Similarly, Q denotes the variance of Q with Q = xQ |x gQ |2 pdf (x)dx and qQ |q gQ |2 Pq for continuous and discrete case respectively. The Cantellis inequality[9] described by Lemma 1 is employed in our statistical ltering technique and it is one-sided version of the Chebyshev inequality. Lemma 1: Let X be a random variable with expected value and nite variance 2 . Then for any real number k > 0, 1 P r(X k) 1+k2 . Then Theorem 1 indicates that we can further enhance the ltering capacity based on some simple statistics of Q. Theorem 1: For the uncertain query Q and distance , suppose the geometric mean gQ , weighted average distance Q and variance Q for Q are available. Then for a given point p, we have 1) If > 1 , Papp (p, ) 1
1+
(1 )2 2 1

E(Y 2 ) E 2 (Y ) (|gQ p| + |x gQ |)2 pdf (x)dx


xQ

(|gQ p| Q )2 2 2 Q Q + 4Q |gQ p| = 1
,

Then based on lemma 1 let k = P r(Y ) =

if > 1 we have 1 1 + ( )2

P r(Y k) 1 1+ ( 1 )2 2 1

Then it is immediate that P r(Y ) 1 P r(Y ) 1 1 1+


(1 )2 2 1

(3)

As to the upper bound, as illustrated in Figure 2 let p be the dummy point on the line pgQ with |p p| = + + . Let = |p gQ |, then we have = + + |p gQ | (4)

According to Inequality 3 and Equation 4, when < |p 1 gQ | Q we have Papp (p , ) 1 , since ( 2 )2


1+

for any x Q and |x p | ( stripped area in Figure 2), |x p| > . It implies that Papp (p, ) 1 Papp (p , ) 1 . ( 2 )2
1+
2 2

2 2

, where 1 =

2 2 |gQ p| + Q and 1 = Q Q + 4Q |gQ p|. 1 , 2) If < |gQ p| Q , Papp (p, ) ( 2 )2 1+ 2 2 where 2 = + Q , 2 = Q Q + 4Q , = + + |p gQ | and > 0. The represents an innitely small positive value.
2 2

Proof: As uncertain query Q can be regarded as a random variable which takes x Q with probability pdf (x), we construct another random variable Y as follows: for x Q, there is a y Y such that y = |x p| and Y.pdf (y) = Q.pdf (x). Then we have Papp (p, ) = P r(Y ) according to the Equation 1. Based on triangle inequality, we have |xp| |xgQ |+|xgQ | and |xp| | |xgQ ||pgQ| | for any x Q. Let and denote the expection and standard deviation of random variable Y respectively, then =
yY

Following extension is immediate based on the rationale of Theorem 1. Extension 1. Suppose o is an object with arbitrary shape, we can simply use |o gQ |min and |o gQ |max to replace |pgQ | in Theorem 1 for lower and upper probabilistic bounds computation respectively. Based on the Extension 1, we can compute the upper and lower bound for Papp (e, ) to prune or validate entries in Algorithm 1. Since gQ , Q and Q are pre-computed, the only dominate cost in ltering phase is distance computation between e and p which only costs O(d) time. Following the similar rationale, another statistical lter can be proposed based the popular statistical inequality Markovs inequality. We omit this part due to the space limitation. C. PCR based Filter
Q.region
R1 R2

C p ,

Q.region

1 2

y Y.pdf (y)dy =
xQ

|x p| pdf (x)dx
Fig. 3. Transform query Fig. 4. Choose PCRs

|gQ p| + Q = 1 Similarly, we have |gQ p| Q .

Although Tao et al.[5], [6] do not address the problem studied in this paper, the PCR technique can be employed

as lter in Algorithm 1. In Figure 3, let Cp, represent the circle(sphere) centred at p with radius . Then we can regard the uncertain query Q as an uncertain object, while Cp, serves as a query. Because PCR technique only works for rectangle query, as suggested in [6], we can use R1 and R2 in Figure 3 to represent Cp, for pruning and validation purpose respectively. Similar transformation can be done for intermediate entries as well. Suppose a nite set of P CRs are pre-computed. For given which is not selected for pre-computation, we can carefully choose two closest existing P CRs for pruning and validation as illustrated in Figure 4. Since the is xed during the query, we can choose these two P CRs for all data points before processing of the query. Then pruning/validating rules in [5] can be applied for pruning and validate which are very time efcient - only O(d) time required for each entry test. In order to further improve the performance of the lter, more sophisticate approach from [6] is applied in our implementation. And the worst ltering time for each entry is O(m + d log m) where m is the number of P CRs. IV. E XPERIMENT We present results of a comprehensive performance study to evaluate the efciency of proposed techniques in the paper. Following the frame work of Algorithm 1, three different ltering techniques (MMD, STF and PCR) have been implemented and evaluated. All algorithms are implemented in C++ and compiled by GNU GCC. Experiments are conducted on PCs with Intel P4 2.8GZ CPU and 2G memory under Debian Linux. The spatial dataset, US, is employed as target dataset which contains 1m 2-dimensional points representing locations in the United State3 . All of the dimensions are normalised to domain [0, 10000]. To evaluate the performance of the algorithms , we also generate synthetic dataset Uniform with 3 dimension, in which points are uniformly distributed. The domain size is [0, 10000] for each dimensions. All of the datasets are organized by aggregate R trees with pagesize 4096 bytes. A workload consists of 200 uncertain queries in our experiment. And the uncertain region of the uncertain queries in our experiment are circles or spheres with radius qr which varies from 200 to 1000. The centres of the queries are randomly generated within the domain and Normal distribution is employed to describe the PDF of the uncertain queries. Moreover, in order to avoid favouring particular value, we randomly choose the probabilistic threshold between 0 and 1 for each uncertain query. We measure the performance of the techniques by means of IO cost and candidate size during the computation. The IO cost is the number of pages visited from RP . While candidate size is the number of data points which need exact probabilistic computation. In the experiments, we evaluate the impact of query distance on the performance of the ltering techniques in terms of candidate size and IO cost against US and 3d Uniform datasets. Figure 5 reports the candidate size of MMD, STF and PCR when query distance grows from 400 to 2000. Clearly, the large results in more candidate data points for verication. It is interesting that with only a few statistics, the STF can
3 Available

achieve a great saving on the candidate size compared with the MMD. With more space, PCR can further reduce the candidate size.
300 Candidate size(k) 250 200 150 100 50 0
400 800 1200 1600

Candidate size(k)

MMD STF PCR

75

50

MMD STF PCR

25

0
400 800 1200 1600

(a) US Fig. 5.
MMD 104 STF

(b) 3d uniform Candidate Size vs


MMD 104 STF PCR

PCR

# node accesses

# node accesses
400 800 1200 1600 2000

103

103

102

102

(a) US

400

(b) 3d uniform

800

1200

1600

2000

Fig. 6.

# node accesses vs

We evaluate the IO cost of the techniques and report the results in Figure 6. As expected, PCR still ranks rst on both datasets. V. C ONCLUSIONS In this paper, we formally dene the problem of uncertain location based range aggregates in a multi-dimensional space; it covers a wide spectrum of applications. To efciently process such a query, we propose a general ltering-andverication framework and two ltering technique, named STF and PCR respectively, such that the expensive computation cost for verication can be signicantly reduced. As demonstrated in the experiment, STF ltering technique can achieve a decent ltering capacity based on a few pre-computed statistics about the uncertain location based query. Moreover, it is very fast and space efcient due to its simplicity. And PCR technique is quite efcient when more space is available. Acknowledgement. The work was supported by ARC Grant (DP0881035 and DP0666428) and Google Research Award. And the third author was supported by Grant CUHK 4161/07 from HKRGC. R EFERENCES
[1] N. N. Dalvi and D. Suciu, Management of probabilistic data: foundations and challenges, in PODS, 2007. [2] M. A. Soliman, I. F. Ilyas, and K. C.-C. Chang, Top-k query processing in uncertain databases, in ICDE, 2007. [3] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, Evaluating probabilistic queries over imprecise data. in SIGMOD 2003. [4] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter, Effcient indexing methods for probabilistic threshold queries over uncertain data. in VLDB 2004. [5] Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar, Indexing multi-dimensional uncertain data with arbitrary probability density functions. in VLDB, 2005. [6] Y. Tao, X. Xiao, and R. Cheng, Range search on multidimensional uncertain data, ACM Trans. Database Syst., vol. 32, no. 3, 2007. [7] J. Chen and R. Cheng, Efcient evaluation of imprecise locationdependent queries, in ICDE, 2007. [8] G. M. Siouris, Missile Guidance and Control Systems, 2004. [9] R. Meester, A Natural Introduction to Probability Theory, 2004.

at http://www.census.gov/geo/www/tiger/

You might also like