0% found this document useful (0 votes)
13 views19 pages

Till Index

Uploaded by

xy5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Till Index

Uploaded by

xy5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

The VLDB Journal (2022) 31:629–647

https://doi.org/10.1007/s00778-021-00715-z

REGULAR PAPER

Span-reachability querying in large temporal graphs


Dong Wen1 · Bohua Yang2 · Ying Zhang2 · Lu Qin2 · Dawei Cheng3 · Wenjie Zhang1

Received: 25 February 2021 / Revised: 22 August 2021 / Accepted: 27 October 2021 / Published online: 23 November 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021

Abstract
Reachability is a fundamental problem in graph analysis. In applications such as social networks and collaboration networks,
edges are always associated with timestamps. Most existing works on reachability queries in temporal graphs assume that
two vertices are related if they are connected by a path with non-decreasing timestamps (time-respecting) of edges. This
assumption fails to capture the relationship between entities involved in the same group or activity with no time-respecting
path connecting them. In this paper, we define a new reachability model, called span-reachability, designed to relax the time
order dependency and identify the relationship between entities in a given time period. We adopt the idea of two-hop cover
and propose an index-based method to answer span-reachability queries. Several optimizations are also given to improve the
efficiency of index construction and query processing. We conduct extensive experiments on eighteen real-world datasets to
show the efficiency of our proposed solution.

Keywords Temporal graph · Reachability · Dynamic graph

1 Introduction works, PPI (protein-protein-interaction) networks, XML and


RDF databases.
Computing the reachability between vertices is a fundamen- In real-world applications, edges in graphs are often
tal problem in network analysis. A true result is returned associated with temporal information. For example, in col-
if there exists a path connecting two query vertices. Exten- laboration networks, each vertex is a researcher, and an edge
sive studies have been done to answer the reachability represents the co-authorship of two researchers at a time.
queries in graphs [2,9,11,13,19,24,27,29,31,35,36], a prob- In social networks, an edge with a timestamp t represents a
lem which has applications across a wide range of domains communication (sending a message or leaving a comment)
such as road networks, social networks, collaboration net- between two users at t. Due to the widely spread tempo-
ral information in entity relationships, research problems in
B Dawei Cheng temporal graphs have recently drawn a lot of attention.
[email protected] Motivation. In this paper, we study the vertex reachability
Dong Wen problem in temporal graphs. An existing method to model
[email protected] the temporal reachability is based on the concept of time-
Bohua Yang respecting paths [17,18,21]. Specifically, a vertex u reaches
[email protected] v if there exists a path connecting u and v such that the
Ying Zhang times on the path follow a non-decreasing order. For exam-
[email protected] ple, in the temporal graph G of Fig. 1, v6 reaches v10 since
Lu Qin there exists a path {v6 , v2 , 5, v2 , v1 , 6, v1 , v10 , 8} con-
[email protected] necting them and the times 5, 6, 8 are in a non-decreasing
Wenjie Zhang order. Semertzidis et al. [25] also model the temporal reach-
[email protected] ability that two vertices u, v are reachable if there exists path
connecting them and the times of all edges in the path are
1 The University of New South Wales, Kensington, Australia consistent, i.e., u, v are reachable in a snapshot of the tem-
2 AAII, University of Technology Sydney, Ultimo, Australia poral graph at a given time.
3 Department of Computer Science and Technology, Tongji
University, Shanghai, China

123
630 D. Wen et al.

protein activities, our model can be used to identify the rela-


tionship between proteins based on GO DAG.
- Security assessment and recommendation. In the context
of assessing security, we need to understand whether certain
person are related to a known terrorist [5]. In organizing a
terrorist activity, there may exist several phone calls among
the suspects with a short period. We may be not able to find
a time-respecting path from the known terrorist to others,
especially when not all people in the organization take orders
from this terrorist. Our model can be used to capture the
Fig. 1 A temporal graph G where each number represents the timestamp related suspects of a targeted terrorist. Similarly, in social
of the edge below networks, our model can be used to detect whether two users
are involved in a group in the period of big social events, such
as FIFA World Cup and Olympic Games.
In many scenarios of temporal graph mining, we may - Money transaction monitor. In e-commerce platforms and
only focus on the relationship between vertices in the pro- bank systems, we have a graph in which each vertex rep-
jected graph of a small time interval without addressing resents a user account and each edge with a timestamp
any order limitation in the edge sequence. Here, the pro- represents a money transaction between two user accounts.
jected graph is the static graph containing all edges at times In monitoring money transactions, or some other illegal
falling in the interval. For example, Gurukar el al. [16] com- financial activities, such as money laundering and fake trans-
pute the communication motifs in temporal graphs and show actions, it is crucial to detect whether there exists a path
that two edges sharing a common vertex are related if the between two user accounts. Normally, a series of money
difference of their timestamps is very small. Authors in transactions should follow an increasing order of timestamps.
[22,28] compute the community structures called Δ-clique However, some skilled users may borrow some untraceable
and (θ, k)-persistent-core, respectively, in temporal graphs. money to finish the transfer and try to dodge any monitoring.
Their models require that the resulting subgraph satisfies For example, an account in the transaction path may trans-
some structural properties (e.g., vertex degree threshold) in fer the money to the next account in advance and receive
the projected graph of a time interval. The aforementioned the money from the prior account later. The existing order-
two reachability models are too strict and might fail to cap- dependent reachability model cannot capture this activity,
ture entity relationship in these scenarios. but our model can be used here by setting a specified time
Span-reachability. In this paper, we define a span- interval.
reachability model. Given a temporal graph and a time inter- Based on the concept of span-reachability, we also study
val I, we say a vertex u span-reaches v if u reaches v in a θ -reachability problem, which is a generalized version of
the projected graph of I. We investigate the problem of effi- the span-reachability. Given a time interval I and a length
ciently answering span-reachability queries for an arbitrary threshold θ , two vertices are θ -reachable in I if they are
pair of vertices and a time interval. span-reachable in a θ -length subinterval of I. Taking the
Example 1 In the temporal graph G of Fig. 1, we have v1 above application of monitoring money transactions, a more
span-reaches v8 in the time interval [3, 5], since there exists general task is to identify whether there exists a transaction
a path {v1 , v5 , 5, v5 , v8 , 4} from v1 to v8 in the projected chain between two accounts finished in a short period over a
graph of [3, 5]. long monitoring period. Note that when the length of query
interval equals to θ , θ -reachability is equivalent to span-
Applications. Using this model, we can effectively analyze reachability. When θ is 1, it is equivalent to the disjunctive
the potential relationship between entities by focusing on historical reachability model studied in [25].
the item interactions in a specific period. Several real-world Online solution. Given a time interval I, a straightforward
applications can benefit from this study. We provide several method to answer span-reachability queries is to perform a
examples as follows. bidirectional modified breath-first search between two query
- Biology analysis. In PPI networks, it is important to iden- vertices. We only scan the edges in the query interval and
tify whether two proteins participate in a common biological return true if a common vertex is found in the searches of
process or molecular function. According to [20], proteins two query vertices. This method works but incurs high com-
or RNA can be described as vocabulary terms. The relation- putational cost especially when the graph is very large.
ships between vocabulary terms can be modeled as a gene Index-based solution. To efficiently and scalable process the
ontology (GO) directed acyclic graph (DAG) in which each span-reachability query, we propose an index-based solution
vertex is a concept or vocabulary term. In monitoring the based on the concept of two-hop cover (sometimes called hop

123
Span-reachability querying in large temporal graphs 631

labeling) [1,3]. The index is called Time Interval Labeling – Several optimizations to improve the efficiency of index
(TILL-Index). Specifically, for each vertex u in the tem- construction and query processing. We propose two opti-
poral graph, we maintain an out-label set Lout (u) and an mizations to improve the efficiency of index construction.
in-label set Lin (u). Each item in Lout (u) (resp. Lin (u)) is a We also use a sliding window-like method to improve the
triplet w, ts , te  which means that u span-reaches (resp. is efficiency of θ -reachability query processing.
span-reachable from) w in the interval [ts , te ]. Given a query – Efficient index maintenance in dynamic temporal graphs.
interval [t1 , t2 ], we answer the span-reachability from a ver- We propose algorithms to maintain the index when
tex u to a vertex v by checking Lout (u) and Lin (v). We have inserting new edges and removing out-of-date edges,
u span-reaches v if there exists a common vertex w such that respectively.
u span-reaches w in a subinterval of [t1 , t2 ] and v is also – Extensive performance studies on more than ten real-
span-reachable from w in a subinterval of [t1 , t2 ]. world datasets. We conduct experiments on eighteen
Efficiently computing a small-size TILL-Index is not a real-world datasets from different categories. The results
trivial task. We construct the index in n iterations where n is demonstrate the effectiveness of our optimizations and
the number of vertices in the temporal graph. In each itera- the efficiency of our proposed solutions.
tion, we pick a vertex u to compute all its reachable vertices
with corresponding time interval and add u to the in-label or Organization. The rest of this paper is organized as fol-
out-label set of other vertices if necessary. This index con- lows. Section 2 introduces some background knowledge
struction algorithm incorporates several optimizations. First, and defines the problem. Section 3 gives an overview of
we use a priority queue to explore the reachable vertices our index-based solution. Section 4 studies the index con-
of the picked vertex u in each iteration. Based on the pri- struction algorithms. Section 5 studies the query processing
ority queue, our first step is always to process the vertex algorithms. Section 6 introduces algorithms for the index
with the shortest time interval that is reachable from u. This maintenance. Section 7 reports the experimental results. Sec-
guarantees that each found vertex reachable from u with a tion 8 introduces related works, and Sect. 9 concludes the
corresponding interval is never dominated by others and sig- paper. The paper is extended from a conference version [32].
nificantly reduces unnecessary visits. In addition, by studying We additionally propose algorithms to maintain the index for
the dominance relationship between different intervals, we dynamic temporal graphs and conduct corresponding exper-
stop exploring neighbors of a visited reachable vertex if a iments. We omit the detailed proofs for several lemmas and
specific condition is satisfied. This pruning rule significantly theorems when they are straightforward.
reduces the search space for each vertex. Real-world tempo-
ral graphs are highly dynamic. We also propose algorithms
to maintain our TILL-Index when inserting new edges and 2 Preliminary
deleting old edges.
Note that even though the concept of the two-hop cover has Let G(V, E) be a directed temporal graph, where V and E
been studied or used in several existing works [1,3,13,30], our denote the set of vertices and the set of temporal edges,
method is not a naive extension of existing techniques. Unlike respectively. Each temporal edge e ∈ E is a triplet u, v, t,
the previous studies, our method exploits the characteristics where u, v are the vertices in V and t is the connection time
of temporal graphs. The proposed optimizations for index from u to v. Without loss of generality, we assume t is an
construction center mainly on the relationships between dif- integer since the timestamp in real-world applications is nor-
ferent time intervals, such as containment and intersection. mally an integer. Note that there may exist multiple edges
We also propose several optimization techniques to improve connecting the same pair of vertices at different times. We
the efficiency of query processing. use n = |V| and m = |E| to denote the number of ver-
Contributions. We summarize the main contributions in this tices and the number of temporal edges, respectively. Given
paper as follows. a vertex u ∈ V, the out-neighbor set of u is defined as
Nout (u) = {v, t|(u, v, t) ∈ E}, and the in-neighbor set
– A novel reachability model in temporal graphs. We define is defined similarly. The out-degree (resp. in-degree) of u
a span-reachability model to capture the interactions is denoted as degrout (u) = |Nout (u)| (resp. degrin (u) =
between entities in a specific period of a temporal graph. |Nin (u)|). Given a time interval [ts , te ], the projected graph
In addition, we further study the θ -reachability problem of G in [ts , te ], denoted by G[ts ,te ] , where V (G[ts ,te ] ) = V and
which is a generalized version of the span-reachability. E(G[ts ,te ] ) = {(u, v)|(u, v, t) ∈ E, t ∈ [ts , te ]}. The length
– A two-hop index-based solution. We exploit the charac- or width of an interval [ts , te ] is the number of timestamps
teristics of the span-reachability model and adopt the idea in the interval, i.e., te − ts + 1. Given the temporal graph G
of two-hop cover to propose an index-based method to in Fig. 1, its projected graph in the interval [2, 4] is given in
answer both research problems. Fig. 2.

123
632 D. Wen et al.

v7 v9 v8 3 Solution overview
v11 v1 v2 v5 v3 v4 We give an overview of our solution in this section. We
start by presenting a straightforward online algorithm for our
Fig. 2 The projected static graph of G in the time interval [2, 4]
research problems and then introduce several basic ideas of
our index-based method.
Based on the concept of the projected graph, we define
the span-reachability as follows. 3.1 A straightforward online approach

Definition 1 (Span- Reachability) Given a temporal graph Given a time interval [ts , te ], the span-reachability of two
G, two vertices u, v and a time interval [ts , te ], u span-reaches vertices u and v in [ts , te ] can be answered by a modified
v in [ts , te ], denoted as u [ts ,te ] v, if u reaches v in the bidirectional breath-first search. Specifically, we begin by
projected graph G[ts ,te ] . alternatively picking one of u and v in each round and explor-
ing the unvisited vertices that are reachable from u or can
Considering the temporal graph G in Fig. 1, we have reach v. We have u reaches v once the search scopes of two
v1 [2,4] v3 since v1 reaches v3 in the projected graph of vertices intersect. The detailed pseudocode of this approach
[2, 4] in Fig. 2. We define the first problem studied in this is given in Algorithm 1. Note that we assume u = v in all
paper based on Definition 1 as follows. proposed algorithms to answer the reachability queries in
Problem 1 Given a temporal graph G, two vertices u, v and this paper. Alternatively, we directly return true without the
a time interval I, we aim to efficiently answer whether u algorithm invocation.
span-reaches v in the interval I. In line 1, Ru and Rv are used to collect all vertices that u
can reach and all vertices that can reach v, respectively. In line
In addition to identifying the span-reachability, we further 5, Q u ∪ Q v = ∅ means there does not exist any unexplored
define a generalized reachability model in a temporal graph vertex for both u and v. The variable toggle initialized in
G. Given two intervals [ts , te ] and [ts , te ], we have [ts , te ] ⊆ line 4 represents the processed vertex in the last iteration,
[ts , te ] iff ts ≥ ts and te ≤ te . and we process u in lines 7–15 if toggle = v. We explore
the out-neighbors of all vertices in the queue in lines 9–15. In
Definition 2 (θ -Reachability) Given a temporal graph G, line 11, we only access edges whose time falls into the input
two vertices u, v, a parameter θ and a time interval [ts , te ] interval. We return true if a common vertex of Ru and Rv is
s.t. te − ts + 1 ≥ θ , u θ -reaches v if there exists an interval found in line 12, or push the new found vertex into the queue
[ts , te ] ⊆ [ts , te ] such that te − ts + 1 = θ and u reaches v in in line 14. The algorithm essentially performs a bidirectional
G[ts ,te ] . BFS in the projected graph G[t1 ,t2 ] . The time complexity of
Algorithm 1 is given as follows.
Example 2 Given the temporal graph G in Fig. 1, let θ = 3.
We have v1 3-reaches v8 in the interval [1, 5] since there Lemma 2 The running time of Algorithm 1 is bounded by
exists an interval [3, 5] ⊆ [1, 5] such that the length of [3, 5] O(m + n).
is 3 and v1 reaches v8 in the projected graph G[3,5] . Problem 2 can be answered by invoking Algorithm 1 as a
subroutine. We can sequentially check each possible θ -length
Relationship of two reachability models. Given an arbitrary
subinterval in the given query interval [t1 , t2 ] and return true
pair of vertices u, v, a threshold θ and a time interval I, we
immediately if u reaches v in any one of them. In the worst
also study the issue of computing θ -reachability from u to
case, the time complexity of this algorithm is bounded by
v in I, denoted by Problem 2. Definition 1 is a special case
O((t2 − t1 − θ ) · (n + m)).
of Definition 2 when θ is equal to the length of the input
Even though the bidirectional search method can suc-
interval. We also see a growing strictness from Definition 1
cessfully answer span-reachability queries and θ -reachability
to Definition 2, which is shown in the following lemma.
queries, the algorithms suffer from a poor scalability since
Lemma 1 Given two vertices u, v and an interval I, u span- the whole graph may be visited during query processing. To
reaches v in I if u θ -reaches v in I. improve query efficiency, we propose an index-based method
in the following section.
For ease of presentation, we assume the input temporal
graph is a directed graph, and our proposed techniques can 3.2 The time interval labeling index
easily handle undirected graphs. We omit the proofs of sev-
eral lemmas and theorems when they are straightforward due We introduce our index structure called Time Interval Label-
to space limitation. ing (TILL-Index) in this section. TILL-Index adopts the idea

123
Span-reachability querying in large temporal graphs 633

Algorithm 1: Online-Reach() Table 1 A Time Interval Labeling of G


Input: a temporal graph G , two vertices u and v and an interval Lin (v2 ) v1 , 2, 2 v1 , 7, 7 Lout (v2 ) v1 , 6, 6
[t1 , t2 ]
Output: the span-reachability of u and v in [t1 , t2 ] Lin (v3 ) v1 , 2, 4 v1 , 4, 5 v2 , 3, 4 Lin (v4 )
1 Ru ← {u}, Rv ← {v}; v1 , 1, 4 v1 , 4, 5 v2 , 3, 5 v2 , 1, 4 v3 , 1, 1
2 Q u ← a queue containing u; v3 , 5, 5 v3 , 6, 8 Lout (v4 ) v3 , 4, 4 Lin (v5 )
3 Q v ← a queue containing v;
v1 , 2, 3 v1 , 5, 5 v2 , 3, 3 Lout (v5 ) v3 , 4, 4
4 toggle ← v;
5 while Q u ∪ Q v = ∅ do Lout (v6 ) v1 , 5, 6 v2 , 5, 5 v4 , 6, 9 Lin (v7 )
6 if toggle = v ∧ Q u = ∅ then v1 , 7, 7 Lout (v7 ) v3 , 3, 6 Lin (v8 ) v1 , 1, 3
7 toggle ← u; v1 , 2, 4 v1 , 4, 5 v2 , 1, 3 v2 , 3, 4 v3 , 8, 8
8 l ← |Q u |;
9 for 1 ≤ i ≤ l do v5 , 1, 1 v5 , 4, 4 v6 , 9, 9 Lout (v8 ) v3 , 4, 6
10 w ← Q u . pop(); v4 , 6, 6 Lin (v9 ) v1 , 1, 1 v1 , 3, 7 v2 , 1, 4
11 foreach w  , t ∈ Nout (w) : t ∈ [t1 , t2 ] do v3 , 1, 1 v7 , 3, 3 Lout (v9 ) v3 , 6, 6 Lin (v10 )
12 if w  ∈ Rv then return true;
13 if w  ∈
/ Ru then v1 , 8, 8 Lout (v10 ) v1 , 9, 9 Lout (v11 ) v1 , 3, 3
14 Q u . push(w  ); Lout (v12 ) v1 , 6, 9 v10 , 6, 6
15 Ru ← Ru ∪ {w  };

16 else
17 repeat lines 7–15 to search the vertices that reach v by
Example 3 Assume that we aim to answer the span-
toggling between u and v, and replacing the subscript out reachability from v6 to v3 in the time interval [4, 8]. We
with in first locate the out-label set of v6 in Table 1, which are
18 return false; Lout (v6 ) = {v1 , 5, 6, v2 , 5, 5, v4 , 6, 9}. The in-label set
of v3 are Lin (v3 ) = {v1 , 2, 4, v1 , 4, 5, v2 , 3, 4}. We
can see that there is a common vertex v1 such that both
v1 , 5, 6 ∈ Lout (v6 ) and v1 , 4, 5 ∈ Lin (v3 ) fall in the
of two-hop cover (or two-hop labeling) [1,3]. In a nutshell, for
query interval [4, 8]. Therefore, the answer of this query is
each vertex u, we maintain an in-label set Lin (u) and an out-
true.
label set Lout (u). Each item in Lin (u) is a triplet w, ts , te 
which means that w reaches u in the projected graph G[ts ,te ] . Even though the idea of two hop cover is simple, it is
Each item in Lout (u) is a triplet w, ts , te  which means that non-trivial to efficiently compute a small TILL-Index and
u reaches w in G[ts ,te ] . A triplet is called a w-triplet if the first answer the reachability queries based on the index. We give
item of the triplet is w. We call u, v, ts , te  a reachability the details about index construction and query processing in
tuple if u [ts ,te ] v, and we say a vertex w covers a reach- Sect. 4 and Sect. 5, respectively.
ability tuple u, v, ts , te  if u [ts ,te ] w and w [ts ,te ] v.
For ease of presentation, we focus mainly on Problem 1 now. Remark 1 One may consider using some existing techniques
Problem 2 can also be solved based on the TILL-Index, and (e.g., transitive closure) of reachability in static graphs to
Sect. 5 will discuss its solution in detail by extending the construct the index for span-reachability. However, the idea
techniques in answering Problem 1. Given two vertices u is hard to work since we may have an extremely large number
and v, u span-reaches v in an interval [t1 , t2 ] if any one of of possible query time spans and each time span corresponds
the following equations holds: a static graph. It is not acceptable to index all possible static
graphs.
1. ∃v, ts , te  ∈ Lout (u): [ts , te ] ⊆ [t1 , t2 ];
2. ∃u, ts , te  ∈ Lin (v): [ts , te ] ⊆ [t1 , t2 ];
3. ∃w, ts , te  ∈ Lout (u), w  , ts , te  ∈ Lin (v): w = w  ∧ 4 Index construction
[ts , te ] ⊆ [t1 , t2 ] ∧ [ts , te ] ⊆ [t1 , t2 ].
4.1 The labeling framework
Based on the above equations, a TILL-Index is a minimal We begin by presenting several basic concepts before intro-
index that can be used to answer correctly all possible span- ducing the details of the index construction.
reachability queries in G. Here, by minimal, we mean that
removing any item in the index cannot correctly determine Definition 3 (Dominance and Skyline Reachability
all possible span-reachability in the graph. An example of Tuple) Given two vertices u and v, a reachability tuple
a TILL-Index of the temporal graph G in Fig. 1 is given in u, v, ts , te  dominates u, v, ts , te  if [ts , te ] ⊂ [ts , te ]. A
Table 1. reachability tuple u, v, ts , te  is a skyline (or non-dominated)
reachability tuple (SRT) if it is not dominated by other tuples.

123
634 D. Wen et al.

Given a vertex u, we also use the term skyline in Def- Definition 4 (Canonical Reachability Tuple) A reach-
inition 3 for the triplets in Lout (u) (resp. Lin (u)) since a ability tuple u, v, ts , te  is a canonical reachability tuple
triplet w, ts , te  ∈ Lout (u) represents a reachability tuple (CRT) if (i) u, v, ts , te  is a skyline reachability tuple, and
u, w, ts , te . In constructing TILL-Index, we only need to (ii) there does not exist a vertex w such that u [ts ,te ] w,
compute labels that can cover all SRTs since a vertex cover- w [ts ,te ] v, O(w) < O(u) and O(w) < O(v).
ing an SRT also covers all its dominating tuples. Therefore,
Given a vertex order O and a vertex u, we say a tuple is
our research task in the index construction is to cover all SRTs
an SRT (resp. CRT) of u if the tuple is an SRT (resp. CRT)
in the graph with the total index size as small as possible.
containing u and the rank of u is higher in the tuple. We have
The minimum two-hop cover. [13] studies the two-hop
following lemmas based on Definition 4.
cover for the shortest distance and reachability queries
in general graphs. They proved that computing the min- Lemma 5 Given an arbitrary vertex u and any (skyline)
imum two-hop cover is NP-hard and can be transformed triplet w, ts , te  in Lout (u) (resp. Lin (u)), u, w, ts , te 
to a minimum cost set cover problem [12]. They use a (resp. w, u, ts , te ) is a CRT.
greedy algorithm to compute a two-hop cover and achieve
Lemma 6 For each CRT u, v, ts , te  in G, there is a triplet
an O(log n) approximation factor. The proposed algorithm
u, ts , te  in Lin (v) if O(u) < O(v). If this is not the case,
is inefficient since a procedure of densest subgraph computa-
there is a triplet v, ts , te  in Lout (u).
tion is invoked every time they select a vertex to cover several
reachability (or shortest distance) vertex pairs. Example 4 The labels in Table 1 are computed following the
Hierarchical two-hop cover. The aforementioned theoreti- total alphabetical order of the vertices in G of Fig. 1. For
cal results also hold in our scenario, and we omit the detailed the in-labels of v8 , we can find that the rank of all vertices
proof. Due to the difficulty of the optimal cover computa- v1 , v2 , v3 , v5 and v6 appearing in Lin (v8 ) have ranks higher
tion, we adopt a hierarchical labeling approach [1,3] which than v8 . For an arbitrary triplet v2 , 3, 4 in Lin (v8 ), there
follows a strict total order on the vertices in G, and we will does not exist any vertex with a higher rank than v8 and v2
prove the minimality of our TILL-Index under the total order that can cover the reachability tuple v2 , v8 , 3, 4.
constraint. We use O to denote the vertex order. We say
the rank of a vertex u is higher than that of a vertex v if Based on Lemmas 5 and 6, there is a one-to-one corre-
O(u) < O(v). By the total order, we mean to sequentially spondence between CRTs and triplets in TILL-Index. It now
process each vertex in O. Once we process a vertex w, we follows that we can construct TILL-Index by computing all
add w and corresponding intervals to the labels of u and v for CRTs. A framework to construct TILL-Index is presented in
all uncovered reachability tuples containing u, v covered by Algorithm 2.
w. Intuitively, a vertex playing an important role in G should
be put at the front of the order. Next, we adopt the order-
Algorithm 2: A Framework of Index Construction
ing method in [19]. Given each vertex u, we use the formula
for 1 ≤ i ≤ n do
(degrin (u) + 1) × (degrout (u) + 1) as the importance of 1
2 u i ← the i-th vertex in the order O;
u. We sort the vertices in a decreasing order of their impor- 3 compute all SRTs of u i ;
tance and break the tie by selecting a vertex with smaller 4 compute all CRTs by refining the computed SRTs;
ID. Given the total vertex order, we immediately have the 5 add corresponding triplet of each CRT to in-labels or
out-labels of other vertices;
following lemmas for our TILL-Index.

Lemma 3 Given an arbitrary vertex u, for every triplet


w, ∗, ∗ in Lout (u) ∪ Lin (u), O(w) < O(u). In the framework, we process each vertex sequentially in
the vertex order. In line 3, the SRTs of u i can be computed
Lemma 4 Given an SRT u, v, ts , te  in G, let w be the first
in two phases. One computes all vertices and correspond-
vertex (the highest rank) in O that can cover u, v, ts , te .
ing time intervals that are reachable from u, while the other
w = u = v. There exists a triplet w, ts , te  ∈ Lout (u) such
computes those that can reach u. Taking the first one as an
that [ts , te ] ⊆ [ts , te ] and a triplet w, ts , te  ∈ Lin (v) such
example, a basic implementation uses a queue to maintain the
that [ts , te ] ⊆ [ts , te ].
discovered reachable triplets of u i . To be specific, the queue
Without loss of generality, we maintain only skyline is initialized as a special triplet containing u i . We iteratively
triplets in labels of TILL-Index since a dominated triplet can pop a triplet v, ts , te , which means u can reach v in [ts , te ].
be always replaced by a corresponding skyline triplet without For each out-neighbor v  , t of v, we expand v, ts , te  to
influencing calculation’s accuracy. We define an important v  , min(ts , t), max(te , t), which means u i reaches v  in the
concept in computing TILL-Index as follows. interval [min(ts , t), max(te , t)]. We mark this new triplet
v  , min(ts , t), max(te , t) as discovered and push it into the

123
Span-reachability querying in large temporal graphs 635

queue if it is not dominated by other discovered triplet, and 4.3 Implementation


remove all its dominating discovered triplets. In line 4, for
every SRT computed in line 3, we check whether there exists The basic implementation incurs high computational cost.
a vertex with a higher rank that can cover the SRT based on We discuss several techniques to efficiently compute SRTs
Definition 4. This can be done by performing a query process- and CRTs as follows.
ing procedure based on the labels computed by higher-rank
vertices. The details of query processing will be given in 4.3.1 Efficient SRT computation
Sect. 5. If yes, we omit such SRT, and derive all CRTs when
all SRTs are checked. We propose a priority queue based method to efficiently com-
pute all SRTs of a given vertex. A key idea of this method is
4.2 Theoretical analysis given in the following lemma.

Lemma 7 Given a vertex u and a set of known SRTs S con-


We prove the correctness and the minimality of TILL-Index taining u, a reachability tuple u, v, ts , te  is an SRT if (i)
computed by Algorithm 2. u, v, ts , te  is not dominated by any other SRT in S, and (ii)
the length of [ts , te ] is the smallest among those of all tuples
Theorem 1 (Correctness) The span-reachability query of that are not in S.
any pair of vertices can be correctly answered (any one of
three conditions presented in Sect. 3.2 holds) based on the Example 5 We consider the temporal graph G in Fig. 1.
index computed by Algorithm 2. Assume that we aim to compute SRTs of v5 . For ease of
presentation, we only consider the SRTs starting from v5 . Ini-
Proof The theorem can be easily derived according to Defi- tially, S = ∅ and we have several reachability tuples with the
nition 4, Lemma 5 and Lemma 6. 
 smallest interval length. They are v5 , v3 , 4, 4, v5 , v8 , 1, 1
and v5 , v8 , 4, 4, and all of them are SRTs. Now we have S =
Theorem 2 (Minimality) For any vertex u and any triplet {v5 , v3 , 4, 4, v5 , v8 , 1, 1, v5 , v8 , 4, 4}. v5 , v8 , 4, 8 is
w, ts , te  in Lin (u) or Lout (u) of the index computed by not an SRT since it is dominated by v5 , v8 , 4, 4 in S, and
Algorithm 2, there exists a pair of vertices u  , v  and a corre- v5 , v4 , 4, 5 is an SRT since its interval length is smallest
sponding interval [ts , te ] such that the span-reachability of u  among all possible reachability tuples except the SRTs in S.
and v  in [ts , te ] cannot be correctly answered after removing
w, ts , te . Based on Lemma 7, to compute all non-dominated reach-
ability triplets (a target and the corresponding time interval)
from a vertex u, we preserve all discovered reachability
Proof Given a triplet w, ts , te  ∈ Lout (u), we prove that
triplets in a priority queue and always pop the triplets with the
after removing w, ts , te , the span-reachability from u to w
smallest time interval length in the priority queue. According
in [ts , te ] cannot be correctly answered. If this query can
to Lemma 7, a popped triplet v, ts , te  must be an SRT if it
be correctly answered, then at least one of the following
is not dominated by any previously found SRT. We compute
two conditions holds: (i) there exists a triplet u, ts , te  in
the new interval of each neighbor of v that can be reached
Lin (w) such that [ts , te ] ⊆ [ts , te ]; (ii) there exists a triplet
from v, ts , te  and push the corresponding new triplet into
v, ts , te  ∈ Lout (u) and a triplet v, ts , te  ∈ Lin (w) such
the priority queue if necessary. Following this, we compute
that [ts , te ] ⊆ [ts , te ] and [ts , te ] ⊆ [ts , te ].
all SRTs when the priority queue is empty. A detailed pseu-
Given that w, ts , te  ∈ Lout (u), we have O(w) < O(u)
docode of our final algorithm will be given in the following
according to Lemma 3, and a triplet containing u cannot
section.
appear in Lin (w) or Lout (w). Therefore, condition i cannot
hold. Condition ii holds if v covers the reachability tuple
4.3.2 Efficient CRT computation
u, w, ts , te  and the rank of v is higher than those of u and
w. This contradicts Lemma 5 that u, w, ts , te  is a CRT. This
We reduce the CRT checks by making use of the transi-
completes the proof of the theorem. 

tive property of the dominance relationship. The following
lemma provides an early termination condition in the search
As we shown earlier, computing the minimum two-hop of SRT computation.
cover for both shortest distance and reachability is NP-hard
according to [13]. The property still holds for computing the Lemma 8 Given a reachability tuple u, v, ts , te  and a ver-
two-hop cover for span-reachability. The proof is similar to tex w, for any reachability tuple u, v  , ts , te , we have w
that in [13] and is done by transforming the problem into the covers u, v  , ts , te  if (i) w covers u, v, ts , te , (ii) [ts , te ] ⊆
minimum cost set cover problem [12]. [ts , te ], and (iii) v span-reaches v  in [ts , te ].

123
636 D. Wen et al.

Given the i-th vertex u i in O, assume that we have detected Algorithm 3: TILL-Construct∗ ()
a vertex v that u i can reach in an interval [ts , te ], and the cor- Input: a temporal graph G (V , E ), a vertex order O and a
responding tuple u i , v, ts , te  has been covered. Based on parameter ϑ
Lemma 8, we immediately terminate any further exploration Output: the TILL-Index of G
of v since all other vertices that are reachable from v, ts , te  1 foreach u ∈ V do
2 Lin (u), Lout (u) ← ∅;
must have been covered too. By adopting this pruning tech-
3 for 1 ≤ i < n do
nique, we not only avoid a large number of CRT checks but
4 u i ← the i-th vertex in O;
also reduce the search scope in SRT computation. We give the 5 Q ← an empty priority queue;
pseudocode of the final algorithm for the index construction 6 Q. push(u i , +∞, −∞);
by combining two optimization techniques in Algorithm 3. 7 while Q is not empty do
v, ts , te  ← Q. pop();
In Algorithm 3, we use a parameter ϑ to achieve a trade- 8
9 if u i = v then
off between the index size and the index coverage practically. if u i L
10 [ts ,te ] v then continue;
ϑ represents the largest interval length of span-reachability 11 else Lin (v) ← Lin (v) ∪ {u i , ts , te };
query that TILL-Index can support. In most applications, 12 foreach v  , t ∈ Nout (v) do
users may be only interested in the span-reachability queries 13 if O(v  ) ≤ O(u i ) then continue;
in a small-length interval. We will show the index size and 14 ts ← min(ts , t), te ← max(te , t);
its construction time under different ϑ selections in Sect. 7. 15 if te − ts + 1 > ϑ then continue;
16 else Q. push(v  , ts , te );
Lines 4–16 of Algorithm 3 compute all reachable vertices
and corresponding intervals from u i . As discussed in Sec- 17 repeat lines 6–16 to construct Lout of each vertex by toggling
tion 4.3.1, we always pop a triplet v, ts , te  with the smallest between the subscripts in and out;
value of te − ts in line 8. Based on Lemma 8, we check if the
reachability tuple u i , v, ts , te  has been covered in line 10.
Here, u i L [ts ,te ] v means the answer of the span-reachability and v4 , 1, 6 are covered by v3 , and the condition in line 10
query from u i to v in [ts , te ] is true according to the current holds. Till now, we have computed all CRTs of v5 which start
TILL-Index L (L includes the in-label Lin and out-label Lout from v5 .
of every vertex). Note that L dynamically increases during
the execution process of the algorithm. We omit this tuple Let l be the number of all CRTs. Based on Lemmas 5
and stop further exploration of it if it is covered by the previ- and 6, it is straightforward to see that l is also the num-
ously computed index (line 10). Lemmas 7 and 8 guarantee ber of all labels, and the index size is bounded by l. Let
that u i , v, ts , te  must be an CRT, and we safely add u i with lq = maxu∈V max(|Lin (u)|, |Lout (u)|) and d be the largest
corresponding interval to the in-labels of v in line 11. Lines out-degree or in-degree of vertices in the graph, i.e., d =
12–16 explore the out-neighbors of v. We omit the neighbor maxu∈V max(degrout (u), degrin (u)). The time complexity
with higher rank in line 13 since their reachability tuples have of Algorithm 3 is given as follows.
been covered in previous iterations. We compute the updated
Theorem 3 The running time of Algorithm 3 is bounded by
reachability interval for each neighbor v  in line 14. We push
O(ld(log ld + lq )).
the triplet into the priority queue in line 16 if the interval gap
is not larger than the threshold ϑ. Proof We first focus on one iteration of line 3. Based on
Lemmas 5 and 6, line 11 is performed O(l) times. We scan
Example 6 We give a running example of Algorithm 3. The the out-neighbors of v  if line 11 holds. Therefore, lines 13–
default value of the parameter ϑ is +∞. Given a graph G 16 are performed O(l · d) times, and the total number of
in Fig. 1 and an alphabetical order, assume that we have items appended to the priority queue is bounded by O(l ·
processed the first 4 vertices. We have i = 5 in line 3 and d). In line 10, we check whether u i , v, ts , te  is covered by
u i = v5 in line 4. The priority queue is initialized with one prior vertices. This can be done by sequentially scanning the
special element v5 , +∞, −∞. We pop v5 , +∞, −∞ in existing out-label of u i and in-label of v and returning true if
line 8 and scan out-neighbors of v5 including v3 , 4, v8 , 1 there is a common vertex in the interval [ts , te ]. The running
and v8 , 4. We omit the out-neighbor v3 , 4 since O(v3 ) < time can be bounded by O(|Lout (u i )| + |Lin (v)|) or O(lq ).
O(v5 ) in line 13, and push v8 , 1, 1 and v8 , 4, 4 into Q. In line 8 and 16, it requires O(log l · d) to push a new item
Assume the next popped triplet in line 8 is v8 , 1, 1. v8 has or pop the top item in the priority queue. By combining the
only one out-neighbor v4 , 6 and we have ts = 1, te = 6 in results, we have the total time complexity O(ld(log ld +lq )).
line 14. We push v4 , 1, 6 into Q. In the next round, we pop 

v8 , 4, 4 and push v4 , 4, 6 into Q. Now, Q contains two
triplets, v4 , 4, 6 and v4 , 1, 6. We do not push any new Undirected graphs. In undirected graphs, we only need to
triplet into Q in the following rounds since both v4 , 4, 6 maintain one label set for each vertex. Therefore, we omit

123
Span-reachability querying in large temporal graphs 637

line 17 of Algorithm 3 when constructing the index of an


undirected graph.

5 Query processing

We study the index-based query processing strategies in


Fig. 3 The data structure of Lin (v4 ) and Lout (v6 )
this section. We discuss the algorithm to answer the span-
reachability query followed by a full discourse of the
algorithm for the θ -reachability query. search and look for the next common vertex. Recall that in
Algorithm 3, the triplets appended to the out-label or in-label
5.1 Span-reachability query processing of each vertex follow the order of the vertex rank. Therefore,
the group operation can be done naturally in the index con-
Our first step is to present several basic pruning strategies to struction without incurring extra cost.
check span-reachability. Given a vertex u, let tmin (Nout (u)) To check whether there exists an interval falling in the
(resp. tmax (Nout (u))) be the smallest (resp. largest) times- query interval, we sort the intervals of each vertex in chrono-
tamp in out-neighbors of u. tmin (Nin (u)) and tmax (Nin (u)) logical order. So, given two intervals [ts , te ] and [ts , te ],
are defined similarly. We have the following lemmas. [ts , te ] is prior to [ts , te ] if (i) ts < ts , or (ii) ts = ts ∧ te < te .
Therefore, given a query interval [t1 , t2 ] and an arbitrary
Lemma 9 A vertex u span-reaches a vertex v in [t1 , t2 ] only if
interval [ts , te ], if an interval [ts∗ , te∗ ] ⊆ [t1 , t2 ] exists, [ts∗ , te∗ ]
there exist a neighbor w, t ∈ Nout (u) and w  , t   ∈ Nin (v)
must appear after [ts , te ] if ts < t1 or appear before [ts , te ]
such that t ∈ [t1 , t2 ] and t  ∈ [t1 , t2 ].
if te > t2 . This sorting task can be done at the end of Algo-
Lemma 10 A vertex u span-reaches a vertex v in [t1 , t2 ] rithm 3 after all labels are completely computed, which would
only if t2 ≥ max(tmin (Nout (u)), tmin (Nin (v))) and t1 ≤ not increase the total time complexity of Theorem 3.
min(tmax (Nout (u)), tmax (Nin (v))).
Example 7 Fig. 3 shows the data structure used to store the
We can check the conditions in above two lemmas simply labels of each vertex. We take Lin (v4 ) and Lout (v6 ) as exam-
by scanning the neighbors of each query vertex. If the con- ples. All triplets in these two label sets can be found in Table 1.
ditions do not hold, we immediately return false and do not Two arrays are used to store the triplets in the label of each
invoke any query processing procedure. vertex. One interval array stores the intervals for each vertex
Given a pair of query vertices u, v and an interval [ts , te ], in the label, and the other vertex array stores all vertices in
a straightforward method to answer the span-reachability of the label and the start position of their intervals in the inter-
u and v is to scan Lout (u) and Lin (v). Let Lout (u)[ts ,te ] (resp. val array. For Lin (v4 ) in Fig. 3, the intervals of v1 , v2 and v3
Lin (u)[ts ,te ] ) be the set of all triplets in Lout (u) (resp. Lin (u)) are marked by white, light gray and dark gray, respectively.
falling in the interval [ts , te ]. We answer true if there exists a The intervals of v2 in Lin (v4 ) in the interval array start from
common vertex in Lout (u)[ts ,te ] ∪ {u} and Lin (v)[ts ,te ] ∪ {v}. the position of v2 (i.e., 2) and end at the position of the next
Otherwise, we return false. This can be done by using a hash vertex v3 in the vertex array (i.e., 4).
table to preserve the vertices.
A complete pseudocode to process the span-reachability
To improve the query efficiency, we group the triplets in
query is presented in Algorithm 4 which is self-explanatory.
the out-label or in-label of each vertex by their target vertices
In lines 5, 6 and 9, we use the binary search method described
(the first item in the triplet). Let V(Lout (u)) be the set of ver-
above to find a subinterval of [t1 , t2 ]. We provide a running
tices in the reachability triplet of Lout (u), i.e., V(Lout (u)) =
example as follows.
{v ∈ V|v, ts , te  ∈ Lout (u)}. Given a vertex w in
V(Lout (u)), we use Lout (u)w to denote the intervals that u Example 8 Assume that we aim to answer the span-
can reach w in Lout (u), i.e., Lout (u)w = {[ts , te ]|w, ts , te  ∈ reachability from v6 to v4 in [3, 5]. We scan the vertex array
Lout (u)}. We check the span-reachability in two phases. In of Lout (v6 ) and Lin (v4 ) to look for a common vertex. We
the first one, we check if there exists a common vertex in first find a common vertex v1 . However, there does not exist
{u} ∪ V(Lout (u)) and {v} ∪ V(Lin (v)). This can be done in a subinterval of [3, 5] of v1 in the interval array of Lout (v6 ).
a merge sort like strategy by arranging the vertices in the We continue to search the next common vertex and find v2 .
label of each vertex by their ranks. Once finding a common We find there exists a subinterval [5, 5] of v2 in Lout (v6 ) and
vertex w, we further check if there exist intervals falling in a subinterval [3, 5] of v2 in Lin (v4 ). Therefore, we return
the query interval in Lout (u)w and Lin (v)w , respectively. If true for this query.
yes, we immediately return true. Otherwise, we resume the

123
638 D. Wen et al.

Algorithm 4: Span-Reach() Algorithm 5: ES-Reach∗ ()


Input: TILL-Index of G , two vertices u and v, and an interval Input: TILL-Index of G , a parameter θ, two vertices u and v and
[t1 , t2 ] an interval [t1 , t2 ]
Output: the span-reachability of u and v in [t1 , t2 ] Output: the θ-reachability of u and v in [t1 , t2 ]
1 i, i  ← 1; 1 i, i  ← 1;
2 while i ≤ |V (Lout (u))| ∧ i  ≤ |V (Lin (v))| do 2 while i ≤ |V (Lout (u))| ∧ i  ≤ |V (Lin (v))| do
3 w ← the i-th vertex in V (Lout (u)); 3 w ← the i-th vertex in V (Lout (u));
4 w  ← the i  -th vertex in V (Lin (v)); 4 w  ← the i  -th vertex in V (Lin (v));
5 if w = v ∧ ∃[ts , te ] ∈ Lout (u)w : [ts , te ] ⊆ [t1 , t2 ] then 5 if
return true; w = v∧∃[ts , te ] ∈ Lout (u)w : [ts , te ] ⊆ [t1 , t2 ], te −ts +1 ≤ θ
6 else if w  = u ∧ ∃[ts , te ] ∈ Lin (v)w : [ts , te ] ⊆ [t1 , t2 ] then then return true;
return true; 6 else if w  = u ∧ ∃w  , ts , te  ∈ Lin (v) : [ts , te ] ⊆
7 else if O(w) < O(w  ) then i ← i + 1; [t1 , t2 ], te − ts + 1 ≤ θ then return true;
8 else if O(w) > O(w  ) then i  ← i  + 1; 7 else if O(w) < O(w  ) then i ← i + 1;
9 else if ∃[ts , te ] ∈ Lout (u)w : [ts , te ] ⊆ [t1 , t2 ] ∧ 8 else if O(w) > O(w  ) then i  ← i  + 1;
∃[ts , te ] ∈ Lin (v)w : [ts , te ] ⊆ [t1 , t2 ] then 9 else if ∃[ts , te ] ∈ Lout (u)w : [ts , te ] ⊆ [t1 , t2 ] ∧
10 return true; ∃[ts , te ] ∈ Lin (v)w : [ts , te ] ⊆ [t1 , t2 ] then
11 else i ← i + 1, i  ← i  + 1; 10 k ← the position of the first interval [ts , te ] ∈ Lout (u)w
s.t. [ts , te ] ⊆ [t1 , t2 ];
12 return false; 11 k  ← the position of the first interval [ts , te ] ∈ Lin (v)w
s.t. [ts , te ] ⊆ [t1 , t2 ];
12 while k ≤ |Lout (u)w | ∧ k  ≤ |Lin (v)w | do
13 [ts , te ] the k-th interval in Lout (u)w ;
Theorem 4 Given two query vertices u and v, the running 14 [ts , te ] the k  -th interval in Lin (v)w ;
time of Algorithm 4 is bounded by O(|Lout (u)| + |Lin (v)|). 15 if [ts , te ]  [t1 , t2 ] ∨ [ts , te ]  [t1 , t2 ] then
16 break;
5.2 -Reachability 17 else if max(te , te ) − min(ts , ts ) + 1 ≤ θ then
18 return true;

Based on the idea for the span-reachability query processing, 19 else if te − ts + 1 > θ ∨ ts < ts then
20 k ← k + 1;
we study the θ -reachability query in this subsection. Given
two vertices u, v, a threshold θ and an interval [t1 , t2 ], a 21 else k  ← k  + 1;
straightforward idea to answer the θ -reachability query is to 22 i ← i + 1, i  ← i  + 1;
invoke Algorithm 4 for every possible interval (from [t1 , t1 + 23 else i ← i + 1, i  ← i  + 1;
θ −1] to [t2 −θ +1, t2 ]). The time complexity of this method is 24 return false;
O((t2 −t1 −θ )·(|Lout (u)|+|Lin (v)|)). We improve the time
complexity to O(|Lout (u)| + |Lin (v)|) by taking a sliding
window based approach. Before discussing the details of the we return true. Alternatively, we filter out the interval with
algorithm, we show that u θ -reaches v in [t1 , t2 ] if one of the the smallest start time and move the sliding window forward
following equations holds: to the next smallest start time of the intervals. This step is
repeated until no interval remains.
1. ∃v, ts , te  ∈ Lout (u): [ts , te ] ⊆ [t1 , t2 ] ∧ te − ts + 1 ≤ θ ; The pseudocode to answer the θ -reachability query is
2. ∃u, ts , te  ∈ Lin (v): [ts , te ] ⊆ [t1 , t2 ] ∧ te − ts + 1 ≤ θ ; given in Algorithm 5. Lines 5 and 6 correspond to the θ -
3. ∃w, ts , te  ∈ Lout (u), w  , ts , te  ∈ Lin (v): w = w  ∧ reachability conditions 1 and 2, respectively. Lines 9–22
[ts , te ] ⊆ [t1 , t2 ] ∧ [ts , te ] ⊆ [t1 , t2 ] ∧ max(te , te ) − correspond to condition 3. In lines 10 and 11, we use a binary
min(ts , ts ) + 1 ≤ θ . search to locate the first interval falling in [t1 , t2 ]. The condi-
tion of line 15 holds if all intervals of Lout (u)w (or Lin (v)w )
Based on the conditions above, we can follow the same in [t1 , t2 ] are scanned, and we break the loop. Line 17 holds if
framework of Algorithm 4. We add the limitation te −ts +1 ≤ we find a pair of intervals falling in the same sliding window.
θ in line 5 and line 6 of Algorithm 4, respectively, to check the In lines 19 and 21, we move the sliding window with a new
first two conditions. To check the third condition of finding start time of min(ts , ts ).
a common vertex w in V(Lout (u)) and V(Lin (v)), we first Theorem 5 Given a pair of vertices u and v, the running time
filter out all intervals in Lout (u)w and Lin (v)w not found in of Algorithm 5 is bounded by O(|Lout (u)| + |Lin (v)|).
[t1 , t2 ]. With the concept of sliding window, the window is
always θ . Recall that the intervals in each label are sorted in Example 9 Given a query interval [1, 8] and θ = 3, assume
chronological order. The initial start time of the window is the that we aim to answer 3-reachability from v6 to v4 . The out-
smallest start time of the remaining intervals in the labels. If label and in-label of v6 and v4 are given in Fig. 3, respectively.
both the first intervals of two labels fall in the sliding window, In line 9 of Algorithm 5, we find a common vertex v1 in

123
Span-reachability querying in large temporal graphs 639

V(Lout (v6 )) and V(Lin (v4 )). We have [ts , te ] = [5, 6] in Proof Given that the times of all edges are earlier than t, the
line 13 and [ts , te ] = [1, 4] in line 14. The conditions in lines insertion of u t , vt , t must generate at least one new CRT
15, 17 and 19 do not hold. As a result, line 21 is executed. In ending at t. Therefore, we have T = T + . Next, we prove
the next iteration, we have [ts , te ] = [4, 5] and [ts , te ] is kept T ⊆ T + . Assume that there exists a CRT u, v, ts , te  in T
constant. The condition in line 17 holds, and true is returned. and not in T + . u, v, ts , te  must be dominated by a new CRT
in T + \ T . This contradicts that any new CRT must end at t
with t > te . 

6 TILL-Index maintenance
Based on Lemma 11, all existing CRTs are still in the
updated index, and we only need to find all new CRTs pro-
Many real-world temporal graphs incrementally and contin-
duced by the insertion of u t , vt , t. Then, we update the
uously update as edge streams. In this section, we extend the
index accordingly. Recall that given a new CRT u, v, ts , te ,
priority queue-based search technique in Sect. 4.3.1 to main-
u, ts , te  is added to the in-label of v if O(u) < O(v). Other-
tain TILL-Index in dynamic temporal streams. Section 6.1
wise, v, ts , te  is added to the out-label of u. For simplicity,
investigates the problem of incremental TILL-Index mainte-
we mainly discuss computing all new CRTs u, v, ts , te  with
nance given a set of new edges. Section 6.2 provides a method
O(u) < O(v) and completing in-labels of each vertex. The
to prune TILL-Index for expiring edges.
idea for updating out-labels is similar.
The problem of maintaining hop-labeling-based index for
Let u, v, ts , te  be an arbitrary new CRT generated by the
shortest distance queries in unlabeled simple graphs has been
insertion of u t , vt , t. We immediately have te = t, which
studied in an existing work [4]. Unlike [4], our techniques for
can be easily proved based on the definition of CRT. It is also
TILL-Index maintenance are around relationships between
straightforward to derive that u [ts ,te ] u t and vt [ts ,te ] v.
time intervals in the index and are extended from the priority-
Intuitively, the search space to find all new CRTs can be very
queue-based search proposed in Sect. 4.3.1. In addition, [4]
large since there may exist many vertices reaching u t and
gives up the support of deleting outdated label entries due
reached from vt . We refine it by making use the existing
to the poor efficiency. However, in the context of temporal
TILL-Index, which is shown in the following lemma.
graphs, we will show in Sect. 6.2 that the outdated labels can
be pruned efficiently and the updated index is guaranteed to Lemma 12 Given a new edge u t , vt , t and an arbitrary new
be minimal. Following [4], we assume the vertex order is CRT u, v, ts , te  with O(u) < O(v), we have O(u) < O(vt )
fixed when edges update. Note that we only consider edge and u ∈ {u t } ∪ Lin (u t ).
updates in the paper. This is because insertions or deletions
of vertices can be expressed using a set of edge updates. Proof Given the new CRT u, v, ts , te , there exists a path
from u to v over the interval [ts , te ]. The path is via u t , vt , t,
6.1 Incremental index maintenance and te = t. Given O(u) < O(v), the rank of u is the highest
in the path. The path from u to u t corresponds to a CRT.
Let tmax be the latest time in the current temporal graph G. Given the equivalence of CRTs and the labeling index, we
Given a set of new edges with incurring times later than tmax have u ∈ {u t } ∪ Lin (u t ). 

inserted to G, we aim to update the TILL-Index to support Based on Lemma 12, we compute all new CRTs start-
the queries for the latest time. The main technical challenges ing from every vertex in {u t } ∪ Lin (u t ). Recall that in
in designing algorithms for maintaining TILL-Index are to Algorithm 3, we compute CRTs by performing a priority
guarantee its completeness and minimality. For ease of pre- queue-based search from each root vertex. Given a root ver-
sentation, we first assume that the time of each edge is unique tex u ∈ {u t } ∪ Lin (u t ), instead of searching from scratch, we
unless otherwise stated. The assumption is crucial to guaran- can reuse the intermediate searching result from u to u t and
tee the minimality. We will lift the restriction later and discuss resume the search from u t to all new vertices reached by u.
the case that multiple new edges come associated with the This is because any valid path of the new CRT u, v, ts , te 
same time. must pass (u t , vt ). We show the following two lemmas to
Assume that an edge u t , vt , t is inserted with t > tmax . support our detailed algorithms.
Due to the equivalence of CRTs and TILL-Index, the basic
idea of TILL-Index maintenance is to monitor the changes Lemma 13 Let Tu,∗ and T∗,u be the sets of CRTs reached
from u and reaching u, respectively. Tu,∗ + and T + be their
of CRTs after inserting u t , vt , t. ∗,u
+
counterparts after inserting u t , vt , t. We have T∗,u t = T∗,u
Lemma 11 Let T and T + be the set of all CRTs before and
t
and Tvt ,∗ = Tv+t ,∗ .
after the insertion of u t , vt , t, respectively. We have T ⊂
T +. Lemma 14 Given a new edge u t , vt , t and an arbitrary new
CRT u, v, ts , te  with O(u) < O(v) and u = u t , let [ts , te ]

123
640 D. Wen et al.

be the last (latest) interval in Lin (u t )u . We have [ts , te ] ⊆


[ts , te ].

Based on Lemma 13, we can safely use the existing CRTs


from a root vertex u to u t . Based on Lemma 14, we resume the
search of u from only one tuple, which is u t , ts , te . Search-
ing from other tuples from u to u t maintained in Lin (u t )u Fig. 4 An example of single edge insertion
would produce results dominated by those from u t , ts , te .

improve the updating efficiency. We simply perform a merge


Algorithm 6: TILL-Insert() sort in line 12 since tuples in Tin and Tout are sorted, respec-
Input: TILL-Index of G , a parameter θ, a vertex order O and a tively, during the construction. In addition, ranks of u in line
new edge u t , vt , t 6 and v in line 11 are higher than that any of u t and vt , which
Output: the updated index supports us to simply add u t , vt , t, t to the end of T in
1 Tin , Tout ← ∅;
line 13. Lines 14–21 resume the priority queue-based search
2 foreach u ∈ V (Lin (u t )) do
3 if O(u) ≥ O(vt ) then break; from each given starting reachability tuple, which is self-
4 [ts , te ] ← the last interval in Lin (u t )u ; explanatory. The completeness and minimality of the index
5 if t − ts + 1 > ϑ then continue; can be easily derived based on the lemmas in Sect. 6.1, and
6 Tin ← Tin ∪ {u, vt , ts , t};
we omit the detailed proofs.
7 foreach v ∈ V (Lout (vt )) do We analyze the running time of Algorithm 6 below where
8 if O(v) ≥ O(u t ) then break; new
9 [ts , te ] ← the last interval in Lout (vt )v ;
l is the number of all new labels. The definitions of lq and
10 if t − ts + 1 > ϑ then continue; d are the same as those in Theorem 3.
11 Tout ← Tout ∪ {u t , v, ts , t};
12 T ← perform a binary merge sort on Tin and Tout ;
Theorem 6 The running time of Algorithm 6 is bounded by
13 add u t , vt , t, t to the end of T ; O(l new d(log l new d + lq )).
14 foreach u, v, ts , te  ∈ T do
15 Q ← an empty priority queue; Proof The proof is similar to that of Theorem 3, and we omit
if O(u) < O(v) then


16
// search following edge directions
the details.
and complete in-labels
17 Q. push(v, ts , te ); Note that Algorithm 6 can be extended to handle out-of-
18 perform lines 7–16 in Algorithm 3 by replacing u i with u; order edge insertions (i.e., the time of the new edge is not the
19 else largest). Let t be the time of the new edge. Instead of picking
// search following reverse edge the last interval in line 4, we derive the union of [t, t] and
directions and add out-labels
20 Q. push(u, ts , te );
each interval in Lin (u t )u and add them to Tin (line 6). We
21 perform line 17 in Algorithm 3 by replacing u i with v; revise lines 9–11 similarly. The revised algorithm computes
the complete index to answer any possible query, but the
index may not be minimal for out-of-order insertions.
Handling simultaneous edges. When multiple edges come
The Algorithm. We now present the algorithm to incre- and are assigned by the same new time, iteratively processing
mentally maintain TILL-Index in Algorithm 6. Lines 1–13 each edge by invoking Algorithm 6 still works but proba-
prepare all CRTs as initial states of the priority-queue based bly cannot guarantee the index minimality. We consider an
search (lines 14–21). Lines 2–6 prepare CRTs to complete example shown in Fig. 4. The ranks of three vertices are
in-labels of the index. Based on Lemma 12, we only consider O(w) < O(u) < O(v). Before the insertion of u, w, 9,
the vertices that can reach u t and have lower ranking values we have w, 6, 7 ∈ Lin (v), u, 6, 9 ∈ Lin (v), and u can-
(line 2). Note that vertices in V(Lin (u t )) have been arranged not reach w. Note that u, 6, 9 is added to the in-label of
following the total order. As a result, we terminate the iter- v due to the insertion of another edge at the time 9. When
ation once finding a vertex ranking lower than vt in line 3. u, w, 9 is inserted, we add w, 9, 9 to the out-label of u
Based on Lemma 14, we derive the last time interval from by Algorithm 6. As a result, w covers the reachability tuple
u to u t . We explore the interval via the edge u t , vt , t and u, v, 6, 9, and u, 6, 9 in Lin (v) is redundant.
generate the CRT u, vt , ts , t since t > te . Similarly, lines Even though such a case is not common in most datasets
7–11 prepare CRTs for out-labels. according to our performance studies, we extend Algorithm 6
Lines 12–13 organize all reachability tuples in non- to guarantee the index minimality theoretically. We make the
decreasing order of their smallest vertex ranking values, following observation based on the example above.
which is crucial to guarantee the index minimality and

123
Span-reachability querying in large temporal graphs 641

Lemma 15 Given a new CRT u, v, ts , te  by the insertion Algorithm 7: TILL-Delete()


of an edge u t , vt , t, u, v, ts , te  becomes redundant when Input: TILL-Index of G and the earliest supported time t of the
inserting an edge u t , vt , t   with t  ≥ t only if t  = t. index
Output: the updated index
Lemma 15 reveals that the index redundancy only hap- 1 foreach u ∈ V do
pens for the new edges coming at the same time. Therefore, 2 foreach v ∈ V (Lin (u)) do
instead of processing each single edge, we process all new 3 foreach [ts , te ] ∈ Lin (u)v do
edges coming at the same time together. Specifically, we first 4 if ts ≥ t then break;
5 remove [ts , te ];
perform lines 1–13 of Algorithm 6 for every edge and merge
reachability tuples of all edges as one sorted list T . Assume 6 foreach v ∈ V (Lout (u)) do
that u is the vertex with the highest rank in all tuples of T . 7 foreach [ts , te ] ∈ Lout (u)v do
8 if ts ≥ t then break;
Unlike single edge insertion, there may exist several tuples 9 remove [ts , te ];
starting (or reaching) from u in T . Accordingly, we mod-
ify the phase of priority-queue based search (lines 14–21).
Instead of only pushing one tuple to the priority queue Q
(line 17 and line 20), we push all tuples starting from u to Q Table 2 Network statistics
and perform line 18. We push all tuples ending to v to Q and Dataset M |V | |E | ϑG
perform line 21. In this way, each derived CRT can never be
dominated in future iterations. CollegeMsg D 1899 59,835 16,736,181
Chess D 7301 65,053 99
6.2 Edge deletion Slashdot D 51,083 140,778 1,157,361,660
MathOverflow D 24,818 506,500 203,068,736
In certain applications, the index may never need to sup- Facebook_f U 63,731 817,035 1,232,231,923
port the query intervals starting earlier than a given time. We Epinions D 131,828 841,372 944
propose an algorithm to dynamically prune the index in this Facebook_wp D 46,952 876,993 134,873,285
subsection. Let tmin be the earliest time of all edges in the AskUbuntu D 159,316 964,437 225,834,442
temporal graph. We start by considering a case that removing Enron D 87,273 1,148,072 1,401,187,797
all edges at tmin . The updating method is simple and efficient SuperUser D 194,085 1,443,339 239,614,928
based on the following lemma. Digg D 279,630 1,731,653 1,247,032,805
Lemma 16 Let T − be the set of all CRTs not starting from Wiki U 118,100 2,917,785 239,001,193
tmin . T − is exactly the set of all CRTs in the graph after Prosper D 89,269 3,394,979 2142
deleting all edges at tmin . Arxiv U 28,093 4,596,803 3649

Proof Let G − be the temporal graph after deleting all edges Youtube U 3,223,589 9,375,374 225

at tmin . Based on Definition 4, deleting an edge at tmin would DBLP U 1,314,050 18,986,618 76
not break any CRT starting after tmin . Therefore, every CRT Flickr D 2,302,925 33,140,017 197
in T − is still valid in G − . Next we show the completeness DBLP_p U 2,828,689 156,773,140 10
of T − . Assume that there exists a CRT c of G − not in T − .
Given that T − is the set of all CRTs not starting from tmin ,
c must be dominated by a tuple starting from tmin , which Lemma 18 The running time of Algorithm 7 for an arbitrary
contradicts that c starts after tmin . The proof is finished. 
 parameter t is bounded by O(l), where l is the number of all
Based on Lemma 16, we only need to remove all labels labels.
containing tmin and finish updating the index. We give the
pseudocode for the edge deletion in Algorithm 7. The param-
eter t represents the earliest supported time of the updated
index. For the case of removing all edges at tmin , we set 7 Experiments
t = tmin + 1 in Algorithm 7. Note that in lines 3 and 7, inter-
vals have been sorted chronologically. It is clear to see that We conducted extensive experiments to evaluate the
the index returned by Algorithm 7 is still minimal. The time performance of our proposed algorithms, summarized as fol-
complexities for deleting edges are given as follows. lows:

Lemma 17 The running time to delete all edges at the



earliest time tmin is bounded by O( u∈V |V(Lin (u))| + – Online-Reach: Algorithm 1.
|V(Lout (u))|). – Span-Reach: Algorithm 4.

123
642 D. Wen et al.

– ES-Reach: a naive method to answer θ -reachability by Online-Reach Span-Reach

Running Time (µs)


9
10
invoking several runs of Span-Reach(). More details can 107
5
10
be found in Sect. 5.2. 103
– ES-Reach∗ : Algorithm 5. 101

Co
Ch geM
Sl s g
M dot
Fa v
Ep boo low
Fa ions f
A boo
En bu wp
Su n
D r Us
W
Pr
A ser
Y
D ube
Fl P
D r
– TILL-Construct: A basic implementation of Algorithm 2.

sk k_

ig er

rx
ou
BL

BL
as

ic
at

op
ik
ce erf

ce

pe
in k_

ro ntu
lle
es s

iv

k
hO

U
h

P_
p
We use a queue to compute all SRTs and get CRTs by
checking whether every SRT can be covered by existing
labels. More details can be found in Sect. 4.1. Fig. 5 Time of span-reachability query processing
– TILL-Construct∗ : Algorithm 3. Online-Reach Span-Reach

Running Time (µs)


– TILL-Insert: The algorithm for edge insertion.
104
– TILL-Delete: The algorithm for edge deletion. 10
2

1
-2
10
All algorithms were implemented in C++ and compiled

Co
Ch geM
Sl s g
M dot
Fa v
Ep boo low
Fa ions f
A boo
En b u w p
Su n
D rUs
W
Pr
A ser
Y v
D ube
Fl P
D r
s k k_

ig er

rx
ou
BL

BL
as

ic
at

op
ik
ce erf

ce

pe
in k_

ro n t u
lle
es s

k
hO

U
using a g++ compiler at a -O3 optimization level. All the

P_
p
experiments were conducted on a Linux Server with an Intel
Xeon 2.7GHz CPU and 180GB RAM.
Fig. 6 Average time of span-reachability query processing (true cases)
Datasets. We conducted experiments on eighteen real-world
graphs. The detailed statistics of these datasets are summa-
rized in Table 2. M demonstrates the types of datasets, where Online-Reach Span-Reach

Running Time (µs)


6
D represents the directed graph and U represents the undi- 10
104
rected graph. ϑG demonstrates the number of atomic units 10
2
1
between the smallest timestamp and the largest timestamp. 10
-2

Co
Ch geM
Sl s g
M dot
Fa v
Ep boo low
Fa ions f
A boo
En bu wp
Su n
D rUs
W
Pr
A ser
Y v
D ube
Fl P
D r
DBLP_p is a graph generated from the data of DBLP from

sk k_

ig er

rx
ou
BL

BL
as

ic
at

op
ik
ce erf

ce

pe
in k_

ro ntu
lle
es s

k
hO

U
h

P_
p
2011 to 2020. Each vertex is a publication. Two vertexes are
connected by an edge if they have a common author. The
time of the edge is the later publication year of two vertices. Fig. 7 Average time of span-reachability query processing (false cases)
All other networks and corresponding detailed descriptions
can be found in SNAP1 and KONECT2 . 100%
Percentage

The rest of this section is organized as follows. Section 7.1 80%


60%
provides the performance of answering span-reachability 40%
20%
queries. Section 7.2 evaluates the index construction algo- 0%
rithms. Section 7.3 reports the performance of answering
Co
Ch geM
Sl s g
M do
Fa Ov
Ep boo low
Fa ion f
A boo
En bu wp
Su n
D rUs
W
Pr
A ser
Y v
D ube
Fl P
D r
sk k

ig e

rx
ou
BL

BL
as

ic
at t

op
ik
ce erf

ce s

pe
in k_

ro ntu
lle
es s

g r

k
h

U _
h

P_
θ -reachability queries. Section 7.4 reports the performance

p
of index maintenance. Section 7.5 reports the performance
for continuous query processing. Fig. 8 Percentage of true cases in 1000 queries

7.1 Span-reachability query processing


that the result is true in Fig. 6 and report the average time for
We evaluate the performance of span-reachability query pro- all false cases in Fig. 7. The percentage of true cases in 1000
cessing. To generate input queries, we randomly pick 100 queries for each dataset is reported in Fig. 8.
vertex pairs in each graph G. For each vertex pair, we We can see that the running time of Span-Reach is at least
randomly generate subintervals of [1, ϑG ] and only keep two orders of magnitude smaller than that of Online-Reach
intervals if the conditions in Lemmas 9 and 10 are satis- in all datasets in the experiment. For example, in the largest
fied. We repeat this step until 10 intervals are found. This dataset Flickr, Online-Reach takes over 30 seconds, while
strategy works because the query algorithm is only invoked our Span-Reach algorithm takes only about 1.4 ms (1s =
if the conditions in Lemmas 9 and 10 hold. As a result, 103 ms = 106 μs). For each dataset, the average running time
we fully prepare 1000 span-reachability queries. We report of true cases is smaller than that of false cases since we need
the running time of Span-Reach for such 1000 queries with to scan all labels if the result is false.
Online-Reach as a comparison in Fig. 5. In addition to the
overall query time, we report the average time of all cases 7.2 Index Construction
1 http://snap.stanford.edu/data/index.html. This section is devoted to evaluating the performance of index
2 http://konect.cc/. construction algorithms.

123
Span-reachability querying in large temporal graphs 643

Running Time (s)

Running Time (s)


Graph Size Index Size 650 800
Index Size (KB)
107 640
6
10 700
5 630
10
104 620 600
3
10
10
2 610
500
600
Co
Ch geM
Sl s g
M dot
Fa v
Ep boo low
Fa ions f
A boo
En bu wp
Su n
D rUs
W
Pr
A ser
Y
D ube
Fl P
D r
sk k_

ig er

rx
ou
BL

BL
as

ic
at

op
ik
ce erf

ce

pe
in k_

ro ntu
lle
es s

iv

k
hO

U
h

P_
20% 40% 60% 80% 100% 20% 40% 60% 80% 100%

p
(a) Enron (b) Youtube
Fig. 9 Index size

Running Time (s)

Running Time (s)


210 6000
200 5000
TILL-Construct TILL-Construct* 190 4000
Running Time (s)

105 180 3000


104
103 170 2000
102 160 1000
1
10
100 20% 40% 60% 80% 100% 20% 40% 60% 80% 100%
Co
Ch geM
Sl s
M dot
Fa v
Ep boo low
Fa ions f
A boo
En bu wp
Su n
D rUs
W
Pr
A ser
Y
D ube
Fl P
D r
(c) DBLP (d) Flickr
sk k_

ig er

rx
ou
BL

BL
as

ic
at

op
ik
ce erf

ce

pe
in k_

ro ntu
lle
es sg

iv

k
hO

U
h

P_
p
50

Index Size (MB)

Index Size (MB)


200
48
190
Fig. 10 Indexing Time 46
180
44
170
42
160
7.2.1 Index size 40
20% 40% 60% 80% 100% 20% 40% 60% 80% 100%

We report the index size of all datasets in Fig. 9 and also (e) Enron (f) Youtube
add the size of datasets as a comparison. We can find that in
Index Size (MB)

Index Size (MB)


75 370
several large datasets, the index size is smaller than the graph 74 360
size. For example, in Flickr, the dataset takes about 400 MB 73
350
72
while the index takes only about 350 MB. 71 340
70 330
20% 40% 60% 80% 100% 20% 40% 60% 80% 100%
7.2.2 Indexing Time
(g) DBLP (h) Flickr
The running time of TILL-Construct∗ for all datasets is
Fig. 11 Varying ϑ of TILL-Construct∗
reported with TILL-Construct as a comparison in Fig. 10.
Note that the running time of TILL-Construct on sev-
eral datasets are not given as the algorithm cannot finish
edge sampling ratio is 20%. It reaches to 22 minutes, 35 min-
in twenty-four hours. It is clear that in comparing all
utes and 73 minutes when the edge sampling ratio is 40%,
reported times of TILL-Construct, TILL-Construct∗ is at least
60% and 80%, respectively. Finally, on the ratio of 100%, the
two orders of magnitude faster. For example in Flickr,
time reaches about 90 minutes. The increasing trends for the
TILL-Construct∗ takes about 1.5 hours to compute TILL-
index size in figures(e)–(h) are similar and even more gentle.
Index. TILL-Construct∗ takes about 1 second on Chess, which
Fig. 11(a)–(d) reports the running times. We can see that
is the shortest on all reported times. By contrast, the running
the increases on both Enron and DBLP are not obvious (does
time of TILL-Construct on Chess is about 20 minutes.
not exceed 20 seconds) from 20% to 100%. The lines are
almost linear in Youtube and Flickr, which start from about
7.2.3 Varying #
500 seconds and 25 minutes, ending at about 750 seconds
and 1.5 hours, respectively. Fig. 11(e)–(h) reports the index
The running times and index sizes of TILL-Construct∗ are
size. The change on all reported datasets is very small. The
presented in Fig. 11 by varying the input parameter ϑ from
group of figures shows that the index size and indexing time
20% to 100% of ϑG for each dataset G. Note that ϑ = ϑG
are confined even though we do not set any interval length
is equivalent to the default setting ϑ as +∞. Due to limited
limitation (ϑ = +∞) in TILL-Construct∗ .
space here, Fig. 11 shows only the results of four datasets
— Enron, Youtube, DBLP and Flickr. The results for other
datasets display similar trends. 7.2.4 Scalability
We can see from the figures(a)–(d) that the increasing
speed of running time becomes small when both vertex and This experiment tests the scalability of our index construction
edge sampling ratio increases. For example, the running time algorithm, which is shown in Fig. 12. We only report results
of TILL-Construct∗ on Flickr is about 14 minutes when the for four real-world graph datasets as representatives—Enron,

123
644 D. Wen et al.

Vertex Sampling Edge Sampling ES−Reach Span−Reach

Running Time (µs)

Running Time (µs)


11
Running Time (s)

Running Time (s)


3 10 4
10 103 10
109
3
102 2
107 10
10 5 2
1 10 10
10 3
10
10
0 101 1 101
10

10
20
30
40
50
60
70
80
90

10
20
30
40
50
60
70
80
90
20% 40% 60% 80% 100% 20% 40% 60% 80% 100%

%
%
%
%
%
%
%
%
%

%
%
%
%
%
%
%
%
%
(a) Enron (b) Youtube (a) Enron (b) Youtube

Running Time (µs)

Running Time (µs)


Running Time (s)

Running Time (s)


10
3
10
4 10
3 104
103
102 103 10
2
1 2 102
10 10 1
10 101
100 101

10
20
30
40
50
60
70
80
90

10
20
30
40
50
60
70
80
90
%
%
%
%
%
%
%
%
%

%
%
%
%
%
%
%
%
%
20% 40% 60% 80% 100% 20% 40% 60% 80% 100%

(c) DBLP (d) Flickr (c) DBLP (d) Flickr


Fig. 13 Performance of θ-reachability query processing
Index Size (MB)

Index Size (MB)

2 3
10 10

101 102 Table 3 Performance of index maintenance (μs)


Dataset Insertion Deletion Q online Q index
100 101

20% 40% 60% 80% 100% 20% 40% 60% 80% 100% CollegeMsg 126 0.14 6 0.12
(e) Enron (f) Youtube Chess 10 0.08 8 0.09
Slashdot 67 0.58 80 0.12
Index Size (MB)

Index Size (MB)

2 3
10 10 MathOverflow 577 0.24 27 0.13
1 2 Facebook_f 6511 0.61 93 0.25
10 10
Epinions 604 0.02 715 1.42
100 101 Facebook_wp 288 0.41 505 1.33
20% 40% 60% 80% 100% 20% 40% 60% 80% 100% AskUbuntu 306 0.32 675 1.60
(g) DBLP (h) Flickr Enron 384 0.16 374 1.94
SuperUser 417 0.39 1027 2.13
Fig. 12 Scalability of index construction
Digg 1701 0.70 1525 2.12
Wiki 3394 1.37 268 0.51
Youtube, DBLP and Flickr. The results on other datasets Prosper 9 0.10 3140 1.81
show similar trends. For each dataset, we vary the graph size Arxiv 159 0.27 588 1.04
and graph density by randomly sampling vertices and edges Youtube 2345 0.99 2849 4.54
from 20% to 100%. When sampling vertices, we derive the DBLP 12 0.05 2613 4.81
induced subgraph of the sampled vertices, and when sam- Flickr 302 0.01 38,879 1.48
pling edges, we select the incident vertices of the edges as DBLP_p 10 0.01 150,414 2.59
the vertex set.

7.3 -Reachability query processing equal when θ increases, since two algorithms are equivalent
when θ is the length of the query interval. For the perfor-
We evaluate the performance of θ -reachability query pro- mance of ES-Reach∗ , it is clear that all lines present roughly
cessing in this subsection. To prepare the input queries, we downward trends.
adopt the same strategy described in Sect. 7.1 and randomly
pick 100 vertex pairs and 10 intervals for each vertex pair. 7.4 Index maintenance
For each interval, we set θ as a fraction of its length and
adjust the fraction from 10% to 90%. The running time of We report the practical performance of index maintenance
ES-Reach∗ on four representative datasets is given in Fig. 13, algorithms. For the edge insertion, we pick the latest ten per-
with ES-Reach as a comparison. cent of all edges and insert them into the temporal graph of
We can see from Fig. 13 that ES-Reach∗ is faster than the front ninety percent. We record the average processing
ES-Reach on all parameter settings. Their times trend towards time for each edge. For the edge deletion, we pick the earliest

123
Span-reachability querying in large temporal graphs 645

Online-Reach Span-Reach Span-Reach(update) Compute from Scratch Dynamic Update

6 7 3 3

Running Time (s)

Running Time (s)


10 10 10 10
Running Time (µs)

Running Time (µs)


5 6 2 2
10 10 10 10
5
4 10 10 10
10
104
10
3 3 1 1
10

0%

10 %

20 0%

30 0%

40 0%

50 0%

0%

10 %

20 0%

30 0%

40 0%

50 0%
%

%
102 102

-5

-5
-6

-7

-8

-9

-1

-6

-7

-8

-9

-1
0

0
00

00
0%

10 %

20 0%

30 0%

40 0%

50 0%

0%

10 %

20 0%

30 0%

40 0%

50 0%

%
%

%
-5

-5
-6

-7

-8

-9

-1

-6

-7

-8

-9

-1
0

0
(a) Enron (b) DBLP
00

00
%

%
(a) Enron (b) DBLP
Fig. 15 Indexing time by sliding time window
Fig. 14 Query time by sliding time window
Compute from Scratch Dynamic Update

35 80

Index Size (MB)

Index Size (MB)


30 70
60
25
50
20 40
ten percent of edges and delete them from the original tempo- 15 30

0%

10 %

20 0%

30 0%

40 0%

50 0%

0%

10 %

20 0%

30 0%

40 0%

50 0%
ral graph. Similarly, we record the average processing time.

%
-5

-5
-6

-7

-8

-9

-1

-6

-7

-8

-9

-1
0

0
00

00
The results are shown in Table 3. As for comparisons, we

%
also report the average times of an online span-reachability (a) Enron (b) DBLP
query and an index-based span-reachability query, respec- Fig. 16 Index size by sliding time window
tively. We can see that the edge deletion is extremely fast
due to its lightweight processing strategy. In most datasets,
the processing time of an edge insertion is smaller than that 8 Related works
of an online query. Given that the index-based query time is
almost negligible compared with the online query time, the Reachability in temporal graphs. The time-respecting path
results support that our index-based solution still works well is defined in [21] to model the reachability problem in tem-
for dynamic temporal graphs. poral graphs. The similar concept is also studied using the
terms journey [14,34] or non-decreasing path [10]. Based on
the time-respecting path, an index-based algorithm to effi-
ciently answer the reachability problem in temporal graphs
is studied in [33] and is improved in [39] for the distributed
7.5 Continuous query processing environment. The time-respecting model only requires a
non-decreasing order of edge times in each valid path.
We simulate a real scenario by continuously maintaining Accordingly, each temporal graph can be transformed to a
a time window for two representative datasets Enron and unlabeled directed graph without breaking the correctness of
DBLP. For each dataset, we initially pick the first 50% edges any time-respecting reachability query. However, the defini-
and construct a temporal graph. Then, we slide the time win- tion of the span-reachability model is totally different, and
dow by adding 10% new edges and removing the oldest 10% existing indexing techniques cannot be applied. The histori-
edges each time. cal reachability problem is studied in [25]. Given an interval
Figure 14 reports the performance of query processing [t1 , t2 ] and a pair of vertices u, v, the conjunctive historical
in different time windows. Span-Reach represents the query reachability of u, v is true if for each possible t ∈ [t1 , t2 ],
algorithm where the index is constructed from scratch for there exists a path connecting u, v and all timestamps in the
the current time window. Span-Reach (update) represents path are t. The disjunctive historical reachability of u, v is
the query algorithm where the index is updated from previous true if there exists a timestamp t ∈ [t1 , t2 ] and a path con-
windows. Note that vertices always follow the degree order of necting u, v in which all timestamps in the path are t [25].
the initial window (0% – 50%) when updating the index. We Other mining problems in temporal graphs can be found in
can see that the query times of Span-Reach and Span-Reach surveys [8,18,23].
(update) are almost the same in all windows. Reachability in static graphs and dynamic graphs. A large
Figures 15 and 16 report the indexing time and the index number of works have been done to design an index for
size, respectively, when sliding the time window. The index answering the reachability query in static graphs [2,9,11,
size is almost the same even though we use the same degree 13,15,19,24,27,29,31,35,36]. These works only focus on the
order of the original temporal graph. Note that the indexing topological structure of graphs and ignore the temporal infor-
time for the initial window (0% – 50%) is relatively large mation. Distributed algorithms for reachability testing are
since more times are included in the first 50% edges. also studied in [40]. Interested readers can find more details in

123
646 D. Wen et al.

surveys [6,38]. Several works study the index maintenance in 12. Chvatal, V.: A greedy heuristic for the set-covering problem. Math.
dynamic graphs [7,24,37,41]. Estimating reachability based Operations Res. 4(3), 233–235 (1979)
13. Cohen, E., Halperin, E., Kaplan, H., Zwick, U.: Reachability and
on random walks is studied in [26]. distance queries via 2-hop labels. SIAM J. Comput. 32(5), 1338–
1355 (2003)
14. Ferreira, A.: On models and algorithms for dynamic commu-
9 Conclusion nication networks: The case for evolving graphs. In: In Proc.
ALGOTEL, (2002)
15. Gao, Y., Zhang, T., Qiu, L., Linghu, Q., Chen, G.: Time-respecting
In this paper, we define a span-reachability model to cap- flow graph pattern matching on temporal graphs. IEEE Trans.
ture entity relationships in a specific period of temporal Knowl. Data Eng. 33, 3453–3467 (2020)
graphs. We propose an index-based method based on the 16. Gurukar, S., Ranu, S., Ravindran, B.: Commit: A scalable approach
to mining communication motifs from dynamic networks. In: SIG-
concept of two-hop cover to answer the span-reachability MOD, pages 475–489, (2015)
query for any pair of vertices and time intervals. Several 17. Holme, P., Edling, C.R., Liljeros, F.: Structure and time evolution
optimizations are given to improve the efficiency of index of an internet dating community. Soc. Netw. 26(2), 155–174 (2004)
construction. We also study the problem of θ -reachability, 18. Holme, P., Saramäki, J.: Temporal networks. Phys. Rep. 519(3),
97–125 (2012)
which is a generalized version of span-reachability. Index 19. Jin, R., Xiang, Y., Ruan, N., Fuhry, D.: 3-hop: a high-compression
maintenance algorithms are proposed for dynamic tempo- indexing scheme for reachability query. In: SIGMOD, pages 813–
ral graphs. We conduct extensive experiments on eighteen 826, (2009)
real-world datasets to show the efficiency of our proposed 20. Jin, R., Xiang, Y., Ruan, N., Wang, H.: Efficiently answering reach-
ability queries on very large directed graphs. In: SIGMOD, pages
algorithms. 595–608, (2008)
21. Kempe, D., Kleinberg, J., Kumar, A.: Connectivity and inference
Acknowledgements Ying Zhang is supported by ARC FT170100128 problems for temporal networks. J. Comput. Syst. Sci. 64(4), 820–
and ARC DP210101393. Lu Qin is supported by ARC FT200100787 842 (2002)
and DP210101347. Dawei Cheng is supported by the National Science 22. Li, R.-H., Su, J., Qin, L., Yu, J.X., Dai, Q.: Persistent community
Foundation of China under grant no 62102287. Wenjie Zhang is sup- search in temporal networks. In: ICDE, pages 797–808 (2018)
ported by ARC DP180103096 and ARC DP200101116. 23. Michail, O.: An introduction to temporal graphs: an algorithmic
perspective. Internet Math. 12(4), 239–280 (2016)
24. Schenkel, R., Theobald, A., Weikum, G.: Efficient creation and
incremental maintenance of the hopi index for complex xml docu-
References ment collections. In: ICDE, pages 360–371, (2005)
25. Semertzidis, K., Pitoura, E., Lillis, K.: Timereach: Historical reach-
1. Abraham, I., Delling, D., Goldberg, A.V., Werneck, R.F.: Hier- ability queries on evolving graphs. In: EDBT, pages 121–132,
archical hub labelings for shortest paths. In: ESA, pages 24–35, (2015)
(2012) 26. Sengupta, N., Bagchi, A., Ramanath, M., Bedathur, S.: Arrow:
2. Agrawal, R., Borgida, A., Jagadish, H.V.: Efficient management Approximating reachability using random walks over web-scale
of transitive relationships in large data and knowledge bases. SIG- graphs. In: ICDE, pages 470–481, (2019)
MOD 18, 253–262 (1989) 27. Su, J., Zhu, Q., Wei, H., Yu, J.X.: Reachability querying: can it be
3. Akiba, T., Iwata, Y., Yoshida, Y.: Fast exact shortest-path distance even faster? TKDE 29(3), 683–697 (2016)
queries on large networks by pruned landmark labeling. In: SIG- 28. Viard, T., Latapy, M., Magnien, C.: Computing maximal cliques in
MOD, pages 349–360, (2013) link streams. Theor. Comput. Sci. 609, 245–252 (2016)
4. Akiba, T., Iwata, Y., Yoshida, Y.: Dynamic and historical shortest- 29. Wang, H., He, H., Yang, J., Yu, P.S., Yu, J.X.: Dual labeling:
path distance queries on large evolving networks by pruned Answering graph reachability queries in constant time. In: ICDE,
landmark labeling. In: Chung, C., Broder, A.Z., Shim, K., Suel, page 75 (2006)
T. (eds.) WWW, pp. 237–248. ACM (2014) 30. Wang, S., Lin, W., Yang, Y., Xiao, X., Zhou, S.: Efficient route
5. Anyanwu, K., Sheth, A.: ρ-queries: enabling querying for seman- planning on public transportation networks: A labelling approach.
tic associations on the semantic web. In: WWW, pages 690–699, In: SIGMOD, pages 967–982, (2015)
(2003) 31. Wei, H., Yu, J.X., Lu, C., Jin, R.: Reachability querying: An inde-
6. Bonifati, A., Fletcher, G., Voigt, H., Yakovets, N.: Querying graphs. pendent permutation labeling approach. PVLDB 7(12), 1191–1202
Synth. Lect. Data Manag. 10(3), 1–184 (2018) (2014)
7. Bramandia, R., Choi, B., Ng, W.K.: On incremental maintenance 32. Wen, D., Huang, Y., Zhang, Y., Qin, L., Zhang, W., Lin, X.:
of 2-hop labeling of graphs. In: WWW, pages 845–854, (2008) Efficiently answering span-reachability queries in large temporal
8. Casteigts, A., Flocchini, P., Quattrociocchi, W., Santoro, N.: Time- graphs. In: ICDE, pages 1153–1164. IEEE, (2020)
varying graphs and dynamic networks. Int. J. Parallel Emerg. 33. Wu, H., Huang, Y., Cheng, J., Li, J., Ke, Y.: Reachability and time-
Distrib. Syst. 27(5), 387–408 (2012) based path queries in temporal graphs. In: ICDE, pages 145–156,
9. Chen, Y., Chen, Y.: An efficient algorithm for answering graph (2016)
reachability queries. In: ICDE, pages 893–902, (2008) 34. Xuan, B.B., Ferreira, A., Jarry, A.: Computing shortest, fastest, and
10. Cheng, E., Grossman, J.W., Lipman, M.J.: Time-stamped graphs foremost journeys in dynamic networks. Int. J. Found. Comput. Sci.
and their associated influence digraphs. Discrete Appl. Math. 14(02), 267–285 (2003)
128(2–3), 317–335 (2003) 35. Yano, Y., Akiba, T., Iwata, Y., Yoshida, Y.: Fast and scalable reach-
11. Cheng, J., Huang, S., Wu, H., Fu, A.W.-C.: Tf-label: a topological- ability queries on graphs by pruned labeling with landmarks and
folding labeling scheme for reachability querying in a large graph. paths. In: CIKM, pages 1601–1606, (2013)
In: SIGMOD, pages 193–204, (2013)

123
Span-reachability querying in large temporal graphs 647

36. Yıldırım, H., Chaoji, V., Zaki, M.J.: Grail: a scalable index for 40. Zhang, T., Gao, Y., Li, C., Ge, C., Guo, W., Zhou, Q.: Distributed
reachability queries in very large graphs. VLDBJ 21(4), 509–534 reachability queries on massive graphs. DASFAA 11448, 406–410
(2012) (2019)
37. Yildirim, H., Chaoji, V., Zaki, M.J.: Dagger: A scalable index 41. Zhu, A.D., Lin, W., Wang, S., Xiao, X.: Reachability queries on
for reachability queries in large dynamic graphs. arXiv preprint large dynamic graphs: a total order approach. In: SIGMOD, pages
arXiv:1301.0977, (2013) 1323–1334, (2014)
38. Yu, J.X., Cheng, J.: Graph reachability queries: a survey. In: Man-
aging and Mining Graph Data, pages 181–215. (2010)
39. Zhang, T., Gao, Y., Chen, L., Guo, W., Pu, S., Zheng, B., Jensen,
Publisher’s Note Springer Nature remains neutral with regard to juris-
C.S.: Efficient distributed reachability querying of massive tempo-
dictional claims in published maps and institutional affiliations.
ral graphs. VLDBJ, pages 1–26, (2019)

123

You might also like