Keyword Search On External Memory Data Graphs: Bhavana Dalvi Meghana Kshirsagar
Keyword Search On External Memory Data Graphs: Bhavana Dalvi Meghana Kshirsagar
E.g. Organizational, government, scientific, medical Often no schema or partially defined schema Lowest common denominator model, across relational, HTML, XML, RDF, Much recent work on extracting and integrating data into a graph model
Keyword search is a natural way to query such data graphs, esp. in the absence of schema
Normalization (implicit/explicit) splits related data across multiple nodes To answer a keyword query we need to find a (closely) connected set of entities that together match all given keywords
3
Query : set of keywords Answer: rooted directed tree connecting keyword nodes (e.g. BANKS) Answer relevance based on
writes
writes
author Soumen C.
Problem: what if graph size > memory? Motivation: Web crawl graphs, social networks, Wikipedia, data generated by IE from Web Algorithm Alternatives: Alternative 1: Virtual Memory ve: thrashing (experimental results later) Alternative 2: SQL ve: For relational data only ve: not good for top-K answer generation Our proposal: use in-memory graph summary
to focus search on relevant parts of the graph avoid IO for rest of graph
6
Related Work
Idea: Avoid search at query time, use only inverted list merge Drawbacks include high space overhead (ObjectRank, EKSO) Several algorithms (Nodine, Buchsbaum, etc) that give worst case guarantees, but require excessive replication Several algorithms (Shekhar, Chang etc) But all depend on properties specific to road networks (large diameter, near planarity etc) For visualization (Lieserson, Buchsbaum etc.) For web graph computations (Raghavan and Garcia-M.)
Hierarchical clustering
Supernode Graph
Inner node
First-Attempt Algorithm:
Expand all supernodes from supernode results Phase 2 : Search on this expanded component of graph to get final top-k results Top-k on expanded component may not be top-k on full graph Experiments show poor recall
9
inner nodes from expanded supernodes unexpanded supernodes edges between these nodes Multi-granular graph evolves as execution proceeds, and supernodes get expanded
10
Multi-Granular Graph
S1 S4
Key:
S2
Expanded Supernode
I - I edge S - I edge S - S edge
S3
Edge-weights:Supernode Innernode
wt(S j): wt(j S):
Expand supernodes
in top answers
12
Any in-memory search algorithm can be used Iteration will terminate What if too many nodes are expanded?
Evict expanded nodes from cache, but retain in logical MG graph, re-fetch as required
Significantly reduces IO compared to search using virtual memory BUT: High CPU cost due to multiple iterations, with each iteration starting search from scratch
13
Incremental Search
Motivation Repeated restarts of search in iterative search Basic Idea Search on multi-granular graph Expand supernode(s) in top answer Unlike Iterative Search
Update the state of the search algorithm when a supernode is expanded, and Continue search instead of restarting
writes
authors
Soumen C.
Byron Dom
SPI Tree
SPI Tree
15
One instance of Dijkstras algorithm per keyword Explored nodes: nodes for which shortest path already found Fringe nodes: unexplored nodes adjacent to explored nodes Shortest-Path Iterator Tree (SPI-Tree):
Tree containing explored and fringe nodes. Edge u v if (current) shortest path from u to keyword passes through v
Find next best answer on current multi-granular graph If answer has supernodes expand supernode(s) Update the state of backward search, i.e. all SPI trees, to reflect state change of multi-granular graph due to expansion
17
S1
1. 2.
Affected nodes get detached Inner-nodes get attached (as fringe nodes) to adjacent explored nodes based on shortest path to K1
3. Affected nodes get attached (as fringe nodes) to adjacent explored nodes based on shortest path to K1
19
Path-costs of explored nodes may increase Explored nodes may become fringe nodes Incremental Expansion: Path-costs may increase or decrease
Invariant
SPI trees reflect shortest paths for explored nodes in current multi-granular graph
Heuristics
Thrashing Control : Stop supernode expansion on cache full Use only parts of the graph already expanded for further search
details in paper Recall at or close to 100% for relevant answers, with heuristics, in our experiments (see paper for details)
21
Experimental Setup
Orthogonal to our work Experiments use Edge prioritized BFS (details in paper) Ongoing work: develop better clustering techniques echo 3 > /proc/sys/vm/drop caches Original Graph Size 99MB Supernode Graph Size 17MB Edges 8.5M Superedges 1.4M
Dataset DBLP
IMDB
94MB
33MB
8M
1024 (7MB) 3510 (24MB)
2.8M
5851 (40MB)
22
Algorithms Compared
Use same clustering as for supernode graph Fetch cluster into cache whenever a node is accessed
Sparse
SQL-based approach from Hristidis et al. [VLDB03] Not applicable to graphs without schema
24
25
All VM
All Incr.
Note: Graphs in paper used wrong cache sizes for VM queries on IMDB (Q8,Q9, Q10 and Q12). Graph above shows corrected results, but there are no significant differences. 26
Conclusions
Graph summarization coupled with a multigranular graph representation shows promise for external memory graph search Ongoing/Future work
Applications in distributed memory graph search Improved clustering techniques Extending Incremental to bidirectional search and other graph search algorithms Testing on really large graphs
27
The End
Queries?
28
1024 (7MB)
1536 (10.5MB)
2048 (14MB)
3510 (24MB)
5851 (40MB)
4023 (27.5MB)
6363 (43.5MB)
4535 (31MB)
6875 (47MB)
For IMDB queries Q8-Q10,Q12, for the case of VMSearch, cache sizes from DBLP were inadvertently used earlier instead of the cache sizes shown above. Queries were rerun on the correct cache size, but there were no changes in the relative performance of Incremental versus VMSearch, on cache misses as well time taken.
29