0% found this document useful (0 votes)

22 views

Keyword Search On External Memory Data Graphs: Bhavana Dalvi Meghana Kshirsagar

This document proposes techniques for performing keyword search on large graph-structured data that exceeds memory size. It presents a multi-granular graph representation where the original graph is summarized into a supernode graph in memory, and supernodes are incrementally expanded and cached as needed. An iterative search algorithm expands supernodes in top answers, while an incremental search algorithm updates search state when supernodes expand. Experimental results on DBLP and IMDB graphs show the incremental approach outperforms alternatives by reducing I/O and processing time for keyword queries.

Uploaded by

No12n533

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Keyword Search On External Memory Data Graphs: Bhavana Dalvi Meghana Kshirsagar

Uploaded by

No12n533

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 29

Keyword Search on External Memory Data Graphs

Bhavana Dalvi* Meghana Kshirsagar# S. Sudarshan Indian Institute of Technology, Bombay

*: Current affiliation: Google Inc. #: Current affiliation: Yahoo Labs.
1

Keyword Search on Graph Data

Motivation: querying of data from (possibly) multiple data sources

E.g. Organizational, government, scientific, medical Often no schema or partially defined schema Lowest common denominator model, across relational, HTML, XML, RDF, Much recent work on extracting and integrating data into a graph model

Graph data model

Keyword search is a natural way to query such data graphs, esp. in the absence of schema

This is the focus of this paper

Keyword Search on Graph-Structured Data

BANKS: Keyword search Focused Crawling paper writes Sudarshan Soumen C. Byron Dom author

E.g. query: soumen byron Key differences from IR/Web Search:

Normalization (implicit/explicit) splits related data across multiple nodes To answer a keyword query we need to find a (closely) connected set of entities that together match all given keywords
3

Query/Answer Models on Graph Data

Query : set of keywords Answer: rooted directed tree connecting keyword nodes (e.g. BANKS) Answer relevance based on

paper Focused Crawling

writes

node prestige 1/(tree edge weight)

author Soumen C.

author Byron Dom

Several closely related ranking models

query: soumen byron

Keyword Search on Graphs

Goal: efficiently find top k answers to keyword query Several algorithms proposed earlier

Backward expanding search Bidirectional search DPBF, BLINKS, Spark,

All above algorithms assume graph fits in memory

External Memory Graph Search

Problem: what if graph size > memory? Motivation: Web crawl graphs, social networks, Wikipedia, data generated by IE from Web Algorithm Alternatives: Alternative 1: Virtual Memory ve: thrashing (experimental results later) Alternative 2: SQL ve: For relational data only ve: not good for top-K answer generation Our proposal: use in-memory graph summary

to focus search on relevant parts of the graph avoid IO for rest of graph
6

Related Work

Keyword querying on graphs using precomputed info

Idea: Avoid search at query time, use only inverted list merge Drawbacks include high space overhead (ObjectRank, EKSO) Several algorithms (Nodine, Buchsbaum, etc) that give worst case guarantees, but require excessive replication Several algorithms (Shekhar, Chang etc) But all depend on properties specific to road networks (large diameter, near planarity etc) For visualization (Lieserson, Buchsbaum etc.) For web graph computations (Raghavan and Garcia-M.)

External memory graph traversal

Shortest path computation in external memory graphs

Hierarchical clustering

2-level graph clustering

Supernode Graph

Inner node

Edge weights: wt(S1 S2): min{wt(i j): i S1, j S2}

Strawman: 2-Phase Search

First-Attempt Algorithm:

Phase 1 : Search on supernode graph to get top-k results (containing supernodes)

Using any search algorithm

Expand all supernodes from supernode results Phase 2 : Search on this expanded component of graph to get final top-k results Top-k on expanded component may not be top-k on full graph Experiments show poor recall
9

Doesnt quite work:

Multi-Granular Graph Representation

Original supernode graph is in-memory Some supernodes are expanded

i.e. their contents are fetched into cache

Multi-granular graph: a logical graph view containing

inner nodes from expanded supernodes unexpanded supernodes edges between these nodes Multi-granular graph evolves as execution proceeds, and supernodes get expanded
10

Search runs on resultant multi-granular graph

Multi-Granular Graph
S1 S4

Key:
S2

Supernode (unexpanded) Inner Node

Expanded Supernode
I - I edge S - I edge S - S edge

Edge-weights:Supernode Innernode
wt(S j): wt(j S):

min{wt(i j): i S} symmetric to above

Iterative Expansion Search

Explore (generate top-k answers on current MG graph,
using any in-memory search method)

top-k answers pure?

No Yes Output

Expand supernodes
in top answers

Edges in top-k answers

Iterative Expansion (Cont.)

Any in-memory search algorithm can be used Iteration will terminate What if too many nodes are expanded?

Eviction of expanded nodes from MG graph

Can lead to non-convergence

Evict expanded nodes from cache, but retain in logical MG graph, re-fetch as required

Can cause thrashing (thrashing control possible)

Performance Evaluation (details later)

Significantly reduces IO compared to search using virtual memory BUT: High CPU cost due to multiple iterations, with each iteration starting search from scratch
13

Incremental Search

Motivation Repeated restarts of search in iterative search Basic Idea Search on multi-granular graph Expand supernode(s) in top answer Unlike Iterative Search

Update the state of the search algorithm when a supernode is expanded, and Continue search instead of restarting

State update depends on search algorithm

We present state update for backward expanding search (BANKS, ICDE02/VLDB05)

Backward Expanding Search

Query: soumen byron
paper Focused Crawling

writes

authors

Soumen C.

Byron Dom

SPI Tree

Backward Expanding Search

Based on Dijkstras single-source shortest path algorithm

One instance of Dijkstras algorithm per keyword Explored nodes: nodes for which shortest path already found Fringe nodes: unexplored nodes adjacent to explored nodes Shortest-Path Iterator Tree (SPI-Tree):

Tree containing explored and fringe nodes. Edge u v if (current) shortest path from u to keyword passes through v

More details in paper

Incremental Backward Search

Backward search run on multi-granular graph repeat

Find next best answer on current multi-granular graph If answer has supernodes expand supernode(s) Update the state of backward search, i.e. all SPI trees, to reflect state change of multi-granular graph due to expansion

until top-k answers on current multi-granular graph are pure answers

State Update on Supernode Expansion

Nodes affected by deletion

Result containing supernodes Supernode S1 to be expanded

SPI tree containing S1

Nodes Get Attached

1. 2.

Affected nodes get detached Inner-nodes get attached (as fringe nodes) to adjacent explored nodes based on shortest path to K1

3. Affected nodes get attached (as fringe nodes) to adjacent explored nodes based on shortest path to K1
19

Effect of Supernode Expansion

Differences from Dijkstra's shortest-path algorithm: For Explored nodes:

Path-costs of explored nodes may increase Explored nodes may become fringe nodes Incremental Expansion: Path-costs may increase or decrease

For Fringe nodes:

Invariant

SPI trees reflect shortest paths for explored nodes in current multi-granular graph

Theorem: Incremental backward expanding search generates correct top-k answers

Heuristics

Thrashing Control : Stop supernode expansion on cache full Use only parts of the graph already expanded for further search

Intra-supernode edge weight

details in paper Recall at or close to 100% for relevant answers, with heuristics, in our experiments (see paper for details)
21

Heuristics can affect recall

Experimental Setup

Clustering algorithm to create supernodes

Orthogonal to our work Experiments use Edge prioritized BFS (details in paper) Ongoing work: develop better clustering techniques echo 3 > /proc/sys/vm/drop caches Original Graph Size 99MB Supernode Graph Size 17MB Edges 8.5M Superedges 1.4M

All experiments done on cold cache

Dataset DBLP

IMDB

94MB

33MB

8M
1024 (7MB) 3510 (24MB)

2.8M

Default Cache size (Incr/Iter) Default Cache Size (VM, DBLP)

Default Cache Size (VM, IMDB)

5851 (40MB)
22

Algorithms Compared

Iterative Incremental Virtual Memory (VM) Search

Use same clustering as for supernode graph Fetch cluster into cache whenever a node is accessed

evicting LRU cluster if required gets Virtual Memory view

Search code unaware of clustering/caching

Sparse

SQL-based approach from Hristidis et al. [VLDB03] Not applicable to graphs without schema

used for comparison, on graphs derived from relational schema

Query Execution Time (top 10 results)

Query Execution Time (Seconds)

Bars: Iterative, Incremental and VM resp.

Query Execution Time (Last Relevant Result)

Query Execution Time (Seconds)

Iterative, Incremental, VM and Sparse resp.

Cache Misses for Different Cache Sizes

All VM

All Incr.

Note: Graphs in paper used wrong cache sizes for VM queries on IMDB (Q8,Q9, Q10 and Q12). Graph above shows corrected results, but there are no significant differences. 26

Conclusions

Graph summarization coupled with a multigranular graph representation shows promise for external memory graph search Ongoing/Future work

Applications in distributed memory graph search Improved clustering techniques Extending Incremental to bidirectional search and other graph search algorithms Testing on really large graphs

The End

Queries?

Minor Correction to Paper

Cache size (Incr/Iter)

1024 (7MB)

1536 (10.5MB)

2048 (14MB)

Cache Size (VM, DBLP)

Cache Size (VM, IMDB)

3510 (24MB)
5851 (40MB)

4023 (27.5MB)
6363 (43.5MB)

4535 (31MB)
6875 (47MB)

For IMDB queries Q8-Q10,Q12, for the case of VMSearch, cache sizes from DBLP were inadvertently used earlier instead of the cache sizes shown above. Queries were rerun on the correct cache size, but there were no changes in the relative performance of Incremental versus VMSearch, on cache misses as well time taken.

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (80)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Manifolds, Tensor Analysis, and Applications: R. Abraham J.E. Marsden T. Ratiu
No ratings yet
Manifolds, Tensor Analysis, and Applications: R. Abraham J.E. Marsden T. Ratiu
3 pages
Algorithms
No ratings yet
Algorithms
49 pages
Social Network Analysis Unit-6
No ratings yet
Social Network Analysis Unit-6
22 pages
Algorithm Design
No ratings yet
Algorithm Design
44 pages
11 Graph Pattern Mining
No ratings yet
11 Graph Pattern Mining
71 pages
Day2.3 Algorithmic ProblemSolving
No ratings yet
Day2.3 Algorithmic ProblemSolving
107 pages
Data structures (Graph) (1)
No ratings yet
Data structures (Graph) (1)
51 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
DS Module4
No ratings yet
DS Module4
27 pages
Shortest Path Computing in Relational DBMSS: Jun Gao, Jiashuai Zhou, Jeffrey Xu Yu, and Tengjiao Wang
No ratings yet
Shortest Path Computing in Relational DBMSS: Jun Gao, Jiashuai Zhou, Jeffrey Xu Yu, and Tengjiao Wang
15 pages
Managing and Mining Graph Data
No ratings yet
Managing and Mining Graph Data
620 pages
Solved_Unit_4_Q-Bank
No ratings yet
Solved_Unit_4_Q-Bank
24 pages
Lec 3
No ratings yet
Lec 3
21 pages
UNIT V - Graphs
No ratings yet
UNIT V - Graphs
18 pages
Basic concepts of data representation Part-7 - Copy
No ratings yet
Basic concepts of data representation Part-7 - Copy
10 pages
Intro To Artificial Intelligence Search: Ahmed Ezzat Labib Helwan University
No ratings yet
Intro To Artificial Intelligence Search: Ahmed Ezzat Labib Helwan University
21 pages
CS229 Project Report: Improving Search Engine For A Digital Library
No ratings yet
CS229 Project Report: Improving Search Engine For A Digital Library
5 pages
Graph in Datastructure
No ratings yet
Graph in Datastructure
34 pages
Intro To Artificial Intelligence Search: Ahmed Ezzat Labib Helwan University
No ratings yet
Intro To Artificial Intelligence Search: Ahmed Ezzat Labib Helwan University
21 pages
Graphs
No ratings yet
Graphs
46 pages
Graph Mining Handout
No ratings yet
Graph Mining Handout
7 pages
Prakash J. y R. Kumar. 2015. Web Crawling Through Shark-Search Using Pagerank
No ratings yet
Prakash J. y R. Kumar. 2015. Web Crawling Through Shark-Search Using Pagerank
7 pages
Co So Du Lieu Do Thi
No ratings yet
Co So Du Lieu Do Thi
46 pages
DS Solbank U3
No ratings yet
DS Solbank U3
15 pages
Lecture 8 Graph Databases
No ratings yet
Lecture 8 Graph Databases
77 pages
Final Graph
No ratings yet
Final Graph
77 pages
SENG 313 Graphs Algorithm - PART1
No ratings yet
SENG 313 Graphs Algorithm - PART1
35 pages
Many Aspects
No ratings yet
Many Aspects
2 pages
AI Search Metods
No ratings yet
AI Search Metods
58 pages
Graph Traversal
No ratings yet
Graph Traversal
4 pages
Graph Data Mining: Slides Are Modified From Jiawei Han & Micheline Kamber
No ratings yet
Graph Data Mining: Slides Are Modified From Jiawei Han & Micheline Kamber
37 pages
Graph Algorithms 10 271709207487730
No ratings yet
Graph Algorithms 10 271709207487730
21 pages
Dsa24 8
No ratings yet
Dsa24 8
43 pages
BFS DFS
No ratings yet
BFS DFS
51 pages
3-Module2 - Introduction to Problem Solving by searching methods-30-07-2024
No ratings yet
3-Module2 - Introduction to Problem Solving by searching methods-30-07-2024
78 pages
Graphs
No ratings yet
Graphs
53 pages
06 Basic Graph Algorithms
No ratings yet
06 Basic Graph Algorithms
31 pages
Graph
No ratings yet
Graph
83 pages
Algorithms
No ratings yet
Algorithms
8 pages
Graph Pattern Mining, Search and OLAP
No ratings yet
Graph Pattern Mining, Search and OLAP
14 pages
Graph Indexing - A Review
No ratings yet
Graph Indexing - A Review
40 pages
Unit III - Graphs
No ratings yet
Unit III - Graphs
36 pages
Ankur Graph[1]
No ratings yet
Ankur Graph[1]
12 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
86 pages
Graphs
No ratings yet
Graphs
39 pages
CA10 GraphMining
No ratings yet
CA10 GraphMining
59 pages
Keyword Searching and Browsing in Databases Using BANKS
No ratings yet
Keyword Searching and Browsing in Databases Using BANKS
33 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
DSA
No ratings yet
DSA
48 pages
L3 Search1 Uninformed
No ratings yet
L3 Search1 Uninformed
72 pages
Unit-4 Dsa
No ratings yet
Unit-4 Dsa
7 pages
Lec12 Sorting
No ratings yet
Lec12 Sorting
90 pages
Graphanalyticswitharangodbfeb2021 210215121042
No ratings yet
Graphanalyticswitharangodbfeb2021 210215121042
56 pages
DSA Unit-5
No ratings yet
DSA Unit-5
227 pages
Graph
No ratings yet
Graph
62 pages
Graph Search Methods
No ratings yet
Graph Search Methods
51 pages
Graphs
No ratings yet
Graphs
5 pages
M4 Notes - PART-1
No ratings yet
M4 Notes - PART-1
20 pages
Unit 4 - Non-Linear Data Structure - Binary - Graph - 1923081007
No ratings yet
Unit 4 - Non-Linear Data Structure - Binary - Graph - 1923081007
105 pages
Solving Problems by Searching
No ratings yet
Solving Problems by Searching
37 pages
Breadth First Search: Fundamentals and Applications
From Everand
Breadth First Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
Nidadavolu Malathi: Telugu Women Writers, 1950-1975
0% (1)
Nidadavolu Malathi: Telugu Women Writers, 1950-1975
111 pages
Cloudy Knapsack Problems: An Optimization Model For Distributed Cloud-Assisted Systems
No ratings yet
Cloudy Knapsack Problems: An Optimization Model For Distributed Cloud-Assisted Systems
5 pages
Issn 0976-8645: A First Epqrst Journal-/^Êϵϳϲͳθϲϰϱ A First Epqrst Journal- /^Êϵϳϲͳθϲϰϱ
No ratings yet
Issn 0976-8645: A First Epqrst Journal-/^Êϵϳϲͳθϲϰϱ A First Epqrst Journal- /^Êϵϳϲͳθϲϰϱ
55 pages
Johnson-Lindenstrauss Theory
No ratings yet
Johnson-Lindenstrauss Theory
8 pages
"Gati Limited Q3 FY2022 Earnings Conference Call": February 07, 2022
No ratings yet
"Gati Limited Q3 FY2022 Earnings Conference Call": February 07, 2022
17 pages
A4PRe3 21r4r
No ratings yet
A4PRe3 21r4r
80 pages
Prospectus HGT
No ratings yet
Prospectus HGT
23 pages
Letter T2o Indian PM Revisedyyg
No ratings yet
Letter T2o Indian PM Revisedyyg
1 page
Information Processing Letters: Malik Magdon-Ismail
No ratings yet
Information Processing Letters: Malik Magdon-Ismail
4 pages
Read:-1.U.O.No. 4368/2019/admn Dated, 23.03.2019.: University of Calicut
No ratings yet
Read:-1.U.O.No. 4368/2019/admn Dated, 23.03.2019.: University of Calicut
125 pages
By Chandra Chekuri August 1998
No ratings yet
By Chandra Chekuri August 1998
145 pages
Kernel K-Means, Spectral Clustering and Normalized Cuts: Inderjit S. Dhillon Yuqiang Guan Brian Kulis
No ratings yet
Kernel K-Means, Spectral Clustering and Normalized Cuts: Inderjit S. Dhillon Yuqiang Guan Brian Kulis
6 pages
Duquette, Ramasubramanian 2009, Anyathākhyāti - A Critique by Appaya Dīk Ita in The Parimala
No ratings yet
Duquette, Ramasubramanian 2009, Anyathākhyāti - A Critique by Appaya Dīk Ita in The Parimala
17 pages
On The Justification of Deduction and Induction
No ratings yet
On The Justification of Deduction and Induction
38 pages
Experimental Study of High Performance Priority Queues: David Lan Roche Supervising Professor: Vijaya Ramachandran
No ratings yet
Experimental Study of High Performance Priority Queues: David Lan Roche Supervising Professor: Vijaya Ramachandran
35 pages
Improved Conic Reformulations For K-Means Clustering: Madhushini Narayana Prasad and Grani A. Hanasusanto
No ratings yet
Improved Conic Reformulations For K-Means Clustering: Madhushini Narayana Prasad and Grani A. Hanasusanto
24 pages
Nabadwip Vidyasagar College Final Merit List Sanskrit Hons (Obc B Category)
No ratings yet
Nabadwip Vidyasagar College Final Merit List Sanskrit Hons (Obc B Category)
2 pages
Nabadwip Vidyasagar College Final Merit List Sanskrit Hons (ST Category)
No ratings yet
Nabadwip Vidyasagar College Final Merit List Sanskrit Hons (ST Category)
1 page
Nabadwip Vidyasagar College Final Merit List Sanskrit Hons (All Category)
No ratings yet
Nabadwip Vidyasagar College Final Merit List Sanskrit Hons (All Category)
11 pages
Singular Values and Eigenvalues of Tensors: A Variational Approach
No ratings yet
Singular Values and Eigenvalues of Tensors: A Variational Approach
4 pages
Nabadwip Vidyasagar College Final Merit List Sanskrit Hons (Obc A Category)
No ratings yet
Nabadwip Vidyasagar College Final Merit List Sanskrit Hons (Obc A Category)
1 page
Real-Time Machine Learning: The Missing Pieces
No ratings yet
Real-Time Machine Learning: The Missing Pieces
6 pages
W1 Tom Introduction
No ratings yet
W1 Tom Introduction
61 pages
The Biggest Monsters That Ever Swam The Seas: Feature Story 1255
No ratings yet
The Biggest Monsters That Ever Swam The Seas: Feature Story 1255
6 pages
Borges The Passion of An Endless Quotation - Lisa Block de Behar
100% (1)
Borges The Passion of An Endless Quotation - Lisa Block de Behar
223 pages
A2 Film Studies Coursework Screenplay
100% (2)
A2 Film Studies Coursework Screenplay
6 pages
Puzzles 5
No ratings yet
Puzzles 5
6 pages
A History of Windowshh
No ratings yet
A History of Windowshh
17 pages
The Waiting Time Problem in A Model Hominin Population: Research Open Access
No ratings yet
The Waiting Time Problem in A Model Hominin Population: Research Open Access
28 pages