Keyword Searching and Browsing in
Databases using BANKS
Gaurav Bhalotia, Arvind Hulgeri,
Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan
I.I.T. Bombay
11/25/2018 1
Motivation
Keyword search of documents on the Web has
been enormously successful
Simple and intuitive, no need to learn any query
language
Database querying using keywords is desirable
SQL is not appropriate for casual users
Form interfaces cumbersome:
Require separate form for each type of query — confusing for
casual users of Web information systems
Not suitable for ad hoc queries
11/25/2018 2
Motivation
Many Web documents are dynamically generated
from databases
E.g. Catalog data
Keyword querying of generated Web documents
May miss answers that need to combine information
on different pages
Suffers from duplication overheads
11/25/2018 3
Examples of Keyword Queries
On a railway reservation database
“mumbai bangalore”
On an e-store database
“camcorder panasonic”
On a book store database
“sudarshan databases”
11/25/2018 4
Differences from IR/Web Search
Related data split across multiple tuples due to
normalization
E.g. Paper (paper-id, title, journal),
Author (author-id, name)
Writes (author-id, paper-id, position)
Different keywords may match tuples from
different relations
What joins are to be computed can only be decided on
the fly
Cites(citing-paper-id, cited-paper-id)
11/25/2018 5
Connectivity
Tuples may be connected by
Foreign key
Implicit links (shared words), etc.
Tuples belonging to the same relation
Would like to find sets of (closely) connected
tuples that match all given keywords
11/25/2018 6
Basic Model
Database: modeled as a graph
Nodes = tuples
Edges = references between tuples
foreign key, other kind of relationships
Edges are directed.
BANKS: Keyword search… MultiQuery Optimization paper
writes
Charuta S. Sudarshan Prasan Roy author
11/25/2018 7
Answer Example
Query: sudarshan roy
paper
MultiQuery Optimization
writes writes
author author
S. Sudarshan Prasan Roy
11/25/2018 8
Edge Directionality
Some popular tuples are connected to many
other tuples
E.g. Students -> departments -> university
Popular tuples would create misleading shortcuts
from every tuple to every other
E.g. every student would be closely linked with every
other student via the department/university
Solution: define different forward and backward
edge weights
Forward edges: In the direction of the foreign key
reference
11/25/2018 9
Edge Weight
Weight of forward edge based on schema
e.g. citation link weights > writes link weights
Weight of backward edge = indegree of edges
pointing to the node
3
1
3
1
3
1
11/25/2018 10
Edge Weight Scaling
Problem: Some backward edges have unduly
large weights
Scale edge weights by using log(1+raw-edgeweight)
total-edge-weight = edge-weights
Edge score E = 1 / total-edge-weight
11/25/2018 11
Node Weight
Nodes have prestige weights too
Observation: nodes with intuitively greater prestige
tend to have greater indegree
Set node weight = indegree
Problem: Nodes with many in-edges result in
skewed answers
Subdue extreme node weights by using
log(1+indegree)
Node score N =
root-node-weight + leaf-node-weights
11/25/2018 12
Combining Scores
Problem: how to combine two independent
metrics: node weight and edge weight
Normalize each to 0-1
Combine using weighting factor
Additive: (1- ) E + N
Multiplicative: E N
Performance study to compare alternatives and
to find reasonable values for
11/25/2018 13
The BANKS Answer Model
Query: set of keywords {k1, k2, .., kn}
Each keyword ki matches set of nodes Si
Answer: rooted, directed tree connecting
nodes, with one node from each Si
Root node(also referred to as Information Node) has
special significance, may be restricted to some
relations
E.g. relations representing entities, not relationships
May include intermediate nodes not in any Si and
hence a Steiner tree.
Multiple answers
Ranking based on proximity + prestige
11/25/2018 14
Finding Answer Trees
Computation of minimum weight Steiner
Trees: NP complete
Backward Expanding Search Algorithm:
Intuition: find vertices from which a forward path
exists to at least one node from each Si.
Run concurrent single source shortest path algorithm
from each node matching a keyword
Create an iterator for each node matching a keyword
Traverse the graph edges in reverse direction
Output a node whenever it is on the intersection of the sets of
nodes reached from each keyword
11/25/2018 15
Finding Answer Tress
For each vertex visited, maintain a nodelist v.Li
for each search term ti.
Update the ith nodelist when the search starting
from a vertex uєSi reaches the vertex v.
The new result tress produced correspond to the
nodelists : u × Л v.Lj
i‡j
11/25/2018 16
Backward Expanding Search
Query: sudarshan roy
paper MultiQuery Optimization
writes
authors S. Sudarshan Prasan Roy
11/25/2018 17
Result Ordering
Answer trees may not be generated in relevance
order
Solution:
Best-first search across all iterators, based on path
length
Output answers to a buffer
Eliminate duplicates: Isomorphic Trees
Output highest ranked answer from buffer to user
when buffer is full
11/25/2018 18
THE BANKS SYSTEM
BANKS provides keyword search coupled with
extensive browsing facilities
Schema browsing + data browsing
Graphical display of data
Implemented using Java + servlets
Keyword search response times typically 1 to 3
seconds on
DBLP database with 100,000 tuples/300,000 edges
P3 600 MHz, 512 MB RAM
Try it out at www.cse.iitb.ac.in/banks/
11/25/2018 19
The BANKS Architecture
HTTP JDBC
User BANKS
Web Server
+ Servlets Database
Connects to any database using JDBC
JDBC metadata features used to provide schema
browsing
No programming needed for customization
Minimal preprocessing of database to create indices and give
weights to links
Extensive set of browsing features
11/25/2018 20
Browsing Features
Hyperlinks are automatically added to all
displayed results
Template facilities to do a variety of tasks
Browsing data by grouping and creating crosstabs
e.g., theses grouped by department and year
Hierarchical views of data
Nested XML style, even on relational data
Graphical displays
Bar charts, pie charts, etc
11/25/2018 21
Example of Browsing in BANKS
11/25/2018 22
BANKS Query Result Example
Result of “Soumen Sunita”
11/25/2018 23
Anecdotes
“Mohan”
Returns C. Mohan at top based on prestige (number of
papers written)
“Transaction”
Returns Jim Gray’s classic paper and textbook as top
answers based on prestige (number of citations)
“Sunita Seltzer”
No common papers, but both have papers with
Stonebraker: system finds this connection
11/25/2018 24
Effect of Parameters
Log scaling of edge weights worked well
(1- ) E + N versus E N -- made little difference
Best with = .2 (subdue node weights but not entirely)
11/25/2018 25
Related Work
DataSpot (DTL)/Mercado Intuifind [VLDB 98]
Based on patent by Palmon (filed 1995, granted 1998)
Similar answer model to ours
Differences: our model of backward link weights and prestige
Proximity Search [VLDB98]
Different model of proximity
No edge weights, prestige, different evaluation algorithm
Information units (linked Web pages) [WWW10]
No directionality, only studied in Web context
Microsoft DBExplorer
No ranking, based on SQL generation
Addresses efficient construction of text indexes
11/25/2018 26
Some Extensions to the BANKS
Searching for similar results: Template Search
define the notion of similarity between two result trees
perform the restricted search
Efficiently handling meta-data queries
starting the search from each of the tuples in a table is
too costly
11/25/2018 27
Template Search
Feedback in terms of result tree
Type of a result tree defined in terms of
type of nodes
the table to which the node belongs
type of edges :
the type of nodes which it connects
the link information e.g. ‘cites’ and ‘cited’ link between two
papers.
Which nodes to start the search from
only the chosen nodes
all the nodes corresponding to a particular keyword
11/25/2018 28
Template Search
Start the backward search only from allowed set
of nodes
Follow the edges as defined by the result type
Example : Consider Query “sudarshan database”
Two types of results for above query
papers written by professor sudarshan
papers cited by papers written by professor sudarshan
Two result types distinguished by whether to
follow the cites/cited link from a paper node.
11/25/2018 29
Metadata Keyword Queries
Metadata keywords : match all the tuples of
a relation.
Too costly to start the search from each of
the tuples of a table
First cut approach: start the forward search from
the information node for the non-metadata
keywords
selectively choose the nodes from where to
start the forward search
11/25/2018 30
Example of Metadata Query
Consider the query “sudarshan paper”
writes table
nodes
To paper table
(forward search)
sudarshan
11/25/2018 31
Conclusions and Future Work
The next big wave: keyword searching and
browsing of databases?
Future work:
Keyword queries on XML
Disambiguating queries by selecting
Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”
Tree structure: “coauthors” or “cites”
Boolean queries, stemming, thesaurus
Metadata: column/relation names
11/25/2018 32
Thank You
11/25/2018 33