Score: Context-Oriented Structured and Unstructured Information Integration
Score: Context-Oriented Structured and Unstructured Information Integration
Roy, P., Mohania, M., Bamba, B., and Raman, S. Towards Automatic Association of Relevant
Unstructured Content with Structured Query Results. In CIKM (2005)
Bamba, B., Roy, P., and Mohania, M. OSQR: Overlapping Clustering of Query Results. In CIKM (2005).
Structured and Unstructured Information
Information content in an enterprise can be structured or unstructured
– Structured Content: payroll, sales orders, invoice, customer profiles, etc.
– Unstructured Content:: emails, reports, web-pages, complaints, information on sales, customers,
competitors, products, suppliers and people, etc.
Historically, the structured and unstructured data management technologies have evolved
separately
Artificial separation between these two “kinds” of information
DBMS Content MS
(<20% of enterprise data) (>80% of enterprise data)
Enterprises are realizing the need to bridge this separation, and are demanding integrated
retrieval, management and analysis of both the structured and unstructured content
Structured and Unstructured Information Integration:
A Brief Background on Existing Solutions
Existing solutions can be classified in terms of the query paradigm used:
SQL Query Based Solutions (SQL LIKE predicate, DB2 NetSearch Extender)
• Text data exposed to relational engine as virtual tables with text columns
• Query both structured and unstructured information using SQL
• Provide SQL primitives to search text in table columns using a set of
keywords
Keyword Query Based Solution: DB2 ESE
DbXplorer/
BANKS
Keyword Query
C1 C2 C3
A X 1
A X 2
A Y 3
B X 4
A X 5
B Y 6
B X 7
B X 8
Keyword Query
Search Engine
Keyword Query Based Solutions: Summary
Advantage: Simplicity!
Disadvantages
• Less expressive (as compared to SQL)
• How to ask for the information related to the five best performing stocks in
the past week?
• Need to specify a set of keywords that succinctly encodes the information need
• Not always easy
SQL Query Based Solution:
Standard SQL LIKE Predicate
SELECT stocks.price, docs.text
FROM stocks, docs
WHERE (stocks.name = ‘IBM’ Information
AND docs.text LIKE ‘% IBM %’)
Integrator
OR (stocks.name = ‘ORCL’
AND docs.text LIKE ‘% ORCL %’)
C1 C2 C3
A X 1
A X 2
A Y 3
B X 4
A X 5
B Y 6
B X 7
B X 8
SQL Query Based Solution:
Net Search Extender
SELECT stocks.price, docs.text
FROM stocks, docs
WHERE (stocks.name = ‘IBM’ Information
AND CONTAINS(docs.text, “IBM”))
Integrator
OR (stocks.name = ‘ORCL’
AND CONTAINS(docs.text, “ORCL”))
C1 C2 C3
A X 1
A X 2
A Y 3
B
A
X
X
4
5
CONTAINS(…)
B Y 6
B X 7
B X 8
Net Search
Extender
SQL Query Based Solutions: Summary
Advantages:
• More expressive – can specify more involved and sophisticated queries
Disadvantages:
• The unstructured data is still queried using keywords
• Need to specify a set of keywords that succinctly encodes the information need
• Not always easy
• The SQL query and the embedded keyword query encode the same information
need
• Redundant effort
• Association of documents with tuples (local context), not with the entire result
(global context)
• Same documents get attached to “IBM” when “IBM” is queried with “ORCL”
as when “IBM” is queried with “DELL”
SCORE Overview
And related
documents
“Get the 3 companies with
max price variation” SELECT name, max(price) -min(price)
FROM stocks
(Keywords not required) GROUP BY name
ORDER BY 2 RDBMS
FETCH FIRST 3 ROWS ONLY
“Doctype:Patents” B
B
X
X
7
8
(optional directive)
SCORE
C1 C2 C3
A X 1
A X 2
A Y 3
B X 4
A X 5
B Y 6
B X 7
B X 8
10
Database Group, IRL
Motivating Example
Report.doc: “… Report on Database Research at IRL … Policy Definition … “ DocID | Author | Year
New.txt: “Some other document” Report.doc | Mukesh | 2000
Another.pdf: “Expense Reimbursement Policy at IRL” Another.pdf | Manish | 2002
New.txt | Prasan | 2003
Content Management System
Directives Year < 2003 Report.doc additionally retrieved based on the context.
Result
Note: No mention of CORTEX in document
Database Group, IRL
FILLED IN
BY THE USER
Restrict to documents
containing “Report” and at least
one group name
Automatically synthesize the “context” of the SQL query from its result and
the known semantic dependencies in the structured data
Use this context and the directives to retrieve the unstructured data
13
Computing the Context
Consider the keyword Apple picked from the column named Company.
Simplyfor
Formalize the criterion using Appleterms
selecting to retrieve documents
as a part would of
of the context retrieve
a givendocuments
SQL
query related to the company Apple as well as the fruit Apple!
Suppose
Develop an algorithm, basedthe on
documents
the abovewere tagged using
formalization, thatthe the Fruit
vocabulary
computes context and
by analyzing theOrganization.
result of the SQL query as well as the related data in the underlying
database –Doc1: “I had an <Fruit>apple</Fruit> for breakfast”
– Determine the relevant tables to explore without any user intervention
– –Doc2:columns
Handle numeric-valued “SteveinJobs founded
addition <Organization>Apple</Organization>”
to string-valued columns
SCORE
Develop algorithms can the
to refine disambiguate
context the keyword Apple by associating with it the
tag Organization
– Ontology based that
expansion of the set corresponds
of keywords to the
forming column name Company
thecontext
– –Uses
Exploit the column nameCriollo to get
information the mapping
maintained from
with each column names to document
keyword
– The columntagsnames are mapped to tag names that were used to tag the documents
(exploits Criollo to store the column name document tag mapping)
disambiguated keywords more precise context more accurate integration
– Aggregated context of the session so far
Architecture
User Interface/Application
Query Result +
Query + Directives Relevant Documents
Query Handler
SCORE
Query Result + Relevant
Query Context Documents
Metadata
Metadata Repository
mapping
Context Handler
Metadata Modified
Modified Query Result Context +
Query Directives
Metadata
RDBMS Search Engine
CM CM
CM CM
Structured Data Source Unstruct Data Source
Annotation Metadata
15
Features
Enables unstructured data to be “queried” using SQL
– Dynamically associates documents based on the global context of the SQL query
Seamlessly associates even external text data available through a keyword search
interface
– E.g. opinions at Epinions.com, finance articles at TheStreet.com
Does not need any additional skills for deployment and maintenance
– No new concepts, languages need to be learned by the DBA/end-users
16
Business Scenario: Customer Relationship Management
Structured Data
– Customer information related to credit cards, loans, insurance, etc.
Unstructured Information
– Applications, Complaints, Surveys, other communications
Scenario
– Some customers have cancelled their credit cards.
17
Business Scenario: Investment Information Agent
Structured Data
– Stock ticker (summarized and archived over past one week)
– Company information
– Sector (IT, Pharma, Banking, etc.)
– Institutional investors, promoters, etc.
Unstructured Information
– Analyst reports, related advisories, news stories/tickers, etc.
Scenario
– User has identified a “watch-list” of some target stocks (e.g. 10 top performing stocks over
the past week)
– This can be done in most online investing tools available today.
18
Business Scenario: E-Commerce (eBay)
Structured Data
– For each customer:
– Item history (watching + bidding + won + didn't win)
– For each seller:
– Item history (selling + sold + unsold)
– For each item
– Profile (product category, price)
Unstructured Data
– Profiles of items currently on sale
– Customer can search items based on keywords
– Profiles of registered trading assistants
– Seller can search based on category and additional keywords
Scenarios
– For Customer: Recommending items similar to “Items I Didn’t Win”
– Use SCORE to identify keywords characterizing items not won
– Search for similar items based on these keywords
– For Seller: Recommending relevant trading assistants
– Use SCORE to identify categories and additional keywords characterizing unsold items
– Search for trading assistants based on these keywords
ROI in terms of better user experience and higher revenue
19
Outline
SCORE Overview
Technical Details
– Computing the Context: Algorithm and Implementation
Experimental Study
20
Context of a SQL Query
The set of terms in the database that the query is focused on.
That is, the set of terms that are :
a) Popular in the query result, and
b) Differentiate the query result from the underlying database
C1 C2 C3
A X 1
Q1: SELECT C1, C2 FROM R
WHERE C3 >= 1 AND C3 <= 4 A X 2
Context = {A} A Y 3
B X 4
Where,
22
Computing the Context: Term Weights
Q1: SELECT C1, C2 FROM R
WHERE C3 >= 1 AND C3 <= 4 C1 C2 C3
For A, NR = 4, NQ = 3 A X 1
– POP(Q1, C1, A) = 3 A X 2
– DIFF(Q1, C1, A) = log((1+4)/(1+1)) = 0.398 A Y 3
– TW(Q1, C1, A) = 3*0.398 = 1.194
For B, NR = 4, NQ = 1 B X 4
– POP(Q1, C1, B) = 1 A X 5
– DIFF(Q1, C1, B) = log((1+4)/(1+3)) = 0.097 B Y 6
– TW(Q1, C1, B) = 1*0.097 = 0.097
For X, NR = 6, NQ = 3
B X 7
– POP(Q1, C2, X) = 3 B X 8
– DIFF(Q1, C2, X) = log((1+4)/(1+3)) = 0.097 R
– TW(Q1, C2, X) = 3*0.097 = 0.291
For Y, NR = 2, NQ = 1
– POP(Q1, C2, Y) = 1
– DIFF(Q1, C2, Y) = log((1+4)/(1+1)) = 0.398
– TW(Q1, C2, Y) = 1*0.398 = 0.398
23
Computing the Context from the SQL Query Result
A straightforward algorithm:
Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let K = { <t, A, w> | t A, A cols(Q), w = TW(Q, A, t) }
2. Return top N elements in K based on w
Limitations:
– The query result contains only a limited amount of information
– More information can be extracted from the “neighborhood” of the
query result in the underlying database
Need to look beyond the input query!
24
Computing the Context: Looking Beyond the Input Query
25
Computing the Context: Including Projected-out Columns
Modified algorithm:
Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let K = { <t, A, w> | t A, A cols(Q), w = TW(AQ, A, t) }
3. Return top N elements in K based on w
26
Computing the Context: Exploring the Neighboring Tables
For each row in the query result, follow the foreign-key “pointers” to rows in
tables not included in the query
Amounts to augmenting the input query with foreign-key joins with other
tables
Solution:
– A iterative greedy heuristic that picks the most relevant augmentation in
each iteration
27
Computing the Context: Exploring the Neighboring Tables
Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK cols(AQ) }
3. Let n = 0
4. While S and n < MAXAUG How to find the “best”
5. Let n = n + 1 foreign key column in S?
6. Let F = “best” FKS
7. Let S = S – { F }
8. Let R be the relation referred by F
9. Let AQ = AQ JOINF R
10. For each FK A cols(R)
11. Let S = S U { A }
12. Let K = { <t, A, w> | t A, A cols(Q), w = TW(A, t) }
13. Return top N elements in K based on w
28
Computing the Context: Comparing Foreign Keys
Given FK columns F1, F2 cols(Q) in the query result, the query Q is more
focused on column F1 than column F2 if there exists a term t F1 such that
Q’s focus on t is more than its focus on any term in F2
For a given query Q, define the column weight of the foreign-key column F
cols(Q) in the query result as:
CW(Q, A) = maxtA TW(Q, A, t)
Then,
Given FK columns F1, F2 cols(Q) in the query result, F1 is “better” than F2
if CW(Q, F1) > CW(Q, F2)
29
Computing the Context: Final Algorithm
Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK cols(AQ) }
3. For each A S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0
6. While S and n < MAXAUG
7. Let n = n + 1
8. Let F = argmax AS CW(A)
9. Let S = S – { F }
10. Let R be the relation referred by F
11. Let AQ = AQ JOINF R
12. For each FK A cols(R)
13. Let CW(A) = maxt A TW(AQ, A, t)
14. Let S = S U { A }
15. Let K = { <t, A, w> | t A, A cols(Q), w = TW(A, t) }
16. Return top N elements in K based on w
30
Implementation Approaches
Brute-Force Approach
– Simple-minded implementation: Uses exact query results to compute
term weights
Histogram-Based Approach
– Uses histogram-based estimates instead of exact query results to
compute term weights
31
Brute-Force Approach
Simple-minded implementation: Use exact query results
Procedure QueryContext(Q)
Evaluate Input: Query Q
AQ Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK cols(AQ) }
3. For each A S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0
6. While S and n < MAXAUG
7. Let n = n + 1
8. Let F = argmax AS CW(A)
Evaluate
AQ
9. Let S = S – { F }
10. Let R be the relation referred by F
11. Let AQ = AQ JOINF R
12. For each FK A cols(R)
13. Let CW(A) = maxt A TW(AQ, A, t)
14. Let S = S U { A }
15. Let K = { <t, A, w> | t A, A cols(Q), w = TW(A, t) }
16. Return top N elements in K based on w Total MAXAUG+1 evaluations
Too expensive!
MAXAUG is the maximum number of augmentations allowed
N is the maximum number of terms allowed in the context
32
Histogram-Based Approach
Use histograms based estimates instead of exact query results
Procedure QueryContext(Q)
Evaluate Compute the histograms for FK
Input: Query Q cols in AQ’s result
AQ Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK cols(AQ) } CW estimated based on the
column histograms
3. For each A S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0 CW estimated based on the
6. While S and n < MAXAUG column histograms
Estimate 7. Let n = n + 1
Histograms for8. Let F = argmax AS CW(A)
the columns 9. Let S = S – { F }
added to AQ
10. Let R be the relation referred by F TW estimated based on the
11. Let AQ = AQ JOINF R column histograms
12. For each FK A cols(R) Computed context may have
13. Let CW(A) = maxt A TW(AQ, A, t) irrelevant terms!
14. Let S = S U { A }
15. Let K = { <t, A, w> | t A, A cols(Q), w = TW(A, t) } Total 1 evaluation
16. Return top N elements in K based on w Inexpensive … …
but unsafe!
MAXAUG is the maximum number of augmentations allowed
N is the maximum number of terms allowed in the context
33
Modified-Histogram Approach (Actually Implemented)
Judiciously use histograms based estimates as well as exact query results
Procedure QueryContext(Q)
Evaluate Compute the histograms for FK
Input: Query Q cols in AQ’s result
AQ Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK cols(AQ) } CW estimated based on the
column histograms
3. For each A S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0
6. While S and n < MAXAUG
Estimate 7. Let n = n + 1
Histograms for8. Let F = argmax AS CW(A) CW estimated based on the
the columns 9. Let S = S – { F } column histograms
added to AQ
10. Let R be the relation referred by F
11. Let AQ = AQ JOINF R
12. For each FK A cols(R) TW and CW computed using the
Evaluate exact result of AQ
13. Let CW(A) = maxt A TW(AQ, A, t)
AQ
14. Let S = S U { A }
15. Let K = { <t, A, w> | t A, A cols(Q), w = TW(A, t) } Total 2
16. Return top N elements in K based on w evaluations
Inexpensive … …
MAXAUG is the maximum number of augmentations allowed and safe
N is the maximum number of terms allowed in the context
34
Conclusion
SCORE provides a novel way of integrating structured and unstructured information
– Enables unstructured data to be “queried” using SQL
– Works even when the source of the unstructured data is external, with restricted access
through a keyword search interface (e.g. Epinions.com, TheStreet.com)
– Achieves high accuracy with very reasonable execution overheads
Non-disruptive deployment
– Works with the existing database system/search engine in a non-intrusive manner,
interacting using only standard interfaces
High ROI
– Provides the infrastructure to create applications that provide better user experience,
effectively manage customer relationships, make better investment decisions, etc.
Addresses an important gap between end-user needs and the capabilities offered by
existing structured and unstructured information integration approaches
35