0% found this document useful (0 votes)
73 views

Score: Context-Oriented Structured and Unstructured Information Integration

The document discusses different approaches for integrating structured and unstructured information, including keyword query based solutions that expose relational data to a search engine, and SQL query based solutions that expose text data to a relational engine. It notes limitations of both approaches and introduces SCORE as a new solution that allows querying both types of data using SQL without requiring specification of keywords.

Uploaded by

Vaibhav Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Score: Context-Oriented Structured and Unstructured Information Integration

The document discusses different approaches for integrating structured and unstructured information, including keyword query based solutions that expose relational data to a search engine, and SQL query based solutions that expose text data to a relational engine. It notes limitations of both approaches and introduces SCORE as a new solution that allows querying both types of data using SQL without requiring specification of keywords.

Uploaded by

Vaibhav Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

SCORE

Context-Oriented Structured and Unstructured Information Integration

Roy, P., Mohania, M., Bamba, B., and Raman, S. Towards Automatic Association of Relevant
Unstructured Content with Structured Query Results. In CIKM (2005)
Bamba, B., Roy, P., and Mohania, M. OSQR: Overlapping Clustering of Query Results. In CIKM (2005).
Structured and Unstructured Information
 Information content in an enterprise can be structured or unstructured
– Structured Content: payroll, sales orders, invoice, customer profiles, etc.
– Unstructured Content:: emails, reports, web-pages, complaints, information on sales, customers,
competitors, products, suppliers and people, etc.

 Historically, the structured and unstructured data management technologies have evolved
separately
 Artificial separation between these two “kinds” of information

DB2 Query CM Query

DBMS Content MS
(<20% of enterprise data) (>80% of enterprise data)

DB2 Result CM Result

Structured Data Management Unstructured Data Management

 Enterprises are realizing the need to bridge this separation, and are demanding integrated
retrieval, management and analysis of both the structured and unstructured content
Structured and Unstructured Information Integration:
A Brief Background on Existing Solutions
Existing solutions can be classified in terms of the query paradigm used:

 Keyword Query Based Solutions (DB2 ESE, DbXplorer/BANKS [ICDE02])


• Relational data exposed to search engine as virtual text documents
• Query both structured and unstructured information using keywords

 SQL Query Based Solutions (SQL LIKE predicate, DB2 NetSearch Extender)
• Text data exposed to relational engine as virtual tables with text columns
• Query both structured and unstructured information using SQL
• Provide SQL primitives to search text in table columns using a set of
keywords
Keyword Query Based Solution: DB2 ESE

Keyword Query DB2


Enterprise
Search
Extender
Keyword Query Based Solution:
DbXplorer/BANKS [ICDE02]

DbXplorer/
BANKS
Keyword Query

C1 C2 C3
A X 1
A X 2
A Y 3
B X 4
A X 5
B Y 6
B X 7
B X 8

Keyword Query
Search Engine
Keyword Query Based Solutions: Summary

 Advantage: Simplicity!

 Disadvantages
• Less expressive (as compared to SQL)
• How to ask for the information related to the five best performing stocks in
the past week?
• Need to specify a set of keywords that succinctly encodes the information need
• Not always easy
SQL Query Based Solution:
Standard SQL LIKE Predicate
SELECT stocks.price, docs.text
FROM stocks, docs
WHERE (stocks.name = ‘IBM’ Information
AND docs.text LIKE ‘% IBM %’)
Integrator
OR (stocks.name = ‘ORCL’
AND docs.text LIKE ‘% ORCL %’)

C1 C2 C3
A X 1
A X 2
A Y 3
B X 4
A X 5
B Y 6
B X 7
B X 8
SQL Query Based Solution:
Net Search Extender
SELECT stocks.price, docs.text
FROM stocks, docs
WHERE (stocks.name = ‘IBM’ Information
AND CONTAINS(docs.text, “IBM”))
Integrator
OR (stocks.name = ‘ORCL’
AND CONTAINS(docs.text, “ORCL”))

C1 C2 C3
A X 1
A X 2
A Y 3
B
A
X
X
4
5
CONTAINS(…)
B Y 6
B X 7
B X 8

Net Search
Extender
SQL Query Based Solutions: Summary

 Advantages:
• More expressive – can specify more involved and sophisticated queries

 Disadvantages:
• The unstructured data is still queried using keywords
• Need to specify a set of keywords that succinctly encodes the information need
• Not always easy
• The SQL query and the embedded keyword query encode the same information
need
• Redundant effort
• Association of documents with tuples (local context), not with the entire result
(global context)
• Same documents get attached to “IBM” when “IBM” is queried with “ORCL”
as when “IBM” is queried with “DELL”
SCORE Overview
And related
documents
“Get the 3 companies with
max price variation” SELECT name, max(price) -min(price)
FROM stocks
(Keywords not required) GROUP BY name
ORDER BY 2 RDBMS
FETCH FIRST 3 ROWS ONLY

SELECT name, max(price) - min(price)


FROM stocks
GROUP BY name
C1 C2 C3
ORDER BY 2 A X 1
A X 2
FETCH FIRST 3 ROWS ONLY A Y 3
B X 4
A X 5
B Y 6

“Doctype:Patents” B
B
X
X
7
8

(optional directive)
SCORE

C1 C2 C3
A X 1
A X 2
A Y 3
B X 4
A X 5
B Y 6
B X 7
B X 8

“IBM” “ORCL” “MSFT” Search Engine


“Database” “Data Cloud
Services”
“Doctype:Patents”

10
Database Group, IRL

Motivating Example

Proj | Name | Description | Group Org | Org Name | Address


1 | CORTEX | Policy Management | 6 5 | IRL | Hauz Khas, New Delhi
PROJECTS ORG

Proj | Emp Emp | Name | Group Group | Name | Manager | Org


1 |7 4 | Mukesh | 6 6 | Databases | 4 |5
1 |4 7 | Manish | 6 GROUPS
PROJEMPS EMPLOYEES
DBMS

Report.doc: “… Report on Database Research at IRL … Policy Definition … “ DocID | Author | Year
New.txt: “Some other document” Report.doc | Mukesh | 2000
Another.pdf: “Expense Reimbursement Policy at IRL” Another.pdf | Manish | 2002
New.txt | Prasan | 2003
Content Management System

Select Name, Description Name | Description


Query From PROJECTS CORTEX | Policy Management
Where Name = “CORTEX”
Report.doc: “… Report on Database Research at IRL … Policy Definition … “

Directives Year < 2003 Report.doc additionally retrieved based on the context.
Result
Note: No mention of CORTEX in document
Database Group, IRL

FILLED IN
BY THE USER

Restrict to documents
containing “Report” and at least
one group name

Select GROUPS.Name, PROJECTS.Description Select ARTICLES.id


From PROJECTS, PROJEMPS, EMPLOYEES, GROUPS From ARTICLES
Where PROJECTS.Proj# = PROJEMPS.Proj# Where ARTICLES.contains(“Report AND
And PROJEMPS.Emp# = EMPLOYEES.Emp# SET-OR($GROUPS.Name)”)
And EMPLOYEES.Grp# = GROUPS.Grp# And ARTICLES.year > 2001
And PROJECTS.Name = “CORTEX” ORDER BY RELEVANCE
USING THRESHOLD = 0.8
Restrict to documents Order by relevance to
published after 2001 the structured query with
cutoff threshold as 0.8
CORE QUERY SEARCH DIRECTIVES
Main Idea

 Specify information need in terms of SQL over the structured database


– Additional information needs specified using “directives” (optional)

 Automatically synthesize the “context” of the SQL query from its result and
the known semantic dependencies in the structured data

 Use this context and the directives to retrieve the unstructured data

13
Computing the Context
Consider the keyword Apple picked from the column named Company.
Simplyfor
 Formalize the criterion using Appleterms
selecting to retrieve documents
as a part would of
of the context retrieve
a givendocuments
SQL
query related to the company Apple as well as the fruit Apple!
Suppose
 Develop an algorithm, basedthe on
documents
the abovewere tagged using
formalization, thatthe the Fruit
vocabulary
computes context and
by analyzing theOrganization.
result of the SQL query as well as the related data in the underlying
database –Doc1: “I had an <Fruit>apple</Fruit> for breakfast”
– Determine the relevant tables to explore without any user intervention
– –Doc2:columns
Handle numeric-valued “SteveinJobs founded
addition <Organization>Apple</Organization>”
to string-valued columns
SCORE
 Develop algorithms can the
to refine disambiguate
context the keyword Apple by associating with it the
tag Organization
– Ontology based that
expansion of the set corresponds
of keywords to the
forming column name Company
thecontext
– –Uses
Exploit the column nameCriollo to get
information the mapping
maintained from
with each column names to document
keyword
– The columntagsnames are mapped to tag names that were used to tag the documents
(exploits Criollo to store the column name  document tag mapping)
 disambiguated keywords  more precise context  more accurate integration
– Aggregated context of the session so far
Architecture
User Interface/Application

Query Result +
Query + Directives Relevant Documents

Query Handler

SCORE
Query Result + Relevant
Query Context Documents
Metadata
Metadata Repository

mapping
Context Handler

Metadata Modified
Modified Query Result Context +
Query Directives
Metadata
RDBMS Search Engine

CM CM
CM CM
Structured Data Source Unstruct Data Source
Annotation Metadata

15
Features
 Enables unstructured data to be “queried” using SQL
– Dynamically associates documents based on the global context of the SQL query

 Seamlessly associates even external text data available through a keyword search
interface
– E.g. opinions at Epinions.com, finance articles at TheStreet.com

 Does not need any additional skills for deployment and maintenance
– No new concepts, languages need to be learned by the DBA/end-users

 Does not need any changes in existing infrastructure for deployment


– Does not need a common schema, or specialized indexes at any source
– Does not need static association between the tuples and docs
– Does not need any external semantic information (e.g. ontologies)
– But can use if available

 Does not disrupt existing applications


– Works in application space using standard interfaces

16
Business Scenario: Customer Relationship Management

 Structured Data
– Customer information related to credit cards, loans, insurance, etc.

 Unstructured Information
– Applications, Complaints, Surveys, other communications

 Scenario
– Some customers have cancelled their credit cards.

SCORE enables the management to:


– Analyze these customer’s relationship with the organization (by looking at common
services, etc.) and find “features” that differentiate them from other customers
– Use the result of this analysis to prioritize the complaints in order to avoid losing any more
customers

ROI in terms of effective problem determination and low customer attrition

17
Business Scenario: Investment Information Agent
 Structured Data
– Stock ticker (summarized and archived over past one week)
– Company information
– Sector (IT, Pharma, Banking, etc.)
– Institutional investors, promoters, etc.

 Unstructured Information
– Analyst reports, related advisories, news stories/tickers, etc.

 Scenario
– User has identified a “watch-list” of some target stocks (e.g. 10 top performing stocks over
the past week)
– This can be done in most online investing tools available today.

SCORE enables the user to:


– Analyze the company information to find the common “features” among these
companies that differentiate them from other stocks (e.g. their sector, common
institutional investors, etc.)
– Use this information to retrieve related reports, advisories, news stories/tickers, etc.

ROI in terms of time saved in analysis, leading to better investment decisions

18
Business Scenario: E-Commerce (eBay)
 Structured Data
– For each customer:
– Item history (watching + bidding + won + didn't win)
– For each seller:
– Item history (selling + sold + unsold)
– For each item
– Profile (product category, price)

 Unstructured Data
– Profiles of items currently on sale
– Customer can search items based on keywords
– Profiles of registered trading assistants
– Seller can search based on category and additional keywords

 Scenarios
– For Customer: Recommending items similar to “Items I Didn’t Win”
– Use SCORE to identify keywords characterizing items not won
– Search for similar items based on these keywords
– For Seller: Recommending relevant trading assistants
– Use SCORE to identify categories and additional keywords characterizing unsold items
– Search for trading assistants based on these keywords
ROI in terms of better user experience and higher revenue

19
Outline

 SCORE Overview

 Technical Details
– Computing the Context: Algorithm and Implementation

 Experimental Study

20
Context of a SQL Query

 The set of terms in the database that the query is focused on.
That is, the set of terms that are :
a) Popular in the query result, and
b) Differentiate the query result from the underlying database
C1 C2 C3
A X 1
Q1: SELECT C1, C2 FROM R
WHERE C3 >= 1 AND C3 <= 4 A X 2
Context = {A} A Y 3
B X 4

Q2: SELECT C1, C2 FROM R A X 5


WHERE C3 >= 5 AND C3 <= 8 B Y 6
Context = {B}
B X 7
B X 8
R
21
Computing the Context: Term Weights
 For a given query Q on a single relation R, define the term weight of the term t
present in column A  cols(Q) in the query result as:

TW(Q, A, t) = POP(Q, A, t) . DIFF(Q, A, t)

Where,

POP(Q, A, t) = |σA = t(Q)| = NQ


DIFF(Q, A, t) = log( (1+NR) / (1+NR-NQ) )

Here NQ = No. of times t appears in column A in Q = |σA = t(Q)|


Let NR = No. of times t appears in column A in R = |σA = t(R)|

Based on the “TF-IDF” metrics popular in the Information-Retrieval community


However:
 The “TF” computed over the query result
 The “IDF” computed over the remainder of the underlying relation

22
Computing the Context: Term Weights
Q1: SELECT C1, C2 FROM R
WHERE C3 >= 1 AND C3 <= 4 C1 C2 C3
 For A, NR = 4, NQ = 3 A X 1
– POP(Q1, C1, A) = 3 A X 2
– DIFF(Q1, C1, A) = log((1+4)/(1+1)) = 0.398 A Y 3
– TW(Q1, C1, A) = 3*0.398 = 1.194
 For B, NR = 4, NQ = 1 B X 4
– POP(Q1, C1, B) = 1 A X 5
– DIFF(Q1, C1, B) = log((1+4)/(1+3)) = 0.097 B Y 6
– TW(Q1, C1, B) = 1*0.097 = 0.097
 For X, NR = 6, NQ = 3
B X 7
– POP(Q1, C2, X) = 3 B X 8
– DIFF(Q1, C2, X) = log((1+4)/(1+3)) = 0.097 R
– TW(Q1, C2, X) = 3*0.097 = 0.291
 For Y, NR = 2, NQ = 1
– POP(Q1, C2, Y) = 1
– DIFF(Q1, C2, Y) = log((1+4)/(1+1)) = 0.398
– TW(Q1, C2, Y) = 1*0.398 = 0.398

23
Computing the Context from the SQL Query Result

 A straightforward algorithm:

Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(Q, A, t) }
2. Return top N elements in K based on w

N is the maximum number of terms allowed in the context

 Limitations:
– The query result contains only a limited amount of information
– More information can be extracted from the “neighborhood” of the
query result in the underlying database
 Need to look beyond the input query!

24
Computing the Context: Looking Beyond the Input Query

 Including the projected-out columns


– Typically, attributes projected out for convenience, but may carry valuable
information

 Exploring the neighboring tables


– Related information may be spread across multiple tables due to normalization
– These related tables are connected by foreign-key relationships
– Exploit the relationships in the “forward” direction (FK  PK)
• Commonplace:
Star/Snowflake schemas,
Relational schemas for storing hierarchies (directories, XML), etc.

25
Computing the Context: Including Projected-out Columns

 Remove projection constraints from the input query

 Modified algorithm:

Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(AQ, A, t) }
3. Return top N elements in K based on w

N is the maximum number of terms allowed in the context

26
Computing the Context: Exploring the Neighboring Tables

 For each row in the query result, follow the foreign-key “pointers” to rows in
tables not included in the query

 Amounts to augmenting the input query with foreign-key joins with other
tables

 Problem: For scalability, need to limit the number of augmentations


– How to select the subset of possible augmentations efficiently?

 Solution:
– A iterative greedy heuristic that picks the most relevant augmentation in
each iteration

27
Computing the Context: Exploring the Neighboring Tables
Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) }
3. Let n = 0
4. While S   and n < MAXAUG How to find the “best”
5. Let n = n + 1 foreign key column in S?
6. Let F = “best” FKS
7. Let S = S – { F }
8. Let R be the relation referred by F
9. Let AQ = AQ JOINF R
10. For each FK A cols(R)
11. Let S = S U { A }
12. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) }
13. Return top N elements in K based on w

MAXAUG is the maximum number of augmentations allowed


N is the maximum number of terms allowed in the context

28
Computing the Context: Comparing Foreign Keys

 Given FK columns F1, F2 cols(Q) in the query result, the query Q is more
focused on column F1 than column F2 if there exists a term t  F1 such that
Q’s focus on t is more than its focus on any term in F2

 For a given query Q, define the column weight of the foreign-key column F
 cols(Q) in the query result as:
CW(Q, A) = maxtA TW(Q, A, t)

 Then,
Given FK columns F1, F2 cols(Q) in the query result, F1 is “better” than F2
if CW(Q, F1) > CW(Q, F2)

29
Computing the Context: Final Algorithm
Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) }
3. For each A  S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0
6. While S   and n < MAXAUG
7. Let n = n + 1
8. Let F = argmax AS CW(A)
9. Let S = S – { F }
10. Let R be the relation referred by F
11. Let AQ = AQ JOINF R
12. For each FK A cols(R)
13. Let CW(A) = maxt  A TW(AQ, A, t)
14. Let S = S U { A }
15. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) }
16. Return top N elements in K based on w

MAXAUG is the maximum number of augmentations allowed


N is the maximum number of terms allowed in the context

30
Implementation Approaches

 Brute-Force Approach
– Simple-minded implementation: Uses exact query results to compute
term weights

 Histogram-Based Approach
– Uses histogram-based estimates instead of exact query results to
compute term weights

 Modified Histogram Approach


– Uses histogram-based estimates instead of exact query results to
compute initial term weights for augmentation
– Uses the exact augmented query result to compute final term weights
for context computation

31
Brute-Force Approach
 Simple-minded implementation: Use exact query results
Procedure QueryContext(Q)
Evaluate Input: Query Q
AQ Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) }
3. For each A  S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0
6. While S   and n < MAXAUG
7. Let n = n + 1
8. Let F = argmax AS CW(A)
Evaluate
AQ
9. Let S = S – { F }
10. Let R be the relation referred by F
11. Let AQ = AQ JOINF R
12. For each FK A cols(R)
13. Let CW(A) = maxt  A TW(AQ, A, t)
14. Let S = S U { A }
15. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) }
16. Return top N elements in K based on w Total MAXAUG+1 evaluations
Too expensive!
MAXAUG is the maximum number of augmentations allowed
N is the maximum number of terms allowed in the context
32
Histogram-Based Approach
 Use histograms based estimates instead of exact query results
Procedure QueryContext(Q)
Evaluate Compute the histograms for FK
Input: Query Q cols in AQ’s result
AQ Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) } CW estimated based on the
column histograms
3. For each A  S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0 CW estimated based on the
6. While S   and n < MAXAUG column histograms
Estimate 7. Let n = n + 1
Histograms for8. Let F = argmax AS CW(A)
the columns 9. Let S = S – { F }
added to AQ
10. Let R be the relation referred by F TW estimated based on the
11. Let AQ = AQ JOINF R column histograms
12. For each FK A cols(R)  Computed context may have
13. Let CW(A) = maxt  A TW(AQ, A, t) irrelevant terms!
14. Let S = S U { A }
15. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) } Total 1 evaluation
16. Return top N elements in K based on w Inexpensive … …
but unsafe!
MAXAUG is the maximum number of augmentations allowed
N is the maximum number of terms allowed in the context
33
Modified-Histogram Approach (Actually Implemented)
 Judiciously use histograms based estimates as well as exact query results
Procedure QueryContext(Q)
Evaluate Compute the histograms for FK
Input: Query Q cols in AQ’s result
AQ Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) } CW estimated based on the
column histograms
3. For each A  S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0
6. While S   and n < MAXAUG
Estimate 7. Let n = n + 1
Histograms for8. Let F = argmax AS CW(A) CW estimated based on the
the columns 9. Let S = S – { F } column histograms
added to AQ
10. Let R be the relation referred by F
11. Let AQ = AQ JOINF R
12. For each FK A cols(R) TW and CW computed using the
Evaluate exact result of AQ
13. Let CW(A) = maxt  A TW(AQ, A, t)
AQ
14. Let S = S U { A }
15. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) } Total 2
16. Return top N elements in K based on w evaluations
Inexpensive … …
MAXAUG is the maximum number of augmentations allowed and safe
N is the maximum number of terms allowed in the context
34
Conclusion
 SCORE provides a novel way of integrating structured and unstructured information
– Enables unstructured data to be “queried” using SQL
– Works even when the source of the unstructured data is external, with restricted access
through a keyword search interface (e.g. Epinions.com, TheStreet.com)
– Achieves high accuracy with very reasonable execution overheads

 Shallow learning curve


– No new language or concept needs to be learned to use SCORE

 Non-disruptive deployment
– Works with the existing database system/search engine in a non-intrusive manner,
interacting using only standard interfaces

 High ROI
– Provides the infrastructure to create applications that provide better user experience,
effectively manage customer relationships, make better investment decisions, etc.

 Addresses an important gap between end-user needs and the capabilities offered by
existing structured and unstructured information integration approaches

35

You might also like