0% found this document useful (0 votes)

81 views35 pages

Score: Context-Oriented Structured and Unstructured Information Integration

The document discusses different approaches for integrating structured and unstructured information, including keyword query based solutions that expose relational data to a search engine, and SQL query based solutions that expose text data to a relational engine. It notes limitations of both approaches and introduces SCORE as a new solution that allows querying both types of data using SQL without requiring specification of keywords.

Uploaded by

Vaibhav Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views35 pages

Score: Context-Oriented Structured and Unstructured Information Integration

Uploaded by

Vaibhav Jindal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

SCORE

Context-Oriented Structured and Unstructured Information Integration

Roy, P., Mohania, M., Bamba, B., and Raman, S. Towards Automatic Association of Relevant
Unstructured Content with Structured Query Results. In CIKM (2005)
Bamba, B., Roy, P., and Mohania, M. OSQR: Overlapping Clustering of Query Results. In CIKM (2005).
Structured and Unstructured Information
 Information content in an enterprise can be structured or unstructured
– Structured Content: payroll, sales orders, invoice, customer profiles, etc.
– Unstructured Content:: emails, reports, web-pages, complaints, information on sales, customers,
competitors, products, suppliers and people, etc.

 Historically, the structured and unstructured data management technologies have evolved
separately
 Artificial separation between these two “kinds” of information

DB2 Query CM Query

DBMS Content MS
(<20% of enterprise data) (>80% of enterprise data)

DB2 Result CM Result

Structured Data Management Unstructured Data Management

 Enterprises are realizing the need to bridge this separation, and are demanding integrated
retrieval, management and analysis of both the structured and unstructured content
Structured and Unstructured Information Integration:
A Brief Background on Existing Solutions
Existing solutions can be classified in terms of the query paradigm used:

 Keyword Query Based Solutions (DB2 ESE, DbXplorer/BANKS [ICDE02])

• Relational data exposed to search engine as virtual text documents
• Query both structured and unstructured information using keywords

 SQL Query Based Solutions (SQL LIKE predicate, DB2 NetSearch Extender)
• Text data exposed to relational engine as virtual tables with text columns
• Query both structured and unstructured information using SQL
• Provide SQL primitives to search text in table columns using a set of
keywords
Keyword Query Based Solution: DB2 ESE

Keyword Query DB2

Enterprise
Search
Extender
Keyword Query Based Solution:
DbXplorer/BANKS [ICDE02]

DbXplorer/
BANKS
Keyword Query

C1 C2 C3
A X 1
A X 2
A Y 3
B X 4
A X 5
B Y 6
B X 7
B X 8

Keyword Query
Search Engine
Keyword Query Based Solutions: Summary

 Advantage: Simplicity!

 Disadvantages
• Less expressive (as compared to SQL)
• How to ask for the information related to the five best performing stocks in
the past week?
• Need to specify a set of keywords that succinctly encodes the information need
• Not always easy
SQL Query Based Solution:
Standard SQL LIKE Predicate
SELECT stocks.price, docs.text
FROM stocks, docs
WHERE (stocks.name = ‘IBM’ Information
AND docs.text LIKE ‘% IBM %’)
Integrator
OR (stocks.name = ‘ORCL’
AND docs.text LIKE ‘% ORCL %’)

C1 C2 C3
A X 1
A X 2
A Y 3
B X 4
A X 5
B Y 6
B X 7
B X 8
SQL Query Based Solution:
Net Search Extender
SELECT stocks.price, docs.text
FROM stocks, docs
WHERE (stocks.name = ‘IBM’ Information
AND CONTAINS(docs.text, “IBM”))
Integrator
OR (stocks.name = ‘ORCL’
AND CONTAINS(docs.text, “ORCL”))

C1 C2 C3
A X 1
A X 2
A Y 3
B
A
X
X
4
5
CONTAINS(…)
B Y 6
B X 7
B X 8

Net Search
Extender
SQL Query Based Solutions: Summary

 Advantages:
• More expressive – can specify more involved and sophisticated queries

 Disadvantages:
• The unstructured data is still queried using keywords
• Need to specify a set of keywords that succinctly encodes the information need
• Not always easy
• The SQL query and the embedded keyword query encode the same information
need
• Redundant effort
• Association of documents with tuples (local context), not with the entire result
(global context)
• Same documents get attached to “IBM” when “IBM” is queried with “ORCL”
as when “IBM” is queried with “DELL”
SCORE Overview
And related
documents
“Get the 3 companies with
max price variation” SELECT name, max(price) -min(price)
FROM stocks
(Keywords not required) GROUP BY name
ORDER BY 2 RDBMS
FETCH FIRST 3 ROWS ONLY

SELECT name, max(price) - min(price)

FROM stocks
GROUP BY name
C1 C2 C3
ORDER BY 2 A X 1
A X 2
FETCH FIRST 3 ROWS ONLY A Y 3
B X 4
A X 5
B Y 6

“Doctype:Patents” B
B
X
X
7
8

(optional directive)
SCORE

C1 C2 C3
A X 1
A X 2
A Y 3
B X 4
A X 5
B Y 6
B X 7
B X 8

“IBM” “ORCL” “MSFT” Search Engine

“Database” “Data Cloud
Services”
“Doctype:Patents”

10
Database Group, IRL

Motivating Example

Proj | Name | Description | Group Org | Org Name | Address

1 | CORTEX | Policy Management | 6 5 | IRL | Hauz Khas, New Delhi
PROJECTS ORG

Proj | Emp Emp | Name | Group Group | Name | Manager | Org

1 |7 4 | Mukesh | 6 6 | Databases | 4 |5
1 |4 7 | Manish | 6 GROUPS
PROJEMPS EMPLOYEES
DBMS

Select Name, Description Name | Description

Query From PROJECTS CORTEX | Policy Management
Where Name = “CORTEX”
Report.doc: “… Report on Database Research at IRL … Policy Definition … “

Directives Year < 2003 Report.doc additionally retrieved based on the context.
Result
Note: No mention of CORTEX in document
Database Group, IRL

FILLED IN
BY THE USER

Restrict to documents
containing “Report” and at least
one group name

Select GROUPS.Name, PROJECTS.Description Select ARTICLES.id

From PROJECTS, PROJEMPS, EMPLOYEES, GROUPS From ARTICLES
Where PROJECTS.Proj# = PROJEMPS.Proj# Where ARTICLES.contains(“Report AND
And PROJEMPS.Emp# = EMPLOYEES.Emp# SET-OR($GROUPS.Name)”)
And EMPLOYEES.Grp# = GROUPS.Grp# And ARTICLES.year > 2001
And PROJECTS.Name = “CORTEX” ORDER BY RELEVANCE
USING THRESHOLD = 0.8
Restrict to documents Order by relevance to
published after 2001 the structured query with
cutoff threshold as 0.8
CORE QUERY SEARCH DIRECTIVES
Main Idea

 Specify information need in terms of SQL over the structured database

– Additional information needs specified using “directives” (optional)

 Automatically synthesize the “context” of the SQL query from its result and
the known semantic dependencies in the structured data

 Use this context and the directives to retrieve the unstructured data

13
Computing the Context
Consider the keyword Apple picked from the column named Company.
Simplyfor
 Formalize the criterion using Appleterms
selecting to retrieve documents
as a part would of
of the context retrieve
a givendocuments
SQL
query related to the company Apple as well as the fruit Apple!
Suppose
 Develop an algorithm, basedthe on
documents
the abovewere tagged using
formalization, thatthe the Fruit
vocabulary
computes context and
by analyzing theOrganization.
result of the SQL query as well as the related data in the underlying
database –Doc1: “I had an <Fruit>apple</Fruit> for breakfast”
– Determine the relevant tables to explore without any user intervention
– –Doc2:columns
Handle numeric-valued “SteveinJobs founded
addition <Organization>Apple</Organization>”
to string-valued columns
SCORE
 Develop algorithms can the
to refine disambiguate
context the keyword Apple by associating with it the
tag Organization
– Ontology based that
expansion of the set corresponds
of keywords to the
forming column name Company
thecontext
– –Uses
Exploit the column nameCriollo to get
information the mapping
maintained from
with each column names to document
keyword
– The columntagsnames are mapped to tag names that were used to tag the documents
(exploits Criollo to store the column name  document tag mapping)
 disambiguated keywords  more precise context  more accurate integration
– Aggregated context of the session so far
Architecture
User Interface/Application

Query Result +
Query + Directives Relevant Documents

Query Handler

SCORE
Query Result + Relevant
Query Context Documents
Metadata
Metadata Repository

mapping
Context Handler

Metadata Modified
Modified Query Result Context +
Query Directives
Metadata
RDBMS Search Engine

CM CM
CM CM
Structured Data Source Unstruct Data Source
Annotation Metadata

15
Features
 Enables unstructured data to be “queried” using SQL
– Dynamically associates documents based on the global context of the SQL query

 Seamlessly associates even external text data available through a keyword search
interface
– E.g. opinions at Epinions.com, finance articles at TheStreet.com

 Does not need any additional skills for deployment and maintenance
– No new concepts, languages need to be learned by the DBA/end-users

 Does not need any changes in existing infrastructure for deployment

– Does not need a common schema, or specialized indexes at any source
– Does not need static association between the tuples and docs
– Does not need any external semantic information (e.g. ontologies)
– But can use if available

 Does not disrupt existing applications

– Works in application space using standard interfaces

16
Business Scenario: Customer Relationship Management

 Structured Data
– Customer information related to credit cards, loans, insurance, etc.

 Unstructured Information
– Applications, Complaints, Surveys, other communications

 Scenario
– Some customers have cancelled their credit cards.

SCORE enables the management to:

– Analyze these customer’s relationship with the organization (by looking at common
services, etc.) and find “features” that differentiate them from other customers
– Use the result of this analysis to prioritize the complaints in order to avoid losing any more
customers

ROI in terms of effective problem determination and low customer attrition

17
Business Scenario: Investment Information Agent
 Structured Data
– Stock ticker (summarized and archived over past one week)
– Company information
– Sector (IT, Pharma, Banking, etc.)
– Institutional investors, promoters, etc.

 Unstructured Information
– Analyst reports, related advisories, news stories/tickers, etc.

 Scenario
– User has identified a “watch-list” of some target stocks (e.g. 10 top performing stocks over
the past week)
– This can be done in most online investing tools available today.

SCORE enables the user to:

– Analyze the company information to find the common “features” among these
companies that differentiate them from other stocks (e.g. their sector, common
institutional investors, etc.)
– Use this information to retrieve related reports, advisories, news stories/tickers, etc.

ROI in terms of time saved in analysis, leading to better investment decisions

18
Business Scenario: E-Commerce (eBay)
 Structured Data
– For each customer:
– Item history (watching + bidding + won + didn't win)
– For each seller:
– Item history (selling + sold + unsold)
– For each item
– Profile (product category, price)

 Unstructured Data
– Profiles of items currently on sale
– Customer can search items based on keywords
– Profiles of registered trading assistants
– Seller can search based on category and additional keywords

 Scenarios
– For Customer: Recommending items similar to “Items I Didn’t Win”
– Use SCORE to identify keywords characterizing items not won
– Search for similar items based on these keywords
– For Seller: Recommending relevant trading assistants
– Use SCORE to identify categories and additional keywords characterizing unsold items
– Search for trading assistants based on these keywords
ROI in terms of better user experience and higher revenue

19
Outline

 SCORE Overview

 Technical Details
– Computing the Context: Algorithm and Implementation

 Experimental Study

20
Context of a SQL Query

 The set of terms in the database that the query is focused on.
That is, the set of terms that are :
a) Popular in the query result, and
b) Differentiate the query result from the underlying database
C1 C2 C3
A X 1
Q1: SELECT C1, C2 FROM R
WHERE C3 >= 1 AND C3 <= 4 A X 2
Context = {A} A Y 3
B X 4

Q2: SELECT C1, C2 FROM R A X 5

WHERE C3 >= 5 AND C3 <= 8 B Y 6
Context = {B}
B X 7
B X 8
R
21
Computing the Context: Term Weights
 For a given query Q on a single relation R, define the term weight of the term t
present in column A  cols(Q) in the query result as:

TW(Q, A, t) = POP(Q, A, t) . DIFF(Q, A, t)

Where,

POP(Q, A, t) = |σA = t(Q)| = NQ

DIFF(Q, A, t) = log( (1+NR) / (1+NR-NQ) )

Here NQ = No. of times t appears in column A in Q = |σA = t(Q)|

Let NR = No. of times t appears in column A in R = |σA = t(R)|

Based on the “TF-IDF” metrics popular in the Information-Retrieval community

However:
 The “TF” computed over the query result
 The “IDF” computed over the remainder of the underlying relation

22
Computing the Context: Term Weights
Q1: SELECT C1, C2 FROM R
WHERE C3 >= 1 AND C3 <= 4 C1 C2 C3
 For A, NR = 4, NQ = 3 A X 1
– POP(Q1, C1, A) = 3 A X 2
– DIFF(Q1, C1, A) = log((1+4)/(1+1)) = 0.398 A Y 3
– TW(Q1, C1, A) = 3*0.398 = 1.194
 For B, NR = 4, NQ = 1 B X 4
– POP(Q1, C1, B) = 1 A X 5
– DIFF(Q1, C1, B) = log((1+4)/(1+3)) = 0.097 B Y 6
– TW(Q1, C1, B) = 1*0.097 = 0.097
 For X, NR = 6, NQ = 3
B X 7
– POP(Q1, C2, X) = 3 B X 8
– DIFF(Q1, C2, X) = log((1+4)/(1+3)) = 0.097 R
– TW(Q1, C2, X) = 3*0.097 = 0.291
 For Y, NR = 2, NQ = 1
– POP(Q1, C2, Y) = 1
– DIFF(Q1, C2, Y) = log((1+4)/(1+1)) = 0.398
– TW(Q1, C2, Y) = 1*0.398 = 0.398

23
Computing the Context from the SQL Query Result

 A straightforward algorithm:

Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(Q, A, t) }
2. Return top N elements in K based on w

N is the maximum number of terms allowed in the context

 Limitations:
– The query result contains only a limited amount of information
– More information can be extracted from the “neighborhood” of the
query result in the underlying database
 Need to look beyond the input query!

24
Computing the Context: Looking Beyond the Input Query

 Including the projected-out columns

– Typically, attributes projected out for convenience, but may carry valuable
information

 Exploring the neighboring tables

– Related information may be spread across multiple tables due to normalization
– These related tables are connected by foreign-key relationships
– Exploit the relationships in the “forward” direction (FK  PK)
• Commonplace:
Star/Snowflake schemas,
Relational schemas for storing hierarchies (directories, XML), etc.

25
Computing the Context: Including Projected-out Columns

 Remove projection constraints from the input query

 Modified algorithm:

Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(AQ, A, t) }
3. Return top N elements in K based on w

N is the maximum number of terms allowed in the context

26
Computing the Context: Exploring the Neighboring Tables

 For each row in the query result, follow the foreign-key “pointers” to rows in
tables not included in the query

 Amounts to augmenting the input query with foreign-key joins with other
tables

 Problem: For scalability, need to limit the number of augmentations

– How to select the subset of possible augmentations efficiently?

 Solution:
– A iterative greedy heuristic that picks the most relevant augmentation in
each iteration

27
Computing the Context: Exploring the Neighboring Tables
Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) }
3. Let n = 0
4. While S   and n < MAXAUG How to find the “best”
5. Let n = n + 1 foreign key column in S?
6. Let F = “best” FKS
7. Let S = S – { F }
8. Let R be the relation referred by F
9. Let AQ = AQ JOINF R
10. For each FK A cols(R)
11. Let S = S U { A }
12. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) }
13. Return top N elements in K based on w

MAXAUG is the maximum number of augmentations allowed

N is the maximum number of terms allowed in the context

28
Computing the Context: Comparing Foreign Keys

 Given FK columns F1, F2 cols(Q) in the query result, the query Q is more
focused on column F1 than column F2 if there exists a term t  F1 such that
Q’s focus on t is more than its focus on any term in F2

 For a given query Q, define the column weight of the foreign-key column F
 cols(Q) in the query result as:
CW(Q, A) = maxtA TW(Q, A, t)

 Then,
Given FK columns F1, F2 cols(Q) in the query result, F1 is “better” than F2
if CW(Q, F1) > CW(Q, F2)

29
Computing the Context: Final Algorithm
Procedure QueryContext(Q)
Input: Query Q
Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) }
3. For each A  S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0
6. While S   and n < MAXAUG
7. Let n = n + 1
8. Let F = argmax AS CW(A)
9. Let S = S – { F }
10. Let R be the relation referred by F
11. Let AQ = AQ JOINF R
12. For each FK A cols(R)
13. Let CW(A) = maxt  A TW(AQ, A, t)
14. Let S = S U { A }
15. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) }
16. Return top N elements in K based on w

MAXAUG is the maximum number of augmentations allowed

N is the maximum number of terms allowed in the context

30
Implementation Approaches

 Brute-Force Approach
– Simple-minded implementation: Uses exact query results to compute
term weights

 Histogram-Based Approach
– Uses histogram-based estimates instead of exact query results to
compute term weights

 Modified Histogram Approach

– Uses histogram-based estimates instead of exact query results to
compute initial term weights for augmentation
– Uses the exact augmented query result to compute final term weights
for context computation

31
Brute-Force Approach
 Simple-minded implementation: Use exact query results
Procedure QueryContext(Q)
Evaluate Input: Query Q
AQ Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) }
3. For each A  S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0
6. While S   and n < MAXAUG
7. Let n = n + 1
8. Let F = argmax AS CW(A)
Evaluate
AQ
9. Let S = S – { F }
10. Let R be the relation referred by F
11. Let AQ = AQ JOINF R
12. For each FK A cols(R)
13. Let CW(A) = maxt  A TW(AQ, A, t)
14. Let S = S U { A }
15. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) }
16. Return top N elements in K based on w Total MAXAUG+1 evaluations
Too expensive!
MAXAUG is the maximum number of augmentations allowed
N is the maximum number of terms allowed in the context
32
Histogram-Based Approach
 Use histograms based estimates instead of exact query results
Procedure QueryContext(Q)
Evaluate Compute the histograms for FK
Input: Query Q cols in AQ’s result
AQ Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) } CW estimated based on the
column histograms
3. For each A  S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0 CW estimated based on the
6. While S   and n < MAXAUG column histograms
Estimate 7. Let n = n + 1
Histograms for8. Let F = argmax AS CW(A)
the columns 9. Let S = S – { F }
added to AQ
10. Let R be the relation referred by F TW estimated based on the
11. Let AQ = AQ JOINF R column histograms
12. For each FK A cols(R)  Computed context may have
13. Let CW(A) = maxt  A TW(AQ, A, t) irrelevant terms!
14. Let S = S U { A }
15. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) } Total 1 evaluation
16. Return top N elements in K based on w Inexpensive … …
but unsafe!
MAXAUG is the maximum number of augmentations allowed
N is the maximum number of terms allowed in the context
33
Modified-Histogram Approach (Actually Implemented)
 Judiciously use histograms based estimates as well as exact query results
Procedure QueryContext(Q)
Evaluate Compute the histograms for FK
Input: Query Q cols in AQ’s result
AQ Output: Context of the query Q
1. Let AQ = Q without the projection constraints
2. Let S = { A | A is a FK  cols(AQ) } CW estimated based on the
column histograms
3. For each A  S
4. Let CW(A) = maxtA TW(A, t)
5. Let n = 0
6. While S   and n < MAXAUG
Estimate 7. Let n = n + 1
Histograms for8. Let F = argmax AS CW(A) CW estimated based on the
the columns 9. Let S = S – { F } column histograms
added to AQ
10. Let R be the relation referred by F
11. Let AQ = AQ JOINF R
12. For each FK A cols(R) TW and CW computed using the
Evaluate exact result of AQ
13. Let CW(A) = maxt  A TW(AQ, A, t)
AQ
14. Let S = S U { A }
15. Let K = { <t, A, w> | t  A, A  cols(Q), w = TW(A, t) } Total 2
16. Return top N elements in K based on w evaluations
Inexpensive … …
MAXAUG is the maximum number of augmentations allowed and safe
N is the maximum number of terms allowed in the context
34
Conclusion
 SCORE provides a novel way of integrating structured and unstructured information
– Enables unstructured data to be “queried” using SQL
– Works even when the source of the unstructured data is external, with restricted access
through a keyword search interface (e.g. Epinions.com, TheStreet.com)
– Achieves high accuracy with very reasonable execution overheads

 Shallow learning curve

– No new language or concept needs to be learned to use SCORE

 Non-disruptive deployment
– Works with the existing database system/search engine in a non-intrusive manner,
interacting using only standard interfaces

 High ROI
– Provides the infrastructure to create applications that provide better user experience,
effectively manage customer relationships, make better investment decisions, etc.

 Addresses an important gap between end-user needs and the capabilities offered by
existing structured and unstructured information integration approaches

35

DO 27 S 2019 PDF
No ratings yet
DO 27 S 2019 PDF
258 pages
ETL Testing - PPT
No ratings yet
ETL Testing - PPT
77 pages
Birthday Girl PDF
No ratings yet
Birthday Girl PDF
1 page
BAED-AI2121-2322S-Written Work 2 4th Quarter Grade 12
100% (1)
BAED-AI2121-2322S-Written Work 2 4th Quarter Grade 12
5 pages
Procurement Process
100% (1)
Procurement Process
14 pages
Wms Requirements Template
100% (1)
Wms Requirements Template
13 pages
REHS0970 - Cross Reference For Electrical Connectors
No ratings yet
REHS0970 - Cross Reference For Electrical Connectors
115 pages
Programming in Java: Biyani's Think Tank
No ratings yet
Programming in Java: Biyani's Think Tank
103 pages
Merge
No ratings yet
Merge
370 pages
Week-3 Schema Matching and Mapping
No ratings yet
Week-3 Schema Matching and Mapping
26 pages
Business Mathematics - Syllabus
No ratings yet
Business Mathematics - Syllabus
3 pages
Bandwidth Part (BWP) in 5G-NR
No ratings yet
Bandwidth Part (BWP) in 5G-NR
18 pages
Keyword Search in Structured Databases: Vagelis Hristidis
No ratings yet
Keyword Search in Structured Databases: Vagelis Hristidis
58 pages
DBMS Theory Book
No ratings yet
DBMS Theory Book
149 pages
Database1 Final Revision ٠٤٥٢٢٤
100% (1)
Database1 Final Revision ٠٤٥٢٢٤
14 pages
DBMS60
No ratings yet
DBMS60
130 pages
15.2 Database Services and ELK
No ratings yet
15.2 Database Services and ELK
42 pages
Database Systems
No ratings yet
Database Systems
9 pages
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
No ratings yet
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
119 pages
Relational Database Model
No ratings yet
Relational Database Model
45 pages
Database Management Systems All Weeks
No ratings yet
Database Management Systems All Weeks
77 pages
Review
No ratings yet
Review
18 pages
Advanced Database Design and Implementation: Lesson 04 SQL
No ratings yet
Advanced Database Design and Implementation: Lesson 04 SQL
82 pages
Lecture 3 - 2 - OLAP, Data Warehouse, and Column Store
No ratings yet
Lecture 3 - 2 - OLAP, Data Warehouse, and Column Store
60 pages
SYM400 T1 S2 Dilts Slides
No ratings yet
SYM400 T1 S2 Dilts Slides
64 pages
Schema Diagram
No ratings yet
Schema Diagram
37 pages
Oracle
No ratings yet
Oracle
103 pages
State Machines
No ratings yet
State Machines
6 pages
Lesson 2
No ratings yet
Lesson 2
50 pages
Chapter 03
No ratings yet
Chapter 03
43 pages
Elective-I Advanced Database Management Systems
No ratings yet
Elective-I Advanced Database Management Systems
67 pages
2 Data Mining Terms & Concepts
No ratings yet
2 Data Mining Terms & Concepts
44 pages
Lecture 6 Database Primer
No ratings yet
Lecture 6 Database Primer
50 pages
The Relational Database Model: Database Systems: Design, Implementation, and Management
No ratings yet
The Relational Database Model: Database Systems: Design, Implementation, and Management
52 pages
ADB - CH2 - Advanced SQL
No ratings yet
ADB - CH2 - Advanced SQL
60 pages
Chapter 4 Supply Management Integration For Competitive Advantage
No ratings yet
Chapter 4 Supply Management Integration For Competitive Advantage
81 pages
Data Warehousing: Data Models and OLAP Operations
No ratings yet
Data Warehousing: Data Models and OLAP Operations
41 pages
CS121 Lec 04
No ratings yet
CS121 Lec 04
44 pages
Image Compression
No ratings yet
Image Compression
15 pages
Lecture 2.1.1
No ratings yet
Lecture 2.1.1
21 pages
Data Warehouse Schemas For Decision Support
No ratings yet
Data Warehouse Schemas For Decision Support
13 pages
Data Warehousing & DATA MINING (SE-409) : Lecture-2
No ratings yet
Data Warehousing & DATA MINING (SE-409) : Lecture-2
36 pages
A Relational Model of Data For Large Shared Data Banks
100% (1)
A Relational Model of Data For Large Shared Data Banks
35 pages
PDF Document BIDA 2
No ratings yet
PDF Document BIDA 2
21 pages
001 - OpenEdge Getting Started Database Essentials Gsdbe
No ratings yet
001 - OpenEdge Getting Started Database Essentials Gsdbe
142 pages
Lecture 6
No ratings yet
Lecture 6
26 pages
The Relational Database Model
No ratings yet
The Relational Database Model
34 pages
Pokemon Black Cheats & Cheat Codes For Nintendo DS - Cheat Code Central
No ratings yet
Pokemon Black Cheats & Cheat Codes For Nintendo DS - Cheat Code Central
55 pages
SQL Query
No ratings yet
SQL Query
14 pages
AIX Disk Queue Depth Tuning For Performance UnixMANTRA
No ratings yet
AIX Disk Queue Depth Tuning For Performance UnixMANTRA
9 pages
Information Integration: Existing Methods and Solutions
No ratings yet
Information Integration: Existing Methods and Solutions
25 pages
SQL Final Notes
No ratings yet
SQL Final Notes
9 pages
Infromation System1
No ratings yet
Infromation System1
47 pages
Week 3
No ratings yet
Week 3
29 pages
VLB Janakiammal College of Engineering and Technology
No ratings yet
VLB Janakiammal College of Engineering and Technology
54 pages
CSE301 Lec6
No ratings yet
CSE301 Lec6
11 pages
Lecture 1: Part I: Emerging Database Technology, Research and Applications
No ratings yet
Lecture 1: Part I: Emerging Database Technology, Research and Applications
11 pages
Two Marks: Unit: 1
No ratings yet
Two Marks: Unit: 1
62 pages
DBMS SRP
No ratings yet
DBMS SRP
13 pages
Database System
No ratings yet
Database System
14 pages
SQL Basics
No ratings yet
SQL Basics
6 pages
Lecture 1 Overview: Two Things in Backend
No ratings yet
Lecture 1 Overview: Two Things in Backend
21 pages
Data Modeling and T-SQL: Meetings / Methodology
No ratings yet
Data Modeling and T-SQL: Meetings / Methodology
13 pages
Web Mining
No ratings yet
Web Mining
20 pages
ADBMS Lec1
No ratings yet
ADBMS Lec1
46 pages
Infoman Finals
No ratings yet
Infoman Finals
5 pages
Bajwa A C
No ratings yet
Bajwa A C
4 pages
Cs 614
No ratings yet
Cs 614
10 pages
SQL Interview Questions 1725044566
No ratings yet
SQL Interview Questions 1725044566
4 pages
Access Finals
No ratings yet
Access Finals
5 pages
Data Warehouse - Logical Design
No ratings yet
Data Warehouse - Logical Design
40 pages
Introduction To Databases: DB2 Tutorial:-What Is Data?
No ratings yet
Introduction To Databases: DB2 Tutorial:-What Is Data?
16 pages
Chapter 8 Revision
No ratings yet
Chapter 8 Revision
15 pages
Gene Expression: Quantification of Information Molecules and Their Applications
No ratings yet
Gene Expression: Quantification of Information Molecules and Their Applications
146 pages
L Lpi3 A4
No ratings yet
L Lpi3 A4
29 pages
Week 5
No ratings yet
Week 5
64 pages
Acronyms
No ratings yet
Acronyms
6 pages
Dayananda Sagar College of Engineering: Shavige Malleshwara Hills, Kumaraswamy Layout, Bangalore-560078
No ratings yet
Dayananda Sagar College of Engineering: Shavige Malleshwara Hills, Kumaraswamy Layout, Bangalore-560078
12 pages
Saliola Assunta 201406 MSC Thesis
No ratings yet
Saliola Assunta 201406 MSC Thesis
80 pages
Strange and Beautiful Numbers 2
No ratings yet
Strange and Beautiful Numbers 2
10 pages
Tiv PDF
No ratings yet
Tiv PDF
1 page
Recap Through Exercise
No ratings yet
Recap Through Exercise
37 pages
Debugging 9
No ratings yet
Debugging 9
16 pages
Indian Addresses Matching
No ratings yet
Indian Addresses Matching
12 pages
HW 7 Solutions
No ratings yet
HW 7 Solutions
9 pages
Ticketdirect 1214508382
No ratings yet
Ticketdirect 1214508382
2 pages
Python - How To Draw A Heart With Pylab - Stack Overflow
No ratings yet
Python - How To Draw A Heart With Pylab - Stack Overflow
5 pages
NASA Science Mission Directorate Knowledge Graph Discovery
No ratings yet
NASA Science Mission Directorate Knowledge Graph Discovery
6 pages
Resizing Partitions (For Android)
No ratings yet
Resizing Partitions (For Android)
2 pages
Track Schedule (ICRTSET 2025)
No ratings yet
Track Schedule (ICRTSET 2025)
3 pages
Futaba - Tbs - CRT As9106
No ratings yet
Futaba - Tbs - CRT As9106
2 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Blowfish Cipher Tutorials - Herong's Tutorial Examples
From Everand
Blowfish Cipher Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
IGNOU BCA Data and File Structure Previous Year Unsolved Papers MCS 021
From Everand
IGNOU BCA Data and File Structure Previous Year Unsolved Papers MCS 021
Manish Soni
No ratings yet
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
From Everand
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
Manish Soni
No ratings yet

Score: Context-Oriented Structured and Unstructured Information Integration

Uploaded by

Score: Context-Oriented Structured and Unstructured Information Integration

Uploaded by

SCORE

Context-Oriented Structured and Unstructured Information Integration

DB2 Query CM Query

DB2 Result CM Result

Structured Data Management Unstructured Data Management

 Keyword Query Based Solutions (DB2 ESE, DbXplorer/BANKS [ICDE02])

Keyword Query DB2

SELECT name, max(price) - min(price)

“IBM” “ORCL” “MSFT” Search Engine

Proj | Name | Description | Group Org | Org Name | Address

Proj | Emp Emp | Name | Group Group | Name | Manager | Org

Select Name, Description Name | Description

Select GROUPS.Name, PROJECTS.Description Select ARTICLES.id

 Specify information need in terms of SQL over the structured database

 Does not need any changes in existing infrastructure for deployment

 Does not disrupt existing applications

SCORE enables the management to:

ROI in terms of effective problem determination and low customer attrition

SCORE enables the user to:

ROI in terms of time saved in analysis, leading to better investment decisions

Q2: SELECT C1, C2 FROM R A X 5

TW(Q, A, t) = POP(Q, A, t) . DIFF(Q, A, t)

POP(Q, A, t) = |σA = t(Q)| = NQ

Here NQ = No. of times t appears in column A in Q = |σA = t(Q)|

Based on the “TF-IDF” metrics popular in the Information-Retrieval community

N is the maximum number of terms allowed in the context

 Including the projected-out columns

 Exploring the neighboring tables

 Remove projection constraints from the input query

N is the maximum number of terms allowed in the context

 Problem: For scalability, need to limit the number of augmentations

MAXAUG is the maximum number of augmentations allowed

MAXAUG is the maximum number of augmentations allowed

 Modified Histogram Approach

 Shallow learning curve

You might also like