0% found this document useful (0 votes)
209 views25 pages

Deep Web

The document describes a system for exploring and querying deep web data sources. The system has six modules: schema mining, schema matching, query planning, query reuse, incremental plan generation, and plan execution. It achieves parallelism through task parallelism by querying independent data sources simultaneously, thread parallelism by submitting multiple queries, and pipeline parallelism by processing outputs while data sources process new queries. The system is designed with querier and data pools to implement parallel query execution.

Uploaded by

Refi Rasheed
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
209 views25 pages

Deep Web

The document describes a system for exploring and querying deep web data sources. The system has six modules: schema mining, schema matching, query planning, query reuse, incremental plan generation, and plan execution. It achieves parallelism through task parallelism by querying independent data sources simultaneously, thread parallelism by submitting multiple queries, and pipeline parallelism by processing outputs while data sources process new queries. The system is designed with querier and data pools to implement parallel query execution.

Uploaded by

Refi Rasheed
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

DEEP WEB

INTRODUCTION
The deep web is the content that resides in
searchable databases, the results from which

can only be discovered by a direct query.


Most of web's information is on dynamically

generated sites, and standard search engines


never find it.

The Deep web is made

up

of

hundreds
of

of

thousands

publicly

accessible

databases

and is approximately

500 times bigger than


the surface Web.

Deep web search technology automates the


process of making dozens of direct queries simultaneously using multi-thread technology. Here propose a system for exploring and querying deep web data sources. This system is able to automatically mine deep web data source

schemas.

CONTENTS
An overview of deep web search System structure
Schema Mining Schema Matching Query Planning Query Reuse Incremental Plan Generation Plan Execution

System design
System details Exploiting Parallelism

An overview of deep web search


When a user submits a query by filling a query form, data related with the query in the back end

database are returned in a HTML page dynamically.


No single database can provide all user requested

information, and the output of some databases need


to be the input for querying another database.

consider a query that asks for the amino acids occurring at the corresponding position in the orthologous gene of non-human mammals with respect to a particular gene, such as ERCC6 .[1]

ref[1]:F. Wang and G. Agrawal, SEEDEEP: A System for Exploring and Querying Scientific Deep web Data Sources

System structure
The system structure comprises of two components which are the exploring and the querying components.
There are 6 individual modules, including Schema Mining Module (Smi),Schema

Matching

Module(Sma),Query

Planning

Module (QP), Query Reuse Module (QR),

Incremental Plan Generation Module (IPG),


and Plan Execution Module (PE).

System structure[2]

ref[2]:F. Wang and G. Agrawal, SEEDEEP: A System for Exploring and Querying Scientific Deep web Data Sources

Schema Mining
SMi is used to automatically mine the schemas
of deep web data sources and build data source models to describe the data content hidden behind the data sources. The schema mining module aims to find the complete output schemas of deep web data sources.

Schema Matching
To construct a data source dependence model, we
need to consider the input-output schema matching problem for different data sources. Schema interfaces. matching discovers the semantics

correspondences among the attributes in web

Query Planning
The query planning module takes a query and system

models as input, and generates a query plan to answer


the query.

We generate simple query plan for each query.


The query plans for all individual queries in the

original query are combined to form the final query


plan.

Query Reuse
The query reuse module takes a query-plan-driven
strategy to generate query plans based on previously cached query plans which are stored in the plan base. An effective data reuse can greatly reduce the execution time of a query plan.

Incremental Plan Generation


When a query plan is executed, the remote data
sources and communication links are subject to congestion and failures. This can cause significant and unpredictable delays. The incremental plan generation module gracefully handles such issues.

Also the problem of unavailability of resources can


be avoided by this module.

Plan Execution
The query plan is executed by the plan execution module.

The plan execution module achieves three types of


parallelism, which are task parallelism, thread

parallelism and pipeline parallelism to accelerate the


execution of a query plan.

System design
A query plan can be modeled by a directed graph
G = (V,E), where the node set is the data sources involved in the plan, and the edge set is the interdependence between these data sources. In order to save the time spent while waiting for results from data sources, we can query multiple data sources in parallel.

System details
In our design, a querier pool and a data pool are
built to implement the parallel query plan execution.

Each querier corresponds to a data source. Its


responsibilities include monitoring the activity at the

parents of the data source.


The synchronization and communication between

the main threads for data sources is supported by the


data pool.

Querier and data pools[3]

ref[3]:F. Wang and G. Agrawal, SEEDEEP: A System for Exploring and Querying Scientific Deep web Data Sources

Querier pool comprises a querier corresponding to each data source. A querier is implemented as an

interactive two-stage multi-threading system.


The first stage contains a thread, which is used to

implement the communication and synchronization


with other queriers.

The second stage is composed of a set of querying


threads which are initiated by the first stage. Each

querying thread requests data from the data source


and sends results back to the main thread.

Two stage multi-threading system[4]

ref[4]:F. Wang and G. Agrawal, SEEDEEP: A System for Exploring and Querying Scientific Deep web Data Sources

Exploiting Parallelism
The system has three kinds of parallelism:Task Parallelism:-Data sources having no interdependence can be queried in parallel. Thread parallelism:- Multiple input instances can be

submitted to the data source in parallel.


Pipeline parallelism:- The output of a data source can

be processed by its children data source, while the data


source can process new input queries.

CONCLUSION
This paper has described and evaluated a
system that exploits parallelization for accelerating search over multiple deep web data sources. An interactive, two-stage multithreading system design enables us to achieve

task parallelization, thread parallelization, and


pipelined parallelization.

References
www.cse.ohio-state.edu

www.portal.acm.com
F. Wang and G. Agarwal, SEEDEEP: A System for Exploring and Querying Scientific Deep web data sources,Statistical

and Scientific Database Management Conference.

QUESTIONS..?

THANK YOU...

You might also like