0% found this document useful (0 votes)

209 views25 pages

Deep Web

The document describes a system for exploring and querying deep web data sources. The system has six modules: schema mining, schema matching, query planning, query reuse, incremental plan generation, and plan execution. It achieves parallelism through task parallelism by querying independent data sources simultaneously, thread parallelism by submitting multiple queries, and pipeline parallelism by processing outputs while data sources process new queries. The system is designed with querier and data pools to implement parallel query execution.

Uploaded by

Refi Rasheed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

209 views25 pages

Deep Web

Uploaded by

Refi Rasheed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 25

DEEP WEB

INTRODUCTION
The deep web is the content that resides in
searchable databases, the results from which

can only be discovered by a direct query.

Most of web's information is on dynamically

generated sites, and standard search engines

never find it.

The Deep web is made

hundreds
of

thousands

publicly

accessible

databases

and is approximately

500 times bigger than

the surface Web.

Deep web search technology automates the

process of making dozens of direct queries simultaneously using multi-thread technology. Here propose a system for exploring and querying deep web data sources. This system is able to automatically mine deep web data source

schemas.

CONTENTS
An overview of deep web search System structure
Schema Mining Schema Matching Query Planning Query Reuse Incremental Plan Generation Plan Execution

System design
System details Exploiting Parallelism

An overview of deep web search

When a user submits a query by filling a query form, data related with the query in the back end

database are returned in a HTML page dynamically.

No single database can provide all user requested

information, and the output of some databases need

to be the input for querying another database.

consider a query that asks for the amino acids occurring at the corresponding position in the orthologous gene of non-human mammals with respect to a particular gene, such as ERCC6 .[1]

ref[1]:F. Wang and G. Agrawal, SEEDEEP: A System for Exploring and Querying Scientific Deep web Data Sources

System structure
The system structure comprises of two components which are the exploring and the querying components.
There are 6 individual modules, including Schema Mining Module (Smi),Schema

Matching

Module(Sma),Query

Planning

Module (QP), Query Reuse Module (QR),

Incremental Plan Generation Module (IPG),

and Plan Execution Module (PE).

System structure[2]

ref[2]:F. Wang and G. Agrawal, SEEDEEP: A System for Exploring and Querying Scientific Deep web Data Sources

Schema Mining
SMi is used to automatically mine the schemas
of deep web data sources and build data source models to describe the data content hidden behind the data sources. The schema mining module aims to find the complete output schemas of deep web data sources.

Schema Matching
To construct a data source dependence model, we
need to consider the input-output schema matching problem for different data sources. Schema interfaces. matching discovers the semantics

correspondences among the attributes in web

Query Planning
The query planning module takes a query and system

models as input, and generates a query plan to answer

the query.

We generate simple query plan for each query.

The query plans for all individual queries in the

original query are combined to form the final query

plan.

Query Reuse
The query reuse module takes a query-plan-driven
strategy to generate query plans based on previously cached query plans which are stored in the plan base. An effective data reuse can greatly reduce the execution time of a query plan.

Incremental Plan Generation

When a query plan is executed, the remote data
sources and communication links are subject to congestion and failures. This can cause significant and unpredictable delays. The incremental plan generation module gracefully handles such issues.

Also the problem of unavailability of resources can

be avoided by this module.

Plan Execution
The query plan is executed by the plan execution module.

The plan execution module achieves three types of

parallelism, which are task parallelism, thread

parallelism and pipeline parallelism to accelerate the

execution of a query plan.

System design
A query plan can be modeled by a directed graph
G = (V,E), where the node set is the data sources involved in the plan, and the edge set is the interdependence between these data sources. In order to save the time spent while waiting for results from data sources, we can query multiple data sources in parallel.

System details
In our design, a querier pool and a data pool are
built to implement the parallel query plan execution.

Each querier corresponds to a data source. Its

responsibilities include monitoring the activity at the

parents of the data source.

The synchronization and communication between

the main threads for data sources is supported by the

data pool.

Querier and data pools[3]

ref[3]:F. Wang and G. Agrawal, SEEDEEP: A System for Exploring and Querying Scientific Deep web Data Sources

Querier pool comprises a querier corresponding to each data source. A querier is implemented as an

interactive two-stage multi-threading system.

The first stage contains a thread, which is used to

implement the communication and synchronization

with other queriers.

The second stage is composed of a set of querying

threads which are initiated by the first stage. Each

querying thread requests data from the data source

and sends results back to the main thread.

Two stage multi-threading system[4]

ref[4]:F. Wang and G. Agrawal, SEEDEEP: A System for Exploring and Querying Scientific Deep web Data Sources

Exploiting Parallelism
The system has three kinds of parallelism:Task Parallelism:-Data sources having no interdependence can be queried in parallel. Thread parallelism:- Multiple input instances can be

submitted to the data source in parallel.

Pipeline parallelism:- The output of a data source can

be processed by its children data source, while the data

source can process new input queries.

CONCLUSION
This paper has described and evaluated a
system that exploits parallelization for accelerating search over multiple deep web data sources. An interactive, two-stage multithreading system design enables us to achieve