0% found this document useful (0 votes)
19 views1 page

A Motivating Problem: Wrapper Induction: Thai Restaurants in L.A. A-Rated by The L.A. County Health Depart

This document discusses active learning techniques for wrapper induction, which is the task of automatically generating extraction rules to extract structured data from web pages. It introduces three multi-view active learning algorithms - Co-Testing, Co-EMT, and Adaptive View Validation - that can learn accurate wrappers from only a few labeled examples by exploiting redundancy across different representations or views of the data. Co-Testing and Co-EMT actively select the most informative examples to label by considering disagreement between views, while Adaptive View Validation predicts whether a new task is suitable for multi-view learning based on prior tasks.

Uploaded by

Srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views1 page

A Motivating Problem: Wrapper Induction: Thai Restaurants in L.A. A-Rated by The L.A. County Health Depart

This document discusses active learning techniques for wrapper induction, which is the task of automatically generating extraction rules to extract structured data from web pages. It introduces three multi-view active learning algorithms - Co-Testing, Co-EMT, and Adaptive View Validation - that can learn accurate wrappers from only a few labeled examples by exploiting redundancy across different representations or views of the data. Co-Testing and Co-EMT actively select the most informative examples to label by considering disagreement between views, while Adaptive View Validation predicts whether a new task is suitable for multi-view learning based on prior tasks.

Uploaded by

Srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Active Learning with Multiple Views

detecting the most informative examples, while also • From Zagat’s, it obtains the name and address of
exploiting the remaining unlabeled examples. Second, all Thai restaurants in L.A. A
we discuss Adaptive View Validation (Muslea et al., • From the L.A. County Web site, it gets the health
2002b), which is a meta-learner that uses the experience rating of any restaurant of interest.
acquired while solving past learning tasks to predict • From the Geocoder, it obtains the latitude/longi-
whether multi-view learning is appropriate for a new, tude of any physical address.
unseen task. • From Tiger Map, it obtains the plot of any loca-
tion, given its latitude and longitude.
A Motivating Problem: Wrapper
Induction Information agents typically rely on wrappers to
extract the useful information from the relevant Web
Information agents such as Ariadne (Knoblock et al., pages. Each wrapper consists of a set of extraction rules
2001) integrate data from pre-specified sets of Web sites and the code required to apply them. As manually writ-
so that they can be accessed and combined via database- ing the extraction rules is a time-consuming task that
like queries. For example, consider the agent in Figure requires a high level of expertise, researchers designed
1, which answers queries such as the following: wrapper induction algorithms that learn the rules from
user-provided examples (Muslea et al., 2001).
Show me the locations of all Thai restaurants in L.A. In practice, information agents use hundreds of
that are A-rated by the L.A. County Health Depart- extraction rules that have to be updated whenever the
ment. format of the Web sites changes. As manually labeling
examples for each rule is a tedious, error-prone task,
To answer this query, the agent must combine data one must learn high accuracy rules from just a few
from several Web sources: labeled examples. Note that both the small training
sets and the high accuracy rules are crucial to the suc-
cessful deployment of an agent. The former minimizes
the amount of work required to create the agent, thus
making the task manageable. The latter is required in
order to ensure the quality of the agent’s answer to
Figure 1. An information agent that combines data each query: when the data from multiple sources is
from the Zagat’s restaurant guide, the L.A. County integrated, the errors of the corresponding extraction
Health Department, the ETAK Geocoder, and the Tiger rules get compounded, thus affecting the quality of
Map service the final result; for instance, if only 90% of the Thai
restaurants and 90% of their health ratings are extracted
Restaurant Guide
correctly, the result contains only 81% (90% x 90% =
81%) of the A-rated Thai restaurants.
Query:
L.A. County
Health Dept. A-rated Thai
We use wrapper induction as the motivating problem
restaurants for this article because, despite the practical importance
in L.A. of learning accurate wrappers from just a few labeled
examples, there has been little work on active learn-
ing for this task. Furthermore, as explained in Muslea
Agent (2002), existing general-purpose active learners can-
RESULTS: not be applied in a straightforward manner to wrapper
induction.
Geocoder

MAIN THRUST
Tiger Map Server

In the context of wrapper induction, we intuitively


describe three novel algorithms: Co-Testing, Co-EMT,

You might also like