Populating The Semantic Web
Populating The Semantic Web
Labeled ford
lincoln
20, 3DIGIT
31, 3DIGIT
wrapper examples DataPro Patterns ALPHA
…
2DIGIT, 3DIGIT
…
skip this step and provide several pages of the same type “number followed by two capitalized words followed by the
to the algorithm. Autowrap, the automatic data extraction word Blvd,” etc.
algorithm is described in Section Data Extraction. The symbolic representation of content by patterns of to-
Finally, we use the patterns learned on the training exam- kens and token types is very flexible and general. In our
ples to assign semantic labels to the automatically extracted previous research we found that a set of patterns describ-
records, as described in Section Labeling. We validated the ing how a field begins and ends, allowed us monitor wrap-
ADEL system on the Used Cars domain. Results of this per’s accuracy or locate examples of the field on new pages
work are presented in the results section. with considerable accuracy (Lerman, Minton, & Knoblock
2003). In Section Labeling we show how to apply patterns
Modeling Data Content to recognize known data fields.
The data modeling step is used to learn the structure of
data fields from examples. We represent the structure of Data Extraction
data by patterns of tokens and token types. In previous As we discussed above, many Web sites that present infor-
work, we developed a flexible pattern language and pre- mation contained in databases follow a de facto convention
sented an efficient algorithm, DataPro, for learning pat- in displaying information to the users and allowing them to
terns from examples of a field (Lerman & Minton 2000; navigate it. This convention affects how the Web site is orga-
Lerman, Minton, & Knoblock 2003). The pattern language nized, and gives us additional information we can leverage
contains specific tokens and general token types. Specific for information extraction. Such Web sites generate list and
types refer to unique text strings, such as “California” or detail pages dynamically from templates and fill them with
“sea”, while the general types describe the syntactic cat- results of database queries.
egory to which the token’s characters belong, such as nu- Consider a typical list page from a Web site. As the server
meric, alphabetic, etc.The token types are organized in a hi- constructs the page in response to a query, it generates a
erarchy, which allows for multi-level generalization.1 The header, followed in many cases by an advertisement, then
pattern language can be extended to include other syntactic possibly a summary of the results, such as “Displaying
types or domain-specific semantic types. In addition to pat- 1-10 of 214 records.”, table header and footer, fol-
terns, we also remember the mean length of a field and its lowed by some concluding remarks, such as a copyright in-
variance. formation or navigation aids. We call this part of the page
DataPro algorithm finds patterns that describe many of the page template. The page template of a list page contains
the examples of a field and are highly unlikely to describe data that is shared by all list pages and is invariant from page
a random token sequence. As an example, names can be to page. The page template can also be thought of as the
represented as a set of patterns such as “capitalized word grammar that generates the pages. Given two, or preferably
followed by an initial” and “capitalized word followed by a more, example list pages from a site, the Autowrap algo-
capitalized word,” whereas addresses can be represented as rithm can derive the grammar used to generate the pages and
1 use it to extract data from them.
A text token is a punctuation mark (PUNCT) or an alphanu-
meric token (ALNUM). If it is alphanumeric, it could be alphabetic Autowrap uses an ad hoc method to induce a grammar for
type (ALPHA) or a number (NUMBER). If alphabetic, it could also the Document Object Model (DOM) trees of the pages. The
be a capitalized word (CAPS) or an all-capitalized word (ALL- kinds of grammars it induces allow us to extract single data
CAPS). The number category is further subdivided into 1DIGIT to items as well as lists, where the rows of a list contain data
5DIGIT numbers. items or nested lists.
r r
children of r (in the intermediate tree). With repeating sub-
a f a f v structures represented as templates, the algorithm can detect
b c e g s b t e u the similarity between the two rows of the outer list and find
d
a row template for the children of r: r([a([b(c)]*) d]*).
d
The second step of the induction algorithm merges the
templates that are induced for each DOM tree. The merging
Figure 2: Example trees algorithm is the same as the template-finding algorithm, ex-
cept the row templates are treated slightly differently: The
order of the nodes within row templates may be rotations
of one another even though the lists represented by the tem-
The induction algorithm has two main stages: Finding re-
plates are very similar. For example, r(a b c a b c a b c)
peating sub-structures and merging grammars into a more
and r(b c a b c a b c) will give r([a b c]*) and r([b c a]*)
general one. Both stages use ideas based on templates.
as templates. The merging algorithm treats this special case
A template is a sequence of alternating slots and stripes and chooses the best alignment among the rotations.
where the stripes are the common sub-structures among all
the pages and slots are the placeholders for pieces of data Labeling
that go in between the stripes. One way to find the tem- When provided with several list pages, Autowrap extracts all
plate of a set of pages is to find the longest common subse- tables from these pages. These include the one with the data
quence (LCS) of all the pages. The LCS immediately gives we are interested in extracting, as well as tables containing
the stripes of the template and with a little bookkeeping, the extraneous information. The next step in the process is to
slots can also be found. label the columns of every table and output the correct table
The template idea can easily be extended to trees, and in of data, which we define to be one that has the most labeled
particular to the DOM structure. Given a set of sequences columns.
of DOM elements, we find the LCS and then for each ele- We use learned patterns to map columns to data fields.
ment in the LCS, we recursively apply the algorithm to the The basic premise is to check how well a field describes
set of child elements. For example, consider two trees r(a(b a column of data, given a list of patterns that describe the
c(d) e) f(g)) and r(a(s b t(d) e) f(u) v) where the parenthe- data field and its mean length. We have developed a set of
sis group the children of the preceding element, as shown heuristics to score how well a field describes a column. The
in Figure 2. First we find the LCS of the single-element column is assigned the field with the highest score. Factors
sequences [r] and [r] and then proceed down to the child that increase a field’s score include
sequences [a f] and [a f v]. The LCS of these two is [a f],
so we first recurse down to the child sequences of the two a • Number of patterns that match examples in the column
nodes and then to those of the two f nodes to get [b e] and • How close examples are in length to the field’s mean
[]. Since there are no more child sequences, we combine the length
LCS’s to get a tree template: r(a(b e) f()). • Pattern weight — where the more specific patterns are
Once we have the template, we can use it to extract the given higher weight
slots by finding the nodes in the original DOM structures
that are not in the stripes of the template. This gives us .(.(. Results
c(d) .) .(g)) and .(.(s . t(d) .) .(u) v). With a little more work, We validated the ADEL system on the Used Cars domain.
we can align the data in columns and represent the data in a We wrapped two used cars sites — Anaheim Lincoln Mer-
table: cury and Mercedes Benz of Laguna Niguel — and collected
c(d) g on the order of 250 records from these sites. We normalized
s t(d) u v all data by lowercasing it. We then ran the DataPro algo-
The template can also be used as a similarity measure be- rithm on the records to learn descriptions of the fields. The
tween two DOM sub-structures, since similar structures will resulting patterns and field lengths are displayed in Table 1.
have templates with bigger stripes than those that are not Fields Mileage and Price had many specific patterns that we
similar. The first step of the induction algorithm uses this do not display in the table.
similarity measure to find repeating sub-structures among Next, we attempted to extract and label data from three
the children of a DOM node. In particular, the algorithm new sites in the Used Cars domain.
looks for consecutive sequences of nodes whose similarity We manually collected three list pages from each new site.
measure is above a threshold. The sequences, which are as- Autowrap automatically induced the template for each set
sumed to be the rows of a list, are used to induce a row tem- of three pages and extracted all data from them. Autowrap
plate. As the algorithm traverses up from the leaves towards found, on average, six tables of data on each page. These ta-
the root, it replaces the repeating sub-structures with the row bles were processed further. Again, we normalized data by
templates. For example, starting with the tree r(a(b(c 1) b(c lowercasing it. In addition, we fixed some of the segmen-
2)) d a(b(c 3) b(c 4) b(c 5)) d), the algorithm first finds row tation errors Autowrap made. For instance, Autowrap does
templates for the inner lists within the children of a nodes: not recognize the sequence “\\r\\n” or a comma as a field
r(a([b(c)]*) d a ([b(c)]*) d) where the notation [...]* rep- delimiter. Thus, “Marina del Rey, CA” is extracted as a sin-
resents a row template. Next, the algorithm examines the gle field, rather than two. These problems will eventually be
Year Color Bodystyle
< 1.0 ± 0.0 > < 1.12 + −0.32 > < 3.06 ± 0.24 >
sion increases to P = 0.80.
[1999] [smoke silver] [4 dr sedan]
[2002] [desert silver] [4 dr A LPHA A LPHA]
[2000] [silver] [2 dr coupe] Discussion
[2003] [black]
[2001] [grey] We have described the ADEL system that automatically ex-
[4D IGIT] [A LPHA] tracts data from HTML pages and labels it. Although the
Make Mileage Engine system is still under development, initial results show good
< 1.31 ± 0.46 > < 2.38 ± 0.92 > < 3.83 ± 0.38 > performance on test sites in the Used Cars domain.
[mercedes, benz] [47K] [5 . 0l v8]
[ford] [21 , 3D IGIT] [4 . 3l v8] ADEL can extract data from HTML pages containing sin-
[mercury] [20 , 3D IGIT] [3 . 2l 6cyl] gle records of data, lists of records, and even nested lists. It
[lincoln] [2D IGIT , 3D IGIT] [3 . 2l A LNUM] does so by inducing the grammar of the pages and using the
Model Price [2 . 3l] grammar to extract data. In order to label the data, we first
< 1.88 ± 0.76 > < 4.0 ± 0.0 > [2 . A LNUM] need to learn the structural descriptions of the data fields.
[s430] [$ 2 , 988] VIN We do this by accumulating labeled data from some sites in
[ranger 2wd] [$ 25 , 3D IGIT] < 1.0 ± 0.0 > the domain, and used the learned descriptions to label data
[mountaineer A LNUM] [$ 15 , 988] [A LNUM]
from new sites in the same domain.
[gr marquis ls] ...
[ls v6] [$ 29 , 3D IGIT] ADEL system has shown good performance on labeling
[ls v8] [$ 30 , 988] data on three new Web sites in the Used Cars domain. Of the
[navigator 2wd] [$ 31 , 3D IGIT] 27 columns of data extracted from pages in these sites, 19
[navigator A LNUM] [$ 2D IGIT , 995] were correctly labeled and 2 were not labeled by the system.
[A LPHA lx] [$ 2D IGIT , 990] Five of the incorrectly labeled columns contained data that
[A LPHA A LPHA] [$ 2D IGIT , 900] was not in the schema we created for the site, such as “Trans-
mission”, “Stock Number”, “Warranty” and “Dealer”. We
can expand the schema by allowing user to specify patterns
Table 1: Patterns learned for data fields in the Used Cars describing new data fields. We did this for the field “Trans-
domain mission.” The patterns for the field were [automatic]
and [5 speed manual]. With the manually coded pat-
terns, we were able to label the transmission field on the
fixed in the Autowrap source. For now, we manually seg- site. Our algorithm also consistently mislabeled the “Price”
mented fields on “\\r\\n” and “,” (except for numbers, in field. This was because without the $, “Price” and “Mileage”
which case we did not segment them on the comma). fields are very similar (two digit number followed by a three
The columns were scored against the patterns in Table 1 digit number). Autowrap did not always extract the $ in the
according to the criteria described in Section Labeling. The price field.
column was assigned the label of the highest scoring field. The most significant challenge for automatic labeling ap-
The table with the most labeled columns was output in pears to be inconsistent data formats. For example, we failed
comma-delimited spreadsheet format. Table 2 shows a sec- to correctly label the “Engine” field in Table 2 because is ap-
tion of a table of labeled data for one site. peared on this site as “3.2 liter 6 cyl.”, whereas we learned a
In all, Autowrap extracted 27 columns of data from the description from sites that described engines as “3.2l 6cyl.”
three sites. In one site, the algorithm made a mistake in the Likewise, some sites presented Body Style as “4dr sedan”,
list template, resulting in it improperly concatenating every while others as ”sedan, 4dr”. Conceivably, we could include
two rows, thereby producing a table with 8 columns, rather such data transformation rules into the labeling algorithm,
than 4. The problem of correct list segmentation is addressed thereby significantly improving performance of the system.
in a different work (Lerman et al. 2004), which shows how
Unlike (Arlotta et al. 2003), we do not rely on the column
to improve Autowrap’s results with additional information
labels provided in the HTML page to label the schema. Even
from detail pages. In this work we are concerned with label-
though each site in the Used Cars domain provides similar
ing columns rather than extracting structured records.
information, the schema labels used were not always consis-
Nineteen of the 27 columns were correctly labeled, 9 were
tent with other sites. For example, one site applied the label
incorrectly labeled, and two were unlabeled, resulting in pre-
“Vehicle” to data composed of fields “Year Make Model”,
cision and recall of P = 0.64 and R = 0.89.2 However,
and “Information” for all the details about the car, such as
five of the columns that were incorrectly labeled were not in
body style, color, engine, etc. Other differences included us-
the schema we created: e.g., , “Transmission”, “Stock Num-
ing “MSRP” rather than “Price” and “Exterior” rather than
ber”, “Warranty” and “Dealer” in Table 2. When these are
“Color.” A scheme based on extracting column labels from
excluded from the incorrectly labeled columns count, preci-
HTML tables will not allow one to recognize that a field
2
We define True Positives (TP) as correctly labeled columns; “MSRP” on one site provides the same information as the
False Positive (FP) as incorrectly labeled columns; and False Neg- field “Price” on another site. Reading column labels will be
atives (FN) as unlabeled columns; therefore, P = T P/(T P +F P ) useful for augmenting the schema by discovering new fields
and R = T P/T P + F N . and labels.
- Model Color Color Bodystyle - VIN Mileage Mileage Model
1998 acura 2 . 5tl automatic silver 4 dr sedan 2.5 liter 5 cyl. jh4ua265xwc007394 60 , 913 14 , 976 norm reeves
2001 acura 3 . 2cl automatic red 2 dr coupe 3.2 liter 6 cyl. 19uya42741a003823 76 , 674 19 , 934 norm reeves
2000 acura 3 . 2tl automatic white 4 dr sedan 3.2 liter v-6 19uua5668ya054953 41 , 421 19 , 595 norm reeves
2000 acura integra gs 5 speed manual silver 2 dr hatchback 1.8 liter 4 cyl. jh4dc4368ys014324 36 , 866 16 , 895 norm reeves
1996 acura integra ls 5 speed manual purple 4 dr sedan 1.8 liter 4 cyl. dohc jh4db7553ts011459 62 , 723 9 , 595 norm reeves
1998 chevrolet astro automatic orange mini van 4.3 liter v-6 1gndm19w0wb180216 83 , 604 8 , 903 cerritos ford
2001 chevrolet camaro automatic blue 2 dr coupe 3.8 liter 6 cyl. 2g1fp22k412109568 22 , 249 13 , 975 norm reeves
Table 2: Results