0% found this document useful (0 votes)
2 views6 pages

Web Page DOM Node Characterization and Its Application To Page Segmentation

This paper presents a novel approach for web page segmentation using a characterization of the Document Object Model (DOM) tree nodes based on Content Size and Entropy. The proposed unsupervised algorithm identifies cohesive regions of web pages, which are typically organized into visually distinct segments like navigation bars and headers. The study highlights the challenges of extracting structured information from web pages and discusses the application of the method in various areas such as mobile internet and advertisement filtering.

Uploaded by

yu pei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

Web Page DOM Node Characterization and Its Application To Page Segmentation

This paper presents a novel approach for web page segmentation using a characterization of the Document Object Model (DOM) tree nodes based on Content Size and Entropy. The proposed unsupervised algorithm identifies cohesive regions of web pages, which are typically organized into visually distinct segments like navigation bars and headers. The study highlights the challenges of extracting structured information from web pages and discusses the application of the method in various areas such as mobile internet and advertisement filtering.

Uploaded by

yu pei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Web Page DOM Node Characterization and its

Application to Page Segmentation


Gujjar Vineel
Computing and Decision Sciences Lab
GE Research, INDIA
Email: [email protected]

Abstract—Web pages are generally organized in terms of For instance, consider the fragment of HTML code shown in
visually distinct segments, such as Navigation bars, Adver- figure 1 (a). When rendered by a browser, it appears as shown
tisement banners, Headers, Portlets and Widgets. Despite the in figure 1 (b). A closer look at the code fragment reveals
apparent structured layout, web pages are considered a source
of unstructured data, from information extraction point of view. a hierarchical nesting of HTML tags, as represented in the
Hence, as a step towards interpreting the organization of web form of a tree graph, in figure 1 (c). Here, each HTML tag,
data, web page segmentation attempts to identify cohesive regions corresponds to a node in the tree graph. The text contents
of a page. appear as leaf nodes. Also, each node in the DOM tree has
In this paper, we present a novel DOM tree mining approach two attributes, name and content.
for page segmentation. We first characterize the nodes of DOM
tree structure, based on their Content Size and Entropy. While • The name of a node, is the string representation of its
Content Size of a node indicates the amount of textual content corresponding HTML tag. For example the nodes in
contributed by its subtree, Entropy measures the strength of local figure 1 (c) are labeled with their names. Note that node
“patterns” exhibited therein. In other words, a node manifesting names are not unique - more than one node might have
highly repetitive patterns begets a high Entropy as per our
formulation. Based on the characterization of DOM nodes, we the same name. Also, leaf nodes corresponding to the
then develop an unsupervised algorithm to automatically identify page text do not have a HTML tag associated with them,
segments of a given web page. they are simply referred to as Text nodes.
Index Terms—Web Information Extraction, Web Page Seg- • The content of a node, is the text that its tree (rooted
mentation, Document Object Model, Entropy at the node itself) manifests in the rendered web page.
Thus, the content of a Text node is simply the text string
I. I NTRODUCTION
that it represents, in the HTML form. The node content
Most web pages exhibit a structured layout, wherein page of an interior node, is recursively the concatenation of
content is organized into well-defined segments. Some seg- the node content of its child nodes.
ments, such as Navigation bars, Headers and Footers, improve While the discussion so far focused on a sample DOM tree,
the usability of websites and impart a uniform look and feel the corresponding structure for a typical page can be complex.
to them. Others, like Advertisement banners, are considered a A typical wep page can contains a few thousands of nodes in
distraction [1] or a source of revenue generation, depending its DOM tree. Our approach to web page segmentation is as
on viewpoint. In general, page segments enable streamlined follows. Section III introduces our characterization of DOM
content management, by allowing pages to be composed in tree nodes. As an application of our node characterization
terms of reusable User Interface (UI) components. method, we develop a segmentation algorithm in section IV.
Despite the apparent organization in the layout of web Section V describes our experimental setup. Section VI and
pages, extracting information from them is a non-trivial task. VII contain the results and conclusions respectively.
This is chiefly due to the orientation of HTML markup
language, towards formatting the content of web pages, rather II. R ELATED W ORK
then describing it in a machine readable form. Consequently, Web Data Extraction. Research in web data extraction
web IR systems adopt heuristic and other methods to interpret has largely focused on wrapper learning [8]. Here, web page
the page structure, and extract segments of interest from it. content is regarded as the output of a template, which trans-
Besides information extraction, web page segmentation has forms underlying database records to web page segments. The
numerous other applications in the ares of Mobile Internet problem of wrapper induction, is to learn the template from
[2], [3], [4], web search [5], filtering advertisement banners the web pages. The template extraction can be unsupervised
[1], extracting product information [6] and site topic hierarchy [9], or can be automated by exploiting regularities in the page
extraction [7]. [10] [11] [12]. There are also visual tools [13] [14] which
In this paper, we analyze the Document Object Model define their own wrapper programming languages in order to
(DOM) tree representation of web pages in order to identify its aid the data extraction.
segments. DOM trees, also called tag trees, are an hierarchical Specific Segment Extraction. Another set of research [15]
representation of the nested tag structures of HTML pages. [16] [17], focuses on specific page segments of interest. For

978-1-4244-4793-0/09/$25.00 c 2009 IEEE


<html>
<h1>Fruits and Vegetables</h1>
html
<table>
<tr>
<td><b>Fruits</b></td> h1 table
<td>Oranges</td>
<td>Mangoes</td> text tr tr
</tr>
<tr>
<td><b>Vegetables</b></td> td td td td td td
<td>Carrots</td>
<td>Tomatoes</td>
</tr> b text text b text text
</table>
</html> text text

(a) An HTML code fragment (b) Rendered view (c) DOM Tree

Fig. 1. A simple HTML fragment, its rendered view and its DOM tree

example, Debnath et. al. [16], compare various methods for ex- segment nodes. Our challenge, is to identify node features that
tracting text portions of web pages. Similarly, Kushmerick[1] distinguish the page segment nodes from others. In particular,
applies page segmentation for advertisement banner detection we propose the following two features:
and removal. Node Content Size Node content, as defined earlier in
Template Removal. Another body of work is characterized section I, is the text manifested by its subtree in the
by the analysis of multiple pages of site to detect common rendered web page. Accordingly, Content Size of a
segments. Subsequent to the early work by Bar-Yoseff and node is the number of words of text in its contents.
Rajagopalan [18] in this direction, Viera et. al. [19] propose Content size provides a direct hint of weather a node
efficient methods for matching similar nodes across DOM trees is a page segment node - nodes with extreme content
to identify the template content. Gibson et al. [20] report a sizes are unlikely to be page segment nodes.
steady grown in template content on the web. Lin and Ho [15] Node Entropy Page segments tend to exhibit strong local
and Yi et. al. [17], use HTML tags such as <table> as cues “patterns” in their contents. For example, Navigation
to construct style trees that aid the extraction of informative bars contain a list of hyperlinks, each of which are
blocks of web pages, while eliminating noisy content. More similarly formatted. Similarly, a page segment might
recently, Chakrabarti et. al. [21], present a method wherein contain a “top stories” list of items. Such lists appear
each node DOM tree is assigned a templateness score, based in the document tree as repetitive node structures. We
on which a classifier can remove template content. propose an Entropy based formulation to quantify
Full Page Segmentation. Early work in this direction this property of page segment nodes.
adopted a Vision-based Page Segmentation (VIPS) approach
[22], wherein rendered pages are viewed as images, from A. Definition
which cohesive regions are to be identified. Similarly Song et. Formally, Content Size of a node can be defined as follows.
al. [23], adopt learning methods based on features extracted For a Text node (which is a leaf node), t, we define the
using VIPS and other techniques. Some research in this function, W (t), which counts the total number of (possibly
direction is motivated by its application to mobile Internet. non-distinct) words in its text. The function evaluates to zero
Baluja [2] proposed a decision tree type method using visual for non-text nodes. Now, if Cv is the set of child nodes of
cues for page segmentation. Similarly, Hattori et. al. [4], node v, and L is a set of all leaf nodes in the document tree,
carry out the page segmentation a page based on the concept then the Content Size of the node, Size(v), is:
of “content distances”. Ramaswamy et al. [24] present a 
W (v) ,v∈L
Shingling algorithm for detecting page segments (fragments). 


Here, the authors focus on detecting fragments that are shared Size(v) = P
Size(c)
across pages. Kao et al. [25] derive features from a web page 

 , otherwise
c∈Cv
and apply a greedy algorithm on the features to segment the
page. Now we turn our attention to quantifying the second feature
In the context of related work discussed so far, this paper of a page segment, the node Entropy. For each node in
presents an DOM tree analysis based unsupervised method, the tree structure, we first compute the distribution of Node
for full page segmentation using page-level information alone. Names in its subtree. We hypothesize that in presence of a
repetitive patterns in a tree structure, its node names tend to
III. N ODE C HARACTERIZATION occur at similar frequencies, due to which the corresponding
In DOM tree representation of web pages, nodes whose distribution is close to uniform. For example, consider the
content appear as page segments, are herein referred to as page table node in the DOM tree of figure 1. It contains a set of
5000
descendants with names tr, td, text and b, occurring at
Wikipedia Page
frequencies of 2, 6, 6 and 2 respectively. This distribution is

1000
Amazon Page
much more closer to uniform than the distribution for (say) the
html node, wherein the frequencies are 2, 6, 7, 2, 1 and 1 for
tr, td, text, b, table and h1 respectively. Hence, in this

Node Size

100
example, the table node is more likely to be a page segment
node than the html node. After experimenting with several
metrics for assessing the uniformity of a distribution, we found
the Entropy function to be most effective. A few notable points

10
about our use of the function are worth discussing here.
• The Entropy function attains maximum value when the
node name frequencies are all equal, and it has low values

1
when the distribution is highly concentrated at a few
nodes. 0.5 0.6 0.7 0.8 0.9 1.0
• The Entropy is calculated based on the node names alone
Node Entropy
and therefore its value depends only on the formatting
idiosyncrasies of a page. The textual contents of the page Fig. 2. DOM Node Feature Space
have no bearing on the formulation. Moreover, the name
of the node whose Entropy is to be determined, is itself
excluded from the calculation. the “Home > Section Page > Subsection Page” set of hyperlinks), also
We now mathematically express the node Entropy. Consider- have high Entropy due to their fairly regular structure. They
ing a node v, if fi is the frequency with which the node name occupy extreme lower right corners in the feature space. We
i appears in its subtree, then we can denote the “probability” propose that the page segment nodes to be separable in this
of node i , as space.
fi A notable property of the feature space is a linear trend
pi = . . . (i)
n between the two axes. On an average, as the node sizes
where, n is the total number of descendant nodes of node v. decrease, the Entropy seems to increase. We observed this
The Entropy of node v is, consistently in all the pages that were analyzed. We find that
with certain assumptions, the trend is explainable. It occurs
due to the dependence of both the features on the node depth.
P pi log(pi )
E(v) = log(|V |) . . . (ii) A formal explanation of this property follows.
i∈V

Depth 0 V0
where, V is the set of unique node names that appear in the 1 Degree 'g'
2 ... g
subtree of node v. The log(|V |) factor is included to normalize Depth 1 Name ' V1 '
V1 V1 . . . V1
the output within the [0, 1] interval.
In summary, we have proposed a two parameter characteriza-
Depth (H-1) VH-1 ... ... VH-1
tion of a DOM node. A node size parameter which measures
the relative “importance” of the node in terms of amount of Depth H VH VH . . . VH VH VH . . . VH
text contained in it. And an Entropy parameter which quanti-
fies the strength of regular patterns in the subtree. The features Fig. 3. An example synthetic tree of height H, constant degree g, and
can be represented in a two dimensional Feature Space, whose homogeneous node names with respect to depth
properties are described in the following subsection.
As depicted in figure 3, consider a simple tree graph of
B. Node Feature Space
height H with a constant degree g at all its nodes. Now,
The feature space of DOM nodes, defined by node Entropy without loss of generality, if we further assume that each leaf
and Size, appears in figure 2. Points in the space correspond node has unit size, then as we ascend from the leaf node to
to nodes of DOM trees computed based on the Amazon root node, the average node content size increases by a factor
homepage1 and a Wikipedia article2 . of g. Hence, the node content size, Sd , at a depth, d, can be
In the feature space, nodes closer to the root appear in expressed as:
upper left corner, as they tend to have lower Entropy and Sd = g (H−d)
higher sizes. Similarly, snippets of text appear in the lower
regions of the plot due to their low content sizes. Besides log(Sd ) = H log(g) − d log(g) . . . (iii)
small size, some other snippets like Breadcrumb Trails (recall
From (iii) it is clear that, on a logarithmic scale, the size
1 http://www.amazon.com, accessed on 15th January 2008 of a node is a linear function of its depth, and the slope of the
2 http://en.wikipedia.org/wiki/Information,
accessed on 15th January 2008 line indicates the node degree.
Now, for calculating the Entropy of a node as a function

Node De

Node

N od

No
500 1000
of its depth, a further assumption is required. We assume that

de
e De
Degre

De
a node name is a function of its depth. In other words, all

gree, g =

gree

gree,
nodes at a depth, d, have node name, Vd . The assumption is

e, g =

,g=

g=
representative because most HTML tags, by their function, are

100 200
Node Size

2
2.5
3
4
suitable at a certain depth in the tree. For instance, the <a>
tags, meant to define hyperlinks, appear close to leaf nodes.
Similarly, the <table> tags, often used for defining regions

50
of web page, appear much closer to the root node. We believe
that this tendency of HTML tags to appear at certain depths

20
only, gives rise to decreasing Entropy with decreasing depth,

10
and ultimately with increasing size.
Now, at depth H, all the nodes are leaf nodes. Consequently
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
the Entropy of a node at this depth is undefined. At depth
(H − 1) however, a node has g child nodes, each with the Node Entropy
name VH . Hence, the set of node names can be represented
Fig. 4. The feature space for the synthetic tree
as < VH > and its corresponding frequencies as < g >.
Consequently, from (i) and (ii) the Entropy for a node at this
depth is,
g
 
g In summary, we suggest that there is a correlation between
EH−1 = log =0 Size and Entropy of nodes, that depends on the node degree.
g g
In other words, for a given tree, a node can have a higher or
Similarly, if we ascend one more level to depth (H − 2), lower Entropy, by virtue of its size. Hence, if the Size-Entropy
the frequencies are < g, g 2 > corresponding to names transfer function can be learned or otherwise estimated, then
< VH−1 , VH >. And now the Entropy of a node at this depth it can serves to distinguish between page segment nodes
is, and other nodes. Based on this observation, we developed a
g2 g2 segmentation algorithm, presented in the next section.
   
g g
EH−2 = log + log
g + g2 g + g2 g + g2 g + g2
IV. W EB PAGE S EGMENTATION
From above, it is clear that for any node at depth d, the
Entropy can be expressed as The segmentation algorithm developed herein is intended as
a demonstration of the usefullness of the DOM tree charac-
H−d
P gi

gi
 terization. It exemplifies one of several potential applications
Ed = g+...+gH−d
log g+...+gH−d
, d<H of the characterization.
i=1
A. Data Cleaning
Using the Geometric Progression relation,
g(1 − g m ) The first step in segmentation is to parse HTML into
g + g 2 + ... + g m = DOM tree. Since the HTML language does not enforce well-
1−g
formedness of its tags, it gives rise to ambiguity in interpreting
Node Entropy simplifies to, the tree structure. Web browsers have built-in rules to handle
H−d the situation. For instance, they assume that any open <td>
P gi−1 .(1−g) 
gi−1 .(1−g)
 tag (marks beginning of a table cell) is implicitly closed when
Ed = 1−gH−d
log 1−gH−d
. . . (iv)
the next <tr> tag is encountered. Software libraries that
i=1 efficiently encode such rules are available in the open source
Upon evaluating the equations (iii) and (iv) for various domain3, and can be leveraged to obtain DOM tree from any
values of d an g, we find that the node Size-Entropy relation HTML page.
appears close to linear. The results are shown in figure 4. We
assumed a constant document size of 2000 words to compute B. Node Feature Computation
the plot. Also, the slope of the line is indicative of the degree Using a depth-first traversal of the tree, each node computes
of the nodes in the tree. a frequency table of node names in its subtree. The Entropy
Note that while most real-word DOM trees do not exhibit a of the node is then computed from the frequency table. Node
constant degree (an assumption that was made in the foregoing sizes are also simultaneously computed. The recursive nature
analysis), we are only explaining their tendencies. If a DOM of the tree structure is leveraged, by computing the features
tree did have a constant degree then both features woud be based on calculations at the child nodes.
strongly correlated, rendering information contributed by one
of the features redundant. 3 http://htmlcleaner.sourceforge.net/
C. Segmentation Algorithm VI. R ESULTS
Based on the feature space introduced in figure 2, the aim
of the segmentation algorithm is to prune the filtered DOM The segmentation algorithm is effective. It exhibited around
tree, so that the leaf nodes in the resultant tree represent 90% precision at 80% recall. A sample output of the segmen-
page segments. The segmentation algorithm traverses down tation algorithm appears in figure 6. From the figure, it appears
the tree till it encounters a node which satisfies pre-specified that two segments, including the “Sponsored Advertisements”
constraints on node size and entropy. The tree is then pruned at segment, have not been identified by the algorithm. On closer
this node, and the subtree under the node is not traversed. The inspection we found that both these segments comprised of
tree traversal continues at other nodes. In our implementation, content generated by web browser scripts. Hence they were
the constraints on the node size and entropy, are viewed as a unavailable as nodes in the DOM tree.
region, called target region, in the feature space (see figure 5). Also, the feature space for the about.com page appears in
Now, it is possible that the algorithm ends up at the leaf figure 7. The solid rectangular points in figure 7 correspond
node, without encountering any node in the target region. In to the dashed segments in figure 6. The figure illustrates the
such cases we use various heuristics to prune the tree. In ability of the algorithm to identify seemingly non-separable
particular, if a node has been already identified as a page page segment nodes.
segment node, then all its siblings belonging to the same
parent, can also be treated as page segment nodes. Thus,
sibling nodes of an already identified page segment nodes are
considered as.
Root Node
5000

+++
Se
pa
rat
in
gL
500

in
e
+ + ++
+
Node Size

50 100

+ ++ + +++++ Node Size


++ + ++ + + Upper Threshold
+ +++ + ++ + + Target+ +
+ +
++ + + + Region
+ + + +
++ ++ + ++
++ + ++ + ++ ++ + ++ +
+ + + + + Node Size
+ + ++ ++ +
5 10

+ Lower Threshold
+
+ +
++ + + +
+ +
+ + + +
+ ++ + + +
+ +
+
1

0.5 0.6 0.7 0.8 0.9 1.0


<1,0>
Node Entropy

Fig. 5. Target region for page segment nodes


Fig. 6. Segmented page from the about.com page. Our implementation
creates an HTML output containing thick dashed lines around the identified
page segments.
V. E XPERIMENTAL S ETUP
In order to experiment with various node features and eval-
uate the effectiveness of our approach, an Java implementation Other than the issue of browser scripts, we found that
of the algorithm was built. The implementation analyzed input segmentation is difficult for article pages that have minimal
HTML pages and generated output HTML by marking page formatting. Such pages4 had one-level, “flat”, DOM trees,
segment using CSS techniques. The output HTML page, when wherein the root node had a large number of leaves (Text
viewed in a browser, visually shows page segments marked in nodes) corresponding to each paragraph of text in the page.
the form of dashed-line boundaries. Finally, in order to quantify the benefit of using the Entropy
In order to have a representative mix of web pages to feature for page segmentation, we applied our algorithm using
evaluate on, we radomly sampled page URLs from the Open the node Size feature alone. We found that the algorithm ex-
Directory Initiative, a directory of websites. We stratified our hibited surprisingly good performance. It achieved a precision
random sampling process to ensure that our test corpus had and recall of 80% and 75% respectively. In this case, errors
representation from cateogries such as index pages, content often arose when an entire collection of page segments met the
pages, blogs, news, shopping, entertainment and sports. The one-dimensional target size criteria, after which the algorithm
performance of segmentation was manually evaluated, by did not descend the tree further (to the actual page segment
human judgement, in terms of precision and recall values. Due nodes).
to the human labor intensive nature of the evaluation process,
we were able to evaluate a total of 400 web pages only. 4 See http://beust.com/belize-200405.html for example
500
[11] V. Crescenzi, G. Mecca, and P. Merialdo, “Roadrunner: Towards auto-
matic data extraction from large web sites,” in VLDB ’01, 2001, pp.

200
109–118.
[12] K. Lerman, L. Getoor, S. Minton, and C. Knoblock, “Using the structure
of web sites for automatic segmentation of tables,” in ACM SIGMOD
50

’04. ACM, 2004, pp. 119–130.


[13] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual web information
Size

10 20

extraction with lixto,” in VLDB ’01, 2001, pp. 119–128.


[14] Z. Liu, W. K. Ng, F. Li, and E.-P. Lim, “A visual tool for building
logical data models of websites,” in CIKM Workshop on WIDM’02,
5

McLean, Virginia, USA, November 8 2002. [Online]. Available:


citeseer.ist.psu.edu/liu02visual.html
2

[15] S.-H. Lin and J.-M. Ho, “Discovering informative content blocks from
web documents,” in ACM SIGKDD ’02. ACM, 2002, pp. 588–593.
1

[16] S. Debnath, P. Mitra, N. Pal, and C. L. Giles, “Automatic identification


0.6 0.7 0.8 0.9 1.0 of informative sections of web pages,” IEEE Trans. on Knowledge and
Data Engg., vol. 17, no. 9, pp. 1233–1246, 2005.
Entropy [17] L. Yi, B. Liu, and X. Li, “Eliminating noisy information in web pages
for data mining,” in ACM SIGKDD ’03. ACM, 2003, pp. 296–305.
Fig. 7. Feature space for the about.com example [18] Z. Bar-Yossef and S. Rajagopalan, “Template detection via data mining
and its applications,” in WWW ’02. ACM, 2002, pp. 580–591.
[19] K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, a. M. B. C. Jo and
J. Freire, “A fast and robust method for web page template detection
VII. S UMMARY A ND C ONCLUSIONS and removal,” in CIKM’06. ACM, 2006, pp. 258–267.
[20] D. Gibson, K. Punera, and A. Tomkins, “The volume and evolution of
In this paper, we defined two node features, Content Size web page templates,” in Poster at WWW ’05. ACM, 2005, pp. 830–839.
and Entropy, which intuitively represent orthogonal properties [21] D. Chakrabarti, R. Kumar, and K. Punera, “Page-level template detection
via isotonic smoothing,” in WWW ’07. ACM, 2007, pp. 61–70.
of a page segment. Based on the feature space defined by [22] D. S. J. R.Wen; and W. Y.Ma, “Vips: a vision based page segmentation
these two properties, we also developed an algorithm for algorithm,” MSR-TR-2003-79, 2003.
page segmentation. The Ascend stage of the algorithm is [23] R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma, “Learning block importance
models for web pages,” in WWW ’04. ACM, 2004, pp. 203–211.
particularly useful in correcting errors in page segmentation. [24] L. Ramaswamy, A. Iyengar, L. Liu, and F. Douglis, “Automatic detection
A supervised version of the algorithm can be easily conceived of fragments in dynamically generated web pages,” in WWW ’04. ACM,
in order to improve its performance. 2004, pp. 443–454.
[25] H.-Y. Kao, “Wisdom: Web intrapage informative structure mining based
The chief drawbacks of the algorithm arises from the on document object model,” IEEE Trans. on Knowledge and Data Engg.,
assumption that all page segments correspond to a single node vol. 17, no. 5, pp. 614–627, 2005, member-Jan-Ming Ho and Fellow-
in the DOM tree. In cases wherein webpages are organized in Ming-Syan Chen.
a flat hierarchy, the assumption causes our method to fail.
A clustering based approach that allows merging a subset of
sibling nodes might resolve the issue.

R EFERENCES
[1] N. Kushmerick, “Learning to remove internet advertisements,” in
AGENTS ’99. ACM, 1999, pp. 175–181.
[2] S. Baluja, “Browsing on small screens: recasting web-page segmentation
into an efficient machine learning framework,” in WWW ’06. ACM,
2006, pp. 33–42.
[3] Y. Chen, W.-Y. Ma, and H.-J. Zhang, “Detecting web page structure for
adaptive viewing on small form factor devices,” in WWW ’03. ACM,
2003, pp. 225–233.
[4] G. Hattori, K. Hoashi, K. Matsumoto, and F. Sugaya, “Robust web
page segmentation for mobile terminal using content-distances and page
layout information,” in WWW ’07. ACM, 2007, pp. 361–370.
[5] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Block-based web search,” in
ACM SIGIR ’04. ACM, 2004, pp. 456–463.
[6] C. W. G. Z. G. Xu, “A web page segmentation algorithm for extracting
product information,” IEEE Conference on Information Acquisition, pp.
1374–1379, Aug. 2006.
[7] S. Chakrabarti, M. Joshi, and V. Tawde, “Enhanced topic distillation
using text, markup tags, and hyperlinks,” in SIGIR ’01. ACM, 2001,
pp. 208–216.
[8] N. Kushmerick, “Wrapper induction for information extraction,” Ph.D.
dissertation, 1997, chairperson-Daniel S. Weld.
[9] J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha,
“Extracting semistructured information from the web,” in Proceedings
of the Workshop on Management of Semistructured Data, 1997.
[Online]. Available: citeseer.ist.psu.edu/hammer97extracting.html
[10] A. Arasu, H. Garcia-Molina, and S. University, “Extracting structured
data from web pages,” in ACM SIGMOD ’03. ACM, 2003, pp. 337–
348.

You might also like