0% found this document useful (0 votes)

2 views6 pages

Web Page DOM Node Characterization and Its Application To Page Segmentation

This paper presents a novel approach for web page segmentation using a characterization of the Document Object Model (DOM) tree nodes based on Content Size and Entropy. The proposed unsupervised algorithm identifies cohesive regions of web pages, which are typically organized into visually distinct segments like navigation bars and headers. The study highlights the challenges of extracting structured information from web pages and discusses the application of the method in various areas such as mobile internet and advertisement filtering.

Uploaded by

yu pei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views6 pages

Web Page DOM Node Characterization and Its Application To Page Segmentation

Uploaded by

yu pei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Web Page DOM Node Characterization and its

Application to Page Segmentation

Gujjar Vineel
Computing and Decision Sciences Lab
GE Research, INDIA
Email: [email protected]

Abstract—Web pages are generally organized in terms of For instance, consider the fragment of HTML code shown in
visually distinct segments, such as Navigation bars, Adver- figure 1 (a). When rendered by a browser, it appears as shown
tisement banners, Headers, Portlets and Widgets. Despite the in figure 1 (b). A closer look at the code fragment reveals
apparent structured layout, web pages are considered a source
of unstructured data, from information extraction point of view. a hierarchical nesting of HTML tags, as represented in the
Hence, as a step towards interpreting the organization of web form of a tree graph, in figure 1 (c). Here, each HTML tag,
data, web page segmentation attempts to identify cohesive regions corresponds to a node in the tree graph. The text contents
of a page. appear as leaf nodes. Also, each node in the DOM tree has
In this paper, we present a novel DOM tree mining approach two attributes, name and content.
for page segmentation. We first characterize the nodes of DOM
tree structure, based on their Content Size and Entropy. While • The name of a node, is the string representation of its
Content Size of a node indicates the amount of textual content corresponding HTML tag. For example the nodes in
contributed by its subtree, Entropy measures the strength of local figure 1 (c) are labeled with their names. Note that node
“patterns” exhibited therein. In other words, a node manifesting names are not unique - more than one node might have
highly repetitive patterns begets a high Entropy as per our
formulation. Based on the characterization of DOM nodes, we the same name. Also, leaf nodes corresponding to the
then develop an unsupervised algorithm to automatically identify page text do not have a HTML tag associated with them,
segments of a given web page. they are simply referred to as Text nodes.
Index Terms—Web Information Extraction, Web Page Seg- • The content of a node, is the text that its tree (rooted
mentation, Document Object Model, Entropy at the node itself) manifests in the rendered web page.
Thus, the content of a Text node is simply the text string
I. I NTRODUCTION
that it represents, in the HTML form. The node content
Most web pages exhibit a structured layout, wherein page of an interior node, is recursively the concatenation of
content is organized into well-defined segments. Some seg- the node content of its child nodes.
ments, such as Navigation bars, Headers and Footers, improve While the discussion so far focused on a sample DOM tree,
the usability of websites and impart a uniform look and feel the corresponding structure for a typical page can be complex.
to them. Others, like Advertisement banners, are considered a A typical wep page can contains a few thousands of nodes in
distraction [1] or a source of revenue generation, depending its DOM tree. Our approach to web page segmentation is as
on viewpoint. In general, page segments enable streamlined follows. Section III introduces our characterization of DOM
content management, by allowing pages to be composed in tree nodes. As an application of our node characterization
terms of reusable User Interface (UI) components. method, we develop a segmentation algorithm in section IV.
Despite the apparent organization in the layout of web Section V describes our experimental setup. Section VI and
pages, extracting information from them is a non-trivial task. VII contain the results and conclusions respectively.
This is chiefly due to the orientation of HTML markup
language, towards formatting the content of web pages, rather II. R ELATED W ORK
then describing it in a machine readable form. Consequently, Web Data Extraction. Research in web data extraction
web IR systems adopt heuristic and other methods to interpret has largely focused on wrapper learning [8]. Here, web page
the page structure, and extract segments of interest from it. content is regarded as the output of a template, which trans-
Besides information extraction, web page segmentation has forms underlying database records to web page segments. The
numerous other applications in the ares of Mobile Internet problem of wrapper induction, is to learn the template from
[2], [3], [4], web search [5], filtering advertisement banners the web pages. The template extraction can be unsupervised
[1], extracting product information [6] and site topic hierarchy [9], or can be automated by exploiting regularities in the page
extraction [7]. [10] [11] [12]. There are also visual tools [13] [14] which
In this paper, we analyze the Document Object Model define their own wrapper programming languages in order to
(DOM) tree representation of web pages in order to identify its aid the data extraction.
segments. DOM trees, also called tag trees, are an hierarchical Specific Segment Extraction. Another set of research [15]
representation of the nested tag structures of HTML pages. [16] [17], focuses on specific page segments of interest. For

978-1-4244-4793-0/09/$25.00 c 2009 IEEE

<html>
<h1>Fruits and Vegetables</h1>
html
<table>
<tr>
<td><b>Fruits</b></td> h1 table
<td>Oranges</td>
<td>Mangoes</td> text tr tr
</tr>
<tr>
<td><b>Vegetables</b></td> td td td td td td
<td>Carrots</td>
<td>Tomatoes</td>
</tr> b text text b text text
</table>
</html> text text

(a) An HTML code fragment (b) Rendered view (c) DOM Tree

Fig. 1. A simple HTML fragment, its rendered view and its DOM tree

example, Debnath et. al. [16], compare various methods for ex- segment nodes. Our challenge, is to identify node features that
tracting text portions of web pages. Similarly, Kushmerick[1] distinguish the page segment nodes from others. In particular,
applies page segmentation for advertisement banner detection we propose the following two features:
and removal. Node Content Size Node content, as defined earlier in
Template Removal. Another body of work is characterized section I, is the text manifested by its subtree in the
by the analysis of multiple pages of site to detect common rendered web page. Accordingly, Content Size of a
segments. Subsequent to the early work by Bar-Yoseff and node is the number of words of text in its contents.
Rajagopalan [18] in this direction, Viera et. al. [19] propose Content size provides a direct hint of weather a node
efficient methods for matching similar nodes across DOM trees is a page segment node - nodes with extreme content
to identify the template content. Gibson et al. [20] report a sizes are unlikely to be page segment nodes.
steady grown in template content on the web. Lin and Ho [15] Node Entropy Page segments tend to exhibit strong local
and Yi et. al. [17], use HTML tags such as <table> as cues “patterns” in their contents. For example, Navigation
to construct style trees that aid the extraction of informative bars contain a list of hyperlinks, each of which are
blocks of web pages, while eliminating noisy content. More similarly formatted. Similarly, a page segment might
recently, Chakrabarti et. al. [21], present a method wherein contain a “top stories” list of items. Such lists appear
each node DOM tree is assigned a templateness score, based in the document tree as repetitive node structures. We
on which a classifier can remove template content. propose an Entropy based formulation to quantify
Full Page Segmentation. Early work in this direction this property of page segment nodes.
adopted a Vision-based Page Segmentation (VIPS) approach
[22], wherein rendered pages are viewed as images, from A. Definition
which cohesive regions are to be identified. Similarly Song et. Formally, Content Size of a node can be defined as follows.
al. [23], adopt learning methods based on features extracted For a Text node (which is a leaf node), t, we define the
using VIPS and other techniques. Some research in this function, W (t), which counts the total number of (possibly
direction is motivated by its application to mobile Internet. non-distinct) words in its text. The function evaluates to zero
Baluja [2] proposed a decision tree type method using visual for non-text nodes. Now, if Cv is the set of child nodes of
cues for page segmentation. Similarly, Hattori et. al. [4], node v, and L is a set of all leaf nodes in the document tree,
carry out the page segmentation a page based on the concept then the Content Size of the node, Size(v), is:
of “content distances”. Ramaswamy et al. [24] present a 
W (v) ,v∈L
Shingling algorithm for detecting page segments (fragments). 


Here, the authors focus on detecting fragments that are shared Size(v) = P
Size(c)
across pages. Kao et al. [25] derive features from a web page 

 , otherwise
c∈Cv
and apply a greedy algorithm on the features to segment the
page. Now we turn our attention to quantifying the second feature
In the context of related work discussed so far, this paper of a page segment, the node Entropy. For each node in
presents an DOM tree analysis based unsupervised method, the tree structure, we first compute the distribution of Node
for full page segmentation using page-level information alone. Names in its subtree. We hypothesize that in presence of a
repetitive patterns in a tree structure, its node names tend to
III. N ODE C HARACTERIZATION occur at similar frequencies, due to which the corresponding
In DOM tree representation of web pages, nodes whose distribution is close to uniform. For example, consider the
content appear as page segments, are herein referred to as page table node in the DOM tree of figure 1. It contains a set of
5000
descendants with names tr, td, text and b, occurring at
Wikipedia Page
frequencies of 2, 6, 6 and 2 respectively. This distribution is

1000
Amazon Page
much more closer to uniform than the distribution for (say) the
html node, wherein the frequencies are 2, 6, 7, 2, 1 and 1 for
tr, td, text, b, table and h1 respectively. Hence, in this

Node Size

100
example, the table node is more likely to be a page segment
node than the html node. After experimenting with several
metrics for assessing the uniformity of a distribution, we found
the Entropy function to be most effective. A few notable points

10
about our use of the function are worth discussing here.
• The Entropy function attains maximum value when the
node name frequencies are all equal, and it has low values

1
when the distribution is highly concentrated at a few
nodes. 0.5 0.6 0.7 0.8 0.9 1.0
• The Entropy is calculated based on the node names alone
Node Entropy
and therefore its value depends only on the formatting
idiosyncrasies of a page. The textual contents of the page Fig. 2. DOM Node Feature Space
have no bearing on the formulation. Moreover, the name
of the node whose Entropy is to be determined, is itself
excluded from the calculation. the “Home > Section Page > Subsection Page” set of hyperlinks), also
We now mathematically express the node Entropy. Consider- have high Entropy due to their fairly regular structure. They
ing a node v, if fi is the frequency with which the node name occupy extreme lower right corners in the feature space. We
i appears in its subtree, then we can denote the “probability” propose that the page segment nodes to be separable in this
of node i , as space.
fi A notable property of the feature space is a linear trend
pi = . . . (i)
n between the two axes. On an average, as the node sizes
where, n is the total number of descendant nodes of node v. decrease, the Entropy seems to increase. We observed this
The Entropy of node v is, consistently in all the pages that were analyzed. We find that
with certain assumptions, the trend is explainable. It occurs
due to the dependence of both the features on the node depth.
P pi log(pi )
E(v) = log(|V |) . . . (ii) A formal explanation of this property follows.
i∈V

Depth 0 V0
where, V is the set of unique node names that appear in the 1 Degree 'g'
2 ... g
subtree of node v. The log(|V |) factor is included to normalize Depth 1 Name ' V1 '
V1 V1 . . . V1
the output within the [0, 1] interval.
In summary, we have proposed a two parameter characteriza-
Depth (H-1) VH-1 ... ... VH-1
tion of a DOM node. A node size parameter which measures
the relative “importance” of the node in terms of amount of Depth H VH VH . . . VH VH VH . . . VH
text contained in it. And an Entropy parameter which quanti-
fies the strength of regular patterns in the subtree. The features Fig. 3. An example synthetic tree of height H, constant degree g, and
can be represented in a two dimensional Feature Space, whose homogeneous node names with respect to depth
properties are described in the following subsection.
As depicted in figure 3, consider a simple tree graph of
B. Node Feature Space
height H with a constant degree g at all its nodes. Now,
The feature space of DOM nodes, defined by node Entropy without loss of generality, if we further assume that each leaf
and Size, appears in figure 2. Points in the space correspond node has unit size, then as we ascend from the leaf node to
to nodes of DOM trees computed based on the Amazon root node, the average node content size increases by a factor
homepage1 and a Wikipedia article2 . of g. Hence, the node content size, Sd , at a depth, d, can be
In the feature space, nodes closer to the root appear in expressed as:
upper left corner, as they tend to have lower Entropy and Sd = g (H−d)
higher sizes. Similarly, snippets of text appear in the lower
regions of the plot due to their low content sizes. Besides log(Sd ) = H log(g) − d log(g) . . . (iii)
small size, some other snippets like Breadcrumb Trails (recall
From (iii) it is clear that, on a logarithmic scale, the size
1 http://www.amazon.com, accessed on 15th January 2008 of a node is a linear function of its depth, and the slope of the
2 http://en.wikipedia.org/wiki/Information,
accessed on 15th January 2008 line indicates the node degree.
Now, for calculating the Entropy of a node as a function

Node De

Node

N od

No
500 1000
of its depth, a further assumption is required. We assume that

de
e De
Degre

De
a node name is a function of its depth. In other words, all

gree, g =

gree

gree,
nodes at a depth, d, have node name, Vd . The assumption is

e, g =

,g=

g=
representative because most HTML tags, by their function, are

100 200
Node Size

2
2.5
3
4
suitable at a certain depth in the tree. For instance, the <a>
tags, meant to define hyperlinks, appear close to leaf nodes.
Similarly, the <table> tags, often used for defining regions

50
of web page, appear much closer to the root node. We believe
that this tendency of HTML tags to appear at certain depths

20
only, gives rise to decreasing Entropy with decreasing depth,

10
and ultimately with increasing size.
Now, at depth H, all the nodes are leaf nodes. Consequently
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
the Entropy of a node at this depth is undefined. At depth
(H − 1) however, a node has g child nodes, each with the Node Entropy
name VH . Hence, the set of node names can be represented
Fig. 4. The feature space for the synthetic tree
as < VH > and its corresponding frequencies as < g >.
Consequently, from (i) and (ii) the Entropy for a node at this
depth is,
g

g In summary, we suggest that there is a correlation between
EH−1 = log =0 Size and Entropy of nodes, that depends on the node degree.
g g
In other words, for a given tree, a node can have a higher or
Similarly, if we ascend one more level to depth (H − 2), lower Entropy, by virtue of its size. Hence, if the Size-Entropy
the frequencies are < g, g 2 > corresponding to names transfer function can be learned or otherwise estimated, then
< VH−1 , VH >. And now the Entropy of a node at this depth it can serves to distinguish between page segment nodes
is, and other nodes. Based on this observation, we developed a
g2 g2 segmentation algorithm, presented in the next section.

g g
EH−2 = log + log
g + g2 g + g2 g + g2 g + g2
IV. W EB PAGE S EGMENTATION
From above, it is clear that for any node at depth d, the
Entropy can be expressed as The segmentation algorithm developed herein is intended as
a demonstration of the usefullness of the DOM tree charac-
H−d
P gi

gi
terization. It exemplifies one of several potential applications
Ed = g+...+gH−d
log g+...+gH−d
, d<H of the characterization.
i=1
A. Data Cleaning
Using the Geometric Progression relation,
g(1 − g m ) The first step in segmentation is to parse HTML into
g + g 2 + ... + g m = DOM tree. Since the HTML language does not enforce well-
1−g
formedness of its tags, it gives rise to ambiguity in interpreting
Node Entropy simplifies to, the tree structure. Web browsers have built-in rules to handle
H−d the situation. For instance, they assume that any open <td>
P gi−1 .(1−g)
gi−1 .(1−g)
tag (marks beginning of a table cell) is implicitly closed when
Ed = 1−gH−d
log 1−gH−d
. . . (iv)
the next <tr> tag is encountered. Software libraries that
i=1 efficiently encode such rules are available in the open source
Upon evaluating the equations (iii) and (iv) for various domain3, and can be leveraged to obtain DOM tree from any
values of d an g, we find that the node Size-Entropy relation HTML page.
appears close to linear. The results are shown in figure 4. We
assumed a constant document size of 2000 words to compute B. Node Feature Computation
the plot. Also, the slope of the line is indicative of the degree Using a depth-first traversal of the tree, each node computes
of the nodes in the tree. a frequency table of node names in its subtree. The Entropy
Note that while most real-word DOM trees do not exhibit a of the node is then computed from the frequency table. Node
constant degree (an assumption that was made in the foregoing sizes are also simultaneously computed. The recursive nature
analysis), we are only explaining their tendencies. If a DOM of the tree structure is leveraged, by computing the features
tree did have a constant degree then both features woud be based on calculations at the child nodes.
strongly correlated, rendering information contributed by one
of the features redundant. 3 http://htmlcleaner.sourceforge.net/
C. Segmentation Algorithm VI. R ESULTS
Based on the feature space introduced in figure 2, the aim
of the segmentation algorithm is to prune the filtered DOM The segmentation algorithm is effective. It exhibited around
tree, so that the leaf nodes in the resultant tree represent 90% precision at 80% recall. A sample output of the segmen-
page segments. The segmentation algorithm traverses down tation algorithm appears in figure 6. From the figure, it appears
the tree till it encounters a node which satisfies pre-specified that two segments, including the “Sponsored Advertisements”
constraints on node size and entropy. The tree is then pruned at segment, have not been identified by the algorithm. On closer
this node, and the subtree under the node is not traversed. The inspection we found that both these segments comprised of
tree traversal continues at other nodes. In our implementation, content generated by web browser scripts. Hence they were
the constraints on the node size and entropy, are viewed as a unavailable as nodes in the DOM tree.
region, called target region, in the feature space (see figure 5). Also, the feature space for the about.com page appears in
Now, it is possible that the algorithm ends up at the leaf figure 7. The solid rectangular points in figure 7 correspond
node, without encountering any node in the target region. In to the dashed segments in figure 6. The figure illustrates the
such cases we use various heuristics to prune the tree. In ability of the algorithm to identify seemingly non-separable
particular, if a node has been already identified as a page page segment nodes.
segment node, then all its siblings belonging to the same
parent, can also be treated as page segment nodes. Thus,
sibling nodes of an already identified page segment nodes are
considered as.
Root Node
5000

+++
Se
pa
rat
in
gL
500

in
e
+ + ++
+
Node Size

50 100

+ ++ + +++++ Node Size

++ + ++ + + Upper Threshold
+ +++ + ++ + + Target+ +
+ +
++ + + + Region
+ + + +
++ ++ + ++
++ + ++ + ++ ++ + ++ +
+ + + + + Node Size
+ + ++ ++ +
5 10

+ Lower Threshold
+
+ +
++ + + +
+ +
+ + + +
+ ++ + + +
+ +
+
1

0.5 0.6 0.7 0.8 0.9 1.0

<1,0>
Node Entropy

Fig. 5. Target region for page segment nodes

Fig. 6. Segmented page from the about.com page. Our implementation
creates an HTML output containing thick dashed lines around the identified
page segments.
V. E XPERIMENTAL S ETUP
In order to experiment with various node features and eval-
uate the effectiveness of our approach, an Java implementation Other than the issue of browser scripts, we found that
of the algorithm was built. The implementation analyzed input segmentation is difficult for article pages that have minimal
HTML pages and generated output HTML by marking page formatting. Such pages4 had one-level, “flat”, DOM trees,
segment using CSS techniques. The output HTML page, when wherein the root node had a large number of leaves (Text
viewed in a browser, visually shows page segments marked in nodes) corresponding to each paragraph of text in the page.
the form of dashed-line boundaries. Finally, in order to quantify the benefit of using the Entropy
In order to have a representative mix of web pages to feature for page segmentation, we applied our algorithm using
evaluate on, we radomly sampled page URLs from the Open the node Size feature alone. We found that the algorithm ex-
Directory Initiative, a directory of websites. We stratified our hibited surprisingly good performance. It achieved a precision
random sampling process to ensure that our test corpus had and recall of 80% and 75% respectively. In this case, errors
representation from cateogries such as index pages, content often arose when an entire collection of page segments met the
pages, blogs, news, shopping, entertainment and sports. The one-dimensional target size criteria, after which the algorithm
performance of segmentation was manually evaluated, by did not descend the tree further (to the actual page segment
human judgement, in terms of precision and recall values. Due nodes).
to the human labor intensive nature of the evaluation process,
we were able to evaluate a total of 400 web pages only. 4 See http://beust.com/belize-200405.html for example
500
[11] V. Crescenzi, G. Mecca, and P. Merialdo, “Roadrunner: Towards auto-
matic data extraction from large web sites,” in VLDB ’01, 2001, pp.

200
109–118.
[12] K. Lerman, L. Getoor, S. Minton, and C. Knoblock, “Using the structure
of web sites for automatic segmentation of tables,” in ACM SIGMOD
50

’04. ACM, 2004, pp. 119–130.

[13] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual web information
Size

10 20

extraction with lixto,” in VLDB ’01, 2001, pp. 119–128.

[14] Z. Liu, W. K. Ng, F. Li, and E.-P. Lim, “A visual tool for building
logical data models of websites,” in CIKM Workshop on WIDM’02,
5

McLean, Virginia, USA, November 8 2002. [Online]. Available:

citeseer.ist.psu.edu/liu02visual.html
2

[15] S.-H. Lin and J.-M. Ho, “Discovering informative content blocks from
web documents,” in ACM SIGKDD ’02. ACM, 2002, pp. 588–593.
1

[16] S. Debnath, P. Mitra, N. Pal, and C. L. Giles, “Automatic identification

0.6 0.7 0.8 0.9 1.0 of informative sections of web pages,” IEEE Trans. on Knowledge and
Data Engg., vol. 17, no. 9, pp. 1233–1246, 2005.
Entropy [17] L. Yi, B. Liu, and X. Li, “Eliminating noisy information in web pages
for data mining,” in ACM SIGKDD ’03. ACM, 2003, pp. 296–305.
Fig. 7. Feature space for the about.com example [18] Z. Bar-Yossef and S. Rajagopalan, “Template detection via data mining
and its applications,” in WWW ’02. ACM, 2002, pp. 580–591.
[19] K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, a. M. B. C. Jo and
J. Freire, “A fast and robust method for web page template detection
VII. S UMMARY A ND C ONCLUSIONS and removal,” in CIKM’06. ACM, 2006, pp. 258–267.
[20] D. Gibson, K. Punera, and A. Tomkins, “The volume and evolution of
In this paper, we defined two node features, Content Size web page templates,” in Poster at WWW ’05. ACM, 2005, pp. 830–839.
and Entropy, which intuitively represent orthogonal properties [21] D. Chakrabarti, R. Kumar, and K. Punera, “Page-level template detection
via isotonic smoothing,” in WWW ’07. ACM, 2007, pp. 61–70.
of a page segment. Based on the feature space defined by [22] D. S. J. R.Wen; and W. Y.Ma, “Vips: a vision based page segmentation
these two properties, we also developed an algorithm for algorithm,” MSR-TR-2003-79, 2003.
page segmentation. The Ascend stage of the algorithm is [23] R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma, “Learning block importance
models for web pages,” in WWW ’04. ACM, 2004, pp. 203–211.
particularly useful in correcting errors in page segmentation. [24] L. Ramaswamy, A. Iyengar, L. Liu, and F. Douglis, “Automatic detection
A supervised version of the algorithm can be easily conceived of fragments in dynamically generated web pages,” in WWW ’04. ACM,
in order to improve its performance. 2004, pp. 443–454.
[25] H.-Y. Kao, “Wisdom: Web intrapage informative structure mining based
The chief drawbacks of the algorithm arises from the on document object model,” IEEE Trans. on Knowledge and Data Engg.,
assumption that all page segments correspond to a single node vol. 17, no. 5, pp. 614–627, 2005, member-Jan-Ming Ho and Fellow-
in the DOM tree. In cases wherein webpages are organized in Ming-Syan Chen.
a flat hierarchy, the assumption causes our method to fail.
A clustering based approach that allows merging a subset of
sibling nodes might resolve the issue.

R EFERENCES
[1] N. Kushmerick, “Learning to remove internet advertisements,” in
AGENTS ’99. ACM, 1999, pp. 175–181.
[2] S. Baluja, “Browsing on small screens: recasting web-page segmentation
into an efficient machine learning framework,” in WWW ’06. ACM,
2006, pp. 33–42.
[3] Y. Chen, W.-Y. Ma, and H.-J. Zhang, “Detecting web page structure for
adaptive viewing on small form factor devices,” in WWW ’03. ACM,
2003, pp. 225–233.
[4] G. Hattori, K. Hoashi, K. Matsumoto, and F. Sugaya, “Robust web
page segmentation for mobile terminal using content-distances and page
layout information,” in WWW ’07. ACM, 2007, pp. 361–370.
[5] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Block-based web search,” in
ACM SIGIR ’04. ACM, 2004, pp. 456–463.
[6] C. W. G. Z. G. Xu, “A web page segmentation algorithm for extracting
product information,” IEEE Conference on Information Acquisition, pp.
1374–1379, Aug. 2006.
[7] S. Chakrabarti, M. Joshi, and V. Tawde, “Enhanced topic distillation
using text, markup tags, and hyperlinks,” in SIGIR ’01. ACM, 2001,
pp. 208–216.
[8] N. Kushmerick, “Wrapper induction for information extraction,” Ph.D.
dissertation, 1997, chairperson-Daniel S. Weld.
[9] J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha,
“Extracting semistructured information from the web,” in Proceedings
of the Workshop on Management of Semistructured Data, 1997.
[Online]. Available: citeseer.ist.psu.edu/hammer97extracting.html
[10] A. Arasu, H. Garcia-Molina, and S. University, “Extracting structured
data from web pages,” in ACM SIGMOD ’03. ACM, 2003, pp. 337–
348.

The Important Parts of A Web Browser: Window Window - Innerwidth Window - Innerheight
No ratings yet
The Important Parts of A Web Browser: Window Window - Innerwidth Window - Innerheight
17 pages
Learn HTML and CSS from beginner to expert: Learn HTML5, CSS3, Flexbox, and CSS Grid from the beginning
From Everand
Learn HTML and CSS from beginner to expert: Learn HTML5, CSS3, Flexbox, and CSS Grid from the beginning
Mohammed Mastafi
No ratings yet
Lecture 9 DOM
No ratings yet
Lecture 9 DOM
177 pages
Lect03 JavaScript 2
No ratings yet
Lect03 JavaScript 2
62 pages
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Handout 05. DOM
No ratings yet
Handout 05. DOM
41 pages
Module 2
No ratings yet
Module 2
73 pages
Backbase 4 RIA Development
From Everand
Backbase 4 RIA Development
Ghica van Emde Boas
No ratings yet
Mastering TypoScript: TYPO3 Website, Template, and Extension Development
From Everand
Mastering TypoScript: TYPO3 Website, Template, and Extension Development
Daniel Koch
No ratings yet
Extracting Content Structure For Web Pages Based o
No ratings yet
Extracting Content Structure For Web Pages Based o
13 pages
UniT II. 16830445513830
No ratings yet
UniT II. 16830445513830
46 pages
Browser APIHandbook
100% (1)
Browser APIHandbook
175 pages
Basic Structure of A Web Page
No ratings yet
Basic Structure of A Web Page
14 pages
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
HTML Dom and Web Rendering
No ratings yet
HTML Dom and Web Rendering
9 pages
Training Linux Administration Material
100% (1)
Training Linux Administration Material
135 pages
Slide15 - The Document Object Model
No ratings yet
Slide15 - The Document Object Model
33 pages
Mastering HTML and CSS for Modern Development
From Everand
Mastering HTML and CSS for Modern Development
THE NORTHERN HIMALAYAS
No ratings yet
FSD Notes Module-2
No ratings yet
FSD Notes Module-2
14 pages
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
DOM: Document Object Model
No ratings yet
DOM: Document Object Model
20 pages
6 DOM Final
No ratings yet
6 DOM Final
56 pages
Ce Between DOM and BOM
No ratings yet
Ce Between DOM and BOM
13 pages
Introduction To The DOM - Web APIs - MDN
No ratings yet
Introduction To The DOM - Web APIs - MDN
11 pages
Chapter 6
No ratings yet
Chapter 6
44 pages
Dom Scripting
No ratings yet
Dom Scripting
53 pages
Lec 8 Basic JavaScript DOM Client Side Validation 03052023 122614pm
No ratings yet
Lec 8 Basic JavaScript DOM Client Side Validation 03052023 122614pm
61 pages
06DOM
No ratings yet
06DOM
33 pages
Documentation. HiPath 3000 - 5000 HiPath 3000 Manager C. Communication For The Open Minded. Administrator Documentation A31003-H3580-M101!7!76A9
No ratings yet
Documentation. HiPath 3000 - 5000 HiPath 3000 Manager C. Communication For The Open Minded. Administrator Documentation A31003-H3580-M101!7!76A9
283 pages
The DOM Level 2 and Level 3
No ratings yet
The DOM Level 2 and Level 3
11 pages
01 - Document Object Model
No ratings yet
01 - Document Object Model
18 pages
James Learning Javascript Programming
From Everand
James Learning Javascript Programming
James Lombard
No ratings yet
Beginning HTML and CSS
From Everand
Beginning HTML and CSS
Rob Larsen
No ratings yet
Learning Web Content Extraction With DOM Features: September 2018
No ratings yet
Learning Web Content Extraction With DOM Features: September 2018
8 pages
PW-03 HTML DOM and Javascript
No ratings yet
PW-03 HTML DOM and Javascript
51 pages
03DOM
No ratings yet
03DOM
55 pages
Extracting Product From E-Commercial Web Site Using Entropy Estimation
100% (2)
Extracting Product From E-Commercial Web Site Using Entropy Estimation
5 pages
WINSEM2012-13 CP0557 03-Jan-2013 RM02 Dom
No ratings yet
WINSEM2012-13 CP0557 03-Jan-2013 RM02 Dom
20 pages
Ocument Bject Odel: A Standard For How To Get, Change, Add, or Delete HTML Elements
No ratings yet
Ocument Bject Odel: A Standard For How To Get, Change, Add, or Delete HTML Elements
39 pages
L3 DOMTree
No ratings yet
L3 DOMTree
29 pages
HTML Dom
No ratings yet
HTML Dom
22 pages
The Document Object Model Chapter 5
No ratings yet
The Document Object Model Chapter 5
28 pages
Javascript DOM
No ratings yet
Javascript DOM
67 pages
DOM
No ratings yet
DOM
5 pages
Dynamically Access and Updates The Content, Structure and Style of Documents
No ratings yet
Dynamically Access and Updates The Content, Structure and Style of Documents
5 pages
Selinium IDE
No ratings yet
Selinium IDE
13 pages
Test Flakiness Study
No ratings yet
Test Flakiness Study
166 pages
The W3C Document Object Model (DOM) Is A Platform and Language-Neutral Interface That Allows Programs and Scripts To Dynamically Access and Update The Content, Structure, and Style of A Document."
No ratings yet
The W3C Document Object Model (DOM) Is A Platform and Language-Neutral Interface That Allows Programs and Scripts To Dynamically Access and Update The Content, Structure, and Style of A Document."
5 pages
Von Idioten Umzingelt
No ratings yet
Von Idioten Umzingelt
228 pages
Repetition-Based Web Page Segmentation by Detecting Tag Patterns For Small-Screen Devices
No ratings yet
Repetition-Based Web Page Segmentation by Detecting Tag Patterns For Small-Screen Devices
7 pages
Wrapper Learning Algorithm
No ratings yet
Wrapper Learning Algorithm
9 pages
Web Data Extraction Using The Approach of Segmentation and Parsing
No ratings yet
Web Data Extraction Using The Approach of Segmentation and Parsing
7 pages
A Survey On Web Page Segmentation and Its Applications: U.Arundhathi, V.Sneha Latha, D.Grace Priscilla
No ratings yet
A Survey On Web Page Segmentation and Its Applications: U.Arundhathi, V.Sneha Latha, D.Grace Priscilla
6 pages
Automatic Extraction of Textual Elements From News Web Page
No ratings yet
Automatic Extraction of Textual Elements From News Web Page
6 pages
Document Object Model (Dom) Api For Javascript: Al and Range
100% (1)
Document Object Model (Dom) Api For Javascript: Al and Range
14 pages
Testing and Documentation of SCADA-DMS System
No ratings yet
Testing and Documentation of SCADA-DMS System
9 pages
Visual HTML Document Modeling For Information Extraction
No ratings yet
Visual HTML Document Modeling For Information Extraction
8 pages
Cloud Computing Infrastructure
No ratings yet
Cloud Computing Infrastructure
10 pages
DOM by Talha
No ratings yet
DOM by Talha
2 pages
Efficient Web Data Extraction
No ratings yet
Efficient Web Data Extraction
4 pages
AZ-303 Exam - Free Actual Q&as, Page 1 - ExamTopics
0% (1)
AZ-303 Exam - Free Actual Q&as, Page 1 - ExamTopics
5 pages
IR Sensor Interface With PIC18F4550
No ratings yet
IR Sensor Interface With PIC18F4550
8 pages
Easy html and css
From Everand
Easy html and css
S VASIST
No ratings yet
XHTML
From Everand
XHTML
Jitendra Patel
No ratings yet
Grove Temperature and Humidity Sensor Sen11301p
No ratings yet
Grove Temperature and Humidity Sensor Sen11301p
9 pages
List of Signals Om-Ot
No ratings yet
List of Signals Om-Ot
42 pages
Remote User Recognition and Access Provision
No ratings yet
Remote User Recognition and Access Provision
54 pages
Nav2013 Enus Cssol 03
No ratings yet
Nav2013 Enus Cssol 03
62 pages
ACA-2 DCN ECE 221 - HO - Vijay
No ratings yet
ACA-2 DCN ECE 221 - HO - Vijay
14 pages
Technicalwriting Paperwriting Drasifullah 29-7-2018
No ratings yet
Technicalwriting Paperwriting Drasifullah 29-7-2018
62 pages
What Is The DOM?
No ratings yet
What Is The DOM?
10 pages
Introductiontorpadocumentunderstanding101922 221019185704 3030f7d5
No ratings yet
Introductiontorpadocumentunderstanding101922 221019185704 3030f7d5
33 pages
Learn MongoDB in 24 Hours
From Everand
Learn MongoDB in 24 Hours
Alex Nordeen
5/5 (2)
Question Ch8
No ratings yet
Question Ch8
6 pages
Dom Cssom
No ratings yet
Dom Cssom
4 pages
DOCUMEN
No ratings yet
DOCUMEN
10 pages
Mytruconnect Training - Update
No ratings yet
Mytruconnect Training - Update
22 pages
SyncPos Data Sheet en MD50B202
No ratings yet
SyncPos Data Sheet en MD50B202
12 pages
L3-16 Multimaximizer™: User Manual
No ratings yet
L3-16 Multimaximizer™: User Manual
19 pages
SET4
0% (1)
SET4
1 page
FTP Server: Phd. Alcides Montoya Canola, Est. Carlos Andres Ballesteros Universidad Nacional de Colombia - Sede Medellin
No ratings yet
FTP Server: Phd. Alcides Montoya Canola, Est. Carlos Andres Ballesteros Universidad Nacional de Colombia - Sede Medellin
5 pages
Function
No ratings yet
Function
18 pages
Clematis TR
No ratings yet
Clematis TR
24 pages
NotPetya Cyberattack Report Final
No ratings yet
NotPetya Cyberattack Report Final
2 pages
Access MCQs Part 2
No ratings yet
Access MCQs Part 2
11 pages
Hydraulic Modeling Experience
No ratings yet
Hydraulic Modeling Experience
1 page
Effective Race Detection For Event-Driven Programs
No ratings yet
Effective Race Detection For Event-Driven Programs
16 pages
An Empirical Study On Large Language Models of Code For Automated Program Repair Camera Ready
No ratings yet
An Empirical Study On Large Language Models of Code For Automated Program Repair Camera Ready
13 pages
Revisiting Template-Based Automated Program Repair Via Mask Prediction
No ratings yet
Revisiting Template-Based Automated Program Repair Via Mask Prediction
13 pages
2-Information Retrieval-Based Fault Localization For Concurrent Programs
No ratings yet
2-Information Retrieval-Based Fault Localization For Concurrent Programs
13 pages
From Commit Message Generation To History-Aware Commit Message Completion
No ratings yet
From Commit Message Generation To History-Aware Commit Message Completion
13 pages
Xtium CLHS PX8 Dsheet
No ratings yet
Xtium CLHS PX8 Dsheet
2 pages
Automated Repair of Layout Cross Browser Issues Using Search-Based Techniques
No ratings yet
Automated Repair of Layout Cross Browser Issues Using Search-Based Techniques
13 pages
On The Use of Evolutionary Coupling Between Tests and Code Units. A Case Study On Fault Localization
No ratings yet
On The Use of Evolutionary Coupling Between Tests and Code Units. A Case Study On Fault Localization
12 pages
Program Vulnerability Repair Via Inductive Inference
No ratings yet
Program Vulnerability Repair Via Inductive Inference
12 pages
Form 1.2 Evidence of Current Competencies Acquired Related To Job-Occupation (4) (1) WITH ANSWErs
No ratings yet
Form 1.2 Evidence of Current Competencies Acquired Related To Job-Occupation (4) (1) WITH ANSWErs
2 pages
13 Is 0088 Ps Real Time Data Services UPDATE 2016
No ratings yet
13 Is 0088 Ps Real Time Data Services UPDATE 2016
2 pages
Using Multi-Locators To Increase The Robustness of Web Test Cases
No ratings yet
Using Multi-Locators To Increase The Robustness of Web Test Cases
11 pages
Interlinking User Stories and GUI Prototyping: A Semi-Automatic LLM-based Approach
No ratings yet
Interlinking User Stories and GUI Prototyping: A Semi-Automatic LLM-based Approach
9 pages
On Event Ordering in Parallel Discrete Event Simulation
No ratings yet
On Event Ordering in Parallel Discrete Event Simulation
8 pages
CBF 12 Guide
No ratings yet
CBF 12 Guide
3 pages
CS 6675 2025-1
No ratings yet
CS 6675 2025-1
5 pages
HTML in 30 Pages
From Everand
HTML in 30 Pages
U.Q. Magnusson
4.5/5 (14)
Robust Testing Platform For DOM-Based XSS Vulnerabilities
No ratings yet
Robust Testing Platform For DOM-Based XSS Vulnerabilities
4 pages
HPC Syllabus
No ratings yet
HPC Syllabus
2 pages
June 2023 - Present: Linkedin
No ratings yet
June 2023 - Present: Linkedin
1 page
Mayank Raj (Resume)
No ratings yet
Mayank Raj (Resume)
1 page
03
No ratings yet
03
12 pages
Understanding Programs by Exploiting (Fuzzing) Test Cases
No ratings yet
Understanding Programs by Exploiting (Fuzzing) Test Cases
11 pages
Coffee Break French
100% (11)
Coffee Break French
111 pages