0% found this document useful (0 votes)
418 views8 pages

E Mine

A large collection of documents, images, text files and other forms of data in structured, semi structured and unstructured forms are available on the web. It has become increasingly difficult to identify relevant pieces of information since the pages are often cluttered with irrelevant content. This paper proposes a novel and an effective method, eMine, to mine the Data Region from a web page automatically.

Uploaded by

mycatalysts
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
418 views8 pages

E Mine

A large collection of documents, images, text files and other forms of data in structured, semi structured and unstructured forms are available on the web. It has become increasingly difficult to identify relevant pieces of information since the pages are often cluttered with irrelevant content. This paper proposes a novel and an effective method, eMine, to mine the Data Region from a web page automatically.

Uploaded by

mycatalysts
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 8

E-MINE: A NOVEL WEB MINING APPROACH

ABSTRACT Related work, mainly in the area of mining data


In recent years government agencies and records in a web page is MDR (Mining Data Records).
industrial enterprises are using the web as MDR is a well known approach which basically
the medium of publication. Hence, a large exploits the regularities in the HTML tag structure
collection of documents, images, text files directly. MDR algorithm makes use of the HTML tag
and other forms of data in structured, tree of the web page to extract data records from the
semi structured and unstructured forms page. However, an incorrect tag tree may be
are available on the web. It has become constructed due to the misuse of HTML tags, which in
increasingly difficult to identify relevant turn makes it impossible to extract data records
pieces of information since the pages are correctly.
often cluttered with irrelevant content like
advertisements, copyright notices, etc MDR automatically mines all data records formed by
surrounding the main content. Thus, we table and form related tags i.e., <TABLE>, <FORM>,
propose a technique that mines the <TR>, <TD>, etc. assuming that a large majority of
relevant data regions from a web page. web data records are formed by them. It has several
This technique is based on three important other limitations which will be discussed in the latter
observations about data regions on the half of this paper. The algorithm is based on two
web. observations:
Introduction (a) A group of data records are always presented in
Extracting the regularly structured data a contiguous region of the web page and are
records from web pages is an important formatted using similar HTML tags. Such region is
problem. So far, several attempts have called a Data Region.
been made to deal with the problem. The (b) The nested structure of the HTML tags in a web
main disadvantage with the existing page usually forms a tag tree and a set of similar data
automatic approaches is their assumption records are formed by some child sub-trees of the
that the relevant information of a data same parent node. MDR system is a freeware and
record is contained in a contiguous can be downloaded at:
segment of HTML code, which is not http://www.cs.uic.edu/~liub/WebDataExtraction/MDR-
always true. Thus, we propose a more download.html

effective method to mine the data region in The Proposed Technique


a web page. The algorithm, eMine, finds We propose a novel and an effective method, eMine,
the data regions formed by all types of tags to mine the data region from a web page
using visual cues. automatically. The basic criteria which eMine uses are
Related Work the locations on the screen at which tags are
rendered i.e. visual Information.
These help the system in three ways:
a) It enables the system to identify gaps
that separate records, which helps to
segment data records correctly, because
the gaps within the data record(if any) is
typically smaller than that in between data
records.
b) The visual information also contains
information about the hierarchical structure Fig.1 System model
of the tags. It consists of the following three components:
c) By observing a webpage, it can be * Largest Rectangle Identifier.
analyzed that the relevant data region * Container Identifier.
occupies the major central part of the * Filter
Webpage. The output of each component is the input of the next
The system model of the eMine component.
technique is shown in fig 1 The eMine technique is based on three observations:
a) A group of data records, that contains
descriptions of set of similar objects, is typically
presented in a contiguous region of a page.
b) The area covered by a rectangle that bounds
the data region (refer to definition 1 below) is
more than the area covered by the rectangles
bounding other regions, e.g.
Advertisements and links.
c) The height of an irrelevant data record within
a collection of data records is less than the
average height of relevant data records within
that region.
Definition 1: A data region is defined as the most
relevant portion of a webpage.
Definition 2: A data record is defined as a collection
of data. It is a meaningful independent entity.
E.g. A product listed inside a data region on a product
related web site is a data record.
Fig.2 illustrates an example which is a segment of a
webpage (www.amazon.com) that shows a data
region. The full description of each book is a data
record.
International Conference on Systemics,
Cybernetics and Informatics
4. How the Algorithm works?
The algorithm takes the HTML source of the web
page as input. In step 2 we scan the HTML document
for tags and identify the height and width of all the
bounding rectangles. Thus, you have the area of each
bounding rectangle. The step 3 finds the largest
rectangle out of all the bounding rectangles. Step 4
identifies the container which holds most of the
relevant data region (and some irrelevant regions
also). Step 5 identifies the actual relevant data region
by filtering the irrelevant regions.
The following sections provide more details about the
Fig 2: An Example of a page showing individual modules associated with the algorithm.
data region and data record (shown from
eMine.exe)
Definition 3: For each tag, there exists an 4.1 Determining the Height and width of all
associated rectangular area on the screen. bounding rectangles
This rectangle is called the bounding In the first step of the proposed technique, we
rectangle for the particular tag. determine the dimensions of all the bounding
The overall algorithm of the proposed rectangles in the web page. Every <table> tag in a
technique is as follows: web page will be associated with a specific height and
Algorithm eMine width attribute. We extract them. If not specified, the
Input: The HTML source of the Web Page. MSHTML parsing and rendering engine of Microsoft
1 Determine the height & width of all the Internet Explorer 6.0 can be used. This parsing and
bounding Rectangles in the HTML rendering engine of the web browser gives us the
document. coordinates of a bounding rectangle. We scan the
2 Calculate the areas of all the HTML file for tags. For each tag encountered, we
Bounding Rectangles. determine the coordinates of the bounding rectangle
3 Identify the Maximum Rectangle from of the corresponding tag and plot it.
all the bounding Rectangles. The Fig. 3 shows a sample web page of the product
4 Identify the container within the related website, which contains a number of books;
Maximum Rectangle obtained from step 3. and their description which form the data records
5 Identify the Data Region in the inside the data region.
container obtained from step 4.
6 Filter the Data Region obtained after
step 5 for removal of some more irrelevant
data.
Fig 4: Bounding Rectangles for <TD> tag
corresponding to the web page in Fig.3

4.2 Identification of the largest rectangle


Based on the height and width of bounding rectangles
obtained in the previous step, we determine the area
of the bounding rectangles of each of the children of
the <body> tag. We then determine the largest
rectangle amongst these bounding rectangles. The
reason for doing this is a sensible assumption; that
the largest bounding rectangle will always contain the
Fig 3: A Sample Web page of a product most relevant data in that web page. The procedure
related website shown in eMine.exe followed to accomplish this task is as follows:
Procedure getMaxRect
Fig 4 shows the bounding rectangles for Input: <body> of the HTML source
the <td> tags of the web pages shown in for each child of <body> tag
Fig 3. begin
Find the coordinates of the bounding rectangles for
the child

If the area of the bounding rectangle > area of


maximum Rectangle
then Maximum Rectangle = child
endif
end

4.3 Identification of the container within the


largest rectangle
Once we have obtained the largest rectangle, we form
a set of the entire bounding rectangles. The rationale
behind this is that the most important data of the web
page must occupy a significant portion of the web
page. Again, we determine the bounding rectangle
having the largest area in the set. The reason for
determining the largest rectangle within this set is that
only the largest rectangle will contain data records.
Thus a container (Refer to definition 4 below) is
obtained which ‘holds’ the data region and also
possibly, some irrelevant data. begin
if area of bounding rectangle of a tag > half the
area of Maximum Rectangle
then container = tag
endif
end

The fig.6 shows the extracted regions from the


container shown in fig.5. We note that there is some
irrelevant data, at the bottom of the actual data region
containing the data records.

Fig 5: The container within the Largest


Rectangle identified from sample web
page in Fig 3

Definition 4:
A container is a superset of the data
region which may or may not contain
irrelevant data. For example, the irrelevant
data contained in the container may
include advertisements at the bottom of the
page and followed by search bars or links
to some other sites. The Fig.5 shows the
container identified from the web page
shown in
fig.3.
The procedure getContainer identifies the
container in the web pages which contains
the relevant data region along with some
irrelevant data also. It is as follows:

Procedure getContainer
Input: The Largest Rectangle out of
all Bounding Rectangles. Fig 6: The extracted Regions from the container
List_of_Children=depth first listing of all the shown in fig 5. The irrelevant portion to be filtered
children of the tag associated with is highlighted.
Maximum Rectangle.
for each tag in List_of_Children 4.4 Identification of data region containing data
records within the container
To remove the irrelevant data from the
container, we use a filter. The filter
determines the average heights children Fig 7: Data Region obtained after filtering the
within the container. Those children whose container in Fig 6.
heights are less than the average height Thus, the eMine technique, as
are identified as irrelevant and are described above, is able to mine the relevant data
discarded. The fig.7 shows a filter applied region containing data records from the given web
on the container in fig.6, in order to obtain page efficiently.
the data region.
The procedure Filter filters the 5. MDR Vs eMine
irrelevant data from the container, and In this section we evaluate the proposed
gives the actual data region as the output. technique and also compare it with MDR.The
It is as follows: evaluation consists of three aspects as discussed in
Procedure Filter the following:
Input: The container obtained from the 1. Data Region Extraction:
previous step. We compare the first step of MDR with our
totalHeight=0 system for identifying the data regions. MDR is
for each child tag within container dependent on certain tags like <table>, <tbody>, etc
totalHeight+=height of the bounding for identifying the data region. But, a data region need
rectangle of child not be always contained only within specific tags like
averageHeight = totalHeight/no of <table>, <tbody>, etc. A data region may also be
children of container contained within tags other than table-related tags like
for each child within container <P>, <li>, <forms> etc. In the proposed eMine
if height of child’s bounding system, the data region identification is independent
rectangle < averageHeight of specific tags and forms. Unlike MDR, where an
then Discard child from container incorrect tag tree may be constructed due to the
endif misuse of HTML tags, there is no such possibility of
end for erroneous tag tree construction in case of eMine,
end for because the hierarchy of tags is constructed based
on the visual cues on the web page. In case of MDR,
the entire tag tree needs to be scanned in order to
mine data regions, but eMine scans only the largest
child of the <body> tag. Hence, this improves the time
complexity compared to MDR.
2. Data Record Extraction:
MDR identifies the data records based on keyword
search (e.g. “$”). But eMine does not make use of any
text or content mining. This proves to be very
advantageous as it overcomes the region consists of only one data record. Also, most of
additional overhead of performing keyword the approaches fail in the case where a series of data
search on web page. MDR, not only records is separated by an advertisement, followed
identifies the relevant data region again by a single data record. eMine works correctly
containing the search result records but for the above case. Further, the comparisons are
also extracts records from all the other made on numbers, unlike other methods where
sections of the page, e.g. some strings or trees are compared. Thus eMine overcomes
advertisement records also, which are the drawbacks of existing methods and performs
irrelevant. significantly better than existing methods.
In MDR, comparison of generalized 7. Scope for future work:
nodes is based on string comparison using Extraction of the data fields from the data records
normalized edit distance method. However, contained in these mined data regions can be
this method is slow and inefficient as considered in the future work taking also into account
compared to eMine where the comparison the complexities such as the web pages featuring
is purely numeric. It scales well with all the dynamic html, etc. The extracted data can be put in
web pages! some suitable format and eventually stored back into
a relational database. Thus, data extracted from each
3. Overall Time Complexity web-page can then be integrated into a single
The existing algorithm MDR has collection. This collection of data can be further used
complexity of the order O(nk) without for various Knowledge Discovery Applications, e.g.,
considering string comparison, where n is making a comparative study of products from various
the total number of nodes in the tag tree companies, smart shopping, etc.
and k is the maximum number of tag nodes
that generalized node can have (which is References:
normally a small number <10). Our [1] Mining Web pages for Data Records, Bing Liu,
algorithm eMine has a complexity of the Robert
order of O(n), where n is the number of Grossman, Yanhong Zhai.
tag-comparisons made. [2] Jiawei Han and Micheline Kambler, Data Mining:
6. Conclusion Concepts and Techniques.
In this paper, we have proposed a new [3] Arun .K. Pujari, Data Mining Techniques
approach to extract structured data from [4] Pieter Adriaans, Dolf Zantinge, Data Mining.
web pages. Although the problem has [5] George M. Maracas, Modern Data Warehousing,
been studied by several researchers, Mining,
existing techniques make many strong and Visualization Core Concepts, 2003
assumptions. eMine is a pure visual [6] J. Hammer, H. Garcia Molina, J. Cho, and A.
structure oriented method that can Crespo,
correctly identify the data regions. Most of Extracting semi-structured information from the web.
the current algorithms fail to correctly [7] A. Arasu, H. Garcia-Molina, Extracting structured
determine the data region, when the data data
from web pages.

You might also like