E Mine
E Mine
Definition 4:
A container is a superset of the data
region which may or may not contain
irrelevant data. For example, the irrelevant
data contained in the container may
include advertisements at the bottom of the
page and followed by search bars or links
to some other sites. The Fig.5 shows the
container identified from the web page
shown in
fig.3.
The procedure getContainer identifies the
container in the web pages which contains
the relevant data region along with some
irrelevant data also. It is as follows:
Procedure getContainer
Input: The Largest Rectangle out of
all Bounding Rectangles. Fig 6: The extracted Regions from the container
List_of_Children=depth first listing of all the shown in fig 5. The irrelevant portion to be filtered
children of the tag associated with is highlighted.
Maximum Rectangle.
for each tag in List_of_Children 4.4 Identification of data region containing data
records within the container
To remove the irrelevant data from the
container, we use a filter. The filter
determines the average heights children Fig 7: Data Region obtained after filtering the
within the container. Those children whose container in Fig 6.
heights are less than the average height Thus, the eMine technique, as
are identified as irrelevant and are described above, is able to mine the relevant data
discarded. The fig.7 shows a filter applied region containing data records from the given web
on the container in fig.6, in order to obtain page efficiently.
the data region.
The procedure Filter filters the 5. MDR Vs eMine
irrelevant data from the container, and In this section we evaluate the proposed
gives the actual data region as the output. technique and also compare it with MDR.The
It is as follows: evaluation consists of three aspects as discussed in
Procedure Filter the following:
Input: The container obtained from the 1. Data Region Extraction:
previous step. We compare the first step of MDR with our
totalHeight=0 system for identifying the data regions. MDR is
for each child tag within container dependent on certain tags like <table>, <tbody>, etc
totalHeight+=height of the bounding for identifying the data region. But, a data region need
rectangle of child not be always contained only within specific tags like
averageHeight = totalHeight/no of <table>, <tbody>, etc. A data region may also be
children of container contained within tags other than table-related tags like
for each child within container <P>, <li>, <forms> etc. In the proposed eMine
if height of child’s bounding system, the data region identification is independent
rectangle < averageHeight of specific tags and forms. Unlike MDR, where an
then Discard child from container incorrect tag tree may be constructed due to the
endif misuse of HTML tags, there is no such possibility of
end for erroneous tag tree construction in case of eMine,
end for because the hierarchy of tags is constructed based
on the visual cues on the web page. In case of MDR,
the entire tag tree needs to be scanned in order to
mine data regions, but eMine scans only the largest
child of the <body> tag. Hence, this improves the time
complexity compared to MDR.
2. Data Record Extraction:
MDR identifies the data records based on keyword
search (e.g. “$”). But eMine does not make use of any
text or content mining. This proves to be very
advantageous as it overcomes the region consists of only one data record. Also, most of
additional overhead of performing keyword the approaches fail in the case where a series of data
search on web page. MDR, not only records is separated by an advertisement, followed
identifies the relevant data region again by a single data record. eMine works correctly
containing the search result records but for the above case. Further, the comparisons are
also extracts records from all the other made on numbers, unlike other methods where
sections of the page, e.g. some strings or trees are compared. Thus eMine overcomes
advertisement records also, which are the drawbacks of existing methods and performs
irrelevant. significantly better than existing methods.
In MDR, comparison of generalized 7. Scope for future work:
nodes is based on string comparison using Extraction of the data fields from the data records
normalized edit distance method. However, contained in these mined data regions can be
this method is slow and inefficient as considered in the future work taking also into account
compared to eMine where the comparison the complexities such as the web pages featuring
is purely numeric. It scales well with all the dynamic html, etc. The extracted data can be put in
web pages! some suitable format and eventually stored back into
a relational database. Thus, data extracted from each
3. Overall Time Complexity web-page can then be integrated into a single
The existing algorithm MDR has collection. This collection of data can be further used
complexity of the order O(nk) without for various Knowledge Discovery Applications, e.g.,
considering string comparison, where n is making a comparative study of products from various
the total number of nodes in the tag tree companies, smart shopping, etc.
and k is the maximum number of tag nodes
that generalized node can have (which is References:
normally a small number <10). Our [1] Mining Web pages for Data Records, Bing Liu,
algorithm eMine has a complexity of the Robert
order of O(n), where n is the number of Grossman, Yanhong Zhai.
tag-comparisons made. [2] Jiawei Han and Micheline Kambler, Data Mining:
6. Conclusion Concepts and Techniques.
In this paper, we have proposed a new [3] Arun .K. Pujari, Data Mining Techniques
approach to extract structured data from [4] Pieter Adriaans, Dolf Zantinge, Data Mining.
web pages. Although the problem has [5] George M. Maracas, Modern Data Warehousing,
been studied by several researchers, Mining,
existing techniques make many strong and Visualization Core Concepts, 2003
assumptions. eMine is a pure visual [6] J. Hammer, H. Garcia Molina, J. Cho, and A.
structure oriented method that can Crespo,
correctly identify the data regions. Most of Extracting semi-structured information from the web.
the current algorithms fail to correctly [7] A. Arasu, H. Garcia-Molina, Extracting structured
determine the data region, when the data data
from web pages.