0% found this document useful (0 votes)
30 views9 pages

Wrapper Learning Algorithm

Web mining aims to discover useful information from web pages, links, and usage data. There are three types of web mining: web usage mining, which involves analyzing user interactions on websites; web content mining, which extracts and integrates useful data from web page contents using techniques like wrappers and landmarks; and web structure mining, which analyzes the link structure of websites.

Uploaded by

rob amiel
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views9 pages

Wrapper Learning Algorithm

Web mining aims to discover useful information from web pages, links, and usage data. There are three types of web mining: web usage mining, which involves analyzing user interactions on websites; web content mining, which extracts and integrates useful data from web page contents using techniques like wrappers and landmarks; and web structure mining, which analyzes the link structure of websites.

Uploaded by

rob amiel
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and

usage data.

Types of Web Mining:


Web Usage Mining Web Content Mining Web Structure Mining

Web Content Mining

mining, extraction and integration of useful data, information and knowledge from Web page contents.
Wrapper- A program for extracting structured data

Extraction from page


A Web page can be seen as a sequence of tokens (e.g., words, numbers and HTML tags). The extraction is done using a tree structure called the EC tree (embedded catalog tree), which models the data embedding in a HTML page. Each extraction is done using two rules, the start rule and the end rule. The start rule identifies the beginning of the node and the end rule identifies the end of the node.

Extraction from page


The extraction rules are based on the idea of landmarks. Landmark is a sequence of consecutive tokens and is used to locate the beginning or the end of a target item.

Sample
Extract Phone number from the ff. HTML code.
Name: Joels <p> Phone: <i> (310) 777-1111 </i><p>

R1: SkipTo(i) This rule means that the system should start from the beginning of the page and skip all the tokens until it sees the first <i> tag. <i> is a landmark.

Similarly, to identify the end of the text to be extracted, we can use: R2: SkipTo(</i>) R1 is called the start rule and R2 is called the end rule.

Name: Joels <p> Phone: <i> (310) 777-1111 </i><p>

You might also like