0% found this document useful (0 votes)
4K views

Web Structure Mining

Web mining is the application of data mining techniques in search engines. Data mining - process of discovering useful knowledge from data sources Web mining automatically discover and extract information from Web documents. Web structure mining discovers useful data from hyperlinks. the credict of this presentation goes to Blessy my friend it is uploaded with all her permission
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4K views

Web Structure Mining

Web mining is the application of data mining techniques in search engines. Data mining - process of discovering useful knowledge from data sources Web mining automatically discover and extract information from Web documents. Web structure mining discovers useful data from hyperlinks. the credict of this presentation goes to Blessy my friend it is uploaded with all her permission
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

WEB STRUCTURE

MINING

SUBMITTED BY:
BLESSY JOHN
R7A
ROLL NO:18
INTRODUCTION
 Web mining is the application of data
mining techniques in search engines.
 Data mining - process of discovering
useful knowledge from data sources
 Web mining automatically discover and
extract information from Web documents.
 Web structure mining discovers useful
data from hyperlinks.
WEB MINI NG
 Useful patterns extraction from WWW
resources

 WWW is widely distributed, global


information service centre that
constitutes a rich source for data
mining

 Employing techniques from Data


Mining, information retrieval,etc.
NEED FOR WEB MINING
 Aims at finding and extracting relevant
information that is hidden in web-
related data.

 The challenge is to bring back the


semantics of hyper text document

 To turn web data into web knowledge


CLASSIFICATION

WEB MINING

WEB CONTENT
WEB STRUCTURE
MINING WEB USAGE
MINING
MINING
WEB STRUCTURE
MINING
 Generate structural summary about the
Web site and Web page

 Use graph theory to analyse node and


connection structure of a web site
 Analysis of the link structure of the
web, and its purposes is to identify
more preferable documents
WEB STRUCTURE
MINING cont…..
 Discovering the nature of the hierarchy
of hyperlinks in the website and its
structure

 Hyperlink identifies author’s


endorsement of the other web page

 Retrieving information about the


relevance and the quality of the web
page.
Page Layout and Li nk
Analy sis for Web
Images
WEB BASICS
 A web is a huge collection of documents
linked together by references.
 To refer from one document to another
is based on hyper text and embedded in
HTML
 HTML describes how the document
should display on browser window
 Web document has a web address
called URL that identifies it uniquely.
WEB CRAWLERS
 Collects “all” web documents by
browsing the Web systematically and
exhaustively

 Region of the web to be crawled can be


specified by using the URL structure.

 Used by a search engine to provide


local access to the most recent versions
of possibly all web pages
INDEXING AND
KEYWORD SEARCH
 There are two types of data:
structured and unstructured
 Structured data have keys associated
with each data item that reflect its
content
 Content-based access to unstructured
data without considering the meaning is
the keyword search approach
DOCUMENT
REPRESENTATION
 To facilitate the process of matching
keywords and documents, some
preprocessing steps are taken first:

 Documents are tokenized


 Characters are converted to upper or
lower case
 Words reduced to canonical form
 Stopwords are usually removed
ALGORITHMS
 There are two main algorithms used in
web structure mining

1. HITS (Hypertext-Induced Topic


Search)
2. Page rank algorithm
HI TS (H yper tex t-In duced Top ic
Searc h)

 Link analysis algorithm


 Rates web pages
 Developed by Jon Kleinberg
 Determines two values for a page
 Authority-estimates the value of the
content of the page
 Hub-estimates the value of its links to
other pages
Hubs a nd Au th or it ies

 Hu b pages point to interesting links to authorities = relevant


pages
 Au thorit ies are targets of hub pages
Continue……
 Authority and hub values are defined in
terms of one another in a mutual
recursion

 It is executed at querry time with the


associated HIT on performance
Page R ank
 Link analysis algorithm
 Assigns a numerical weightage to each
element of a hyperlinked set of
documents
 Denoted by PR(E)
 Relies on uniquely democratic nature
 Link from page A to page B is a vote,
by page A, for page B
Continue…..
 Here, A considers itself important and
help to make B important

 Also a probability distribution –


represents the probability that a click on
a link arrives at any particular page

 Page rank of 0.5 -> 50% chance that a


person clicking on a link will be directed
to the document with the 0.5 page rank
APPLICATIONS
 Information retrieval in social networks.
 To find out the relevancy of each Web
page
 Measuring completeness of the Web
sites
 Used in search engines to find out
relevant information
CONCLUSION
 Search engines uses web structure
mining to find the information.

 We can create new knowledge out of


the available information

 Web Content mining can be added to it


to enhance the performance of search
engines.
Thank Yo u !
Questions ?

You might also like