WEB STRUCTURE
MINING
SUBMITTED BY:
BLESSY JOHN
R7A
ROLL NO:18
INTRODUCTION
Web mining is the application of data
mining techniques in search engines.
Data mining - process of discovering
useful knowledge from data sources
Web mining automatically discover and
extract information from Web documents.
Web structure mining discovers useful
data from hyperlinks.
WEB MINI NG
Useful patterns extraction from WWW
resources
WWW is widely distributed, global
information service centre that
constitutes a rich source for data
mining
Employing techniques from Data
Mining, information retrieval,etc.
NEED FOR WEB MINING
Aims at finding and extracting relevant
information that is hidden in web-
related data.
The challenge is to bring back the
semantics of hyper text document
To turn web data into web knowledge
CLASSIFICATION
WEB MINING
WEB CONTENT
WEB STRUCTURE
MINING WEB USAGE
MINING
MINING
WEB STRUCTURE
MINING
Generate structural summary about the
Web site and Web page
Use graph theory to analyse node and
connection structure of a web site
Analysis of the link structure of the
web, and its purposes is to identify
more preferable documents
WEB STRUCTURE
MINING cont…..
Discovering the nature of the hierarchy
of hyperlinks in the website and its
structure
Hyperlink identifies author’s
endorsement of the other web page
Retrieving information about the
relevance and the quality of the web
page.
Page Layout and Li nk
Analy sis for Web
Images
WEB BASICS
A web is a huge collection of documents
linked together by references.
To refer from one document to another
is based on hyper text and embedded in
HTML
HTML describes how the document
should display on browser window
Web document has a web address
called URL that identifies it uniquely.
WEB CRAWLERS
Collects “all” web documents by
browsing the Web systematically and
exhaustively
Region of the web to be crawled can be
specified by using the URL structure.
Used by a search engine to provide
local access to the most recent versions
of possibly all web pages
INDEXING AND
KEYWORD SEARCH
There are two types of data:
structured and unstructured
Structured data have keys associated
with each data item that reflect its
content
Content-based access to unstructured
data without considering the meaning is
the keyword search approach
DOCUMENT
REPRESENTATION
To facilitate the process of matching
keywords and documents, some
preprocessing steps are taken first:
Documents are tokenized
Characters are converted to upper or
lower case
Words reduced to canonical form
Stopwords are usually removed
ALGORITHMS
There are two main algorithms used in
web structure mining
1. HITS (Hypertext-Induced Topic
Search)
2. Page rank algorithm
HI TS (H yper tex t-In duced Top ic
Searc h)
Link analysis algorithm
Rates web pages
Developed by Jon Kleinberg
Determines two values for a page
Authority-estimates the value of the
content of the page
Hub-estimates the value of its links to
other pages
Hubs a nd Au th or it ies
Hu b pages point to interesting links to authorities = relevant
pages
Au thorit ies are targets of hub pages
Continue……
Authority and hub values are defined in
terms of one another in a mutual
recursion
It is executed at querry time with the
associated HIT on performance
Page R ank
Link analysis algorithm
Assigns a numerical weightage to each
element of a hyperlinked set of
documents
Denoted by PR(E)
Relies on uniquely democratic nature
Link from page A to page B is a vote,
by page A, for page B
Continue…..
Here, A considers itself important and
help to make B important
Also a probability distribution –
represents the probability that a click on
a link arrives at any particular page
Page rank of 0.5 -> 50% chance that a
person clicking on a link will be directed
to the document with the 0.5 page rank
APPLICATIONS
Information retrieval in social networks.
To find out the relevancy of each Web
page
Measuring completeness of the Web
sites
Used in search engines to find out
relevant information
CONCLUSION
Search engines uses web structure
mining to find the information.
We can create new knowledge out of
the available information
Web Content mining can be added to it
to enhance the performance of search
engines.
Thank Yo u !
Questions ?