Unit 5 DM
Unit 5 DM
Web mining
□ Web mining - data mining techniques to
automatically discover and extract information
from Web documents/services.
□ Web mining research – integrate research
from several research communities such
as:
Database (DB)
Information retrieval (IR)
The sub-areas of machine learning (ML)
Natural language processing
Mining the World-Wide Web
WWW is huge, widely distributed, global
information source for
□ – Information services: news,
advertisements, consumer information,
financial management,education,
government, e-commerce, etc.
□ – Hyper-link information
□ – Access and usage information
□ – Web Site contents and Organization
Mining the World-Wide Web
□ Growing and changing very rapidly
– Broad diversity of user communities
□ Only a small portion of the information on
the Web is truly relevant or useful to Web
users
– How to find high-quality Web pages on
a specified topic?
□ WWW provides rich sources for data
mining
Challenges on WWW Interactions
□ Finding Relevant Information
□ Creating knowledge from Information
available
□ Personalization of the information
□ Learning about customers / individual
users.
Web Mining can play an important
Role!
Web Mining: more challenging
□ Searches for:
– Web access patterns
– Web structures
– Regularity and dynamics of Web contents
Problems
– The “abundance” problem
– Limited coverage of the Web: hidden Web
sources, majority of data in DBMS
– Limited query interface based on
keyword-oriented search
– Limited customization to individual users
– Dynamic and semi structured
Web Mining : Subtasks
□ Resource Finding
– Task of retrieving intended web-documents
□ Information Selection & Pre-processing
– Automatic selection and pre-processing specific
information from retrieved web resources
□ Generalization
– Automatic Discovery of patterns in web sites
□ Analysis
– Validation and / or interpretation of mined
patterns
Web Mining Taxonomy
Web Mining
Each page has one outgoing link. So that means C(A) = 1 and
C(B) = 1.
d (damping factor) = 0.85
PR(A)= (1 – d) + d(PR(B)/1)
PR(B)= (1 – d) + d(PR(A)/1)
i.e.
PR(A)= 0.15 + 0.85 * 1
=1
PR(B)= 0.15 + 0.85 * 1
=1
PAGERANK algorithm
The formula used for calculating the PAGERANK of a page is recursive,
starting with any set of rank and iterating the computations till it converges.
The algorithm for PageRank is as follows :
Step 1. Initialise the rank value of each page by 1/n where n is
the total number of pages to be ranked.
Step 2. Consider some value of damping factor „d‟ such that
0<d<1 e.g. 0.85,0.15 etc.
Step 3. Let PR be an array of elements representing PageRank
for each web page. Then Repeat for each node i where 0<i<n.
PR[i]=1-d
For all pages Q that have a inward link to i, compute
PR[i]= PR[i]+d*A[Q]/Qn
Where Qn=number of outdegree of Q
Step 4. Update the value of A[i]=PR[i] for 0<i<n
Repeat from Step 3 till PR[i] value converges i.e. value of two
consecutive iteration is similar.
Authority and Hub
□ A page that is referenced by lot of important pages (has
more back links) is more important (Authority)
■ A page referenced by a single important page may be more
important than that referenced by five unimportant pages
□ A page that references a lot of important pages is also
important (Hub)
□ “Importance” can be propagated
■ Your importance is the weighted sum of the importance
conferred on you by the pages that refer to you
■ The importance you confer on a page may be proportional
to how many other pages you refer to (cite)
□ (Also what you say about them when you cite them!)
□ A page is a good authoritative page with respect to a
given query if it is referenced (i.e., pointed to) by many
(good hub) pages that are related to the query.
□ A page is a good hub page with respect to a given query
if it points to many good authoritative pages with
respect to the query.
□ Good authoritative pages (authorities) and good hub
pages (hubs) reinforce each other.
HITS (Hyperlink Induced Topic
Search)
□ It is a Link analysis algorithm that rates web pages on the
concept of authority and hub. It was proposed by Jon Kleinberg
and is based on the principle that a document has a high
weight authority if it is pointed to by many document with high hub
weight and vice versa, and a document has a high hub weight
if it points to many documents with high authority weight and
vice versa.
The steps in this algorithm are:
1. A subgraph (R) is created based on the query (keywords)
by the algorithm.
2. Then, the weight of hubs and authorities for each node are
calculated.
3. Further, expand R to a base set B, of pages linked to or
from R.
4. Again, calculate weights for authorities and hubs.
5. Pages with highest ranks in B are returned.
HITS
HITS provided good results. However, it is not stable on the
class of authority connected graphs and did not work well in
few cases:
a) At sometimes a set of documents on one host points to a
single document on another host or a single document on
one host points to a set of document on another host. These
situations may give a misleading result for a good hub or a
good authority.
b) Automatically generated links by the tools may again
provide wrong definition of good hub or good authority.
c) Sometimes pages point to other pages which are nonrelevant
to the query topic. This may lead to wrong results for hub and authorities .
SOCIAL NETWORK ANALYSIS
The basic premise here is that if a webpage points a link to another web
page, then the former is endorsing the importance of the latter in some
sense.
Also, if there exists a link from a node to the other node and back from the
latter to the former, it signifies some kind of mutual reinforcement while
links from one node to different nodes shows existence of Co-Citation.
Ranking pages
Kleinberg discusses a heuristic method of giving weights
to the links.
A link is said to be transverse link if it is between
pages with different domain names and intrinsic if it is between
pages with the same domain name.
Intrinsic links convey less information about the page than transverse links
so they are not taken into account and hence deleted from the graph.