Mining The Web Graph: Technical Seminar Presentation On
Mining The Web Graph: Technical Seminar Presentation On
K. Lokesh Acharaya
Roll No. 0701229274
K. Lokesh Acharaya
MINING THE WEB GRAPH
Web Mining
Web
Web Content Web Usage
Structure
Mining Mining
Mining
K. Lokesh Acharaya 1
MINING THE WEB GRAPH
K. Lokesh Acharaya 2
MINING THE WEB GRAPH
•Database views
Model data on the web
Integrate them for more sophisticated queries.
K. Lokesh Acharaya 3
MINING THE WEB GRAPH
WEB AS A GRAPH
K. Lokesh Acharaya 6
MINING THE WEB GRAPH
K. Lokesh Acharaya 7
MINING THE WEB GRAPH
HITS ALGORITHM
•HITS: Algorithm for identifying good hub and authority pages for a
query each page is associated with a hub score and an authority score.
• Scores are computed based on graph structure of the Web.
• Mutual reinforcement of hubs and authorities is exploited with an
iterative algorithm.
• Hub Scores h(p): hub scores are updated with the sum
of all authority weights of pages it points
to.h(p)=Σ (p,q)ε Ε a(q)
• Authority Scores a(p): authority scores are updated
with the sum of all hub weights that point to
it.a(p)=Σ (q,p)ε Eh(q)
•Drawback: the hub and authority scores are computed
iteratively from the query result.
K. Lokesh Acharaya 8
MINING THE WEB GRAPH
PAGE RANK
•Page Rank pr(p):
pr(p)=(1-d)1/N+dΣ (q,p)ε Ε
•Where o(p) out degree of page p
d damping factor (0.85)
N total number of pages
• Page rank prefers pages that have:
a large in-degree.
predecessors with a large page rank.
predecessors with a small out-degree.
• Page rank is a probability distribution.
K. Lokesh Acharaya 9
MINING THE WEB GRAPH
K. Lokesh Acharaya 10
MINING THE WEB GRAPH
TYPES OF CLUSTERING
•Hierarchical clustering.
•Partitional clustering.
•Probabilistic clustering.
•Graph based clustering.
•Fuzzy clustering.
•Neural network based clustering.
K. Lokesh Acharaya 12
MINING THE WEB GRAPH
WEB CACHING
•Caching can improve the net traffic in the web by reducing
the bandwidth consumption
the network latency perceived by the client
the server load.
•Caching can improve the network reliability perceived by the
client.
•Evaluation measures and techniques are:
Hit Rate: The ratio of requests fulfilled by the cache and then
not handled by the web servers.
Weighted Hit Rate: Ratio of bytes served to the client by the
cache.
Latency: The time that an end user waits for retrieving a
resource.
K. Lokesh Acharaya 13
MINING THE WEB GRAPH
National Institute of Science & Technology
THANK YOU
K. Lokesh Acharaya 14