0% found this document useful (0 votes)
18 views28 pages

Link Mining

The document discusses advanced topics in data mining, focusing on link-based ranking methods such as the HITS (Hyperlink-Induced Topic Search) algorithm. It explains the concepts of authority and hub scores, how they are calculated, and their significance in ranking web pages based on hyperlinks. Additionally, it highlights the importance of these methods in addressing issues of content similarity in search engines and their applications in various domains.

Uploaded by

20i0863 Maryam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views28 pages

Link Mining

The document discusses advanced topics in data mining, focusing on link-based ranking methods such as the HITS (Hyperlink-Induced Topic Search) algorithm. It explains the concepts of authority and hub scores, how they are calculated, and their significance in ranking web pages based on hyperlinks. Additionally, it highlights the importance of these methods in addressing issues of content similarity in search engines and their applications in various domains.

Uploaded by

20i0863 Maryam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Data Mining

Advanced Topics in Data Mining


Link Based Ranking
Graph-based
Representation
• Directed / Undirected
• Weighted / Unweighted
• Graph - Adjacency
Matrix
• Degree of a node
• In_degree / Out_degree
Ranking

 Teams and Player Ranking


 Student Ranking
 Web Pages Ranking
 Exert Ranking
 Scholars and Academic Entities Ranking
 My interest
 Think of anything and you can Rank it
 Content
 Link
Introduction – From Content to ?

• Early search engines focus


• compare content similarity of the query and the indexed pages.
• They use information retrieval methods, cosine, TF-IDF, ...
• From 1996, it became clear that content similarity alone
was no longer sufficient.
• The number of pages grew rapidly in the mid-late 1990’s.
• This Growth is Exponential [Internet Statistics]
• How to rank top 10-40 pages and show to the user?

• Issues
• Content similarity is easily spammed.
Links

• Starting around 1996, researchers began to work on the problem.


[hyperlinks. ]

• Web pages on the other hand are connected through hyperlinks,


• carry important information
Ranking Algorithm

• HITS (Hyperlink Induced Topic Search)

e.g.Alta Vista

• [Developd by Jon Kleinberg.]


Short Introduction

• HITS (Hyperlink-Induced Topic Search)


• Information Retrieval like PageRank
• tries to find key pages for specific web communities.

• HITS focuses on finding


• Authorities
• Hubs
Introduction HITS

• Authority
• A page with many in-links.
• page may have good or authoritative content on a topic
• Hub
• Page with many out-links.
• Page serves as an organizer of information on a topic
• The key idea
• Good hub points to many good authorities
• Good authority is pointed to by many good hubs.
Authorities and Hubs

• Let Ai be the authority score for page i,


• let Hi be the hub score for page i.
• Calculation Procedure
• Initialize the variables as 1 for every page,
• then iterate the following two equations until the Convergence is
achieved (numbers settle down):
H n A calculation

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)


Authorities and Hubs

• Initially Ha = Hb = Hc = Hd =1
Authorities and Hubs example

• Initially Ha = Hb = Hc = Hd =1

1. Aa = Hb = 1; Ab = Ha = 1; Ac = Ha + Hb = 2; Ad = Ha + Hb + Hc = 3
Normalise: Aa = 0.143 ; Ab = 0.143; Ac = 0.286; Ad = 0.429

Ha = Ab + Ac + Ad = 0.858; Hb = Aa + Ac + Ad = 0.858; Hc = 0.429; Hd = 0


Normalise: Ha = 0.4; Hb = 0.4; Hc = 0.2; Hd = 0
Example

2. Aa = Hb = 0.4; Ab = Ha = 0.4; Ac = Ha + Hb = 0.8; Ad = Ha + Hb + Hc = 1


Normalise: Aa = 0.154 ; Ab = 0.154; Ac = 0.308; Ad = 0.386

Ha = Ab + Ac + Ad = 0.848; Hb = Aa + Ac + Ad = 0.848; Hc = 0.386; Hd = 0


Normalise: Ha = 0.356; Hb = 0.356; Hc = 0.288; Hd = 0
Example

3. Aa = Hb = 0.356; Ab = Ha = 0.356; Ac = Ha + Hb = 0.712; Ad = Ha+Hb+Hc = 1


Normalise: Aa = 0.146 ; Ab = 0.146; Ac = 0.292; Ad = 0.416

Ha = Ab + Ac + Ad = 0.854; Hb = Aa + Ac + Ad = 0.854; Hc = 0.416; Hd = 0


Normalise: Ha = 0.402; Hb = 0.402; Hc = 0.196; Hd = 0
Example

4. Aa = Hb = 0.402; Ab = Ha = 0.402; Ac = Ha + Hb = 0.804; Ad = Ha+Hb+Hc = 1


Normalise: Aa = 0.154 ; Ab = 0.154; Ac = 0.308; Ad = 0.384

Ha = Ab + Ac + Ad = 0.846; Hb = Aa + Ac + Ad = 0.846; Hc = 0.384; Hd = 0


Normalise: Ha = 0.408; Hb = 0.408; Hc = 0.184; Hd = 0
Example
Exercise (Optional)
(Computer Hub and Authorities for the following Graph
till Convergence)
The HITS algorithm: Formal way

• Given a broad search query, q, HITS collects a set of


pages as follows:

• It sends the query q to a search engine.


• It then collects t (t = 200 usually) highest ranked
pages. This set is called the root set W.
• It then grows W by including any page pointed to
by a page in W and any page that points to a page
in W. This gives a larger set S, base set.
The link graph G

• HITS works on the pages in S, and assigns every page in S an authority score
and a hub score.
• Let the number of pages in S be n.
• We again use G = (V, E) to denote the hyperlink graph of S.
• We use L to denote the adjacency matrix of the graph.
The HITS algorithm

• Let the authority score of the page i be a(i), and the hub score of page i be
h(i).
• The mutual reinforcing relationship of the two scores is represented as
follows:
The HITS
algorithm
How is HITS used

 HITS is search query dependent.


 When the user issues a search query,
 HITS first expands the list of relevant pages returned by a search engine and
then produces two rankings of the expanded set of pages, authority ranking
and hub ranking.
 WHICH IS TO BE CONSIDERED?
How is HITS used

 Finding Communities in Web


 Community Detection – imp research domain
 Application in all sorts of research domains
 Web
 Marketing
 Social Issues
TASKS [Basic before
Implementation] Optional
 First Think of Scenario where Basic Idea of HITS Applies
 Recall basic idea
 Incoming are as imp as Outgoing
 Pure Connected Graph not Sparse

 You need to define and Find any one Scenario


 Entities [Vertex]
 Relationship [Edges]
 What type of relationship
 Why incoming
 Why outgoing

 Make it a sample graph


 Define sets of Vertices and Edges
 Define Hubs and Authorities
Issues of HITS

 What is the main flaw in HITS?

 Outlinks are not as importance as Inlinks

 What is the possible Solution?

 How inlinks can be considered more imp?


Any Question?

You might also like