0% found this document useful (0 votes)
12 views10 pages

14 Link 1

The document outlines the structure and challenges of web search, emphasizing the importance of PageRank in ranking web pages based on their links. It discusses the concept of web graphs, the significance of links as votes for page importance, and the implementation of PageRank through various mathematical formulations. Additionally, it addresses issues like dead ends and spider traps in web navigation and presents solutions to these problems.

Uploaded by

asansyzbai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

14 Link 1

The document outlines the structure and challenges of web search, emphasizing the importance of PageRank in ranking web pages based on their links. It discusses the concept of web graphs, the significance of links as votes for page importance, and the implementation of PageRank through various mathematical formulations. Additionally, it addresses issues like dead ends and spider traps in web navigation and presents solutions to these problems.

Uploaded by

asansyzbai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Announcements

• Homeworks:
• HW2 (due: 11/08)
• HW3 (will be posted on 11/06)
Link Analysis 1 • Note: Each homework has its own claim session
EE412: Foundation of Big Data Analytics • Textbook vs. slides:
• Prioritize the slides over the textbook.

Jaemin Yoo 1 Jaemin Yoo 2

Recap Outline
• UV Decomposition 1. Web Search as a Graph
• UV Decomposition: Computation 2. PageRank
• UV Decomposition: Variants 3. PageRank: Implementation

n k n
𝑓
✖ 𝑉! k
m R ≈ U 𝑓 𝑦 + 𝛻𝑓(𝑦)

𝑦
Jaemin Yoo 3 Jaemin Yoo 4
Graphs Graph Data: Social Networks
• Data structure that represents connections and relationships.
• Consists of nodes and edges.
• Can be directed or undirected.
• Represented as a sparse adjacency matrix.

Source: GeeksforGeeks
Source: [Backstrom et al., 2011]

Jaemin Yoo 5 Jaemin Yoo 6

Graph Data: Communication Web as a Graph


• Web is represented as a directed graph.
domain2
• Each web page is a node.
• There is an edge if there are hyperlinks from page 𝑝) to 𝑝*.
domain1

CS224W: Computer
I teach a Classes are Science Stanford
class on in the Department University
Networks.
Gates at Stanford
router building

domain3

Source: Stanford CS246

Source: Stanford CS246

Jaemin Yoo 7 Jaemin Yoo 8


Challenges in Web Search Early Search Engines
Two challenges of web search: • Many search engines before Google used an inverted index.
1. Who to trust? • Data structure that makes it easy to find all pages containing a term.
• Web contains many sources of information. • Given a search query, pages with those terms are extracted and ranked.
• Idea: Trustworthy pages may point to each other. • Page is more relevant if a term occurs frequently.

2. What is the best answer to each query? ...the cat is


cat
• No single right answer. fat...

• Idea: Pages that know about 𝑋 might be pointing to many 𝑋.


dog
… raining cats Documents
and dogs ...

Inverted Buckets …the dog is


eating ...
index
Jaemin Yoo 9 10

Early Search Engines Ranking Nodes on the Graph


• Unethical people started to fool search engines. • Observation: All web pages are not equally important.
• For example, add term “cat” thousands of times. • Google introduced PageRank: Let’s rank the pages by the links.
• Make term same color as background. • There is large diversity in node connectivity in web graphs.
• Search for “cat,” copy that page, and make it invisible.

cat cat cat


cat cat cat cat
cat cat ...

dog
… raining cats Documents
and dogs ...

Inverted Buckets …the dog is


index eating ...
Source: Stanford CS246
11 12
Outline Intuition 1: Links as Votes
1. Web Search as a Graph • Idea: Consider links as “votes” for importance.
2. PageRank • Page is more important if it has more in-coming links.
• www.stanford.edu has 23,400 in-links.
3. PageRank: Implementation • www.joe-schmoe.com has 1 in-link.
• Are all in-links are equal?
• Recursive question: Links from important pages count more.
• PageRank is the converged state of page importance.

Jaemin Yoo 13 Jaemin Yoo 14

Intuition 2: Random Surfing Example: PageRank Scores


• Web pages are important if people visit them a lot.
• However, we can’t watch everybody using the Web. A B
3.3 C
• Good surrogate is the random surfer model: 38.4
34.3
• Start at a random page and follow random out-links repeatedly.
• Assume that people follow links randomly.
• PageRank is the probability of being at a page at any time. D E F
3.9 8.1 3.9

1.6
1.6 1.6 1.6 1.6

Source: Stanford CS246

Jaemin Yoo 15 Jaemin Yoo 16


Recursive Formulation Matrix Formulation
• Each link’s vote is proportional to the importance of its source page. • Define a transition matrix 𝑀 of size 𝑛 × 𝑛 from the graph.
• If page 𝑗 with importance 𝑟+ has 𝑛 out-links, each link gets 𝑟+ / 𝑛 votes. • Let page 𝑖 has 𝑑, out-links.
• Page 𝑗’s own importance is the sum of the votes on its in-links. • If 𝑖 → 𝑗, then 𝑀+, = 1 / 𝑑, , otherwise 𝑀+, = 0.
• 𝑀 is a column-stochastic matrix.
i k • Every entry is positive, and each column sums to 1.
ri/3
rk/4 • Define a rank vector 𝑟 of size 𝑛.
j rj/3 𝑟! = 𝑟" / 3 + 𝑟# / 4 • 𝑟, : Importance score of page 𝑖, where ∑, 𝑟, = 1.

rj/3 rj/3
• The recursive flow equation can be written as 𝑟 = 𝑀𝑟.

Source: Stanford CS246

Jaemin Yoo 17 Jaemin Yoo 18

Example: Matrix Formulation Eigenvector Formulation


y a m • Observation: Rank vector 𝑟 is an eigenvector of 𝑀.
y y ½ ½ 0 • 𝑟 = 𝑀𝑟 follows the definition of an eigenpair (𝐴𝑥 = 𝜆𝑥) with 𝜆 = 1.
a ½ 0 1 • 𝑟 is the dominant (or principal) eigenvector.
• Since an eigenvalue of a stochastic matrix is always between 0 and 1.
a m m 0 ½ 0
• Can find the dominant eigenvector with power iteration!
r = Mr • Power iteration works only for diagonalizable matrices.
• Stochastic matrices are diagonalizable in most cases (discussed later).
ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m
Source: Stanford CS246

Jaemin Yoo 19 Jaemin Yoo 20


Power Iteration Why Power Iteration Works?
• Power iteration finds the dominant eigenvector as follows: • Define a sequence 𝑟 $
,𝑟 %
,⋯,𝑟 #
of rank vectors as follows:
-
• Initialize 𝑟 = 1/𝑁, 1/𝑁, ⋯ , 1/𝑁 . ) -
𝑟 = 𝑀𝑟
• Iterate 𝑟 ./) = 𝑀𝑟 . for 𝑡 = 0, ⋯ , 𝑇. 𝑟 * = 𝑀𝑟 ) = 𝑀*𝑟 -

• Stop when ||𝑟 ./) − 𝑟 . || is small enough. ⋯


• Can be seen as modeling the movement of random surfers. 𝑟 0 = 𝑀0 𝑟 -
• Start from any stochastic vector 𝑟 - . • Claim: The sequence approaches the dominant eigenvector of 𝑀.
• The limit 𝑀(𝑀(⋯ 𝑀(𝑀𝑟 - ))) is the long-term distribution of the surfers. • Proof: See the next page.
• If 𝑟 is the limit of 𝑀𝑀 ⋯ 𝑀𝑢, then 𝑟 satisfies the equation 𝑟 = 𝑀𝑟.

Jaemin Yoo 21 Jaemin Yoo 22

Why Power Iteration Works? Why Power Iteration Works?


• Assume that 𝑀 has 𝑛 linearly independent eigenvectors. • 𝑀# 𝑟 $
= 𝑐% 𝜆%# 𝑥% + 𝑐& 𝜆#& 𝑥& + ⋯ + 𝑐' 𝜆#' 𝑥'
• 𝑥), ⋯ , 𝑥1 with corresponding eigenvalues 𝜆), ⋯ , 𝜆1 , where 𝜆. > 𝜆./) . (! # (# #
• This is true when 𝑀 is a diagonalizable matrix. = 𝜆%# 𝑐% 𝑥% + 𝑐& 𝑥& + ⋯ + 𝑐' 𝑥'
(" ("
• Vectors 𝑥% , 𝑥& , ⋯ , 𝑥' form a basis and thus we can write as
• Since 𝜆) > 𝜆)*% , all fractions , ,⋯, are in −1, +1 .
(! ($ (#
• 𝑟 - = 𝑐)𝑥) + 𝑐*𝑥* + ⋯ + 𝑐1 𝑥1 . (" (" ("

• Then, 𝑀𝑟 $
= 𝑀 𝑐% 𝑥% + 𝑐& 𝑥& + ⋯ + 𝑐' 𝑥' • Since 𝜆" /𝜆% #
= 0 as 𝑘 → ∞, we prove 𝑀 𝑟 # $
= 𝑐% 𝜆%# 𝑥% .
= 𝑐% 𝑀𝑥% + 𝑐& 𝑀𝑥& + ⋯ + 𝑐' 𝑀𝑥' • May not converge if 𝜆) = 𝜆* (discussed later).

= 𝑐% 𝜆% 𝑥% + 𝑐& 𝜆& 𝑥& + ⋯ + 𝑐' 𝜆' 𝑥'


• If we repeat: 𝑀# 𝑟 $
= 𝑐% 𝜆%# 𝑥% + 𝑐& 𝜆#& 𝑥& + ⋯ + 𝑐' 𝜆#' 𝑥' .
Jaemin Yoo 23 Jaemin Yoo 24
PageRank for Undirected Graphs Outline
• Given an undirected graph with 𝑛 nodes and 𝑚 edges. 1. Web Search as a Graph
• Nodes are pages and edges are hyperlinks. 2. PageRank
• Claim: For any node 𝑣, 𝑟+ = 𝑑+ / 2𝑚 is a solution. 3. PageRank: Implementation
• Proof: Substitute 𝑟" with 𝑑" / 2𝑚 in the equation 𝑟 = 𝑀𝑟.

Jaemin Yoo 25 Jaemin Yoo 26

Two Problems in PageRank Problem: Dead Ends


• Dead ends: Some pages have no out-links. • Power iteration: y
y a m
) y ½ ½ 0
• Random walker has nowhere to go to. • Set 𝑟+ = .
2 a ½ 0 0
• Such pages cause importance to leak out. 4
• Update 𝑟+ ← ∑,→+ 5! iteratively. a m m 0 ½ 0
• Spider traps: All out-links are within the group. !

• Random walker gets stuck in a trap. • Example: ry = ry /2 + ra /2


• Eventually, spider traps absorb all importance. ra = ry /2
𝑟𝑦 1/3 2/6 3/12 5/24 0
rm = ra /2
𝑟𝑎 = 1/3 1/6 2/12 3/24 … 0
𝑟𝑚 1/3 1/6 1/12 2/24 0

• Here the PageRank leaks out since the matrix is not stochastic.

Jaemin Yoo 27 Jaemin Yoo 28


Solution: Always Teleport Problem: Spider Traps
• Follow random teleport links with probability 1 from dead-ends. • Power iteration: y
y a m
) y ½ ½ 0
• Adjust the transition matrix accordingly. • Set 𝑟+ = .
2 a ½ 0 0
4
• Update 𝑟+ ← ∑,→+ 5! iteratively. a m m 0 ½ 1
!
y y
• Example: m is a spider trap ry = ry /2 + ra /2
a ra = ry /2
a m m 𝑟𝑦 1/3 2/6 3/12 5/24 0
rm = ra /2 + rm
𝑟𝑎 = 1/3 1/6 2/12 3/24 … 0
y a m y a m 𝑟𝑚 1/3 3/6 7/12 16/24 1
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓ • All PageRank scores are trapped in node 𝑚.
m 0 ½ 0 m 0 ½ ⅓

Jaemin Yoo 29 Jaemin Yoo 30

Solution: Probabilistically Teleport Google’s Solution to Both Problems


• Teleports: At each step, the random surfer has two options: • Google’s solution:
• With probability 𝛽, follow a link at random. • Fill in the “empty” columns of 𝑀 with 1 / 𝑁.
• With probability 1 − 𝛽, jump to some random page. • Random surfer has two options at each step:
• 𝛽 is typically in the range 0.8 to 0.9. • With 𝛽, follow a link at random; With 1 − 𝛽, jump to some random page.

• Surfer will teleport out of a trap within a few time steps. • PageRank equation [Brin and Page, 1998]:
𝑟" 1
𝑟! = : 𝛽 + 1−𝛽
y y
𝑑" 𝑁
"→!
a m a m
• 𝑑, is the out-degree of node 𝑖.

Jaemin Yoo 31 Jaemin Yoo 32


The Google Matrix Random Teleports (𝛽 = 0.8)
M [1/N]NxN
• We have the Google Matrix 𝐴:
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
1
𝐴 = 𝛽𝑀 + 1 − 𝛽 0 1/2 1 1/3 1/3 1/3
𝑁

1/
5
7/1

15
7/1
0×0

1/
15
y 7/15 7/15 1/15
13/15
• 𝑀 is preprocessed to be column-stochastic. a 7/15 1/15 1/15
7/15
a m 1/15 7/15 13/15
• In practice, 𝛽 = 0.8 or 0.9 (surfer jumps every 5 to 10 steps). 1/15
m
A
• Note: 𝐴 is stochastic, diagonalizable, and satisfies 𝜆% > 𝜆& .
• 𝜆) and 𝜆* are the two largest eigenvalues. y 1/3 0.33 0.24 0.26 7/33
a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
Jaemin Yoo 33 34

Computation of PageRank Sparse Matrix Encoding


• For computation, it is inefficient to explicitly create 𝐴 from 𝑀. • Encode the sparse matrix 𝛽𝑀 using only the nonzero entries.
• 𝐴 is a dense matrix, while 𝑀 is a sparse matrix. • Space is roughly proportional with the number of links.
• Creating a dense matrix for the entire web is almost impossible. • Still won’t fit in memory for a large graph, but will fit on disk.
• Rearrange the PageRank equation into
1−𝛽
𝑟 = 𝛽𝑀𝑟 +
𝑁 0

• Core operation is the spare matrix-vector multiplication Source: Stanford CS246

Jaemin Yoo 35 Jaemin Yoo 36


Basic Algorithm Basic Algorithm 1−𝛽
𝑟 = 𝛽𝑀𝑟 +
𝑁
• Assume that we have enough memory to store 𝑟 234 .
2
• Each step of the power iteration is:
• Store the previous rank 𝑟 789 and the matrix 𝑀 on disk.
Initialize all entries of 𝑟 :;< = 1 − 𝛽 /𝑁
• For each source page, update 𝑟 234 of all destination pages. For each page 𝑖 (with out-degree 𝑑, ):
Read into memory: 𝑖, 𝑑, , dest) , ⋯ , dest 5! , 𝑟 789 𝑖
0 rnew source degree destination rold 0
1
For 𝑗 = 1, ⋯ , 𝑑,
1 0 3 1, 5, 6
2 2 𝑟 :;< dest+ += 𝛽𝑟 789 𝑖 / 𝑑,
3 1 4 17, 64, 113, 117 3
4 4
2 2 13, 23 • Only one full disk access is required for each iteration.
5 5
6 6 • Still slow? MapReduce was originally designed for PageRank.

Jaemin Yoo 37 Jaemin Yoo 38

Summary
1. Web Search as a Graph
2. PageRank
• Recursive formulation
• Power iteration
3. PageRank: Implementation
• Spider traps
• Dead ends
• Random teleports
• Sparse matrix computation

Jaemin Yoo 39

You might also like