0% found this document useful (0 votes)
12 views86 pages

Lecture 7

Information Resource Management

Uploaded by

Julien LEKA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views86 pages

Lecture 7

Information Resource Management

Uploaded by

Julien LEKA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Lecture 7: Web Search (I)

Dr. YI Cheng (易成)


School of Economics and Management
Apr 15, 2024
Previous Lectures
• Information Organization
– Metadata
– Information Categorization and Subject Analysis
– Controlled Vocabularies
– Classification
• Information Retrieval
– Text Processing and Document Indexing
– Classical Retrieval Models
– Evaluation and Relevance Feedback

Dr. Yi, C., Tsinghua SEM 2


Outline of Lecture 7
• Web search
– Link analysis
• HITS
• PageRank

Dr. Yi, C., Tsinghua SEM 3


How to Organize the Web?
• First try: Human curated Web
Directories (for browsing)
– Yahoo, DMOZ, LookSmart
• Second try: Web Search
– Information Retrieval investigates: Find
relevant docs in a small and trusted set
• Articles, Patents, etc.
– But: Web is huge, full of untrusted
documents, random things, web spam,
etc.
Dr. Yi, C., Tsinghua SEM 4
Basic Web Search Engine

3
1

Dr. Yi, C., Tsinghua SEM 5


1. Web Crawler
• Web crawler

Dr. Yi, C., Tsinghua SEM 6


2. Indexes for Web Search Engines
❖Indexer: Parses pages, separates text (to
Forward Index), links (to Anchors), and
essential text info (to Inverted Index)
❖Text in an anchor is very relevant for destination
page
❖Font, placement in page makes some terms extra
relevant
❖Forward index: docid → list of terms appearing
in docid
❖Inverted index: term → list of docid’s containing
term
Dr. Yi, C., Tsinghua SEM 7
2. Indexes for Web Search Engines
❖Partition the indexes across different machines
❖Each machine handles different parts of the data
❖Duplicate the data across many machines
❖Queries are distributed among the machines
❖Most do a combination of these

Dr. Yi, C., Tsinghua SEM 8


Search Engine Querying
In this example,
the data for the
pages is
partitioned across
machines.
Additionally, each
partition is
allocated multiple
machines to
handle the
queries.

Each row can


handle 120
queries per
second

Each column can


handle 7M pages

To handle more
queries, add
another row.

Dr. Yi, C., Tsinghua SEM 9


3. Ranking Algorithms for Web Search
• Standard IR models apply but are not
sufficient
– Different information needs
– Documents have additional information (content +
structure)
– Information quality varies a lot
• Major extensions
– Exploiting links (i.e., structure) to improve scoring
– Exploiting clickthroughs for massive implicit
feedback
Dr. Yi, C., Tsinghua SEM 10
3. Ranking Algorithms for Web Search
❖Varies by search engine
❖Pretty messy in many cases
❖Details usually proprietary and fluctuating
❖Combining subsets of:
❖Term frequencies
❖Term proximities
❖Term position (title, top of page, etc)
❖Term characteristics (boldface, capitalized, etc)
❖Link analysis information
❖Category information
❖Popularity information
Dr. Yi, C., Tsinghua SEM 11
Links
• Links are a key component of the Web
• Important for navigation, but also for search
– e.g., <a href="http://example.com" >Example
website</a>
– “Example website” is the anchor text
– “http://example.com” is the destination link
– both are used by search engines

Dr. Yi, C., Tsinghua SEM 12


Anchor Text
• Anchor text tends to be short, descriptive, and
similar to query text
• Used as a description of the content of the
destination page
• Retrieval experiments have shown that anchor
text has significant impact on effectiveness for
some types of queries
– Home pages for a particular topic, person, or
organization

Dr. Yi, C., Tsinghua SEM 13


Links
• When page A links to page B, this means
– A’s author thinks that B’s content is interesting or
important
– So a link from A to B, adds to B’s reputation
• But not all links are equal…
– If A is very important, then A → B counts more!
– If A is not important, then A → B counts less
• Two algorithms based on this idea
– PageRank (Brin and Page, Oct 1998)
– HITS (Kleinberg, Apr 1998)
Dr. Yi, C., Tsinghua SEM 14
Dr. Yi, C., Tsinghua SEM 15
Dr. Yi, C., Tsinghua SEM 16
Link Analysis
• Link analysis using hubs and authorities
– HITS (hyperlink-induced topic search)
• Link analysis based on random surfer model
– PageRank

Dr. Yi, C., Tsinghua SEM 17


Link Analysis using Hubs and
Authorities
• Voting by inlinks
– Collect a large sample (e.g., 200) of relevant pages on a
topic (i.e., root set) and other pages (e.g., 50) that link to
them (i.e., base set)
– Purpose: find the most high-quality pages
– Use inlinks (votes) to assess the authority
of a page

• How to make deeper use of the network structure than


just counting in-links?
– What if there is not necessarily a single, intuitively “best”
answer?

Dr. Yi, C., Tsinghua SEM 18


Is there a “best” answer?
E.g., a broad-
topic query
“newspapers”
Or search for
particular
products to
purchase

Dr. Yi, C., Tsinghua SEM 19


Link Analysis using Hubs and
Authorities
• Another kind of useful answer to a broad-
topic query: pages that compile lists of
resources relevant to the topic
– Score these pages as lists
– A page’s value as a list is equal to the sum of the
votes received by all pages that it voted for

• Authorities: answer pages to the queries


• Hubs: lists that point to answer pages

Dr. Yi, C., Tsinghua SEM 20


E.g., finding good
lists for the query
“newspapers”
• Each page’s value
as a list is written
as a number
inside it: the sum
of the votes
received by all
pages it voted for

Dr. Yi, C., Tsinghua SEM 21


Link Analysis using Hubs and
Authorities
• A list-finding technique
– If pages scoring well as lists actually have a
better sense for where the good results are,
then we should weight their votes more heavily
– Give each page’s vote a weight equal to its value
as a list

Dr. Yi, C., Tsinghua SEM 22


E.g., re-weighting
votes for the query
“newspapers”
• Each of the
labeled page’s
new score is
equal to the sum
of the values of all
lists that point to
it.

Dr. Yi, C., Tsinghua SEM 23


Link Analysis using Hubs and
Authorities
• Repeated improvement
– If we have better votes on the right-hand-side
pages, use these to get still more refined values
for the quality of the left-hand-side lists
– Reweight the votes of the right-hand-side pages
again

Dr. Yi, C., Tsinghua SEM 24


Link Analysis using Hubs and
Authorities
• For each page p, we assign it two numerical
scores: auth(p) and hub(p)
• Authority Update Rule: For each page p,
update auth(p) to be the sum of the hub
scores of all pages that point to it.
• Hub Update Rule: For each page p, update
hub(p) to be the sum of the authority scores
of all pages that it points to.

Dr. Yi, C., Tsinghua SEM 25


Link Analysis using Hubs and
Authorities
• Apply these rules in alternating fashion
– We start with all hub scores and all authority scores
equal to 1.
– We choose a number of steps k.
– We then perform a sequence of k hub-authority
updates.
• First apply the Authority Update Rule to the current set of
scores.
• Then apply the Hub Update Rule to the resulting set of
scores.
– Normalize the scores
• Divide each authority(hub) score by the sum of all
authority(hub) score
Dr. Yi, C., Tsinghua SEM 26
E.g., normalizing
votes for the
query
“newspapers”

Dr. Yi, C., Tsinghua SEM 27


What happens when k gets larger and larger?
The normalized values
stabilize (as k goes to
infinity)
E.g., limiting hub and
authority values for the
query “newspapers”

Dr. Yi, C., Tsinghua SEM 28


Examples: HITS

Dr. Yi, C., Tsinghua SEM 29


Examples: HITS

Dr. Yi, C., Tsinghua SEM 30


Examples: HITS

Dr. Yi, C., Tsinghua SEM 31


Examples: HITS

Dr. Yi, C., Tsinghua SEM 32


Examples: HITS

Dr. Yi, C., Tsinghua SEM 33


Examples: HITS

Dr. Yi, C., Tsinghua SEM 34


Examples: HITS

Dr. Yi, C., Tsinghua SEM 35


Examples: HITS

Dr. Yi, C., Tsinghua SEM 36


Examples: HITS

Dr. Yi, C., Tsinghua SEM 37


Examples: HITS

Dr. Yi, C., Tsinghua SEM 38


Examples: HITS

Dr. Yi, C., Tsinghua SEM 39


Examples: HITS

Dr. Yi, C., Tsinghua SEM 40


Examples: HITS

Dr. Yi, C., Tsinghua SEM 41


Examples: HITS

Dr. Yi, C., Tsinghua SEM 42


Examples: HITS

Dr. Yi, C., Tsinghua SEM 43


Examples: HITS

Dr. Yi, C., Tsinghua SEM 44


Examples: HITS

Dr. Yi, C., Tsinghua SEM 45


Examples: HITS

Dr. Yi, C., Tsinghua SEM 46


Examples: HITS

Dr. Yi, C., Tsinghua SEM 47


Examples: HITS

Dr. Yi, C., Tsinghua SEM 48


Examples: HITS

Dr. Yi, C., Tsinghua SEM 49


Examples: HITS

Dr. Yi, C., Tsinghua SEM 50


Examples: HITS

Dr. Yi, C., Tsinghua SEM 51


Examples: HITS
As k goes to infinity:

Dr. Yi, C., Tsinghua SEM 52


Link Analysis using Hubs and
Authorities
• Strengths
– Able to rank pages according to the query topic (for
broad-topic queries)
• Weaknesses
– Authority and hub rankings are query dependent,
hence minor changes to the web could significantly
change the scores (computed online, not offline)
– Cannot detect advertisements
– Can be easily spammed (can easily add out-links to a
good site)

Dr. Yi, C., Tsinghua SEM 53


Link Analysis
• Link analysis using hubs and authorities
– HITS (hyperlink-induced topic search)
• Link analysis based on random surfer model
– PageRank

Dr. Yi, C., Tsinghua SEM 54


Link Analysis: PageRank
• The intuition behind hubs and authorities:
pages play multiple roles in the network
– For queries with a commercial aspect
• But in other settings, “endorsement” is best
viewed as passing directly from one page to
another
– Dominant mode of endorsement among academic
or governmental pages, personal pages, scientific
literature, etc.
– Forms the basis for PageRank

Dr. Yi, C., Tsinghua SEM 55


Link Analysis: PageRank
• PageRank of a page is the probability that a
“random surfer” will be looking at that page
– Links from popular pages will increase PageRank of
pages they point to
– Assumption: If the pages pointing to this page are
good, then this is also a good page
• Why does this work?
– The official Toyota site will be linked to by lots of
other official (or high-quality) sites
– The best Toyota fan-club site probably also has many
links pointing to it
– Less high-quality sites do not have as many high-
quality sites linking to them
Dr. Yi, C., Tsinghua SEM 56
Link Analysis: PageRank

• PageRank (PR) of page C = PR(A)/2 + PR(B)/1


• More generally,

– where Bu is the set of pages that point to u, and Lv is


the number of outgoing links from page v (not
counting duplicate links)
Dr. Yi, C., Tsinghua SEM 57
Link Analysis: PageRank
• Don’t know PageRank values at start
• Assume equal values (1/3 in this case), then
repeated improvement:
– first iteration: PR(A) = 0.33, PR(B) = 0.17, PR(C) =
0.33/2 + 0.33 = 0.5
– second: PR(A) = 0.5, PR(B) = 0.17, PR(C) = 0.33/2 +
0.17 = 0.33
– third: PR(A) = 0.33, PR(B) = 0.25, PR(C) = 0.42
• Converges to PR(A) = 0.4, PR(B) = 0.2, PR(C) =
0.4 (total PR in the network remain constant)
Dr. Yi, C., Tsinghua SEM 58
Excise

• All pages start out with a PageRank of 1/8, then:

Dr. Yi, C., Tsinghua SEM 59


Excise
• Equilibrium PageRank values for the network

Dr. Yi, C., Tsinghua SEM 60


PageRank: Spider Traps

Dr. Yi, C., Tsinghua SEM 61


PageRank: Dead Ends

Dr. Yi, C., Tsinghua SEM 62


PageRank Problems
• Getting stuck on pages that
– do not have links (dead ends)
• May also link to pages that have not yet been crawled
• Called dangling links
• Such pages cause pagerank to “leak out”
– have links forming a loop (spider traps)
• Eventually spider traps absorb all pagerank

Dr. Yi, C., Tsinghua SEM 63


Solution: Random Surfer Model
• Browse the Web using the following
algorithm:
– Choose a random number r between 0 and 1
– If r < λ:
• Go to a random page (i.e., teleport)
– If r ≥ λ:
• Click a link at random on the current page
– Start again

Dr. Yi, C., Tsinghua SEM 64


Random Surfer Model
• Taking random page jump into account:
1/3 chance of going to any page when r < λ,
i.e., PR(C)
= λ/3 + (1 − λ) · (PR(A)/2 + PR(B)/1)

• More generally,

– where N is the number of pages, λ typically 0.15 (5


links and jump, assuming no dead ends)

Dr. Yi, C., Tsinghua SEM 65


Test of Your Knowledge
• In PageRank, what is NOT the benefit of
introducing random jumping?
– A: Otherwise PageRank will favor nodes with
fewer incoming links
– B: Otherwise disconnected page always has zero
pagerank
– C: Otherwise zero-outlink nodes will receive all the
pagerank

Dr. Yi, C., Tsinghua SEM 66


Test of Your Knowledge
• Consider this mini-internet:

• Rank the pages in order of importance


without doing any calculation.
• Find the PageRank of each of the pages.

Dr. Yi, C., Tsinghua SEM 67


Test of Your Knowledge
• Consider this mini-internet:

• Rank the pages in order of importance without


doing any calculation.
• Find the PageRank of each of the pages.
Dr. Yi, C., Tsinghua SEM 68
Test of Your Knowledge
• Can PageRank work if the network is
disconnected?

Dr. Yi, C., Tsinghua SEM 69


Random Surfer Model
• Note that from the original Google paper
(Brin‐Page, 98), PageRank is defined :

– where λ typically 0.15

• Average PR of each page = 1


• Some examples using this definition as follows

Dr. Yi, C., Tsinghua SEM 70


Example 1
• Every page has a minimum value λ

Dr. Yi, C., Tsinghua SEM 71


Example 2
• Looping or extensive interlinking (fully meshed)
make pages of equal importance

Dr. Yi, C., Tsinghua SEM 72


Example 3
• Dangling links lead to wasted PR (dead ends!)

Dr. Yi, C., Tsinghua SEM 73


Example 4
• Linking external sites back into home page
preserves PR and increases the PR of home page

Dr. Yi, C., Tsinghua SEM 74


Example 5
• A simple hierarchy without dangling links
• A hierarchy concentrates votes and PR into
one page

Dr. Yi, C., Tsinghua SEM 75


Example 6
• Does a variant of hierarchy work better?
• No… C and D are not helping.
• More internal links, more evenly the PR will
spread out between the pages

Dr. Yi, C., Tsinghua SEM 76


Link Analysis in Modern Web
Search
• Link analysis plays an integral role in the
ranking functions of Web search engines
• But has been extended and generalized
considerably
• The importance of PageRank as a feature in
Google’s ranking function declines over time
• Combining links, text, and usage data
– E.g., weigh the contributions of the links with
highly relevant anchor text more heavily

Dr. Yi, C., Tsinghua SEM 77


Test of Your Knowledge
• HITS and PageRank only use the inter-
document links when calculating a
document’s score, without considering the
content of the document.
– True
– False

Dr. Yi, C., Tsinghua SEM 78


PageRank
• Strengths
– Gives global ranking of importance, can be
computed offline and hence lower response time
– Robust against spam as it is not easy for a website
to add inlinks from other important pages
• Weaknesses
– Query independent, ignores topic relevance
– May score high on advertising pages (i.e., does not
distinguish navigation, advertising or function links)
– Favors older pages

Dr. Yi, C., Tsinghua SEM 79


Combining PageRank and Topic
Relevance

The importance of
page i in the topic k set

How q is relevant
to this topic k
Dr. Yi, C., Tsinghua SEM 80
Link Quality
• Link quality is affected by spam and other
factors
– Link spam: web owners create useless links to
improve their placement on search engines
– E.g. trackback links in blogs can create loops

Dr. Yi, C., Tsinghua SEM 81


Example: Trackback Links

Dr. Yi, C., Tsinghua SEM 82


Link Quality
• Link quality is affected by spam and other
factors
– trackback links in blogs can create loops
– links from comments section of popular blogs
• Solution: Blog services modify comment links to
contain rel=nofollow attribute
• e.g., “Come visit my <a rel=nofollow
href="http://www.page.com">web page</a>.”
• Search engines will ignore these links during indexing

Dr. Yi, C., Tsinghua SEM 83


Link Quality
• Link farm: a concentration from spam pages
may work
• But this is the wrong way to improve PR
– Search engines may ignore the page

Dr. Yi, C., Tsinghua SEM 84


Summary
• Link information is very useful
– Anchor text
– PageRank
– HITS
• Both PageRank and HITS have many
applications in analyzing other graphs or
networks

Dr. Yi, C., Tsinghua SEM 85


Week Date Lesson Topics
1 Feb 26 Introduction
2 Mar 4 Metadata and subject analysis (metadata schemes,
controlled vocabularies)
3 Mar 11 - Information categorization
- Computational classification: text processing basics
4 Mar 18 - Computational classification: decision tree
Course - Information retrieval: inverted indexes
5-6 Mar 25, Apr 1 - Information retrieval: models (Boolean, vector space,
Schedule probabilistic) and evaluation
7 Apr 8 Project presentation
8-9 Apr 15, 22 -Web search (link analysis, paid search)
10 Apr 29 - Test 1
- Guest lecture
11-12 May 6, 13 Information and social network (information cascades,
social network analysis)
13-14 May 20, 27 Social and ethical issues (pricing of information,
information goods market, IP issues)
- Review
15 Jun 3 Test 2
Dr. Yi, C., Tsinghua SEM 86

You might also like