0% found this document useful (0 votes)
44 views21 pages

IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material

The document discusses some of the key challenges in web search including the massive scale and dynamic nature of the web compared to traditional document collections. It also provides an overview of the major components of a typical search engine architecture, including crawlers to discover pages, indexers to create searchable indexes, and ranking algorithms like PageRank to order search results. The document also notes some of the technical resources like hardware and personnel required at large scale search engines like Google.

Uploaded by

Ali Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views21 pages

IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material

The document discusses some of the key challenges in web search including the massive scale and dynamic nature of the web compared to traditional document collections. It also provides an overview of the major components of a typical search engine architecture, including crawlers to discover pages, indexers to create searchable indexes, and ranking algorithms like PageRank to order search results. The document also notes some of the technical resources like hardware and personnel required at large scale search engines like Google.

Uploaded by

Ali Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

IR on Web

Search Engines

Reference of slides taken from Dr Haddawy's material


IR on the Web
● Search engines use well-known techniques from IR.

● But IR algorithms were developed for relatively small and coherent


collections of documents, e.g. newspaper articles.

● The Web is massive, much less coherent, changes rapidly, and is


spread over geographically distributed computers.

● Selectivity Problem: Traditional techniques measure the similarity of


the query text with document texts. But the tiny queries over vast
collections, typical for Web search engines prevent similarity-based
approaches from filtering sufficient numbers of irrelevant pages out
of the search results.
Challenges for Web Searching
● Distributed data
● Volatile data: 40% of the web changes every month
● Exponential growth
● Unstructured and redundant data: 30% of web pages
are near duplicates
● Unedited data
● Multiple formats
● Many different kinds of users
Challenges for Web Searching
● Web search queries are SHORT
● ~2 - 3 words on average
● User Expectations are quite high
● Many say “the first item shown should be what
I want to see”!
Web is a complex graph

Page 1
Site 1 Page 1 Site 2

Page 3 Page 2
Page 3
Page 2

Page 5 Page 1
Page 4
Site 5 Page 1

Page 6 Page 1 Page 2 Site 6

Site 3
Search Engine Architecture

Spider
– Crawls the web to find pages. Follows
hyperlinks

Indexer
– Produces data structures for fast searching of
all words in the pages

Retriever
– Query interface
– Database lookup to find hits

2 billion documents

4 TB RAM, many terabytes of disk
– Ranking
Typical Search Engine
Architecture
User

Queries
Page Repository Results

Query Ranking
Crawlers Engine

Indexer

Indexes

Structure
Text
Web
Manpower and Hardware:
Google
85 people
50% technical, 14 Ph.D. in Computer Science

Equipment
2,500 Linux machines
80 terabytes of spinning disks
30 new machines installed daily

Reported by Larry Page, Google, March 2000


At that time, Google was handling 5.5 million searches per day
Increase rate was 20% per month
By fall 2002, Google had grown to over 400 people and 10,000 Linux
Servers (World’s largest Linux cluster).
Crawlers (Spiders, Bots)


Main idea:
– Start with known sites
– Record information for these sites
– Follow the links from each site
– Record information found at new sites
– Repeat
Web Crawlers
● Start with an initial page P0. Find URLs on P0 and add
them to a queue.
● When done with P0, pass it to an indexing program,
get a page P1 from the queue and repeat.

Issues
– Which page to look at next?
– Avoid overloading a site
– How deep within a site to go (drill-down)?
– How frequently to visit pages?
Page Visit Order

Animated examples of breadth-first vs depth-first search on trees:
– http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Structure to be traversed
Indexing
• Arrangement of data to permit fast searching
• Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
• Sorting helps in searching
– You probably use this when looking something up in the
telephone book or dictionary. For instance, "cold fusion" is
probably near the front, so you open maybe 1/4 of the way in.
Inverted Files
FILE
POS
1 A file is a list of words by position
10
– First entry is the word in position 1 (first word)
20
– Entry 4562 is the word in position 4562 (4562nd word)
30

36 – Last entry is the last word

An inverted file is a list of positions by word!


a (1, 4, 40)
entry (11, 20, 31)
file (2, 38) INVERTED FILE
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
Inverted Files for Multiple Documents
“jezebel” occurs
DOCID OCCUR POS 1 POS 2 ... 6 times in document 34,
LEXICON 3 times in document 44,
4 times in document 56 . . .
WORD NDOCS PTR
jezebel 20 34 6 1 118 2087 3922 3981 5002
44 3 215 2291 3010
jezer 3 56 4 5 22 134 992
jezerit 1
jeziah 1 566 3 203 245 287
jeziel 1
jezliah 1 67 1 132 WORD
INDEX
jezoar 1 ...
jezrahliah 1
jezreel 39 107 4 322 354 381 405
232 6 15 195 248 1897 1951 2192
jezo ar

677 1 481
713 3 42 312 802
Ranking (Scoring) Hits

Hits must be presented in some order

What order?

– Relevance, recentness, popularity, reliability?



Some ranking methods

– Presence of keywords in title of document


– Closeness of keywords to start of document
– Frequency of keyword in document
– Link popularity (how many pages point to this
one)
Ranking: Google

1. Vector space ranking with corrections for document


length
2. Extra weighting for specific fields, e.g., title, urls, etc.
3. PageRank
The balance between 1, 2, and 3 is not made public.
Google’s PageRank Algorithm

Assumption: A link in page A to page B is a
recommendation of page B by the author of A
(we say B is successor of A)
 The “quality” of a page is related to the number of links that
point to it (its in-degree)


Apply recursively: Quality of a page is related to
– its in-degree, and to
– the quality of pages linking to it
 PageRank Algorithm (Brinn & Page, 1998)

SOURCE: GOOGLE
PageRank

Consider the following infinite random walk (surfing):
– Initially the surfer is at a random page
– At each step, the surfer proceeds

to a randomly chosen web page with probability d

to a randomly chosen successor of the current page
with probability 1-d

SOURCE: GOOGLE
PageRank Formula
d
PageRank ( p ) = + (1 − d ) ∑ PageRank (q ) / outdegree(q )
n ( q , p )∈E

where n is the total number of nodes in the graph


Google uses d ≈ 0.85


PageRank is a probability distribution over web pages SOURCE: GOOGLE
PageRank Example

A B

d d

PageRank of P is
(1-d)∗[(PageRank of A)/4 + (PageRank of B)/3)] + d/n

PAGERANK CALCULATOR
SOURCE: GOOGLE
Robot Exclusion

You may not want certain pages indexed but still viewable
by browsers. Can’t protect directory.

Some crawlers conform to the Robot Exclusion Protocol.
Compliance is voluntary. One way to enforce: firewall

They look for file robots.txt at highest directory level in
domain. If domain is www.ecom.cmu.edu, robots.txt goes
in www.ecom.cmu.edu/robots.txt

A specific document can be shielded from a crawler by
adding the line: <META NAME="ROBOTS”
CONTENT="NOINDEX">

You might also like