0% found this document useful (0 votes)

44 views21 pages

IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material

The document discusses some of the key challenges in web search including the massive scale and dynamic nature of the web compared to traditional document collections. It also provides an overview of the major components of a typical search engine architecture, including crawlers to discover pages, indexers to create searchable indexes, and ranking algorithms like PageRank to order search results. The document also notes some of the technical resources like hardware and personnel required at large scale search engines like Google.

Uploaded by

Ali Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views21 pages

IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material

Uploaded by

Ali Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

IR on Web

Search Engines

Reference of slides taken from Dr Haddawy's material

IR on the Web
● Search engines use well-known techniques from IR.

● But IR algorithms were developed for relatively small and coherent

collections of documents, e.g. newspaper articles.

● The Web is massive, much less coherent, changes rapidly, and is

spread over geographically distributed computers.

● Selectivity Problem: Traditional techniques measure the similarity of

the query text with document texts. But the tiny queries over vast
collections, typical for Web search engines prevent similarity-based
approaches from filtering sufficient numbers of irrelevant pages out
of the search results.
Challenges for Web Searching
● Distributed data
● Volatile data: 40% of the web changes every month
● Exponential growth
● Unstructured and redundant data: 30% of web pages
are near duplicates
● Unedited data
● Multiple formats
● Many different kinds of users
Challenges for Web Searching
● Web search queries are SHORT
● ~2 - 3 words on average
● User Expectations are quite high
● Many say “the first item shown should be what
I want to see”!
Web is a complex graph

Page 1
Site 1 Page 1 Site 2

Page 3 Page 2
Page 3
Page 2

Page 5 Page 1
Page 4
Site 5 Page 1

Queries
Page Repository Results

Query Ranking
Crawlers Engine

Indexer

Indexes

Structure
Text
Web
Manpower and Hardware:
Google
85 people
50% technical, 14 Ph.D. in Computer Science

Equipment
2,500 Linux machines
80 terabytes of spinning disks
30 new machines installed daily

Reported by Larry Page, Google, March 2000

At that time, Google was handling 5.5 million searches per day
Increase rate was 20% per month
By fall 2002, Google had grown to over 400 people and 10,000 Linux
Servers (World’s largest Linux cluster).
Crawlers (Spiders, Bots)

●
Main idea:
– Start with known sites
– Record information for these sites
– Follow the links from each site
– Record information found at new sites
– Repeat
Web Crawlers
● Start with an initial page P0. Find URLs on P0 and add
them to a queue.
● When done with P0, pass it to an indexing program,
get a page P1 from the queue and repeat.
●
Issues
– Which page to look at next?
– Avoid overloading a site
– How deep within a site to go (drill-down)?
– How frequently to visit pages?
Page Visit Order
●
Animated examples of breadth-first vs depth-first search on trees:
– http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

Structure to be traversed
Indexing
• Arrangement of data to permit fast searching
• Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
• Sorting helps in searching
– You probably use this when looking something up in the
telephone book or dictionary. For instance, "cold fusion" is
probably near the front, so you open maybe 1/4 of the way in.
Inverted Files
FILE
POS
1 A file is a list of words by position
10
– First entry is the word in position 1 (first word)
20
– Entry 4562 is the word in position 4562 (4562nd word)
30

36 – Last entry is the last word

An inverted file is a list of positions by word!

a (1, 4, 40)
entry (11, 20, 31)
file (2, 38) INVERTED FILE
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
4562 (21, 27)
Inverted Files for Multiple Documents
“jezebel” occurs
DOCID OCCUR POS 1 POS 2 ... 6 times in document 34,
LEXICON 3 times in document 44,
4 times in document 56 . . .
WORD NDOCS PTR
jezebel 20 34 6 1 118 2087 3922 3981 5002
44 3 215 2291 3010
jezer 3 56 4 5 22 134 992
jezerit 1
jeziah 1 566 3 203 245 287
jeziel 1
jezliah 1 67 1 132 WORD
INDEX
jezoar 1 ...
jezrahliah 1
jezreel 39 107 4 322 354 381 405
232 6 15 195 248 1897 1951 2192
jezo ar

677 1 481
713 3 42 312 802
Ranking (Scoring) Hits
●
Hits must be presented in some order
●
What order?

– Relevance, recentness, popularity, reliability?

●
Some ranking methods

– Presence of keywords in title of document

– Closeness of keywords to start of document
– Frequency of keyword in document
– Link popularity (how many pages point to this
one)
Ranking: Google

1. Vector space ranking with corrections for document

length
2. Extra weighting for specific fields, e.g., title, urls, etc.
3. PageRank
The balance between 1, 2, and 3 is not made public.
Google’s PageRank Algorithm
●
Assumption: A link in page A to page B is a
recommendation of page B by the author of A
(we say B is successor of A)
 The “quality” of a page is related to the number of links that
point to it (its in-degree)

●
Apply recursively: Quality of a page is related to
– its in-degree, and to
– the quality of pages linking to it
 PageRank Algorithm (Brinn & Page, 1998)

SOURCE: GOOGLE
PageRank
●
Consider the following infinite random walk (surfing):
– Initially the surfer is at a random page
– At each step, the surfer proceeds
●
to a randomly chosen web page with probability d
●
to a randomly chosen successor of the current page
with probability 1-d

SOURCE: GOOGLE
PageRank Formula
d
PageRank ( p ) = + (1 − d ) ∑ PageRank (q ) / outdegree(q )
n ( q , p )∈E

where n is the total number of nodes in the graph

●
Google uses d ≈ 0.85

●
PageRank is a probability distribution over web pages SOURCE: GOOGLE
PageRank Example

A B

d d

PageRank of P is
(1-d)∗[(PageRank of A)/4 + (PageRank of B)/3)] + d/n

PAGERANK CALCULATOR
SOURCE: GOOGLE
Robot Exclusion
●
You may not want certain pages indexed but still viewable
by browsers. Can’t protect directory.
●
Some crawlers conform to the Robot Exclusion Protocol.
Compliance is voluntary. One way to enforce: firewall
●
They look for file robots.txt at highest directory level in
domain. If domain is www.ecom.cmu.edu, robots.txt goes
in www.ecom.cmu.edu/robots.txt
●
A specific document can be shielded from a crawler by
adding the line: <META NAME="ROBOTS”
CONTENT="NOINDEX">

Social Network Analysis (SNA) - 1
100% (1)
Social Network Analysis (SNA) - 1
81 pages
Learning To Rank
No ratings yet
Learning To Rank
777 pages
Evolution in Computational Intelligence
No ratings yet
Evolution in Computational Intelligence
555 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Samridhi Research Paper
No ratings yet
Samridhi Research Paper
42 pages
Webmininglec
100% (1)
Webmininglec
75 pages
Search Engine
100% (2)
Search Engine
42 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
Unit 5
No ratings yet
Unit 5
36 pages
Bulu
No ratings yet
Bulu
47 pages
OCR A Level Computer Science: 3.4 Web Technologies
No ratings yet
OCR A Level Computer Science: 3.4 Web Technologies
73 pages
Relevance of The Results: Documents Are Retrieved Relevant Irrelevant Measure
No ratings yet
Relevance of The Results: Documents Are Retrieved Relevant Irrelevant Measure
42 pages
Applications of Eigenvalues and Eigenvectors
No ratings yet
Applications of Eigenvalues and Eigenvectors
5 pages
Chap 2
No ratings yet
Chap 2
29 pages
How A Search Engine Works - Slide
No ratings yet
How A Search Engine Works - Slide
40 pages
Summary of A Search Engine
No ratings yet
Summary of A Search Engine
4 pages
Web Search. Web Spidering
No ratings yet
Web Search. Web Spidering
44 pages
Red Stack Attack Algorithms Capital and
No ratings yet
Red Stack Attack Algorithms Capital and
12 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Attentive Design
No ratings yet
Attentive Design
90 pages
Challenges in Running A Commercial Web Search Engine: Amit Singhal
No ratings yet
Challenges in Running A Commercial Web Search Engine: Amit Singhal
50 pages
Page Rank of Google Search: The Algorithm That Organizes The Web
No ratings yet
Page Rank of Google Search: The Algorithm That Organizes The Web
8 pages
PageRank 2021
No ratings yet
PageRank 2021
55 pages
Lecture15 Crawling
No ratings yet
Lecture15 Crawling
17 pages
Chapter 3
No ratings yet
Chapter 3
39 pages
Web Page Classification - Features and Algorithms
No ratings yet
Web Page Classification - Features and Algorithms
31 pages
SNA - T4-5 - Pagerank and Communities
No ratings yet
SNA - T4-5 - Pagerank and Communities
56 pages
Google SearchEngine
No ratings yet
Google SearchEngine
13 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Off Page SEO and Link Building General S PDF
100% (1)
Off Page SEO and Link Building General S PDF
13 pages
Batch B DWM Experiments
No ratings yet
Batch B DWM Experiments
90 pages
Ir 5
No ratings yet
Ir 5
18 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Advanced SEO Interview Questions and Answers
0% (1)
Advanced SEO Interview Questions and Answers
41 pages
Week7 1
No ratings yet
Week7 1
48 pages
12 Handout PDF
No ratings yet
12 Handout PDF
82 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Chapter 1 Search Engine 1. Objective
No ratings yet
Chapter 1 Search Engine 1. Objective
63 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Building Search Engine Using Machine Learning Technique I-EEE
No ratings yet
Building Search Engine Using Machine Learning Technique I-EEE
6 pages
The "Troll Farm" Pages: (QP JSF Allen (Data Science) - Friday, October 4, 2019 - Reading Time: 28 Minutes #8
No ratings yet
The "Troll Farm" Pages: (QP JSF Allen (Data Science) - Friday, October 4, 2019 - Reading Time: 28 Minutes #8
20 pages
Tutorial: Web Information Retrieval: Monika Henzinger
No ratings yet
Tutorial: Web Information Retrieval: Monika Henzinger
154 pages
Pagerank Explained Correctly With Examples - WWW - Cs.princeton - Edu - Chazelle - Courses - BIB - Pagerank
No ratings yet
Pagerank Explained Correctly With Examples - WWW - Cs.princeton - Edu - Chazelle - Courses - BIB - Pagerank
18 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
CSF-469-L11-13 (Link Analysis Page Rank)
No ratings yet
CSF-469-L11-13 (Link Analysis Page Rank)
47 pages
Search Engine
No ratings yet
Search Engine
35 pages
Human Computer Interaction
No ratings yet
Human Computer Interaction
26 pages
Search Engine Student Documents
No ratings yet
Search Engine Student Documents
6 pages
Mini Google
No ratings yet
Mini Google
34 pages
Elective II: Selected Topics in Information Retrieval (IR) and Natural Language Processing (NLP)
No ratings yet
Elective II: Selected Topics in Information Retrieval (IR) and Natural Language Processing (NLP)
16 pages
Web Search Personalization Based On Browsing History by Artificial Immune System
No ratings yet
Web Search Personalization Based On Browsing History by Artificial Immune System
21 pages
Data Warehouse Presentation
No ratings yet
Data Warehouse Presentation
13 pages
Ontology and Ontology Based Systems
No ratings yet
Ontology and Ontology Based Systems
13 pages
Climate Services: Francesca Larosa, Jaroslav Mysiak T
No ratings yet
Climate Services: Francesca Larosa, Jaroslav Mysiak T
13 pages
Alexa Rank Checker
No ratings yet
Alexa Rank Checker
3 pages
Professional Practices
No ratings yet
Professional Practices
8 pages
IRWM: Assignment 1: How Does Google Search Engine Works?
No ratings yet
IRWM: Assignment 1: How Does Google Search Engine Works?
7 pages
The Google Effect: Googling, Blogging, Wikis and The Flattening of Expertise
No ratings yet
The Google Effect: Googling, Blogging, Wikis and The Flattening of Expertise
11 pages
Question Recommendation On The Stack Overflow Network: Jacob Perricone
No ratings yet
Question Recommendation On The Stack Overflow Network: Jacob Perricone
8 pages
Google PageRank Algorithm
No ratings yet
Google PageRank Algorithm
10 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
The Boolean Model: Simple Model Based On Set Theory Queries Specified As Boolean Expressions
No ratings yet
The Boolean Model: Simple Model Based On Set Theory Queries Specified As Boolean Expressions
26 pages
The Anatomy of A Large-Scale Hypertextual
No ratings yet
The Anatomy of A Large-Scale Hypertextual
41 pages
1preprocessing Crawling Laws PDF
No ratings yet
1preprocessing Crawling Laws PDF
53 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google
24 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
Comsats Institute of Information TECHNOLOGY Islamabad
No ratings yet
Comsats Institute of Information TECHNOLOGY Islamabad
11 pages
Backlinks - Pagerank
No ratings yet
Backlinks - Pagerank
12 pages
Stop Losing Hope by Anees Ur Rahman
No ratings yet
Stop Losing Hope by Anees Ur Rahman
3 pages
Movie Description (Audio Description) .
No ratings yet
Movie Description (Audio Description) .
11 pages
Web Search
No ratings yet
Web Search
49 pages
Anatomy of A Large-Scale Hypertextual Web Search Engine
No ratings yet
Anatomy of A Large-Scale Hypertextual Web Search Engine
33 pages
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
No ratings yet
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
27 pages
Applying Page Rank and HITS Algorithm To Identify Key Use Cases
No ratings yet
Applying Page Rank and HITS Algorithm To Identify Key Use Cases
8 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
How Google Works
No ratings yet
How Google Works
61 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
Building Fast Search Engines
No ratings yet
Building Fast Search Engines
21 pages
Websearch
No ratings yet
Websearch
21 pages
SEARCH ENGINES and PAGERANK
No ratings yet
SEARCH ENGINES and PAGERANK
29 pages
Search Engine
No ratings yet
Search Engine
42 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
26 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Module-5:: Network Analysis
No ratings yet
Module-5:: Network Analysis
22 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
HITS e Pagerank
No ratings yet
HITS e Pagerank
10 pages
SearchLand: Search Quality For Beginners
No ratings yet
SearchLand: Search Quality For Beginners
29 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
11 pages
The Wisdom of Crowds: Web Mining or
No ratings yet
The Wisdom of Crowds: Web Mining or
50 pages
Google Algorithm Updates
No ratings yet
Google Algorithm Updates
8 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Clojure Web Development Essentials
From Everand
Clojure Web Development Essentials
Ryan Baldwin
No ratings yet
Learning Dart
From Everand
Learning Dart
Ivo Balbaert
No ratings yet