0% found this document useful (0 votes)

42 views

Building Fast Search Engines

This document summarizes how search engines work by indexing web documents, processing user queries, and returning relevant results. It discusses key aspects like indexing data with inverted indexes, approximating relevance using statistical algorithms like Okapi BM25, compressing indexes and documents for fast retrieval, and architectures that divide indexes and documents across multiple servers. The document concludes by outlining current research at RMIT's Search Engine Group on fast search techniques and applications to other domains.

Uploaded by

Shikhir Kapoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

Building Fast Search Engines

Uploaded by

Shikhir Kapoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Building Fast Search Engines

Hugh E. Williams ([email protected])

School of Computer Science and Information
Technology, RMIT
Overview

• User’s Information Needs

• Why users use search engines

• How users query with search engines

• Answers
• What is a good answer?

• How search engines provide a search service

• Indexing data

• Index design

• Architecture of a commercial search engine

• Research
• Fast searching and emerging technologies
Queries

• Search engines are one tool used to answer information

needs
• Users express their information needs as queries
• Usually informally expressed as two or three words (we

call this a ranked query)

• A recent study showed the mean query length was 2.4

words per query with a median of 2

• Around 48.4% of users submit just one query in a

session, 20.8% submit two, and about 31% submit three

or more
• Less than 5% of queries use Boolean operators (AND,

OR, and NOT), and around 5% contain quoted phrases

Queries...

• About 1.28 million different words were used in queries in

the Excite log studied (which contained 1.03 million
queries)
• Around 75 words account for 9% of all words used in
queries. The top-ten non-trivial words occurring in 531,000
queries are “sex” (10,757), “free” (9,710), “nude” (7,047),
“pictures” (5,939), “university” (4,383), “pics” (3,815), “chat”
(3,515), “adult” (3,385), “women” (3,211), and “new” (3,109)
• 16.9% of the queries were about entertainment, 16.8%
about sex, pornography, or preferences, and 13.3%
concerned commerce, travel, employment, and the
economy
Answers

• What is a good answer to a query?

• One that is relevant to the user’s information need!

• Search engines typically return ten answers-per-page,

where each answer is a short summary of a web

document
• Likely relevance to an information need is approximated

by statistical similarity between web documents and the

query
• Users favour search engines that have high precision,

that is, those that return relevant answers in the first

page of results
An Example Query
Top-ten Answers
Approximating
Relevance
• Statistical similarity is used to estimate the relevance of a
query to an answer
• Consider the query “Richardson Richmond Football”
• A good answer contains all three words, and the more

frequently the better; we call this term frequency (TF)

• Some query terms are more important—have better

discriminating power—than others. For example, an

answer containing only “Richardson” is likely to be better
than an answer containing only “Football”; we call this
inverse document frequency (IDF)
• A popular, state-of-the-art statistical ranking function that
incorporates these ideas is Okapi
Okapi BM25 Function

• The Okapi ranking function is as follows:

( k 1 + 1)tf (k 3 + 1) qtf
∑
T ∈Q
w
K + tf
×
k 3 + qtf
• Q is a query that contains the words T
• k1, b, and k3 are constant parameters (k1=1.2 and b=0.75 work well, k3 is 7 or 1000)
• K is: k 1((1 − b) + b.dl / avdl )
• tf is the term frequency of the term with a document
• qtf is the term frequency in the query
• w is: ( N − n + 0.5)
log
(n + 0.5)
• N is the number of documents, n is the number containing the term
• dl and avdl are the document length and average document length
• Overall: ranking uses the number of times a word occurs in
a document, the number of documents containing the term,
and the document length
More on Ranking...

• Other techniques are used to improve the accuracy of

search engines:
• Google Inc. use their patented PageRank(tm)

technology. Google ranks a page higher if it links to

pages that are an authorative source, and a link from an
authorative source to a page ranks that page higher
• Relevance feedback is a technique that adds words to a

query based on a user selecting a more like this option

• Query expansion adds words to a query using thesaural

or other techniques
• Searching within categories or groups to narrow a

search
How Search Engines
Work
• Search engines work as follows:
• They retrieve (spider or crawl) documents from the Web

• Documents are stored as a collection in a centralised

repository
• The collection is indexed to allow fast ranking to find

answers
• A web interface is provided for entering queries and

presenting answers
• Document summarisation is used to present short

answers to the user for judging relevance

• Documents are updated and re-indexed regularly
Indexing Data

• All search engines use inverted indexes to support fast

searching
• An inverted index consists of two components:
• A searchable in-memory vocabulary of all words in the

collection; stored with each word is the IDF and a pointer

to the inverted list for that word
• An on-disk inverted list for each word in the collection.

This list contains:

• the documents that contain the word
• the term frequency of the word in each document
• the offset or offsets of the word in each document (this is
optional, and is used for proximity and phrase queries)
Indexing Data
Resolving Queries

• Queries are resolved using the inverted index

• Consider the example query “Cat Mat Hat”. This is
evaluated as follows:
• Select a word from the query (say, “Cat”)
• Retrieve the inverted list from disk for the word
• Process the list. For each document the word occurs in, add weight
to an accumulator for that document based on the TF, IDF, and
document length
• Repeat for each word in the query
• Find the best-ranked documents with the highest weights
• Lookup the document in the mapping table
• Retrieve and summarise the documents, and present to the user
Fast Search Engines

• There are many well-known principles for building a fast

search engine
• Perhaps the most important is compression:
• Inverted lists are stored in a compressed format. This

allows more information per second to be retrieved from

disk, and it lowers disk head seek times
• As long as decompression is fast, there is a beneficial

trade-off in time
• Documents are stored in a compressed format for the

same reason
• Different compression schemes are used for lists (which

are integers) and documents (which are multimedia, but

mostly text)
Fast Search Engines...

• Average query times and index sizes for 25,000 queries on

10 gigabytes of indexed Web data
Index Size (% of collection) Query Speed (Seconds)

35
1
30
25 0.8
% of
20 Average 0.6
collection Query
size 15
Time (sec) 0.4
10
5 0.2
0 0
Compressed Uncompressed Compressed Uncompressed
Fast Search Engines...

• Other principles of fast searching:

• Sort disk accesses to minimise disk head movement

when retrieving lists or documents

• Use hash tables in memory to store the vocabulary;

avoid slow hash functions that use modulo

• Pre-calculate and store constants in ranking formulae

• Carefully choose integer compression schemes

• Organise inverted lists so that the information frequently

needed is at the start of the list

• Use heap structures when partial sorting is required

• Develop a query plan for each query

Search Engine
Architecture
Search Engine
Architecture...
• The inverted lists are divided amongst a number of servers,
where each is known as a shard
• If an inverted list is required for a particular range of words,
then that shard server is contacted
• Each shard server can be replicated as many times as
required; each server in a shard is identical
• Documents are also divided amongst a number of servers
• Again, if a document is required within a particular range,
then the appropriate document server is contacted
• Each document server can also be replicated as many
times as required
What we’re working on...

• The Search Engine Group here at RMIT specialises in

research into fast search engines and applications of
search technology to other domains
• We are currently investigating:
• Fast phrase querying using new index structures
• Answer summarisation
• Index design
• Fast vocabulary searching and accumulation
• Index construction
• DNA and protein search engines
• Image and video management and retrieval
• General-purpose compression of collections

• Our new research testbed search engine will be released

under the GPL later this year
Pointers (& advertising!)
• The Search Engine Group, http://goanna.cs.rmit.edu.au/~jz/seg/
• My home page, http://www.cs.rmit.edu.au/~hugh/
• Witten, Moffat, and Bell, “Managing Gigabytes”, 2nd edition, Morgan Kaufmann, 1999
• Spink, Wolfram, Jansen and Saracevic, “Searching the web: The public and their queries”,
Journal of the American Society for Information Science, 52(3), 226--234, 2001. Queries
are available from: http://www.mds.rmit.edu.au/~hugh/queries/
• Williams and Zobel, “Compressing Integers for Fast File Access”, The Computer Journal,
42(3), 193-201, 1999.
• Moffat, Zobel, and Sharman, “Text compression for dynamic document databases”, IEEE
Transactions on Knowledge and Data Engineering, 9(2):302-313, March-April 1997.
• Zobel and Moffat, “Adding compression to a full text retrieval system”, Software-Practice
and Experience, 25(8):891-903, 1995.
• Zobel, Heinz, and Williams, “In-memory Hash Tables for Accumulating Text Vocabularies”,
Information Processing Letters. To appear.

Cheat Sheet
No ratings yet
Cheat Sheet
1 page
UX Project Checklist
100% (1)
UX Project Checklist
1 page
Lesson Plan On Grade 11 Ict
0% (1)
Lesson Plan On Grade 11 Ict
5 pages
chapter 2
No ratings yet
chapter 2
45 pages
How A Search Engine Works - Slide
No ratings yet
How A Search Engine Works - Slide
40 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Search Engine
No ratings yet
Search Engine
35 pages
Comsats Institute of Information TECHNOLOGY Islamabad
No ratings yet
Comsats Institute of Information TECHNOLOGY Islamabad
11 pages
CS571-Note
No ratings yet
CS571-Note
2 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
Unit 5
No ratings yet
Unit 5
36 pages
Search Engine
No ratings yet
Search Engine
42 pages
Chap - Week8 - Queries and Information Needs
No ratings yet
Chap - Week8 - Queries and Information Needs
44 pages
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
No ratings yet
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
14 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
COMP S834: Unit 4
No ratings yet
COMP S834: Unit 4
44 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
Search Engine Using Apache Lucene
No ratings yet
Search Engine Using Apache Lucene
5 pages
Search Engine: Amit Kamath Ancy Alphonso
No ratings yet
Search Engine: Amit Kamath Ancy Alphonso
22 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
chap5-index-construction
No ratings yet
chap5-index-construction
38 pages
SearchLand: Search Quality For Beginners
No ratings yet
SearchLand: Search Quality For Beginners
29 pages
02 - Lect2 Biomedical IR
No ratings yet
02 - Lect2 Biomedical IR
20 pages
Search Engine
100% (2)
Search Engine
42 pages
Web Search
No ratings yet
Web Search
49 pages
Prashant Mathur Neha Gupta Monu K. Verma Mohd. Shoaib
No ratings yet
Prashant Mathur Neha Gupta Monu K. Verma Mohd. Shoaib
31 pages
Mini Google
No ratings yet
Mini Google
34 pages
Seminar Formatkhjj
No ratings yet
Seminar Formatkhjj
24 pages
Aesthetics and Technology in Building, Pier Luigi Nervi
100% (4)
Aesthetics and Technology in Building, Pier Luigi Nervi
146 pages
An Approach For Search Engine Optimization Using Classification - A Data Mining Technique
No ratings yet
An Approach For Search Engine Optimization Using Classification - A Data Mining Technique
4 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
11 pages
WEB BROWSERS+search Engine
No ratings yet
WEB BROWSERS+search Engine
10 pages
Web Search Engines
No ratings yet
Web Search Engines
30 pages
Cif Irws
No ratings yet
Cif Irws
3 pages
MS CS Manipal University Ashish Kumar Jha Data Structures and Algorithms Used in Search Engine
No ratings yet
MS CS Manipal University Ashish Kumar Jha Data Structures and Algorithms Used in Search Engine
13 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
26 pages
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
17 pages
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
No ratings yet
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
21 pages
005-001-000-024 Search Engines
No ratings yet
005-001-000-024 Search Engines
11 pages
L01
No ratings yet
L01
33 pages
Term Paper OF Int-301: Web Programming: Topic: Search Engine
No ratings yet
Term Paper OF Int-301: Web Programming: Topic: Search Engine
18 pages
Search Engine Description
No ratings yet
Search Engine Description
17 pages
Unit 4
No ratings yet
Unit 4
47 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
LLLLLLLLLLLLLLLLL
No ratings yet
LLLLLLLLLLLLLLLLL
30 pages
Seach Engine
50% (2)
Seach Engine
18 pages
Unit - 3 Ir Questionbank
No ratings yet
Unit - 3 Ir Questionbank
27 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Search Engines: The Players and The Field
No ratings yet
Search Engines: The Players and The Field
27 pages
Chap 1
No ratings yet
Chap 1
22 pages
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
No ratings yet
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
22 pages
93512information Retrieval LecturesNotes2024
No ratings yet
93512information Retrieval LecturesNotes2024
153 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
Assignment 1
No ratings yet
Assignment 1
23 pages
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Veritas Guide
No ratings yet
Veritas Guide
7 pages
Introduction To DSP
No ratings yet
Introduction To DSP
59 pages
Siddhant Shukla R185017 Resume
No ratings yet
Siddhant Shukla R185017 Resume
5 pages
Moc 55202 A: Powershell 5.0 and Desired State Configuration Course Summary
No ratings yet
Moc 55202 A: Powershell 5.0 and Desired State Configuration Course Summary
4 pages
Opa - Safe and Secure Web Development
100% (2)
Opa - Safe and Secure Web Development
66 pages
Coduri SAP
No ratings yet
Coduri SAP
2 pages
A The Nacl Manual
No ratings yet
A The Nacl Manual
246 pages
At Mega 64
No ratings yet
At Mega 64
392 pages
Randal SQL SDB407 Undocumented
No ratings yet
Randal SQL SDB407 Undocumented
24 pages
Sauter Modulo 6 PDF
No ratings yet
Sauter Modulo 6 PDF
16 pages
Object-Oriented Programming I: Objects and Classes: COMP 232 Fundamentals of Programming
No ratings yet
Object-Oriented Programming I: Objects and Classes: COMP 232 Fundamentals of Programming
34 pages
Cmp100 Revision
No ratings yet
Cmp100 Revision
23 pages
Tutorial 7 (Answer)
No ratings yet
Tutorial 7 (Answer)
6 pages
Recovery CD Manual
No ratings yet
Recovery CD Manual
14 pages
Windows 10 Product Key
No ratings yet
Windows 10 Product Key
8 pages
User Manual For ISmartViewPro v1.0
No ratings yet
User Manual For ISmartViewPro v1.0
13 pages
A Computer Based Information System Life Cycle
50% (2)
A Computer Based Information System Life Cycle
3 pages
Intel I860 Processor Architecture Word
100% (1)
Intel I860 Processor Architecture Word
9 pages
Language en
No ratings yet
Language en
83 pages
Isolation Game Heuristic Analysis
No ratings yet
Isolation Game Heuristic Analysis
4 pages
3.1 Introduction To Embedded System
No ratings yet
3.1 Introduction To Embedded System
3 pages
01 Big Picture
No ratings yet
01 Big Picture
51 pages
Ground-Water Flow and Solute Transport For The PHAST Simulator
No ratings yet
Ground-Water Flow and Solute Transport For The PHAST Simulator
23 pages
Plantillas Design Review Checklist, Code Review Checklist
No ratings yet
Plantillas Design Review Checklist, Code Review Checklist
2 pages
Final Project Uml
No ratings yet
Final Project Uml
1 page
The Universal Windows Platform: Developer's Guide For Windows 10 Preview
No ratings yet
The Universal Windows Platform: Developer's Guide For Windows 10 Preview
51 pages
Mqmon
No ratings yet
Mqmon
100 pages

Building Fast Search Engines

Uploaded by

Building Fast Search Engines

Uploaded by

Building Fast Search Engines

Hugh E. Williams ([email protected])

• User’s Information Needs

• How users query with search engines

• How search engines provide a search service

• Architecture of a commercial search engine

• Search engines are one tool used to answer information

call this a ranked query)

words per query with a median of 2

session, 20.8% submit two, and about 31% submit three

OR, and NOT), and around 5% contain quoted phrases

• About 1.28 million different words were used in queries in

• What is a good answer to a query?

• Search engines typically return ten answers-per-page,

where each answer is a short summary of a web

by statistical similarity between web documents and the

that is, those that return relevant answers in the first

frequently the better; we call this term frequency (TF)

discriminating power—than others. For example, an

• The Okapi ranking function is as follows:

• Other techniques are used to improve the accuracy of

technology. Google ranks a page higher if it links to

query based on a user selecting a more like this option

• Documents are stored as a collection in a centralised

answers to the user for judging relevance

• All search engines use inverted indexes to support fast

collection; stored with each word is the IDF and a pointer

This list contains:

• Queries are resolved using the inverted index

• There are many well-known principles for building a fast

allows more information per second to be retrieved from

are integers) and documents (which are multimedia, but

• Average query times and index sizes for 25,000 queries on

• Other principles of fast searching:

when retrieving lists or documents

avoid slow hash functions that use modulo

• Carefully choose integer compression schemes

• Organise inverted lists so that the information frequently

needed is at the start of the list

• Develop a query plan for each query

• The Search Engine Group here at RMIT specialises in

• Our new research testbed search engine will be released

You might also like