0% found this document useful (0 votes)

17 views

Chapter 2

The document discusses the architecture and processes involved in building and operating a search engine. It describes the key components as indexing, which involves acquiring text documents, transforming the text into index terms, and creating indexes to support fast searching; and query processing, which involves supporting user queries, ranking results, and evaluating system performance. Each component is then discussed in more detail regarding its purpose and implementation considerations for large-scale search engines.

Uploaded by

RAUSHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Chapter 2

Uploaded by

RAUSHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Search Engines and Information Retrieval

Chapter 2

Architecture of a Search Engine

Full Credit: Croft et al. - http://www.search-engines-book.com/

In association with and

Search Engine Architecture

● A software architecture consists of software components, the interfaces

provided by those components (APIs), and the relationships between them
○ describes a system at a particular level of abstraction

● Architecture of a search engine determined by 2 requirements

○ effectiveness (quality of results, aka relevance) and efficiency (response time and
throughput)

In association with and

Indexing Process

WWW

In association with and

Indexing Process
● Text acquisition
○ identifies and stores documents for indexing

● Text transformation
○ transforms documents into index terms or features

● Index creation
○ takes index terms and creates data structures (indexes) to support fast searching

In association with and

Query Process

● Query Suggest
● Query refinements
● Spell correction
● User clicks
● Mouse tracking

In association with and

Query Process
● User interaction
○ supports creation and refinement of query, display of results

● Ranking
○ uses query and indexes to generate ranked list of documents

● Evaluation
○ monitors and measures effectiveness and efficiency (primarily offline)

In association with and

Indexing Process

In association with and

Details: Text Acquisition
● Crawler
○ Identifies and acquires documents for search engine

○ Many types – web, enterprise (internal inside the firewall), desktop

○ Web crawlers follow links to find documents

■ Must efficiently find huge numbers of web pages (coverage) and keep them up-to-date (freshness)

■ Single site crawlers for site search

■ Topical or focused crawlers for vertical search

● Documents on a topic link to other documents on the topic

● Use text classification to classify topic

○ Document crawlers for enterprise and desktop search

■ Follow links and scan directories

In association with and

Text Acquisition

● Feeds
○ Real-time streams of documents
■ e.g., web feeds for news, blogs, video, radio, tv
○ RSS is common standard (Really Simple Syndication)
■ RSS “reader” can provide new XML documents to search engine
■ E.g. Feedly is a RSS reader for users
● Conversion
○ Convert variety of documents into a consistent text plus metadata format
■ e.g. HTML, XML, Word, PDF, etc. → XML
○ Convert text encoding for different languages
■ Using a Unicode standard like UTF-8

In association with and

Text Acquisition
● Document data store
○ Stores text, metadata, and other related content for documents

■ Metadata is information about document such as type and creation date

■ Other content includes title, links, anchor text

○ Provides fast access to document contents for search engine components

■ e.g. result list generation

○ Could use relational database system

■ Not designed for document storage (designed for structured data, e.g. numbers, dates etc.)

■ More typically, a simpler, more efficient storage system is used due to huge numbers of
documents

In association with and

Indexing Process

In association with and

Text Transformation

● Parser and Tokenizer

○ Processing the sequence of text tokens in the document to recognize structural elements

■ e.g., titles, links, headings, etc.

○ Tokenizer recognizes “words” in the text

■ must consider issues like capitalization, hyphens, apostrophes, non-alpha characters,

separators [c++] [AT&T]

○ Markup languages such as HTML, XML often used to specify structure

■ Tags used to specify document elements

● E.g., <h2> Overview </h2>

■ Document parser uses syntax of markup language (or other formatting) to identify structure

In association with and

Text Transformation
● Stopping - stop words
○ Remove common words

■ e.g., “and”, “or”, “the”, “in”

○ Some impact on efficiency and effectiveness

○ Can be a problem for some queries ["to be or not to be"]

● Stemming
○ Group words derived from a common stem

■ e.g., “computer”, “computers”, “computing”, “compute”

○ Usually effective, but not for all queries [transformers] changes to [transformer]

○ Benefits vary for different languages

In association with and

Text Transformation
● Link Analysis
○ Makes use of links and anchor text in web pages

○ Link analysis identifies popularity and community information

■ e.g., PageRank (a particular link analysis algorithm)

○ Anchor text can significantly enhance the representation of pages pointed to by links

○ Significant impact on web search

■ Less importance in other applications

● Because web is link heavy

In association with and

Text Transformation

● Information Extraction
○ Identify classes of index terms that are important for some applications

○ e.g., named entity recognizers identify classes such as people, locations, companies,
dates, etc.

● Classifier
○ Identifies class-related metadata for documents

■ i.e., assigns labels to documents

■ e.g., topics, reading levels, sentiment, genre

○ Use depends on application

In association with and

See you next time

Image Credit: Adobe Text to Image

In association with and
Indexing Process

In association with and

Index Creation

● Document Statistics
○ Gathers counts and positions of words and other features

○ Used in ranking algorithm

● Weighting
○ Computes weights for index terms (how salient is a term for a document?)

○ Used in ranking algorithm Freq in the document aka tf

○ e.g., tf.idf weight 1 / number of document a word occurs in (document frequency or df)

■ Combination of term frequency in document and inverse document frequency in the

collection

In association with and

Index Creation
d1 d2

● Inversion w1
0

w2 3

○ Core of indexing process

w100000000000000

○ Converts document-term information to term-document for indexing

■ Difficult for very large numbers of documents

● w1 -> d23:2, d456789:17, d23598237459:1, d823475293874:25

● w23457823 -> d23, d1234

○ Format of inverted file is designed for fast query processing

■ Must also handle updates

■ Compression used for efficiency

In association with and

Index Creation

● Index Distribution
○ Distributes indexes across multiple computers and/or multiple sites

○ Essential for fast query processing with large numbers of documents

○ Many variations

■ Document distribution, term distribution, replication

○ P2P and distributed IR involve search across multiple sites

In association with and

Query Process

● Query Suggest
● Query refinements
● Spell correction
● User clicks
● Mouse tracking

In association with and

User Interaction

● Query input
○ Provides interface and parser for query language

○ Most web queries are very simple, other applications may use forms

○ Query language used to describe more complex queries and results of query
transformation

■ e.g., Boolean queries, Indri and Galago query languages

■ similar to SQL language used in database applications

■ IR query languages also allow content and structure specifications, but focus on
content

In association with and

User Interaction
● Query transformation
○ Improves initial query, both before and after initial search

○ Includes text transformation techniques used for documents

○ Spell checking and query suggestion provide alternatives to original query

○ Query expansion and relevance feedback modify the original query with additional terms

In association with and

User Interaction

● Results output
○ Constructs the display of ranked documents for a query

○ Generates snippets to show how queries match documents

○ Highlights important words and passages

○ Retrieves appropriate advertising in many applications

○ May provide clustering and other visualization tools

In association with and

Query Process

● Query Suggest
● Query refinements
● Spell correction
● User clicks
● Mouse tracking

In association with and

Ranking
● Scoring
○ Calculates scores for documents using a ranking algorithm

○ Core component of search engine

○ Basic form of score is ∑ qi di

■ qi and di are query and document term weights for term i

○ Many variations of ranking algorithms and retrieval models

In association with and

Ranking

● Performance optimization
○ Designing ranking algorithms for efficient processing

■ Term-at-a time vs. document-at-a-time processing

■ Safe vs. unsafe optimizations

● Distribution
○ Processing queries in a distributed environment

○ Query broker distributes queries and assembles results

○ Caching is a form of distributed searching

In association with and

Query Process

● Query Suggest
● Query refinements
● Spell correction
● User clicks
● Mouse tracking

In association with and

Evaluation

● Logging
○ Logging user queries and interaction is crucial for improving search effectiveness and
efficiency

○ Query logs and clickthrough data used for query suggestion, spell checking, query
caching, ranking, advertising search, and other components

● Ranking analysis
○ Measuring and tuning ranking effectiveness

● Performance analysis
○ Measuring and tuning system efficiency

In association with and

How Does It Really Work?

● This course explains these components of a search engine in more detail

● Often many possible approaches and techniques for a given component
○ Focus is on the most important alternatives

○ i.e., explain a small number of approaches in detail rather than many approaches

○ “Importance” based on research results and use in actual search engines

○ Alternatives described in references

In association with and

See you next time

Image Credit: Adobe Text to Image

In association with and

Copado-Fundamentals-I
No ratings yet
Copado-Fundamentals-I
7 pages
100 Dataengineering Interview Questions TRRaveendra 1694654407
No ratings yet
100 Dataengineering Interview Questions TRRaveendra 1694654407
58 pages
BS en 16102 2011
No ratings yet
BS en 16102 2011
58 pages
Olive ONE User Guide Complete
No ratings yet
Olive ONE User Guide Complete
16 pages
Text
No ratings yet
Text
5 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
chapter 2
No ratings yet
chapter 2
45 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
Information Retrieval
No ratings yet
Information Retrieval
142 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
Search Engine Architecture 1
No ratings yet
Search Engine Architecture 1
23 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
4
No ratings yet
4
35 pages
L01
No ratings yet
L01
33 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
2 Mod-1_Lec-2
No ratings yet
2 Mod-1_Lec-2
58 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Mini Google
No ratings yet
Mini Google
34 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Did It Make The News?
No ratings yet
Did It Make The News?
6 pages
chapter 1 ir (1)
No ratings yet
chapter 1 ir (1)
37 pages
Preprocessing, Inverted Index
No ratings yet
Preprocessing, Inverted Index
15 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Chap 1
No ratings yet
Chap 1
22 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
CS8080 Irt Unit 4 23 24
No ratings yet
CS8080 Irt Unit 4 23 24
36 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Google'S Pagerank and Beyond:: The Science of Search Engine Rankings
No ratings yet
Google'S Pagerank and Beyond:: The Science of Search Engine Rankings
158 pages
Everything in Brief Introduction
No ratings yet
Everything in Brief Introduction
5 pages
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
No ratings yet
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
21 pages
Search engines
No ratings yet
Search engines
4 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
Information Retrieval and XML Data: ADBMS Unit-4
No ratings yet
Information Retrieval and XML Data: ADBMS Unit-4
37 pages
The Overview of Web Search Engines 16ep4np3gk
No ratings yet
The Overview of Web Search Engines 16ep4np3gk
23 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Search Engines Information Retrieval in Practice PDF
No ratings yet
Search Engines Information Retrieval in Practice PDF
542 pages
Aesthetics and Technology in Building, Pier Luigi Nervi
100% (4)
Aesthetics and Technology in Building, Pier Luigi Nervi
146 pages
Information_Retrieval_systems_and_Web_Search_Engin
No ratings yet
Information_Retrieval_systems_and_Web_Search_Engin
4 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
No ratings yet
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
22 pages
Chap 1
No ratings yet
Chap 1
23 pages
L001
No ratings yet
L001
49 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Building Fast Search Engines
No ratings yet
Building Fast Search Engines
21 pages
Introducción a Recuperación de Información y Sistemas de Recomendación
No ratings yet
Introducción a Recuperación de Información y Sistemas de Recomendación
40 pages
A Comparison of Open Source Search Engine
No ratings yet
A Comparison of Open Source Search Engine
46 pages
02 - Lect2 Biomedical IR
No ratings yet
02 - Lect2 Biomedical IR
20 pages
Python Data Structures Explained: A Practical Guide with Examples
From Everand
Python Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Elasticsearch Indexing: How to Improve User's Search Experience
From Everand
Elasticsearch Indexing: How to Improve User's Search Experience
Huseyin Akdogan
1/5 (1)
Python File Handling Made Easy: A Practical Guide with Examples
From Everand
Python File Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
From Everand
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
Lorenzo Bettini
4/5 (1)
It'S The Methodology For Me: A Systematic Review of Early Approaches To Studying Tiktok
No ratings yet
It'S The Methodology For Me: A Systematic Review of Early Approaches To Studying Tiktok
17 pages
Advanced External Procedure Transformation
No ratings yet
Advanced External Procedure Transformation
14 pages
Evidencetechnology 2016winter
No ratings yet
Evidencetechnology 2016winter
52 pages
Immediate Download Building Enterprise Taxonomies 2nd Edition Darin L. Stewart Ebooks 2024
100% (8)
Immediate Download Building Enterprise Taxonomies 2nd Edition Darin L. Stewart Ebooks 2024
84 pages
Getting Started With Data Literacy and Information As A Second Language
No ratings yet
Getting Started With Data Literacy and Information As A Second Language
26 pages
Cloud Age Data Architect
No ratings yet
Cloud Age Data Architect
15 pages
Organizing E-Learning Standards and Specifications
No ratings yet
Organizing E-Learning Standards and Specifications
7 pages
Microsoft Windows Server Update Services 3.0 Operations Guide
No ratings yet
Microsoft Windows Server Update Services 3.0 Operations Guide
166 pages
Release Notes
No ratings yet
Release Notes
34 pages
Spring Data Mongodb Reference
No ratings yet
Spring Data Mongodb Reference
105 pages
Guide To Design Database For Newsletter in MySQL
No ratings yet
Guide To Design Database For Newsletter in MySQL
9 pages
Brochure Forcepoint Classification en
No ratings yet
Brochure Forcepoint Classification en
5 pages
DSpace Presentation Tutorial v142
No ratings yet
DSpace Presentation Tutorial v142
43 pages
Json API by Example
100% (1)
Json API by Example
64 pages
ADF - Control Flow Activites 1
No ratings yet
ADF - Control Flow Activites 1
17 pages
File - SK.312 MENLHK SETJEN PSKL.1 4 2019 PDF
No ratings yet
File - SK.312 MENLHK SETJEN PSKL.1 4 2019 PDF
3 pages
Story-Map: Ipad Companion For Long Form TV Narratives
No ratings yet
Story-Map: Ipad Companion For Long Form TV Narratives
4 pages
LF TDI LT CScott and Onsite Contracting Re Cease and Desist With Attachments (FINAL)
No ratings yet
LF TDI LT CScott and Onsite Contracting Re Cease and Desist With Attachments (FINAL)
15 pages
Integration of Job Portals by Meta-Search
No ratings yet
Integration of Job Portals by Meta-Search
12 pages
Understanding Data
No ratings yet
Understanding Data
14 pages
Oracle Data Pump
No ratings yet
Oracle Data Pump
14 pages
Dbgap On Fhir: Michael Feolo-Dbgap Team Lead
No ratings yet
Dbgap On Fhir: Michael Feolo-Dbgap Team Lead
13 pages
Bosba
No ratings yet
Bosba
8 pages
Thesis Final - Pham Dung - Quang Anh - ver2
No ratings yet
Thesis Final - Pham Dung - Quang Anh - ver2
30 pages
MARC - Machine Readable Catalog
No ratings yet
MARC - Machine Readable Catalog
37 pages
CS6010 - SOCIAL NETWORK ANALYSIS - Unit 1 Notes
67% (6)
CS6010 - SOCIAL NETWORK ANALYSIS - Unit 1 Notes
25 pages