Topic 2 W2 - SDR - Edited - March2023

The document discusses the key components and processes involved in building a search engine, including indexing documents, transforming text, handling user queries, ranking results, and evaluating system performance.

Uploaded by

VISALINI VIJAYAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views25 pages

Topic 2 W2 - SDR - Edited - March2023

Uploaded by

VISALINI VIJAYAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Search Engines

Information Retrieval in Practice

Search Engine Architecture
• A software architecture consists of software
components, the interfaces provided by those
components, and the relationships between
them
– describes a system at a particular level of abstraction
• Architecture of a search engine determined by 2
requirements
– effectiveness (quality of results) and efficiency
(response time)
Indexing Process
Indexing Process
• Text acquisition
– identifies and stores documents for indexing
• Text transformation
– transforms documents into index terms or
features
• Index creation
– takes index terms and creates data structures
(indexes) to support fast searching
Query Process
Query Process
• User interaction
– supports creation and refinement of query, display
of results
• Ranking
– uses query and indexes to generate ranked list of
documents
• Evaluation
– monitors and measures effectiveness and
efficiency (primarily offline)
Details: Text Acquisition
• Crawler
– Identifies and acquires documents for search
engine
– Many types – web, enterprise, desktop
– Web crawlers follow links to find documents
• Must efficiently find huge numbers of web pages
(coverage) and keep them up-to-date (freshness)
• Single site crawlers for site search
• Topical or focused crawlers for vertical search
– Document crawlers for enterprise and desktop
search
• Follow links and scan directories
Text Acquisition
• Feeds
– Real-time streams of documents
• e.g., web feeds for news, blogs, video, radio, tv
– RSS is common standard
• RSS “reader” can provide new XML documents to search
engine (See: https://edition.cnn.com/services/rss/)
• Conversion
– Convert variety of documents into a consistent text
plus metadata format
• e.g. HTML, XML, Word, PDF, etc. → XML
– Convert text encoding for different languages
• Using a Unicode standard like UTF-8
Text Acquisition
• Document data store
– Stores text, metadata, and other related content
for documents
• Metadata is information about document such as type
and creation date
• Other content includes links, anchor text
– Provides fast access to document contents for
search engine components
• e.g. result list generation
– Could use relational database system
• More typically, a simpler, more efficient storage system
is used due to huge numbers of documents
Text Transformation
• Parser
– Processing the sequence of text
tokens in the document to recognize
structural elements
• e.g., titles, links, headings, etc.
– Tokenizer recognizes “words” in the
text
• must consider issues like
capitalization, hyphens,
apostrophes, non-alpha
characters, separators
– Markup languages such as HTML,
XML often used to specify structure
• Tags used to specify document
elements
– E.g., <h2> Overview </h2>
• Document parser uses syntax of
markup language (or other
formatting) to identify structure
Text Transformation
• Stopping
– Remove common words
• e.g., “and”, “or”, “the”, “in”
– Some impact on efficiency and effectiveness
– Can be a problem for some queries
• Stemming
– Group words derived from a common stem
• e.g., “computer”, “computers”, “computing”, “compute”
– Usually effective, but not for all queries
– Benefits vary for different languages
• Link Analysis
– Makes use of links and anchor text in web pages
– Link analysis identifies popularity and community
Text information
• e.g., PageRank
Transformation – Anchor text can significantly enhance the representation
of pages pointed to by links
– Significant impact on web search
• Less importance in other applications
Text Transformation
• Information Extraction
– Identify classes of index terms that are important
– e.g., named entity recognizers identify classes
such as people, locations, companies, dates, etc.
• Classifier
– Identifies class-related metadata for documents
• i.e., assigns labels to documents
• e.g., topics, reading levels, sentiment, genre
Index Creation
• Document Statistics
– Gathers counts and positions of words and other
features
– Used in ranking algorithm
• Weighting
– Computes weights for index terms
– Used in ranking algorithm
– e.g., tf.idf weight
• Combination of term frequency in document and inverse
document frequency in the collection
• to reflect how important a word is to a document in a
collection or corpus
Index Creation
• Inversion
– Core of indexing process
– Converts document-term information to term-
document for indexing
• Difficult for very large numbers of documents
– Format of inverted file is designed for fast query
processing
• Must also handle updates
• Compression used for efficiency
Index Creation
• Index Distribution
– Distributes indexes across multiple computers
and/or multiple sites
– Essential for fast query processing with large
numbers of documents
– Many variations
• Document distribution, term distribution, replication
User Interaction
• Query input
– Provides interface and parser for query language
– Most web queries are very simple, other
applications may use forms
– Query language used to describe more complex
queries and results of query transformation
• e.g., Boolean queries, Indri and Galago query languages
• similar to SQL language used in database applications
• IR query languages also allow content and structure
specifications, but focus on content
User Interaction
• Query transformation
– Improves initial query, both before and after initial
search
– Includes text transformation techniques used for
documents
– Spell checking and query suggestion provide
alternatives to original query
– Query expansion and relevance feedback modify
the original query with additional terms
User Interaction
• Results output
– Constructs the display of ranked documents for a
query
– Generates snippets to show how queries match
documents
Snippets

Highlights
– Highlights important words and passages
– Retrieves appropriate advertising in many
applications
– May provide clustering and other visualization
tools
Ads

Clusters
Ranking
• Scoring
– Calculates scores for documents using a ranking
algorithm
– Core component of search engine
– Basic form of score is  qi di
• qi and di are query and document term weights for
term i
– Many variations of ranking algorithms and
retrieval models
Ranking
• Performance optimization
– Designing ranking algorithms for efficient
processing
• Distribution
– Processing queries in a distributed environment
– Query broker distributes queries and assembles
results
Evaluation
• Logging
– Logging user queries and interaction is crucial for
improving search effectiveness and efficiency
– Query logs and clickthrough data used for query
suggestion, spell checking, query caching, ranking,
advertising search, and other components
• Ranking analysis
– Measuring and tuning ranking effectiveness
• Performance analysis
– Measuring and tuning system efficiency
• END

Search Engines Information Retrieval in Practice PDF
No ratings yet
Search Engines Information Retrieval in Practice PDF
542 pages
DP-900 Exam Simulation All Questions
No ratings yet
DP-900 Exam Simulation All Questions
183 pages
Information Retrieval and XML Data: ADBMS Unit-4
No ratings yet
Information Retrieval and XML Data: ADBMS Unit-4
37 pages
About The Exam: Print Exit Print Mode
80% (5)
About The Exam: Print Exit Print Mode
65 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
Chapter 1 Search Engine 1. Objective
No ratings yet
Chapter 1 Search Engine 1. Objective
63 pages
A Comparison of Open Source Search Engine
No ratings yet
A Comparison of Open Source Search Engine
46 pages
Lecture6 SearchEngines
No ratings yet
Lecture6 SearchEngines
85 pages
Information Retrieval
No ratings yet
Information Retrieval
142 pages
Bulu
No ratings yet
Bulu
47 pages
OS Search Engine Comparison
No ratings yet
OS Search Engine Comparison
46 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Mini Google
No ratings yet
Mini Google
34 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
CS8080 Irt Unit 4 23 24
No ratings yet
CS8080 Irt Unit 4 23 24
36 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Module 1 Part BInformation Retrieval Webdocuments
No ratings yet
Module 1 Part BInformation Retrieval Webdocuments
49 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
Search Engine
No ratings yet
Search Engine
35 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Irt Unit5
No ratings yet
Irt Unit5
111 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Seminar Formatkhjj
No ratings yet
Seminar Formatkhjj
24 pages
Chap 1
No ratings yet
Chap 1
22 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
49 pages
2 Mod-1 - Lec-2
No ratings yet
2 Mod-1 - Lec-2
58 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Search Engine Architecture 1
No ratings yet
Search Engine Architecture 1
23 pages
Did It Make The News?
No ratings yet
Did It Make The News?
6 pages
Search Engines Information Retrieval in Practice: W. Bruce Croft Donald Metzler Trevor Strohman
No ratings yet
Search Engines Information Retrieval in Practice: W. Bruce Croft Donald Metzler Trevor Strohman
7 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Lpu Dbms
No ratings yet
Lpu Dbms
1 page
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
No ratings yet
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
21 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
MongoDB Tutorial PDF
No ratings yet
MongoDB Tutorial PDF
16 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
How A Search Engine Works - Slide
No ratings yet
How A Search Engine Works - Slide
40 pages
Unit 8 - Search Engines
No ratings yet
Unit 8 - Search Engines
8 pages
Chapter - 6 - Searching and Indexing
No ratings yet
Chapter - 6 - Searching and Indexing
44 pages
Chap 2
No ratings yet
Chap 2
29 pages
4
No ratings yet
4
35 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
L01
No ratings yet
L01
33 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
How Do Search Engines Work
No ratings yet
How Do Search Engines Work
3 pages
Comsats Institute of Information TECHNOLOGY Islamabad
No ratings yet
Comsats Institute of Information TECHNOLOGY Islamabad
11 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Question Bank IRS All Module - OS
No ratings yet
Question Bank IRS All Module - OS
5 pages
Igcse Ict
75% (4)
Igcse Ict
1 page
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Text
No ratings yet
Text
5 pages
Search Engine Using Apache Lucene
No ratings yet
Search Engine Using Apache Lucene
5 pages
Information Retrieval Systems and Web Search Engin
No ratings yet
Information Retrieval Systems and Web Search Engin
4 pages
Data Mining Resources
No ratings yet
Data Mining Resources
33 pages
Understanding ETL
No ratings yet
Understanding ETL
20 pages
Inft
No ratings yet
Inft
22 pages
Release Notes AirNav Maintenance SALR V2.12.2
No ratings yet
Release Notes AirNav Maintenance SALR V2.12.2
35 pages
Class VIII Chapter 2 Exercise
No ratings yet
Class VIII Chapter 2 Exercise
11 pages
Sample Practical File 24-25
No ratings yet
Sample Practical File 24-25
32 pages
File On Ms Access
No ratings yet
File On Ms Access
15 pages
SQL Server Functions The Basicsss
No ratings yet
SQL Server Functions The Basicsss
31 pages
API Key Authentication
No ratings yet
API Key Authentication
5 pages
(ID 1058763.1) Interoperability Notes EBS R12 With Database 11gR2 29 Nov 21010
No ratings yet
(ID 1058763.1) Interoperability Notes EBS R12 With Database 11gR2 29 Nov 21010
9 pages
Pre Vs Post
No ratings yet
Pre Vs Post
49 pages
Building An Effective Data Warehouse Architecture: James Serra
No ratings yet
Building An Effective Data Warehouse Architecture: James Serra
30 pages
Employee Management System
No ratings yet
Employee Management System
26 pages
SAP Data Archiving Open Text
No ratings yet
SAP Data Archiving Open Text
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
Employeedetails (Empid, Fullname, Managerid, Dateofjoining) Employeesalary (Empid, Project, Salary)
No ratings yet
Employeedetails (Empid, Fullname, Managerid, Dateofjoining) Employeesalary (Empid, Project, Salary)
1 page
Normalization in SQL
No ratings yet
Normalization in SQL
2 pages
What Exactly Is PLAN - TABLE in Oracle Database?
No ratings yet
What Exactly Is PLAN - TABLE in Oracle Database?
6 pages
Spatial Data Management: Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1
No ratings yet
Spatial Data Management: Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1
7 pages
Tutorial A (DB and SQL) Solutions
No ratings yet
Tutorial A (DB and SQL) Solutions
12 pages
Document
No ratings yet
Document
9 pages
Table
No ratings yet
Table
3 pages
Suleman Khan: Professional Summary
No ratings yet
Suleman Khan: Professional Summary
3 pages
Advantages of Distributed Databases and Types of Two Phase Locking
No ratings yet
Advantages of Distributed Databases and Types of Two Phase Locking
4 pages
Esp VII
No ratings yet
Esp VII
2 pages
Mastering Elasticsearch 5.x - Third Edition
From Everand
Mastering Elasticsearch 5.x - Third Edition
Bharvi Dixit
3/5 (1)
Applied Architecture Patterns on the Microsoft Platform Second Edition
From Everand
Applied Architecture Patterns on the Microsoft Platform Second Edition
Andre Dovgal
No ratings yet
Elasticsearch Server: Second Edition
From Everand
Elasticsearch Server: Second Edition
Rafał Kuć
No ratings yet
Mastering Elasticsearch - Second Edition
From Everand
Mastering Elasticsearch - Second Edition
Rafał Kuć
No ratings yet

Topic 2 W2 - SDR - Edited - March2023

Uploaded by

Topic 2 W2 - SDR - Edited - March2023

Uploaded by

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008

You might also like