0% found this document useful (0 votes)

13 views33 pages

L01

The document discusses the evolution and applications of Information Retrieval (IR) systems, emphasizing that search engines like Google were not the original creators of search technology. It outlines various types of data, search applications, and the challenges faced in IR, including the difficulty of capturing semantics and the need for efficient indexing. Additionally, it highlights the importance of search engines in accessing unstructured data and the competitive landscape of the search industry.

Uploaded by

Stephen Chow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views33 pages

L01

Uploaded by

Stephen Chow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Search Engine was not Created by Google!

It has many names:

• Information retrieval (IR): dated back to 50’s as one
of the major applications of computers
• Document retrieval: “Information” could mean many
things; “document” refers to natural language texts
organized in some predefined structures (books,
reports, letters)
• Text retrieval: Texts are strings of characters with
little or no structure; no images or videos
Applications

• Digital libraries: All materials in digital forms,

accessible and searchable digitally
• Web search: Search anything accessible on the Web;
include non-text content, although this course
focuses on texts (HTML pages)
• Vertical search: Search in a particular domain, e.g.,
image, video, news, product (e-commerce) search
– If we consider web search as “horizontal” search, vertical
search focuses on a particular segment, topic or data type
and provides better search functions for its focus compared
to a general web search engine
Types of Data

• Unformatted or unstructured data (as opposed to

relational database)
– Textual data: papers, technical reports, newspaper articles
– Completed untagged, plain-text data

• Semi-structured data
– Web pages (HTML and XML files)
– Email messages

• Non-textual/multimedia data
– images, graphics, video
Examples of IR Systems :

• Examples of famous search engines are Google, Bing,

Baidu (GBB), …
– Stand-alone search engines (i.e., interact directly with users
and search is the only function they provide)
Other Examples of IR Systems

• Most people used IR in some embedded ways

– In Windows 10, search is one of the many functions of the
operating system
– Search is provided to users as a function or service offered
by the application (e.g., in a library system) instead of a
standalone search engine by itself)
Library systems

• Books: http://ustlib.ust.hk/ (HKUST library)

Federated
search
Result Page has more Functions

• Unlike Google, libraries have more structured data (fields / facets)

Search in Different Applications
• Vertical Search: A search engine for one data of a focus area
– Data could be maintained on multiple sites in the vertical search or
aggregated from multiple external sites
• E.g., Job search, News search, Movie search, …
• Site Search: A search engine for one site (or group of related sites)
– hsbc.com.hk, ust.hk, …
• Custom Search: A search frontend to a (big) backend search engine
to narrow search to a small set of websites
– Ust.hk/search-engine?...
Both Google Site Search and Google Custom
Search are Google products, but the idea is
applicable to other search engines

• Enterprise Search: A search engine for a corporate intranet

– Multiple types of data (databases, Office documents, emails, …)
– Different user roles (sales vs technical support vs CEO …)
– Security, security, security, …
Embedded Search Engines on Devices

• Search engines embedded on portable devices (mobile phones, USB

thumb drives, CD ROMs)
• Search engines are tailored for the data on device
– E.g., Electronic encyclopaedia, product catalogues, corporate reports, etc.
– Don’t forget that a CD could hold 600 Mbytes of text!
• Special requirements:
– No installation needed; built-in and executable
– Provide adequate interface (e.g., web-based)
– Fast and resource sensitive (running on small devices)
File Search on UNIX/LINUX

– UNIX grep commands (grep, egrep, agrep, etc.)

$ grep comp4321 input-file1 input-file2 …

Matched lines
Input files
in input files
grep

Query = comp4321
– man –k keyword
• Search UNIX man pages

– These are simple "search engines" although search functions are

extremely simple and primitive!
How do you Search for Files on Windows?

• Search for files: plain text, MS Office files, email, etc.

• Specify filenames, dates, file types, etc.
• Windows built-in search function, Yahoo Desktop, Google
Desktop, Windows Desktop, etc.
Desktop Search Examples

• Windows desktop search has

been integrated into
Windows
• Copernic is still available
• Google desktop has long
been discontinued

Search result
Index/Search on Windows 10

• Windows 10 Index Option allows you to specify:

– Folders to index
– Index encrypted files or not
– To index properties only or properties plus content for different file types
– Rebuild index at any time
Web Search Engines (GBB: Google/Bing/Baidu)

• World wide web search engines or Web Search

– Most popular IR application nowadays, e.g., Google, Bing, Baidu
• Other niche search engine DuckDuckGo, Yandex, etc.

Spider / Search Browser User

Crawler Index engine queries

Web pages

In a real
Local implementation (yet
copy Cache many other things are
not shown here)
Federated and Meta Search

• Several search engines are used to return complete

search results across all search engines
• A query will be dispatched to all search engines and
search results are sent back to the originator, which
will integrate the result
query
query Meta Search
Federated Search Meta
SE #1 search
query query
query query
result SE #4 result SE #3
result result
SE #2 query result SE #1 query result

SE #3 SE #2
Federated vs Meta Search
Federated Search Meta Search
Each node is a full-function SE for Meta-search node passes queries
its own collection and search results between users
and underlying SEs; itself is not a SE
Search engines agree to join Agreement is not needed; unwilling
Federated Search SEs can block search requests
Agree to use the same standard for Query and results of underlying SEs
query/result representation and can have different format; meta-
API (e.g., ANSI/ISO Z39.50 for search performs query
libraries) transformation and data aggregation
SEs can collaborate to perform a Participating SEs do not collaborate
search with each other
E.g., HKALL (HK Academic Lib Link) Dogpile.com
Differences from Web Search (GBB)
• Technologies for all these different forms of search are
more or less the same, but in enterprise or product search
– Data are more structured:
• Data are grouped into "collections “, e.g., products, press releases,
news, manuals, records dumped from database tables
• Search can be applied to a subset of the collections
– Query format:
• Standard AND/OR, phrase, etc.
• Search on fields: titles, authors, within date range, etc.
– Result page: Grouped by document types, ranked by date or
relevance, etc.
• Example: search on amazon.com; what search features
are most useful to you that are available on GBB?
Why is IR Important? Needed Everywhere!

• Most information available is in textual form and has no

predefined format (e.g., emails and newsgroup articles)
– You may think businesses store data in structured databases, but >80%
of business information is unstructured and mostly in text
• Integration of text retrieval capability in most relational database
systems. SQL already supports limited search capability such as
search based on regular expressions:
– select * from Employee where Name like ’%Lee%’
• Increasing number of online documentation systems
(no more hardcopy!)
• Of course, the bloom of World Wide Web
Why is IR Difficult? Size!

• The size of the web is doubling every year:

– 50 million pages in November 1995
– 320 million pages in December 1997 Imagine you need to
spend
– 800 million pages in February 1999
“just one second more”
– 1 billion pages in 2000 on each page!
– 3.5 billion in 2003 (openfind.com) Renders Natural
– 8 billion in 2004 (google.com) Language Processing
– 20+ billion in 2005 (yahoo.com) methods infeasible
• Google stopped releasing the size
– 130 trillion in 2016
• Huge amount of data (e.g., WWW) dictates efficiency,
effectiveness and user-friendliness
• Slide from a Google Presentation

Dik Lun LEE Department of Computer Science and Engineering, HKUST Slide 1
Dik Lun LEE Department of Computer Science and Engineering, HKUST Slide 1
Why is IR Difficult? Semantics!

• Unstructured data: difficult to capture semantics in documents.

Compare:
– “select * from Employee where Salary > 100,000”
– “retrieve all news items about corporate takeover”

• Why is the second query more difficult to answer? The following

query is even more difficult:
– “retrieve all news items about corporate takeover involving an
internet company”
– Note: syntactic → semantic → real-world knowledge

• Documents have unrestricted subject domains

– it is hard to predefine or pre-categorize the subject domains of
documents
Why is IR a Difficult Problem? Diversity!

• Diversified user base: expert to casual users

– a system may be clumsy for an expert user but difficult to use for a
casual user
– a system may return information too general to be useful for an
expert in the subject but too narrow for a general user

• Intention of information and user query is hard to capture

– compare a README file and a user manual
– compare a summary versus an in-depth report

One size cannot fit all!

Indexing by Professionals (Librarians/Authors)

• High labor cost of trained human indexers

• Inconsistency in selecting index terms and judging relevance
– thesauri created by two indexers in a given subject domain have only
60% of index terms in common
– indexes obtained by two indexers from the same document with the
same thesaurus have only 30% in common
– documents obtained from two persons searching the same document
set with the same question have only 40% in common
– relevance judgments obtained by two users on the same set of
documents and the same topic have only 60% in common
• Ref: Olson, Hope A., and Dietmar Wolfram. "Indexing consistency and its implications for
information architecture: A pilot study." IA Summit (2006).
Why is IR a Difficult Problem?

• Distributed and interlinked (e.g., Hypertext and WWW)

– Where to start a search? Unlike in a centralize database, you
have only one (or a few) database(s) to search.
– How are the information related?

How fast How good

• Efficiency vs. effectiveness

– With limited resources, one can only improve efficiency and
effectiveness to a certain degree.
– Improving efficiency often means degrading effectiveness,
and vice versa.
Document Retrieval Model
Relevance
Feedback

User’s Query Retrieved Document

information need formulation modification Documents
documents

Formal Document Indexing

language retrieval representation

• Document: a long string of characters contained in a single file

• Index: a list of important keywords from the documents,
stored in some efficient file structure
• Query: Boolean (A and B or C), list of words, natural language
• Relevance feedback: try “similar pages” in Google
Evolution of Search Technologies

• Zeroth-generation search (1960 -)

– Libraries, collections of electronic documents (legal documents,
Lexis/Nexis, scientific databases)
– Individual documents organized in folders or databases
– Keyword-based search (looking for keywords)
– Search on fields (title, author, date) in addition to search on full
text body
– Boolean (title=“computer” AND body contains “IBM”)
– E.g., IBM Stairs
– 0.5 generation: adding statistical to Boolean (e.g., how often
does a keyword appear in a document and where?)
Evolution of Search Technologies (Cont.)

• First-generation search engines (web-based, 1993 -)

– Statistical keyword match
• Traditional search methods (mostly vector space model, which we will
learn next) applied to web
– Add a spider / crawler to download web pages
– Earlier versions:
• Altavista (started by Digital Equipment Corporation, then the 2nd largest
computer company; sold to Yahoo!)
• Infoseek (founded in 1994; Infoseek engineer Li Yanhong returned to
China and founded Baidu; sold to Disney in 1998)
• Lycos (started by CMU in 1994)
• etc.
Evolution of Search Technologies (Cont.)

• Second-generation search engines (1997 - )

– In addition to keyword matching, relying heavily on link analysis
(thus capitalizing on the special property of web)
– Using links to measure the quality of web page, thus
fundamentally expand the dimension of ranking
– Google, Fast (sold to Microsoft), etc. etc.
Evolution of Search Technologies (Cont.)

• Third-generation search engines (2001- )

– Incorporate advanced search features, e.g., automatic categorization

Challengers:
• Teoma (acquired by ask.com)
• Wisenut (acquired by Looksmart)
• Vivisimo (own clusty.com;
started by CMU in 2000;
acquired by IBM)
• Powerset (acquired by Microsoft
in 2008 at allegedly US$ 100m)
• Companies that you will start!
The Search Industry (and our Job Market)
• GBB: Global web search engines attract billions of searches every
day; advertisement is the major source of revenue; technological
competitiveness is a must (winner takes all!)
• Enterprise search: Companies deploy their own search engines to
enhance productivity; vendors include Endeca (Oracle), Microsoft
(SharePoint), and Google (Site/Custom Search)
• Various vertical search: Business directories, recruitment and
travel web sties; advertisement is the largest source of revenue
• Search engine marketing (SEM): Marketing via search engine ad
placements
• Search engine optimization (SEO): Companies helping websites
to rank high in GBB
Take Home Messages
• Search engine is rooted in “information retrieval” used by academics
• IR existed even before computers were invented (e.g., manual
catalogs in libraries, manual keyword extraction)
• Search engine does NOT just mean web search (Google.com and
Bing.com), it includes intranet and enterprise search engines
• Search engine could search structured information (as in library
systems); how is structured information represented in HTML?
• Search is difficult because it has to “understand” what the user
wants through a few query keywords and retrieve 10 best pages out
of billions of pages based on the semantic content of the pages
• In addition to sophistication of search, scaling up remains important
• High quality ranking at sub-second speed => Great User eXperience

Mountainboard Design Project Student 2010 LR ENG PDF
No ratings yet
Mountainboard Design Project Student 2010 LR ENG PDF
541 pages
Time Management Toolkit PDF
80% (5)
Time Management Toolkit PDF
20 pages
Bridge Reference
100% (1)
Bridge Reference
81 pages
List of Search Engines
No ratings yet
List of Search Engines
16 pages
Computer Power User, April 2006
No ratings yet
Computer Power User, April 2006
112 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Adobe Bridge: Help and Tutorials
No ratings yet
Adobe Bridge: Help and Tutorials
66 pages
Google'S Pagerank and Beyond:: The Science of Search Engine Rankings
No ratings yet
Google'S Pagerank and Beyond:: The Science of Search Engine Rankings
158 pages
List of Search Engines
No ratings yet
List of Search Engines
14 pages
Windows Desktop Search Administration Guide 3 Revb
No ratings yet
Windows Desktop Search Administration Guide 3 Revb
49 pages
Benchmark Study of Desktop Search Tools
100% (1)
Benchmark Study of Desktop Search Tools
15 pages
Install Guide PDF
No ratings yet
Install Guide PDF
100 pages
DocFetcher Manual
No ratings yet
DocFetcher Manual
5 pages
Enterprise Search Tools Move From Luxury Item To Business Essential As Data Builds Up - Computerweekly - Com
No ratings yet
Enterprise Search Tools Move From Luxury Item To Business Essential As Data Builds Up - Computerweekly - Com
10 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
RAG With OpenAI For Financial Analysis
No ratings yet
RAG With OpenAI For Financial Analysis
11 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Comsats Institute of Information TECHNOLOGY Islamabad
No ratings yet
Comsats Institute of Information TECHNOLOGY Islamabad
11 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
A Comparison of Open Source Search Engine
No ratings yet
A Comparison of Open Source Search Engine
46 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Windows Desktop Advanced Query Reference
No ratings yet
Windows Desktop Advanced Query Reference
8 pages
LLLLLLLLLLLLLLLLL
No ratings yet
LLLLLLLLLLLLLLLLL
30 pages
Information Retrieval and XML Data: ADBMS Unit-4
No ratings yet
Information Retrieval and XML Data: ADBMS Unit-4
37 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
OS Search Engine Comparison
No ratings yet
OS Search Engine Comparison
46 pages
GNU/Linux Semantic Storage System: Ahmed Salama, Ahmed Samih Amr Ramadan, Karim M. Yousef
No ratings yet
GNU/Linux Semantic Storage System: Ahmed Salama, Ahmed Samih Amr Ramadan, Karim M. Yousef
106 pages
Sseeeeeeee
No ratings yet
Sseeeeeeee
29 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Chap 1
No ratings yet
Chap 1
23 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Introduction
No ratings yet
Introduction
32 pages
Chap 1
No ratings yet
Chap 1
22 pages
Cs8080irtunitinotes 220515215754 E06d144b
No ratings yet
Cs8080irtunitinotes 220515215754 E06d144b
43 pages
Aesthetics and Technology in Building, Pier Luigi Nervi
100% (4)
Aesthetics and Technology in Building, Pier Luigi Nervi
146 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine
20 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Indexing and Search Engines For The Intranets: by Suvarsha Walters (Suvarsha@ncsi - Iisc.ernet - In)
No ratings yet
Indexing and Search Engines For The Intranets: by Suvarsha Walters (Suvarsha@ncsi - Iisc.ernet - In)
33 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Search Engine
No ratings yet
Search Engine
42 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
32 pages
ITR Notes
No ratings yet
ITR Notes
166 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
No ratings yet
The Anatomy of A Large-Scale Hypertextual Web Search Engine '
11 pages
Term Paper OF Int-301: Web Programming: Topic: Search Engine
No ratings yet
Term Paper OF Int-301: Web Programming: Topic: Search Engine
18 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
16 pages
The Overview of Web Search Engines 16ep4np3gk
No ratings yet
The Overview of Web Search Engines 16ep4np3gk
23 pages
Algo Research
No ratings yet
Algo Research
3 pages
4
No ratings yet
4
35 pages
E-Sys EN Release-Notes V3 38 2
No ratings yet
E-Sys EN Release-Notes V3 38 2
1 page
E-Sys EN Release-Notes V3 39 1
No ratings yet
E-Sys EN Release-Notes V3 39 1
1 page
E-Sys EN Release-Notes V3 36 2
No ratings yet
E-Sys EN Release-Notes V3 36 2
1 page
Text
No ratings yet
Text
5 pages
Information Retrieval Systems and Web Search Engin
No ratings yet
Information Retrieval Systems and Web Search Engin
4 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
49 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Unit-5. Search Engines
No ratings yet
Unit-5. Search Engines
105 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Chatgpt-4 5
No ratings yet
Chatgpt-4 5
4 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
Bulu
No ratings yet
Bulu
47 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Chap 2
No ratings yet
Chap 2
29 pages