0% found this document useful (0 votes)
12 views33 pages

L01

The document discusses the evolution and applications of Information Retrieval (IR) systems, emphasizing that search engines like Google were not the original creators of search technology. It outlines various types of data, search applications, and the challenges faced in IR, including the difficulty of capturing semantics and the need for efficient indexing. Additionally, it highlights the importance of search engines in accessing unstructured data and the competitive landscape of the search industry.

Uploaded by

Stephen Chow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views33 pages

L01

The document discusses the evolution and applications of Information Retrieval (IR) systems, emphasizing that search engines like Google were not the original creators of search technology. It outlines various types of data, search applications, and the challenges faced in IR, including the difficulty of capturing semantics and the need for efficient indexing. Additionally, it highlights the importance of search engines in accessing unstructured data and the competitive landscape of the search industry.

Uploaded by

Stephen Chow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Search Engine was not Created by Google!

It has many names:


• Information retrieval (IR): dated back to 50’s as one
of the major applications of computers
• Document retrieval: “Information” could mean many
things; “document” refers to natural language texts
organized in some predefined structures (books,
reports, letters)
• Text retrieval: Texts are strings of characters with
little or no structure; no images or videos
Applications

• Digital libraries: All materials in digital forms,


accessible and searchable digitally
• Web search: Search anything accessible on the Web;
include non-text content, although this course
focuses on texts (HTML pages)
• Vertical search: Search in a particular domain, e.g.,
image, video, news, product (e-commerce) search
– If we consider web search as “horizontal” search, vertical
search focuses on a particular segment, topic or data type
and provides better search functions for its focus compared
to a general web search engine
Types of Data

• Unformatted or unstructured data (as opposed to


relational database)
– Textual data: papers, technical reports, newspaper articles
– Completed untagged, plain-text data

• Semi-structured data
– Web pages (HTML and XML files)
– Email messages

• Non-textual/multimedia data
– images, graphics, video
Examples of IR Systems :

• Examples of famous search engines are Google, Bing,


Baidu (GBB), …
– Stand-alone search engines (i.e., interact directly with users
and search is the only function they provide)
Other Examples of IR Systems

• Most people used IR in some embedded ways


– In Windows 10, search is one of the many functions of the
operating system
– Search is provided to users as a function or service offered
by the application (e.g., in a library system) instead of a
standalone search engine by itself)
Library systems

• Books: http://ustlib.ust.hk/ (HKUST library)

Federated
search
Result Page has more Functions

• Unlike Google, libraries have more structured data (fields / facets)


Search in Different Applications
• Vertical Search: A search engine for one data of a focus area
– Data could be maintained on multiple sites in the vertical search or
aggregated from multiple external sites
• E.g., Job search, News search, Movie search, …
• Site Search: A search engine for one site (or group of related sites)
– hsbc.com.hk, ust.hk, …
• Custom Search: A search frontend to a (big) backend search engine
to narrow search to a small set of websites
– Ust.hk/search-engine?...
Both Google Site Search and Google Custom
Search are Google products, but the idea is
applicable to other search engines

• Enterprise Search: A search engine for a corporate intranet


– Multiple types of data (databases, Office documents, emails, …)
– Different user roles (sales vs technical support vs CEO …)
– Security, security, security, …
Embedded Search Engines on Devices

• Search engines embedded on portable devices (mobile phones, USB


thumb drives, CD ROMs)
• Search engines are tailored for the data on device
– E.g., Electronic encyclopaedia, product catalogues, corporate reports, etc.
– Don’t forget that a CD could hold 600 Mbytes of text!
• Special requirements:
– No installation needed; built-in and executable
– Provide adequate interface (e.g., web-based)
– Fast and resource sensitive (running on small devices)
File Search on UNIX/LINUX

– UNIX grep commands (grep, egrep, agrep, etc.)

$ grep comp4321 input-file1 input-file2 …

Matched lines
Input files
in input files
grep

Query = comp4321
– man –k keyword
• Search UNIX man pages

– These are simple "search engines" although search functions are


extremely simple and primitive!
How do you Search for Files on Windows?

• Search for files: plain text, MS Office files, email, etc.


• Specify filenames, dates, file types, etc.
• Windows built-in search function, Yahoo Desktop, Google
Desktop, Windows Desktop, etc.
Desktop Search Examples

• Windows desktop search has


been integrated into
Windows
• Copernic is still available
• Google desktop has long
been discontinued

Search result
Index/Search on Windows 10

• Windows 10 Index Option allows you to specify:


– Folders to index
– Index encrypted files or not
– To index properties only or properties plus content for different file types
– Rebuild index at any time
Web Search Engines (GBB: Google/Bing/Baidu)

• World wide web search engines or Web Search


– Most popular IR application nowadays, e.g., Google, Bing, Baidu
• Other niche search engine DuckDuckGo, Yandex, etc.

Spider / Search Browser User


Crawler Index engine queries

Web pages

In a real
Local implementation (yet
copy Cache many other things are
not shown here)
Federated and Meta Search

• Several search engines are used to return complete


search results across all search engines
• A query will be dispatched to all search engines and
search results are sent back to the originator, which
will integrate the result
query
query Meta Search
Federated Search Meta
SE #1 search
query query
query query
result SE #4 result SE #3
result result
SE #2 query result SE #1 query result

SE #3 SE #2
Federated vs Meta Search
Federated Search Meta Search
Each node is a full-function SE for Meta-search node passes queries
its own collection and search results between users
and underlying SEs; itself is not a SE
Search engines agree to join Agreement is not needed; unwilling
Federated Search SEs can block search requests
Agree to use the same standard for Query and results of underlying SEs
query/result representation and can have different format; meta-
API (e.g., ANSI/ISO Z39.50 for search performs query
libraries) transformation and data aggregation
SEs can collaborate to perform a Participating SEs do not collaborate
search with each other
E.g., HKALL (HK Academic Lib Link) Dogpile.com
Differences from Web Search (GBB)
• Technologies for all these different forms of search are
more or less the same, but in enterprise or product search
– Data are more structured:
• Data are grouped into "collections “, e.g., products, press releases,
news, manuals, records dumped from database tables
• Search can be applied to a subset of the collections
– Query format:
• Standard AND/OR, phrase, etc.
• Search on fields: titles, authors, within date range, etc.
– Result page: Grouped by document types, ranked by date or
relevance, etc.
• Example: search on amazon.com; what search features
are most useful to you that are available on GBB?
Why is IR Important? Needed Everywhere!

• Most information available is in textual form and has no


predefined format (e.g., emails and newsgroup articles)
– You may think businesses store data in structured databases, but >80%
of business information is unstructured and mostly in text
• Integration of text retrieval capability in most relational database
systems. SQL already supports limited search capability such as
search based on regular expressions:
– select * from Employee where Name like ’%Lee%’
• Increasing number of online documentation systems
(no more hardcopy!)
• Of course, the bloom of World Wide Web
Why is IR Difficult? Size!

• The size of the web is doubling every year:


– 50 million pages in November 1995
– 320 million pages in December 1997 Imagine you need to
spend
– 800 million pages in February 1999
“just one second more”
– 1 billion pages in 2000 on each page!
– 3.5 billion in 2003 (openfind.com) Renders Natural
– 8 billion in 2004 (google.com) Language Processing
– 20+ billion in 2005 (yahoo.com) methods infeasible
• Google stopped releasing the size
– 130 trillion in 2016
• Huge amount of data (e.g., WWW) dictates efficiency,
effectiveness and user-friendliness
• Slide from a Google Presentation

Dik Lun LEE Department of Computer Science and Engineering, HKUST Slide 1
Dik Lun LEE Department of Computer Science and Engineering, HKUST Slide 1
Why is IR Difficult? Semantics!

• Unstructured data: difficult to capture semantics in documents.


Compare:
– “select * from Employee where Salary > 100,000”
– “retrieve all news items about corporate takeover”

• Why is the second query more difficult to answer? The following


query is even more difficult:
– “retrieve all news items about corporate takeover involving an
internet company”
– Note: syntactic → semantic → real-world knowledge

• Documents have unrestricted subject domains


– it is hard to predefine or pre-categorize the subject domains of
documents
Why is IR a Difficult Problem? Diversity!

• Diversified user base: expert to casual users


– a system may be clumsy for an expert user but difficult to use for a
casual user
– a system may return information too general to be useful for an
expert in the subject but too narrow for a general user

• Intention of information and user query is hard to capture


– compare a README file and a user manual
– compare a summary versus an in-depth report

One size cannot fit all!


Indexing by Professionals (Librarians/Authors)

• High labor cost of trained human indexers


• Inconsistency in selecting index terms and judging relevance
– thesauri created by two indexers in a given subject domain have only
60% of index terms in common
– indexes obtained by two indexers from the same document with the
same thesaurus have only 30% in common
– documents obtained from two persons searching the same document
set with the same question have only 40% in common
– relevance judgments obtained by two users on the same set of
documents and the same topic have only 60% in common
• Ref: Olson, Hope A., and Dietmar Wolfram. "Indexing consistency and its implications for
information architecture: A pilot study." IA Summit (2006).
Why is IR a Difficult Problem?

• Distributed and interlinked (e.g., Hypertext and WWW)


– Where to start a search? Unlike in a centralize database, you
have only one (or a few) database(s) to search.
– How are the information related?

How fast How good

• Efficiency vs. effectiveness


– With limited resources, one can only improve efficiency and
effectiveness to a certain degree.
– Improving efficiency often means degrading effectiveness,
and vice versa.
Document Retrieval Model
Relevance
Feedback

User’s Query Retrieved Document


information need formulation modification Documents
documents

Formal Document Indexing


language retrieval representation

• Document: a long string of characters contained in a single file


• Index: a list of important keywords from the documents,
stored in some efficient file structure
• Query: Boolean (A and B or C), list of words, natural language
• Relevance feedback: try “similar pages” in Google
Evolution of Search Technologies

• Zeroth-generation search (1960 -)


– Libraries, collections of electronic documents (legal documents,
Lexis/Nexis, scientific databases)
– Individual documents organized in folders or databases
– Keyword-based search (looking for keywords)
– Search on fields (title, author, date) in addition to search on full
text body
– Boolean (title=“computer” AND body contains “IBM”)
– E.g., IBM Stairs
– 0.5 generation: adding statistical to Boolean (e.g., how often
does a keyword appear in a document and where?)
Evolution of Search Technologies (Cont.)

• First-generation search engines (web-based, 1993 -)


– Statistical keyword match
• Traditional search methods (mostly vector space model, which we will
learn next) applied to web
– Add a spider / crawler to download web pages
– Earlier versions:
• Altavista (started by Digital Equipment Corporation, then the 2nd largest
computer company; sold to Yahoo!)
• Infoseek (founded in 1994; Infoseek engineer Li Yanhong returned to
China and founded Baidu; sold to Disney in 1998)
• Lycos (started by CMU in 1994)
• etc.
Evolution of Search Technologies (Cont.)

• Second-generation search engines (1997 - )


– In addition to keyword matching, relying heavily on link analysis
(thus capitalizing on the special property of web)
– Using links to measure the quality of web page, thus
fundamentally expand the dimension of ranking
– Google, Fast (sold to Microsoft), etc. etc.
Evolution of Search Technologies (Cont.)

• Third-generation search engines (2001- )


– Incorporate advanced search features, e.g., automatic categorization

Challengers:
• Teoma (acquired by ask.com)
• Wisenut (acquired by Looksmart)
• Vivisimo (own clusty.com;
started by CMU in 2000;
acquired by IBM)
• Powerset (acquired by Microsoft
in 2008 at allegedly US$ 100m)
• Companies that you will start!
The Search Industry (and our Job Market)
• GBB: Global web search engines attract billions of searches every
day; advertisement is the major source of revenue; technological
competitiveness is a must (winner takes all!)
• Enterprise search: Companies deploy their own search engines to
enhance productivity; vendors include Endeca (Oracle), Microsoft
(SharePoint), and Google (Site/Custom Search)
• Various vertical search: Business directories, recruitment and
travel web sties; advertisement is the largest source of revenue
• Search engine marketing (SEM): Marketing via search engine ad
placements
• Search engine optimization (SEO): Companies helping websites
to rank high in GBB
Take Home Messages
• Search engine is rooted in “information retrieval” used by academics
• IR existed even before computers were invented (e.g., manual
catalogs in libraries, manual keyword extraction)
• Search engine does NOT just mean web search (Google.com and
Bing.com), it includes intranet and enterprise search engines
• Search engine could search structured information (as in library
systems); how is structured information represented in HTML?
• Search is difficult because it has to “understand” what the user
wants through a few query keywords and retrieve 10 best pages out
of billions of pages based on the semantic content of the pages
• In addition to sophistication of search, scaling up remains important
• High quality ranking at sub-second speed => Great User eXperience

You might also like