L01
L01
• Semi-structured data
– Web pages (HTML and XML files)
– Email messages
• Non-textual/multimedia data
– images, graphics, video
Examples of IR Systems :
Federated
search
Result Page has more Functions
Matched lines
Input files
in input files
grep
Query = comp4321
– man –k keyword
• Search UNIX man pages
Search result
Index/Search on Windows 10
Web pages
In a real
Local implementation (yet
copy Cache many other things are
not shown here)
Federated and Meta Search
SE #3 SE #2
Federated vs Meta Search
Federated Search Meta Search
Each node is a full-function SE for Meta-search node passes queries
its own collection and search results between users
and underlying SEs; itself is not a SE
Search engines agree to join Agreement is not needed; unwilling
Federated Search SEs can block search requests
Agree to use the same standard for Query and results of underlying SEs
query/result representation and can have different format; meta-
API (e.g., ANSI/ISO Z39.50 for search performs query
libraries) transformation and data aggregation
SEs can collaborate to perform a Participating SEs do not collaborate
search with each other
E.g., HKALL (HK Academic Lib Link) Dogpile.com
Differences from Web Search (GBB)
• Technologies for all these different forms of search are
more or less the same, but in enterprise or product search
– Data are more structured:
• Data are grouped into "collections “, e.g., products, press releases,
news, manuals, records dumped from database tables
• Search can be applied to a subset of the collections
– Query format:
• Standard AND/OR, phrase, etc.
• Search on fields: titles, authors, within date range, etc.
– Result page: Grouped by document types, ranked by date or
relevance, etc.
• Example: search on amazon.com; what search features
are most useful to you that are available on GBB?
Why is IR Important? Needed Everywhere!
Dik Lun LEE Department of Computer Science and Engineering, HKUST Slide 1
Dik Lun LEE Department of Computer Science and Engineering, HKUST Slide 1
Why is IR Difficult? Semantics!
Challengers:
• Teoma (acquired by ask.com)
• Wisenut (acquired by Looksmart)
• Vivisimo (own clusty.com;
started by CMU in 2000;
acquired by IBM)
• Powerset (acquired by Microsoft
in 2008 at allegedly US$ 100m)
• Companies that you will start!
The Search Industry (and our Job Market)
• GBB: Global web search engines attract billions of searches every
day; advertisement is the major source of revenue; technological
competitiveness is a must (winner takes all!)
• Enterprise search: Companies deploy their own search engines to
enhance productivity; vendors include Endeca (Oracle), Microsoft
(SharePoint), and Google (Site/Custom Search)
• Various vertical search: Business directories, recruitment and
travel web sties; advertisement is the largest source of revenue
• Search engine marketing (SEM): Marketing via search engine ad
placements
• Search engine optimization (SEO): Companies helping websites
to rank high in GBB
Take Home Messages
• Search engine is rooted in “information retrieval” used by academics
• IR existed even before computers were invented (e.g., manual
catalogs in libraries, manual keyword extraction)
• Search engine does NOT just mean web search (Google.com and
Bing.com), it includes intranet and enterprise search engines
• Search engine could search structured information (as in library
systems); how is structured information represented in HTML?
• Search is difficult because it has to “understand” what the user
wants through a few query keywords and retrieve 10 best pages out
of billions of pages based on the semantic content of the pages
• In addition to sophistication of search, scaling up remains important
• High quality ranking at sub-second speed => Great User eXperience