4
4
• Vocabularies mismatching
• Queries are ambiguous
• Content representation may be
inadequate and incomplete
• The user is the ultimate judge, but we
don’t know how the judge judges.
Challenges in IR
• 1.Apache Lucene
• 2. Sphinx
• 3. Whoosh
• 4. Carrot2
Apache Lucene Core
• Apache Lucene is a high-performance, full-featured text search engine library written entirely in
Java. It is a technology suitable for nearly any application that requires full-text search, especially
cross-platform.
• Powerful features through a simple API:
• • Scalable, High-Performance Indexing
• • Over 150GB/hour on modem hardware
• • small RAM requirements -- only 1MB heap
• • incremental indexing as fast as batch indexing
• • index size roughly 20-30% the size of text indexed
• • Powerful, Accurate and Efficient Search Algorithms
• • ranked searching -- best results returned first
• • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries
and more
• • fielded searching (e.g. title, author, contents)
• • sorting by any field
• • multiple-index searching with merged results
• • allows simultaneous update and searching
• • flexible faceting, highlighting, joins and result grouping
• • fast, memory-efficient and typo-tolerant suggesters
• • pluggable ranking models, including the Vector Space Model and Okapi BM25
• • configurable storage engine (codecs)
Sphinx
• Sphinx is a full-text search engine, publicly distributed under GPL version. Technically, Sphinx is a standalone
software package provides fast and relevant full-text search functionality to client applications.
• It was specially designed to integrate well with SQL databases storing the data, and to be easily accessed by
scripting languages.
• However, Sphinx does not depend on nor require any specific database to function.
• Applications can access Sphinx search daemon (searchd) using any of the three different access methods:
a) via Sphinx own implementation of MySQL network protocol
b) via native search API (SphinxAPI) or
c) via MySQL server with a pluggable storage engine (SphinxSE).
• Starting from version 1.10-beta, Sphinx supports two different indexing backends:
a) "Disk" index backend- Disk indexes support online full-text index rebuilds, but online updates can only be
done on non-text (attribute) data.
b) "Realtime" (RT) index backend - RT indexes additionally allow for online full-text index updates. Previous
versions only supported disk indexes.
• Sphinx features are:
• high indexing and searching performance;
• advanced indexing and querying tools;
• advanced result set post-processing (SELECT with expressions, WHERE, ORDER BY, GROUP BY, HAVING etc over text
search results);
• proven scalability up to billions of documents, terabytes of data, and thousands of queries per second;
• easy integration with SQL and XML data sources, and SphinxQL, SphinxAPI, or SphinxSE search interfaces;
• easy scaling with distributed searches.
Whoosh
• Whoosh was created by Matt Chaput. It started as a quick and dirty search server
for the online documentation of the Houdini 3D animation software package.
• Whoosh is a fast, featureful full-text indexing and searching library implemented
in pure Python.
• Programmers can use it to easily add search functionality to their applications
and websites.
• Every part of how Whoosh works can be extended or replaced to meet your
needs exactly. Whoosh’s features include:
• • Pythonic API.
• • Pure-Python. No compilation or binary packages needed, no mysterious
crashes.
• • Fielded indexing and search.
• • Fast indexing and retrieval – faster than any other pure-Python, scoring, full-
text search solution
• • Pluggable scoring algorithm (including BM25F), text analysis, storage, posting
format, etc.
• • Powerful query language.
• • Pure Python spell-checker
Carrot²
Tim Berners-Lee concept in 1989: a British computer scientist, proposed the concept
of the World Wide Web while working at CERN (the European Organization for Nuclear
Research).
•1990: The concept was successfully tested, and the first website was created.
•1991: The World Wide Web was publicly released, allowing people outside of CERN to
use and access web pages. This marked the beginning of the modern internet era.
• It was called world wide web (www)
• WWW use three protocols
• HTML
• HTTP
• URLs
IR on web
• Keyword queries
• Boolean Queries
• Phrase queries
• Proximity queries
• Full document queries
• Natural language questions
Web challenges on IR
• Artificial Intelligence
• • Study of how to construct intelligent machines and systems that can simulate or extend the development
of human intelligence
• In the 1980s, they started to cooperate and the term intelligent information retrieval was coined for AI applications in IR.
• The integration of Artificial Intelligence and Information Retrieval has led to the following development:
• o Development of methods to learn user's information needs.
• o Extract information based on what has been learned.
• o Represent the semantics of information
• In the 1990s, information retrieval has seen a shift from set based Boolean retrieval models to ranking systems
What are Intelligent IR Systems?
• The concept of 'intelligent' information retrieval was first
suggested in the late 1970s.
• Not pursued by IR Community until early 1990s.
• An intelligent IR system can simulate the human thinking
process on information processing and intelligence
activities to achieve information and knowledge storage,
retrieval and reasoning, and to provide intelligence support.
• In an Intelligent IR system, the functions of the human
intermediary are performed by a program, interacting with
the human user.
• Intelligent IR is performed by a computer program
(intelligent agent), which acts on (minimal or no explicit)
instructions from a human user, retrieves and presents
information to the user without any other interaction.
How to introduce AI into IR systems?
Reasoning Natural
under language
certainty processing
Knowledge
representatio
n
Cognitiv
e
theory Machine
Computer Learning
Vision
AI applied to IR
System
Information integration
characterization
Search
formulation in Support
seeking functions
information
Web Search vs IR
• Traditional IR systems normally index a closed
collection of documents, which are mainly text-
based and usually offer little linkage between
documents.
• Traditional IR systems are often referred to as
full-text retrieval systems.
• Libraries were among the first to adopt IR to
index their catalogs and later, to search through
information which was typically imprinted onto
CD-ROMs.
• The main aim of traditional IR was to return
relevant documents that satisfy the user’s
information need.
• Although the main goal of satisfying the user’s
need is still the central issue in web IR (or web
search).
Components of a Search engine
• Characteristics
• Measuring the Internet and in particular the Web, is a difficult task due
to its highly dynamic nature.
• How many different institutions (not Web servers) maintain Web data? o
This number is smaller than the number of servers, because many
places have multiple servers.
• The exact number is unknown, but should be larger than 40% of the number of
Web servers.
• More recent studies on the size of search engines estimated that there were over
20 billion pages in 2005, and that the size of the static Web is roughly doubling
every eight months.
• Nowadays, the Web is infinite for practical purposes, as we can generate
an infinite number of dynamic pages (e.g. consider an on-line calendar..
• The most popular formats of Web documents are HTML, followed by GIF
and JPG (both images), ASCII text, and PDF, in that order.