Lect 1 IRIntroduction
Lect 1 IRIntroduction
Outline
• Introduction
• Information retrieval using the Boolean model
• The dictionary and postings lists
• Tolerant retrieval
• Scoring and term weighting
• Vector space retrieval
• IR Evaluation
• Text Classification
• Relevance Feedback and Query Expansion
• Web Mining & Link Analysis
Text Book
Christopher D. Manning, ―An Introduction to
Information Retrieval‖, Cambridge University
Press, Cambridge, England, 2009
Modern Information Retrieval: The Concepts and
Technology behind Search, 2/E, Ricardo Baeza-
Yates, Berthier Ribeiro-Neto, 2011
Overview
• Introduction to IR
• IR vs DB
• IR and search engines
• The IR process
What is information retrieval
Information retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfy an information need
from within large collections (usually on local
computer servers or on the internet).
What is information retrieval
• Gathering information from a source(s) based on
an information need usually from a query
– Major assumption - that the information need can be
specified
– Broad definition of information
• Sources of information
– Other people
– Archived information (libraries, maps, etc.)
– Radio, TV, etc.
– Web
– Nature
What IR is usually not about
• Not about structured data (databases)
– Why?
– Grow of structured data?
• Retrieval from databases is usually not considered
– Database querying assumes that the data is in a
standardized format
– Transforming all information, news articles, web sites
into a database format is difficult for large data
collections
Unstructured (text) vs. structured
(database) data in 1996
160
140
120
100
80 Unstructured
Structured
60
40
20
0
Data volume Market Cap
Unstructured (text) vs. structured
(database) data in 2006
160
140
120
100
80 Unstructured
Structured
60
40
20
0
Data volume Market Cap
Information Retrieval Vs. Data Retrieval
Data Retrieval Information Retrieval
Matching Exact match Partial match, best match
Inference Deduction Induction
Model Deterministic Probabilistic
Classification Monotheistic Polytheistic
• Search engines
• Summarization
• Text mining
• Question answering
• Categorization - Document classification
23
What is a Search Engine
Search engines are the key to finding specific information on the vast expanse of the World
Wide Web. Without sophisticated search engines, it would be virtually impossible to locate
anything on the Web without knowing a specific URL. When people use the term search
engine in relation to the Web, they are usually referring to the actual search forms that
searches through databases of HTML documents
• Changing your web pages has no effect on your listing. Things that are useful for
improving a listing with a search engine have nothing to do with improving a listing in a
directory. The only exception is that a good site, with good content, might be more
likely to get reviewed for free than a poor site.
3) Hybrid search engine. In the web's early days, it used to be that a search engine either
presented crawler-based results or human-powered listings. Today, it extremely
common for both types of results to be presented. Usually, a hybrid search engine will
favor one type of listings over another. For example, MSN Search is more likely to
present human-powered listings from LookSmart. However, it does also present
crawler-based results (as provided by Inktomi), especially for more obscure queries.
Web Crawling
• A Web Crawler is a software for downloading pages from the Web
• Also known as Web Spider, Web Robot, or simply Bot
URLs crawled
and parsed
Unseen Web
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
Web search engine
• A web search engine is designed to search
for information on the World Wide Web.
• The search results are usually presented in a
list of results and are commonly called hits.
The information may consist of web pages, images, information and
other types of files.
1. Web crawling
2. Indexing
3. Searching
Crawling
• Web search engines work by storing
information about many web pages, which
they retrieve from the html pages itself.
• Page Rank
• Anchor Text
• Scalability
41
Impact of search engines
• Make the web scale!
• Unbelievable access to information
– Implications are only just being understood
– Democratization of humankind’s knowledge
• The online world
– I ―googled‖ him just to see …
– Search is crucial part of many everyday existence and 2nd most
popular online activity after email.
– Social interactions
• The death of anonymity/privacy
– Nearly everyone is searchable
• Choicepoint
• Facebook
Finding Out About (FOA)
• Three phases:
– Asking of a question (the Information Need)
– Construction of an answer (IR proper)
– Assessment of the answer (Evaluation)
• Part of an iterative process
IR is an Iterative Process
Goals Repositories
Workspace
IR Problem
• The key goal of an IR system is to retrieve all the
items that are relevant to a user query, while
retrieving as few nonrelevant items as possible
• The notion of relevance is of central importance
in IR
• the IR system must rank the information items
according to a degree of relevance to the user
query
User’s
Information
Need
text input
Parse Query
Collections
Pre-process
Index
User’s
Information Collections
Need
Pre-process
text input
Rank or Match
User’s
Information Collections
Need
Pre-process
text input
Rank or Match
Query Reformulation
IR is usually a dialog
Potentially
Relevant
Documents
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System
Potentially
Relevant
Documents
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System
Potentially
Relevant
Documents
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System
Potentially
Relevant
Documents
Is Information Retrieval?
• discovering new knowledge
• capturing existing knowledge
• sharing knowledge with others
• applying knowledge
Should we really be studying knowledge
retrieval?
Data, information, knowledge
• Data - Facts, observations, or perceptions.
• Information - Subset of data, only including those data that
possess context, relevance, and purpose.
• Knowledge - A more simplistic view considers knowledge as
being at the highest level in a hierarchy with data (at the lowest
level) and information (at the middle level).