0% found this document useful (0 votes)
10 views

Lect 1 IRIntroduction

Information reatteavalb

Uploaded by

maramabdo26124
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lect 1 IRIntroduction

Information reatteavalb

Uploaded by

maramabdo26124
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Information Retrieval

Outline
• Introduction
• Information retrieval using the Boolean model
• The dictionary and postings lists
• Tolerant retrieval
• Scoring and term weighting
• Vector space retrieval
• IR Evaluation
• Text Classification
• Relevance Feedback and Query Expansion
• Web Mining & Link Analysis
Text Book
Christopher D. Manning, ―An Introduction to
Information Retrieval‖, Cambridge University
Press, Cambridge, England, 2009
Modern Information Retrieval: The Concepts and
Technology behind Search, 2/E, Ricardo Baeza-
Yates, Berthier Ribeiro-Neto, 2011
Overview
• Introduction to IR
• IR vs DB
• IR and search engines
• The IR process
What is information retrieval
Information retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfy an information need
from within large collections (usually on local
computer servers or on the internet).
What is information retrieval
• Gathering information from a source(s) based on
an information need usually from a query
– Major assumption - that the information need can be
specified
– Broad definition of information
• Sources of information
– Other people
– Archived information (libraries, maps, etc.)
– Radio, TV, etc.
– Web
– Nature
What IR is usually not about
• Not about structured data (databases)
– Why?
– Grow of structured data?
• Retrieval from databases is usually not considered
– Database querying assumes that the data is in a
standardized format
– Transforming all information, news articles, web sites
into a database format is difficult for large data
collections
Unstructured (text) vs. structured
(database) data in 1996
160

140

120

100

80 Unstructured
Structured
60

40

20

0
Data volume Market Cap
Unstructured (text) vs. structured
(database) data in 2006
160

140

120

100

80 Unstructured
Structured
60

40

20

0
Data volume Market Cap
Information Retrieval Vs. Data Retrieval
Data Retrieval Information Retrieval
Matching Exact match Partial match, best match
Inference Deduction Induction
Model Deterministic Probabilistic
Classification Monotheistic Polytheistic

Query language Artificial Natural

Query specification Complete Incomplete

Error response Sensitive Insensitive

Items wanted Matching Relevant


IR vs. databases:
Unstructured vs. Structured data
• “Structured data” tends to refer to
information in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000

Typically allows numerical range and exact match


(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith. 11
How much information is there? Yotta

• Soon most everything Everything


Zetta
!
will be recorded and Recorded
All Books Exa
indexed MultiMedia
• Most bytes will never be Peta
All books
seen by humans. (words) Tera
• Data summarization and
.Movi
trend detection are key e Giga
technologies A Photo
Mega
A Book
Kilo
Ideal Information Retrieval
• The answer should be:

– what is actually needed (relevant)


• IR is very concerned with relevance
– available when you want it
– available where you want it
– tailored to the user (personalization)
– your information needs anticipated
What is relevance?
• An answer(s) that fits your need.
What is relevance?
• In IR relevance is everything
• Relevance information is that suited to your
information need.
• Dependent on
– User
– Space/time
– Group
– Context
What an IR system should do
• Store/archive information
• Provide access to that information
• Answer queries with relevant information
• Stay current
• Future list
– Understand the user’s queries
– Understand the user’s need
– Acts as an assistant
How good is the IR system
Measures of performance based on what the system
returns:
• Relevance
• Coverage
• Regency
• Functionality (e.g. query syntax)
• Speed
• Availability
• Usability
• Time/ability to satisfy user requests
How IR systems work
Algorithms implemented in software
• Gathering of information
• Storage of information
• Indexing
• Interaction
• Evaluation
How is IR accomplished?
• Ask someone
• Search
– Search for someone to ask
– Search for needed information
– Use a search engine
• Process of IR - queries or questions
What is SEARCH?
• The activity of looking thoroughly in order to find something or
someone

• To request the electronic retrieval of documents based on the


presence of specific terms and within other restrictions established
(e.g., subject, date, journal, etc.). Search results list The list of
documents retrieved as a result of a search request submitted.

• Intelligently seeking answers to a known or


unknown question, often as part of solving a
larger problem
Existing Popular IR System:
IR applications

• Search engines

• Summarization

• Text mining

• Question answering
• Categorization - Document classification
23
What is a Search Engine
Search engines are the key to finding specific information on the vast expanse of the World
Wide Web. Without sophisticated search engines, it would be virtually impossible to locate
anything on the Web without knowing a specific URL. When people use the term search
engine in relation to the Web, they are usually referring to the actual search forms that
searches through databases of HTML documents

There are basically three types of search engines:


1) Crawler-based search engines (Spiders) are those that use automated software agents
(called crawlers) that visit a Web site, read the information on the actual site, read the site's
meta tags and also follow the links that the site connects to performing indexing on all
linked Web sites as well. The crawler returns all that information back to a central
repository, where the data is indexed. The crawler will periodically return to the sites to
check for any information that has changed. The frequency with which this happens is
determined by the administrators of the search engine.
Types of search engines
2) Human-powered search engines rely on humans to submit information that is
subsequently indexed and catalogued. Only information that is submitted is put into the
index. A human-powered directory, such as the Open Directory, depends on humans for
its listings. You submit a short description to the directory for your entire site, or editors
write one for sites they review. A search looks for matches only in the descriptions
submitted.

• Changing your web pages has no effect on your listing. Things that are useful for
improving a listing with a search engine have nothing to do with improving a listing in a
directory. The only exception is that a good site, with good content, might be more
likely to get reviewed for free than a poor site.

3) Hybrid search engine. In the web's early days, it used to be that a search engine either
presented crawler-based results or human-powered listings. Today, it extremely
common for both types of results to be presented. Usually, a hybrid search engine will
favor one type of listings over another. For example, MSN Search is more likely to
present human-powered listings from LookSmart. However, it does also present
crawler-based results (as provided by Inktomi), especially for more obscure queries.
Web Crawling
• A Web Crawler is a software for downloading pages from the Web
• Also known as Web Spider, Web Robot, or simply Bot

Cycle of a Web crawling process:


• The crawler start downloading a set of seed pages, that are parsed and
scanned for new links
• The links to pages that have not yet been downloaded are added to a
central queue for download later
• Next, the crawler selects a new page for download and the process is
repeated until a stop criterion is met
Crawling picture

URLs crawled
and parsed
Unseen Web

Seed URLs frontier


pages
Web
What are indices?
Indices are giant databases of information that is collected and stored and
subsequently searched
How search
engines work?
Query Engine Index

Interface

Indexer
Users

Crawler

Web
A Typical Web Search Engine
Web search engine
• A web search engine is designed to search
for information on the World Wide Web.
• The search results are usually presented in a
list of results and are commonly called hits.
The information may consist of web pages, images, information and
other types of files.

• Search engines operate algorithmically or


are a mixture of algorithmic and human
input.
How they work..
• A search engine operates, in the following
order:

1. Web crawling

2. Indexing

3. Searching
Crawling
• Web search engines work by storing
information about many web pages, which
they retrieve from the html pages itself.

• These pages are retrieved by a Web


crawler (sometimes also known as a spider)
— an automated Web browser which
follows every link on the site.
Indexing
• Data about web pages are stored in an index
database for use in later queries.

• The purpose of an index is to allow


information to be found as quickly as
possible.
34
Searching
• A query can be a single word.
• When a user enters a query into a search
engine (typically by using key words), the
engine examines its index and provides a
listing of best-matching web pages
according to its criteria.
• The result is usually with a short summary
containing the document's title and
sometimes parts of the text.
36
Which search engine is more useful?

• The usefulness of a search engine depends


on the relevance of the result set it gives
back.

• While there may be millions of web pages


that include a particular word or phrase,
some pages may be more relevant, popular,
or authoritative than others.
37
Ranking
• Most search engines employ methods
to rank the results to provide the "best"
results first.
• They compute a numeric score on how
well each result matches the query
• How a search engine decides which pages
are the best matches, and what order the
results should be shown in, varies widely
from one engine to another. 38
Let's go back in time! 1998
• What exactly is Google!
1993 Aliweb Launch
1994 WebCrawler Launch
1994 Infoseek Launch
1994 Lycos Launch
1995 AltaVista Launch
1995 Excite Launch
1996 Dogpile Launch
1996 Inktomi Founded
1996 Ask Jeeves Founded
1997 Northern Light Launch
1998 Google Launch
1999 AlltheWeb Launch
2000 Teoma Founded
2003 Objects Search Launch
2004 Yahoo! Search Final launch(first original results)
2004 MSN Search Beta launch
2005 MSN Search Final launch 39
2005 Kosmix Beta Launch
Google Search Engine
• The Google web search engine is the company's most
popular service.

• According to market research in November 2009, Google


is the dominant search engine in the US market.

• Google indexes billions of Web pages, so that users can


search for the information they desire, through the use
of keywords and operators, although at any given time it
will only return a maximum of 1,000 results for any
specific search query.
40
What made Google so great?

• Page Rank

• Anchor Text

• Scalability

41
Impact of search engines
• Make the web scale!
• Unbelievable access to information
– Implications are only just being understood
– Democratization of humankind’s knowledge
• The online world
– I ―googled‖ him just to see …
– Search is crucial part of many everyday existence and 2nd most
popular online activity after email.
– Social interactions
• The death of anonymity/privacy
– Nearly everyone is searchable
• Choicepoint
• Facebook
Finding Out About (FOA)
• Three phases:
– Asking of a question (the Information Need)
– Construction of an answer (IR proper)
– Assessment of the answer (Evaluation)
• Part of an iterative process
IR is an Iterative Process

Goals Repositories

Workspace
IR Problem
• The key goal of an IR system is to retrieve all the
items that are relevant to a user query, while
retrieving as few nonrelevant items as possible
• The notion of relevance is of central importance
in IR
• the IR system must rank the information items
according to a degree of relevance to the user
query
User’s
Information
Need

text input

Parse Query
Collections

Pre-process

Index
User’s
Information Collections
Need

Pre-process
text input

Parse Query Index

Rank or Match
User’s
Information Collections
Need

Pre-process
text input

Parse Query Index

Rank or Match

Query Reformulation
IR is usually a dialog

– The exchange doesn’t end with first answer


– User can recognize elements of a useful answer
– Questions and understanding changes as the process
continues.
Information Retrieval
• Revised Goal Statement:

Build a system that retrieves documents that users


are likely to find relevant to their queries.

• This set of assumptions underlies the field


of Information Retrieval.
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

Rules of the game =


Rules for subject indexing +
Formulating query in Thesaurus (which consists of Indexing
terms of (Descriptive and
descriptors Lead-In Subject)
Vocabulary
and
Indexing
Language
Storage of
Storage of
profiles
Documents

Store1: Profiles/ Comparison/ Store2: Document


Search requests Matching representations

Adapted from Soergel, p. 19

Potentially
Relevant
Documents
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

Rules of the game =


Rules for subject indexing +
Formulating query in Thesaurus (which consists of Indexing
terms of (Descriptive and
descriptors Lead-In Subject)
Vocabulary
and
Indexing
Language
Storage of
Storage of
profiles
Documents

Store1: Profiles/ Comparison/ Store2: Document


Search requests Matching representations

Adapted from Soergel, p. 19

Potentially
Relevant
Documents
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

Rules of the game =


Rules for subject indexing +
Formulating query in Thesaurus (which consists of Indexing
terms of (Descriptive and
descriptors Lead-In Subject)
Vocabulary
and
Indexing
Language
Storage of
Storage of
profiles
Documents

Store1: Profiles/ Comparison/ Store2: Document


Search requests Matching representations

Potentially
Relevant
Documents
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

Rules of the game =


Rules for subject indexing +
Formulating query in Thesaurus (which consists of Indexing
terms of (Descriptive and
descriptors Lead-In Subject)
Vocabulary
and
Indexing
Language
Storage of
Storage of
profiles
Documents

Store1: Profiles/ Comparison/ Store2: Document


Search requests Matching representations

Adapted from Soergel, p. 19

Potentially
Relevant
Documents
Is Information Retrieval?
• discovering new knowledge
• capturing existing knowledge
• sharing knowledge with others
• applying knowledge
Should we really be studying knowledge
retrieval?
Data, information, knowledge
• Data - Facts, observations, or perceptions.
• Information - Subset of data, only including those data that
possess context, relevance, and purpose.
• Knowledge - A more simplistic view considers knowledge as
being at the highest level in a hierarchy with data (at the lowest
level) and information (at the middle level).

•Data refers to bare facts void of context.


–A telephone number.
•Information is data in context.
–A phone book.
•Knowledge is information that facilitates action.
–Recognizing that a phone number belongs to a good client,
who needs to be called once per week to get his orders.

You might also like