0% found this document useful (0 votes)

23 views59 pages

Lect 1 IRIntroduction

Information reatteavalb

Uploaded by

maramabdo26124

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views59 pages

Lect 1 IRIntroduction

Information reatteavalb

Uploaded by

maramabdo26124

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Information Retrieval

Outline
• Introduction
• Information retrieval using the Boolean model
• The dictionary and postings lists
• Tolerant retrieval
• Scoring and term weighting
• Vector space retrieval
• IR Evaluation
• Text Classification
• Relevance Feedback and Query Expansion
• Web Mining & Link Analysis
Text Book
Christopher D. Manning, ―An Introduction to
Information Retrieval‖, Cambridge University
Press, Cambridge, England, 2009
Modern Information Retrieval: The Concepts and
Technology behind Search, 2/E, Ricardo Baeza-
Yates, Berthier Ribeiro-Neto, 2011
Overview
• Introduction to IR
• IR vs DB
• IR and search engines
• The IR process
What is information retrieval
Information retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfy an information need
from within large collections (usually on local
computer servers or on the internet).
What is information retrieval
• Gathering information from a source(s) based on
an information need usually from a query
– Major assumption - that the information need can be
specified
– Broad definition of information
• Sources of information
– Other people
– Archived information (libraries, maps, etc.)
– Radio, TV, etc.
– Web
– Nature
What IR is usually not about
• Not about structured data (databases)
– Why?
– Grow of structured data?
• Retrieval from databases is usually not considered
– Database querying assumes that the data is in a
standardized format
– Transforming all information, news articles, web sites
into a database format is difficult for large data
collections
Unstructured (text) vs. structured
(database) data in 1996
160

140

120

100

80 Unstructured
Structured
60

0
Data volume Market Cap
Unstructured (text) vs. structured
(database) data in 2006
160

140

120

100

80 Unstructured
Structured
60

0
Data volume Market Cap
Information Retrieval Vs. Data Retrieval
Data Retrieval Information Retrieval
Matching Exact match Partial match, best match
Inference Deduction Induction
Model Deterministic Probabilistic
Classification Monotheistic Polytheistic

Query language Artificial Natural

Query specification Complete Incomplete

Error response Sensitive Insensitive

Items wanted Matching Relevant

IR vs. databases:
Unstructured vs. Structured data
• “Structured data” tends to refer to
information in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000

Typically allows numerical range and exact match

(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith. 11
How much information is there? Yotta

• Soon most everything Everything

Zetta
!
will be recorded and Recorded
All Books Exa
indexed MultiMedia
• Most bytes will never be Peta
All books
seen by humans. (words) Tera
• Data summarization and
.Movi
trend detection are key e Giga
technologies A Photo
Mega
A Book
Kilo
Ideal Information Retrieval
• The answer should be:

– what is actually needed (relevant)

• IR is very concerned with relevance
– available when you want it
– available where you want it
– tailored to the user (personalization)
– your information needs anticipated
What is relevance?
• An answer(s) that fits your need.
What is relevance?
• In IR relevance is everything
• Relevance information is that suited to your
information need.
• Dependent on
– User
– Space/time
– Group
– Context
What an IR system should do
• Store/archive information
• Provide access to that information
• Answer queries with relevant information
• Stay current
• Future list
– Understand the user’s queries
– Understand the user’s need
– Acts as an assistant
How good is the IR system
Measures of performance based on what the system
returns:
• Relevance
• Coverage
• Regency
• Functionality (e.g. query syntax)
• Speed
• Availability
• Usability
• Time/ability to satisfy user requests
How IR systems work
Algorithms implemented in software
• Gathering of information
• Storage of information
• Indexing
• Interaction
• Evaluation
How is IR accomplished?
• Ask someone
• Search
– Search for someone to ask
– Search for needed information
– Use a search engine
• Process of IR - queries or questions
What is SEARCH?
• The activity of looking thoroughly in order to find something or
someone

• To request the electronic retrieval of documents based on the

presence of specific terms and within other restrictions established
(e.g., subject, date, journal, etc.). Search results list The list of
documents retrieved as a result of a search request submitted.

• Intelligently seeking answers to a known or

unknown question, often as part of solving a
larger problem
Existing Popular IR System:
IR applications

• Search engines

• Summarization

• Text mining

• Question answering
• Categorization - Document classification
23
What is a Search Engine
Search engines are the key to finding specific information on the vast expanse of the World
Wide Web. Without sophisticated search engines, it would be virtually impossible to locate
anything on the Web without knowing a specific URL. When people use the term search
engine in relation to the Web, they are usually referring to the actual search forms that
searches through databases of HTML documents

There are basically three types of search engines:

1) Crawler-based search engines (Spiders) are those that use automated software agents
(called crawlers) that visit a Web site, read the information on the actual site, read the site's
meta tags and also follow the links that the site connects to performing indexing on all
linked Web sites as well. The crawler returns all that information back to a central
repository, where the data is indexed. The crawler will periodically return to the sites to
check for any information that has changed. The frequency with which this happens is
determined by the administrators of the search engine.
Types of search engines
2) Human-powered search engines rely on humans to submit information that is
subsequently indexed and catalogued. Only information that is submitted is put into the
index. A human-powered directory, such as the Open Directory, depends on humans for
its listings. You submit a short description to the directory for your entire site, or editors
write one for sites they review. A search looks for matches only in the descriptions
submitted.

• Changing your web pages has no effect on your listing. Things that are useful for
improving a listing with a search engine have nothing to do with improving a listing in a
directory. The only exception is that a good site, with good content, might be more
likely to get reviewed for free than a poor site.

3) Hybrid search engine. In the web's early days, it used to be that a search engine either
presented crawler-based results or human-powered listings. Today, it extremely
common for both types of results to be presented. Usually, a hybrid search engine will
favor one type of listings over another. For example, MSN Search is more likely to
present human-powered listings from LookSmart. However, it does also present
crawler-based results (as provided by Inktomi), especially for more obscure queries.
Web Crawling
• A Web Crawler is a software for downloading pages from the Web
• Also known as Web Spider, Web Robot, or simply Bot

Cycle of a Web crawling process:

• The crawler start downloading a set of seed pages, that are parsed and
scanned for new links
• The links to pages that have not yet been downloaded are added to a
central queue for download later
• Next, the crawler selects a new page for download and the process is
repeated until a stop criterion is met
Crawling picture

URLs crawled
and parsed
Unseen Web

Seed URLs frontier

pages
Web
What are indices?
Indices are giant databases of information that is collected and stored and
subsequently searched
How search
engines work?
Query Engine Index

Interface

Indexer
Users

Crawler

Web
A Typical Web Search Engine
Web search engine
• A web search engine is designed to search
for information on the World Wide Web.
• The search results are usually presented in a
list of results and are commonly called hits.
The information may consist of web pages, images, information and
other types of files.

• Search engines operate algorithmically or

are a mixture of algorithmic and human
input.
How they work..
• A search engine operates, in the following
order:

1. Web crawling

2. Indexing

3. Searching
Crawling
• Web search engines work by storing
information about many web pages, which
they retrieve from the html pages itself.

• These pages are retrieved by a Web

crawler (sometimes also known as a spider)
— an automated Web browser which
follows every link on the site.
Indexing
• Data about web pages are stored in an index
database for use in later queries.

• The purpose of an index is to allow

information to be found as quickly as
possible.
34
Searching
• A query can be a single word.
• When a user enters a query into a search
engine (typically by using key words), the
engine examines its index and provides a
listing of best-matching web pages
according to its criteria.
• The result is usually with a short summary
containing the document's title and
sometimes parts of the text.
36
Which search engine is more useful?

• The usefulness of a search engine depends

on the relevance of the result set it gives
back.

• While there may be millions of web pages

that include a particular word or phrase,
some pages may be more relevant, popular,
or authoritative than others.
37
Ranking
• Most search engines employ methods
to rank the results to provide the "best"
results first.
• They compute a numeric score on how
well each result matches the query
• How a search engine decides which pages
are the best matches, and what order the
results should be shown in, varies widely
from one engine to another. 38
Let's go back in time! 1998
• What exactly is Google!
1993 Aliweb Launch
1994 WebCrawler Launch
1994 Infoseek Launch
1994 Lycos Launch
1995 AltaVista Launch
1995 Excite Launch
1996 Dogpile Launch
1996 Inktomi Founded
1996 Ask Jeeves Founded
1997 Northern Light Launch
1998 Google Launch
1999 AlltheWeb Launch
2000 Teoma Founded
2003 Objects Search Launch
2004 Yahoo! Search Final launch(first original results)
2004 MSN Search Beta launch
2005 MSN Search Final launch 39
2005 Kosmix Beta Launch
Google Search Engine
• The Google web search engine is the company's most
popular service.

• According to market research in November 2009, Google

is the dominant search engine in the US market.

• Google indexes billions of Web pages, so that users can

search for the information they desire, through the use
of keywords and operators, although at any given time it
will only return a maximum of 1,000 results for any
specific search query.
40
What made Google so great?

• Page Rank

• Anchor Text

• Scalability

41
Impact of search engines
• Make the web scale!
• Unbelievable access to information
– Implications are only just being understood
– Democratization of humankind’s knowledge
• The online world
– I ―googled‖ him just to see …
– Search is crucial part of many everyday existence and 2nd most
popular online activity after email.
– Social interactions
• The death of anonymity/privacy
– Nearly everyone is searchable
• Choicepoint
• Facebook
Finding Out About (FOA)
• Three phases:
– Asking of a question (the Information Need)
– Construction of an answer (IR proper)
– Assessment of the answer (Evaluation)
• Part of an iterative process
IR is an Iterative Process

Goals Repositories

Workspace
IR Problem
• The key goal of an IR system is to retrieve all the
items that are relevant to a user query, while
retrieving as few nonrelevant items as possible
• The notion of relevance is of central importance
in IR
• the IR system must rank the information items
according to a degree of relevance to the user
query
User’s
Information
Need

text input

Parse Query
Collections

Pre-process

Index
User’s
Information Collections
Need

Pre-process
text input

Parse Query Index

Rank or Match
User’s
Information Collections
Need

Pre-process
text input

Parse Query Index

Rank or Match

Query Reformulation
IR is usually a dialog

– The exchange doesn’t end with first answer

– User can recognize elements of a useful answer
– Questions and understanding changes as the process
continues.
Information Retrieval
• Revised Goal Statement:

Build a system that retrieves documents that users

are likely to find relevant to their queries.

• This set of assumptions underlies the field

of Information Retrieval.
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

Rules of the game =

Rules for subject indexing +
Formulating query in Thesaurus (which consists of Indexing
terms of (Descriptive and
descriptors Lead-In Subject)
Vocabulary
and
Indexing
Language
Storage of
Storage of
profiles
Documents

Store1: Profiles/ Comparison/ Store2: Document

Search requests Matching representations

Adapted from Soergel, p. 19

Potentially
Relevant
Documents
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

Rules of the game =

Store1: Profiles/ Comparison/ Store2: Document

Search requests Matching representations

Adapted from Soergel, p. 19

Potentially
Relevant
Documents
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

Rules of the game =

Store1: Profiles/ Comparison/ Store2: Document

Search requests Matching representations

Potentially
Relevant
Documents
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

Rules of the game =

Store1: Profiles/ Comparison/ Store2: Document

Search requests Matching representations

Adapted from Soergel, p. 19

Potentially
Relevant
Documents
Is Information Retrieval?
• discovering new knowledge
• capturing existing knowledge
• sharing knowledge with others
• applying knowledge
Should we really be studying knowledge
retrieval?
Data, information, knowledge
• Data - Facts, observations, or perceptions.
• Information - Subset of data, only including those data that
possess context, relevance, and purpose.
• Knowledge - A more simplistic view considers knowledge as
being at the highest level in a hierarchy with data (at the lowest
level) and information (at the middle level).

•Data refers to bare facts void of context.

–A telephone number.
•Information is data in context.
–A phone book.
•Knowledge is information that facilitates action.
–Recognizing that a phone number belongs to a good client,
who needs to be called once per week to get his orders.

Types of Search Engines and How It Works
100% (2)
Types of Search Engines and How It Works
42 pages
Hadr Setup Step by Step
No ratings yet
Hadr Setup Step by Step
4 pages
Module 07 Manage Azure Storage
No ratings yet
Module 07 Manage Azure Storage
5 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
IR_Lec1
No ratings yet
IR_Lec1
26 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
L01
No ratings yet
L01
33 pages
Chap 1
No ratings yet
Chap 1
22 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Information Retrieval and XML Data: ADBMS Unit-4
No ratings yet
Information Retrieval and XML Data: ADBMS Unit-4
37 pages
How Do Search Engines Work
No ratings yet
How Do Search Engines Work
3 pages
SPPM 1002 Web Searching
No ratings yet
SPPM 1002 Web Searching
12 pages
Chapter 1 Search Engine 1. Objective
No ratings yet
Chapter 1 Search Engine 1. Objective
63 pages
BA4029 SOCIAL MEDIA WEB ANALYTICS Unit 5
No ratings yet
BA4029 SOCIAL MEDIA WEB ANALYTICS Unit 5
23 pages
Comsats Institute of Information TECHNOLOGY Islamabad
No ratings yet
Comsats Institute of Information TECHNOLOGY Islamabad
11 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
32 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
10 pages
Introduction
No ratings yet
Introduction
32 pages
2 Mod-1 - Lec-2
No ratings yet
2 Mod-1 - Lec-2
58 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
Search Engine Description
No ratings yet
Search Engine Description
17 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
Wad Module3
No ratings yet
Wad Module3
38 pages
The Overview of Web Search Engines 16ep4np3gk
No ratings yet
The Overview of Web Search Engines 16ep4np3gk
23 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
How A Search Engine Works - Slide
No ratings yet
How A Search Engine Works - Slide
40 pages
Search Engine Optimization - Using Data Mining Approach
No ratings yet
Search Engine Optimization - Using Data Mining Approach
5 pages
Search Engine Student Documents
No ratings yet
Search Engine Student Documents
6 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
Prashant Mathur Neha Gupta Monu K. Verma Mohd. Shoaib
No ratings yet
Prashant Mathur Neha Gupta Monu K. Verma Mohd. Shoaib
31 pages
Learn: Iienstitu
No ratings yet
Learn: Iienstitu
100 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
4
No ratings yet
4
35 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
Search Tools: Presented By: ISHA
No ratings yet
Search Tools: Presented By: ISHA
22 pages
Database & Search Engine
No ratings yet
Database & Search Engine
17 pages
Search ENgine
No ratings yet
Search ENgine
28 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Unit 8 - Search Engines
No ratings yet
Unit 8 - Search Engines
8 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Module 2
No ratings yet
Module 2
18 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Chap 2
No ratings yet
Chap 2
29 pages
WEB BROWSERS+search Engine
No ratings yet
WEB BROWSERS+search Engine
10 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Search Engine
No ratings yet
Search Engine
42 pages
Seminar Formatkhjj
No ratings yet
Seminar Formatkhjj
24 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
An Approach For Search Engine Optimization Using Classification - A Data Mining Technique
No ratings yet
An Approach For Search Engine Optimization Using Classification - A Data Mining Technique
4 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Search Engine
No ratings yet
Search Engine
35 pages
Information Retrieval Systems and Web Search Engin
No ratings yet
Information Retrieval Systems and Web Search Engin
4 pages
IR Module 3
No ratings yet
IR Module 3
45 pages
Seach Engine
50% (2)
Seach Engine
18 pages
SEARCH ENGINES and PAGERANK
No ratings yet
SEARCH ENGINES and PAGERANK
29 pages
STP Containers TC
No ratings yet
STP Containers TC
134 pages
PHP 2
No ratings yet
PHP 2
4 pages
Etl Developer Resume
100% (1)
Etl Developer Resume
4 pages
SPSS Project
No ratings yet
SPSS Project
20 pages
Design ER Diagram For Library: A Micro Project On
No ratings yet
Design ER Diagram For Library: A Micro Project On
14 pages
Miscellaneous Cost: Issues Related To Miscellaneous Cost Doc Created by - Anup Narayana-Version 1.0
No ratings yet
Miscellaneous Cost: Issues Related To Miscellaneous Cost Doc Created by - Anup Narayana-Version 1.0
28 pages
Data Visualization
No ratings yet
Data Visualization
4 pages
Class 9 Chapter 4
No ratings yet
Class 9 Chapter 4
1 page
Infosys Revision Sheet
No ratings yet
Infosys Revision Sheet
6 pages
The SQL Injection Technique: Fawaz Ahmad
No ratings yet
The SQL Injection Technique: Fawaz Ahmad
2 pages
Adding Field in Standard Fiori Apps
No ratings yet
Adding Field in Standard Fiori Apps
68 pages
COMP6153 Operating System: Practicum Case
No ratings yet
COMP6153 Operating System: Practicum Case
9 pages
Mod 4
No ratings yet
Mod 4
2 pages
Final PDF
No ratings yet
Final PDF
279 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
4 pages
Celonis Report 1276
0% (1)
Celonis Report 1276
31 pages
Bda-Unit-Iv - 2020-21
100% (1)
Bda-Unit-Iv - 2020-21
30 pages
Aakash Data Scientist
No ratings yet
Aakash Data Scientist
5 pages
Thurain Ko Ko Naing ADMS Assignment Spring 2024. 3
No ratings yet
Thurain Ko Ko Naing ADMS Assignment Spring 2024. 3
26 pages
Probability & Statistics
No ratings yet
Probability & Statistics
351 pages
Empirical Study of Most Popular PHP Framework
No ratings yet
Empirical Study of Most Popular PHP Framework
4 pages
Playframework: #Playframe Work
No ratings yet
Playframework: #Playframe Work
44 pages
Work Experience: Personal Projects
No ratings yet
Work Experience: Personal Projects
1 page
Life Without Tools: Monitoring Database Activity With The Power of SQL
No ratings yet
Life Without Tools: Monitoring Database Activity With The Power of SQL
27 pages
Data Science and Machine Learning Roadmap
No ratings yet
Data Science and Machine Learning Roadmap
4 pages
S.no Hours: Creo Parametric - Syllubus
No ratings yet
S.no Hours: Creo Parametric - Syllubus
2 pages
Data Science Tools
No ratings yet
Data Science Tools
8 pages
RakeshChikatimalla Java8+
No ratings yet
RakeshChikatimalla Java8+
5 pages

Lect 1 IRIntroduction

Uploaded by

Lect 1 IRIntroduction

Uploaded by

Information Retrieval

Query language Artificial Natural

Query specification Complete Incomplete

Error response Sensitive Insensitive

Items wanted Matching Relevant

Typically allows numerical range and exact match

• Soon most everything Everything

– what is actually needed (relevant)

• To request the electronic retrieval of documents based on the

• Intelligently seeking answers to a known or

There are basically three types of search engines:

Cycle of a Web crawling process:

Seed URLs frontier

• Search engines operate algorithmically or

• These pages are retrieved by a Web

• The purpose of an index is to allow

• The usefulness of a search engine depends

• While there may be millions of web pages

• According to market research in November 2009, Google

• Google indexes billions of Web pages, so that users can

Parse Query Index

Parse Query Index

– The exchange doesn’t end with first answer

Build a system that retrieves documents that users

• This set of assumptions underlies the field

Rules of the game =

Store1: Profiles/ Comparison/ Store2: Document

Adapted from Soergel, p. 19

Rules of the game =

Store1: Profiles/ Comparison/ Store2: Document

Adapted from Soergel, p. 19

Rules of the game =

Store1: Profiles/ Comparison/ Store2: Document

Rules of the game =

Store1: Profiles/ Comparison/ Store2: Document

Adapted from Soergel, p. 19

•Data refers to bare facts void of context.

You might also like