0% found this document useful (0 votes)

86 views

Web Information Retrieval

The document discusses web information retrieval. It begins by defining information retrieval as finding relevant information resources within a collection to satisfy an information need. It then describes the key elements of an information retrieval system as the information need, relevant resources, and the collection of resources. The document also discusses how web information retrieval differs from traditional information retrieval due to features of the web like hyperlinks and semi-structured data. Finally, it provides an overview of how a basic web search engine works through crawling the web, preprocessing data, indexing documents, and retrieving documents in response to queries.

Uploaded by

Bani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views

Web Information Retrieval

Uploaded by

Bani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Web Information Retrieval

Data Retrieval – Obtaining data from a database.

• Using a query language (e.g. SQL)
– Data is structured and free of ambiguity.

Information retrieval as a field of study is finding a relevant information resource that

satisfies the information need from within a collection of resources.
The elements of an information retrieval system:
– Information need.
– Relevant information resource.
– Collection of resources.

• Information need:
-is the topic about which the user desires to obtain information that satisfies conscious
or unconscious need.
- is differentiated from (but expressed as) a query
•Query:
-is what the user communicates with the computer in an attempt to express the
information need in words (or other format).
• Relevant information resource:
- Is the retrieved information that the user perceives valuable with respect to his/her
information need.
• Collection of resources:
- In case of text documents, it is referred to as corpus, but it can refer to a collection of
any sort of unstructured data (text, images, videos, audio, etc.)
- Often the resources themselves are not kept or stored directly in the IR system, but are
instead represented in the system by other surrogates or metadata.
IR Model
Structured vs. Unstructured Data

Database Management:
• Focused on structured data stored in relational tables rather than free-form text.
• Focused on efficient processing of well-defined queries in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data (XML) brings it closer to IR.

Library and Information Science:

• Focused on the human user aspects of information retrieval (human-computer
interaction, user interface, visualization).
• Concerned with effective categorization of human knowledge.
• Concerned with citation analysis and bibliometrics (structure of information).
• Recent work on digital libraries brings it closer to CS & IR.

Artificial Intelligence:
• Focused on the representation of knowledge, reasoning, and intelligent action.
• Formalisms for representing knowledge and queries: – First-order Predicate Logic –
Bayesian Networks
• Recent work on web ontologies and intelligent information agents brings it closer to
IR.
Natural Language Processing:
• Focused on the syntactic, semantic, and pragmatic analysis of natural language text
and discourse.
• Ability to analyze syntax (phrase structure) and semantics could allow retrieval based
on meaning rather than keywords.

• Methods for determining the sense of an ambiguous word based on context (word sense
disambiguation).
• Methods for identifying specific pieces of information in a document (information extraction).
• Methods for answering specific NL questions from document corpora or structured data

Machine Learning
Focused on the development of computational systems that improve their performance with
experience.
Automated classification of examples based on learning concepts from labeled training
examples (supervised learning).
Automated methods for clustering unlabeled examples into meaningful groups (unsupervised
learning
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
• Learning to Rank
1 Evaluation:
What makes WIR specific?
• Larger than traditional information resources
• Presence of hyperlinks
• Data in semi-structured
• Evolving significantly
• Multiple content types (text, images, and even tables) + application
• Quality of document

Differences between Web Information Retrieval and traditional Information

Retrieval
How web search engine works
• Web corpus collection (Crawling)
• Preprocessing
• Indexing
• Document retrieval
Crawling the web
• Start from an initial page
• Retrieve all linked pages
• Iterate on new pages
• Do not visit the same page twice
• Avoid conflict and overlapping when crawling with parallel machines.
• Crawl important pages (avoid leaving important pages)
Indexing
• Is the efficiency key of a search engine.
– Retrieveing relevant result quickly.
• It avoids linearly scanning the texts for each query.
Evaluation of Information Systems
• General measures for software systems
– Completeness, covering all requirements
– Efficient use of resources (runtime, RAM, disk space bandwidth)
– Useability
• Measures for database systems
– Runtime indexing
– Runtime querying
– Max number of parallel users
Boolean Model
The Boolean retrieval model is a model for information retrieval in which we MODEL can pose
any query which is in the form of a Boolean expression of terms, that is, in which terms are
combined with the operators AND, OR, and NOT. The model views each document as just a set
of words

Boolean:
– Retrieval based on boolean algebra
– Binary concept of relevance (yes/no)
• No ranking!
– Queries use boolean operators

• Corpus: 𝐷 = {𝑑1, 𝑑2, … , 𝑑𝑁}

• Vocabulary: 𝑉 = {𝑡1,2, … ,𝑡𝑀}
• Representation of documents:
– Of interest: is a given term present
or not?
– Document as vector in {0,1} M

• Term-document matrix 𝑀 × 𝑁:
Alternative View using T-D-Matrix
• Single term query:
– Result: row in T-D-Matrix
• Combination:
– Bit operations on rows

• Example:
– coffee AND tea

Inverted Index
To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. The
major steps in this are:
1. Collect the documents to be indexed

2. Tokenize the text, turning each document into a list of tokens:

3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing
terms:
4. Index the documents that each term occurs in by creating an inverted index, consisting of a
dictionary and postings.

DOCID - Within a document collection, we assume that each document has a unique DOCID serial
number, known as the document identifier (docID).

• Data structure consisting of

– Lookup terms (row vectors)
• Search tree

– Posting-List of non-zero entries in vector

• Linked list of postings

– Posting: reference to a document

Inverted index construction

Initial stages of text processing

• Tokenization
– Cut character sequence into word tokens
• Deal with “John’ s ” , a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Implementing Boolean Retrieval

 Search for 𝑞 = 𝑞1 𝐴𝑁𝐷 𝑞2

 Intersect the result lists (posting lists)
 Example : query „coffee AND tea“

• BUT operator
– Binary operator
– Defined as: 𝑞1 BUT 𝑞2 = 𝑞1 AND(NOT 𝑞2 )

Phrase queries

Biword indexes
One approach to handling phrases is to consider every pair of consecutive terms in a document
as a phrase. For example, the text Friends, Romans, Countrymen would generate the biwords:
friends romans
romans countrymen
In this model, we treat each of these biwords as a vocabulary term. Being able to process two-
word phrase queries is immediate. Longer phrases can be processed by breaking them down.
The query stanford university palo alto can be broken into the Boolean query on biwords:
“stanford university” AND “university palo” AND “palo alto”
Without the docs, we cannot verify that the docs matching the above Boolean query do contain
the phrase.
Issues for biword indexes
• False positives, as noted before
• Index blowup due to bigger dictionary
– Infeasible for more than biwords, big even for them
• Biword indexes are not the standard solution (for all biwords) but can be part of a
compound strategy

Positional indexes
Here, for each term in the vocabulary, we store postings of the form docID: hposition1, position2, . . .
where each position is a token index in the document. Each posting will also usually record the term
frequency

Vector Space Model

In a collection we can obtain N vectors , each documents has a vector and each vector is of
length |V| (cardinality of V) – V is the dictionary so the length of vectors corresponds to the
number of words in dictionary.
Terms here are the axes of the space
Documents are points or vectors in this space

LRN C2 January 2022 Exam Paper
No ratings yet
LRN C2 January 2022 Exam Paper
15 pages
OCSC Economic Impact Study 091812 (Final)
No ratings yet
OCSC Economic Impact Study 091812 (Final)
45 pages
Jelly Roll Morton's Discography
No ratings yet
Jelly Roll Morton's Discography
27 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Unit 2
No ratings yet
Unit 2
58 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
NLP UNIT-II(PART-I)
No ratings yet
NLP UNIT-II(PART-I)
19 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
NLP 4
No ratings yet
NLP 4
33 pages
Ir 1
No ratings yet
Ir 1
14 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Unit II
No ratings yet
Unit II
73 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Ch2_IR and LT
No ratings yet
Ch2_IR and LT
45 pages
1_introIR
No ratings yet
1_introIR
15 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
Information Retreival Methods
No ratings yet
Information Retreival Methods
19 pages
L001
No ratings yet
L001
49 pages
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
No ratings yet
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
57 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
No ratings yet
Nowadays IR Is Much More Than Building Search Engines !: Paolo Ferragina
47 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
IR Notes.docx
No ratings yet
IR Notes.docx
14 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
emutye
No ratings yet
emutye
20 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Lect 01-Introduction (1)
No ratings yet
Lect 01-Introduction (1)
53 pages
chapter one IR
No ratings yet
chapter one IR
18 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
IR Introduction
100% (1)
IR Introduction
6 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
Information Retrieval 1
No ratings yet
Information Retrieval 1
10 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Chap 1
No ratings yet
Chap 1
22 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
215 BSc Information Technology C
No ratings yet
215 BSc Information Technology C
85 pages
Experiment No: 05 Name of The Experiment: Objectives:: Required Component
No ratings yet
Experiment No: 05 Name of The Experiment: Objectives:: Required Component
2 pages
Ui Techonolgy
100% (1)
Ui Techonolgy
248 pages
Esan Emmanuel's Resume
No ratings yet
Esan Emmanuel's Resume
2 pages
Class 10 Social Science Notes for Session 2025-26 Chapter - 2 Federalism
No ratings yet
Class 10 Social Science Notes for Session 2025-26 Chapter - 2 Federalism
44 pages
IMM - CPM - Course Material (00000002)
No ratings yet
IMM - CPM - Course Material (00000002)
96 pages
Block Diagram: Penryn SFF
No ratings yet
Block Diagram: Penryn SFF
94 pages
Solutions To Exercises - Print PDF
No ratings yet
Solutions To Exercises - Print PDF
72 pages
Chapter 6: Agricultural Sector ECO 310: Economic Development Transformations and Rural Development
No ratings yet
Chapter 6: Agricultural Sector ECO 310: Economic Development Transformations and Rural Development
20 pages
Preparation of Silica Gel From Rice Husk Ash Using
No ratings yet
Preparation of Silica Gel From Rice Husk Ash Using
7 pages
Project Proposal 3 Flagpole
No ratings yet
Project Proposal 3 Flagpole
2 pages
Chapter 1 2 and 3
100% (1)
Chapter 1 2 and 3
20 pages
Gmail - Verification Summary Printout
No ratings yet
Gmail - Verification Summary Printout
2 pages
Wealth Management and Financial Planning
No ratings yet
Wealth Management and Financial Planning
6 pages
DEFFENCE PARK Brochure
No ratings yet
DEFFENCE PARK Brochure
12 pages
How To Remove Fluoride From Your Water at Home - The Epoch Times
No ratings yet
How To Remove Fluoride From Your Water at Home - The Epoch Times
11 pages
Cpu 0807
No ratings yet
Cpu 0807
112 pages
VF1RZG00666351005
No ratings yet
VF1RZG00666351005
12 pages
Bizgram Daily DIY Pricelist Month 03
No ratings yet
Bizgram Daily DIY Pricelist Month 03
8 pages
4Ps of Agora
100% (2)
4Ps of Agora
25 pages
Ff0000Master Cleaning Sanitation Program: ISSUE 1: August 2021
No ratings yet
Ff0000Master Cleaning Sanitation Program: ISSUE 1: August 2021
5 pages
Research Proposal 1
No ratings yet
Research Proposal 1
20 pages
3-Axle Steel Segment Tipper Semitrailer: Product Benefits
No ratings yet
3-Axle Steel Segment Tipper Semitrailer: Product Benefits
9 pages
Billabong Case Study
0% (1)
Billabong Case Study
13 pages
Standard Reliability Prediction Report
No ratings yet
Standard Reliability Prediction Report
5 pages
Around The World in Eighty Days Newspaper Project
100% (1)
Around The World in Eighty Days Newspaper Project
4 pages
Fds 54 GFD
No ratings yet
Fds 54 GFD
11 pages

Web Information Retrieval

Uploaded by

Web Information Retrieval

Uploaded by

Web Information Retrieval

Data Retrieval – Obtaining data from a database.

Information retrieval as a field of study is finding a relevant information resource that

Library and Information Science:

Differences between Web Information Retrieval and traditional Information

• Corpus: 𝐷 = {𝑑1, 𝑑2, … , 𝑑𝑁}

2. Tokenize the text, turning each document into a list of tokens:

• Data structure consisting of

– Posting-List of non-zero entries in vector

– Posting: reference to a document

Initial stages of text processing

 Search for 𝑞 = 𝑞1 𝐴𝑁𝐷 𝑞2

Vector Space Model

You might also like