Web Information Retrieval
Web Information Retrieval
• Information need:
-is the topic about which the user desires to obtain information that satisfies conscious
or unconscious need.
- is differentiated from (but expressed as) a query
•Query:
-is what the user communicates with the computer in an attempt to express the
information need in words (or other format).
• Relevant information resource:
- Is the retrieved information that the user perceives valuable with respect to his/her
information need.
• Collection of resources:
- In case of text documents, it is referred to as corpus, but it can refer to a collection of
any sort of unstructured data (text, images, videos, audio, etc.)
- Often the resources themselves are not kept or stored directly in the IR system, but are
instead represented in the system by other surrogates or metadata.
IR Model
Structured vs. Unstructured Data
Database Management:
• Focused on structured data stored in relational tables rather than free-form text.
• Focused on efficient processing of well-defined queries in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data (XML) brings it closer to IR.
Artificial Intelligence:
• Focused on the representation of knowledge, reasoning, and intelligent action.
• Formalisms for representing knowledge and queries: – First-order Predicate Logic –
Bayesian Networks
• Recent work on web ontologies and intelligent information agents brings it closer to
IR.
Natural Language Processing:
• Focused on the syntactic, semantic, and pragmatic analysis of natural language text
and discourse.
• Ability to analyze syntax (phrase structure) and semantics could allow retrieval based
on meaning rather than keywords.
• Methods for determining the sense of an ambiguous word based on context (word sense
disambiguation).
• Methods for identifying specific pieces of information in a document (information extraction).
• Methods for answering specific NL questions from document corpora or structured data
Machine Learning
Focused on the development of computational systems that improve their performance with
experience.
Automated classification of examples based on learning concepts from labeled training
examples (supervised learning).
Automated methods for clustering unlabeled examples into meaningful groups (unsupervised
learning
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
• Learning to Rank
1 Evaluation:
What makes WIR specific?
• Larger than traditional information resources
• Presence of hyperlinks
• Data in semi-structured
• Evolving significantly
• Multiple content types (text, images, and even tables) + application
• Quality of document
Boolean:
– Retrieval based on boolean algebra
– Binary concept of relevance (yes/no)
• No ranking!
– Queries use boolean operators
• Term-document matrix 𝑀 × 𝑁:
Alternative View using T-D-Matrix
• Single term query:
– Result: row in T-D-Matrix
• Combination:
– Bit operations on rows
• Example:
– coffee AND tea
Inverted Index
To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. The
major steps in this are:
1. Collect the documents to be indexed
3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing
terms:
4. Index the documents that each term occurs in by creating an inverted index, consisting of a
dictionary and postings.
DOCID - Within a document collection, we assume that each document has a unique DOCID serial
number, known as the document identifier (docID).
• BUT operator
– Binary operator
– Defined as: 𝑞1 BUT 𝑞2 = 𝑞1 AND(NOT 𝑞2 )
Phrase queries
Biword indexes
One approach to handling phrases is to consider every pair of consecutive terms in a document
as a phrase. For example, the text Friends, Romans, Countrymen would generate the biwords:
friends romans
romans countrymen
In this model, we treat each of these biwords as a vocabulary term. Being able to process two-
word phrase queries is immediate. Longer phrases can be processed by breaking them down.
The query stanford university palo alto can be broken into the Boolean query on biwords:
“stanford university” AND “university palo” AND “palo alto”
Without the docs, we cannot verify that the docs matching the above Boolean query do contain
the phrase.
Issues for biword indexes
• False positives, as noted before
• Index blowup due to bigger dictionary
– Infeasible for more than biwords, big even for them
• Biword indexes are not the standard solution (for all biwords) but can be part of a
compound strategy
Positional indexes
Here, for each term in the vocabulary, we store postings of the form docID: hposition1, position2, . . .
where each position is a token index in the document. Each posting will also usually record the term
frequency