0% found this document useful (0 votes)
86 views

Web Information Retrieval

The document discusses web information retrieval. It begins by defining information retrieval as finding relevant information resources within a collection to satisfy an information need. It then describes the key elements of an information retrieval system as the information need, relevant resources, and the collection of resources. The document also discusses how web information retrieval differs from traditional information retrieval due to features of the web like hyperlinks and semi-structured data. Finally, it provides an overview of how a basic web search engine works through crawling the web, preprocessing data, indexing documents, and retrieving documents in response to queries.

Uploaded by

Bani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Web Information Retrieval

The document discusses web information retrieval. It begins by defining information retrieval as finding relevant information resources within a collection to satisfy an information need. It then describes the key elements of an information retrieval system as the information need, relevant resources, and the collection of resources. The document also discusses how web information retrieval differs from traditional information retrieval due to features of the web like hyperlinks and semi-structured data. Finally, it provides an overview of how a basic web search engine works through crawling the web, preprocessing data, indexing documents, and retrieving documents in response to queries.

Uploaded by

Bani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Web Information Retrieval

Data Retrieval – Obtaining data from a database.


• Using a query language (e.g. SQL)
– Data is structured and free of ambiguity.

Information retrieval as a field of study is finding a relevant information resource that


satisfies the information need from within a collection of resources.
The elements of an information retrieval system:
– Information need.
– Relevant information resource.
– Collection of resources.

• Information need:
-is the topic about which the user desires to obtain information that satisfies conscious
or unconscious need.
- is differentiated from (but expressed as) a query
•Query:
-is what the user communicates with the computer in an attempt to express the
information need in words (or other format).
• Relevant information resource:
- Is the retrieved information that the user perceives valuable with respect to his/her
information need.
• Collection of resources:
- In case of text documents, it is referred to as corpus, but it can refer to a collection of
any sort of unstructured data (text, images, videos, audio, etc.)
- Often the resources themselves are not kept or stored directly in the IR system, but are
instead represented in the system by other surrogates or metadata.
IR Model
Structured vs. Unstructured Data

Database Management:
• Focused on structured data stored in relational tables rather than free-form text.
• Focused on efficient processing of well-defined queries in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data (XML) brings it closer to IR.

Library and Information Science:


• Focused on the human user aspects of information retrieval (human-computer
interaction, user interface, visualization).
• Concerned with effective categorization of human knowledge.
• Concerned with citation analysis and bibliometrics (structure of information).
• Recent work on digital libraries brings it closer to CS & IR.

Artificial Intelligence:
• Focused on the representation of knowledge, reasoning, and intelligent action.
• Formalisms for representing knowledge and queries: – First-order Predicate Logic –
Bayesian Networks
• Recent work on web ontologies and intelligent information agents brings it closer to
IR.
Natural Language Processing:
• Focused on the syntactic, semantic, and pragmatic analysis of natural language text
and discourse.
• Ability to analyze syntax (phrase structure) and semantics could allow retrieval based
on meaning rather than keywords.

• Methods for determining the sense of an ambiguous word based on context (word sense
disambiguation).
• Methods for identifying specific pieces of information in a document (information extraction).
• Methods for answering specific NL questions from document corpora or structured data

Machine Learning
Focused on the development of computational systems that improve their performance with
experience.
Automated classification of examples based on learning concepts from labeled training
examples (supervised learning).
Automated methods for clustering unlabeled examples into meaningful groups (unsupervised
learning
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
• Learning to Rank
1 Evaluation:
What makes WIR specific?
• Larger than traditional information resources
• Presence of hyperlinks
• Data in semi-structured
• Evolving significantly
• Multiple content types (text, images, and even tables) + application
• Quality of document

Differences between Web Information Retrieval and traditional Information


Retrieval 
How web search engine works
• Web corpus collection (Crawling)
• Preprocessing
• Indexing
• Document retrieval
Crawling the web
• Start from an initial page
• Retrieve all linked pages
• Iterate on new pages
• Do not visit the same page twice
• Avoid conflict and overlapping when crawling with parallel machines.
• Crawl important pages (avoid leaving important pages)
Indexing
• Is the efficiency key of a search engine.
– Retrieveing relevant result quickly.
• It avoids linearly scanning the texts for each query.
Evaluation of Information Systems
• General measures for software systems
– Completeness, covering all requirements
– Efficient use of resources (runtime, RAM, disk space bandwidth)
– Useability
• Measures for database systems
– Runtime indexing
– Runtime querying
– Max number of parallel users
Boolean Model
The Boolean retrieval model is a model for information retrieval in which we MODEL can pose
any query which is in the form of a Boolean expression of terms, that is, in which terms are
combined with the operators AND, OR, and NOT. The model views each document as just a set
of words

Boolean:
– Retrieval based on boolean algebra
– Binary concept of relevance (yes/no)
• No ranking!
– Queries use boolean operators

• Corpus: 𝐷 = {𝑑1, 𝑑2, … , 𝑑𝑁}


• Vocabulary: 𝑉 = {𝑡1,2, … ,𝑡𝑀}
• Representation of documents:
– Of interest: is a given term present
or not?
– Document as vector in {0,1} M

• Term-document matrix 𝑀 × 𝑁:
Alternative View using T-D-Matrix
• Single term query:
– Result: row in T-D-Matrix
• Combination:
– Bit operations on rows

• Example:
– coffee AND tea

Inverted Index
To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. The
major steps in this are:
1. Collect the documents to be indexed

2. Tokenize the text, turning each document into a list of tokens:

3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing
terms:
4. Index the documents that each term occurs in by creating an inverted index, consisting of a
dictionary and postings.

DOCID - Within a document collection, we assume that each document has a unique DOCID serial
number, known as the document identifier (docID).

• Data structure consisting of


– Lookup terms (row vectors)
• Search tree

– Posting-List of non-zero entries in vector


• Linked list of postings

– Posting: reference to a document


Inverted index construction

Initial stages of text processing


• Tokenization
– Cut character sequence into word tokens
• Deal with “John’ s ” , a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Implementing Boolean Retrieval

 Search for 𝑞 = 𝑞1 𝐴𝑁𝐷 𝑞2


 Intersect the result lists (posting lists)
 Example : query „coffee AND tea“

• BUT operator
– Binary operator
– Defined as: 𝑞1 BUT 𝑞2 = 𝑞1 AND(NOT 𝑞2 )

Phrase queries

Biword indexes
One approach to handling phrases is to consider every pair of consecutive terms in a document
as a phrase. For example, the text Friends, Romans, Countrymen would generate the biwords:
friends romans
romans countrymen
In this model, we treat each of these biwords as a vocabulary term. Being able to process two-
word phrase queries is immediate. Longer phrases can be processed by breaking them down.
The query stanford university palo alto can be broken into the Boolean query on biwords:
“stanford university” AND “university palo” AND “palo alto”
Without the docs, we cannot verify that the docs matching the above Boolean query do contain
the phrase.
Issues for biword indexes
• False positives, as noted before
• Index blowup due to bigger dictionary
– Infeasible for more than biwords, big even for them
• Biword indexes are not the standard solution (for all biwords) but can be part of a
compound strategy

Positional indexes
Here, for each term in the vocabulary, we store postings of the form docID: hposition1, position2, . . .
where each position is a token index in the document. Each posting will also usually record the term
frequency

Vector Space Model


In a collection we can obtain N vectors , each documents has a vector and each vector is of
length |V| (cardinality of V) – V is the dictionary so the length of vectors corresponds to the
number of words in dictionary.
Terms here are the axes of the space
Documents are points or vectors in this space

You might also like