Part B
Part B
Data retrieval being used for finding exact matches using stringent queries on structured
data, often in a Relational Database Management System (RDBMS).
" IR is used for assessing human interests, i.e., IR selects and ranks documents based on the
likelihood of relevance to the user's needs. DR is different; answers to users' queries are
exact matches which do not impose any ranking.
" Data retrieval involves the selection ofa fixed set of data based on a well-defined query.
Information retrieval involves the retrieval of documents of natural language.
" IR systems do not support transactional updates whereas database systems support
structured data, with schemas that define the data organization. IR systems deal with some
querying issues not generally addressed by database systems and approximate searching by
keywords.
1.3.1 Difference between Data Retrieval and Information Retrieval
Parameters Data retrieval Information retrieval
Example Data base query wwW search
" An information retrieval system is an information system, which is used to store items of
information that need to be processed, searched, retrieved, and disseminated to various user
populations.
TECHNICAL PUBLICATIONS- An up thrust for knowiedge
Information Retrieval Techniques (1 - 6) Introduction
" Information retrieval is the process of searching some collection of documents, in order to
identify those documents which deal with a particular subject. Any system that is designed
to facilitate this literature searching may legitimately be called an information retrieval
system.
Conceptually, IR is the study of finding needed information. It helps users to find
information that matches their information needs. Historically, IR is about document
retrieval, emphasizing document as the basic unit.
" Information retrieval locates relevant documents, on the basis of user input such as
keywords or example documents, for example: Find documents containing the words
"database systems".
" Fig. 1.4.1 shows information retrieval system block diagram. It consists of three
components : Input, processor and output.
Feedback
Query
Output
Processor
Input
Document
" The computer-based retrieval systems store only a representation of the document or query
which means that the text of a document is lost once it has been processed for the purpose
of generating its representation.
" The process may involve structuring the information, such as classifying it. It will also
involve performing the actual retrieval function that is executing the search strategy in
response to a query.
Text document is the output of information retrieval system. Web search engines are the
most familiar example of IR systems.
1.4.1 Process of Information Retrieval
" Information retrieval is often a continuous process during which you will consider,
reconsider and refine your research problem, use various different information resources,
information retrieval techniques and library services and evaluate the information you find.
" Fig. 1.4.2 shows that the stages follow each other during the process, but in reality they are
often active simultaneously and you usually will repeat some stages during the same
information retrieval process.
Problem/
topic
Locating Information
publications retrieval
Evaluating
the results
Q: A Boolean expression. The terms are index terms and operators are AND, OR, and
NOT.
F: Boolean algebra over sets of terms and sets of documents.
R: A document is predicted as relevant to a query expression if and only if it satisfies the
query expression
((text v information) A retrieval A¬ theory)
Each query term specifies a set of documents containing the term
a. AND (A):The intersection of twO sets b. OR (v): The union of two sets
c. NOT():Set inverse, or really set difference.
Boolean Relevance example:
((text v information) A retrieval A theory)
It gives following list :
"Information Retrieval"
"Information Theory"
"Modern Information Retrieval: Theory and Practice"
"Text Compression"
> Implementing the Boolean Model :
First, consider purely conjunctive queries (4, Ah at).
" It only satisfied by a document containing all three terms.
" IfD() = {d|t, e d}, then the maximum possible size ofthe retrieved set is the size ofthe
smallest D (,).
" |D (t,) | is the length of the inverted list for t,
" For instance, the query social AND economic will produce the set of documents that are
indexed both with the term social and the term economic, i.e. the intersection of both sets.
" Combining terms with the OR operator will define a document set that is bigger than or
equal to the document sets of any of the single terms.
" So, the query social OR political will produce the set of documents that are indexed with
either the term social or the term political or both, i.e. the union of both sets.
This is visualized in the Venn diagrams of Fig. 2.1.I in which each set of documents is
visualized by a disec.
TECHNICAL PUBLICATIONS®. An up thrust for knowiedge
" The intersections of these discs and their complements divide the document collection into
8 non-overlapping regions, the unions of which give 256 diferent Boolean combinations of
"social, political and economic documents". In Fig. 2.1.1the retrieved sets are visualized by
the shaded areas.
" Term weights are used to compute the degree of similarity between documents and User
They considered the index representations and the query as vectors embedded in a high
dimensional Euclidean space, where each term is assigned a separate dimension. The
similarity measure is usually the cosine of the angle that separates the two vectors d and g
" The cosine of an angle is 0 if the vectors are orthogonal in the multidimensional space and
1 if the angle is 0 degrees. The cosine formula is given by
score (, )
" The metaphor of angles between vectors in a multidimensional space makes it easy to
explain the implications of the model to non-experts. Up to three dimensions, one can
easily visualise the document and query vectors. Fig. 2.1.2 shows a query and document
representation in the vector space model.
Political
Social
Economic
Fig. 2.1.2 : Query and document representation in the vector space model
" Measuring the cosine of the angle between vectors i equivalent with normalizing the
vectors to unit length and taking the vector inner product. If index representations and
queries are properly normalised, then the vector product measure of equation I does have a
strong theoretical motivation. The formula then becomes:
score (. 4) - ,n4)-n4)
where n(V)
We think of the documents as a collection C of objects and think of the user query as a
specification of a set A of objects. In this scenario, the IR problem can be reduced to the
TECHNICAL PUBLICATIONS- An up thrust for knowle dge
problem of determine which documents are in the set A and which ones are not (ie. the IR
problem can be viewed as a clustering problem).
1. Intra-cluster : One needs to determine what are the features which better describe the
objects in the set A.
2. Inter-cluster : One needs to determine what are the features which better distinguish
the objects in the set A.
" t,:Inter-clustering similarity is quantified by measuring the raw frequency of aterm
inside a document d, , such tem frequency is usually referred to as the t, factor and
provides one measure of how well that term describes the document contents.
" idf: Inter-clustering similarity is quantified by measuring the inverse of the frequency of a
term k. among the documents in the collection. This frequency is often referred to as the
inverse document frequency
> Advantages:
1. Its term-weighting scheme improves retrieval performance.
2. Its partial matching strategy allows retrieval of documents that approximate the query
conditions.
3. Its cosine ranking formula sorts the documents according to their degree of similarity to
the query.
> Disadvantages :
1. The assumption of mutual independence between index terms.
Infomation Retrieval Techniques (3- 20) Text Classiñication and Clustering
" Support Vector Machines (SVMs) are a set of supervised learning methods which learn
from the dataset and used for classification. SVM is a classifier derived from statistical
learning theory by Vapnik and Chervonenkis.
" An SVM is a kind of large-margin classifier : it is a vector space based machine learning
method where the goal is to find a decision boundary between two classes that is maximally
far from any point in the training data.
Given a set of training examples, each marked as belonging to one of two classes, an SVM
algorithm builds a model that predicts whether a new example falls into one class or the
other. Simply speaking, we can think of an SVM model as representing the examples of the
separate classes are divided by a gap that is as wide as possible.
" New examples are then mapped into the same space and classified to belong to the class
based on which side of the gap they fall on.
" SVM are primarily two-class classifiers with the distinct characteristic that they aim to find
the optimal hyperplane such that the expected generalization error is minimized. Instead of
directly minimizing the empirical risk calculated from the training data, SVMs perform
structural risk minimization to achieve good generalization.
The empirical risk is the average loss of an estimator for a finite set of data drawn from P.
The idea of risk minimization is not only measure the performance of an estimator by its
risk, but to actually search for the estimator that minimizes risk over distribution P. Because
we don't know distribution P we instead minimize empirical risk over a training dataset
drawn from P. This general learning technique is called enmpirical risk minimization.
Fig. 3.3.5 shows empirical risk.
High
Expected risk,
Short
Confidence
Empirical risk
Low
Small Large
Complexity of function set
P (E\H) *P(H)
P (H\E) =
P (E)
7 Turing
Matrix factorization (MF) models are based on the latent factor model. MF approach is
most accurate approach to reduce the problem from high levels of sparsity in RS database.
Matrix factorization is a simple embedding model. Given the feedback matrix A e R,
where m is the number of users (or queries) andn is the number of items, the model learns :
m xd
1. A user embedding matrix UeR ,where row iis the embedding for user i.
2. An item embedding matrix Ve R ,where row j is the embedding for item j.
5.6.1 Singular Value Decomposition (SVD)
" SVD is a matrix factorization technique that is usually used to reduce the number of
features of a data set by reducing space dimensions from N to K where K < N.
The matrix factorization is done on the user-item ratings matrix.
X
T12 T|n S11 0.
T21 T22
:
Uml Unr Urn
Tml Tmn
Tn X r r X
mX Xr
" The matrix S is a diagonal matrix containing the singular values of the matrix X. There are
exactly r singular values, where r is the rank of X.
The rank of a matrix is the number of linearly independent rows or columns in the matrix.
Recall that two vectors are linearly independent if they can not be written as the sum or
scalar multiple of any other vectors in the space.
Incremental SVD Algorithm (SVD++)
The idea is borrowed from the Latent Semantic Indexing (LSI) world to handle dynamic
databases.
" LSI is a conceptual indexing technique which uses the SVD to estimate the underlying
latent semantic structure of the word to document association.
User U.
Item
Descriptions Profile User U,
feedback|
User U,
Structured Profile
Reoresentatior
User U,
Profile
Represented Filtering
Profiles Component
items
iterris
User
Query
-Data
Indey
Documents
Feedback
Revised results
Information
need
(a) Relevance feedback
Recommender System
Information Retrieval Techniques (5-11)
ratings.
" The rule-based approach applies association rule discovery algorithms to find
association between co-purchased items and then generates item recommendation based
on the strength of the association between items
Advantages
1. Scalability : Most models resulting from model-based algorithms are much smaller than
the actual dataset, so that even for very large datasets, the model ends up being small
enough to be used efficiently. This imparts scalability to the overall system.
2. Prediction speed : Model-based systems are also likely to be faster, at least in
comparison to memory-based systems because, the time required to query the model is
usually much smaller than that required to query the whole dataset.
which we build our model is
3. Avoidance of over fitting : If the dataset over
representative enough of real-world data, it is easier to try to avoid over-fitting with
model-based systems.
Disadvantages
1. Inflexibility : Because building a model is often a time- and resource-consuming
process, it is usually more difficult to add data to model-based systems, making them
inflexible.
TECHNICAL PUBLICATIONS- An up hrust for krnowledge