0% found this document useful (0 votes)
20 views12 pages

Part B

Part b questions

Uploaded by

ssmagesh78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views12 pages

Part B

Part b questions

Uploaded by

ssmagesh78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Information Retrieval Technques (1-5) Introduction

I1.3 Information versus Data Retrieval


" An information retrieval system is software that has the features and functions required to
manipulate "information" items versus a DBMS that is optimized to handle "structured"
data.
" Information retrieval and Data Retrieval (DR) are often viewed as two mutually exclusive
means to perform different tasks, IR being used for finding relevant documents among a
collection of unstructured/semi-structured documents.

Data retrieval being used for finding exact matches using stringent queries on structured
data, often in a Relational Database Management System (RDBMS).
" IR is used for assessing human interests, i.e., IR selects and ranks documents based on the
likelihood of relevance to the user's needs. DR is different; answers to users' queries are
exact matches which do not impose any ranking.
" Data retrieval involves the selection ofa fixed set of data based on a well-defined query.
Information retrieval involves the retrieval of documents of natural language.
" IR systems do not support transactional updates whereas database systems support
structured data, with schemas that define the data organization. IR systems deal with some
querying issues not generally addressed by database systems and approximate searching by
keywords.
1.3.1 Difference between Data Retrieval and Information Retrieval
Parameters Data retrieval Information retrieval
Example Data base query wwW search

Matching Exact Partial match Best match

Inference Deduction Induction


Model Deterministic Probabilistic
Query language Artificial Natural
Query specification Complete Incomplete
Items wanted Matching Relevant
Error response Sensitive Insensitive
Classification Monotonic Polytechnic
1.4 The IR System AU : Dec.-16, 17

" An information retrieval system is an information system, which is used to store items of
information that need to be processed, searched, retrieved, and disseminated to various user
populations.
TECHNICAL PUBLICATIONS- An up thrust for knowiedge
Information Retrieval Techniques (1 - 6) Introduction

" Information retrieval is the process of searching some collection of documents, in order to
identify those documents which deal with a particular subject. Any system that is designed
to facilitate this literature searching may legitimately be called an information retrieval
system.
Conceptually, IR is the study of finding needed information. It helps users to find
information that matches their information needs. Historically, IR is about document
retrieval, emphasizing document as the basic unit.
" Information retrieval locates relevant documents, on the basis of user input such as
keywords or example documents, for example: Find documents containing the words
"database systems".
" Fig. 1.4.1 shows information retrieval system block diagram. It consists of three
components : Input, processor and output.
Feedback

Query
Output
Processor
Input
Document

Fig. 1.4.1: IR block diagram


a) Input : Store only a representation of the document or query which means that the text
of a document is lost once it has been processed for the purpose of generating its
representation.
b) A document representative could be a list of extracted words considered to be
significant.
c) Processor : Involve in performing actual retrieval function, executing the search
strategy in response to a query.
d) Feedback : Improving the subsequent run after sample retrieval.
e) Output : A set of document numbers.
" Information retrieval locates relevant documents, on the basis of user input such as
keywords or example documents.
TECHNICAL PUBLICATIONS An up thrust for knowledge
Information Retrieval Techniques (1-7) Introduction

" The computer-based retrieval systems store only a representation of the document or query
which means that the text of a document is lost once it has been processed for the purpose
of generating its representation.
" The process may involve structuring the information, such as classifying it. It will also
involve performing the actual retrieval function that is executing the search strategy in
response to a query.
Text document is the output of information retrieval system. Web search engines are the
most familiar example of IR systems.
1.4.1 Process of Information Retrieval
" Information retrieval is often a continuous process during which you will consider,
reconsider and refine your research problem, use various different information resources,
information retrieval techniques and library services and evaluate the information you find.
" Fig. 1.4.2 shows that the stages follow each other during the process, but in reality they are
often active simultaneously and you usually will repeat some stages during the same
information retrieval process.

Problem/
topic

Using and Information


evaluating retrieval
the informa
tion plan

Locating Information
publications retrieval

Evaluating
the results

Flg. 1.4.2 Stages of IR process


" The different stages of the information retrieval process are :
1. Problem / Topic : An information need occurs when more information is required to
solve a problem
2. Information retrieval plan : Define your information need and choose your information
resources, retrieval techniques and search terms
TECHNICAL PUBLICATIONS®- An up thrust for knowiedge

Information Retrieval Techniques (1- 8) Introduction

3. Information retrieval : Perform your planned information retrieval (information retrieval


techniques)
4. Evaluating the results : Evaluate the results of your information retrieval (number and
relevance of search results)
5. Locating publications : Find out where and how the required publication, e.g. article,
can be acquired
6. Using and evaluating the information : Evaluate the final results of the process (critical
and ethical evaluation of the information and information resources)
2.1.2 Boolean Model
" The Boolean model is the first model of information retrieval and probably also the most
criticized model. It is based on set theory and Boolean algebra.
" It is based on a binary decision criterion without any notion of a grading scale. Boolean
expressions have precise semantics. It is not simple to translate an information need into a
Boolean expression. It can be represented as a disjunction of conjunction vectors(in
disjunctive normal form-DNF).
D: Set of words (indexing terms) present in a document. Each term is either present (1)
or absent (0).

TECHNICAL PUBLICATIONS- An up thrust for knowiedge

Information Retrieval Techniques (2-3) Modeling and Retrieval Evaluation

Q: A Boolean expression. The terms are index terms and operators are AND, OR, and
NOT.
F: Boolean algebra over sets of terms and sets of documents.
R: A document is predicted as relevant to a query expression if and only if it satisfies the
query expression
((text v information) A retrieval A¬ theory)
Each query term specifies a set of documents containing the term
a. AND (A):The intersection of twO sets b. OR (v): The union of two sets
c. NOT():Set inverse, or really set difference.
Boolean Relevance example:
((text v information) A retrieval A theory)
It gives following list :
"Information Retrieval"
"Information Theory"
"Modern Information Retrieval: Theory and Practice"
"Text Compression"
> Implementing the Boolean Model :
First, consider purely conjunctive queries (4, Ah at).
" It only satisfied by a document containing all three terms.
" IfD() = {d|t, e d}, then the maximum possible size ofthe retrieved set is the size ofthe
smallest D (,).
" |D (t,) | is the length of the inverted list for t,
" For instance, the query social AND economic will produce the set of documents that are
indexed both with the term social and the term economic, i.e. the intersection of both sets.
" Combining terms with the OR operator will define a document set that is bigger than or
equal to the document sets of any of the single terms.
" So, the query social OR political will produce the set of documents that are indexed with
either the term social or the term political or both, i.e. the union of both sets.
This is visualized in the Venn diagrams of Fig. 2.1.I in which each set of documents is
visualized by a disec.
TECHNICAL PUBLICATIONS®. An up thrust for knowiedge

Information Retrieval Techniques (2-4) Modeling and Retrieval Evaluation

" The intersections of these discs and their complements divide the document collection into
8 non-overlapping regions, the unions of which give 256 diferent Boolean combinations of
"social, political and economic documents". In Fig. 2.1.1the retrieved sets are visualized by
the shaded areas.

Social Political Social Political Social Political

Economic Economic Economic

Social and economic Social or political (Social or Political)


and not (Social and economic)
Fig. 2.1.1 : Boolean comblnations of sets visualized as Venn dlagrams
>Advantages :
a. Clean formalism
b. Simplicity
c. It is very precise in nature. The user exactly gets what is specified.
d. Boolean model is still widely used in small scale searches like searching emails, files
from local hard drives or in a mid-sized library.
> Disadvantages:

a. It is not simple to translate an information need into a Boolean expression


b. Exact matching may lead to retrieval of too few or too many documents.
c. The retrieved documents are not ranked.
d. The model does not use term weights.
2.1.3 Vector Model
" Assign non-binary weights to index terms in queries and in documents. Compute the
similarity between documents and query. The index terms in the query are also weighted

" Term weights are used to compute the degree of similarity between documents and User

query. Then, retrieved documents are sorted in decreasing order.

TECHNICAL PUBLICATIONS®- An up thrust for knowledpe

Information Retrieval Techniques (2-5) Modeling and Retrieval Evaluation

They considered the index representations and the query as vectors embedded in a high
dimensional Euclidean space, where each term is assigned a separate dimension. The
similarity measure is usually the cosine of the angle that separates the two vectors d and g
" The cosine of an angle is 0 if the vectors are orthogonal in the multidimensional space and
1 if the angle is 0 degrees. The cosine formula is given by

score (, )

" The metaphor of angles between vectors in a multidimensional space makes it easy to
explain the implications of the model to non-experts. Up to three dimensions, one can
easily visualise the document and query vectors. Fig. 2.1.2 shows a query and document
representation in the vector space model.
Political

Social

Economic

Fig. 2.1.2 : Query and document representation in the vector space model
" Measuring the cosine of the angle between vectors i equivalent with normalizing the
vectors to unit length and taking the vector inner product. If index representations and
queries are properly normalised, then the vector product measure of equation I does have a
strong theoretical motivation. The formula then becomes:
score (. 4) - ,n4)-n4)
where n(V)

We think of the documents as a collection C of objects and think of the user query as a
specification of a set A of objects. In this scenario, the IR problem can be reduced to the
TECHNICAL PUBLICATIONS- An up thrust for knowle dge

tnformation Retrieval Techniques (2-6) Modeling and Retrieval Evaluation

problem of determine which documents are in the set A and which ones are not (ie. the IR
problem can be viewed as a clustering problem).
1. Intra-cluster : One needs to determine what are the features which better describe the
objects in the set A.
2. Inter-cluster : One needs to determine what are the features which better distinguish
the objects in the set A.
" t,:Inter-clustering similarity is quantified by measuring the raw frequency of aterm
inside a document d, , such tem frequency is usually referred to as the t, factor and
provides one measure of how well that term describes the document contents.
" idf: Inter-clustering similarity is quantified by measuring the inverse of the frequency of a
term k. among the documents in the collection. This frequency is often referred to as the
inverse document frequency
> Advantages:
1. Its term-weighting scheme improves retrieval performance.
2. Its partial matching strategy allows retrieval of documents that approximate the query
conditions.

3. Its cosine ranking formula sorts the documents according to their degree of similarity to
the query.
> Disadvantages :
1. The assumption of mutual independence between index terms.
Infomation Retrieval Techniques (3- 20) Text Classiñication and Clustering

3.3.4 SVM Classifier

" Support Vector Machines (SVMs) are a set of supervised learning methods which learn
from the dataset and used for classification. SVM is a classifier derived from statistical
learning theory by Vapnik and Chervonenkis.
" An SVM is a kind of large-margin classifier : it is a vector space based machine learning
method where the goal is to find a decision boundary between two classes that is maximally
far from any point in the training data.
Given a set of training examples, each marked as belonging to one of two classes, an SVM
algorithm builds a model that predicts whether a new example falls into one class or the
other. Simply speaking, we can think of an SVM model as representing the examples of the
separate classes are divided by a gap that is as wide as possible.
" New examples are then mapped into the same space and classified to belong to the class
based on which side of the gap they fall on.
" SVM are primarily two-class classifiers with the distinct characteristic that they aim to find
the optimal hyperplane such that the expected generalization error is minimized. Instead of
directly minimizing the empirical risk calculated from the training data, SVMs perform
structural risk minimization to achieve good generalization.
The empirical risk is the average loss of an estimator for a finite set of data drawn from P.
The idea of risk minimization is not only measure the performance of an estimator by its
risk, but to actually search for the estimator that minimizes risk over distribution P. Because
we don't know distribution P we instead minimize empirical risk over a training dataset
drawn from P. This general learning technique is called enmpirical risk minimization.
Fig. 3.3.5 shows empirical risk.
High

Expected risk,

Short
Confidence

Empirical risk

Low
Small Large
Complexity of function set

Fig. 3.3.5 : Empirical risk


TECHNICAL PUBLICATIONS An up thrust or knowledge

Infomation Retrieval Techniques (3- 21) Text Classification and Clustering

> Key Properties of Support Vector Machines


1. Use a single hyperplane which subdivides the space into two half-spaces, one which is
occupied by Class l and the other by Class 2.
2. They maximize the margin of the decision boundary using quadratic optimization
techniques which find the optimal hyperplane.
3. Ability to handle large feature spaces.
4. Overfitting can be controlled by soft margin approach
5. When used in practice, SVM approaches frequently map the examples to a higher
dimensional space and find margin maximal hyperplanes in the mapped space, obtaining
decision boundaries which are not hyperplanes in the original space.
6. The most popular versions of SVMs use non-linear kernel functions and map the
attribute space into a higher dimensional space to facilitate finding "good" linear
decision boundaries in the modified space.
> Soft Margin SVM
" For the very high dimensional problems common in text classification, sometimes the data
are linearly separable. But in the general case they are not, and even if they are, we might
prefer a solution that better separates the bulk of the data while ignoring a few weird noise
documents.
What if the training set is not linearly separable ? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.
" A soft-margin allows a few variables to cross into the margin or over the hyperplane,
allowing misclassification.
We penalize the crossover by looking at the number and distance of the misclassifications.
This is a trade-off between the hyperplane violations and the margin size. The slack
variables are bounded by some set cost. The farther theyare from the soft margin, the less
influence they have on the prediction.
All observations have an associated slack variable.
1. Slack variable = 0 then all points on the margin.
2. Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane.
3. C is the tradeoff between the slack variable penalty and the margin.
> Limitatlons of SVM
1. It is sensitive to noise.
2. The biggest limitation of SVM lies in the choice of the kernel.
3. Another limitation is speed and size.
4. The optimal design for multiclass SVM classifiers is also a research area.
TECHNICAL PUBLICATIONS-An up thrust lor knowiedge
What is the Naive Bayes
Algorithm?

Likelihood of the Evidence Prior Probability of the


given that the Hypothesis is Hypothesis
True

P (E\H) *P(H)
P (H\E) =
P (E)

Prior probability of the Prior probability that


the evidence is True
Hypothesis given that the
Evidence is True

7 Turing

It is an algorithm that learns the probability of


every object, its features, and which groups they
belong to. It is also known as a probabilistic
classifier. The Naive Bayes Algorithm comes under
supervised learning and is mainly used to solve
classification problems.
Probability
Probability helps to predict an event's occurrence
out of allthe potential outcomes. The
mathematicalequation for probability is as
follows:

Probability of an event = Number of Favorable


Events/ Total number of outcomes
0<=probability of an event <=1. The favorable
outcome denotes the event that results from the
probability. Probability is alwaysbetween 0 and1,
where O
means no probability of it happening, and
Imeans the success rate of that event is likely.

For better understanding, youcan also consider a


case where you predictafruit based on its color
and texture. Here are some possible assumptions
that youcan make. You can either choose the
correct fruit that you have in mind or get confused
with similar fruits and nmake mistakes. Either way,
the probability of choosing the right fruit is 50%.
Bayes Theory
Bayes Theory works on coming to a hypothesis (H)
from a given set of evidence (E). It relates to two
things: the probability of the hypothesis before the
evidence P(H)and the probability after the
evidence P(HIE). The Bayes Theory is explained by
the following equation:
P(HIE) = (P(EIH} *P(H))/P(E)
In the aboveequation,

P(HIE)denotes howevent Hhappens when


event Etakes place.
P(EIH) represents how often event Ehappens
when event H takes place first.
" P(H) represents the probability of event X
happening on its Own.
" P(E) represents the probability of event Y
happening on its own.

The Bayes Rule is amethod for determining P(HIE)


from P(E|H). In short, it provides you with a way of
calculating the probability of a hypothesis with the
provided evidence.
5.6 Matrix Factorization Models

Matrix factorization (MF) models are based on the latent factor model. MF approach is
most accurate approach to reduce the problem from high levels of sparsity in RS database.
Matrix factorization is a simple embedding model. Given the feedback matrix A e R,
where m is the number of users (or queries) andn is the number of items, the model learns :
m xd
1. A user embedding matrix UeR ,where row iis the embedding for user i.
2. An item embedding matrix Ve R ,where row j is the embedding for item j.
5.6.1 Singular Value Decomposition (SVD)
" SVD is a matrix factorization technique that is usually used to reduce the number of
features of a data set by reducing space dimensions from N to K where K < N.
The matrix factorization is done on the user-item ratings matrix.
X
T12 T|n S11 0.
T21 T22

:
Uml Unr Urn
Tml Tmn
Tn X r r X
mX Xr

TECHNICAL PUBLICATIONs- An up thrust for knowledge

Information Retrieval Techniques (5-14) Recommender System

" The matrix S is a diagonal matrix containing the singular values of the matrix X. There are
exactly r singular values, where r is the rank of X.
The rank of a matrix is the number of linearly independent rows or columns in the matrix.
Recall that two vectors are linearly independent if they can not be written as the sum or
scalar multiple of any other vectors in the space.
Incremental SVD Algorithm (SVD++)
The idea is borrowed from the Latent Semantic Indexing (LSI) world to handle dynamic
databases.

" LSI is a conceptual indexing technique which uses the SVD to estimate the underlying
latent semantic structure of the word to document association.

Projection of additional users provides good approximation to the complete model


SVD based recommender systems has following limitations
a. It can not be applied on sparse data
b. Does not have regularization
user profile with each item.
5.4.1 High Level Architecture Content-based Recommender Systems
" Fig. 5.4.1 shows High Level Architecture Content-based Recommender Systems.
User U,
feedback
Information Profile
Source Learner Feedback

User U.
Item
Descriptions Profile User U,
feedback|

Content Profile User U, Active


Analyzer trainina Cleaner feedback user Ua
example

User U,
Structured Profile
Reoresentatior

User U,
Profile
Represented Filtering
Profiles Component
items

iterris

Fig. 5.4.1 : High Level Architecture Content-based Recommender Systems


TECHNICAL PUBLICATIONS An up thrust for knowledge

Informaton Retrieval Techniques (5-6) Recommender System

>1. Content Analyzer


" Extracts the features (keywords, n-grams) from the source.
" Conversion from unstructured to structured item.
Data stored in the repository Represented Items
>2. Profile Learner
" To build user profile
Updates the profile using the data in Feedback repository
> 3. Filtering Component
Matching the user profile with the actual item to be recommended
Uses different strategies
retrieval environment.
Users have no detailed knowledge of collection makeup and the
often need to reformulate their queries to obtain the results of their interest.
Most users
5.4.2 Relevance Feedback
attempt to retrieve relevant
" Thus, the first query formulation should be treated as an initial for relevance and used to
information. Documents initially retrieved could be analyzed
improve the initial query.
" Fig. 5.4.2 shows relevance feedback on initial query.

User
Query
-Data
Indey
Documents

Feedback

Revised results
Information
need
(a) Relevance feedback

Known non-relevant documents


known relevant documents
Revised
query

(b) Relevance feedback on initial query


Fig. 5.4.2 150/168
knowledge
TECHNICAL PUBLICATIONS- An up thrust for
distribution of sets of items into account are possible. Furthermore, item-based CF is
successfully applied in large scale recommender systems (e.g., by Amazon.com).
5.5.2 Collaborative Filtering Algorithms
>1. Memory-based algorithms:
" Operate over the entire user-item database to
make predictions.
neighbors of the active user and then
" Statistical techniques are employed to find the
combine their preferences to produce aprediction.
" Memory-based algorithms utilize the entire user-item database to generate a prediction.
users, known as neighbors
Thesesystems employ statistical techniques to find a set of
that have a history of agreeing with the target user.
use different algorithms to
" Once a neighborhood of users is formed, these systems
combine the preferences of neighbors to produce a prediction or top-N recommendation
for the active user. The techniques, also known as nearest-neighbor or user-based
collaborative filtering are more popular and widely used in practice.
" Dynamic structure. More popular and widely used in
practice.
Advantages
1. The quality of predictions is rather good.
situation.
2. This is a relatively simple algorithm to implement for any
it uses the entire database every time i
3. It is very easy to update the database, since
makes a prediction.
Disadvantages
1. It uses the entire database every time it makes a prediction, so it needs to be in memory

it is very, very slow.


makes a prediction, so it
2. Even when in memory, it uses the entire database every time it
is very slow.
active users/items. This can occur if
3. It can sometimes not make a prediction for certain
the active user has no items in common with all people who have rated the target item.
4. Overfits the data. It takes all random variability in people's ratings as causation, which
algorithms do not generalize the
can be a real problem. In other words, memory-based
data at all.
TECHNICAL PUBLICATIONS. An up thrust for knowtedge

Recommender System
Information Retrieval Techniques (5-11)

> 2. Model-based algorithms :


" Input the user database to estimate or learn a model of user ratings, then run new data
through the model to get a predicted output.
" A prediction is computed through the expected value
of a user rating, given his/her
ratings on other items.
" Static structure. In dynamic domains the model could soon become inaccurate.
" Model-based collaborative filtering algorithms provide item recommendation by first
developing a model of user ratings. Algorithms in this category take a probabilistic
approach and envision the collaborative filtering process as computing the expected
value of a user prediction, given his/her ratings on other items.
" The model building process is performed by different machine learning algorithms such
The Bayesian network
as Bayesian network, clustering and rule-based approaches. problem.
model formulates a probabilistic model for collaborative filtering
" The clustering model treats collaborative filtering as a classification problem and works
a particular
by clustering similar users in same class and estimating the probability that
user is in a particular class C and from there computes the conditional probability of

ratings.
" The rule-based approach applies association rule discovery algorithms to find
association between co-purchased items and then generates item recommendation based
on the strength of the association between items

Advantages
1. Scalability : Most models resulting from model-based algorithms are much smaller than
the actual dataset, so that even for very large datasets, the model ends up being small
enough to be used efficiently. This imparts scalability to the overall system.
2. Prediction speed : Model-based systems are also likely to be faster, at least in
comparison to memory-based systems because, the time required to query the model is
usually much smaller than that required to query the whole dataset.
which we build our model is
3. Avoidance of over fitting : If the dataset over
representative enough of real-world data, it is easier to try to avoid over-fitting with
model-based systems.
Disadvantages
1. Inflexibility : Because building a model is often a time- and resource-consuming
process, it is usually more difficult to add data to model-based systems, making them
inflexible.
TECHNICAL PUBLICATIONS- An up hrust for krnowledge

You might also like