0% found this document useful (0 votes)
13 views

Chapter 2

Chapter 2 discusses retrieval models in information retrieval, focusing on document and query representation, and relevance determination. It covers Boolean models, which provide binary relevance, and vector space models that rank documents based on similarity using term frequency-inverse document frequency (tf-idf). Additionally, it highlights preprocessing techniques like tokenization, stop word removal, and stemming, as well as the importance of statistical and neural language models in NLP applications.

Uploaded by

ayusssssh100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Chapter 2

Chapter 2 discusses retrieval models in information retrieval, focusing on document and query representation, and relevance determination. It covers Boolean models, which provide binary relevance, and vector space models that rank documents based on similarity using term frequency-inverse document frequency (tf-idf). Additionally, it highlights preprocessing techniques like tokenization, stop word removal, and stemming, as well as the importance of statistical and neural language models in NLP applications.

Uploaded by

ayusssssh100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Information Retrieval

Chapter-2
Retrieval Model
• A retrieval model specifies the details of: –
Document representation
Query representation
Retrieval function

• Determines a notion of relevance.

• Notion of relevance can be binary or continuous (i.e. ranked


retrieval).
Classes of Retrieval Models
• Boolean models (set theoretic)

• Vector space models (statistical/algebraic)


Boolean Model
• A document is represented as a set of keywords.

• Queries are Boolean expressions of keywords, connected by


AND, OR, and NOT, including the use of brackets to indicate
scope

• Output: Document is relevant or not. No partial matches or


ranking.
Boolean Retrieval Model
• Popular retrieval model because:
Easy to understand for simple queries.
Clean formalism.

• Boolean models can be extended to include ranking.

• Reasonably efficient implementations possible for normal


queries.
Boolean Models -Problems
• Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
• Difficult to control the number of documents retrieved.
– All matched documents will be returned.
• Difficult to rank output.
– All matched documents logically satisfy the query.
• Difficult to perform relevance feedback.
– If a document is identified by the user as relevant or
irrelevant, how should the query be modified
Consider a small document collection:

Document ID Document Content


D1 Artificial intelligence is transforming industries.
Machine learning and artificial intelligence are
D2 related.
D3 Data science uses machine learning techniques.
D4 Artificial intelligence is used in data science.
• "artificial AND intelligence“
• "machine OR data“
• "artificial AND NOT science"
Statistical Models
• A document is typically represented by a bag of words
(unordered words with frequencies).

• Bag = set that allows multiple occurrences of the same


element.
Statistical Retrieval
• Retrieval based on similarity between query and documents.
• Output documents are ranked according to similarity to
query.
• Similarity based on occurrence frequencies of keywords in
query and document.
• Automatic relevance feedback can be supported:
– Relevant documents “added” to query.
– Irrelevant documents “subtracted” from query .
The vector space model

• Documents and queries are assumed to be a part of n-dimensional


vector space, where n is the number of index term.
A document Di is represented by a vector of index terms:
Di =(di1,di2,……) where dij is the weight of the jth term
Query Q is represented by a vector of n weights
Q= (q1,q2,……) where qj is the weight of the jth term in the query.

A document collection containing N documents can be represented as a


matrix of term weights.
Issues for Vector Space
Model
• How to determine important words in a document?
– Word sense?
– Word n-grams (and phrases, idioms,…)terms
• How to determine the degree of importance of a term within a
document and within the entire collection?
• How to determine the degree of similarity between a document
and the query?
• In the case of the web, what is the collection and what are the
effects of links, formatting information, etc.?
Boolean and Vector-Space
Retrieval Models
• Boolean Retrieval (BR) and Vector Space Model (VSM) are very
popular methods in information retrieval for creating an inverted
index and querying terms.

• BR method searches the exact results of the textual information


retrieval without ranking the results.

• VSM method searches and ranks the results.


Term Weighting
• tf-idf stands for Term frequency-inverse document
frequency.
• The tf-idf weight is a weight often used in information
retrieval and text mining.
• Variations of the tf-idf weighting scheme are often used by
search engines in scoring and ranking a document’s
relevance given a query.
• This weight is a statistical measure used to evaluate how
important a word is to a document in a collection or corpus.
• The importance increases proportionally to the number of
times a word appears in the document but is offset by the
frequency of the word in the corpus (data-set).
Term Frequency (TF)
• The major reason for google’s success is because of its
pageRank algorithm.

• PageRank determines how trustworthy and reputable a


given website is.

• But there is also another part-The input query entered by


the user should be used to match the relevant documents
and score them.
Consider the three documents
Document 1: The game of life is a game of everlasting
learning

Document 2: The unexamined life is not worth living

Document 3: Never stop learning


Let us imagine that you are doing a search on these documents with the
following query: life learning

The query is a free text query.


-It means a query in which the terms of the query are typed
freeform into the search interface, without any connecting search
operators.
Step 1: Term Frequency (TF)
• Term Frequency also known as TF measures the number
of times a term (word) occurs in a document.

• Given in next slide are the terms and their frequency on


each of the document.
• In reality each document will be of different size.

• On a large document the frequency of the terms will be much higher


than the smaller ones.

• Hence we need to normalize the document based on its size.

• A simple trick is to divide the term frequency by the total number of


terms.

• For example in Document 1 the term game occurs two times. The
total number of terms in the document is 10.

• Hence the normalized term frequency is 2 / 10 = 0.2. Given below


are the normalized term frequency for all the documents.
Step 2: Inverse Document Frequency
(IDF)
• The main purpose of doing a search is to find
out relevant documents matching the query.
• In the first step all terms are considered equally important.
In fact certain terms that occur too frequently have little
power in determining the relevance.
• We need a way to weigh down the effects of too
frequently occurring terms.
• Also the terms that occur less in the document can be more
relevant.
• We need a way to weigh up the effects of less frequently
occurring termsLogarithm helps to solve
Computing IDF for the term game
IDF(game) = 1 + log (Total Number Of Documents / Number Of
e

Documents with term game in it)

There are 3 documents in all = Document1, Document2,


Document3

The term game appears in Document1 IDF(game) = 1 + log (3 /


e

1) = 1 + 1.098726209 = 2.098726209
IDF for terms occurring in all the documents

Since the terms: the, life, is,


learning occurs in 2 out of 3
documents they have a lower
score compared to the other
terms that appear in only one
document.
Step 3: TF * IDF
• Remember we are trying to find out relevant documents for the
query: life learning

• For each term in the query multiply its normalized term frequency with
its IDF on each document.

• In Document1 for the term life the normalized term frequency is 0.1
and its IDF is 1.405507153.

• Multiplying them together we get 0.140550715 (0.1 * 1.405507153).

• Given in the next slide is TF * IDF calculations for life and learning in
all the documents.
Step 4: Vector Space Model –
Cosine Similarity
• The query entered by the user can also be represented as a
vector.
• Calculate the TF*IDF for the query
Now calculate the cosine similarity
ofSimilarity(Query,Document1)
Cosine the query and Document1. = Dot product(Query,
Document1) / ||Query|| * ||Document1||

Dot product(Query, Document1)= ((0.702753576) * (0.140550715) +


(0.702753576)*(0.140550715))
= 0.197545035151

||Query|| = sqrt((0.702753576) + (0.702753576) ) = 0.993843638185


2 2

||Document1|| = sqrt((0.140550715) + (0.140550715) ) =


2 2

0.198768727354

Cosine Similarity(Query, Document) = 0.197545035151 /


(0.993843638185) * (0.198768727354) = 0.197545035151 /
0.197545035151 = 1
The similarity scores for all the
documents and the query

• Document1 has the highest score of 1.

• This is not surprising as it has both the terms life


and learning.
Preprocessing
• Preprocessing is an important task and critical step
in Text mining, Natural Language Processing (NLP)
and information retrieval (IR).
• Before the information retrieval from the documents,
the data preprocessing techniques are applied on
the target data set to reduce the size of the data set
which will increase the effectiveness of IR System.
• Preprocessing methods such as Tokenization, Stop
word removal and Stemming for the text documents.
Tokenization
• Tokenization is a simple process that takes raw data and
converts it into a useful data string.
• Tokenization is used in natural language processing to split
paragraphs and sentences into smaller units that can be
more easily assigned meaning.
• The first step of the NLP process is gathering the data (a
sentence) and breaking it into understandable parts
(words).
• An example of a string of data:
• “What restaurants are nearby?“
Tokenization
• In order for this sentence to be understood by a machine,
tokenization is performed on the string to break it into individual
parts. With tokenization, we’d get something like this:
• ‘what’ ‘restaurants’ ‘are’ ‘nearby’
• This may seem simple, but breaking a sentence into its parts
allows a machine to understand the parts as well as the whole.
• This will help the program understand each of the words by
themselves, as well as how they function in the larger text.
• This is especially important for larger amounts of text as it
allows the machine to count the frequencies of certain words as
well as where they frequently appear.
Stop Word Removal
• Stop word removal is one of the most commonly used preprocessing steps across different
NLP applications.

• The idea is simply removing the words that occur commonly across all the documents in
the corpus.

• Typically, articles and pronouns are generally classified as stop words.

• These words have no significance in some of the NLP tasks like information retrieval and
classification, which means these words are not very discriminative.

• On the contrary, in some NLP applications stop word removal will have very little impact.

• Most of the time, the stop word list for the given language is a well hand-curated list of
words that occur most commonly across corpuses.
Stemming
• Stemming, also called suffix stripping, is a technique used to
reduce text dimensionality. Stemming is also a type of text
normalization that enables you to standardize some words into
specific expressions also called stems.

• In other words, is a technique used to extract the base


form of the words by removing affixes from them. It is
just like cutting down the branches of a tree to its stems.

• For example, the stem of the words eating, eats,


eaten is eat.
Stemming
• Search engines use stemming for indexing the words.

• That’s why rather than storing all forms of a word, a


search engine can store only the stems.

• In this way, stemming reduces the size of the index and


increases retrieval accuracy.
Types of Language Models:
• There are primarily two types of language models:

1. Statistical Language Models


Statistical models include the development of probabilistic
models that are able to predict the next word in the sequence,
given the words that precede it. A number of statistical language
models are in use already.
2. Neural Language Models
• These language models are based on neural networks and are
often considered as an advanced approach to execute NLP
tasks. Neural language models overcome the shortcomings of
classical models such as n-gram and are used for complex tasks
such as speech recognition or machine translation.
Some Common Examples of Language Models

Language models are the cornerstone of Natural Language Processing (NLP) technology. We have
been making the best of language models in our routine, without even realizing it. Let’s take a
look at some of the examples of language models:

1. Speech Recognition
• Voice assistants such as Siri and Alexa are examples of how language models help machines in
processing speech audio.
2. Machine Translation
• Google Translator and Microsoft Translate are examples of how NLP models can help in
translating one language to another.
3. Sentiment Analysis
• This helps in analyzing the sentiments behind a phrase. This use case of NLP models is used in
products that allow businesses to understand a customer’s intent behind opinions or attitudes
expressed in the text. Hubspot’s Service Hub is an example of how language models can help
in sentiment analysis.
4. Text Suggestions
• Google services such as Gmail or Google Docs use language models to help users get text
suggestions while they compose an email or create long text documents, respectively.
5. Parsing Tools
• Parsing involves analyzing sentences or words that comply with syntax or grammar rules. Spell
checking tools are perfect examples of language modelling and parsing.

You might also like