0% found this document useful (0 votes)

13 views

Chapter 2

Chapter 2 discusses retrieval models in information retrieval, focusing on document and query representation, and relevance determination. It covers Boolean models, which provide binary relevance, and vector space models that rank documents based on similarity using term frequency-inverse document frequency (tf-idf). Additionally, it highlights preprocessing techniques like tokenization, stop word removal, and stemming, as well as the importance of statistical and neural language models in NLP applications.

Uploaded by

ayusssssh100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Chapter 2

Uploaded by

ayusssssh100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Information Retrieval

Chapter-2
Retrieval Model
• A retrieval model specifies the details of: –
Document representation
Query representation
Retrieval function

• Determines a notion of relevance.

• Notion of relevance can be binary or continuous (i.e. ranked

retrieval).
Classes of Retrieval Models
• Boolean models (set theoretic)

• Vector space models (statistical/algebraic)

Boolean Model
• A document is represented as a set of keywords.

• Queries are Boolean expressions of keywords, connected by

AND, OR, and NOT, including the use of brackets to indicate
scope

• Output: Document is relevant or not. No partial matches or

ranking.
Boolean Retrieval Model
• Popular retrieval model because:
Easy to understand for simple queries.
Clean formalism.

• Boolean models can be extended to include ranking.

• Reasonably efficient implementations possible for normal

queries.
Boolean Models -Problems
• Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
• Difficult to control the number of documents retrieved.
– All matched documents will be returned.
• Difficult to rank output.
– All matched documents logically satisfy the query.
• Difficult to perform relevance feedback.
– If a document is identified by the user as relevant or
irrelevant, how should the query be modified
Consider a small document collection:

Document ID Document Content

D1 Artificial intelligence is transforming industries.
Machine learning and artificial intelligence are
D2 related.
D3 Data science uses machine learning techniques.
D4 Artificial intelligence is used in data science.
• "artificial AND intelligence“
• "machine OR data“
• "artificial AND NOT science"
Statistical Models
• A document is typically represented by a bag of words
(unordered words with frequencies).

• Bag = set that allows multiple occurrences of the same

element.
Statistical Retrieval
• Retrieval based on similarity between query and documents.
• Output documents are ranked according to similarity to
query.
• Similarity based on occurrence frequencies of keywords in
query and document.
• Automatic relevance feedback can be supported:
– Relevant documents “added” to query.
– Irrelevant documents “subtracted” from query .
The vector space model

• Documents and queries are assumed to be a part of n-dimensional

vector space, where n is the number of index term.
A document Di is represented by a vector of index terms:
Di =(di1,di2,……) where dij is the weight of the jth term
Query Q is represented by a vector of n weights
Q= (q1,q2,……) where qj is the weight of the jth term in the query.

A document collection containing N documents can be represented as a

matrix of term weights.
Issues for Vector Space
Model
• How to determine important words in a document?
– Word sense?
– Word n-grams (and phrases, idioms,…)terms
• How to determine the degree of importance of a term within a
document and within the entire collection?
• How to determine the degree of similarity between a document
and the query?
• In the case of the web, what is the collection and what are the
effects of links, formatting information, etc.?
Boolean and Vector-Space
Retrieval Models
• Boolean Retrieval (BR) and Vector Space Model (VSM) are very
popular methods in information retrieval for creating an inverted
index and querying terms.

• BR method searches the exact results of the textual information

retrieval without ranking the results.

• VSM method searches and ranks the results.

Term Weighting
• tf-idf stands for Term frequency-inverse document
frequency.
• The tf-idf weight is a weight often used in information
retrieval and text mining.
• Variations of the tf-idf weighting scheme are often used by
search engines in scoring and ranking a document’s
relevance given a query.
• This weight is a statistical measure used to evaluate how
important a word is to a document in a collection or corpus.
• The importance increases proportionally to the number of
times a word appears in the document but is offset by the
frequency of the word in the corpus (data-set).
Term Frequency (TF)
• The major reason for google’s success is because of its
pageRank algorithm.

• PageRank determines how trustworthy and reputable a

given website is.

• But there is also another part-The input query entered by

the user should be used to match the relevant documents
and score them.
Consider the three documents
Document 1: The game of life is a game of everlasting
learning

Document 2: The unexamined life is not worth living

Document 3: Never stop learning

Let us imagine that you are doing a search on these documents with the
following query: life learning

The query is a free text query.

-It means a query in which the terms of the query are typed
freeform into the search interface, without any connecting search
operators.
Step 1: Term Frequency (TF)
• Term Frequency also known as TF measures the number
of times a term (word) occurs in a document.

• Given in next slide are the terms and their frequency on

each of the document.
• In reality each document will be of different size.

• On a large document the frequency of the terms will be much higher

than the smaller ones.

• Hence we need to normalize the document based on its size.

• A simple trick is to divide the term frequency by the total number of

terms.

• For example in Document 1 the term game occurs two times. The
total number of terms in the document is 10.

• Hence the normalized term frequency is 2 / 10 = 0.2. Given below

are the normalized term frequency for all the documents.
Step 2: Inverse Document Frequency
(IDF)
• The main purpose of doing a search is to find
out relevant documents matching the query.
• In the first step all terms are considered equally important.
In fact certain terms that occur too frequently have little
power in determining the relevance.
• We need a way to weigh down the effects of too
frequently occurring terms.
• Also the terms that occur less in the document can be more
relevant.
• We need a way to weigh up the effects of less frequently
occurring termsLogarithm helps to solve
Computing IDF for the term game
IDF(game) = 1 + log (Total Number Of Documents / Number Of
e

Documents with term game in it)

There are 3 documents in all = Document1, Document2,

Document3

The term game appears in Document1 IDF(game) = 1 + log (3 /

1) = 1 + 1.098726209 = 2.098726209
IDF for terms occurring in all the documents

Since the terms: the, life, is,

learning occurs in 2 out of 3
documents they have a lower
score compared to the other
terms that appear in only one
document.
Step 3: TF * IDF
• Remember we are trying to find out relevant documents for the
query: life learning

• For each term in the query multiply its normalized term frequency with
its IDF on each document.

• In Document1 for the term life the normalized term frequency is 0.1
and its IDF is 1.405507153.

• Multiplying them together we get 0.140550715 (0.1 * 1.405507153).

• Given in the next slide is TF * IDF calculations for life and learning in
all the documents.
Step 4: Vector Space Model –
Cosine Similarity
• The query entered by the user can also be represented as a
vector.
• Calculate the TF*IDF for the query
Now calculate the cosine similarity
ofSimilarity(Query,Document1)
Cosine the query and Document1. = Dot product(Query,
Document1) / ||Query|| * ||Document1||

Dot product(Query, Document1)= ((0.702753576) * (0.140550715) +

(0.702753576)*(0.140550715))
= 0.197545035151

||Query|| = sqrt((0.702753576) + (0.702753576) ) = 0.993843638185

2 2

||Document1|| = sqrt((0.140550715) + (0.140550715) ) =

2 2

0.198768727354

Cosine Similarity(Query, Document) = 0.197545035151 /

(0.993843638185) * (0.198768727354) = 0.197545035151 /
0.197545035151 = 1
The similarity scores for all the
documents and the query

• Document1 has the highest score of 1.

• This is not surprising as it has both the terms life

and learning.
Preprocessing
• Preprocessing is an important task and critical step
in Text mining, Natural Language Processing (NLP)
and information retrieval (IR).
• Before the information retrieval from the documents,
the data preprocessing techniques are applied on
the target data set to reduce the size of the data set
which will increase the effectiveness of IR System.
• Preprocessing methods such as Tokenization, Stop
word removal and Stemming for the text documents.
Tokenization
• Tokenization is a simple process that takes raw data and
converts it into a useful data string.
• Tokenization is used in natural language processing to split
paragraphs and sentences into smaller units that can be
more easily assigned meaning.
• The first step of the NLP process is gathering the data (a
sentence) and breaking it into understandable parts
(words).
• An example of a string of data:
• “What restaurants are nearby?“
Tokenization
• In order for this sentence to be understood by a machine,
tokenization is performed on the string to break it into individual
parts. With tokenization, we’d get something like this:
• ‘what’ ‘restaurants’ ‘are’ ‘nearby’
• This may seem simple, but breaking a sentence into its parts
allows a machine to understand the parts as well as the whole.
• This will help the program understand each of the words by
themselves, as well as how they function in the larger text.
• This is especially important for larger amounts of text as it
allows the machine to count the frequencies of certain words as
well as where they frequently appear.
Stop Word Removal
• Stop word removal is one of the most commonly used preprocessing steps across different
NLP applications.

• The idea is simply removing the words that occur commonly across all the documents in
the corpus.

• Typically, articles and pronouns are generally classified as stop words.

• These words have no significance in some of the NLP tasks like information retrieval and
classification, which means these words are not very discriminative.

• On the contrary, in some NLP applications stop word removal will have very little impact.

• Most of the time, the stop word list for the given language is a well hand-curated list of
words that occur most commonly across corpuses.
Stemming
• Stemming, also called suffix stripping, is a technique used to
reduce text dimensionality. Stemming is also a type of text
normalization that enables you to standardize some words into
specific expressions also called stems.

• In other words, is a technique used to extract the base

form of the words by removing affixes from them. It is
just like cutting down the branches of a tree to its stems.

• For example, the stem of the words eating, eats,

eaten is eat.
Stemming
• Search engines use stemming for indexing the words.

• That’s why rather than storing all forms of a word, a

search engine can store only the stems.

• In this way, stemming reduces the size of the index and

increases retrieval accuracy.
Types of Language Models:
• There are primarily two types of language models:

1. Statistical Language Models

Statistical models include the development of probabilistic
models that are able to predict the next word in the sequence,
given the words that precede it. A number of statistical language
models are in use already.
2. Neural Language Models
• These language models are based on neural networks and are
often considered as an advanced approach to execute NLP
tasks. Neural language models overcome the shortcomings of
classical models such as n-gram and are used for complex tasks
such as speech recognition or machine translation.
Some Common Examples of Language Models

Language models are the cornerstone of Natural Language Processing (NLP) technology. We have
been making the best of language models in our routine, without even realizing it. Let’s take a
look at some of the examples of language models:

1. Speech Recognition
• Voice assistants such as Siri and Alexa are examples of how language models help machines in
processing speech audio.
2. Machine Translation
• Google Translator and Microsoft Translate are examples of how NLP models can help in
translating one language to another.
3. Sentiment Analysis
• This helps in analyzing the sentiments behind a phrase. This use case of NLP models is used in
products that allow businesses to understand a customer’s intent behind opinions or attitudes
expressed in the text. Hubspot’s Service Hub is an example of how language models can help
in sentiment analysis.
4. Text Suggestions
• Google services such as Gmail or Google Docs use language models to help users get text
suggestions while they compose an email or create long text documents, respectively.
5. Parsing Tools
• Parsing involves analyzing sentences or words that comply with syntax or grammar rules. Spell
checking tools are perfect examples of language modelling and parsing.

Madina Arabic: Book 2: Class Notes
100% (21)
Madina Arabic: Book 2: Class Notes
76 pages
Methods To Improve Decisions - Five Common Mistakes and How To Address Them (2014!02!24) Michael Mauboussin & Dan Callahan (Credit Suisse)
100% (2)
Methods To Improve Decisions - Five Common Mistakes and How To Address Them (2014!02!24) Michael Mauboussin & Dan Callahan (Credit Suisse)
16 pages
Norma VDI 2206
0% (1)
Norma VDI 2206
6 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
1 Overview
No ratings yet
1 Overview
44 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
chapter 3 term weighting
No ratings yet
chapter 3 term weighting
11 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
a
No ratings yet
a
48 pages
IR Journal
No ratings yet
IR Journal
36 pages
L03
No ratings yet
L03
16 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
IR - Models
100% (3)
IR - Models
58 pages
TF Idf
100% (3)
TF Idf
38 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
DAA Sess-I 2024
No ratings yet
DAA Sess-I 2024
1 page
uhv0004
No ratings yet
uhv0004
29 pages
uhv0002
No ratings yet
uhv0002
20 pages
12_Structure & files in c
No ratings yet
12_Structure & files in c
37 pages
8_Two Dimensional Array
No ratings yet
8_Two Dimensional Array
18 pages
lec3uhvslides
No ratings yet
lec3uhvslides
10 pages
chapter3uhv
No ratings yet
chapter3uhv
26 pages
9_Strings
No ratings yet
9_Strings
16 pages
6_Loops in C
No ratings yet
6_Loops in C
33 pages
11_Function in C
No ratings yet
11_Function in C
26 pages
7_Arrays_1_linear Search
No ratings yet
7_Arrays_1_linear Search
14 pages
2_NUMBER SYSTEMS
No ratings yet
2_NUMBER SYSTEMS
38 pages
ML_assignment 1
No ratings yet
ML_assignment 1
2 pages
7_Arrays_0
No ratings yet
7_Arrays_0
11 pages
Solution Psy 511
No ratings yet
Solution Psy 511
2 pages
Mini Research LINGUISTICS - Rifki
No ratings yet
Mini Research LINGUISTICS - Rifki
6 pages
Bsbmgt615 Student Assessment Task 4
25% (4)
Bsbmgt615 Student Assessment Task 4
6 pages
Handbook of Music and Emotion Theory Research Applications Affective Science 1st Edition Patrik N. Juslin - The newest ebook version is ready, download now to explore
No ratings yet
Handbook of Music and Emotion Theory Research Applications Affective Science 1st Edition Patrik N. Juslin - The newest ebook version is ready, download now to explore
80 pages
Gordon Functional Health Assessment
100% (1)
Gordon Functional Health Assessment
18 pages
DLL - All Subjects 2 - Q4 - W8 - D3
No ratings yet
DLL - All Subjects 2 - Q4 - W8 - D3
10 pages
Children Theatre Course Teaching Outline
No ratings yet
Children Theatre Course Teaching Outline
3 pages
Developing Cultural Awareness in Foreign Language Teaching
0% (1)
Developing Cultural Awareness in Foreign Language Teaching
8 pages
CSTP 4 Johnson 7
No ratings yet
CSTP 4 Johnson 7
7 pages
Suggested Strategy Diminutive
No ratings yet
Suggested Strategy Diminutive
9 pages
Rancangan Pengajaran Harian Pendidikan Khas - Masalah Pembelajaran
No ratings yet
Rancangan Pengajaran Harian Pendidikan Khas - Masalah Pembelajaran
2 pages
Hortatory Exposition 2
No ratings yet
Hortatory Exposition 2
2 pages
Negative Body Language - 7 Deadly Sins of Nonverbal Communication
No ratings yet
Negative Body Language - 7 Deadly Sins of Nonverbal Communication
8 pages
Activity 2
No ratings yet
Activity 2
2 pages
CH 10: Expert Systems:: Representing Using Domain Knowledge Expert System Shells Explanation, Knowledge Acquisition
100% (1)
CH 10: Expert Systems:: Representing Using Domain Knowledge Expert System Shells Explanation, Knowledge Acquisition
22 pages
COMMUNICATION
100% (1)
COMMUNICATION
6 pages
Yle Movers Reading Part 1
No ratings yet
Yle Movers Reading Part 1
5 pages
Five Element Theory in Business - Wang
100% (1)
Five Element Theory in Business - Wang
17 pages
Ahola, A. (2012). How Reliable Are Eyewitness Memories Effects of Retention Interval, Violence of Act, And Gender Stereotypes on Observers’ Judgments of Their Own Memory Regarding Witnessed Act and Perpetrator. Psychology, Cr
No ratings yet
Ahola, A. (2012). How Reliable Are Eyewitness Memories Effects of Retention Interval, Violence of Act, And Gender Stereotypes on Observers’ Judgments of Their Own Memory Regarding Witnessed Act and Perpetrator. Psychology, Cr
15 pages
When To Sell With Facts and Figures
No ratings yet
When To Sell With Facts and Figures
2 pages
Activity 1 DIASS
No ratings yet
Activity 1 DIASS
3 pages
Milestone 2 - OMS CS7637
No ratings yet
Milestone 2 - OMS CS7637
4 pages
CLC 12 - Capstone Draft Proposal Worksheet
No ratings yet
CLC 12 - Capstone Draft Proposal Worksheet
3 pages
Theories of Personality
No ratings yet
Theories of Personality
3 pages
Strategies and Models for Teachers: Teaching Content and Thinking Skills 6th Edition Paul Eggen And Don Kauchak all chapter instant download
100% (12)
Strategies and Models for Teachers: Teaching Content and Thinking Skills 6th Edition Paul Eggen And Don Kauchak all chapter instant download
43 pages
Model Assure
No ratings yet
Model Assure
3 pages
Management and Organizational Behavior: Conflict: Process of Conflict and Types of Conflict
No ratings yet
Management and Organizational Behavior: Conflict: Process of Conflict and Types of Conflict
21 pages

Chapter 2

Uploaded by

Chapter 2

Uploaded by

Information Retrieval

• Determines a notion of relevance.

• Notion of relevance can be binary or continuous (i.e. ranked

• Vector space models (statistical/algebraic)

• Queries are Boolean expressions of keywords, connected by

• Output: Document is relevant or not. No partial matches or

• Boolean models can be extended to include ranking.

• Reasonably efficient implementations possible for normal

Document ID Document Content

• Bag = set that allows multiple occurrences of the same

• Documents and queries are assumed to be a part of n-dimensional

A document collection containing N documents can be represented as a

• BR method searches the exact results of the textual information

• VSM method searches and ranks the results.

• PageRank determines how trustworthy and reputable a

• But there is also another part-The input query entered by

Document 2: The unexamined life is not worth living

Document 3: Never stop learning

The query is a free text query.

• Given in next slide are the terms and their frequency on

• On a large document the frequency of the terms will be much higher

• Hence we need to normalize the document based on its size.

• A simple trick is to divide the term frequency by the total number of

• Hence the normalized term frequency is 2 / 10 = 0.2. Given below

Documents with term game in it)

There are 3 documents in all = Document1, Document2,

The term game appears in Document1 IDF(game) = 1 + log (3 /

Since the terms: the, life, is,

• Multiplying them together we get 0.140550715 (0.1 * 1.405507153).

Dot product(Query, Document1)= ((0.702753576) * (0.140550715) +

||Query|| = sqrt((0.702753576) + (0.702753576) ) = 0.993843638185

||Document1|| = sqrt((0.140550715) + (0.140550715) ) =

Cosine Similarity(Query, Document) = 0.197545035151 /

• Document1 has the highest score of 1.

• This is not surprising as it has both the terms life

• Typically, articles and pronouns are generally classified as stop words.

• In other words, is a technique used to extract the base

• For example, the stem of the words eating, eats,

• That’s why rather than storing all forms of a word, a

• In this way, stemming reduces the size of the index and

1. Statistical Language Models

You might also like