0% found this document useful (0 votes)

36 views61 pages

Introduction To Information Retrieval: Courtesy

This document introduces information retrieval (IR) and the main components of an IR system. It discusses the IR problem of finding relevant documents from a large collection based on a user's information need. The key processes in an IR system are indexing documents and queries, retrieving documents based on the query, and evaluating system performance. Document indexing represents documents as weighted keywords to facilitate retrieval. Stopwords are removed and stemming is applied before indexing. Retrieval involves comparing the query representation to document representations to calculate matching scores. Different IR models approach this differently, such as the Boolean and vector space models. System evaluation assesses precision and recall to measure effectiveness.

Uploaded by

Tamizharasi A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views61 pages

Introduction To Information Retrieval: Courtesy

Uploaded by

Tamizharasi A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 61

Introduction to Information

Retrieval

Courtesy:
Jian-Yun Nie
University of Montreal
Canada

1
Outline
 What is the IR problem?
 How to organize an IR system? (Or the
main processes in IR)
 Indexing
 Retrieval
 System evaluation
 Some current research topics
2
The problem of IR
 Goal = find documents relevant to an information
need from a large document set Info.
need

Query
IR
Retrieval system
Document Answer list
collection

3
Example

Google

Web

4
IR problem
 First applications: in libraries (1950s)
ISBN: 0-201-12227-8
Author: Salton, Gerard
Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer
Editor: Addison-Wesley
Date: 1989
Content: <Text>
 external attributes and internal attribute (content)
 Search by external attributes = Search in DB
 IR: search by content
5
Possible approaches
1. String matching (linear search in
documents)
- Slow
- Difficult to improve
2. Indexing (*)
- Fast
- Flexible to further improvement

6
Indexing-based IR
Document Query

indexing indexing
(Query analysis)
Representation Representation
(keywords) Query (keywords)
evaluation

7
Main problems in IR
 Document and query indexing
 How to best represent their contents?
 Query evaluation (or retrieval process)
 To what extent does a document correspond
to a query?
 System evaluation
 How good is a system?
 Are the retrieved documents relevant?
(precision)
 Are all the relevant documents retrieved?
(recall)
8
Document indexing
 Goal = Find the important meanings and create an
internal representation
 Factors to consider:
 Accuracy to represent meanings (semantics)
 Exhaustiveness (cover all the contents)
 Facility for computer to manipulate
 What is the best representation of contents?
 Char. string (char trigrams): not precise enough
 Word: good coverage, not precise
 Phrase: poor coverage, more precise
 Concept: poor coverage, precise

Coverage Accuracy
(Recall) String Word Phrase Concept (Precision)
9
Keyword selection and weighting
 How to select important keywords?
Simple method: using middle-frequency words



Frequency/Informativity

frequency informativity

Max.

Min.
123… Rank
10
tf*idf weighting schema
 tf = term frequency
 frequency of a term/keyword in a document
The higher the tf, the higher the importance (weight) for the doc.
 df = document frequency
 no. of documents containing the term
 distribution of the term
 idf = inverse document frequency
 the unevenness of term distribution in the corpus
 the specificity of term to a document
The more the term is distributed evenly, the less it is specific to a
document
weight(t,D) = tf(t,D) * idf(t)
11
Some common tf*idf schemes
 tf(t, D)=freq(t,D) idf(t) = log(N/n)
 tf(t, D)=log[freq(t,D)] n = #docs containing t
 tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus
 tf(t, D)=freq(t,d)/Max[f(t,d)]

weight(t,D) = tf(t,D) * idf(t)

 Normalization: Cosine normalization, /max, …

12
Document Length
Normalization
 Sometimes, additional normalizations e.g.
length:
weight (t , D)
pivoted (t , D) 
slope
1 normalized _ weight (t , D)
(1  slope)  povot

Probability
of relevance
slope
pivot
Probability of retrieval

Doc. length
13
Stopwords / Stoplist
 function words do not bear useful information for IR
of, in, about, with, I, although, …
 Stoplist: contain stopwords, not to be used as index
 Prepositions
 Articles
 Pronouns
 Some adverbs and adjectives
 Some frequent words (e.g. document)

 The removal of stopwords usually improves IR

effectiveness
 A few “standard” stoplists are commonly used.

14
Stemming
 Reason:
 Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
 Stemming:
 Removing some endings of word
computer
compute
computes
computing
computed
computation
comput

15
Porter algorithm
(Porter, M.F., 1980, An algorithm for suffix stripping,
Program, 14(3) :130-137)
 Step 1: plurals and past participles
 SSES -> SS caresses -> caress
 (*v*) ING -> motoring -> motor
 Step 2: adj->n, n->v, n->adj, …
 (m>0) OUSNESS -> OUS callousness -> callous
 (m>0) ATIONAL -> ATE relational -> relate
 Step 3:
 (m>0) ICATE -> IC triplicate -> triplic
 Step 4:
 (m>1) AL -> revival -> reviv
 (m>1) ANCE -> allowance -> allow
 Step 5:

(m>1) E -> probate -> probat

(m > 1 and *d and *L) -> single letter controll -> control
16
Lemmatization
 transform to standard form according to syntactic
category.
E.g. verb + ing  verb
noun + s  noun
 Need POS tagging
 More accurate than stemming, but needs more resources

 crucial to choose stemming/lemmatization rules

noise v.s. recognition rate
 compromise between precision and recall

light/no stemming severe stemming

-recall +precision +recall -precision

17
Result of indexing
 Each document is represented by a set of weighted
keywords (terms):
D1  {(t1, w1), (t2,w2), …}

e.g. D1  {(comput, 0.2), (architect, 0.3), …}

D2  {(comput, 0.1), (network, 0.5), …}

 Inverted file:
comput  {(D1,0.2), (D2,0.1), …}
Inverted file is used during retrieval for higher efficiency.

18
Retrieval
 The problems underlying retrieval
 Retrieval model
 How is a document represented with the
selected keywords?
 How are document and query representations
compared to calculate a score?
 Implementation

19
Cases
 1-word query:
The documents to be retrieved are those that
include the word
- Retrieve the inverted list for the word

- Sort in decreasing order of the weight of the word

 Multi-word query?
- Combining several lists
- How to interpret the weight?

(IR model)

20
IR models
 Matching score model
 Document D = a set of weighted keywords
 Query Q = a set of non-weighted keywords
 R(D, Q) = i w(ti , D)
where ti is in Q.

21
Boolean model
 Document = Logical conjunction of keywords
 Query = Boolean expression of keywords
 R(D, Q) = D Q

e.g. D = t 1  t 2  …  tn
Q = (t1  t2)  (t3  t4)
D Q, thus R(D, Q) = 1.

Problems:
 R is either 1 or 0 (unordered set of documents)
 many documents or few documents
 End-users cannot manipulate Boolean operators correctly
E.g. documents about kangaroos and koalas

22
Extensions to Boolean model
(for document ordering)
 D = {…, (ti, wi), …}: weighted keywords
 Interpretation:
 D is a member of class ti to degree wi.
 In terms of fuzzy sets: ti(D) = wi
A possible Evaluation:
R(D, ti) = ti(D);
R(D, Q1  Q2) = min(R(D, Q1), R(D, Q2));
R(D, Q1  Q2) = max(R(D, Q1), R(D, Q2));
R(D, Q1) = 1 - R(D, Q1).

23
Vector space model
 Vector space = all the keywords encountered
<t1, t2, t3, …, tn>
 Document
D= < a1, a2, a3, …, an>
ai = weight of ti in D
 Query
Q= < b1, b2, b3, …, bn>
bi = weight of ti in Q
 R(D,Q) = Sim(D,Q)
24
Matrix representation
Document space t1 t2 t3 … tn Term vector
space
D1 a11 a12 a13 … a1n
D2 a21 a22 a23 … a2n
D3 a31 a32 a33 … a3n
…
Dm am1 am2 am3 … amn
Q b1 b2 b3 … bn
25
Some formulas for Sim
Dot product Sim( D, Q)   ( ai * bi )
t1
 (a * b ) i i D
Sim( D, Q)  i
Cosine  ai *  bi
2 2 Q
i i
t2
2 (ai * bi )
Dice Sim( D, Q)  i

 ai   bi
2 2

i i

 (a * b ) i i
Jaccard Sim( D, Q)  i

 a   b   (a * b )
2 2
i i i i
i i i
26
Implementation (space)
 Matrix is very sparse: a few 100s terms for
a document, and a few terms for a query,
while the term space is large (~100k)

 Stored as:
D1  {(t1, a1), (t2,a2), …}

t1  {(D1,a1), …}

27
Implementation (time)
 The implementation of VSM with dot product:
 Naïve implementation: O(m*n)
 Implementation using inverted file:

Given a query = {(t1,b1), (t2,b2)}:

1. find the sets of related documents through inverted file for
t1 and t2
2. calculate the score of the documents to each weighted term
(t1,b1)  {(D1,a1 *b1), …}
3. combine the sets and sum the weights ()
 O(|Q|*n)

28
Other similarities
 Cosine:
 (a * b ) i i
ai bi
Sim( D, Q)  i

 a *b a b
2 2 2 2
i
i i j j
j j j j

- use j j
a
2
and j j to normalize the
b
2

weights after indexing

- Dot product
(Similar operations do not apply to Dice and
Jaccard)
29
Probabilistic model
 Given D, estimate P(R|D) and P(NR|D)
 P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)
 P(D|R)
1 present
D = {t1=x1, t2=x2, …} xi  
 0 absent
 P( D | R)   P(t
( ti  xi )D
i  xi | R)

  P (ti  1 | R) xi P (ti  0 | R ) (1 xi )   pi i (1  pi ) (1 xi )

ti ti

P ( D | NR)   P (ti  1 | NR ) xi P(ti  0 | NR ) (1 xi )   qi i (1  qi ) (1 xi )

ti ti

30
Prob. model (cont’d)
For document ranking
 i (1 xi )
x
p i
(1  pi )
P( D | R) ti
Odd ( D)  log  log
 i (1 xi )
x
P ( D | NR ) q i
(1  qi )
ti

pi (1  qi ) 1  pi
  xi log   log
ti qi (1  pi ) ti 1  qi
pi (1  qi )
  xi log
ti qi (1  pi )

31
Prob. model (cont’d)
ri ni-ri ni
 How to estimate pi and qi? Rel. doc. Irrel.doc. Doc.
with ti with ti with ti
 A set of N relevant and Ri-ri N-Ri–n+ri N-ni
irrelevant samples: Rel. doc. Irrel.doc. Doc.
without ti without ti without ti
ri ni  ri
pi  qi  Ri N-Ri N
Ri N  Ri Rel. doc Irrel.doc. Samples

32
Prob. model (cont’d)
pi (1  qi )
Odd ( D )   xi log
ti qi (1  pi )
ri ( N  Ri  ni  ri )
  xi
ti ( Ri  ri )(ni  ri )
 Smoothing (Robertson-Sparck-Jones formula)

(ri  0.5)( N  Ri  ni  ri  0.5)

Odd ( D)   xi   wi
ti ( Ri  ri  0.5)(ni  ri  0.5) ti D

 When no sample is available:

pi=0.5,
qi=(ni+0.5)/(N+0.5)ni/N
 May be implemented as VSM 33
BM25
(k1  1)tf (k3  1)qtf avdl  dl
Score ( D, Q)   w  k2 | Q |
tQ K  tf k3  qtf avdl  dl
dl
K  k1 ((1  b)  b )
avdl  dl

 k1, k2, k3, d: parameters

 qtf: query term frequency
 dl: document length
 avdl: average document length
34
(Classic) Presentation of results
 Query evaluation result is a list of documents,
sorted by their similarity to the query.
 E.g.
doc1 0.67
doc2 0.65
doc3 0.54
…

35
System evaluation
 Efficiency: time, space
 Effectiveness:
 How is a system capable of retrieving relevant
documents?
 Is a system better than another one?
 Metrics often used (together):
 Precision = retrieved relevant docs / retrieved docs
 Recall = retrieved relevant docs / relevant docs
retrieved relevant

relevant retrieved
36
General form of precision/recall
Precision
1.0

Recall
1.0

-Precision change w.r.t. Recall (not a fixed point)

-Systems cannot compare at one Precision/Recall point
-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0) 37
An illustration of P/R
calculation
Precision
List Rel? 1.0 - * (0.2, 1.0)

Doc1 Y 0.8 - * (0.6, 0.75)

Doc2 0.6 -
* (0.4, 0.67)
* (0.6, 0.6)

Doc3 Y * (0.2, 0.5)

38
MAP (Mean Average Precision)
1 1 j
MAP   
n Qi | Ri | D j Ri rij

 rij = rank of the j-th relevant document for Qi

 |Ri| = #rel. doc. for Qi
 n = # test queries
 E.g. Rank: 1 4 1st rel. doc.
5 8 2nd rel. doc.
10 3rd rel. doc.
1 1 1 2 3 1 1 2
MAP  [ (   )  (  )]
2 3 1 5 10 2 4 8 39
Some other measures
 Noise = retrieved irrelevant docs / retrieved docs
 Silence = non-retrieved relevant docs / relevant docs
 Noise = 1 – Precision; Silence = 1 – Recall
 Fallout = retrieved irrel. docs / irrel. docs
 Single value measures:
 F-measure = 2 P * R / (P + R)
 Average precision = average at 11 points of recall
 Precision at n document (often used for Web IR)
 Expected search length (no. irrelevant documents to read
before obtaining n relevant doc.)

40
Test corpus
 Compare different IR systems on the same
test corpus
 A test corpus contains:
 A set of documents
 A set of queries

 Relevance judgment for every document-query pair

(desired answers for each query)

 The results of a system is compared with the
desired answers.

41
An evaluation example
(SMART)
Run number: 1 2 Average precision for all points
Num_queries: 52 52 11-pt Avg: 0.2859 0.3092
Total number of documents over all % Change: 8.2
queries
Recall:
Retrieved: 780 780
Exact: 0.4139 0.4166
Relevant: 796 796
at 5 docs: 0.2373 0.2726
Rel_ret: 246 229
at 10 docs: 0.3254 0.3572
Recall - Precision Averages:
at 15 docs: 0.4139 0.4166
at 0.00 0.7695 0.7894
at 30 docs: 0.4139 0.4166
at 0.10 0.6618 0.6449
Precision:
at 0.20 0.5019 0.5090
Exact: 0.3154
at 0.30 0.3745 0.3702 0.2936
at 0.40 0.2249 0.3070 At 5 docs: 0.4308 0.4192
at 0.50 0.1797 0.2104 At 10 docs: 0.3538 0.3327
at 0.60 0.1143 0.1654 At 15 docs: 0.3154 0.2936
at 0.70 0.0891 0.1144 At 30 docs: 0.1577 0.1468
at 0.80 0.0891 0.1096
at 0.90 0.0699 0.0904
at 1.00 0.0699 0.0904

42
The TREC experiments
 Once per year
 A set of documents and queries are distributed
to the participants (the standard answers are
unknown) (April)
 Participants work (very hard) to construct, fine-
tune their systems, and submit the answers
(1000/query) at the deadline (July)
 NIST people manually evaluate the answers
and provide correct answers (and classification
of IR systems) (July – August)
 TREC conference (November)
43
TREC evaluation methodology
 Known document collection (>100K) and query set
(50)
 Submission of 1000 documents for each query by each
participant
 Merge 100 first documents of each participant ->
global pool
 Human relevance judgment of the global pool
 The other documents are assumed to be irrelevant
 Evaluation of each system (with 1000 answers)

 Partial relevance judgments

 But stable for system ranking

44
Tracks (tasks)
 Ad Hoc track: given document collection, different
topics
 Routing (filtering): stable interests (user profile),
incoming document flow
 CLIR: Ad Hoc, but with queries in a different language
 Web: a large set of Web pages
 Question-Answering: When did Nixon visit China?
 Interactive: put users into action with system
 Spoken document retrieval
 Image and video retrieval
 Information tracking: new topic / follow up
45
CLEF and NTCIR
 CLEF = Cross-Language Experimental Forum
 for European languages
 organized by Europeans
 Each per year (March – Oct.)
 NTCIR:
 Organized by NII (Japan)
 For Asian languages
 cycle of 1.5 year

46
Impact of TREC
 Provide large collections for further experiments
 Compare different systems/techniques on realistic
data
 Develop new methodology for system evaluation

 Similar experiments are organized in other areas

(NLP, Machine translation, Summarization, …)

47
Some techniques to
improve IR effectiveness
 Interaction with user (relevance feedback)
- Keywords only cover part of the contents
- User can help by indicating relevant/irrelevant
document
 The use of relevance feedback
 To improve query expression:
Qnew = *Qold + *Rel_d - *Nrel_d
where Rel_d = centroid of relevant documents
NRel_d = centroid of non-relevant documents

48
Effect of RF

2nd retrieval
1st retrieval

* * *
* *
* x * x x
* * * x x
** * * R* Q * NR x
Qnew
**
* x * x x
* * x

49
Modified relevance feedback
 Users usually do not cooperate (e.g.
AltaVista in early years)
 Pseudo-relevance feedback (Blind RF)
 Using the top-ranked documents as if they
are relevant:
Select m terms from n top-ranked documents
 One can usually obtain about 10% improvement

50
Query expansion
 A query contains part of the important words
 Add new (related) terms into the query
 Manually constructed knowledge base/thesaurus
(e.g. Wordnet)
 Q = information retrieval
 Q’ = (information + data + knowledge + …)
(retrieval + search + seeking + …)
 Corpus analysis:
 two terms that often co-occur are related (Mutual
information)
 Two terms that co-occur with the same words are
related (e.g. T-shirt and coat with wear, …)
51
Global vs. local context analysis
 Global analysis: use the whole document
collection to calculate term relationships
 Local analysis: use the query to retrieve a
subset of documents, then calculate term
relationships
 Combine pseudo-relevance feedback and term co-
occurrences
 More effective than global analysis

52
Some current research topics:
Go beyond keywords
 Keywords are not perfect representatives of concepts
 Ambiguity:
table = data structure, furniture?
 Lack of precision:
“operating”, “system” less precise than “operating_system”
 Suggested solution
 Sense disambiguation (difficult due to the lack of contextual
information)
 Using compound terms (no complete dictionary of
compound terms, variation in form)
 Using noun phrases (syntactic patterns + statistics)
 Still a long way to go
53
Theory …
 Bayesian networks
 P(Q|D)
D1 D2 D3 … Dm

t1 t2 t3 t4 …. tn

c1 c2 c3 c4 … cl

Inference Q revision

 Language models
54
Logical models
 How to describe the relevance relation
as a logical relation?
D => Q
 What are the properties of this relation?
 How to combine uncertainty with a
logical framework?
 The problem: What is relevance?

55
Related applications:
Information filtering
 IR: changing queries on stable document collection
 IF: incoming document flow with stable interests
(queries)
 yes/no decision (in stead of ordering documents)
 Advantage: the description of user’s interest may be
improved using relevance feedback (the user is more willing
to cooperate)
 Difficulty: adjust threshold to keep/ignore document
 The basic techniques used for IF are the same as those for
IR – “Two sides of the same coin”
keep
… doc3, doc2, doc1 IF
ignore
User profile 56
IR for (semi-)structured
documents
 Using structural information to assign weights
to keywords (Introduction, Conclusion, …)
 Hierarchical indexing
 Querying within some structure (search in
title, etc.)
 INEX experiments
 Using hyperlinks in indexing and retrieval
(e.g. Google)
 …
57
PageRank in Google
I1
PR( I i )
A B PR ( A)  (1  d )  d 
I2 i C(Ii )

 Assign a numeric value to each page

 The more a page is referred to by important pages, the more this
page is important

 d: damping factor (0.85)

 Many other criteria: e.g. proximity of query words

 “…information retrieval …” better than “… information … retrieval …”

58
IR on the Web
 No stable document collection (spider,
crawler)
 Invalid document, duplication, etc.
 Huge number of documents (partial
collection)
 Multimedia documents
 Great variation of document quality
 Multilingual problem
 …
59
Final remarks on IR
 IR is related to many areas:
 NLP, AI, database, machine learning, user
modeling…
 library, Web, multimedia search, …
 Relatively week theories
 Very strong tradition of experiments
 Many remaining (and exciting) problems
 Difficult area: Intuitive methods do not
necessarily improve effectiveness in practice

60
Why is IR difficult
 Vocabularies mismatching
 Synonymy: e.g. car v.s. automobile
 Polysemy: table
 Queries are ambiguous, they are partial specification of
user’s need
 Content representation may be inadequate and
incomplete
 The user is the ultimate judge, but we don’t know how
the judge judges…
 The notion of relevance is imprecise, context- and user-
dependent

 But how much it is rewarding to gain 10% improvement!

Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
mod4 nlp
No ratings yet
mod4 nlp
53 pages
mod 4
No ratings yet
mod 4
35 pages
IR-Berhampore-sukomalPal
No ratings yet
IR-Berhampore-sukomalPal
82 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
module 7
No ratings yet
module 7
53 pages
1 Overview
No ratings yet
1 Overview
44 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
IR Models
No ratings yet
IR Models
65 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks.in)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks.in)
48 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
F-IR
No ratings yet
F-IR
30 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
ISR chap..1
No ratings yet
ISR chap..1
27 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
chap2part2
No ratings yet
chap2part2
20 pages
IR Chap4
100% (1)
IR Chap4
32 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
bulu
No ratings yet
bulu
47 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
10 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
IR Chap4
100% (1)
IR Chap4
32 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Irt Ans
No ratings yet
Irt Ans
9 pages
L03
No ratings yet
L03
16 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
Lifting Categorisation Flow Chart Onshore
No ratings yet
Lifting Categorisation Flow Chart Onshore
3 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
Performance Enhancement and Customization of Information Storage and Retrieval System
No ratings yet
Performance Enhancement and Customization of Information Storage and Retrieval System
32 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
AUTOCAD 2020: Presented By: Prakriti Design
No ratings yet
AUTOCAD 2020: Presented By: Prakriti Design
13 pages
Logcat Prev CSC Log
No ratings yet
Logcat Prev CSC Log
282 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Railway Curves 7
No ratings yet
Railway Curves 7
21 pages
Septic Tank and Soak Pit
No ratings yet
Septic Tank and Soak Pit
4 pages
Unit 3 Basic Processing Unit
No ratings yet
Unit 3 Basic Processing Unit
42 pages
Unit 5 - NEW
No ratings yet
Unit 5 - NEW
20 pages
Virtual Memory
No ratings yet
Virtual Memory
20 pages
23 OT II JEEA TD Paper 1 Sol
No ratings yet
23 OT II JEEA TD Paper 1 Sol
20 pages
Unit I HCI Fundamentals
No ratings yet
Unit I HCI Fundamentals
74 pages
Presentation 2
No ratings yet
Presentation 2
52 pages
Mapping Functions
No ratings yet
Mapping Functions
23 pages
Elliptic Curve Cryptography
No ratings yet
Elliptic Curve Cryptography
21 pages
SRS RPC
No ratings yet
SRS RPC
64 pages
ADC Project Report
No ratings yet
ADC Project Report
7 pages
Java Classes: Introduction and
No ratings yet
Java Classes: Introduction and
38 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
32 pages
Af201 Final Exam Revision Package
No ratings yet
Af201 Final Exam Revision Package
12 pages
Chapter 4: Query Languages: Baeza-Yates, 1999 Modern Information Retrieval
No ratings yet
Chapter 4: Query Languages: Baeza-Yates, 1999 Modern Information Retrieval
29 pages
Abhijith Dushyant EllipticCurveCryptography
No ratings yet
Abhijith Dushyant EllipticCurveCryptography
23 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
WBS Assignment
No ratings yet
WBS Assignment
3 pages
Asyg 12 Ltca
100% (1)
Asyg 12 Ltca
22 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
AJK Board 10th Class Solved Numericals of Chapter 15, Ilmkidunya
0% (1)
AJK Board 10th Class Solved Numericals of Chapter 15, Ilmkidunya
5 pages
Microwave Lab 1
No ratings yet
Microwave Lab 1
11 pages
CBSE Class 12 Chemistry 2015 Foreign Re Evaluation Subjects Set 1
No ratings yet
CBSE Class 12 Chemistry 2015 Foreign Re Evaluation Subjects Set 1
15 pages
Cs8494 Software Engineering: Course Instructor Mrs.A.Tamizharasi, AP/CSE, RMDEC Mrs. Remya Rose S, AP/CSE, RMDEC
No ratings yet
Cs8494 Software Engineering: Course Instructor Mrs.A.Tamizharasi, AP/CSE, RMDEC Mrs. Remya Rose S, AP/CSE, RMDEC
23 pages
O2 Control Manual
No ratings yet
O2 Control Manual
9 pages
Panti Ramos, Darío. Trabajo de Estadistica Descriptiva e Inferencial
No ratings yet
Panti Ramos, Darío. Trabajo de Estadistica Descriptiva e Inferencial
13 pages
Lecture10 Efficient Scoring
No ratings yet
Lecture10 Efficient Scoring
19 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
Abrasive Grinding Wheels
No ratings yet
Abrasive Grinding Wheels
136 pages
Smart Ambulance
No ratings yet
Smart Ambulance
26 pages
Lab Report. 09: HY - 1104 Hysics Aboratory
No ratings yet
Lab Report. 09: HY - 1104 Hysics Aboratory
6 pages
R.M.D. Engineering College Department of Computer Science and Engineering Cs8079 - Human Computer Interaction Unit Iv Mobile Hci
No ratings yet
R.M.D. Engineering College Department of Computer Science and Engineering Cs8079 - Human Computer Interaction Unit Iv Mobile Hci
11 pages
Inventory Routing Problems: Martin Savelsbergh
No ratings yet
Inventory Routing Problems: Martin Savelsbergh
0 pages
Variables:-: Research in Architecture - Assignment 3
No ratings yet
Variables:-: Research in Architecture - Assignment 3
6 pages
MM Price Ing Issue
No ratings yet
MM Price Ing Issue
2 pages
Chapter 7 More About Polynomials
No ratings yet
Chapter 7 More About Polynomials
15 pages
R.M.D. Engineering College Department of Computer Science and Engineering Cs8079 - Human Computer Interaction Unit-V Web Interface Design
No ratings yet
R.M.D. Engineering College Department of Computer Science and Engineering Cs8079 - Human Computer Interaction Unit-V Web Interface Design
9 pages
Tsurumi KTZ/KTZE Pump
No ratings yet
Tsurumi KTZ/KTZE Pump
2 pages
T8890 Depliant ENG PDF
No ratings yet
T8890 Depliant ENG PDF
2 pages
Solving Sudoku
100% (2)
Solving Sudoku
11 pages
Practice Handout: Polynomial End Behavior
No ratings yet
Practice Handout: Polynomial End Behavior
2 pages
Apps Reviewer
No ratings yet
Apps Reviewer
14 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet

Introduction To Information Retrieval: Courtesy

Uploaded by

Introduction To Information Retrieval: Courtesy

Uploaded by

Introduction to Information

weight(t,D) = tf(t,D) * idf(t)

 Normalization: Cosine normalization, /max, …

 The removal of stopwords usually improves IR

 crucial to choose stemming/lemmatization rules

light/no stemming severe stemming

e.g. D1  {(comput, 0.2), (architect, 0.3), …}

- Sort in decreasing order of the weight of the word

Given a query = {(t1,b1), (t2,b2)}:

weights after indexing

  P (ti  1 | R) xi P (ti  0 | R ) (1 xi )   pi i (1  pi ) (1 xi )

P ( D | NR)   P (ti  1 | NR ) xi P(ti  0 | NR ) (1 xi )   qi i (1  qi ) (1 xi )

(ri  0.5)( N  Ri  ni  ri  0.5)

 When no sample is available:

 k1, k2, k3, d: parameters

-Precision change w.r.t. Recall (not a fixed point)

Doc1 Y 0.8 - * (0.6, 0.75)

Doc3 Y * (0.2, 0.5)

 rij = rank of the j-th relevant document for Qi

 Relevance judgment for every document-query pair

(desired answers for each query)

 Partial relevance judgments

 Similar experiments are organized in other areas

 Assign a numeric value to each page

 d: damping factor (0.85)

 Many other criteria: e.g. proximity of query words

 But how much it is rewarding to gain 10% improvement!

You might also like