0% found this document useful (0 votes)

36 views18 pages

KEN2570-5-Search and IR

The document provides an agenda for an information retrieval and search class that includes the following: 1) An lab assignment on wildcards is due Friday at 6pm and any issues should be reported on Discord. 2) Students should form project teams and think about potential project ideas to propose. 3) The class will cover an introduction to information retrieval, retrieval models like the Boolean and vector space models, and evaluation methods.

Uploaded by

Gibi Gibi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views18 pages

KEN2570-5-Search and IR

Uploaded by

Gibi Gibi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

KEN2570 Public Service Announcements

Natural Language Processing • Lab 1 is due this Friday 18.00

- How are we doing?
Information Retrieval/Search - Report any issues on Discord
- Wildcards can be used
• Project proposal
- Form your teams
- Think about your project idea
Jerry Spanakis
http://dke.maastrichtuniversity.nl/jerry.spanakis
@gerasimoss

Agenda Information Retrieval/Search

• Intro to IR
• Retrieval models
- Boolean model
- Vector space model (BOW + TFIDF)
• Evaluation

• Part of Information Extraction (NER)

Retrieval Models Search engines • Foundation for Question-Answering
& Conversational Agents

3 4
Information Retrieval The classic search model
• Information Retrieval (IR) is finding material (usually
Get rid of mice in a
documents) of an unstructured nature (usually text) User task
politically correct way
that satisfies an information need from within large Misconception?
collections (usually stored on computers). Info about removing mice
Info need
without killing them
- These days we frequently think first of web search, but there Misformulation?
are many other cases:
- E-mail search Query
how trap mice alive Search
- Searching your laptop
- Corporate knowledge bases
- Legal information retrieval Search
engine

Query Results
5 Collection 6
refinement

Sec. 1.1 Sec. 1.1

What do people search? What do people search?

https://trends.google.com/trends/yis/2022/GLOBAL/ https://trends.google.com/trends/yis/2022/US/ https://trends.google.com/trends/yis/2022/NL/

7 8
Challenges & Characteristics Text Retrieval Is Hard!
• Dynamically generated content
• Under/over-specified query
• New pages get added all the time
- The size of the web (or textual content in general) just doubles
- Ambiguous: “buying CDs” (money or music?)
every a few minutes - Incomplete: what kind of CDs?
• Users (usually) revise and revisit their queries - What if “CD” is never mentioned in documents?
• Queries are not extremely long • Vague semantics of documents
- They used to be very short (up to 2 words) - Ambiguity: e.g., word-sense, structural
• Probably a large number of typos - Incomplete: Inferences required
• A small number of popular queries • Hard even for people!
- A long tail of infrequent ones
• Almost no use of advanced query operators
- 80% agreement in human judgments or relevant
results
9 10

Text Retrieval Is “Easy”! Retrieval models

• Text retrieval CAN be easy in a particular
case:
- Ambiguity in query/document is relative to the
database
- So, if the query is specific enough, just one
keyword may get all the relevant documents
• Perceived text retrieval performance is
usually better than the actual performance
- Users can NOT judge the completeness of an
answer
11 12
Sec. 1.1 Sec. 1.1

First approach: Term-document matrices Incidence vectors

• So, we have a 0/1 vector for each term.
• To answer query: take the vectors for
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar
Calpurnia
1
0
1
1
0
0
1
0
1
0
1
0
Brutus, Caesar and ΝΟΤCalpurnia and apply
Cleopatra
mercy
1
1
0
0
0
1
0
1
0
1
0
1 some logical rules
worser 1 0 1 1 1 0

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony X 1 1 0 0 0 1
Brutus 1 1 1 0 1 0 0
Caesar 1 1 1 0 1 1 1
Calpurnia 0 0 1 0 0 0 0

Brutus AND Caesar BUT NOT 1 if play contains Cleopatra X 1 0 0 0 0 0

mercy X 1 0 1 1 1 1

Calpurnia word, 0 otherwise worser X 1 0 1 1 1 0

13 14

Sec. 1.1

Problems with the Boolean model Storage

• Consider N = 1 million documents,
• Storage & size each with about 1000 words.
• Sparsity • Avg 6 bytes/word including spaces/punctuation
• Requires exact match • What is the memory size required?
• Others…?
A.6 MB
B. 60 MB
C. 1 GB
D.6 GB
E. 60 GB

15 16
Sec. 1.1

Sparsity A potential improvement: Inverted Index

• Consider N = 1 million documents, • For each term t, we must store a list of all
each with about 1000 words. documents that contain t.
• Say there are M = 500K distinct terms among
these. • Given a query, we can fetch the lists for all
• Term-document matrix is 500K x 1M query terms and work on the involved
• How many non-zero elements does it have? documents.
A.1 million
B. 1 billion • Keep everything sorted! This gives you a
C. 0.5 trillion logarithmic improvement in access.
D.1 trillion
E. What’s a trillion? 17 18

Inverted index construction Tokenizer+Lingustics = Vocabulary reduction

Remember all these steps…?
Documents to Friends, Romans, countrymen.
be indexed
• Tokenization
Tokenizer
- Cut character sequence into word tokens
- Deal with “John’s”, a state-of-the-art solution
Token stream Friends Romans Countrymen • Normalization
- Map text and query term to same form
Linguistic modules - You want U.S.A. and USA to match
friend roman countryman
• Lemmatization or Stemming
Modified tokens - We may wish different forms of a root to match
Indexer friend 2 4 - authorize, authorization
• Stop words
roman 1 2 - We may omit very common words (or not)
Inverted index
countryman 13 16 - the, a, to, of
19 20
Sec. 1.3

Exact match What the boolean model doesn’t tell us

• The Boolean retrieval model is being able to ask a
• How to define/select the tokens/words
query that is a Boolean expression:
- Boolean Queries are queries using AND, OR and NOT to join (or the “basic concepts”)
query terms - What kind of pre-processing I need
- Is precise: document matches condition or not. • How to assign weights
- Perhaps the simplest model to build an IR system on
- Weight in query indicates importance of term
- Weight in doc indicates how well the term
• Primary commercial retrieval tool for many decades.
characterizes the document
• Many search systems you still(?) use are Boolean:
- Email, library catalog, Mac OS X Spotlight • How to define a similarity/distance measure

• Many improvements/extensions: e.g. use bigrams

21 22

What’s a good “basic concept”? Why Is Being “Orthogonal” Important?

1) Orthogonal • Query = {laptop computer pc sale}
- Linearly independent basis vectors • Document 1 = {computer sale}
- “Non-overlapping” in meaning • Document 2 = {laptop computer pc}
2) No ambiguity
3) Weights can be assigned automatically Ambiguity is the killer
and –hopefully- accurately
• Query = {Jaguar band}
• Many possibilities: Words, stemmed • Document 1 = {Jaguar car}
words, phrases, “latent concept”, … • Document 2 = {Jaguar cat}
23 24
How to Assign Weights? Weighting Is Very Important…
• Very, very important! • Query = {text information management}
• Why weighting • Document 1 = {text information}
- Query side: Not all terms are equally important • Document 2 = {information management}
- Doc side: Some terms carry more information about
content
• Document 3 = {text management}
• How?
- Two basic heuristics • How do we weight the multiple occurrences
- TF (Term Frequency) = Within-doc-frequency of a term in a document?
- IDF (Inverse Document Frequency) • How do we weight different terms across
- Many variations of these exist… documents?
25 26

Sec. 6.2

Term-document count matrices Bag of words (BOW) model

• Consider the number of occurrences of a • Vector representation doesn’t consider the ordering of
term in a document: words in a document
• John is quicker than Mary and Mary is quicker than John
- Each document is a count vector in ℕ|V|: a
have the same vectors
column below
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
• This is called the bag of words model.
Antony 157 73 0 0 0 0
• Quite popular in many applications, involving
Brutus 4 157 0 1 0 0 classification problems
Caesar 232 227 0 2 1 1 • Can you think of other problems (other than the order)
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
of words?
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0 27 28
Sec. 6.2

Term frequency (tf) tf log-weighting

• The term frequency tft,d of term t in document d is • The log frequency weight of term t in d is
defined as the number of times that t occurs in d. ì1 + log10 tft,d , if tft,d > 0
• We want to use tf when computing query-document wt,d = í
î 0, otherwise
match scores. But how?
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
• Raw term frequency is not what we want:
- A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term. • Score for a document-query pair: sum over
- But not 10 times more relevant. terms t in both q and d:
• Relevance does not increase proportionally with term
frequency.
score = åtÎqÇd (1 + log tft ,d )

NB: frequency = count in IR 29 30

Sec. 6.3

tf normalization tf normalization, cont.

• A vector can be (length-) normalized by dividing
• Why? each of its components by its length – for this
- Document length variation
- “Repeated occurrences” are less informative than
we use the L2 norm:
the “first occurrence” 
• Two views of document length
x2= åx 2
i i

- A doc is long because it uses more words • Dividing a vector by its L2 norm makes it a unit
- A doc is long because it has more content $%&' !(!,#
(length) vector 𝑡𝑓!,# =
• Generally we want to penalize long documents, ∑'
$%& !($,#
(

but avoid over-penalizing • Long and short documents now have

comparable weights
31 32
Sec. 6.2.1 Sec. 6.2.1

Document frequency (df) Document frequency, continued

• Frequent terms are less informative than rare terms
• Rare terms are more informative • Consider a query term that is frequent in the collection
than frequent terms (e.g., high, increase, line)
- Recall stop words • A document containing such a term might or might not
• Consider a term in the query that is rare in the be relevant
collection (e.g., anatidaephobia) • Intuition: If all documents in the collection contain the
term, then for sure it is not relevant
• A document containing this term is very likely to
• We want lower weights for frequent terms (relatively
be relevant to the query anatidaephobia to rare terms)
• → We want a high weight for rare terms like • We will use document frequency (df) to capture this.
anatidaephobia.
33 34

Sec. 6.2.1 Sec. 6.2.1

idf weight idf example, suppose N = 1 million

• dft is the document frequency of t: the
number of documents that contain t term dft idft

- dft is an inverse measure of the anatidaephobia 1

informativeness of t animal 100

- dft £ N sunday 1,000

fly 10,000
• We define the idf (inverse document under 100,000
frequency) of t by idft = log10 ( N/dft ) the 1,000,000
- We use log (N/dft) instead of N/dft to
“dampen” the effect of idf. idft = log10 ( N/dft )
There is one idf value for each term t in a collection.
Turns out the base of the log is immaterial. 35 36
Sec. 6.2.2 Sec. 6.2.2

tf-idf weighting Final ranking of documents for a query

• The tf-idf weight of a term is the product of its tf weight
and its idf weight.
w t ,d = (1 + log tft ,d ) ´ log10 ( N / df t ) Score(q,d) = ∑ tf.idft,d
t ∈q∩d
• Best known weighting scheme in information retrieval
- Note: the “-” in tf-idf is a hyphen, not a minus sign!
- Alternative names: tf.idf, tf x idf, tfidf
• Increases with the number of occurrences within a Note that we can (and will) use tf.idf weights
document for other tasks like e.g. classification!
• Increases with the rarity of the term in the collection
€
37 38

Sec. 6.3

Effect of idf on ranking Binary → Count → Weight matrix

Score(q,d) = ∑ tf.idft,d
t ∈q∩d

• Question: Does idf have an effect on ranking for Antony

Antony and Cleopatra
5.25
Julius Caesar
3.18
The Tempest
0
Hamlet
0
Othello
0
Macbeth
0.35
one-term queries, like: “iPhone” ? Brutus 1.21 6.1 0 1 0 0

• idf has no effect on ranking one term queries Caesar 8.59 2.54 0 1.51 0.25 0
€ Calpurnia 0 1.54 0 0 0 0
- idf affects the ranking of documents for queries with Cleopatra 2.85 0 0 0 0 0
at least two terms mercy 1.51 0 1.9 0.12 5.25 0.88

- For the query capricious person, idf weighting makes worser 1.37 0 0.11 4.15 0.25 1.95

occurrences of capricious count for much more in

the final document ranking than occurrences of Each document is now represented by a real-valued
vector of tf-idf weights ∈ R|V|
person.
39 40
Sec. 6.3 Sec. 6.3

The Vector Space Model Formalizing vector space proximity

• Distance measures we know:
• Now we have a |V|-dimensional vector space x = (x1 x2 ⋅⋅⋅ xV ) and y = (y1 y2 ⋅⋅⋅ yV )
• Terms are axes of the space ⎛V ⎞p
1

• Documents and queries are points or vectors in d(x, y) = ⎜⎜ ∑| xi − yi | p ⎟⎟ , p > 0

⎝ i=1 ⎠
this space
• For p=1 à Manhattan distance
• Very high-dimensional: tens of millions of
• For p=2 à Euclidean distance
dimensions when you apply this to a web
search engine ∑
V

i=1
xi yi
cos(x, y) =
• These are very sparse vectors – most entries ∑
V
xi2 ∑
V
yi2
are zero
i=1 i=1

• Cosine distance/angle of vectors

41 42

Sec. 6.3 Sec. 6.3

Formalizing vector space proximity Why distance is a bad idea

• Let’s compare Euclidean distances: • The Euclidean distance

between q and d2 is
distance(d1,d2), distance(d1,d3), distance(d2,d3)
large even though the
distribution of terms in
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 the query q and the
d1 1 0 0 0 0 0 0 0 0 0 distribution of terms in
d2 0 0 0 0 0 0 0 0 0 1 the document d2 are
d3 1 0 0 0 1 0 0 0 0 1 very similar.

• But then, could I use

angles instead?

43
Sec. 6.3

cosine(query,document) cosine for length-normalized vectors

• For length-normalized vectors, cosine
Dot product Unit vectors
similarity is simply the dot product (or
   
å
V
  q•d q d qd
i =1 i i
scalar product):
cos( q, d ) =   =  •  =
qd q d
å
V
q 2
å
V
d 2     V
i =1 i i =1 i
cos(q, d ) = q • d = ∑ qi di
i=1
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or, for q, d length-normalized.

equivalently, the cosine of the angle between q and d. €
45 46

Sec. 6.3 Sec. 6.3

Cosine similarity amongst 3 documents Cosine similarity amongst 3 documents, cont.

• Log frequency • After length
How similar are weighting normalization
the novels term SaS PaP WH
term SaS PaP WH term SaS PaP WH
SaS: Sense and affection 115 58 20 affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
Sensibility jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
jealous 10 7 11
PaP: Pride and gossip 1.30 0 1.78 gossip 0.335 0 0.405
gossip 2 0 6
Prejudice, and wuthering 0 0 2.58 wuthering 0 0 0.588
WH: Wuthering wuthering 0 0 38
cos(SaS,PaP) ≈
Heights? 0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94
Term frequencies (counts)
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Note: To simplify this example, we don’t do idf weighting.
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
47 48
Sec. 6.4

tf-idf weighting has many variants (Okapi) BM25: A strong baseline

• Goal: be sensitive to term frequency and document length while
not adding too many parameters

N (k1 +1)tfi
RSV BM 25 = ∑ log ⋅ dl = document length (|d|)
avdl = average document length
i∈q dfi k ((1− b) + b dl ) + tf in the whole collection
1 i
avdl

• k1 controls term frequency scaling

- k1 = 0 is binary model; k1 large is raw term frequency
• b controls document length normalization
Columns headed ‘n’ are acronyms for weight schemes. - b = 0 is no length normalization; b = 1 is relative frequency (fully scaled
by document length)
• Typically, k1 is set around 1.2–2 and b around 0.75

49 (Robertson and Zaragoza 2009; Spärck Jones et al. 2000)

Why BM25 might be better than VSM tf-idf? Intelligent IR

• Example query:
• Taking into account the meaning of the words
“president lincoln” tfpresident,d tflincoln,d Score(q,d) used.
with BM25
• “president” is in 40.000 - Any external info e.g. WordNet?
documents in the collection 15 25 20.66
15 1 12.74
- Word vectors?
(dfpresident=40000)
• “lincoln” is in 300 documents in 15 0 5.00 • Taking into account the order of words in the
the collection (dflincoln=300) 1 25 18.2 query.
• The document length is 90% of 0 25 15.66
• Adapting to the user based on direct or indirect
the average length (dl/avgl=0.9)
• Let’s assume, k1=1.2, b=0.75
The low df term plays a bigger role. feedback.
• Taking into account the authority of the source.

52
Integrating multiple features How to combine features to assign a relevance
to determine relevance score to a document?
• Modern systems – especially on the Web – use a great
number of features: • Given lots of relevant features…
- Arbitrary useful features – not a single unified model
- Log frequency of query word in anchor text? • You can continue to hand-engineer retrieval
-
-
Query word in color on page?
# of images on page?
scores
- # of (out) links on page? • Or, you can build a classifier to learn weights for
- PageRank of page?
- URL length? the features
- URL contains “~”? - Requires: labeled training data
- Page edit recency?
- Page length? - This is the “learning to rank” approach, which has become a
• Google is using over 200 such features for the rankings and hot area in recent years (esp. with deep models)
constantly updates the algorithm (SEO, paid advertisements, - We only provide an elementary introduction here
etc.)
53 54

Sec. 15.4.1 Sec. 15.4.1

Simple example: Simple example:

Using classification for ad-hoc IR Using classification for ad hoc IR
• Collect a training corpus of (q, d, r) triples
- Relevance r is here binary (but may be multiclass, with 3–7 values) 0.05
- Document is represented by a feature vector
- x = (φ, ω) φ is cosine similarity, ω is minimum query window size R Decision surface
- ω is the the shortest text span that includes all query words R N
R R
- Query term proximity is a very important new weighting factor! R
cosine score φ

R R
- Train a machine learning model to predict the class r of a document-query
pair N N
0.025 R
R R
R N N
N
N
N
N N

0
2 3 4 5
Term proximity w 56
Sec. 8.6

Evaluation of search engines Measures for a search engine

• How fast does it index
- Number of documents/hour
- (Average document size)
• How fast does it search
- Latency as a function of index size
• Expressiveness of query language
- Ability to express complex information needs
- Speed on complex queries
• Uncluttered UI
• Is it free?
57 58

Sec. 8.1 Sec. 8.4

What about what users need, contet-wise? Evaluating results

• An information need is translated into a query • Evaluation of a result set:
• Relevance is assessed relative to the information need - If we have
- a benchmark document collection
not the query
- a benchmark set of queries
- assessor judgments of whether documents are relevant to
• E.g., Information need: I’m looking for information on queries
whether drinking red wine is more effective at reducing Then we can use Precision/Recall/F measure as before
your risk of heart attacks than white wine. • Evaluation of ranked results:
• Query: wine red white heart attack effective - The system can return any number of results
- By taking various numbers of the top returned
• You evaluate whether the doc addresses the documents (levels of recall), the evaluator can
information need, not whether it has these words produce a precision-recall curve
59 60
The Contingency Table How to evaluate ranking?
Compute the precision/recall at every point (or more often to the position where each
relevant document is retrieved). That is called Precision/Recall @K
Action Retrieved Not Retrieved
Doc
Sim. Doc. Relevant
Relevant + Relevant Retrieved Relevant Rejected 0.98 d1 + Recall = 0.25, Precision = 1
0.95 d2 + Recall = 0.5, Precision = 1
Not relevant - Irrelevant Retrieved Irrelevant Rejected
0.83 d3 - Recall = 0.5, Precision = 0.67
0.80 d4 + Recall = 0.75, Precision = 0.75
0.76 d5 -
Relevant Retrieved 0.56 d6 -
Precision = ……
Retrieved 0.34 d7 -
0.21 d8 +
Relevant Retrieved Recall = 1, Precision = 4/9
Recall = 0.21 d9 -
Relevant
In this example we can say e.g. that R@1 = 25% or R@4=75% or R@9=100%

61 62

Sec. 8.4

PR curves Other measures: R-precision

Plot a precision-recall (PR) curve for every pair • R-precision
- If have known (though perhaps incomplete)
set of relevant documents of size R, then
x
calculate precision of top-R docs returned
precision
x - Perfect system could score 1.0.
x
x
Example:
recall o In a ranking: R N R N R (5 documents in total, 3 relevant)
o I retrieve the first 3 documents (because I know there are 3 relevant)
o R-Precision is then 2/3 = 0.67.

63 64
Sec. 8.4

Other measures: MAP Modern IR

• Mean average precision (MAP)
- AP: Average of the precision value obtained for the Indexing Query
top-k documents, each time a relevant doc is
retrieved
- Avoids interpolation, use of fixed recall levels
- Does weight most accuracy of top returned results
- MAP for set of queries is arithmetic average of APs
- Macro-averaging: each query counts equally
Example:
In a ranking: R N R N R (5 documents in total, 3 relevant)
1 1 2 3
𝑀𝐴𝑃 = + + = 0.76
3 1 3 5
65 66

Modern IR, cont. Bias in IR

• BM25 is still a strong baseline! • Let’s google “professor style” and “teacher style” and see the
• Modern IR systems will rely on “dense vectors” images
(embeddings) for capturing semantics (“dense retrieval”) - Is it OK for a system to be biased if it amplifies a bias in the world? What if it
faithfully represents the world?
Louis, Antoine, and Gerasimos Spanakis. "A Statutory Article Retrieval Dataset in French." ACL 2022
Louis, Antoine, Gijs van Dijck, and Gerasimos Spanakis. "Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks." EACL2023
• Let’s try “why coffee is good for you” vs. “why coffee is bad for
you”
- Does a system have a responsibility to give us unbiased information when we
ourselves are biased? What are the potential impacts of this kind of bias in search
results?
• Bias in IR is unsolved (and complex)
- If you were the CEO of a search engine company and wanted to reduce bias, how
would you modify the algorithm? Are there any other factors the algorithm
should consider besides the similarity of the query to the retrieved document?
How should it detect and handle opinionated queries or queries with potential
for gender bias?

67 68
Privacy in IR Recap
• Personalization is an important topic in information retrieval; after all,
we'd like our search results to be relevant to us and our interests.
• Let’s google "marguerite". What is the first search result? Would you • Information Retrieval Challenges
expect another person - say, someone in USA- to get the same search
result?
• Retrieval models
- Think of other examples of personalization based on location, search and - Boolean
browsing history, or social media?
• What are potential benefits and risks of getting personalized searches? Is - TF-IDF
it okay that search engines are using our data to personalize our - BM25
searches? Or is there a limit to what kind of data should be okay for
search engines to use? • Evaluation metrics for retrieval tasks
• In 2009, the French government signed the "Charter of good practices on
the right to be forgotten on social networks and search engines."
- Do you think people should have the right to remove information about
themselves from the web (the right to be forgotten)?
- Do you think Google should be required to remove information about an
individual upon request?
69 70

References
• IR Chapters: 1, 2.1-2.2, 6.2-6.4.3, 8.1-8.5
• Bias in IR:
- https://www.theverge.com/2022/5/11/23064883/google-ai-skin-tone-measure-
monk-scale-inclusive-search-results
- https://blogs.bing.com/search-quality-insights/february-2018/toward-a-more-
intelligent-search-bing-multi-perspective-answers
- https://www.tandfonline.com/doi/abs/10.1080/00913367.1990.10673179
- https://journals.sagepub.com/doi/10.1177/002193479902900303
- https://psycnet-apa-org.stanford.idm.oclc.org/fulltext/2020-42793-001.html
- https://journals.sagepub.com/doi/abs/10.1177/1090198120957949

HirePurchaseScheme CurrentPriceList PDF
50% (16)
HirePurchaseScheme CurrentPriceList PDF
60 pages
Firewall Policies
100% (1)
Firewall Policies
3,195 pages
CSS Tutorial
100% (1)
CSS Tutorial
663 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Ir 1
No ratings yet
Ir 1
14 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Unit 1
No ratings yet
Unit 1
181 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Boolean Model 2021spring
No ratings yet
Boolean Model 2021spring
43 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
NLP Week10 IR Enc Dec Annotated - by - Ces
No ratings yet
NLP Week10 IR Enc Dec Annotated - by - Ces
83 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
TF Idf
100% (3)
TF Idf
38 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
No ratings yet
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
40 pages
I R Rank
No ratings yet
I R Rank
52 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Python For Econometrics
No ratings yet
Python For Econometrics
300 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Web Search and Mining: Lecture 2: Boolean Retrieval
No ratings yet
Web Search and Mining: Lecture 2: Boolean Retrieval
45 pages
6 Tfidf
No ratings yet
6 Tfidf
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
IR Presentation 2
No ratings yet
IR Presentation 2
28 pages
Unit 2
No ratings yet
Unit 2
58 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
16 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
L03
No ratings yet
L03
16 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
5g Core Guide Cloud Infrastructure PDF
0% (1)
5g Core Guide Cloud Infrastructure PDF
8 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
HLK-RM04 User Manual
No ratings yet
HLK-RM04 User Manual
49 pages
Study Material: Vivekananda College Thakurpukur
No ratings yet
Study Material: Vivekananda College Thakurpukur
10 pages
Modifications: Before Modification After Modification Set - Tris - B (0x7F) Pin - b7
No ratings yet
Modifications: Before Modification After Modification Set - Tris - B (0x7F) Pin - b7
7 pages
Stop and Wait ARQ
No ratings yet
Stop and Wait ARQ
14 pages
NM C771 Schematic
No ratings yet
NM C771 Schematic
50 pages
ENVISION DIGITAL SMART WIND SOLUTION DECK Draft2
No ratings yet
ENVISION DIGITAL SMART WIND SOLUTION DECK Draft2
9 pages
Magnetic RAM: Magnetoresistive Random Access Memory
No ratings yet
Magnetic RAM: Magnetoresistive Random Access Memory
19 pages
CAPTCHA
No ratings yet
CAPTCHA
10 pages
A Review of Cyber-Physical System Research Relevant To The Emerging It Trends: Industry 4.0, Iot, Big Data, and Cloud Computing
No ratings yet
A Review of Cyber-Physical System Research Relevant To The Emerging It Trends: Industry 4.0, Iot, Big Data, and Cloud Computing
22 pages
Get Started With 3cx DEFAULT
No ratings yet
Get Started With 3cx DEFAULT
2 pages
Information and Communication Technology: Paper 0417/11 Written Paper
No ratings yet
Information and Communication Technology: Paper 0417/11 Written Paper
22 pages
q1 Module 2 Emptech
No ratings yet
q1 Module 2 Emptech
8 pages
Numpy Data Analytics
No ratings yet
Numpy Data Analytics
13 pages
Nuevo Documento de Texto
No ratings yet
Nuevo Documento de Texto
4 pages
Semester Result
No ratings yet
Semester Result
1 page
Google Search Essentials
No ratings yet
Google Search Essentials
9 pages
Fastvlm: Efficient Vision Encoding For Vision Language Models
No ratings yet
Fastvlm: Efficient Vision Encoding For Vision Language Models
20 pages
Assignment Report
No ratings yet
Assignment Report
12 pages
Architectural Changes To Vsphere 6:: Vcenter Server With Embedded PSC
No ratings yet
Architectural Changes To Vsphere 6:: Vcenter Server With Embedded PSC
8 pages
Installing ICU 52
No ratings yet
Installing ICU 52
7 pages
GePG POS APP USER GUIDE
No ratings yet
GePG POS APP USER GUIDE
11 pages
Daily Monitoring Checklist
No ratings yet
Daily Monitoring Checklist
4 pages
RV 60 C 911
No ratings yet
RV 60 C 911
3 pages
Cs 1 Final Answer Key
No ratings yet
Cs 1 Final Answer Key
2 pages
Checkout - Remove Digital
No ratings yet
Checkout - Remove Digital
1 page
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet

KEN2570-5-Search and IR

Uploaded by

KEN2570-5-Search and IR

Uploaded by

KEN2570 Public Service Announcements

Natural Language Processing • Lab 1 is due this Friday 18.00

Agenda Information Retrieval/Search

• Part of Information Extraction (NER)

Sec. 1.1 Sec. 1.1

What do people search? What do people search?

Text Retrieval Is “Easy”! Retrieval models

First approach: Term-document matrices Incidence vectors

Brutus AND Caesar BUT NOT 1 if play contains Cleopatra X 1 0 0 0 0 0

Calpurnia word, 0 otherwise worser X 1 0 1 1 1 0

Problems with the Boolean model Storage

Sparsity A potential improvement: Inverted Index

Inverted index construction Tokenizer+Lingustics = Vocabulary reduction

Exact match What the boolean model doesn’t tell us

• Many improvements/extensions: e.g. use bigrams

What’s a good “basic concept”? Why Is Being “Orthogonal” Important?

Term-document count matrices Bag of words (BOW) model

Term frequency (tf) tf log-weighting

NB: frequency = count in IR 29 30

tf normalization tf normalization, cont.

but avoid over-penalizing • Long and short documents now have

Document frequency (df) Document frequency, continued

Sec. 6.2.1 Sec. 6.2.1

idf weight idf example, suppose N = 1 million

- dft is an inverse measure of the anatidaephobia 1

informativeness of t animal 100

- dft £ N sunday 1,000

tf-idf weighting Final ranking of documents for a query

Effect of idf on ranking Binary → Count → Weight matrix

• Question: Does idf have an effect on ranking for Antony

occurrences of capricious count for much more in

The Vector Space Model Formalizing vector space proximity

• Documents and queries are points or vectors in d(x, y) = ⎜⎜ ∑| xi − yi | p ⎟⎟ , p > 0

• Cosine distance/angle of vectors

Sec. 6.3 Sec. 6.3

Formalizing vector space proximity Why distance is a bad idea

• Let’s compare Euclidean distances: • The Euclidean distance

• But then, could I use

cosine(query,document) cosine for length-normalized vectors

cos(q,d) is the cosine similarity of q and d … or, for q, d length-normalized.

Sec. 6.3 Sec. 6.3

Cosine similarity amongst 3 documents Cosine similarity amongst 3 documents, cont.

tf-idf weighting has many variants (Okapi) BM25: A strong baseline

• k1 controls term frequency scaling

49 (Robertson and Zaragoza 2009; Spärck Jones et al. 2000)

Why BM25 might be better than VSM tf-idf? Intelligent IR

Sec. 15.4.1 Sec. 15.4.1

Simple example: Simple example:

Evaluation of search engines Measures for a search engine

Sec. 8.1 Sec. 8.4

What about what users need, contet-wise? Evaluating results

PR curves Other measures: R-precision

Other measures: MAP Modern IR

Modern IR, cont. Bias in IR

You might also like