0% found this document useful (0 votes)

49 views37 pages

IR Chapter 2

NOTE

Uploaded by

mutgatkekdeng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views37 pages

IR Chapter 2

NOTE

Uploaded by

mutgatkekdeng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Mizan Tepi University

Tepi campus
Department of IT

Information Storage and Retrieval

ITec3081

Chapter 2
Text/Document Operations
and Automatic Indexing
Document Pre-processing
2
 Not all words are equally significant for representing the semantics of a
document.
 Some words carry more meaning than others, therefore it is worthwhile to
preprocess the text of the documents in the collection to determine the term to
be used as index terms.
Statistical Properties of Text
 How is the frequency of different words distributed?
 How fast does vocabulary size grow with the size of a corpus?
 Such factors affect the performance of IR system & can be used to select suitable
term weights & other aspects of the system.
 A few words are very common.
 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.
 Most words are very rare.
 Half the words in a corpus appear only once, called “read only once
Cont.…
3
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
 f : is the frequency that w appears
 r: is rank of w in order of frequency. (The most commonly occurring
word has rank 1, etc.)
Zipf's Law states that when the distinct words in a text are arranged
in decreasing order of their frequency of occuerence (most frequent
words first), the occurence characterstics of the vocabulary can be
characterized by the constant rank-frequency law of Zipf:
 Frequency * Rank = constant
That is If the words, w, in a collection are ranked, r, by their frequency,
f, they roughly fit the relation: r * f = c
Note: Different collections have different constants c.
Cont.…
4
 There are three well-known researcher who define statistical
properties of words in a text:
 Zipf’s Law: models word distribution in text
corpus
 Luhn’s idea: measures word significance
 Heap’s Law: shows how vocabulary size grows
with the growth corpus size
 Such properties of text collection greatly affect
the performance of IR system & can be used to
select suitable term weights & other aspects of
the system.
Cont’d
5
 Example: Zipf's Law
Methods that Build on Zipf's
Law
• Stop lists:
• Ignore the most frequent words (upper cut-off).
Used by almost all systems.

• Significant words:
• Take words in between the most frequent
(upper cut-off) and least frequent words (lower
cut-off).

• Term weighting:
• Give differing weights to terms based on their
frequency, with most frequent words weighed
less. Used by almost all ranking methods.
Word significance: Luhn’s Ideas
 Luhn Idea (1958): the frequency of word occurrence in a
text furnishes a useful measurement of word significance.
 Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
 For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
 The words exceeding the upper cutoff were considered to be
common
 The words below the lower cutoff were considered to be rare
Hence they are not contributing significantly to the content of the text
 The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
Text Operations

9
 Text operations is the process of text transformations in to
logical representations
 The main operations for selecting index terms, i.e. to
choose words/stems (or groups of words) to be used as
indexing terms are:
 Lexical analysis/Tokenization of the text - digits, hyphens,
punctuations marks, and the case of letters
 Elimination of stop words - filter out words which are not useful
in the retrieval process
 Stemming words - remove affixes (prefixes and suffixes)
 Construction of term categorization structures such as
thesaurus, to capture relationship for allowing the
expansion of the original query with related terms
Elimination of Stopwords
10 Stopwords are extremely common words across document
collections that have no discriminatory power
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select
documents matching a user need and needs to be filtered out
from potential index terms
 Examples of Stopwords are articles, , pronouns,
prepositions, conjunctions, etc.:
 articles (a, an, the); pronouns: (I, he, she, it, their, his)
 Some prepositions (on, of, in, about, besides, against, over),
 conjunctions/ connectors (and, but, for, nor, or, so, yet),
 verbs (is, are, was, were),
 adverbs (here, there, out, because, soon, after) and
 adjectives (all, any, each, every, few, many, some) can also be
treated a as stopwords
 Stopwords are language dependent.
Stopwords
 Intuition:
 Stopwords take up 50% of the text. Hence,
document size reduces drastically enabling to
organizes smaller indices for information
retrieval
 Good compression techniques for indices: The
30 most common words account for 30% of
the tokens in written text

 Better approximation of importance for

classification, summarization, etc.
•Stopwords are language dependent.
How to detect a stopword?
 One method: Sort terms (in decreasing order) by
document frequency and take the most frequent
ones
 In a collection about insurance practices,
“insurance” would be a stop word

 Another method: Build a stop word list that

contains a set of articles, pronouns, etc.
 Why do we need stop lists: With a stop list, we
can compare and exclude from index terms
entirely the commonest words.

 With the removal of stopwords, we can measure

better approximation of importance for
classification, summarization, etc.
Stemming/Morphological analysis

 Stemming reduces tokens to their “root” form of words to

recognize morphological variation .
 The process involves removal of affixes (i.e. prefixes and suffixes)
with the aim of reducing variants to the same stem
 Often stemming removes inflectional and derivational
morphology of a word
 Inflectional morphology: vary the form of words in order to express
grammatical features, such as singular/plural or past/present tense.
E.g. Boy → boys, cut → cutting.
 Derivational morphology: makes new words from old ones. E.g.
creation is formed from create , but they are two separate words. And
also, destruction → destroy
 Stemming is language dependent
 Correct stemming is language specific and can be complex.
Stemming
 The final output from a conflation algorithm is a
set of classes, one for each stem detected .
 A Stem: the portion of a word which is left after the
removal of its affixes (i.e., prefixes and/or suffixes).
 Example: ‘connect’ is the stem for {connected,
connecting connection, connections}
 Thus, [automate, automatic, automation] all reduce to  automat
 A class name is assigned to a document if and
only if one of its members occurs as a significant
word in the text of the document.
 A document representative then becomes a list of
class names, which are often referred as the
documents index terms/keywords.
 Queries : Queries are handled in the same way.
Ways to implement stemming
There are basically two ways to implement
stemming.
 The first approach is to create a big dictionary that maps
words to their stems.
 The advantage of this approach is that it works perfectly
(insofar as the stem of a word can be defined perfectly);
the disadvantages are the space required by the
dictionary and the investment required to maintain the
dictionary as new words appear.
 The second approach is to use a set of rules that
extract stems from words.
 The advantages of this approach are that the code is
typically small, and it can gracefully handle new words;
the disadvantage is that it occasionally makes mistakes.
 But, since stemming is imperfectly defined, anyway,
occasional mistakes are tolerable, and the rule-based
approach is the one that is generally chosen.
Stemming: challenges
 May produce unusual stems that are not English
words:
 Removing ‘UAL’ from FACTUAL and EQUAL

 May conflate (reduce to the same token) words that

are actually distinct.
 “computer”, “computational”,
“computation” all reduced to same token
“comput”

 Not recognize all morphological derivations.

Term extraction (Term weighting and similarity measures)

17
 Terms are usually stems. Terms can be also phrases,
such as “Information Retrieval”, “World Wide Web”, etc.
 Documents and queries are represented as vectors or
“bags of words” (BOW).
 Each vector holds a place for every term in the collection.
 Position 1 corresponds to term 1, position 2 to term 2,
position n to term n. D w , w ,..., w
i d i1 d i2 d in

Q wq1 , wq 2, ..., wqn

W=0 if a term is absent

 Documents are represented by binary weights or

Non-binary weighted vectors of terms.
Document Collection
 A collection of n documents can be represented in the
vector space model by a term-document matrix.
 An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist in
the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
D2 1 0 0
included in the vector D3 0 1 1
D4 1 0 0
• Binary formula gives D5 1 1 1
every word that appears D6 1 1 0
in a document equal D7 0 1 0
D8 0 1 0
relevance. D9 0 0 1
• It can be useful when D10 0 1 1
D11 1 0 1
frequency is not
important.
1 if freq ij  0

• Binary Weights freq ij 
Formula: 0 if freq ij 0

Why use term weighting?
20
 Binary weights are too limiting.
 terms are either present or absent.
 Not allow to order documents according to their level of
relevance for a given query

 Non-binary weights allow to model partial matching .

 Partial matching allows retrieval of docs that approximate
the query.
• Term-weighting improves quality of answer set.
 Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top as
they are more relevant than others.
Term Weighting: Term Frequency (TF)

 TF (term frequency) - Count the number

of times term occurs in document. docs t1 t2 t3
fij = frequency of term i in document j D1 2 0 3
D2 1 0 0
 The more times a term t occurs in
D3 0 4 7
document d the more likely it is that t is
relevant to the document, i.e. more D4 3 0 0
indicative of the topic.. D5 1 6 3
 If used alone, it favors common words and D6 3 5 0
long documents. D7 0 8 0
 It gives too much credit to words that
appears more frequently.
D8 0 10 0
D9 0 0 1
 May want to normalize term frequency D10 0 3 5
(tf) across the entire corpus:
D11 4 0 1
tfij = fij / max{fij}
Document Normalization
22
 Long documents have an unfair advantage:
 They use a lot of terms
 So they get more matches than short documents
 And they use the same words repeatedly
 So they have much higher term frequencies

 Normalization seeks to remove these effects:

 Related somehow to maximum term frequency.
 But also sensitive to the number of terms.

 If we don’t normalize short documents may not be

recognized as relevant.
Document Frequency
23
 It is defined to be the number of documents in the
collection that contain a term

DF = document frequency
 Count the frequency considering the whole
collection of documents.
 Less frequently a term appears in the whole
collection, the more discriminating it is.
df i = document frequency of term i
= number of documents containing term i
Inverse Document Frequency
24 (IDF)
 IDF measures rarity of the term in collection. The
IDF is a measure of the general importance of the
term
 Inverts the document frequency.
 It diminishes the weight of terms that occur very
frequently in the collection and increases the
weight of terms that occur rarely.
 Gives full weight to terms that occur in one document
only.
 Gives lowest weight to terms that occur in all
documents.
 Terms that appear in many different documents are less
indicative of overall topic.

idfi = inverse document frequency of term i,

= log2 (N/ df i) (N: total number of
Inverse Document
Frequency
25 E.g.: given a collection of 1000 documents and
document frequency, compute IDF for each word?

Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency ?
TF*IDF Weighting
26
 The most used term-weighting is tf*idf
weighting scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)

 A term occurring frequently in the document but

rarely in the rest of the collection is given high
weight.
 The tf-idf value for a term will always be greater than or
equal to zero.

 Experimentally, tf*idf has been found to work well.

 It is often used in the vector space model together with
cosine similarity to determine the similarity between two
documents.
TF*IDF weighting
 When does TF*IDF registers a high weight?
when a term t occurs many times within a
small number of documents
 Highest tf*idf for a term shows a term has a high
term frequency (in the given document) and a low
document frequency (in the whole collection of
documents);
 the weights hence tend to filter out common terms.
 Thus lending high discriminating power to those
documents
 Lower TF*IDF is registered when the term occurs
fewer times in a document, or occurs in many
documents
 Thus offering a less pronounced relevance signal
 Lowest TF*IDF is registered when the term occurs in
virtually all documents
Computing TF-IDF: An Example
28
Assume collection contains 10,000 documents and
statistical analysis shows that document frequencies
(DF) of three terms are: A(50), B(1300), C(250). And
also term frequencies (TF) of these terms are: A(3),
B(2), C(1). Compute TF*IDF for each term?
A: tf = 3/3=1.00; idf = log 2(10000/50) = 7.644;
tf*idf = 7.644
B: tf = 2/3=0.67; idf = log 2(10000/1300) = 2.943;
tf*idf = 1.962
C: tf = 1/3=0.33; idf = log 2(10000/250) = 5.322;
tf*idf = 1.774
 Query vector is typically treated as a document and also tf-idf
weighted.
More Example
29
 Consider a document containing 100 words
wherein the word technology appears 3 times.
Now, assume we have 10 million documents and
technology appears in one thousand of these.

 The term frequency (TF) for technology:

3/100 = 0.03

 The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

 The TF*IDF score is the product of these frequencies:

0.03 * 13.228 = 0.39684
Similarity Measure

31
 A similarity measure is a function that computes the degree of similarity
or distance between document vector and query vector.
 Using a similarity measure between the query and each document:
 It is possible to rank the retrieved documents in the order of presumed
relevance.
 It is possible to enforce a certain threshold so that the size of the retrieved set
can be controlled.

 Documents that are “close together” in the vector space talk about the
same things and more similar than others.
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
 A similarity measure attempts to compute the distance between
document vector wj and query wq vector.
 The assumption here is that documents whose vectors are close to the query
vector are more relevant to the query than documents whose vectors are away
Similarity Measure: Techniques
• 32
Euclidean distance
 It is the most common similarity measure. Euclidean distance
examines the root of square differences between coordinates of
a pair of document and query terms.

Dot product
 The dot product is also known as the scalar product or inner
product
 the dot product is defined as the product of the magnitudes
of query and document vectors
Cosine similarity (or normalized inner product)
 It projects document and query vectors into a term space and
calculate the cosine angle between these.
Euclidean distance
 33
Similarity between vectors for the document di
and query q can be computed as:
n
sim(dj,q) = |dj – q| =
i 1
ij  (w
iq
2
w )
where wij is the weight of term i in document j and wiq is the
weight of term i in the query

• Example: Determine the Euclidean distance between the

document 1 vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0,
0). 0 means corresponding term not found in
document or query

 (0  2) 2  (3  7) 2  (2  1) 2  (1  0) 2  (10  0) 2 11.05
Inner Product
34
 Similarity between vectors for the document di and
query q can be computed as the vector inner
product:
n
sim(dj,q) = dj•q =  wij · wiq
i1
where wij is the weight of term i in document j and wiq is the weight of term
i in the query q

 For binary vectors, the inner product is the number

of matched query terms in the document (size of
intersection).
 For weighted term vectors, it is the sum of the
products of the weights of the matched terms.
Inner Product -- Examples
 Binary weight :

 Size of vector = size of vocabulary = 7

Retrieval Database Term Computer Text Manage Data

D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1

sim(D, Q) = 3
Retrieval Database Architecture

• Term Weighted: D1 2 3 5
D2 3 7 1
Q 1 0 2

sim(D1 , Q) = 21 + 30 + 5*2 = 12

Inner Product:
37 Example 2 k1
k2
d2 d6 d7
d4
d5
d3
d1
k1 k2 k3 q  dj
k3
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1
Cosine similarity
 Measures similarity between d1 and d2
captured by the cosine of the angle x
 
between them. d q

n
j
w w i, j i ,q
sim ( d j , q )     i 1

i 1 w i 1 i ,q
n n
dj q 2
w 2
  i, j


n
d j d k wi , j wi ,k
sim (d j , d k )     i 1
 Or;
i 1 w i 1 i ,k
n n
d j dk 2
i, j w 2

 The denominator involves the lengths of the

vectors
 So the cosine measure is also known as the
normalized inner product 
i 1 i , j
n 2
Length d j  w
Example: Computing Cosine Similarity

• Let say we have query vector Q = (0.4, 0.8);

and also document D1 = (0.2, 0.7). Compute
their similarity using cosine?

(0.4 * 0.2)  (0.8 * 0.7)

sim (Q, D2 ) 
2 2 2 2
[(0.4)  (0.8) ] * [(0.2)  (0.7) ]
0.64
  0.98
0.42
41

Questions, Ambiguities, Doubts, … ???

BOOK Din 2 A-1 Nuer Gospel Song PDF
67% (3)
BOOK Din 2 A-1 Nuer Gospel Song PDF
112 pages
BOOK Din 2 B-1-1 Nuer Gospel's Song PDF
100% (1)
BOOK Din 2 B-1-1 Nuer Gospel's Song PDF
114 pages
20 Rules of Subject Verb Agreement
No ratings yet
20 Rules of Subject Verb Agreement
3 pages
ағылшын т, педагогика тест 4 нұсқа
100% (1)
ағылшын т, педагогика тест 4 нұсқа
7 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
Chapter 2 Text Operation
No ratings yet
Chapter 2 Text Operation
46 pages
2 - Text Operation
No ratings yet
2 - Text Operation
47 pages
2 Text-Operation
No ratings yet
2 Text-Operation
60 pages
Chapter Two IR
No ratings yet
Chapter Two IR
44 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
2&3 Text Operation
No ratings yet
2&3 Text Operation
65 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
2 Text Operation
No ratings yet
2 Text Operation
46 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
2 - Text Operation
No ratings yet
2 - Text Operation
35 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
CH 2 - Text Operation
No ratings yet
CH 2 - Text Operation
38 pages
2 TextOperations
No ratings yet
2 TextOperations
54 pages
Mod4 NLP
No ratings yet
Mod4 NLP
53 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Chap 4
No ratings yet
Chap 4
76 pages
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
No ratings yet
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
13 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
2 - Text Operations
No ratings yet
2 - Text Operations
56 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Module 4
No ratings yet
Module 4
16 pages
Session 1
No ratings yet
Session 1
33 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
NLP Mod-5
No ratings yet
NLP Mod-5
17 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
Module 4 Notes
No ratings yet
Module 4 Notes
34 pages
Lecture 2 Bag of Words
No ratings yet
Lecture 2 Bag of Words
25 pages
Text Mining
No ratings yet
Text Mining
34 pages
Irs Unit Ii
No ratings yet
Irs Unit Ii
25 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
No ratings yet
An Introduction To Information Retrieval Systems: Intelligent Systems March 18, 2004 Ramashis Das
25 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
1-S2.0-S1877050916311589-Main - Part-5
No ratings yet
1-S2.0-S1877050916311589-Main - Part-5
7 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
Irs Ii
No ratings yet
Irs Ii
39 pages
Text Mining
No ratings yet
Text Mining
62 pages
Chap 2
No ratings yet
Chap 2
70 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
Unit Iii Data Structure
No ratings yet
Unit Iii Data Structure
43 pages
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
DSA Project Title
100% (1)
DSA Project Title
1 page
2016 EAT Second Semester Class Schedule WEW
No ratings yet
2016 EAT Second Semester Class Schedule WEW
1 page
Chapter Two DB
No ratings yet
Chapter Two DB
30 pages
%%%%%%CIVICS
No ratings yet
%%%%%%CIVICS
9 pages
Chapter 3 Emerging Technology - Artificial Intelligence
100% (1)
Chapter 3 Emerging Technology - Artificial Intelligence
47 pages
Fayda Letter
No ratings yet
Fayda Letter
1 page
Chapter 4 - Emergig-Technology-IOT
No ratings yet
Chapter 4 - Emergig-Technology-IOT
48 pages
Chpter One
No ratings yet
Chpter One
28 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
32 pages
Assignment Report Speech
No ratings yet
Assignment Report Speech
10 pages
Chapter 3 - Artificial Intelligence
No ratings yet
Chapter 3 - Artificial Intelligence
18 pages
??????? ????? ?????
No ratings yet
??????? ????? ?????
316 pages
Englishnotes 1
No ratings yet
Englishnotes 1
63 pages
The Heir Language and Folklore by A.C. Hollis With Introduction by Sir Charles Eliot - Masaitheirlangu00hollgoog - Djvu
No ratings yet
The Heir Language and Folklore by A.C. Hollis With Introduction by Sir Charles Eliot - Masaitheirlangu00hollgoog - Djvu
757 pages
A Phonological Study of Elision in Standard English and Standard Arabic
100% (1)
A Phonological Study of Elision in Standard English and Standard Arabic
20 pages
SUBJECT VERB AGREEMENT TP
No ratings yet
SUBJECT VERB AGREEMENT TP
58 pages
Sentence Structures Powerpoint
No ratings yet
Sentence Structures Powerpoint
40 pages
Grade 8 English Language Week 2 Lesson 1 and Answer Sheet
No ratings yet
Grade 8 English Language Week 2 Lesson 1 and Answer Sheet
4 pages
9th English (Test No 3)
No ratings yet
9th English (Test No 3)
2 pages
Daily Literacy Practice - Set 4
No ratings yet
Daily Literacy Practice - Set 4
5 pages
Phrasal Verb Booklet
No ratings yet
Phrasal Verb Booklet
15 pages
English-4 Q2 W4
No ratings yet
English-4 Q2 W4
6 pages
PMR Modul English Closest
No ratings yet
PMR Modul English Closest
11 pages
Preposition List
No ratings yet
Preposition List
1 page
Proper Nouns: Niagara Falls, Dracula, The Federal Bureau of Investigation, The Great Depression
No ratings yet
Proper Nouns: Niagara Falls, Dracula, The Federal Bureau of Investigation, The Great Depression
3 pages
About The Grammar Test
20% (5)
About The Grammar Test
13 pages
Grade - Viii Model Paper 2017 English: Mcqs Paper Key S. No. Key S. No. Key
No ratings yet
Grade - Viii Model Paper 2017 English: Mcqs Paper Key S. No. Key S. No. Key
6 pages
English Grammar: Parts of Speech in English
No ratings yet
English Grammar: Parts of Speech in English
1 page
Saint Vincent de Paul Diocesan College: Read and Explore
67% (3)
Saint Vincent de Paul Diocesan College: Read and Explore
16 pages
To Be
No ratings yet
To Be
11 pages
A Study of The Phenomenon of Pronominalization in Dangme
No ratings yet
A Study of The Phenomenon of Pronominalization in Dangme
11 pages
Unit 1. Blood Is Thicker Than Water
No ratings yet
Unit 1. Blood Is Thicker Than Water
12 pages
G1 Regular EWB - 2ndterm - English - AY 2024-2025
No ratings yet
G1 Regular EWB - 2ndterm - English - AY 2024-2025
73 pages
Practice Unit 1. Pronouns
No ratings yet
Practice Unit 1. Pronouns
8 pages
English GR 4 Curriculum Coverage 2023-2024
No ratings yet
English GR 4 Curriculum Coverage 2023-2024
26 pages
Ejericios Estudiantes B1
No ratings yet
Ejericios Estudiantes B1
73 pages
Descriptive Text
No ratings yet
Descriptive Text
3 pages
Verb Tense Exercise 5
No ratings yet
Verb Tense Exercise 5
3 pages
Examen Repaso General 2º ESO
No ratings yet
Examen Repaso General 2º ESO
2 pages
Subject Verb Concord
No ratings yet
Subject Verb Concord
5 pages
CN U1.7 Personality
No ratings yet
CN U1.7 Personality
7 pages

IR Chapter 2

Uploaded by

IR Chapter 2

Uploaded by

Mizan Tepi University

Information Storage and Retrieval

 Better approximation of importance for

 Another method: Build a stop word list that

 With the removal of stopwords, we can measure

 Stemming reduces tokens to their “root” form of words to

 May conflate (reduce to the same token) words that

 Not recognize all morphological derivations.

Q wq1 , wq 2, ..., wqn

 Documents are represented by binary weights or

 Non-binary weights allow to model partial matching .

 TF (term frequency) - Count the number

 Normalization seeks to remove these effects:

 If we don’t normalize short documents may not be

idfi = inverse document frequency of term i,

 A term occurring frequently in the document but

 Experimentally, tf*idf has been found to work well.

 The term frequency (TF) for technology:

 The inverse document frequency is

 The TF*IDF score is the product of these frequencies:

• Example: Determine the Euclidean distance between the

 For binary vectors, the inner product is the number

 Size of vector = size of vocabulary = 7

sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12

 The denominator involves the lengths of the

• Let say we have query vector Q = (0.4, 0.8);

(0.4 * 0.2)  (0.8 * 0.7)

Questions, Ambiguities, Doubts, … ???

You might also like

sim(D1 , Q) = 21 + 30 + 5*2 = 12