2&3 Text Operation
2&3 Text Operation
• Such properties of text collection greatly affect the performance of IR system & can
be used to select suitable term weights & other aspects of the system .
Word Distribution
• A few words are very common.
2 most frequent words (e.g.
“the”, “of”) can account for
about 10% of word
occurrences.
• Most words are very rare.
Half the words in a corpus
appear only once, called
“read only once”
Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistic professor George Kingsley Zipf
(1902-1950),
attempts to capture the distribution of the frequencies (i.e. , number of
occurances ) of the words within a text.
• For all the words in a collection of documents, for each word w
f : is the frequency that w appears
r : is rank of w in order of frequency. (The most commonly occurring word has
rank 1, etc.)
f
Zipf’s distributions:
Distribution of sorted word
Rank Frequency w has rank r &
frequencies, according to
Distribution frequency f Zipf’s law
r
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are arranged in decreasing
order of their frequency of occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by the constant rank-
frequency law of Zipf.
• If the words, w, in a collection
are ranked, r, by their frequency,
f, they roughly fit the relation:
r*f=c
– Different collections have
different constants c.
• The table shows the most frequently occurring words from 336,310 document corpus containing 125,
720, 891 total words; out of which 508, 209 are unique words.
More Example: Zipf’s Law
• Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)= C
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
Zipf’s law: modeling word distribution
• Given that occurrence of the most frequent word is f1 times, the
collection frequency of the ith most common term is proportional to
1/i
1
fi
i
– If the most frequent term occurs f1 times, then the second most frequent term
has half as many occurrences, the third most frequent term has a third as
many, etc
• Zipf's Law states that the frequency of the ith most frequent word is
1/iӨ times that of the most frequent word
– occurrence of some event (P), as a function of the rank (i) when the rank is
determined by the frequency of occurrence, is a power-law function P i ~ 1/i Ө
with the exponent Ө close to unity.
Methods that Build on Zipf's Law
• Stop lists:
• Ignore the most frequent words (upper cut-off).
Used by almost all systems.
• Significant words:
• Take words in between the most frequent (upper cut-off) and least
frequent words (lower cut-off).
• Term weighting:
• Give differing weights to terms based on their frequency, with most
frequent words weighed less.
Used by almost all ranking methods.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.
• Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
• For this, Luhn specifies two cut off points: an upper and a lower cut
offs based on which non-significant words are excluded.
–The words exceeding the upper cutoff were considered to be common.
–The words below the lower cutoff were considered to be rare.
–Hence they are not contributing significantly to the content of the text.
–The ability of words to discriminate content, reached a peak at a rank order
position half way between the two-cutoffs.
• Let f be the frequency of occurrence of words in a text, and r their rank in
decreasing order of word frequency, then a plot relating f & r yields the
following curve.
Luhn’s Ideas
Luhn (1958) suggested that both extremely common and extremely uncommon words
were not very useful for document representation & indexing.
Vocabulary Growth: Heaps’ Law
• How does the size of the overall vocabulary (number of unique words)
grow with the size of the corpus?
– This determines how the size of the inverted index will scale with the size of the
corpus.
• Heap’s law: estimates the number of vocabularies in a given corpus
– The vocabulary size grows by O(nβ), where β is a constant between 0 – 1.
– If V is the size of the vocabulary and n is the length of the corpus in words, Heap’s
provides the following equation:
• Where constants: V Kn
– K 10100
– 0.40.6 (approx. square-root)
Heap’s distributions
• Distribution of size of the vocabulary vs. total number of terms
extracted from text corpus
– Reduce noise means reduce words which can be used to refer to the document.
• Text operation is the task of preprocessing text documents to control the size of the
vocabulary or the number of distinct words used as index terms
– Stemming words: remove affixes (prefixes and suffixes) and group together word
variants with similar meaning
– e,g swimming → swim
– Construction of term categorization structures such as thesaurus, to capture relationship
among words for allowing the expansion of the original query with related terms
Generating Document Representatives
• Text Processing System
– Input text – full text, abstract or title
– Output – a document representative adequate for use in an automatic retrieval system.
• The document representative consists of a list of class names, each name
representing a class of words occurring in the total input text.
• A document will be indexed by a name if one of its significant words occurs as a member
of that class.
Free
Text Index
terms
Lexical Analysis/Tokenization of Text
• Tokenization is one of the step used to convert text of the documents into a
sequence of words, w1, w2, … wn to be adopted as index terms.
– It is the process of demarcating and possibly classifying sections of a string of input
characters into words.
– For example,
• The quick brown fox jumps over the lazy dog
• Another method: Build a stop word list that contains a set of articles,
pronouns, etc.
– Why do we need stop lists: With a stop list, we can compare and exclude
from index terms entirely the commonest words.
• A class name is assigned to a document if and only if one of its members occurs as a
significant word in the text of the document.
–A document representative then becomes a list of class names, which are often
referred as the documents index terms/keywords.
• Queries : Queries are handled in the same way.
Ways to implement stemming
There are basically two ways to implement stemming.
–The first approach is to create a big dictionary that maps words to their stems.
• The advantage of this approach is that it works perfectly (insofar as the stem
of a word can be defined perfectly); the disadvantages are the space
required by the dictionary and the investment required to maintain the
dictionary as new words appear.
–The second approach is to use a set of rules that extract stems from words.
• The advantages of this approach are that the code is typically small, and it
can gracefully handle new words; the disadvantage is that it occasionally
makes mistakes.
• But, since stemming is imperfectly defined, anyway, occasional mistakes
are tolerable, and the rule-based approach is the one that is generally
chosen.
Stemming: challenges
• May produce unusual stems that are not English words:
– Removing ‘UAL’ from FACTUAL and EQUAL
• May conflate (reduce to the same token) words that are actually
distinct.
• “computer”, “computational”, “computation” all reduced to same
token “comput”
32
Terms
• Terms are usually stems. Terms can be also phrases, such as “Information
Technology”, “World Wide Web”, etc.
• Documents and queries are represented as vectors or “bags of words”
(BOW).
– Each vector holds a place for every term in the collection.
D 1corresponds
– Position
i w , w to,...,
d i1 d i2
termw1, position 2 to term 2, position n to term n.
d in
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
35
Binary Weights
• Only the presence (1) or absence (0) of a term docs t1 t2 t3
is included in the vector D1 1 0 1
D2 1 0 0
• Binary formula gives every word that appears D3 0 1 1
in a document equal relevance. D4 1 0 0
D5 1 1 1
• It can be useful when frequency is not D6 1 1 0
D7 0 1 0
important. D8 0 1 0
D9 0 0 1
• Binary Weights Formula: D10 0 1 1
D11 1 0 1
1 if freq ij 0
freq ij
0 if freq ij 0
Why use term weighting?
• Binary weights are too limiting.
– terms are either present or absent.
– Not allow to order documents according to their level of relevance for a given
query
– Term weighting enables ranking of retrieved documents; such that best matching
documents are ordered at the top as they are more relevant than others.
37
Term Weighting: Term Frequency (TF)
• TF (term frequency) - Count the number of times term occurs in
document.
docs t1 t2 t3
D1 2 0 3
fij = frequency of term i in document j
D2 1 0 0
D3 0 4 7
• The more times a term t occurs in document d the more likely it
D4 3 0 0
is that t is relevant to the document, i.e. more indicative of the
topic..
D5 1 6 3
– If used alone, it favors common words and long documents.
D6 3 5 0
D7 0 8 0
– It gives too much credit to words that appears more frequently.
D8 0 10 0
• May want to normalize term frequency (tf) across the entire
corpus:
D9 0 0 1
D10 0 3 5
tfij = fij / max{fij}
D11 4 0 1
Document Normalization
• Long documents have an unfair advantage:
– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly
• So they have much higher term frequencies
• Normalization seeks to remove these effects:
– Related somehow to maximum term frequency.
– But also sensitive to the number of terms.
40
Document Frequency
• It is defined to be the number of documents in the collection that
contain a term
DF = document frequency
41
Inverse Document Frequency (IDF)
• IDF measures rarity of the term in collection.
• The IDF is a measure of the general importance of the term
– Inverts the document frequency.
• It diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
– Gives full weight to terms that occur in one document only.
– Gives lowest weight to terms that occur in all documents.
– Terms that appear in many different documents are less indicative of
overall topic.
idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of documents)
42
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document frequency,
compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values for common words.
• IDF is an indication of a term’s discrimination power.
• A term occurring frequently in the document but rarely in the rest of the
collection is given high weight.
–The tf-idf value for a term will always be greater than or equal to zero.
44
TF*IDF weighting
• When does TF*IDF registers a high weight?
When a term t occurs many times within a small number of documents
– Highest tf*idf for a term shows a term has a high term frequency (in the
given document) and a low document frequency (in the whole collection of
documents);
– the weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents
• Lower TF*IDF is registered when the term occurs fewer times in a document,
or occurs in many documents
– Thus offering a less pronounced relevance signal
• Lowest TF*IDF is registered when the term occurs in virtually all documents
Computing TF-IDF: An Example
•Assume collection contains 10,000 documents and statistical analysis shows
that document frequencies (DF) of three terms are: A(50), B(1300), C(250).
And also term frequencies (TF) of these terms are: A(3), B(2), C(1).
Compute TF*IDF for each term?
46
More Example
• Consider a document containing 100 words wherein the word cow
appears 3 times. Now, assume we have 10 million documents and
cow appears in one thousand of these.
– The term frequency (TF) for cow :
3/100 = 0.03
49
Concluding remarks
• Hence IDF is incorporated which diminishes the weight of terms that occur
very frequently in the collection and increases the weight of terms that
occur rarely.
– This leads to use TF*IDF as a better weighting technique
51
Similarity Measure
Desiderata for proximity
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
– Sometimes it is a good idea to determine the maximum possible similarity as the
“distance” between a document d and itself
sim(dj,q) = |dj – q| = (w
i 1
ij wiq ) 2
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
• Example: Determine the Euclidean distance between the document 1
vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0).
• 0 means corresponding term not found in document or query
2 2 2 2 2
(0 2) (3 7) (2 1) (1 0) (10 0) 11.05
54
Inner Product
• Similarity between vectors for the document di and query q can be
computed as the vector inner product:
n
sim(dj,q) = dj•q =
w ·w
i 1
ij iq
where wij is the weight of term i in document j and wiq is the weight of
term i in the query q
• For binary vectors, the inner product is the number of matched query terms
in the document (size of intersection).
• For weighted term vectors, it is the sum of the products of the weights of the
matched terms.
55
Properties of Inner Product
56
Inner Product -- Examples
• Binary weight :
–Size of vector = size of vocabulary = 7
sim(D,Retrieval Database
Q) =1*1+1*0+1*1+0*0+1*0+1*1+0*1Term
=3 Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
Inner Product:
Example 1 k1
d2
k2
d6 d7
d4
d5
d3
k1 k2 k3 q dj d1
d1 1 0 1 2
d2 1 0 0 1 k3
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1
58
Inner Product: k2
Exercise k1
d2 d6 d7
d4 d5
d1 d3
k1 k2 k3 q dj
d1 1 0 1 ? k3
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
q 1 2 3
59
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
n
d j q wi , j wi , q
sim ( d j , q ) i 1
i 1 w i 1 i ,q
n n
dj q 2
i, j w 2
• Or;
n
d j d k wi , j wi ,k
sim(d j , d k ) i 1
i 1 w i 1 i,k
n n
d j dk 2
i, j w 2
1.0 Q
D2
cos 1 0.74 0.8
2
cos 2 0.98
0.6
0.4
1 D1
0.2
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
63
Cosine Similarity vs. Inner Product
• Cosine similarity measures the cosine of the angle between two
vectors.
• Inner product normalized by the vector lengths.
t
dj q ( wij wiq )
i 1
CosSim(dj, q) = t t
dj q wij wiq 2 2
i 1 i 1
InnerProduct(dj, q) = dj q
65