0% found this document useful (0 votes)
26 views65 pages

2&3 Text Operation

Uploaded by

Tafa Tulu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views65 pages

2&3 Text Operation

Uploaded by

Tafa Tulu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 65

Chapter Two

Text Operations and Automatic Indexing


Statistical Properties of Text

• Text operations refer to a range of processes and techniques used to manipulate,


analyze, and transform text data. These operations are fundamental in fields like
information retrieval, natural language processing (NLP), and text analytics
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of a corpus?
• There are three well-known researcher who define statistical properties of words in a
text:
– Zipf’s Law: models word distribution in text corpus
– Luhn’s idea: measures word significance
– Heap’s Law: shows how vocabulary size grows with the growth corpus size

• Such properties of text collection greatly affect the performance of IR system & can
be used to select suitable term weights & other aspects of the system .
Word Distribution
• A few words are very common.
2 most frequent words (e.g.
“the”, “of”) can account for
about 10% of word
occurrences.
• Most words are very rare.
Half the words in a corpus
appear only once, called
“read only once”
Word distribution: Zipf's Law
• Zipf's Law- named after the Harvard linguistic professor George Kingsley Zipf
(1902-1950),
 attempts to capture the distribution of the frequencies (i.e. , number of
occurances ) of the words within a text.
• For all the words in a collection of documents, for each word w
 f : is the frequency that w appears
 r : is rank of w in order of frequency. (The most commonly occurring word has
rank 1, etc.)
f

Zipf’s distributions:
Distribution of sorted word
Rank Frequency w has rank r &
frequencies, according to
Distribution frequency f Zipf’s law

r
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are arranged in decreasing
order of their frequency of occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by the constant rank-
frequency law of Zipf.
• If the words, w, in a collection
are ranked, r, by their frequency,
f, they roughly fit the relation:
r*f=c
– Different collections have
different constants c.

• The table shows the most frequently occurring words from 336,310 document corpus containing 125,
720, 891 total words; out of which 508, 209 are unique words.
More Example: Zipf’s Law
• Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)= C
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
Zipf’s law: modeling word distribution
• Given that occurrence of the most frequent word is f1 times, the
collection frequency of the ith most common term is proportional to
1/i
1
fi 
i
– If the most frequent term occurs f1 times, then the second most frequent term
has half as many occurrences, the third most frequent term has a third as
many, etc

• Zipf's Law states that the frequency of the ith most frequent word is
1/iӨ times that of the most frequent word
– occurrence of some event (P), as a function of the rank (i) when the rank is
determined by the frequency of occurrence, is a power-law function P i ~ 1/i Ө
with the exponent Ө close to unity.
Methods that Build on Zipf's Law
• Stop lists:
• Ignore the most frequent words (upper cut-off).
 Used by almost all systems.

• Significant words:
• Take words in between the most frequent (upper cut-off) and least
frequent words (lower cut-off).

• Term weighting:
• Give differing weights to terms based on their frequency, with most
frequent words weighed less.
 Used by almost all ranking methods.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.
• Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
• For this, Luhn specifies two cut off points: an upper and a lower cut
offs based on which non-significant words are excluded.
–The words exceeding the upper cutoff were considered to be common.
–The words below the lower cutoff were considered to be rare.
–Hence they are not contributing significantly to the content of the text.
–The ability of words to discriminate content, reached a peak at a rank order
position half way between the two-cutoffs.
• Let f be the frequency of occurrence of words in a text, and r their rank in
decreasing order of word frequency, then a plot relating f & r yields the
following curve.
Luhn’s Ideas

 Luhn (1958) suggested that both extremely common and extremely uncommon words
were not very useful for document representation & indexing.
Vocabulary Growth: Heaps’ Law
• How does the size of the overall vocabulary (number of unique words)
grow with the size of the corpus?
– This determines how the size of the inverted index will scale with the size of the
corpus.
• Heap’s law: estimates the number of vocabularies in a given corpus
– The vocabulary size grows by O(nβ), where β is a constant between 0 – 1.
– If V is the size of the vocabulary and n is the length of the corpus in words, Heap’s
provides the following equation:

• Where constants: V Kn
– K  10100
–   0.40.6 (approx. square-root)
Heap’s distributions
• Distribution of size of the vocabulary vs. total number of terms
extracted from text corpus

Example: from 1,000,000,000 documents, there may be 1,000,000


distinct words. Can you agree?
Text Operations
• Not all words in a document are equally significant to represent the contents/meanings of a
document
– Some word carry more meaning than others
– Noun words are the most representative of a document content
• Therefore, need to preprocess the text of a document in a collection to be used as index
terms
• Using the set of all words in a collection to index documents creates too much noise for the
retrieval task

– Reduce noise means reduce words which can be used to refer to the document.

• Text operation is the task of preprocessing text documents to control the size of the
vocabulary or the number of distinct words used as index terms

– Preprocessing will lead to an improvement in the information retrieval performance

• However, some search engines on the Web omit preprocessing


– Every word in the document is an index term
Text Operations
• Text operations is the process of text transformations in to logical
representations.
• 5 main operations for selecting index terms, i.e. to choose words/stems (or
groups of words) to be used as indexing terms:
– Lexical analysis/Tokenization of the text: generate a set or words from text collection
– Elimination of stop words: filter out words which are not useful in the retrieval process
• e,g Original Sentence
• Sentence: "The quick brown fox jumps over the lazy dog
– After Elimination: "quick brown fox jumps lazy dog.s

– Stemming words: remove affixes (prefixes and suffixes) and group together word
variants with similar meaning
– e,g swimming → swim
– Construction of term categorization structures such as thesaurus, to capture relationship
among words for allowing the expansion of the original query with related terms
Generating Document Representatives
• Text Processing System
– Input text – full text, abstract or title
– Output – a document representative adequate for use in an automatic retrieval system.
• The document representative consists of a list of class names, each name
representing a class of words occurring in the total input text.
• A document will be indexed by a name if one of its significant words occurs as a member
of that class.

Documen Tokenization stop words stemming Thesaurus


tCorps

Free
Text Index
terms
Lexical Analysis/Tokenization of Text
• Tokenization is one of the step used to convert text of the documents into a
sequence of words, w1, w2, … wn to be adopted as index terms.
– It is the process of demarcating and possibly classifying sections of a string of input
characters into words.
– For example,
• The quick brown fox jumps over the lazy dog

• Objective of tokenization is identifying words in the text


– What is a word means?
• is that a sequence of characters, numbers and alpha-numeric once?
– How we identify a set of words that exist in a text documents?
• Tokenization Issues
– numbers, hyphens, punctuations marks, apostrophes …
Issues in Tokenization
• Two words may be connected by hyphens.
– Can two words connected by hyphens and punctuation marks taken as one word or two
words? Break up hyphenated sequence as two tokens?
• In most cases hyphen – break up the words (e.g. state-of-the-art  state of the art), but
some words, e.g. MS-DOS, B-49 - unique words which require hyphens

• Two words may be connected by punctuation marks .


– remove totally punctuation marks unless significant, e.g. program code: x.exe and xexe

• Two words may be separated by space.


– E.g. Addis Ababa, San Francisco, Los Angeles

• Two words may be written in different ways


– lowercase, lower-case, lower case ?
– data base, database, data-base?
Issues in Tokenization
• Numbers: Are numbers/digits words & used as index terms?
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers (+251923415005)
– IP addresses (100.2.86.144)
– Generally, don’t index numbers as text most numbers are not good index
terms (like 1910, 1999)
• What about case of letters (e.g. Data or data or DATA):
– cases are not important and there is a need to convert all to upper or lower.
Which one is mostly followed by human beings?
• Simplest approach is to ignore all numbers & punctuations and use only
case-insensitive unbroken strings of alphabetic characters as tokens.
• Issues of tokenization are language specific
– Requires the language to be known
Tokenization
• Analyze text into a sequence of discrete tokens (words).
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters that are grouped
together as a useful semantic unit for processing)
– Friends
– Romans
– and
– Countrymen
• Each such token is now a candidate for an index entry, after further
processing
• But what are valid tokens to emit as index terms?
• The cat slept peacefully in the living room. It’s a very old cat.
Exercise: Tokenization
• The instructor (Dr. O’Neill) thinks that the boys’ stories about Chile’s
capital aren’t amusing.
Elimination of Stop words
• Stop words are extremely common words across document collections that have no
discriminatory power
– They may occur in 80% of the documents in a collection.
• Stop words have little semantic content; It is typical to remove such high-frequency words
– They would appear to be of little value in helping select documents matching a user need
and needs to be filtered out as potential index terms
• Examples of stop words are articles, prepositions, conjunctions, etc.:
– Articles (a, an, the);
– Pronouns: (I, he, she, it, their, his)
– Some prepositions (on, of, in, about, besides, against),
– Conjunctions/ connectors (and, but, for, nor, or, so, yet),
– Verbs (is, are, was, were),
– Adverbs (here, there, out, because, soon, after) and
– Adjectives (all, any, each, every, few, many, some) can also be treated as stop words
Stop words
• Intuition:
– Stopwords take up 50% of the text. Hence, document size reduces drastically
enabling to organizes smaller indices for information retrieval
– Good compression techniques for indices: The 30 most common words
account for 30% of the tokens in written text
• Better approximation of importance for classification, summarization, etc.

• Stop words are language dependent.


How to detect a stop word?
• One method: Sort terms (in decreasing order) by document frequency and
take the most frequent ones
– In a collection about insurance practices, “insurance” would be a stop
word

• Another method: Build a stop word list that contains a set of articles,
pronouns, etc.
– Why do we need stop lists: With a stop list, we can compare and exclude
from index terms entirely the commonest words.

• With the removal of stop words, we can measure better approximation of


importance for classification, summarization, etc.
Stop words
• Stop word elimination used to be standard in older IR systems.
• But the trend is away from doing this. Most web search engines index stop
words:
–Good query optimization techniques mean you pay little at query time for
including stop words.
–You need stop words for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
–Elimination of stop words might reduce recall (e.g. “To be or not to be” –
all eliminated except “be” – no or irrelevant retrieval)
Normalization
• It is canonicalizing tokens so that matches occur despite superficial
differences in the character sequences of the tokens
– Need to “normalize” terms in indexed text as well as query terms into the
same form
– Example: We want to match U.S.A. and USA, by deleting periods in a term
• Case Folding: Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization…
– Republican vs. republican
– mekaneselam vs. Mekaneselam vs. MEKANESELAM
– Anti-discriminatory vs. antidiscriminatory
– Car vs. automobile?
Normalization issues
• Good for
– Allow instances of Automobile at the beginning of a sentence to match
with a query of automobile
– Helps a search engine when most users type ferrari when they are
interested in a Ferrari car
• Bad for
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, …
• Solution:
– lowercase only words at the beginning of the sentence
• In IR, lowercasing is most practical because of the way users issue their
queries
Stemming/Morphological analysis
• Stemming reduces tokens to their “root” form of words to recognize morphological
variation.
– The process involves removal of affixes (i.e. prefixes and suffixes) with the aim of
reducing variants to the same stem
• Often stemming removes inflectional and derivational morphology of a word
– Inflectional morphology: vary the form of words in order to express grammatical
features, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
– Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy
• Stemming is language dependent

– Correct stemming is language specific and can be complex.

for example compressed and for example compress and


compression are both accepted. compress are both accept
Stemming
• The final output from a conflation algorithm is a set of classes, one for each stem
detected.
–A Stem: the portion of a word which is left after the removal of its affixes (i.e.,
prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected, connecting connection, connections}
–Thus, [automate, automatic, automation] all reduce to  automat

• A class name is assigned to a document if and only if one of its members occurs as a
significant word in the text of the document.

–A document representative then becomes a list of class names, which are often
referred as the documents index terms/keywords.
• Queries : Queries are handled in the same way.
Ways to implement stemming
There are basically two ways to implement stemming.
–The first approach is to create a big dictionary that maps words to their stems.
• The advantage of this approach is that it works perfectly (insofar as the stem
of a word can be defined perfectly); the disadvantages are the space
required by the dictionary and the investment required to maintain the
dictionary as new words appear.
–The second approach is to use a set of rules that extract stems from words.
• The advantages of this approach are that the code is typically small, and it
can gracefully handle new words; the disadvantage is that it occasionally
makes mistakes.
• But, since stemming is imperfectly defined, anyway, occasional mistakes
are tolerable, and the rule-based approach is the one that is generally
chosen.
Stemming: challenges
• May produce unusual stems that are not English words:
– Removing ‘UAL’ from FACTUAL and EQUAL

• May conflate (reduce to the same token) words that are actually
distinct.
• “computer”, “computational”, “computation” all reduced to same
token “comput”

• Not recognize all morphological derivations.


Language-specificity
• Many of the above features embody transformations that
are
– Language-specific and
– Often, application-specific

• These are “plug-in” addenda to the indexing process


• Both open source and commercial plug-ins are available
for handling these
Chapter ThreeTerm
Extraction:
Term weighting and
similarity measures

32
Terms
• Terms are usually stems. Terms can be also phrases, such as “Information
Technology”, “World Wide Web”, etc.
• Documents and queries are represented as vectors or “bags of words”
(BOW).
– Each vector holds a place for every term in the collection.
D 1corresponds
– Position
i w , w to,...,
d i1 d i2
termw1, position 2 to term 2, position n to term n.
d in

Q wq1 , wq 2, ..., wqn


W=0 if a term is absent
• Documents are represented by binary weights or Non-binary weighted
vectors of terms. 33
Bags of words
The Bag of Words model learns a vocabulary from all of the
documents, then models each document by counting the number of
times each word appears.
 For example, consider the following two sentences:
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat"
From these two sentences, our vocabulary is as follows:
 {the, cat, sat, on, hat, dog, ate, and}
 The feature vector for Sentence 1 are: { 2, 1, 1, 1, 1, 0, 0, 0 }
 The features for Sentence 2 are: { 3, 1, 1, 0, 1, 1, 0, 1}
Document Collection
• A collection of n documents can be represented in the vector space model by a
term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term in the document;
zero means the term has no significance in the document or it simply doesn’t exist
in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

35
Binary Weights
• Only the presence (1) or absence (0) of a term docs t1 t2 t3
is included in the vector D1 1 0 1
D2 1 0 0
• Binary formula gives every word that appears D3 0 1 1
in a document equal relevance. D4 1 0 0
D5 1 1 1
• It can be useful when frequency is not D6 1 1 0
D7 0 1 0
important. D8 0 1 0
D9 0 0 1
• Binary Weights Formula: D10 0 1 1
D11 1 0 1

1 if freq ij  0

freq ij 
0 if freq ij 0

Why use term weighting?
• Binary weights are too limiting.
– terms are either present or absent.
– Not allow to order documents according to their level of relevance for a given
query

• Non-binary weights allow to model partial matching.


– Partial matching allows retrieval of docs that approximate the query.
• Term-weighting improves quality of answer set.

– Term weighting enables ranking of retrieved documents; such that best matching
documents are ordered at the top as they are more relevant than others.
37
Term Weighting: Term Frequency (TF)
• TF (term frequency) - Count the number of times term occurs in
document.
docs t1 t2 t3
D1 2 0 3
fij = frequency of term i in document j
D2 1 0 0
D3 0 4 7
• The more times a term t occurs in document d the more likely it
D4 3 0 0
is that t is relevant to the document, i.e. more indicative of the
topic..
D5 1 6 3
– If used alone, it favors common words and long documents.
D6 3 5 0
D7 0 8 0
– It gives too much credit to words that appears more frequently.
D8 0 10 0
• May want to normalize term frequency (tf) across the entire
corpus:
D9 0 0 1
D10 0 3 5
tfij = fij / max{fij}
D11 4 0 1
Document Normalization
• Long documents have an unfair advantage:
– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly
• So they have much higher term frequencies
• Normalization seeks to remove these effects:
– Related somehow to maximum term frequency.
– But also sensitive to the number of terms.

• If we don’t normalize short documents may not be recognized


as relevant.
39
Problems with term frequency
• Need a mechanism for attenuating the effect of terms that occur too often
in the collection to be meaningful for relevance/meaning determination
• Scale down the term weight of terms with high collection frequency
– Reduce the tf weight of a term by a factor that grows with the collection
frequency
• More common for this purpose is document frequency
– how many documents in the collection contain the term

• The example shows that collection frequency and


document frequency behaves differently

40
Document Frequency
• It is defined to be the number of documents in the collection that
contain a term
DF = document frequency

– Count the frequency considering the whole collection of documents.


– Less frequently a term appears in the whole collection, the more
discriminating it is.

df i = document frequency of term i


= number of documents containing term i

41
Inverse Document Frequency (IDF)
• IDF measures rarity of the term in collection.
• The IDF is a measure of the general importance of the term
– Inverts the document frequency.
• It diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
– Gives full weight to terms that occur in one document only.
– Gives lowest weight to terms that occur in all documents.
– Terms that appear in many different documents are less indicative of
overall topic.
idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of documents)
42
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document frequency,
compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966

• IDF provides high values for rare words and low values for common words.
• IDF is an indication of a term’s discrimination power.

• Log used to dampen the effect relative to tf.


• Make the difference between Document frequency vs. corpus frequency?
43
TF*IDF Weighting
• The most used term-weighting is tf*idf weighting scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)

• A term occurring frequently in the document but rarely in the rest of the
collection is given high weight.
–The tf-idf value for a term will always be greater than or equal to zero.

• Experimentally, tf*idf has been found to work well.


–It is often used in the vector space model together with cosine similarity
to determine the similarity between two documents.

44
TF*IDF weighting
• When does TF*IDF registers a high weight?
When a term t occurs many times within a small number of documents
– Highest tf*idf for a term shows a term has a high term frequency (in the
given document) and a low document frequency (in the whole collection of
documents);
– the weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents
• Lower TF*IDF is registered when the term occurs fewer times in a document,
or occurs in many documents
– Thus offering a less pronounced relevance signal
• Lowest TF*IDF is registered when the term occurs in virtually all documents
Computing TF-IDF: An Example
•Assume collection contains 10,000 documents and statistical analysis shows
that document frequencies (DF) of three terms are: A(50), B(1300), C(250).
And also term frequencies (TF) of these terms are: A(3), B(2), C(1).
Compute TF*IDF for each term?

A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644


B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774

•Query vector is typically treated as a document and also tf-idf weighted.

46
More Example
• Consider a document containing 100 words wherein the word cow
appears 3 times. Now, assume we have 10 million documents and
cow appears in one thousand of these.
– The term frequency (TF) for cow :
3/100 = 0.03

– The inverse document frequency is


log2(10,000,000 / 1,000) = 13.228

– The TF*IDF score is the product of these frequencies: 0.03 * 13.228 =


0.39684
47
Exercise
Word C TW TD DF TF IDF TFIDF
airplane 5 46 3 1
• Let C = number of times a given word blue 1 46 3 1
appears in a document; chair 7 46 3 3
• TW = total number of words in a
computer 3 46 3 1
document;
• TD = total number of documents in a forest 2 46 3 1
corpus, and justice 7 46 3 3
• DF = total number of documents love 2 46 3 1
containing a given word; might 2 46 3 1
• compute TF, IDF and TF*IDF score for perl 5 46 3 2
each term
rose 6 46 3 3
shoe 4 46 3 1
thesis 2 46 3 2 48
Concluding remarks
• Suppose from a set of English documents, we wish to determine which once are the most
relevant to the query "the brown cow."
• A simple way to start out is by eliminating documents that do not contain all three words
"the," "brown," and "cow," but this still leaves many documents.
• To further distinguish them, we might count the number of times each term occurs in each
document and sum them all together;
– the number of times a term occurs in a document is called its TF. However, because the
term "the" is so common, this will tend to incorrectly emphasize documents which happen
to use the word "the" more, without giving enough weight to the more meaningful terms
"brown" and "cow".
– Also the term "the" is not a good keyword to distinguish relevant and non-relevant
documents and terms like "brown" and "cow" that occur rarely are good keywords to
distinguish relevant documents from the non-relevant once.

49
Concluding remarks
• Hence IDF is incorporated which diminishes the weight of terms that occur
very frequently in the collection and increases the weight of terms that
occur rarely.
– This leads to use TF*IDF as a better weighting technique

• On top of that we apply similarity measures to calculate the distance


between document i and query j.
• There are a number of similarity measures; the most common similarity
measure are
• Euclidean distance , Inner or Dot product, Cosine similarity, Dice similarity,
Jaccard similarity, etc.
Similarity Measure
• We now have vectors for all documents in the collection, a
vector for the query, how to compute similarity? t3
• A similarity measure is a function that computes the degree
of similarity or distance between document vector and query 
vector. D1
Q
• Using a similarity measure between the query and each 
document: t1

–It is possible to rank the retrieved documents in the order of


t2
presumed relevance. D2

–It is possible to enforce a certain threshold so that the size of


the retrieved set can be controlled.

51
Similarity Measure
Desiderata for proximity
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
– Sometimes it is a good idea to determine the maximum possible similarity as the
“distance” between a document d and itself

• A similarity measure attempts to compute the distance between document vector w j


and query wq vector.
– The assumption here is that documents whose vectors are close to the query vector
are more relevant to the query than documents whose vectors are away from the
query vector.
52
Similarity Measure: Techniques
• Euclidean distance
–It is the most common similarity measure. Euclidean distance examines the
root of square differences between coordinates of a pair of document and
query terms.
• Dot product
–The dot product is also known as the scalar product or inner product
–the dot product is defined as the product of the magnitudes of query and
document vectors
• Cosine similarity (or normalized inner product)
–It projects document and query vectors into a term space and calculate the
cosine angle between these.
53
Euclidean distance
• Similarity between vectors for the document di and query q can be
computed as:
n

sim(dj,q) = |dj – q| =  (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq is the weight of
term i in the query
• Example: Determine the Euclidean distance between the document 1
vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0).
• 0 means corresponding term not found in document or query
2 2 2 2 2
 (0  2)  (3  7)  (2  1)  (1  0)  (10  0) 11.05
54
Inner Product
• Similarity between vectors for the document di and query q can be
computed as the vector inner product:
n
sim(dj,q) = dj•q = 
w ·w
i 1
ij iq

where wij is the weight of term i in document j and wiq is the weight of
term i in the query q
• For binary vectors, the inner product is the number of matched query terms
in the document (size of intersection).
• For weighted term vectors, it is the sum of the products of the weights of the
matched terms.
55
Properties of Inner Product

• Favors long documents with a large number of unique terms.


– Again, the issue of normalization
• Measures how many terms matched but not how many terms are
not matched.

56
Inner Product -- Examples
• Binary weight :
–Size of vector = size of vocabulary = 7

sim(D,Retrieval Database
Q) =1*1+1*0+1*1+0*0+1*0+1*1+0*1Term
=3 Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1

• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
Inner Product:
Example 1 k1
d2
k2
d6 d7
d4
d5
d3
k1 k2 k3 q  dj d1
d1 1 0 1 2
d2 1 0 0 1 k3
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1

58
Inner Product: k2
Exercise k1
d2 d6 d7
d4 d5
d1 d3
k1 k2 k3 q  dj
d1 1 0 1 ? k3
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

q 1 2 3

59
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
 

n
d j q wi , j wi , q
sim ( d j , q )     i 1

i 1 w i 1 i ,q
n n
dj q 2
i, j w 2

• Or;  

n
d j d k wi , j wi ,k
sim(d j , d k )     i 1

i 1 w i 1 i,k
n n
d j dk 2
i, j w 2

• The denominator involves the lengths of the vectors


• So the cosine measure is also known as the normalized
inner product 

n 2
Length d j  i 1
wi , j
Example: Computing Cosine Similarity
• Let say we have query vector Q = (0.4, 0.8); and also document D 1
= (0.2, 0.7). Compute their similarity using cosine?

(0.4 * 0.2)  (0.8 * 0.7)


sim (Q, D2 ) 
2 2 2 2
[(0.4)  (0.8) ] * [(0.2)  (0.7) ]
0.64
  0.98
0.42
Example: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 = (0.8, 0.3) and D2 =
(0.2, 0.7). Given query vector Q = (0.4, 0.8), determine which document
is the most relevant one for the query?

1.0 Q
D2
cos 1 0.74 0.8

2
cos  2 0.98
0.6

0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0


62
Example
• Given three documents; D1, D2 and D3 with the corresponding TFIDF
weight, Which documents are more similar using the three
measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

63
Cosine Similarity vs. Inner Product
• Cosine similarity measures the cosine of the angle between two
vectors.
• Inner product normalized by the vector lengths.
  t

dj q   ( wij wiq )
 

i 1
CosSim(dj, q) = t t
dj q  wij  wiq 2 2

i 1 i 1
 
InnerProduct(dj, q) = dj q 

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81


D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times
better using inner product. 64
Exercises
• A database collection consists of 1 million documents, of which 200,000 contain the
term holiday while 250,000 contain the term season. A document repeats holiday 7
times and season 5 times. It is known that holiday is repeated more than any other term
in the document.
• Calculate the weight of both terms in this document using three different term weight
methods.
• Try with
(i) normalized and unnormalized TF;
(ii) TF*IDF based on normalized and unnormalized TF

65

You might also like