0% found this document useful (0 votes)

2 views

ch2_Text Operations and Automatic Indexing

The document discusses information retrieval (IR) systems, focusing on text operations and automatic indexing. It explains the role of web crawlers in gathering and indexing documents, the statistical properties of text including Zipf's Law and Luhn's Analysis, and the importance of preprocessing text for effective indexing. Additionally, it outlines various text operations such as lexical analysis, stop-word removal, and the construction of term categorization structures.

Uploaded by

misrak dagne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

ch2_Text Operations and Automatic Indexing

Uploaded by

misrak dagne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

1 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A.

WKU, Ethiopia

CHAPTER TWO – TEXT OPERATIONS AND AUTOMATIC INDEXING

The IR system contains a collection of documents, with each document represented by a sequence
of tokens. Markup may indicate titles, authors, and other structural elements.

Many Web pages are actually spam — malicious pages deliberately posing as something that they
are not in order to attract unwarranted attention of a commercial or other nature.

Other problems derive from the scale of the Web — Trillions of pages distributed among millions of
hosts. In order to index these pages, they must be gathered from across the Web by a crawler and stored
locally by the search engine for processing.

Because many pages may change daily or hourly, this snapshot of the Web must be refreshed on a
regular basis. While gathering data the crawler may detect duplicates and near-duplicates of pages,
which must be dealt with appropriately.

Another consideration is the volume and variety of queries commercial Web search engines
receive, which direct reflect the volume and variety of information on the Web itself. Queries are often
short — one or two terms. E.g. Google receive 400,000 queries per second.

2.1. Text Acquisition via Crawler

In order to index documents (web pages), they must be gathered from across web by a crawler and
stored locally by the search engine for processing.

The crawler component has the primary responsibility for identifying and obtaining documents
for the search engine. There are a number of different types of crawlers, but the most common
is the general web crawler. A web crawler is designed to follow the links on web pages to discover
and download new pages. Although this sounds deceptively simple, there are significant challenges in
designing a web crawler that can efficiently handle the huge volume of new pages on the Web, while
at the same time ensuring that pages that may have changed since the last time a crawler visited a site
are kept “fresh” for the search engine. A web crawler can be restricted to a single site, such as a
university, as the basis for site search. This type of crawler may be used by a vertical or topical search
application, such as a search engine that provides access to medical information on web pages.

For enterprise search, the crawler is adapted to discover and update all documents and web pages
related to a company’s operation. An enterprise document crawler follows links to discover both

1
2 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

external and internal (i.e., restricted to the corporate intranet) pages, but also must scan both corporate
and personal directories to identify email, word processing documents, presentations, database
records, and other company information.
Document crawlers are also used for desktop search, although in this case only the user’s personal
directories need to be scanned.

Feeds: Document feeds are a mechanism for accessing a real-time stream of documents. For example,
a news feed is a constant stream of news stories and updates. In contrast to a crawler, which must
discover new documents, a search engine acquires new documents from a feed simply by monitoring
it. Some content like news, blogs, or video are used for web feeds. The reader monitors those feeds
and provides new content when it arrives. Radio and television feeds are also used in some search
applications, where the “documents” contain automatically segmented audio and video streams,
together with associated text from closed captions or speech recognition

Conversion: The documents found by a crawler or provided by a feed are rarely in plain text. Instead,
they come in a variety of formats, such as HTML, XML, Adobe PDF, Microsoft Word, Microsoft
PowerPoint and so on, search engines require that these documents be converted into a consistent text
plus metadata format.

Document data store: It is a database used to manage large numbers of documents and the
structured data that is associated with them. The document contents are typically stored in
compressed form for efficiency. The structured data consists of document metadata and other
information extracted from the documents, such as links and anchor text. These is all about role of
crawler. Generally gathering copy of web pages from across web and storing locally for search engine
for processing are the task of crawler.

2.2. Statistical Properties of Text

– How is the frequency of different words distributed?
– How fast does vocabulary size grow with the size of a corpus?
– Such factors affect the performance of IR system & can be used to select suitable term weights & other
aspects of the system. A few words are very common. 2 most frequent words (e.g. “the”, “of”) can
account for about 10% of word occurrences. Most words are very rare. Half the words in a corpus
appear only once, called “read only once”

2
3 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

2.2.1. Zipf’s Law

Formulated in the 1940s, Zipf’s law states that, given a corpus of natural language utterances, the
frequency of any word is inversely proportional to its rank in the frequency table. This can be
empirically validated by plotting the frequency of words in large textual corpora, as done for instance
in a well-known experiment with the Brown Corpus. Formally, if the words in a document collection
are ordered according to a ranking function r(w) in decreasing order of frequency f (w), the following
holds: 𝒓(𝒘) × 𝒇 (𝒘) = 𝒄 where c is a language-dependent constant. For instance, in English
collections c can be approximated to 10.

Zipf’s law example: The Frequent Number of Percentage of

word occurrences f(w) total r(w) c
table shows the most
the 7,398,934 5.9 1*5.9=5.9
frequently occurring of 3,893,790 3.1 2*3.1=6.2
to 3,364,653 2.7 3*2.7=8.1
words from 336,310 and 3,320,687 2.6 4*2.6=10.4
document collection in 2,311,785 1.8 5*1.8=9
is 1,559,147 1.2 6*1.2=7.2
containing 125, 720, 891 for 1,313,561 1.0 7*1=7
The 1,144,860 0.9 8*0.9=7.2
total words; out of which that 1,066,503 0.8 9*0.8=7.2
508, 209 unique words. said 1,027,713 0.8 …

The law has been explained by “principle of least effort” which makes it easier for a speaker or
writer of a language to repeat certain words instead of coining new and different words. Zipf’s
explanation was his “principle of least effort” which balance between speaker’s desire for a small
vocabulary and hearer’s desire for a large one.

Zipf’s Law Impact on IR

 Good News: Stopwords will account for a large fraction of text so eliminating them greatly
reduces inverted-index storage costs.
 Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for
correlation analysis for query expansion) is difficult since they are extremely rare.
2.2.2. Luhn’s Analysis

Information from Zipf’s law can be combined with the findings of Luhn, roughly ten years later: “It
is here proposed that the frequency of word occurrence in an article furnishes a useful measurement
of word significance. It is further proposed that the relative position within a sentence of words having
3
4 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

given values of significance furnish a useful measurement for determining the significance of
sentences. Formally, let f (w) be the frequency of occurrence and r(w) be their rank order, a plot relating
f (w) and r(w) yields a hyperbolic curve, demonstrating Zipf’s assertion that the product of the
frequency of use of words and the rank order is approximately constant.

Luhn used this law as a null hypothesis to specify two cut-offs, an upper and a lower, to exclude non-
significant words. Indeed, words above the upper cut-off can be considered as too common, while
those below the lower cut-off are too rare to be significant for understanding document content.

Consequently, Luhn assumed that the resolving power of significant words, by which he meant the
ability of words to discriminate content, reached a peak at a rank order position halfway between the
two cut-offs and from the peak fell off in either direction, reducing to almost zero at the cut-off points.

Luhn suggested that

both extremely
common and
extremely uncommon
words were not very
useful for document
representation &
indexing.

2.2.3. Heap’s Law

The above findings relate the frequency and relevance of words in a corpus. However, an interesting
question regards how vocabulary grows with respect to the size of a document collection.
Heap’s law has an answer for this, stating that the
vocabulary size V can be computed as 𝑉 = 𝐾𝑁𝛽 where
N is the size (in words) of the document collection, K is
a constant (typically between 10 and 100), and 0 < β <
1 is a constant, typically between 0.4 and 0.6. This
finding is very important for the scalability of the
indexing process: it states that vocabulary (index) size

4
5 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

exhibits a less-than-linear growth with respect to the growth of the document collection.

2.3. Generating Document Representatives

Indexing Processing involves Preprocessing and Storing of information into a repository.
Retrieval/Runtime process involves issuing a query, accessing the index to find the documents
relevant the query. In IR indexing is the process developing a document representation by a context
descriptors or terms due to the documents. Some interesting properties of language and its usage were
studied well before current IR research and may be useful in understanding the indexing process.

When we consider natural language text, it is easy to notice that not all words are equally effective for
the representation of a document’s semantics. Usually, noun words (or word groups containing nouns,
also called noun phrase groups) are the most representative components of a document in terms of
content. This is the implicit mental process we perform when distilling the “important” query concepts
into some representative nouns in our search engine queries. Based on this observation, the IR system
also preprocesses the text of the documents to determine the most “important” terms to be used as
index terms; a subset of the words is therefore selected to represent the content of a document.

When selecting candidate keywords, indexing must fulfill two different and potentially opposite goals:
one is exhaustiveness, i.e., assigning a sufficiently large number of terms to a document, and the other
is specificity, i.e., the exclusion of generic terms that carry little semantics and inflate the index.
Generic terms, for example, conjunctions and prepositions, are characterized by a low discriminative
power, as their frequency across any document in the collection tends to be high. In other words,
generic terms have high term frequency, defined as the number of occurrences of the term in a
document. In contrast, specific terms have higher discriminative power, due to their rare occurrences
across collection documents: they have low document frequency, defined as the number of documents
in a collection in which a term occurs.

2.4. PREPROCESSING
Text operations is the process of text transformations in to logical representations. Document
preprocessing is a procedure which can be divided mainly into 5 text operations (transformations):

A. Lexical analysis of the text with the objective of treating digits, hyphens, punctuation marks,
and the case of letters.

5
6 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

B. Elimination of stop-words with the objective of filtering out words with very low
discrimination values for retrieval purposes (e.g. the , an , a ,m as and…..)
C. Stemming of the remaining words with the objective of removing affixes (i.e., prefixes and
suffixes) and allowing the retrieval of documents containing syntactic variations of query
terms (e.g., connect, connecting, connected, etc).
D. Selection of index terms to determine which words/stems (or groups of words) will be used as
an indexing elements. Usually, the decision on whether a particular word will be used as an
index term is related to the syntactic nature of the word. In fact, noun words frequently carry
more semantics than adjectives, adverbs, and verbs.
E. Construction of term categorization structures such as a thesaurus, or extraction of structure
directly represented in the text, for allowing the expansion of the original query with related
terms (a usually useful procedure). The next figure sketches the textual preprocessing phase
typically performed by an IR engine, taking as input a document and yielding its index terms.
1. Document Parsing. Documents come in all
sorts of languages, character sets, and
formats; often, the same document may
contain multiple languages or formats, e.g., a
French email with Portuguese PDF
attachments. Document parsing deals with the
recognition and “breaking down” of the
document structure into individual
components. In this preprocessing phase, unit
documents are created; e.g., emails with
attachments are split into one document
representing the email and as many
documents as there are attachments.

2. Lexical Analysis. After parsing, lexical analysis tokenizes a document, seen as an input stream, into
words. Issues related to lexical analysis include the correct identification of accents, abbreviations,
dates, and cases. Change text of the documents into words to be adopted as index terms.
– The objective of tokenization is to identify words in the text

6
7 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

– Numbers are not good index terms (like 1910, 1999); but 510 B.C. – unique
– Hyphen – break up the words (e.g. state-of-the-art = state of the art)- but some words,
e.g. B-29- unique words which require hyphens
– Punctuation marks – remove totally unless significant, e.g. program code: x.exe and xexe
– Case of letters – not important and can convert all to upper or lower

Issues in Tokenization

 One word or multiple: How do you decide it is one token or two or more?
– Hewlett-Packard  Hewlett and Packard as two tokens?
– State-of-the-art: break up hyphenated sequence.
– Addis Ababa
– lowercase, lower-case, lower case ?
 Numbers:
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers,
– IP addresses (100.2.86.144)
 How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, e-mail.
 Simplest approach is to ignore all numbers and punctuation and use only case-insensitive
unbroken strings of alphabetic characters as tokens. Generally, systems do not index numbers as
text, but often very useful for search.
 tokenization are language specific–What works for one language doesn’t fork for the other.
3. Stop-Word Removal. A subsequent step optionally applied to the results of lexical analysis is stop-
word removal, i.e., the removal of high-frequency words. However, as this process may decrease recall
(prepositions are important to disambiguate queries), most search engines do not remove them. With
the removal of stop-words, we can measure better approximation of importance for classification,
summarization, etc.
– Stop-words have little semantic content; It is typical to remove such high-frequency words.
Stopwords take up 50% of the text. Hence, document size reduces by 30-50%
i. The 30 most common words account for 30% of the tokens in written text
– Stop word elimination used to be standard in older IR systems.

7
8 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

– But the trend is getting away from doing this. Most web search engines index stop words: Various
song titles, “Let it be”, “To be or not to be”
– Elimination of stop-words might reduce recall (e.g. “To be or not to be” – all eliminated except
“be” – will produce no or irrelevant retrieval)

Normalization

It is Canonicalizing (change to simplest form) tokens so that matches occur despite superficial
differences in the character sequences of the tokens
– Need to “normalize” terms in indexed text as well as query terms into the same form
– Example: We want to match U.S.A. and USA, by deleting periods in a term
– Case Folding: Often best to lower case everything, since users will use lowercase regardless of
‘correct’ capitalization…
 Republican vs. republican Fasil vs. fasil vs. FASIL
 Car vs. automobile?
– Case Folding is bad for Proper names as well as common nouns
 E.g. General Motors, Associated Press, …
– To overcome this problem lowercase only words at the beginning of the sentence. Because In IR,
lowercasing is most practical because of the way users issue their queries.
4. Phrase detection.
This step captures text meaning beyond what is possible with pure bag-of-word approaches, thanks to
the identification of noun groups and other phrases. Phrase detection may be approached in several
ways, including rules (e.g., retaining terms that are not separated by punctuation marks),
morphological analysis (part-of-speech tagging) syntactic analysis, and combinations thereof. For
example, scanning our example sentence “search engines are the most visible information retrieval
applications” for noun phrases would probably result in identifying “search engines” and
“information retrieval”.
5. Stemming/Morphological analysis
Stemming reduces tokens to their “root” form of words to recognize morphological variation. The
process involves removal of affixes (i.e. prefixes and suffixes) with the aim of reducing variants to the
same stem. Often removes inflectional and derivational morphology of a word. Inflectional

8
9 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

morphology: vary the form of words in order to express grammatical features, such as singular/plural
or past/present tense. E.g. Boy → boys, cut → cutting.
Derivational morphology: makes new words from old ones. E.g. creation is formed from create, but
they are two separate words. And also, destruction → destroy

Stemming is language dependent— correct stemming is language specific and can be complex. A
Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes and/or
suffixes). Example: ‘connect’ is the stem for {connected, connecting, connection, connections}
Thus, [automate, automatic, automation] all reduce to  automat

Both queries and indexes are handled in the same way.

There are basically two ways to implement stemming.

 The first approach is to create a big dictionary that maps words to their stems.

 The advantage of this approach is that it works perfectly.

 The disadvantages are the space required by the dictionary and the investment required to maintain
the dictionary as new words appear.

 The second approach is to use a set of rules that extract stems from words.

 The advantages of this approach are that the code is typically small, and it can gracefully handle
new words; the disadvantage is that it occasionally makes mistakes.

 But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the
rule-based approach is the one that is generally chosen.

Google, for instance, uses stemming to search for web pages containing the words connected,
connecting, connection and connections when users ask for a web page that contains the word connect.
In 1979, Martin Porter developed a stemming algorithm that uses a set of rules to extract stems from
words, and though it makes some mistakes, most common words seem to work out right. Porter
describes his algorithm and provides a reference implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html

9
10 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

Porter Stemmer is most common algorithm for stemming English words to their common
grammatical root. It is simple procedure for removing known affixes in English without using a
dictionary. To get rid of plurals the following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
– S cats  cat
• EMENT   (Delete final ement if what remains is longer than 1 character )

– replacement  replac cement  cement

While step 1a gets rid of plurals, step 1b removes -ed or -ing.

e.g. agreed  agree disabled  disable matting  mat
mating  mate meeting  meet milling  mill
messing  mess meetings mee feed  feed

Stemming: challenges

 May produce unusual stems that are not English words:

– Removing ‘UAL’ from FACTUAL and EQUAL

 May conflate (reduce to the same token) words that are actually distinct.

– “computer”, “computational”, “computation” all reduced to same token “comput”

 Not recognize all morphological derivations.

6. Thesauri

Mostly full-text searching cannot be accurate, since different authors may select different words to
represent the same concept

 Problem: The same meaning can be expressed using different terms that are synonyms, homonyms,
and related terms

 How can it be achieved such that for the same meaning the identical terms are used in the index and
the query?

10
11 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

Thesaurus: The vocabulary of a controlled indexing language, formally organized so that a priori
relationships between concepts (for example as "broader" and “related") are made explicit.

 A thesaurus contains terms and relationships between terms

E.g. car = automobile, truck, bus, taxi, motor vehicle

The aim of thesaurus is:

– to provide a standard vocabulary for indexing and searching

 Thesaurus rewrite to form equivalence classes, and we index such equivalences

When the document contains automobile, index it under car as well (usually, also vice-versa)

– to assist users with locating terms for proper query formulation:

When the query contains automobile, look under car as well for expanding query

– to provide classified hierarchies that allow the broadening and narrowing of the current request
according to user needs.

Many of the above features embody transformations that are

– Language-specific and
– Often, application-specific
– These are “plug-in” addenda to the indexing process
– Both open source and commercial plug-ins are available for handling these issues
Index language is the language used to describe documents and requests

 If a full text representation of the text is adopted, then all words in the text are used as index
terms = full text indexing

 Otherwise, need to select the words to be used as index terms for reducing the size of the index
file which is basic to design an efficient searching IR system

Terms are usually stems. Terms can be also phrases, such as “Computer Science”, “World Wide
 Web”, etc. Documents and queries are represented as vectors
or “bags of words” (BOW).Each vector holds a place for every
term in the collection. Position 1 corresponds to term 1, position
W=0 if a term is absent
2 to term 2, position n to term n.

11
12 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

Documents are represented by binary weights or Non-binary

weighted vectors of terms. A collection of n documents can be
represented in the vector space model by a term-document matrix. An
entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term has no significance in the document or
it simply doesn’t exist in the document.

Binary Weights
Only the presence (1) or absence (0) of a term
is included in the vector. Binary formula
gives every word that appears in a document
equal relevance. It can be useful when
frequency is not important.

Binary Weights Formula:

1 if freqij  0
freqij  
0 if freqij  0

Why use term weighting?

Binary weights are too limiting. Because terms are either present or absent. Not allow to order
documents according to their level of relevance for a given query. According to binary weights all
present documents are equally relevant in fact all documents are not equally relevant.

 Non-binary weights (Term-weighting) allow to model partial matching.

 Partial matching allows retrieval of docs that approximate the query.
 Term-weighting improves quality of answer set.
 Term weighting enables ranking of retrieved documents; such that best matching documents are
ordered at the top as they are more relevant than others.

12
13 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

Term Weighting: Term Frequency (TF)

TF - Count the number of times term occurs in document. fij = frequency of term i in document j

 The more times a term t occurs in document d the more likely

it is that t is relevant to the document, i.e. more indicative of
the topic. If used alone, it favors common words and long
documents. It gives too much credit to words that appears
more frequently. May want to normalize term frequency (tf)
across the entire corpus: tfij = fij / ∑{fij}

Document Normalization- Long documents have an unfair advantage: They use a lot of terms

 So they get more matches than short documents

 And they use the same words repeatedly

 So they have much higher term frequencies

Normalization seeks to remove these effects:

 Related somehow to maximum term frequency.

 But also sensitive to the total number of terms.

If we don’t normalize short documents may not be recognized as relevant.

Problems with term frequency

We need a mechanism for reducing the effect of terms that occur too often in the collection to be
meaningful for relevance/meaning determination. Scale down the term weight of terms with high
collection frequency. Reduce the tf weight of a term by a factor that grows with the collection
frequency. More common for this purpose is document frequency. How many documents in the
collection contain the term.

The example shows that collection frequency and document frequency behaves differently

13
14 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

Word cf df
try 10422 8760
insurance 10440 3997
Document Frequency (df) -- the number of documents in the collection that contain a term
Less frequently a term appears in the whole collection, the more discriminating it is. df i = document
frequency of term i = number of documents containing term i

Inverse Document Frequency (idf)

𝑖𝑑𝑓 measures rarity of the term in collection. The 𝑖𝑑𝑓 is a measure of the general importance of the
term. It is the inverse of the document frequency. It diminishes the weight of terms that occur very
frequently in the collection and increases the weight of terms that occur rarely.
 Gives full weight to terms that occur in one document only.
 Gives lowest weight to terms that occur in all documents.
 Terms that appear in many different documents are less indicative of overall topic.
𝒊𝒅𝒇𝒊 = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents)

E.g. given a collection of 1000 documents and document frequency, compute 𝑖𝑑𝑓 for each word?
 𝑖𝑑𝑓 provides high values for rare words and low
values for common words.
 IDF is an indication of a term’s discrimination power.
 Log used to dampen the effect relative to 𝑡𝑓.
 Make the difference between Document frequency vs.
corpus frequency?

TF*IDF Weighting
The most used term-weighting is 𝑡𝑓 ∗ 𝑖𝑑𝑓 weighting scheme:

𝒘𝒊𝒋 = 𝒕𝒇𝒊𝒋 ∗ 𝒊𝒅𝒇𝒊 = 𝒕𝒇𝒊𝒋 ∗ 𝒍𝒐𝒈𝟐 (𝑵/ 𝒅𝒇𝒊) A term occurring frequently in the document but
rarely in the rest of the collection is given high weight. The 𝑡𝑓 ∗ 𝑖𝑑𝑓 value for a term will always be
greater than or equal to zero. Experimentally, 𝑡𝑓 ∗ 𝑖𝑑𝑓 has been found to work well. It is often used in
the vector space model together with cosine similarity to determine the similarity between two
documents.

When does TF*IDF registers a high weight?

14
15 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

The answer is when a term t occurs many times within a small number of documents.

 Highest 𝑡𝑓 ∗ 𝑖𝑑𝑓 for a term shows a term has a high term frequency (in the given document) and
a low document frequency (in the whole collection of documents). Thus lending high
discriminating power to those documents.
 Lower 𝑡𝑓 ∗ 𝑖𝑑𝑓 is registered when the term occurs fewer times in a document, or occurs in many
documents. Thus offering a less discriminating power to those documents. Lowest 𝑡𝑓 ∗ 𝑖𝑑𝑓 is
registered when the term occurs in virtually all documents.

Computing 𝒕𝒇 ∗ 𝒊𝒅𝒇: An Example assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (𝑑𝑓) of three terms are: A(50), B(1300), C(250). And also
term frequencies (𝑡𝑓) of these terms are: A(3), B(2), C(1). Compute TF*IDF for each term?

A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644

B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
Query vector is typically treated as a document and also tf-idf weighted.

More example

 Consider a document containing 100 words wherein the word cow appears 3 times. Now, assume
we have 10 million documents and cow appears in one thousand of these.

 The term frequency (𝑡𝑓) for cow : 3/100 = 0.03

 The inverse document frequency is log2(10,000,000 / 1,000) = 13.228
 The TF*IDF score is the product of these frequencies: 0.03 * 13.228 = 0.39684
Exercise

Let C = number of times a given word appears in a document;

TW = total number of words in a document;
TD = total number of documents in a corpus, and
DF = total number of documents containing a given word; compute TF, IDF and TF*IDF score for
each term.

15
16 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

Word C TW TD DF TF IDF TFIDF

airplane 5 46 3 1
blue 1 46 3 1
chair 7 46 3 3
computer 3 46 3 1
forest 2 46 3 1
justice 7 46 3 3
love 2 46 3 1
might 2 46 3 1
perl 5 46 3 2
rose 6 46 3 3
shoe 4 46 3 1
thesis 2 46 3 2
Suppose from a set of English documents, we wish to determine which once are the most relevant to
the query "the brown cow." A simple way to start out is by eliminating documents that do not contain
all three words "the," "brown," and "cow," but this still leaves many documents.

To further differentiate them, we might count the number of times each term occurs in each document
and sum them all together; the number of times a term occurs in a document is called its TF. However,
because the term "the" is so common, this will tend to incorrectly emphasize documents which happen
to use the word "the" more, without giving enough weight to the more meaningful terms "brown" and
"cow". Also the term "the" is not a good keyword to distinguish relevant and non-relevant documents
and terms like "brown" and "cow" that occur rarely are good keywords to distinguish relevant
documents from the non-relevant once.

Hence IDF is incorporated which diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely. This leads to use 𝑇𝐹 ∗ 𝐼𝐷𝐹 as a better
weighting technique. On top of that we apply similarity measures to calculate the distance between
document i and query j. There are a number of similarity measures; the most common similarity
measure are Euclidean distance , Inner or Dot product, Cosine similarity, Dice similarity, Jaccard
similarity, etc.

16
17 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

Similarity Measure
We now have vectors for all documents in the collection, and
a vector for the query. A similarity measure is a function that
computes the degree of similarity or distance between
document vector and query vector. Using a similarity
measure between the query and each document:
 It is possible to rank the retrieved documents in the order of
presumed relevance.
 It is possible to enforce a certain threshold so that the size of
the retrieved set can be controlled.
Postulate: Documents that are “close together” in the vector space talk about the same things and
are more similar than others.
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
Sometimes it is a good idea to determine the
maximum possible similarity as the
“distance” between a document d and itself
A similarity measure attempts to compute the
distance between document vector wj and
query wq vector.
The assumption here is that documents
whose vectors are close to the query vector
are more relevant to the query than
documents whose vectors are away from the
query vector.

 We can use different techniques to compute similarity/relatedness of query and documents. E.g.
Euclidean distance, Dot product, cosine similarity, etc.

17
18 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

I. Euclidean distance

 It is the most common similarity measure. Euclidean distance examines the root of square
differences between coordinates of a pair of document and query terms.
 Similarity between vectors for the document di and query q can be computed as:
n

sim(dj,q) = |dj – q| =  (w
i 1
ij  wiq ) 2

where 𝑤𝑖𝑗 is the weight of term i in document j and 𝑤𝑖𝑞 is the weight of term i in the query q
Example: Determine the Euclidean distance between the document 1 vector (0, 3, 2, 1, 10) and query
vector (2, 7, 1, 0, 0). 0 means corresponding term not found in document or query.

 (0  2) 2  (3  7) 2  (2  1) 2  (1  0) 2  (10  0) 2  11.05

II. Dot product

The dot product is also known as the scalar product or inner product. The dot product is defined as
the product of the magnitudes of query and document vectors. Similarity between vectors for the
document di and query q can be computed as the vector inner product:
n

sim(dj,q) = dj•q =  wij · wiq

i 1
Where wij is the weight of term i in document j and wiq is the weight of term i in the query q
For binary vectors, the inner product is the number of matched query terms in the document (size of
intersection).
For weighted term vectors, it is the sum of the products of the weights of the matched terms.
Properties of Inner Product
 Favors long documents with a large number of unique terms. (the issue of normalization)
 Measures how many terms matched but not how many terms are not matched.
Examples
• Binary weight : Size of vector = size of vocabulary = 7 sim(D, Q) = 3

18
19 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

• Term Weighted:

sim(D1 , Q) = 20 + 30 + 5*2 = 10

sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

III. Cosine similarity (or normalized inner product)

It projects document and query vectors into a term space and calculate the cosine angle between
these.
 Measures similarity between d1 and d2 captured by the cosine of the angle x between them.
 
 w w
n
d j q
sim(d j , q)     i 1 i, j i ,q

 w 
n n
dj q 2
wi2,q
i 1 i, j i 1

 The denominator involves the lengths of the vectors

 So the cosine measure is also known as the normalized inner product


n
Length d j  i 1
wi
2
,j

Example1: Computing Cosine Similarity

• Let us say we have a query vector Q = (0.4, 0.8); and a document vector D1 = (0.2, 0.7).
Compute their similarity using cosine?
(0.4 * 0.2)  (0.8 * 0.7)
sim(Q, D2 ) 
[( 0.4) 2  (0.8) 2 ] * [( 0.2) 2  (0.7) 2 ]
0.64

0.42
 0.98
Example2: Computing Cosine Similarity

Let say we have two documents in our

corpus; 𝐷1 = (0.8, 0.3) and 𝐷2 =
(0.2, 0.7) given query vector 𝑄 =
(0.4, 0.8), determine which document is the
most relevant for the query?

19
20 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia

CosSim(D1 , Q) = CosSim(D2 , Q) =
(0.4∗0.8)+( 0.8∗0.3) (0.4∗0.2)+( 0.8∗0.7)
√ [(0.4)2 +(0.8)2 ]∗[(0.8)2 +(0.3)2 ] √ [(0.4)2 +(0.8)2 ]∗[(0.2)2 +(0.7)2 ]

0.08+0.56
0.32+0.24 =
= √(0.16+0.64)∗(0.04+0.49)
√(0.16+0.64)∗(0.64+0.09)
0.64
0.56 =
= √0.8∗0.53
√0.8∗0.73
0.64
0.56 =
√0.424
=
√0.584 0.64
0.56
=
0.65
=
0.764
cos 2 =0.98
cos1 =0.73

Cosine Similarity vs. Inner Product

• Cosine similarity measures the cosine of the angle between two vectors.
• Inner product normalized by the vector lengths.
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product. Red
numbers (10 and 2) show inner product
a. Exercises:
b. 1. Given three documents; D1, D2 and
D3 with the corresponding TFIDF
weight, which documents are more
similar using the three measurement?
2. A database collection consists of 1 million documents, of which 200,000 contain the term
holiday while 250,000 contain the term season. A document repeats holiday 7 times and season
5 times. It is known that holiday is repeated more than any other term in the document. Calculate
the weight of both terms in this document using three different term weight methods. Try with

 normalized and unnormalized TF;

 TF*IDF based on normalized and unnormalized TF

Collins Cobuild English Grammar
From Everand
Collins Cobuild English Grammar
HarperCollins UK
4/5 (13)
Chapter 6 Answers To Problems
0% (1)
Chapter 6 Answers To Problems
17 pages
Chapter Two - Text Operations and Automatic Indexing: 2.1. Text Acquisition Via Crawler
No ratings yet
Chapter Two - Text Operations and Automatic Indexing: 2.1. Text Acquisition Via Crawler
19 pages
Ir Assignment
No ratings yet
Ir Assignment
12 pages
2 TextOperations
No ratings yet
2 TextOperations
54 pages
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
No ratings yet
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
13 pages
2_Text Operations (1)
No ratings yet
2_Text Operations (1)
56 pages
2 - Text Operation
No ratings yet
2 - Text Operation
55 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Chap 4
No ratings yet
Chap 4
76 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
16 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
2&3 Text Operation
No ratings yet
2&3 Text Operation
65 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Chapter 4
No ratings yet
Chapter 4
72 pages
2 Text Operation
No ratings yet
2 Text Operation
46 pages
Ch-2 Text Operations
No ratings yet
Ch-2 Text Operations
40 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
8-Text and Multimedia Languages
No ratings yet
8-Text and Multimedia Languages
22 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
Associative Text Retrieval From A Large Document Collection Using Unorganized Neural Networks
No ratings yet
Associative Text Retrieval From A Large Document Collection Using Unorganized Neural Networks
10 pages
Internet Searching: Crawling Is Conceptually Quite Simple: Starting at Some Well-Known Sites On The Web
No ratings yet
Internet Searching: Crawling Is Conceptually Quite Simple: Starting at Some Well-Known Sites On The Web
4 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Chapter - 2 Literature Survey: S. No Page No
No ratings yet
Chapter - 2 Literature Survey: S. No Page No
22 pages
Zaheer Ahmad, Presentation Information Literacy Skills
No ratings yet
Zaheer Ahmad, Presentation Information Literacy Skills
29 pages
2 - Text Operation
No ratings yet
2 - Text Operation
47 pages
Processing Text: 4.1 From Words To Terms
No ratings yet
Processing Text: 4.1 From Words To Terms
52 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Information Retrieval
No ratings yet
Information Retrieval
142 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
Chapter 2 Text Operation
No ratings yet
Chapter 2 Text Operation
46 pages
1 introIR
No ratings yet
1 introIR
22 pages
Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
NLP Mod-5
No ratings yet
NLP Mod-5
17 pages
CH 2_text operation
No ratings yet
CH 2_text operation
38 pages
chapter two IR
No ratings yet
chapter two IR
44 pages
Irs Unit 3
No ratings yet
Irs Unit 3
11 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Web Search
No ratings yet
Web Search
49 pages
Information Retrieval (IR) Is The Science of
No ratings yet
Information Retrieval (IR) Is The Science of
10 pages
Information Retrieval Data Structures & Algorithms - William B. Frakes
No ratings yet
Information Retrieval Data Structures & Algorithms - William B. Frakes
630 pages
2_text operation
No ratings yet
2_text operation
35 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
01 Introduction to ISR
No ratings yet
01 Introduction to ISR
34 pages
Info Retrieval
No ratings yet
Info Retrieval
14 pages
Chap 4 Text IR PDF
No ratings yet
Chap 4 Text IR PDF
19 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
NLP UNIT-II(PART-I)
No ratings yet
NLP UNIT-II(PART-I)
19 pages
Natural Language Understanding: Fundamentals and Applications
From Everand
Natural Language Understanding: Fundamentals and Applications
Fouad Sabry
No ratings yet
Domain-Specific Languages in R: Advanced Statistical Programming
From Everand
Domain-Specific Languages in R: Advanced Statistical Programming
Thomas Mailund
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
LOTED: a semantic web portal for the management of tenders from the European Community
From Everand
LOTED: a semantic web portal for the management of tenders from the European Community
Francesco Valle
No ratings yet
4_5807667692617862567
No ratings yet
4_5807667692617862567
104 pages
Advanced Internet Programming Module
No ratings yet
Advanced Internet Programming Module
56 pages
ch1_Information Retrieval Systems
No ratings yet
ch1_Information Retrieval Systems
52 pages
Unit-2
No ratings yet
Unit-2
47 pages
TM10 Evaluating and Selecting
No ratings yet
TM10 Evaluating and Selecting
90 pages
Mechanics L3_Perform Advanced Engineering Detail Drafting
No ratings yet
Mechanics L3_Perform Advanced Engineering Detail Drafting
171 pages
Unit-6
No ratings yet
Unit-6
37 pages
3. Establishing and Maintaining Client User Liaison
No ratings yet
3. Establishing and Maintaining Client User Liaison
44 pages
2. Managing Web and database Project
No ratings yet
2. Managing Web and database Project
81 pages
Chapter 3 - Sound and Image
No ratings yet
Chapter 3 - Sound and Image
36 pages
04.1. PSTN
No ratings yet
04.1. PSTN
20 pages
130 Unitech Enterprise
No ratings yet
130 Unitech Enterprise
1 page
Executive Order No. 180
100% (1)
Executive Order No. 180
5 pages
Chapter 1 & 2 Introduction and Accounting Equation
No ratings yet
Chapter 1 & 2 Introduction and Accounting Equation
71 pages
A4 Accessories Brochure - 10
No ratings yet
A4 Accessories Brochure - 10
21 pages
ASR For Models 129.061-066
No ratings yet
ASR For Models 129.061-066
33 pages
Workspaces 3.8
No ratings yet
Workspaces 3.8
134 pages
Ey Assurance Eye Reporting Insights April 2023
No ratings yet
Ey Assurance Eye Reporting Insights April 2023
27 pages
Samsung Plasma TV Tips
100% (3)
Samsung Plasma TV Tips
75 pages
The New Department of Labor Guidelines On Internships
No ratings yet
The New Department of Labor Guidelines On Internships
37 pages
module 3 answers
No ratings yet
module 3 answers
9 pages
Soil Engineering
No ratings yet
Soil Engineering
3 pages
Quiz Competiotion
No ratings yet
Quiz Competiotion
31 pages
Trade Notice 03-2017
No ratings yet
Trade Notice 03-2017
133 pages
Redscrew Twin Screw Iom RSW 10000 E 01 Web
No ratings yet
Redscrew Twin Screw Iom RSW 10000 E 01 Web
12 pages
The Hypercube of Innovation
No ratings yet
The Hypercube of Innovation
26 pages
Excel Basics: You Will Learn Basic Concepts in This Module Which Will Give You A Chance
No ratings yet
Excel Basics: You Will Learn Basic Concepts in This Module Which Will Give You A Chance
4 pages
07 Omnipcx Enterprise Communication Server Datasheet en
No ratings yet
07 Omnipcx Enterprise Communication Server Datasheet en
5 pages
S-CAB-LV_00
No ratings yet
S-CAB-LV_00
17 pages
Rme Diag Calculus Dummy
No ratings yet
Rme Diag Calculus Dummy
2 pages
Morningstar Bakery: Business Plan
No ratings yet
Morningstar Bakery: Business Plan
14 pages
Ink 2010 App INR
No ratings yet
Ink 2010 App INR
3 pages
Culinary (Or Gastronomic) Tourism
No ratings yet
Culinary (Or Gastronomic) Tourism
4 pages
Genetics Analysis and Principles 6th Edition Brooker Test Bank - Free Download Available In PDF DOCX Format
100% (2)
Genetics Analysis and Principles 6th Edition Brooker Test Bank - Free Download Available In PDF DOCX Format
42 pages
Android Magazine
100% (1)
Android Magazine
100 pages
Sample Report 2
No ratings yet
Sample Report 2
48 pages
Module 1
No ratings yet
Module 1
59 pages
Earth Works Volume Conversion and Swell Factors
No ratings yet
Earth Works Volume Conversion and Swell Factors
4 pages
Norstar Programming Feature Codes
No ratings yet
Norstar Programming Feature Codes
8 pages
Galdabini Datasheet - TQ01.01.R01
No ratings yet
Galdabini Datasheet - TQ01.01.R01
3 pages