ch2_Text Operations and Automatic Indexing
ch2_Text Operations and Automatic Indexing
WKU, Ethiopia
The IR system contains a collection of documents, with each document represented by a sequence
of tokens. Markup may indicate titles, authors, and other structural elements.
Many Web pages are actually spam — malicious pages deliberately posing as something that they
are not in order to attract unwarranted attention of a commercial or other nature.
Other problems derive from the scale of the Web — Trillions of pages distributed among millions of
hosts. In order to index these pages, they must be gathered from across the Web by a crawler and stored
locally by the search engine for processing.
Because many pages may change daily or hourly, this snapshot of the Web must be refreshed on a
regular basis. While gathering data the crawler may detect duplicates and near-duplicates of pages,
which must be dealt with appropriately.
Another consideration is the volume and variety of queries commercial Web search engines
receive, which direct reflect the volume and variety of information on the Web itself. Queries are often
short — one or two terms. E.g. Google receive 400,000 queries per second.
The crawler component has the primary responsibility for identifying and obtaining documents
for the search engine. There are a number of different types of crawlers, but the most common
is the general web crawler. A web crawler is designed to follow the links on web pages to discover
and download new pages. Although this sounds deceptively simple, there are significant challenges in
designing a web crawler that can efficiently handle the huge volume of new pages on the Web, while
at the same time ensuring that pages that may have changed since the last time a crawler visited a site
are kept “fresh” for the search engine. A web crawler can be restricted to a single site, such as a
university, as the basis for site search. This type of crawler may be used by a vertical or topical search
application, such as a search engine that provides access to medical information on web pages.
For enterprise search, the crawler is adapted to discover and update all documents and web pages
related to a company’s operation. An enterprise document crawler follows links to discover both
1
2 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
external and internal (i.e., restricted to the corporate intranet) pages, but also must scan both corporate
and personal directories to identify email, word processing documents, presentations, database
records, and other company information.
Document crawlers are also used for desktop search, although in this case only the user’s personal
directories need to be scanned.
Feeds: Document feeds are a mechanism for accessing a real-time stream of documents. For example,
a news feed is a constant stream of news stories and updates. In contrast to a crawler, which must
discover new documents, a search engine acquires new documents from a feed simply by monitoring
it. Some content like news, blogs, or video are used for web feeds. The reader monitors those feeds
and provides new content when it arrives. Radio and television feeds are also used in some search
applications, where the “documents” contain automatically segmented audio and video streams,
together with associated text from closed captions or speech recognition
Conversion: The documents found by a crawler or provided by a feed are rarely in plain text. Instead,
they come in a variety of formats, such as HTML, XML, Adobe PDF, Microsoft Word, Microsoft
PowerPoint and so on, search engines require that these documents be converted into a consistent text
plus metadata format.
Document data store: It is a database used to manage large numbers of documents and the
structured data that is associated with them. The document contents are typically stored in
compressed form for efficiency. The structured data consists of document metadata and other
information extracted from the documents, such as links and anchor text. These is all about role of
crawler. Generally gathering copy of web pages from across web and storing locally for search engine
for processing are the task of crawler.
2
3 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
Formulated in the 1940s, Zipf’s law states that, given a corpus of natural language utterances, the
frequency of any word is inversely proportional to its rank in the frequency table. This can be
empirically validated by plotting the frequency of words in large textual corpora, as done for instance
in a well-known experiment with the Brown Corpus. Formally, if the words in a document collection
are ordered according to a ranking function r(w) in decreasing order of frequency f (w), the following
holds: 𝒓(𝒘) × 𝒇 (𝒘) = 𝒄 where c is a language-dependent constant. For instance, in English
collections c can be approximated to 10.
The law has been explained by “principle of least effort” which makes it easier for a speaker or
writer of a language to repeat certain words instead of coining new and different words. Zipf’s
explanation was his “principle of least effort” which balance between speaker’s desire for a small
vocabulary and hearer’s desire for a large one.
Information from Zipf’s law can be combined with the findings of Luhn, roughly ten years later: “It
is here proposed that the frequency of word occurrence in an article furnishes a useful measurement
of word significance. It is further proposed that the relative position within a sentence of words having
3
4 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
given values of significance furnish a useful measurement for determining the significance of
sentences. Formally, let f (w) be the frequency of occurrence and r(w) be their rank order, a plot relating
f (w) and r(w) yields a hyperbolic curve, demonstrating Zipf’s assertion that the product of the
frequency of use of words and the rank order is approximately constant.
Luhn used this law as a null hypothesis to specify two cut-offs, an upper and a lower, to exclude non-
significant words. Indeed, words above the upper cut-off can be considered as too common, while
those below the lower cut-off are too rare to be significant for understanding document content.
Consequently, Luhn assumed that the resolving power of significant words, by which he meant the
ability of words to discriminate content, reached a peak at a rank order position halfway between the
two cut-offs and from the peak fell off in either direction, reducing to almost zero at the cut-off points.
4
5 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
exhibits a less-than-linear growth with respect to the growth of the document collection.
When we consider natural language text, it is easy to notice that not all words are equally effective for
the representation of a document’s semantics. Usually, noun words (or word groups containing nouns,
also called noun phrase groups) are the most representative components of a document in terms of
content. This is the implicit mental process we perform when distilling the “important” query concepts
into some representative nouns in our search engine queries. Based on this observation, the IR system
also preprocesses the text of the documents to determine the most “important” terms to be used as
index terms; a subset of the words is therefore selected to represent the content of a document.
When selecting candidate keywords, indexing must fulfill two different and potentially opposite goals:
one is exhaustiveness, i.e., assigning a sufficiently large number of terms to a document, and the other
is specificity, i.e., the exclusion of generic terms that carry little semantics and inflate the index.
Generic terms, for example, conjunctions and prepositions, are characterized by a low discriminative
power, as their frequency across any document in the collection tends to be high. In other words,
generic terms have high term frequency, defined as the number of occurrences of the term in a
document. In contrast, specific terms have higher discriminative power, due to their rare occurrences
across collection documents: they have low document frequency, defined as the number of documents
in a collection in which a term occurs.
2.4. PREPROCESSING
Text operations is the process of text transformations in to logical representations. Document
preprocessing is a procedure which can be divided mainly into 5 text operations (transformations):
A. Lexical analysis of the text with the objective of treating digits, hyphens, punctuation marks,
and the case of letters.
5
6 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
B. Elimination of stop-words with the objective of filtering out words with very low
discrimination values for retrieval purposes (e.g. the , an , a ,m as and…..)
C. Stemming of the remaining words with the objective of removing affixes (i.e., prefixes and
suffixes) and allowing the retrieval of documents containing syntactic variations of query
terms (e.g., connect, connecting, connected, etc).
D. Selection of index terms to determine which words/stems (or groups of words) will be used as
an indexing elements. Usually, the decision on whether a particular word will be used as an
index term is related to the syntactic nature of the word. In fact, noun words frequently carry
more semantics than adjectives, adverbs, and verbs.
E. Construction of term categorization structures such as a thesaurus, or extraction of structure
directly represented in the text, for allowing the expansion of the original query with related
terms (a usually useful procedure). The next figure sketches the textual preprocessing phase
typically performed by an IR engine, taking as input a document and yielding its index terms.
1. Document Parsing. Documents come in all
sorts of languages, character sets, and
formats; often, the same document may
contain multiple languages or formats, e.g., a
French email with Portuguese PDF
attachments. Document parsing deals with the
recognition and “breaking down” of the
document structure into individual
components. In this preprocessing phase, unit
documents are created; e.g., emails with
attachments are split into one document
representing the email and as many
documents as there are attachments.
2. Lexical Analysis. After parsing, lexical analysis tokenizes a document, seen as an input stream, into
words. Issues related to lexical analysis include the correct identification of accents, abbreviations,
dates, and cases. Change text of the documents into words to be adopted as index terms.
– The objective of tokenization is to identify words in the text
6
7 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
– Numbers are not good index terms (like 1910, 1999); but 510 B.C. – unique
– Hyphen – break up the words (e.g. state-of-the-art = state of the art)- but some words,
e.g. B-29- unique words which require hyphens
– Punctuation marks – remove totally unless significant, e.g. program code: x.exe and xexe
– Case of letters – not important and can convert all to upper or lower
Issues in Tokenization
One word or multiple: How do you decide it is one token or two or more?
– Hewlett-Packard Hewlett and Packard as two tokens?
– State-of-the-art: break up hyphenated sequence.
– Addis Ababa
– lowercase, lower-case, lower case ?
Numbers:
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers,
– IP addresses (100.2.86.144)
How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, e-mail.
Simplest approach is to ignore all numbers and punctuation and use only case-insensitive
unbroken strings of alphabetic characters as tokens. Generally, systems do not index numbers as
text, but often very useful for search.
tokenization are language specific–What works for one language doesn’t fork for the other.
3. Stop-Word Removal. A subsequent step optionally applied to the results of lexical analysis is stop-
word removal, i.e., the removal of high-frequency words. However, as this process may decrease recall
(prepositions are important to disambiguate queries), most search engines do not remove them. With
the removal of stop-words, we can measure better approximation of importance for classification,
summarization, etc.
– Stop-words have little semantic content; It is typical to remove such high-frequency words.
Stopwords take up 50% of the text. Hence, document size reduces by 30-50%
i. The 30 most common words account for 30% of the tokens in written text
– Stop word elimination used to be standard in older IR systems.
7
8 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
– But the trend is getting away from doing this. Most web search engines index stop words: Various
song titles, “Let it be”, “To be or not to be”
– Elimination of stop-words might reduce recall (e.g. “To be or not to be” – all eliminated except
“be” – will produce no or irrelevant retrieval)
Normalization
It is Canonicalizing (change to simplest form) tokens so that matches occur despite superficial
differences in the character sequences of the tokens
– Need to “normalize” terms in indexed text as well as query terms into the same form
– Example: We want to match U.S.A. and USA, by deleting periods in a term
– Case Folding: Often best to lower case everything, since users will use lowercase regardless of
‘correct’ capitalization…
Republican vs. republican Fasil vs. fasil vs. FASIL
Car vs. automobile?
– Case Folding is bad for Proper names as well as common nouns
E.g. General Motors, Associated Press, …
– To overcome this problem lowercase only words at the beginning of the sentence. Because In IR,
lowercasing is most practical because of the way users issue their queries.
4. Phrase detection.
This step captures text meaning beyond what is possible with pure bag-of-word approaches, thanks to
the identification of noun groups and other phrases. Phrase detection may be approached in several
ways, including rules (e.g., retaining terms that are not separated by punctuation marks),
morphological analysis (part-of-speech tagging) syntactic analysis, and combinations thereof. For
example, scanning our example sentence “search engines are the most visible information retrieval
applications” for noun phrases would probably result in identifying “search engines” and
“information retrieval”.
5. Stemming/Morphological analysis
Stemming reduces tokens to their “root” form of words to recognize morphological variation. The
process involves removal of affixes (i.e. prefixes and suffixes) with the aim of reducing variants to the
same stem. Often removes inflectional and derivational morphology of a word. Inflectional
8
9 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
morphology: vary the form of words in order to express grammatical features, such as singular/plural
or past/present tense. E.g. Boy → boys, cut → cutting.
Derivational morphology: makes new words from old ones. E.g. creation is formed from create, but
they are two separate words. And also, destruction → destroy
Stemming is language dependent— correct stemming is language specific and can be complex. A
Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes and/or
suffixes). Example: ‘connect’ is the stem for {connected, connecting, connection, connections}
Thus, [automate, automatic, automation] all reduce to automat
The first approach is to create a big dictionary that maps words to their stems.
The disadvantages are the space required by the dictionary and the investment required to maintain
the dictionary as new words appear.
The second approach is to use a set of rules that extract stems from words.
The advantages of this approach are that the code is typically small, and it can gracefully handle
new words; the disadvantage is that it occasionally makes mistakes.
But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the
rule-based approach is the one that is generally chosen.
Google, for instance, uses stemming to search for web pages containing the words connected,
connecting, connection and connections when users ask for a web page that contains the word connect.
In 1979, Martin Porter developed a stemming algorithm that uses a set of rules to extract stems from
words, and though it makes some mistakes, most common words seem to work out right. Porter
describes his algorithm and provides a reference implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
9
10 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
Porter Stemmer is most common algorithm for stemming English words to their common
grammatical root. It is simple procedure for removing known affixes in English without using a
dictionary. To get rid of plurals the following rules are used:
– SSES SS caresses caress
– IES i ponies poni
– SS SS caress → caress
– S cats cat
• EMENT (Delete final ement if what remains is longer than 1 character )
Stemming: challenges
May conflate (reduce to the same token) words that are actually distinct.
6. Thesauri
Mostly full-text searching cannot be accurate, since different authors may select different words to
represent the same concept
Problem: The same meaning can be expressed using different terms that are synonyms, homonyms,
and related terms
How can it be achieved such that for the same meaning the identical terms are used in the index and
the query?
10
11 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
Thesaurus: The vocabulary of a controlled indexing language, formally organized so that a priori
relationships between concepts (for example as "broader" and “related") are made explicit.
When the document contains automobile, index it under car as well (usually, also vice-versa)
When the query contains automobile, look under car as well for expanding query
– to provide classified hierarchies that allow the broadening and narrowing of the current request
according to user needs.
– Language-specific and
– Often, application-specific
– These are “plug-in” addenda to the indexing process
– Both open source and commercial plug-ins are available for handling these issues
Index language is the language used to describe documents and requests
If a full text representation of the text is adopted, then all words in the text are used as index
terms = full text indexing
Otherwise, need to select the words to be used as index terms for reducing the size of the index
file which is basic to design an efficient searching IR system
Terms are usually stems. Terms can be also phrases, such as “Computer Science”, “World Wide
Web”, etc. Documents and queries are represented as vectors
or “bags of words” (BOW).Each vector holds a place for every
term in the collection. Position 1 corresponds to term 1, position
W=0 if a term is absent
2 to term 2, position n to term n.
11
12 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
Binary Weights
Only the presence (1) or absence (0) of a term
is included in the vector. Binary formula
gives every word that appears in a document
equal relevance. It can be useful when
frequency is not important.
1 if freqij 0
freqij
0 if freqij 0
12
13 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
Document Normalization- Long documents have an unfair advantage: They use a lot of terms
The example shows that collection frequency and document frequency behaves differently
13
14 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
Word cf df
try 10422 8760
insurance 10440 3997
Document Frequency (df) -- the number of documents in the collection that contain a term
Less frequently a term appears in the whole collection, the more discriminating it is. df i = document
frequency of term i = number of documents containing term i
E.g. given a collection of 1000 documents and document frequency, compute 𝑖𝑑𝑓 for each word?
𝑖𝑑𝑓 provides high values for rare words and low
values for common words.
IDF is an indication of a term’s discrimination power.
Log used to dampen the effect relative to 𝑡𝑓.
Make the difference between Document frequency vs.
corpus frequency?
TF*IDF Weighting
The most used term-weighting is 𝑡𝑓 ∗ 𝑖𝑑𝑓 weighting scheme:
𝒘𝒊𝒋 = 𝒕𝒇𝒊𝒋 ∗ 𝒊𝒅𝒇𝒊 = 𝒕𝒇𝒊𝒋 ∗ 𝒍𝒐𝒈𝟐 (𝑵/ 𝒅𝒇𝒊) A term occurring frequently in the document but
rarely in the rest of the collection is given high weight. The 𝑡𝑓 ∗ 𝑖𝑑𝑓 value for a term will always be
greater than or equal to zero. Experimentally, 𝑡𝑓 ∗ 𝑖𝑑𝑓 has been found to work well. It is often used in
the vector space model together with cosine similarity to determine the similarity between two
documents.
14
15 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
The answer is when a term t occurs many times within a small number of documents.
Highest 𝑡𝑓 ∗ 𝑖𝑑𝑓 for a term shows a term has a high term frequency (in the given document) and
a low document frequency (in the whole collection of documents). Thus lending high
discriminating power to those documents.
Lower 𝑡𝑓 ∗ 𝑖𝑑𝑓 is registered when the term occurs fewer times in a document, or occurs in many
documents. Thus offering a less discriminating power to those documents. Lowest 𝑡𝑓 ∗ 𝑖𝑑𝑓 is
registered when the term occurs in virtually all documents.
Computing 𝒕𝒇 ∗ 𝒊𝒅𝒇: An Example assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (𝑑𝑓) of three terms are: A(50), B(1300), C(250). And also
term frequencies (𝑡𝑓) of these terms are: A(3), B(2), C(1). Compute TF*IDF for each term?
More example
Consider a document containing 100 words wherein the word cow appears 3 times. Now, assume
we have 10 million documents and cow appears in one thousand of these.
15
16 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
To further differentiate them, we might count the number of times each term occurs in each document
and sum them all together; the number of times a term occurs in a document is called its TF. However,
because the term "the" is so common, this will tend to incorrectly emphasize documents which happen
to use the word "the" more, without giving enough weight to the more meaningful terms "brown" and
"cow". Also the term "the" is not a good keyword to distinguish relevant and non-relevant documents
and terms like "brown" and "cow" that occur rarely are good keywords to distinguish relevant
documents from the non-relevant once.
Hence IDF is incorporated which diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely. This leads to use 𝑇𝐹 ∗ 𝐼𝐷𝐹 as a better
weighting technique. On top of that we apply similarity measures to calculate the distance between
document i and query j. There are a number of similarity measures; the most common similarity
measure are Euclidean distance , Inner or Dot product, Cosine similarity, Dice similarity, Jaccard
similarity, etc.
16
17 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
Similarity Measure
We now have vectors for all documents in the collection, and
a vector for the query. A similarity measure is a function that
computes the degree of similarity or distance between
document vector and query vector. Using a similarity
measure between the query and each document:
It is possible to rank the retrieved documents in the order of
presumed relevance.
It is possible to enforce a certain threshold so that the size of
the retrieved set can be controlled.
Postulate: Documents that are “close together” in the vector space talk about the same things and
are more similar than others.
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
Sometimes it is a good idea to determine the
maximum possible similarity as the
“distance” between a document d and itself
A similarity measure attempts to compute the
distance between document vector wj and
query wq vector.
The assumption here is that documents
whose vectors are close to the query vector
are more relevant to the query than
documents whose vectors are away from the
query vector.
We can use different techniques to compute similarity/relatedness of query and documents. E.g.
Euclidean distance, Dot product, cosine similarity, etc.
17
18 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
I. Euclidean distance
It is the most common similarity measure. Euclidean distance examines the root of square
differences between coordinates of a pair of document and query terms.
Similarity between vectors for the document di and query q can be computed as:
n
sim(dj,q) = |dj – q| = (w
i 1
ij wiq ) 2
where 𝑤𝑖𝑗 is the weight of term i in document j and 𝑤𝑖𝑞 is the weight of term i in the query q
Example: Determine the Euclidean distance between the document 1 vector (0, 3, 2, 1, 10) and query
vector (2, 7, 1, 0, 0). 0 means corresponding term not found in document or query.
(0 2) 2 (3 7) 2 (2 1) 2 (1 0) 2 (10 0) 2 11.05
The dot product is also known as the scalar product or inner product. The dot product is defined as
the product of the magnitudes of query and document vectors. Similarity between vectors for the
document di and query q can be computed as the vector inner product:
n
18
19 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
• Term Weighted:
w
n n
dj q 2
wi2,q
i 1 i, j i 1
19
20 INFORMATION STORAGE AND RETRIVAL 2019 Prepared by Abdo A. WKU, Ethiopia
CosSim(D1 , Q) = CosSim(D2 , Q) =
(0.4∗0.8)+( 0.8∗0.3) (0.4∗0.2)+( 0.8∗0.7)
√ [(0.4)2 +(0.8)2 ]∗[(0.8)2 +(0.3)2 ] √ [(0.4)2 +(0.8)2 ]∗[(0.2)2 +(0.7)2 ]
0.08+0.56
0.32+0.24 =
= √(0.16+0.64)∗(0.04+0.49)
√(0.16+0.64)∗(0.64+0.09)
0.64
0.56 =
= √0.8∗0.53
√0.8∗0.73
0.64
0.56 =
√0.424
=
√0.584 0.64
0.56
=
0.65
=
0.764
cos 2 =0.98
cos1 =0.73
20