Cross Lingual Information Retrieval and Error Tracking in Search Engine
Cross Lingual Information Retrieval and Error Tracking in Search Engine
in search engine
by
Saurabh Garg
Roll No: 140070003
RnD Project
I declare that this written submission represents my ideas in my own words and where
other’s ideas or words have been included, I have adequately cited and referenced the orig-
inal sources. I also declare that I have adhered to all principles of academic honesty and
integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in
my submission. I understand that any violation of the above will be cause for disciplinary
action by the Institute and can also evoke penal action from the sources which have thus
not been properly cited or from whom proper permission has not been taken when needed.
i
Acknowledgements
I am thankful to the people who have been instrumental in helping me out throughout
this project. First and foremost, I express my sincere gratitude towards my supervisor
Prof. Pushpak Bhattacharyya for his guidance. My research in Cross Lingual Information
Retrieval is being driven by his vision and support. I thank Arjun Atreya and Maulik
Vacchani for there guidance throughout this project. For the work and experience that we
have shared, I heartedly thank Sandhan team members at IITB. I am also thankful to my
friends, family and teachers who have been always there for me whenever I needed them.
ii
Contents
iii
1.5.1 Heaps’ Law: Estimation number of terms . . . . . . . . . . . . 13
1.5.2 Dictionary Compression . . . . . . . . . . . . . . . . . . . . . 13
6 Conclusion 27
iv
List of Figures
v
List of Tables
vi
Chapter 1
1
Dictionary Posting List
abc 1→2→5→6
xyz 4→5→8
pqrt 2→5→8
... ...
Posting List: Each document id is called a posting and a set of document ids is a postings
list. Basic inverted index is a dictionary of terms each of which is associated with a postings
list. Document frequency is length of each posting list. For example see 1.1
2
1.2 Term vocabulary and posting lists
To construct inverted index :
The first task is to choose a document unit. In simple cases each file is regarded as a
document unit. But in many cases like in traditional Unix email file stored a sequence of
emails which should be regarded as different files and in current days email may contain
many attached documents which should be treated as separate files. Going in the opposite
direction, various pieces of web software (such as latex2html) take things that you might
regard as a single document (e.g., a Power-point file or a LATEX document) and split them
into separate HTML pages for each slide or subsection, stored as separate files. In these
cases, you might want to combine multiple files into a single document.
More generally, for very long documents, the issue of indexing granularity arises. For a
collection of books, it would usually be a bad idea to index an entire book as a document.
A search for Indian languages might bring up a book that mentions India in the first chapter
and languages in the last chapter, but this does not make it relevant to the query. Instead
we might say each chapter or each paragraph or each line should be treated as separate
document. Thus clearly there is a precision/recall trade off here. If the units get too small,
we are likely to miss important passages because terms were distributed over several mini-
documents, while if units are too large we tend to get spurious matches and the relevant
information is hard for the user to find.
3
1.2.2 Determining the vocabulary of terms
Given a character sequence and a defined document unit, tokenization is the task of chop-
ping it up into pieces, called tokens, perhaps at the same time throwing away certain char-
acters, such as punctuation. Example :
Input : Hey! How are you?
Output : Hey How are you
A token[11] is an instance of a sequence of characters in some particular document that are
grouped together as a useful semantic unit for processing. A type is the class of all tokens
containing the same character sequence. A term is a type that is included in the IR systems
dictionary.
The major problem of the tokenization phase is what are the correct tokens to use? In
this example, it looks fairly trivial: you chop on whitespaces and throw away punctuation
characters. This is a starting point, but even for English there are a number of tricky cases.
For example, what do you do about the various uses of the apostrophe for possession and
contractions? In term aren’t we can not chop off at apostrophe and take aren and t. These
issues of tokenization are language-specific. It thus requires the language of the document
to be known. Language identification based on classifiers that use short character subse-
quences as features is highly effective; most languages have distinctive signature patterns.
One more problem of hyphenation and spacing. Hyphenation is used for various pur-
poses ranging from splitting up vowels in words (co-education) to joining nouns as names
(Hewlett-Packard). Handling hyphens automatically can thus be complex: it can either be
done as a classification problem, or more commonly by some heuristic rules, such as al-
lowing short hyphenated prefixes on words, but not longer hyphenated forms. Splitting on
white space can also split what should be regarded as a single token. This occurs most com-
monly with names (San Francisco, Los Angeles) but also with borrowed foreign phrases
(au fait) and compounds that are sometimes written as a single word and sometimes space
separated (such as white space vs. whitespace).For all queries like lowercase, lower-case
and lower case should return the same results. One effective strategy in practice, which is
4
used by some Boolean retrieval systems such as Westlaw and Lexis-Nexis, is to encourage
users to enter hyphens wherever they may be possible, and whenever there is a hyphenated
form, the system will generalize the query to cover all three of the one word, hyphenated,
and two word forms, so that a query for lower-case will search for ”lower-case” OR lower
case OR ”lowercase”.
Also different languages presents different problems like French (uses of apostrophe) and
Chinese (no space delimiter).
Stop Words are words which do not contain important significance to be used in Search
Queries. Usually these words are filtered out from search queries because they return vast
amount of unnecessary information. The general strategy for determining a stop list is to
sort the terms by collection frequency(the total number of times each term appears in the
document collection) and take most frequent terms, often hand filtered and removed when
creating index lists. But removing stop words may give a little harm like ”Prime minister
of India” will be then ”Prime Minister” and ”India” if we remove ”of”.
1.2.4 Normalization
Token normalization is the process of canonicalizing tokens so that matches occur despite
superficial differences in the character sequences of the tokens. The most standard way to
normalize is to implicitly create equivalence classes, which are normally named after one
member of the set. An alternative to creating equivalence classes is to maintain relations
between un-normalized tokens.
Some normalization forms :
2. Capitalization/case-folding
5
Query Term Terms in documents that should be matched
Windows Windows
Stemming usually refers to a crude heuristic process that chops off the ends of words in the
hope of achieving this goal correctly most of the time, and often includes the removal of
derivational affixes. Lemmatization usually refers to doing things properly with the use of
a vocabulary and morphological analysis of words, normally aiming to remove inflectional
endings only and to return the base or dictionary form of a word, which is known as the
lemma. The most common algorithm for stemming english is Porter’s Algorithm. It consist
of five phases of word reduction. Example :
Sample Text : Such an analysis can reveal features that are not easily visible from the varia-
tions in the individual genes and can lead to a picture of expression that is more biologically
transparent and accessible to interpretation
Porter Stemmer Output : such an analysi can reveal featur that ar not easili visibl from the
variat in the individu gene and can lead to a pictur of express that is more biolog transpar
and access to interpret
Consider query like ”Prime Minister India”. It doesn’t make sense to spit it by space and
search for ”Prime” and ”Minister”. To be able to support such phrase queries, it is no longer
sufficient for postings lists to be simply lists of documents that contain individual terms.
Biword Indexes : Thus in above we generate two vocabulary terms ”Prime Minister” and
”Minister India”.
6
This can work fairly well but their can be false positives. Thus we often first do Part of
speech of Tagging. Also, this concept of biword can be extended to a sequence of words
and if index includes variable length word sequences it is called Phrase Index.
Given an inverted index and a query, our first task is to determine whether each query term
exists in the vocabulary and if so, identify the pointer to the corresponding postings.[8]
The vocabulary look up requires one of the two data structure for efficiency: hashing and
Search Trees.
Hashing has been used for dictionary lookup in some search engines. Each vocabulary
term (key) is hashed into an integer over a large enough space that hash collisions are
unlikely; collisions if any are resolved by auxiliary structures like bins. At query time,
we hash each query term separately and following a pointer to the corresponding postings,
taking into account any logic for resolving hash collisions.
The best-known search tree is the binary tree, in which each internal node has two children.
The search for a term begins at the root of the tree. Each internal node (including the root)
represents a binary test, based on whose outcome the search proceeds to one of the two
sub-trees below that node.
Wildcard query is useful when user if unsure of letters in between or is unaware of some
letters at few positions in phrase. A query such as nam* is known as a trailing wildcard
query, because the * symbol occurs only once, at the end of the search string. A search
tree on the dictionary is a convenient way of handling trailing wildcard queries: we walk
down the tree following the symbols n, a and m in turn, at which point we can enumerate
7
the set W of terms in the dictionary with the prefix nam. Finally, we use |W | lookups on
the standard inverted index to retrieve all documents containing any term in W. We also
consider a reverse B-tree on the dictionary to handle queries like *nam which then can be
handled in similar way.
Thus using a normal B-Tree and a reverse B-tree we can handle more general case like
nam*fl by querying in B-tree and reverse B-tree and then taking intersection of results
obtained. To handle general wildcard query we consider permutation index. First, we
introduce a special symbol $ into our character set, to mark the end of a term. Next, we
construct a permuterm index, in which the various rotations of each term (augmented with
$) all link to the original vocabulary term. Consider the wildcard query m*n. The key is
to rotate such a wildcard query so that the * symbol appears at the end of the string thus
the rotated wildcard query becomes n$m*. Next, we look up this string in the permuterm
index.
Whereas the permuterm index is simple, it can lead to a considerable blowup from the
number of rotations per term. A k-gram is a sequence of k characters. In a k-gram index,
the dictionary contains all k-grams that occur in any term in the vocabulary. Each postings
list points from a k-gram to all vocabulary terms containing that k-gram.
Two specific forms of spelling correction that we refer to as isolated-term correction and
context-sensitive correction. In isolated-term correction, we attempt to correct a single
query term at a time even when we have a multiple-term query.
Techniques for isolated term correction :
1. Edit Distance : Given two character strings s1 and s2, the edit distance between
them is the minimum number of edit operations required to transform s1 into s2.
This is done using dynamic programming paradigm.
2. k-gram indexes : k-gram index to retrieve vocabulary terms that have many k-grams
8
in common with the query.
Algorithms for such phonetic hashing are commonly collectively known as soundex algo-
rithms. Idea :
1. Turn every term to be indexed into a 4-character reduced form. Build an inverted
index from these reduced forms to the original terms; call this the soundex index.
3. When the query calls for a soundex match, search this soundex index.
Algorithm :
5. Remove all zeros from the resulting string. Pad the resulting string with trailing
zeros and return the first four positions, which will consist of a letter followed by
three digits.
9
1.4 Index Construction
The algorithm parses documents into termIDdocID pairs and accumu- lates the pairs in
memory until a block of a fixed size is full. We choose the block size to fit comfortably into
memory to permit a fast in-memory sort. The block is then inverted and written to disk.
Inversion involves two steps. First, we sort the termIDdocID pairs. Next, we collect all
termIDdocID pairs with the same termID into a postings list, where a posting is simply a
docID. The result, an inverted index for the block we have just read, is then written to disk.
In the final step, the algorithm simultaneously merges the ten blocks into one large merged
index.[10]
Blocked sort-based indexing has excellent scaling properties, but it needs a data structure
for mapping terms to termIDs. For very large collections, this data structure does not fit
into memory. A difference between BSBI and SPIMI is that SPIMI adds a posting directly
to its postings list. Instead of first collecting all termIDdocID pairs and then sorting them
(as we did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows)
and it is immediately available to collect postings. This has two advantages: It is faster
because there is no sorting required, and it saves memory because we keep track of the
term a postings list belongs to, so the termIDs of postings need not be stored.
Collections are often so large that we cannot perform index construction efficiently on
a single machine. This is particularly true of the World Wide Web for which we need
large computer clusters to construct any reasonably sized web index. Web search engines,
therefore, use distributed indexing algorithms for index construction. The result of the con-
10
struction process is a distributed index that is partitioned across several machines either
according to term or according to document. MapReduce, a general architecture for dis-
tributed computing, is designed for large computer clusters. The point of a cluster is to
solve large computing problems on cheap commodity machines. A master node directs the
process of assigning and reassigning tasks to individual worker nodes. The map and reduce
phases of MapReduce split up the computing job into chunks that standard machines can
process in a short time. First, the input data, in our case a collection of web pages, are split
into n splits where the size of the split is chosen to ensure that the work can be distributed
evenly (chunks should not be too large) and efficiently. Splits are not preassigned to ma-
11
chines, but are instead assigned by the master node on an ongoing basis: As a machine
finishes processing one split, it is assigned the next one. If a machine dies or becomes a
laggard due to hardware problems, the split it is working on is simply reassigned to another
machine. In general, MapReduce breaks a large computing problem into smaller parts by
recasting it in terms of manipulation of key-value pairs.
The map phase of MapReduce consists of mapping splits of the input data to key-value
pairs. For the reduce phase, we want all values for a given key to be stored close together,
so that they can be read and processed quickly. Collecting all values for a given key into
one list is the task of the inverters in the reduce phase. The master assigns each term par-
tition to a different inverter and, as in the case of parsers, reassigns term partitions in case
of failing or slow inverters.
12
1.5.1 Heaps’ Law: Estimation number of terms
Estimating number of distinct terms M is Heaps law, which estimates vocabulary size as a
function of collection size:
M = kT b
Using fixed-width entries for terms is wasteful. One can overcome these shortcomings by
storing the dictionary terms as one long string of characters. The pointer to the next term is
also used to demarcate the end of the current term.
One can further compress the dictionary by grouping terms in the string into blocks of size
k and keeping a term pointer only for the first term of each block. We store the length of the
term in the string as an additional byte at the beginning of the term. Their is one source of
redundancy in the dictionary that consecutive entries in an alphabetically sorted list share
common prefixes. This observation leads to front coding.
13
Chapter 2
A web search engine is a software system that is designed to search for information on
the World Wide Web. The search results are generally presented in a line of results often
referred to as search engine results pages (SERPs). The information may be a mix of web
pages, images, and other types of files. A search system consists of two parts viz. offline
processing and online processing.
2.1.1 Crawling
Injection step involves the injection of the seed URLs into the crawling pipeline. This step
occurs only once. Generation step, the crawler generates the list of URLs to be fetched.
This is done by scoring the URLs by some metric and then the top N URLs are selected.
Fetching step, the web pages of the URLs generated in the Generate phase are downloaded.
14
Parse step, the downloaded URLs are parsed and the hyperlinks are extracted to be given to
the Generate phase again. UpdateDB, step the crawlDB is updated with the newer URLs.
2.1.2 Indexing
In this phase, the downloaded pages are sent to the Solr indexer to be indexed.
15
ming and query formulation. Some of the search engines also do named entity recognition,
multi-word recognition or word sense disambiguation to enhance query processing. This
processing is done in the form of a pipeline where output of one stage is fed as input to the
next stage.
Consider the diagram 2.2, a user’s query is entered in to the pipeline and first stage is
stop word removal where stop words like the, is etc. are removed as explained earlier1.
Second stage is stemming where each word is stemmed[5] and then passed on to next
stages for Name Entity and Multi word Entity recognition from the dictionary which is
pre-computed. From this, a new Boolean Query is created which is used to fetch results.
The quality of these set of results is a function of the accuracy of individual modules.
16
Chapter 3
provide platform to evaluate the system performance in terms of precision, recall, MAP
value, etc. However, these measures indicate end to end performance of the system and
do not evaluate performance of individual module in the system. Since the architecture is
pipelined, the error propagates and multiplies. In such architecture, tracking the root cause
of error is important.
3.1 Fire
Aims of FIRE[4]:
1. Large scale test collections for Indian Language Information Retrieval experiments
17
2. Provided a common evaluation infrastructure for comparing different information
retrieval systems.
3.2 CLEF
CLEF provides[3]:
3.3 TREC
Goals of TREC[1]:
3. to speed the transfer of technology from research labs into commercial products
by demonstrating substantial improvements in retrieval methodologies on real-world
problems
18
4. to increase the availability of appropriate evaluation techniques for use by industry
and academia, including development of new evaluation techniques more applicable
to current systems.
3.4 NTCIR
Aims of NTCIR[2]:
Tracker is a tool which captures the input and output information of each stage of the
pipeline and displays it to the assessor/developer.
19
Chapter 4
• In Stop word removal, consider the case that stop word not being detected by the
module. This can boost non-relevant search results in the top because a high fre-
quency of stop word content in it. On the other hand, consider the case in which an
important word is removed as stop word and hence results corresponding to that are
removed.
• Name Entity or Multi Word Entity might not be detected due to wrong stemming
which will cause loss of important or close proximity information.
20
Thus there is need to track and there by correct error of each module. Changing the
module for each error and retesting the system after change is quite time consuming. A
simpler and elegant solution is to have error detection and pseudo error correction built with
the search engine. This involves correcting the output of a particular module temporarily if
needed without making a change in the module for detecting errors in further modules.
4.2 Tracker
Tracker[6] comes with capability of error detection and pseudo error correction after each
module if needed in the User Interface of search engine itself. It is developed in order to
assist developers and assessors to analyze each module for error and accordingly tune the
required parameters which can improve the performance of search engine. Tracker captures
input and output of each module for a query and displays them to the assessor/developer.
The assessor/developer can then manually judge the outputs as correct or incorrect. This
helps in detecting errors in modules.
Tracker also allows assessor/developer to replace the output of a particular module with
the correct output without actually modifying the module. Once the assessor submits the
query again after modifying the output of particular stage, the new corrected output will
override the existing output of that stage and will be fed as input to next stage. This can be
done incrementally to detect errors in all the modules. And finally necessary changes after
analyzing the complete situation can be added to system by developer.
21
Chapter 5
Tracker is developed in Java. Tracker has a web interface which is linked to the results page
of the search engine. When a query is fired, the input and output of each module is stored
in an object. This object persists till expiry of session. The values captured in the object are
displayed to the assessors on the tracker web interface. The assessor is allowed to modify
the output of any module. The modified output is stored back in the object and used for
overriding the output of the stages for which modification is done. The object stores change
information till either session expires or user fires a different query. In the latter case, the
object is used for storing information about new query. Implementing Tracker on Sandhan
involved two stages :
(i) Changes in User Interface
(ii) Changes in solr Code
22
Figure 5.1: Sandhan homepage with added ”Error Tracker” option
Overview of User Interface implementation details in figure5.2. When a user input is sub-
mitted and clicked on search, normal search will happen and user will be shown results, as
it was done originally. But if user clicks on ”Error Tracker” normal results are shown along
with output of each stage of pipeline which can be changed (for pseudo error correction)
and query can be re-fired to see changes in results due to changed parameters.
23
5.2 SOLR Code Changes
Currently, version of Sandhan search engine uses Apache-solr-3.4. Overview of Imple-
mentation Details in figure5.3. When a query is received SolrParser creates name-value
pair for each entry in user query and is added to solrParams class object. These name-value
pairs are used everywhere else in the code for query processing particularly in DisMaxQ-
Parser.java and when any field is changed by user, corresponding changes are made to
XMLWriter object in XMLWriter.java to pass changes back to user Interface.
24
5.3 Results
Consider query in figure 5.4
Output of the solr code that is passed to user interface contains some extra fields along with
original retrieved documents and meta data shown in 5.5 (Only extra added fields shown).
Search Results shown on the user interface of the Sandhan looks like 5.6. The first column
Figure 5.6: Results of Error Tracker on Sandhan on firing query with error Tracking on
25
specifies the level in module hierarchy, second denotes module name and the last column
shows the output of the corresponding output. Since it is pipeline architecture, the output of
one module directly forms input of other module and hence inputs are not explicitly shown.
The assessors are allowed to edit one level at a time which enables tracking incremental
changes.
Thus, on interface output contains necessary fields for error tracking and correction1 . It also
contains one check box for tracking on/off, if it is selected and query is re-fired after editing
some fields, then these changed fields will overwrite the corresponding outputs in the query
processing pipeline. Using this, a new Boolean query will be generated and search results
will now correspond to the updated user query as shown in 5.7. If tracking check box is not
checked, then search is directed to the normal search without error tracking outputs.
Figure 5.7: Results of Error Tracker on Sandhan after editing NEs and re-firing query
1
Note: Search results also include relavent web pages (not shown here).
26
Chapter 6
Conclusion
In this report, I started with Boolean Information Retrieval theory and offline & online
processing in a search engine, followed up by list of Fora for IR evaluation, followed by
importance and need of error tracker in search engine by highlighting benefits it can pro-
vide to assessors. Tracker facilitates detection of errors in different modules of the search
engine. Pseudo error correction of outputs helps discovering further errors in the system
without making a change in the module. Also, along with this various stage of Query pro-
cessing was briefly explained. This error tracker and corrector can be extended for large
scale applications.
27
Bibliography
[3] The CLEF initiative (Conference and Labs of the Evaluation Forum) homepage
http://www.clef-initiative.eu/. 2000.
[8] C. Manning, P. Raghavan, and H. Schtze. Dictionaries and tolerant retrieval, intro-
duction to information retrieval, cambridge university press. 2008.
28
[9] C. Manning, P. Raghavan, and H. Schtze. Index compression, introduction to infor-
mation retrieval, cambridge university press. 2008.
[11] C. Manning, P. Raghavan, and H. Schtze. The term vocabulary & postings lists,
introduction to information retrieval, cambridge university press. 2008.
29