OS Search Engine Comparison
OS Search Engine Comparison
1 Introduction 5
2 Background 7
2.1 Document Collection . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Web Crawling . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 TREC . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Searching and Ranking . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Retrieval Evaluation . . . . . . . . . . . . . . . . . . . . . . . 13
3 Search Engines 17
3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Methodology 25
4.1 Document collections . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Performance Comparison Tests . . . . . . . . . . . . . . . . . 26
4.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Tests 29
5.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Indexing Test over TREC-4 collection . . . . . . . . . 29
3
4 CONTENTS
6 Conclusions 41
Chapter 1
Introduction
5
6 CHAPTER 1. INTRODUCTION
mental indexes. Other important factors to consider are the last date of
update of the software, the current version and the activity of the project.
These factors are important since a search engine that has not been updated
recently, may present problems at the moment of customizing it to the ne-
cessities of the current website. These characteristics are useful to make a
broad classification of the search engines and be capable of narrowing the
available spectrum of alternatives. Afterward, it is important to consider
the performance of these search engines with different loads of data and
also analyze how it degrades when the amount of information increases. In
this stage, it is possible to analyze the indexing time versus the amount of
data, as well as the amount of resources used during the indexing, and also
analyze the performance during the retrieval stage.
The present work is the first study, to the best of our knowledge, to
cover a comparison of the main features of 17 search engines, as well as a
comparison of the performance during the indexing and retrieval tasks with
different document collections and several types of queries. The objective of
this work is to be used as a reference for deciding which open source search
engine fits best with the particular constraints of the search problem to be
solved.
On chapter 2 we prefer a background of the general concepts of Infor-
mation Retrieval. On chapter 3 it is presented a description of the search
engines used in this work. Then, on chapter 4 the methodology used during
the experiments is described. On chapters 5.1 and 5.2 we present the results
of the different experiments conducted, and on chapter 5.3 the analysis of
these results. Finally, on chapter 6 the conclusions are presented.
Chapter 2
Background
The main idea is to satisfy the user information need by searching over
the available material for information that seems relevant. In order to ac-
complish this, the IR system consists on several modules that interact among
them (see Figure 2.1). It can be described, in a general form, as three main
areas: Indexing, Searching, and Ranking:
7
8 CHAPTER 2. BACKGROUND
Ranking Although this is an optional task, it is also very important for the
retrieval task. It is in charge of sorting the results, based on heuristics
that try to determine which results satisfy better the user need.
2.1.2 TREC
Other document collections have been generated, some of them for academic
analysis. For example, in the Text REtrieval Conference[13] (TREC), they
have created several document collections with different sizes and different
types of documents, specially designed for particular tasks. The tasks are
divided into several tracks that characterize the objective of the study of
that collection. For example, some of the seven 2007 TREC Tracks are:
1 10 20 30 40 50 60 70
It was open - wide, wide open - and I grew furious as I gazed upon it.
Table 2.1: Example of inverted index based on a sample text. For every
word, the list of occurrences is stored.
2.2 Indexing
functionalities offered. For example, some indexers store the full text of the
collection, in order to present the user with a sample of the text (“snippet”)
surrounding the search, while others use less space, but are not able to give
a snippet. Other indexers use techniques for reducing the size of the posting
list (e.g., using block addressing, where the text is divided into blocks so
the posting list points to the blocks, grouping several instances into fewer
blocks), but their trade-off is that for obtaining the exact position of a word
the engine might need to do extra work (in our case, it is necessary to do a
sequential scan over the desired block).
There are several pre-processing steps that can be performed over the
text during the indexing stage. Some of the most commonly used are stop-
word elimination and stemming.
There are some terms that appear very frequently on the collection, and
are not relevant for the retrieval task (for example, in English, the words “a”,
“an”, “are”, “be”, “for”, . . . ), and they are referred as stopwords. Depending
on the application and language of the collection, the list of words can vary.
A common practice, called stopword elimination, is to remove these words
from the text and do not index them, making the inverted index much
smaller.
All the pre-processing and the way of storing the inverted index affect
the space required as well as the time used for indexing a collection. As
mentioned before, it depends on the application, it might be convenient to
12 CHAPTER 2. BACKGROUND
trade-off time needed to build the index in order to obtain a more space-
efficient index. Also, the characteristics of the index will affect the searching
tasks, that will be explained on the following section.
the word position, and perform a pattern matching over the resulting list,
making the retrieval more complicated than simple boolean queries.
After performing the search over the index, it might be necessary to
rank the results obtained in order to satisfy the user need. This stage of
ranking might be optional, depending on the application, but for the Web
search scenario it has become very important. The process of ranking must
take into consideration several additional factors, besides whether the list of
documents satisfy the query or not. For example, in some applications, the
size of the retrieved document might indicate a level of importance of the
document; on the Web scenario, another factor might be the “popularity”
of the retrieved page (e.g. a combination of the number of in- and out-links,
age of the page, etc); the location of the queried terms (e.g. if they appear
on the body or in the title of the document); etc.
• Recall: Ratio between the relevant retrieved documents, and the set
of relevant documents.
|Ra|
Recall =
|R|
14 CHAPTER 2. BACKGROUND
0.5 0.5
0.4 0.4
Precision
Precision
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall
• Precision: Ratio between the relevant retrieved documents and the set
of retrieved documents.
|Ra|
P recision =
|A|
Since each of these values by itself may not be sufficient (e.g., a system
might get full recall by retrieving all the documents in the collection), it is
possible to analyze them by combining the measures. For example, to an-
alyze graphically the quality of a retrieval system, we can plot the average
precision-recall (see Figure 2.2) and observe the behavior of the precision
and recall of a system. These type of plots is useful also for comparing the
retrieval of different engines. For example, on Figure 2.1(b) we observe the
curves for 3 different engines. We can observe that Engine 2 has lower pre-
cision than the others at low recall, but as the recall increases, its precision
doesn’t degrade as fast as the other engines.
Another common measure is to calculate precision at certain document
cut-offs, for example, analyze the precision at the first 5 documents. This
is usually called precision at n (P@n) and represents the quality of the
answer, since the user is frequently presented with only the first n documents
retrieved, and not with the whole list of results.
2.4. RETRIEVAL EVALUATION 15
Search Engines
There are several open source search engines available to download and use.
In this study it is presented a list of the available search engines and an
initial evaluation of them that permits to have a general overview of the al-
ternatives. The criteria used in this initial evaluation was the development
status, the current activity and the date of the last update made to the
search engine. We compared 29 search engines: ASPSeek, BBDBot, Dat-
apark, ebhath, Eureka, ht://Dig, Indri, ISearch, IXE, Lucene, Managing
Gigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu,
Nutch, Omega, OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E,
SWISH++, Terrier, WAIS/ freeWAIS, WebGlimpse, XML Query Engine,
XMLSearch, Zebra, and Zettair.
Based on the information collected, it is possible to discard some projects
because they are considered outdated (e.g. last update is prior to the year
2000), the project is not maintained or paralyzed, or it was not possible
to obtain information of them. For these reasons we discarded ASPSeek,
BBDBot, ebhath, Eureka, ISearch, MPS Information Server, PLWeb, and
WAIS/freeWAIS.
In some cases, a project was rejected because of additional factors. For
example, although the MG project (presented on the book “Managing Gi-
17
18 CHAPTER 3. SEARCH ENGINES
gabytes” [18]) is one of the most important work on the area, it was not
included in this work, due to the fact that it has not been updated since
1999. Another special case is the Nutch project. The Nutch search engine
is based on the Lucene search engine, and is just an implementation that
uses the API provided by Lucene. For this reason, only the Lucene project
will be analyzed. And finally, XML Query Engine and Zebra were discarded
since they focus on structured data (XML) rather than on semi-structured
data as HTML.
Therefore, the initial list of search engines that we wanted to cover in the
present work were: Datapark, ht://Dig, Indri, IXE, Lucene, MG4J, mno-
GoSearch, Namazu, OmniFind, OpenFTS, Omega, SWISH-E, SWISH++,
Terrier, WebGlimpse (Glimpse), XMLSearch, and Zettair. However, with
the preliminary tests, we observed that the indexing time for Datapark,
mnoGoSearch, Namazu, OpenFTS, and Glimpse where 3 to 6 times longer
than the rest of the search engines, for the smallest database, and hence we
also did not considered them on the final performance comparison.
3.1 Features
Stop words Indicates if the indexer can use a list of words used as stop
words in order to discard too frequent terms.
Filetype The types of files the indexer is capable of parsing. The common
filetype of the engines analyzed was HTML.
Stemming If the indexer/searcher is capable of doing stemming operations
over the words.
Fuzzy Search Ability of solving queries in a fuzzy way, i.e. not necessarily
matching the query exactly.
Sort Ability to sort the results by several criteria.
Ranking Indicates if the engine gives the results based on a ranking func-
tion.
Search Type The type of searches it is capable of doing, and whether it
accepts query operators.
Indexer Language The programming language used to implement the
indexer. This information is useful in order to extend the functionalities or
integrate it into an existent platform.
License Determines the conditions for using and modifying the indexer
and/or search engine.
On Table 3.2 it is presented a summary of the features each of the search
engines have. In order to make a decision it is necessary to analyze the
features as a whole, and complement this information with the results of
the performance evaluation.
3.2 Description
Each of the search engines that will be analyzed can be described shortly,
based on who and where developed it and its main characteristic that iden-
tifies it.
ht://Dig [16] is a set of tools that permit to index and search a website. It
provides with a command line tool to perform the search as well as a CGI
20 CHAPTER 3. SEARCH ENGINES
interface. Although there are newer versions than the one used, according
to their website, the version 3.1.6 is the fastest one.
IXE Toolkit is a set of modular C++ classes and utilities for indexing and
querying documents. There exists a commercial version from Tiscali (Italy),
as well as a non-commercial version for academic purposes.
Indri [3] is a search engine built on top of the Lemur [4] project, which is a
toolkit designed for research in language modeling and information retrieval.
This project was developed by a cooperative work between the University
of Massachusetts and Carnegie Mellon University, in the USA.
Lucene [6] is a text search engine library part of the Apache Software
Foundation. Since it is a library, there are some applications that make use
of it, e.g. the Nutch project [8]. In the present work, the simple applications
bundled with the library were used to index the collection.
MG4J [7] (Managing Gigabytes for Java) is full text indexer for large col-
lection of documents, developed at the University of Milano, Italy. As by-
products, they offer general-purpose optimized classes for processing strings,
bit-level I/O, etc.
Omega is an application built on top of Xapian [14] which is an Open Source
Probabilistic Information Retrieval library. Xapian is written in C++ but
can be binded to different languages (Perl, Python, PHP, Java, TCL, C#).
IBM Omnifind Yahoo! Edition [2] is a search software that enables
rapid deployment of intranet search. It combines internal search, based on
Lucene search engine, with the possibility to search on Internet using Yahoo!
search engine.
SWISH-E [11] (Simple Web Indexing System for Humans - Enhanced) is
an open source engine for indexing and searching. It is an enhanced version
of SWISH, written by Kevin Hughes.
SWISH++ [10] is an indexing and searching tool based on Swish-E, al-
though completely rewritten in C++. It has most of the features of Swish-E,
but not all of them.
3.3. EVALUATION 21
3.3 Evaluation
As seen before, each search engine has multiple characteristics that differ-
entiates it from the other engines. To make a comparison of the engines, we
would like to have a well-defined qualification process that can give the user
an objective grade indicating the quality of each search engine. The problem
is that it depends on the particular needs of each user and the main objec-
tive of the engine, how to choose the “best” search engine. For example, the
evaluation can be tackled from the usability point of view, i.e. how simple
is to use the engine out-of-the-box, and how simple it is to customize it in
order to have it running. This depends on the main characteristic of the
search engine. For example, Lucene is intended to be an index and search
API, but if you need the features of Lucene as a front-end you must focus
on the subproject Nutch. Another possibility is to analyze the common
characteristics, as indexing and searching performance, and these features
are much more analytical, but they must be analyzed with care since they
are not the only feature. For this reason, we present a comparison based
on these quantifiable parameters (indexing time, index size, resource con-
22 CHAPTER 3. SEARCH ENGINES
Table 3.1: Initial characterization of the available open source search en-
gines.
24 CHAPTER 3. SEARCH ENGINES
Results Template
Results Excerpt
Indexer Lang.(b)
Search Type(c)
Increm. Index
Search Engine
Fuzzy Search
Stop words
Filetype(e)
Stemming
Storage(f )
License(a)
Ranking
Sort(d)
Datapark 2 1,2,3 1,2 2 1 4
ht://Dig 1 1,2 1 2 1,2 4
Indri 1 1,2,3,4 1,2 1,2,3 2 3
IXE 1 1,2,3 1,2 1,2,3 2 8
Lucene 1 1,2,4 1 1,2,3 3 1
MG4J 1 1,2 1 1,2,3 3 6
mnoGoSearch 2 1,2 1 2 1 4
Namazu 1 1,2 1,2 1,2,3 1 4
Omega 1 1,2,4,5 1 1,2,3 2 4
OmniFind 1 1,2,3,4,5 1 1,2,3 3 5
OpenFTS 2 1,2 1 1,2 4 4
SWISH-E 1 1,2,3 1,2 1,2,3 1 4
SWISH++ 1 1,2 1 1,2,3 2 4
Terrier 1 1,2,3,4,5 1 1,2,3 3 7
(g) (g) (e)
WebGlimpse 1 1,2 1 1,2,3 1 8,9
XMLSearch 1 3 3 1,2,3 2 8
Zettair 1 1,2 1 1,2,3 1 2
(a)
1:Apache,2:BSD,3:CMU,4:GPL,5:IBM,6:LGPL,7:MPL,8:Comm,9:Free
(b)
1:C, 2:C++, 3:Java, 4:Perl, 5:PHP, 6:Tcl
(c)
1:phrase, 2:boolean, 3:wild card.
(d)
1:ranking, 2:date, 3:none. Available
(e)
1:HTML, 2:plain text, 3:XML, 4:PDF, 5:PS. Not Available
(f )
1:file, 2:database.
(g)
Commercial version only.
Table 3.2: Main characteristics of the open source search engines analyzed.
Chapter 4
Methodology
25
26 CHAPTER 4. METHODOLOGY
We executed 5 different tests over the document collections. The first three
experiments were conducted over the parsed document collection (TREC-4),
4.3. SETUP 27
and the last two experiments were conducted over the WT10g WebTREC
document collection. The first test consisted on indexing the document col-
lection with each of the search engines and record the elapsed time as well
as the resource consumption. The second test consisted on comparing the
search time of the search engines that performed better during the index-
ing tests, and analyze their performance with each of the collections. The
third test consisted on comparing the indexing time required for making
incremental indices. The indexing process of all the search engines were
performed sequentially, using the same computer. The fourth experiment
consisted on comparing the indexing time for subcollections of different sizes
from the WT10g, with the search engines that were capable of indexing the
whole collection of the previous experiments. Finally, the fifth experiment
consisted on analyzing the searching time, precision and recall using a set
of query topics, over the full WT10g collection.
4.3 Setup
The main characteristics of the computer used: Pentium 4HT 3.2 GHz pro-
cessor, 2.0 GB RAM, SATA HDD, running under Debian Linux (Kernel
2.6.15). In order to analyze the resource consumption of every search engine
during the process of indexing, it was necessary to have a monitoring tool.
There are some open source monitors available, for example, “Load Moni-
tor” [5] and “QOS” [9], but for this work a simple monitor was sufficient.
For this reason, we implemented a simple daemon that logged the CPU and
memory consumption of a given process, at certain time intervals. After-
ward, the information collected can be easily parsed in order to generate
data that can be plotted with Gnuplot.
28 CHAPTER 4. METHODOLOGY
Chapter 5
Tests
5.1 Indexing
The indexing tests consisted on indexing the document collections with each
of the search engines and record the elapsed time as well as the resource
consumption (CPU, RAM memory, and index size on disk). After each
phase, the resulting time was analyzed and only the search engines that had
“reasonable” indexing times continued to be tested on the following phase
with the bigger collection. We arbitrarily defined the concept of “indexers
with reasonable indexing time”, based on the preliminary observations, as
the indexers with indexing time no more than 20 times the fastest indexer.
Indexing Time
29
30 CHAPTER 5. TESTS
Indexing Time
1000 750 MB
1.6 GB
2.7 GB
100
Time (min)
10
1
HtDig Indri IXE Lucene MG4J Omega Omnifind SwishE Swish++ TerrierXMLSearch Zettair
Search Engine
Figure 5.1: Indexing time for document collections of different sizes (750MB,
1.6GB, and 2.7GB) of the search engines that were capable of indexing all
the document collections.
observation is that all of the search engines that used a database for storing
the index had indexing time much larger than the rest of the search engines.
For the 750MB collection, the search engines had indexing time between
1 and 32 minutes. Then, with the 1.6GB collection, their indexing time
ranged from 2 minutes to 1 hour. Finally, with the 2.7GB collection, the
indexing time of the search engines, with the exception of Omega, was be-
tween 5 minutes and 1 hour. Omega showed a different behavior than the
other, since the indexing time for the larger collection was of 17 hours and
50 minutes.
server, that was used during the test. We observed that their CPU consump-
tion remained constant during the indexing stage, using almost the 100%
of the CPU. On the other hand, we observed 6 different behaviors on the
RAM usage: constant (C), linear (L), step (S), and a combination of them:
linear-step (L-S), linear-constant (L-C), and step-constant (S-C). ht://Dig,
Lucene, and XMLSearch had a steady usage of RAM during the whole pro-
cess. MG4J, Omega, Swish-E, and Zettair presented a linear growth in their
RAM usage, and Swish++ presented a step-like behavior, i.e. it started us-
ing some memory, and then it maintained the usage for a period of time,
and then continued using more RAM. Indri had a linear growth on the RAM
usage, then it decreased abruptly the amount used, and then started using
more RAM in a linear way. Terrier’s behavior was a combination of step-
like growth, and then it descended abruptly, and kept constant their RAM
usage until the end of the indexing. Finally, Omega’s behavior was a lin-
ear growth, but when it reached the 1.6GB of RAM usage, it maintained a
constant usage until the end of the indexing.
Index Size
In Table 5.2 it is presented the size of the indices created by each of the
search engines that were able of indexing the three collections in reasonable
time. We can observe 3 groups: indices whose size range between 25%-35%,
a group using 50%-55%, and the last group that used more than 100% the
size of the collection.
We also compared the time needed for making incremental indices using
three sets of different sizes: 1%, 5%, and 10% of the initial collection. We
based on the indices created for the 1.6GB collection and each of the new
collections had documents that were not included before. We compared
ht://Dig, Indri, IXE, Swish-E, and Swish++. On Figure 5.2 we present the
graph comparing their incremental indexing time.
32 CHAPTER 5. TESTS
Table 5.1: Maximum CPU and RAM usage, RAM behavior, and index
size of each search engine, when indexing collections of 750MB, 1.6GB, and
2.7GB.
Table 5.2: Index size of each search engine, when indexing collections of
750MB, 1.6GB, and 2.7GB.
test, with the whole WT10g collection (10.2 GB). Only Indri, IXE, MG4J,
Terrier, and Zettair could index the whole collection with a linear growth in
time (compared to their corresponding indexing times on the previous tests).
The other search engines did not scale appropriately or crashed due to lack
of memory. ht://Dig and Lucene took more than 7 times their expected
indexing time and more than 20 times the fastest search engine (Zettair);
while Swish-E and Swish++ crashed due to an “out of memory” error.
Based on these results, we analyzed the indexing time with subcollections
of the original collection, of different sizes (2.4GB, 4.8GB, 7.2GB, and 10.2
GB). On Figure 5.3 we present a comparison of the indexing time for each
of the search engines that were capable of indexing the entire collection. We
can observe that these search engines scaled linearly as the collection grew.
5.2 Searching
The searching tests are based on a set of queries that must be answered,
and then compare the level of “correct” results that each engine retrieved.
Depending on the collection and the set of queries, this idea of “correct”
34 CHAPTER 5. TESTS
100
Time (sec)
10
1
HtDig Indri IXE Swish-E Swish++
Search Engine
results will be defined. In order to obtain the set of queries to use, we can
identify three approaches:
The first approach, using a query log, seems attractive since it will test
the engines in a “real-world” situation. The problem with this approach is
that, in order to be really relevant, it must be tested with a set of pages that
are related to the query log, i.e. we would need to obtain a set of crawled
pages and a set of query logs that were used over these documents. Since on
the first tests we are using the TREC-4 collection which is based on a set of
news articles, we don’t have a query log relevant to these documents. For
this reason we used a set of randomly created queries (more detail on section
5.2. SEARCHING 35
2.4 GB
4.8 GB
7.2 GB
100 10.2 GB
Time (min)
10
1
Indri IXE MG4J Terrier Zettair
Search Engine
5.2.1) based on the words contained on the documents, using different word
distributions. Finally, the most complete test environment can be obtained
by using a set of predefined set of queries, related to the document collection.
These queries can be used on the second set of experiments, that operate
over the WT10g collection, created for the TREC evaluation. This approach
seems to be the most complete and close to the real-world situation, with a
controlled environment.
For the reasons mentioned above, we used a set of randomly generated
queries over the TREC-4 collection, and a set of topics and query relevance
for the WT10g TREC collection.
The Searching Tests were conducted using the three document collections,
with the search engines that had better performance during the Indexing
Tests (i.e., ht://Dig, Indri, IXE, Lucene, MG4J, Swish-E, Swish++, Terrier,
XMLSearch, and Zettair). These tests consisted on creating 1-word and 2-
36 CHAPTER 5. TESTS
0.4
Precision
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
words queries from the dictionary obtained from each of the collections, and
then analyzing the search time of each of the search engines, as well as
the “retrieval percentage”. The “retrieval percentage” is the ratio between
the amount of documents retrieved by a search engine and the maximum
amount of documents that were retrieved by all of the search engines.
In order to create the queries, we chose 1 or 2 words by random from
the dictionary of words that appeared on each of the collections (without
stopwords), using several word distributions:
The queries used on each of the collections considered the dictionary and
distribution of words particular to that collection. The word frequency of
all of the collections followed a Zipf law.
5.2. SEARCHING 37
35
30
25
20
15
10
5
0
HtDig Indri IXE Lucene MG4J Swish-E Swish++ TerrierXMLSearchZettair
Search Engine
After submitting the set of 1- and 2-words queries (each set consisted on
100 queries), we could observe the average searching time for each collection
and the corresponding retrieval percentage. For the 2-words queries, we
considered the matching of any of the words (using the OR operator). On
figure 5.5 we present a graph comparison of the average search times of each
search engine for the 2.7GB collection.
The results obtained show that all of the search engines that qualified
for the searching stage had similar searching times in each of the set of
queries. In average, the searching time of submitting a 1-word or a 2-words
query differed by a factor of 1.5 or 2.0, in a linear way. The fastest search
engines were Indri, IXE, Lucene, and XMLSearch. Then it was MG4J, and
Zettair. The retrieval percentage was also very similar between them, but
it decreased abruptly as the collection became larger and with queries from
the lowest 30%.
38 CHAPTER 5. TESTS
Using the indices created for WT10g, it was possible to analyze precision
and recall for each of the search engines. We used the 50 topics (using title-
only queries) used on the TREC-2001 Web Track for the “Topic Relevance
Task”, and their corresponding relevance judgments. To have a common
scenario for every search engine, we didn’t use any stemming, or stop-word
removal of the queries, and used the OR operator between the terms.
Afterward, the processing of the results was done using the trec_eval
software, that permits to evaluate the results with the standard NIST eval-
uation and is freely available. As output of the program, you obtain general
information about the queries (e.g. number of relevant documents) as well as
precision and recall statistics. We focused on the interpolated average preci-
sion/recall and the precision at different levels. The average precision/recall
permits to compare the retrieval performance of the engines by observing
their behavior throughout the retrieval (see Figure 5.4). On the other hand,
we also compared the precision at different cutoff values, allowing to observe
how it behaves at different thresholds (see Table 5.3).
5.3. GLOBAL EVALUATION 39
Based on the results obtained, after performing the tests with different
collection of documents, the search engines that took less indexing time
were: ht://Dig, Indri, IXE, Lucene, MG4J, Swish-E, Swish++, Terrier,
XMLSearch, and Zettair. When analyzing the size of the index created,
there are 3 different groups: IXE, Lucene, MG4J, Swish-E, Swish++, XMLSearch
and Zettair created an index of 25%-35% the size of the collection; Terrier
had an index of 50%-55% of the size of the collection; and ht://Dig, Omega,
and OmniFind created an index of more than 100% the size of the collec-
tion. Finally, another aspect to consider is the behavior that had the RAM
usage during the indexing stage. ht://Dig, Lucene, and XMLSearch had
a constant usage of RAM. The first two used the same amount of RAM
memory, independent of the collection (between 30MB and 120MB). On the
other hand, IXE, MG4J, Swish-E, Swish++, and Terrier used much more
memory, and growed in a linear way, reaching between 320MB to 600MB
for the smallest collection, and around 1GB for the largest collection.
Another fact that can be observed is related to the way the search engines
store and manipulate the index. The search engines that used a database
(DataparkSearch, mnoGoSearch, and OpenFTS) had a very poor perfor-
mance during the indexing stage, since their indexing time was 3 to 6 larger
than the best search engines.
On the second part of the tests, it was possible to observe that, for a
40 CHAPTER 5. TESTS
given collection and type of queries (1- or 2-words), the search engines had
similar searching times. For the 1-word queries, the searching time ranged
from less than 10 ms to 90 ms, while on the 2-words queries their searching
time ranged from less than 10 ms to 110 ms. The search engines that had
the smallest searching time were Indri, IXE, Lucene, and XMLSearch. The
only difference observed is when searching over the least frequent words,
since most of them retrieved 0 or 1 documents, the retrieval percentage is
not representative.
From the tests performed with the WT10g collection we can observe that
only Indri, IXE, MG4J, Terrier, and Zettair where capable of indexing the
whole collection without considerable degradation, compared to the results
obtained from the TREC-4 collection. Swish-E, and Swish++ were not able
to index it, on the given system characteristics (operating system, RAM,
etc.). ht://Dig and Lucene degraded considerably their indexing time, and
we excluded them from the final comparison. Zettair was the fastest indexer
and its average precision/recall was similar to Indri’s, MG4J’s, and Terrier’s.
IXE had low values on the average precision/recall, compared to the other
search engines. By comparing the results with the results obtained on other
TREC Tracks (e.g. Tera collection) we can observe that IXE, MG4J, and
Terrier were on the top list of search engines. This difference with the official
TREC evaluation can be explained by the fact that the engines are carefully
fine-tuned by the developers, for the particular needs of each track, and most
of this fine-tuning is not fully documented on the released version, since they
are particularly fitted to the track objective.
Chapter 6
Conclusions
This study presents the methodology used for comparing different open
source search engines, and the results obtained after performing tests with
document collections of different sizes. At the beginning of the work, 17
search engines were selected (from the 29 search engines found), for being
part of the comparison. After executing the tests, only 10 search engines
were able to index a 2.7GB document collection in “reasonable” time (less
than an hour), and only these search engines were used for the searching
tests. It was possible to identify different behaviors, in relation to their
memory consumption, during the indexing stage, and also observed that
the size of the indexes created varied according to the indexer used. On the
searching tests, there was no considerable difference on the performance of
the search engines that were able to index the largest collections.
The final tests consisted on comparing their ability to index a larger
collection (10GB) and analyze their precision at different levels. Only five
search engines were capable of indexing the collection (given the character-
istic of the server). By observing the average precision/recall we can observe
that Zettair had the best results, but similar to the results obtained by Indri.
By comparing these results with the results obtained on the official TREC
evaluation, it is possible to observe some differences. This can be explained
41
42 CHAPTER 6. CONCLUSIONS
Search Engine Indexing Time Index Size Searching Time Answer Quality
(h:m:s) (%) (ms) P@5
ht://Dig (7) 0:28:30 (10) 104 (6) 32 -
Indri (4) 0:15:45 (9) 63 (2) 19 (2) 0.2851
IXE (8) 0:31:10 (4) 30 (2) 19 (5) 0.1429
Lucene (10) 1:01:25 (2) 26 (4) 21 -
MG4J (3) 0:12:00 (8) 60 (5) 22 (4) 0.2480
Swish-E (5) 0:19:45 (5) 31 (8) 45 -
Swish++ (6) 0:22:15 (3) 29 (10) 51 -
Terrier (9) 0:40:12 (7) 52 (9) 50 (3) 0.2800
XMLSearch (2) 0:10:35 (1) 22 (1) 12 -
Zettair (1) 0:04:44 (6) 33 (6) 32 (1) 0.3240
Table 6.1: Ranking of search engines, comparing their indexing time, index
size, and the average searching time (for the 2.7GB collection), and the
Answer Quality for the engines that parsed the WT10g. The number in
parentheses corresponds to the relative position of the search engine.
by the fact that most of the search engines are fine-tuned by the developers
for each of the retrieval task of TREC, and some of these tuning are not
fully documented.
When comparing the results of the initial tests made with the dis-
carded search engines (Datapark, mnoGoSearch, Namazu, OpenFTS, and
Glimpse), it is possible to observe that the discarded search engines were
much slower than the final search engines.
With the information presented on this work, it is possible to have a
general view of the characteristics and performance of the available open
source search engines in the indexing and retrieval tasks. On Table 6.1
we present a ranked comparison of the indexing time and index size when
indexing the 2.7GB collection and the average searching time of each of
the search engines. The ranked comparison of the searching time was made
considering all the queries (1- and 2-words queries with original and uniform
distribution) using the 2.7GB collection. Also we present the precision from
the first 5 results for the search engines that indexed the WT10g collection.
By analyzing the overall quantitative results, over the small (TREC-
43
4) and the large (WT10g) collections, we can observe that Zettair is one
of the most complete engines, due to its ability to process large amount of
information in considerable less time than the other search engines (less than
half the time of the second fastest indexer) and obtain the highest average
precision and recall over the WT10g collection.
On the other hand, in order to make a decision on what search engine to
use, it is necessary to complement the results obtained with any additional
requirement of each website. There are some considerations to make, based
on the programming language (e.g. to be able to modify the sources) and/or
the characteristics of the server (e.g. RAM memory available). For example,
if the size of the collection to index is very large and it tends to change
(i.e. needs to be indexed frequently), maybe it can be wise to focus the
attention on Zettair, MG4J or Swish++, since they are fast in the indexing
and searching stages. Swish-E will also be a good alternative. On the other
hand, if one of the constraints is the amount of disk space, then Lucene
would be a good alternative, since it uses few space and has low retrieval
time. The drawback is the time it takes to index the collection. Finally, if the
collection does not change frequently, and since all the search engines had
similar searching times, you can make a decision based on the programming
language used by the other applications in the website, so the customization
time is minimized. For Java you can choose MG4J, Terrier or Lucene, and
for C/C++ you can choose Swish-E, Swish++, ht://Dig, XMLSearch, or
Zettair.
44 CHAPTER 6. CONCLUSIONS
Bibliography
45
46 BIBLIOGRAPHY
[18] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gi-
gabytes: Compressing and Indexing Documents and Images. Morgan
Kaufmann Publishers, San Francisco, CA, 1999.