NSums
NSums
Search Engines
Apisitt Rattana and Andrew Davison
Dept. of Computer Engineering
Prince of Songkla University
Hat Yai, Songkhla 90112, Thailand
E-mail: [email protected]
Abstract
The N-Sums framework is aimed at making the design, implementation, and maintenance of
specialized domain search engines easier, with the resulting applications able to produce more
useful matches and less bad hits. The heart of N-Sums is the realization that effective search
requires a combination of search strategies.
ComicSearch is a specialized domain search engine for finding comic covers and other
information, built using the N-Sums framework. It has outperformed the popular Copernic general
purpose search engine in all of its tests.
1. Introduction
N-Sums (Niche-Search Using Multiple Strategies) is a framework for creating effective Web search
engines for specialized domains (e.g. finding comic books, finding the latest soccer scores), By
effective, we mean that the engine will return matches containing a high percentage of URLs
pointing to the requested information and a low percentage of links to poorly related or unrelated
data,
At the heart of N-Sums is the recognition that a variety of different search mechanisms must be
combined in order to obtain good search results. This differs from the present trend in search engine
design which hopes that 'more of the same' will lead to improved answers.
In the following sections, we describe our framework, and a particular example of its use, the
ComicSearch search engine. We compare ComicSearch with one of the leading 'super-search'
engines, Copernic [3], and see that ComicSearch performs significantly better within its chosen
domain.
4. ComicSearch
ComicSearch finds information about comics – the user enters a comic title (or a phrase from the
title) and issue number of interest, and the search engine returns a table of URLs. The user can click
on a URL to open it in the system's default browser. The results of a query for "iron man" issue 1
are shown in Figure 1. "iron man" matches against all the comics which contain that phrase (e.g.
"Giant-Size Iron Man"), but only details on the comics with the given issue number are returned.
ComicSearch is primarily aimed at US comics produced during the so-called Golden, Silver, and
Bronze ages (roughly 1939 – 1970). These comics are the main concern of collectors. ComicSearch
is based on a similar search engine, MCCSE, which concentrated on the Marvel Comic publishing
company [7].
Google is used as the general purpose search engine, and acts as a backup to three specialized
domain search engines, AuctionHawk, the Grand Comic Database (GCD,
http://213.203.29.50/), and the search engine at Nick Simon's Marvel Silver Age site
(http://www.geocities.com/Area51/Zone/4414/index.html). GCD stores author and artist
information on over 64,000 comics, but has many gaps in its data, some errors, and a poor selection
of cover images. It is also somewhat daunting for non-technical users due to its complex query
interface. Nick Simon's site focuses on a single US publisher, Marvel, and a particular era (about
1956 to 1970), but is very comprehensive within those boundaries.
ComicSearch comes with a small database of useful comic sites (currently about 40). For each site,
a list is given of the comic titles which can be found there, together with issue numbers and URLs.
However, storing a single URL for each issue would quickly lead to a very large database. Instead,
a format was designed around storing issue ranges and URL patterns using place-holders. For
example:
Title: The Mighty Banana
Issues: 1 5 7 9-205 1004
imageURL: http://foo.com/mb**.html
These three lines represent 201 URLs for comics. At run time, ComicSearch substitutes the issue
number for the place-holding *'s in the image URL, left-padding it with 0's if necessary. A search
for "The Mighty Banana" issue 7 will return the URL http://foo.com/mb07.html. Issues with
more than two digits are substituted with no padding (e.g. http://foo.com/mb1004.html).
The database format was designed to be understandable by its users, the intention being that they
could extend it. The online documentation for ComicSearch encourages users to submit their
information by e-mail for inclusion in future releases of the application.
ComicSearch utilizes Java threads to query its component search engines and local database. An
important aspect of the threads is the filtering of the results returned by the engines. For example,
the Google thread uses a regular expression based on the comic title followed by spaces or letters or
a '#' and then the issue number. This relatively simple pattern is applied to the title lines of the
Google results, and filters out 50-70% of the bad hits on average, compared with the unfiltered
results. ComicSearch employs the SteveSoft regular expression class [2] for this, although much
can be achieved with Java's String class alone. The number of good hits filtered out is typically
quite small, about 5%. A similar technique is used in the AuctionHawk thread, but was unnecessary
in the GCD and Nick Simon threads due to their accuracy.
A specially coded networking class lets ComicSearch operate through proxies/firewalls, and deals
with user authentication.
There are no secondary searches of the results by visiting their Web pages. The success of the filters
makes this extra step unnecessary. Also there is no direct searching of Web pages. Since the comic
sites are quite static, their details were converted into database entries.
The specialized domain search engines do not return duplicate URLs since they search in different
places on the Web, but Google does occasionally return a few hits found by the others. The removal
of duplicate URLs from the results table would be straightforward, but is not currently carried out.
However, the contents of the results pages do often overlap, but this is not seen as a bad thing – it is
quite useful to have repeated information, images, etc. from different sources, to allow comparisons
between them. For example in figure 1, the results rows 1, 2, 3, 5, 7, 8, 9, 10, 15, and 18 all refer to
the same comic, but the details on the pages vary, such as where to buy the comic, it's cost, the
publication history, and the cover image size and quality.
ComicSearch was made freely available over the Internet in March 2001. It can be obtained from
http://fivedots.coe.psu.ac.th/~ad/ComicSearch/readme.html.
Table 1 shows six typical comic queries, the number of exact matches, related matches, unrelated
matches, and the total number of matches returned by Copernic. An exact match is a URL to a page
which describes a comic containing the partial title string with the given issue number, related
matches are URLs to pages about the general comic or its characters. Unrelated matches have
nothing to do with the comic. The 'rows' column gives the row positions of the exact matches in
Copernic's output after it had been sorted into decreasing order by score. The query input was a
(partial) comic title and issue number, separated by a space.
Partial Title and No. of Sources which No. of No. of Total no.
issue no. exact supplied the related unrelated of matches
matches exact matches matches matches
Wolverine 1 37 GG/18, AH/13, 4 3 43
GCD/6
Superman 100 10 AH/4, DB/3, 0 0 10
GG/2, GCD/1
Spider-man 122 9 GCD/3, GG/3, 0 0 9
DB/2, AH/1
Mad 99 3 GG/2, DB/1 0 2 5
Green Lantern 59 8 DB/3, GCD/3, 0 0 8
AH/2
Flash 13 11 DB/4, GCD/3, 0 19 30
GG/3, AH/1
Table 2. ComicSearch results for six queries.
Table 2 shows the same queries processed by ComicSearch. A 'rows' column is not included, partly
because the output from ComicSearch is ordered nondeterministically due to its threaded behaviour.
The other reason is that the exact matches almost always appear before the related or unrelated
ones, a consequence of the slow response rate of Google. Instead, a 'sources' column is given, which
details the number of exact matches contributed by each search thread. AH is AuctionHawk, DB is
the local database, GCD is the Grand Comic Database, GG is Google, and NS is Nick Simon's site.
ComicSearch produces fewer results than Copernic, but the quality is higher; quality can be
measured as the percentage of exact matches in the total number of matches. Another indicator is
the coverage of the exact matches. Queries such as "flash 13" can match on many different comics
which use the word "flash" in their titles, and different volumes within the same comic title.
ComicSearch frequently returns at least one link to each of these possibilities.
The presence of so many unrelated matches in Copernic's output can make using it quite tiresome:
finding the best URLs frequently involves a lengthy study of the 50 or more results. This indicates
that usability is as much affected by the number of bad hits as the number of good ones.
The number of unrelated matches generated by Copernic points to a difficulty when searching for
comics using general purpose search engines – comic titles regularly use common words such as
'flash' and 'mad'. Even a word like 'wolverine' has a number of meanings; in this instance, a US
football team and the Canadian animal.
ComicSearch is helped and hindered by the presence of Google, which accounts for all the
unrelated matches, but also occasionally turns up exact matches missed by the other searches.
Fortunately, the number of unrelated matches from Google is significantly reduced by having
ComicSearch filter its results.
The local database contributes exact matches in almost all the example queries, but its success
depends on the choice of queries. AuctionHawk returns exact matches quite often, but this depends
on the auctions currently in progress; when an auction finishes, the information will disappear soon
after. Nick Simon's search engine does not return anything for the test queries, but this is because
the titles and issues are outside its range of interest. GCD information can sometimes be rather
brief.
6. Conclusions
The N-Sums framework rests on the idea that effective search engines for specific domains (e.g.
comics, soccer scores) are best designed using a combination of search strategies. These utilize:
• general purpose search engines;
• specialized domain search engines;
• direct search of static/dynamic Web pages;
• local database(s) of search information.
We described how these different approaches have consequences for the design, implementation,
and maintenance of the resulting search engine.
We used the N-Sums framework to build ComicSearch, a search engine focussing on US comics
from the 1940's to the 1970's. It performs significantly better than general purpose search engines
due to its multi-faceted approach to finding results. ComicSearch was made freely available on the
Web in March 2001.
Our future work will utilize N-Sums to build three more search engines: one for finding celebrity
mailing addresses, one for returning the latest soccer scores for any team, and a tool for finding
online resources for computing textbooks (e.g. slides, software, exercises). We expect that these
efforts will indicate further avenues for the refinement of N-Sums.
References
[1] BRANDT, S.R. "Regular Expressions in Java", Package com.stevesoft.pat version 1.4. Available at
http://javaregex.com/, March 2001.
[2] CLIENT HELP DESK. "Web Statistics: Size, the Average Page", Available at
http://www.clienthelpdesk.com/statistics_research/
web_statistics.html, March 2001
[6] KISTLER, T. AND MARAIS, H. "WebL – A Programming Language for the Web", SRC Research Report, Digital
Systems Research Center, Palo Alto, California, USA., 1998.
[7] RATTANA, A. "Marvel Comics Cover Search Engine (MCCSE)", Senior Project, Dept. of Comp. Eng., PSU, Hat
Yai, Thailand, February 2001.