Grouper A Dynamic Cluster Interface To Web Search Results
Grouper A Dynamic Cluster Interface To Web Search Results
Abstract
Users of Web search engines are often forced to sift through the long ordered list of document “snippets”
returned by the engines. The IR community has explored document clustering as an alternative method of
organizing retrieval results, but clustering has yet to be deployed on most major search engines. The
NorthernLight search engine organizes its output into “custom folders” based on pre-computed document
labels, but does not reveal how the folders are generated or how well they correspond to users’ interests.
In this paper, we introduce Grouper – an interface to the results of the HuskySearch meta-search
engine, which dynamically groups the search results into clusters labeled by phrases extracted from the
snippets. In addition, we report on the first empirical comparison of user Web search behavior on a
standard ranked-list presentation versus a clustered presentation. By analyzing HuskySearch logs, we are
able to demonstrate substantial differences in the number of documents followed, and in the amount of
time and effort expended by users accessing search results through these two interfaces.
Keywords: document clustering, user interface, search engine, visualization
1. Introduction
Conventional document retrieval systems return long lists of ranked documents that users are forced to
sift through to find relevant documents. The majority of today's Web search engines (e.g., Excite,
AltaVista) follow this paradigm. The notoriously low precision of Web search engines coupled with the
ranked list presentation make it hard for users to find the information they are looking for. Our goal is to
make search engine results easy to browse by clustering them.
Document clustering has long been investigated as a post-retrieval document visualization technique
[1, 4, 13, 18]. Document clustering algorithms attempt to group documents together based on their
similarities; thus documents relating to a certain topic will hopefully be placed in a single cluster. This
can help users both in locating interesting documents more easily and in getting an overview of the
retrieved document set.
Document clustering can be performed, in advance, on the collection as a whole [5], but post-
retrieval document clustering has been shown to produce superior results [13]. This is due to the fact that
the clusters are computed based on the returned document set; the cluster boundaries are drawn to
appropriately partition the set of documents at hand. In contrast, pre-retrieval clusters might be based on
features that are infrequent in the retrieved set, as many non-retrieved documents influence the cluster
formation.
One way to cluster Web search results is to do so on the search engine. But search engines are under
severe resource constraints and dedicating enough CPU time to each query might not be feasible. The
clusters, therefore, would have to be pre-computed (at least partially – e.g., NorthernLight). We are
interested in exploring an alternative model, one that would not impose additional resource demands on
the search engines. We assume the clustering interface is independent of the search engine, i.e., it can
reside on a meta-search engine or on a client’s browser. Previously we have identified some key
requirements for a post-retrieval document clustering interface of this sort [38]:
1. Coherent Clusters: The clustering algorithm should group similar documents together. As
documents often have multiple topics, they should not be confined to a single cluster [11]; the
clustering algorithm should generate overlapping clusters when appropriate.
2. Efficiently Browsable: The user needs to determine at a glance whether the contents of a cluster are
of interest. Therefore, the system has to provide concise and accurate cluster descriptions.
3. Speed: The clustering system should not introduce a substantial delay before displaying the results.
This suggests two additional requirements. Algorithmic Speed: The clustering algorithm used must be
able to cluster hundreds or even thousands of snippets in a few seconds. Snippet-Tolerance: As there
is no time to download the original documents off the Web, the system should produce high quality
clusters even when it only has access to the snippets returned by the search engines.
In earlier work we introduced Suffix Tree Clustering (STC) – a fast, incremental, linear time clustering
algorithm that produces coherent clusters [38]. Using traditional IR metrics (e.g., average-precision) we
compared STC to other fast clustering algorithms (including k-means [25], Buckshot and Fractionation
[5]) on the task of clustering results returned by Web search engines, and showed it to be both faster and
more precise than previous algorithms. We used the results to argue that post-retrieval clustering of Web
search engine results a-la-STC is feasible.
In this paper, we present Grouper – a clustering interface to the HuskySearch meta-search engine
(publicly available at [www.cs.washington.edu/research/clustering]), which uses STC as its clustering
algorithm. The paper is organized as follows. First, we describe related work on post retrieval document
visualization and document clustering. The following section describes the Grouper system: its user
interface and the design decisions made to make the system fast and its clusters easy to browse. Next, we
evaluate Grouper in practice based on the logs gathered by this fully deployed system. By contrasting
Grouper’s logs with the logs of the standard ranked-list interface to HuskySearch, we are able to
demonstrate a substantial difference in the number of documents followed, and in the amount of time and
effort expended by users accessing search results through these two interfaces. To the best of our
knowledge, this is the first evaluation of a post-retrieval clustering interface on the Web. We conclude by
summarizing our contributions and presenting directions for future work.
2. Related Work
The Information Retrieval community has long explored a number of post-retrieval document
visualization techniques as alternatives to the ranked list presentation. One category of visualization
techniques aims to display additional information about each retrieved document. Often, this will have the
secondary effect of grouping documents that share this additional information. The additional information
displayed highlights the relationship between the documents and the query terms [10, 33]; predefined
document attributes (e.g., size, date, source or popularity) [12, 23]; or user-specified attributes (i.e.,
predefined topics) [17, 24, 29].
The second category of visualization techniques attempts to visualize inter-document similarities.
This can help the user get an overview of the collection, or, alternatively, find interesting documents
faster, as once one document has been found it is easy to find others similar to it in the visualized space.
There are four major techniques to visualize inter-document similarities: document networks [8, 31],
spring embeddings [3, 30], document clustering [1, 13, 18, 38] and self-organizing maps [16, 20, 22]. Of
the four major techniques, only document clustering appears to be both fast enough and intuitive enough
to require little training or adjustment time from the user [13, 37].
Numerous document clustering algorithms appear in the literature [36]. Agglomerative Hierarchical
Clustering algorithms are probably the most commonly used. These algorithms are quadratic in the
number of documents and are therefore too slow for our online requirements [34]. Linear time clustering
algorithms are the best candidates to comply with the speed requirement of on-line clustering. These
include K-Means [25], Single-Pass [14], Buckshot and Fractionation [5]. In earlier work [38] we have
introduced another linear time algorithm – STC – which will be briefly described in the next section.
In contrast to STC, most clustering algorithms treat a document as a set of words and not as an
ordered sequence of words, thus losing valuable information. Phrases have long been used to supplement
word-based indexing in IR systems [7, 15, 26], yet these methods have not been widely applied to
document clustering. The only example known to the authors is the use of the co-appearance of pairs of
words as the attributes of the documents’ vector representations [21].
Many document clustering algorithms rely on off-line clustering of the entire document collection [6,
28]. Since the Web is a dynamic environment, static, pre-computed clusters would have to be constantly
updated, and most clustering algorithms cannot do so incrementally. This would require a huge amount of
resources and it has been shown that such an approach results in clusters of lower quality [13]. A pre-
computed clustering of a sizeable part the Web (30M HTML documents in over 10 CPU-days) was
shown to be feasible [2], but the approach used (a non-hierarchical single-link algorithm with a
predetermined similarity threshold) was designed for near-duplicate document identification, and is not
suitable for the grouping of retrieval results.
On the Web, there are some attempts to handle the large number of documents returned by search
engines. Many search engines provide query refinement features. Excite [www.excite.com], for example,
suggests words to be added to or to be excluded from the query. In AltaVista [www.altavista.com], these
words are organized into groups, but these groups do not represent clusters of documents.
The NorthernLight search engine [www.northernlight.com] provides "Custom Folders" in which the
retrieved documents are organized. These folders can be based on type (e.g., press releases, product
reviews, etc.), source (sites, domains), language or subject. The subject-folders are labeled by a short
phrase (usually of one or two words). These folders are organized hierarchically and introduce no
apparent delay to the search.
NorthernLight does not reveal the method used to create these folders. After experimenting with their
system, we hypothesize that at indexing time a set of potential folders is assigned to each document, and
at query time the system decides which folders to present. Electric Library [www.elibrary.com] uses
“Recurring Themes” in a similar manner. Themes are of three categories: places, persons and subjects.
The subject-themes appear to be selected from a restricted, predefined list.
Grouper differs from the NorthernLight approach in several ways. First, it receives hits from different
engines, and only looks at the top hits from each search engine. Thus, its clusters are not affected by many
low relevance documents, as is the case for NorthernLight’s folders. Second, Grouper performs post-
retrieval clustering, and thus the clusters are tailored to the retrieved set. Finally, as Grouper can run on a
client-machine, it is suitable for distributed IR systems, and addresses the scalability issue.
There are several additional techniques used in Web search engines that are also meant to help the
user sift through a large set of retrieved results. Excite, for example, has a “List by Web Site” option that
presents an alternative view of the retrieved documents; their “More Like This” option allows, once one
interesting document was found, the identification of additional similar documents (though, unlike the
other techniques described previously, this performs a new search using the document as a search seed).
3. Grouper
Grouper is a document clustering interface to the HuskySearch meta-search service. HuskySearch
(which is based on MetaCrawler [27]) retrieves results from several popular Web search engines, and
Grouper clusters the results as they arrive using the STC algorithm. Below we present a brief overview of
STC and describe its characteristics. Next, we describe the user interface of the Grouper system. Finally,
we explain key design decisions that make Grouper fast, and the clusters easy to understand.
Figure 1: Grouper’s query interface. Users do not need to enter any parameters for the clustering
algorithm.
The main results page displays the number of documents retrieved and the number of clusters found.
The clusters are presented in a large table – each cluster in a single row referred to as the summary of the
cluster. The clusters are ordered by their estimated coherence. A summary of a cluster includes its size
(the number of documents it contains), and attempts to convey the content of the documents in the
clusters by displaying shared phrases and sample titles. Shared phrases are phrases that appear in many
documents of the cluster. The numbers appearing in parenthesis after each phrase indicate the percentage
of the documents in the cluster that contain the phrase. Titles of three sample documents are also
displayed in the summary of each cluster.
If a summary of a cluster indicates to the user that the cluster is of interest, the user can take several
actions. If a certain title seems promising, the user can go directly to that document by clicking on the
title. Alternatively, the user can click on the "View Results" links to go to the cluster's page. There, the
snippets of all the documents in the cluster are presented in a ranked list.
Figure 2: The main results page in Grouper for the query “israel”. Each row in the table is the
summary of a cluster – an attempt to convey the content of the documents in the cluster. It includes
the size of the cluster, shared phrases – phrases that appear in many documents of the cluster, and up
to three sample titles of documents in the cluster. The numbers appearing in parenthesis after each
phrase indicate the percentage of the documents in the cluster that contain the phrase. In the example
above only the first five clusters are shown.
Finally, the user may want to retrieve additional similar documents. Several search engines (e.g.,
Excite) allow the user to use a certain document as a seed for a new search (the "More Like This” option).
However, this method does not allow the user to control the search being made. We wanted to give the
user the ability to find documents that are related to the cluster of interest, but allow the user to control the
exact search that is being done. The "Refine Query Based On This Cluster" option in Grouper is designed
to do precisely that (Figure 3) – the user can use the phrases that characterize the cluster to reformulate
the query. To our knowledge, Grouper is the only search engine that suggests multi-word phrases to the
user for query refinement. Looking at the logs of the system, we see that in 80% of the cases where users
used the “Refine Query” option, they used the phrases suggested by the system, and 41% of these were
multi-word phrases. These figures seem to suggest that users find this feature beneficial in practice.
Figure 3: The query refinement page for the first cluster of the query “israel”. The user can
reformulate the query using the phrases that were identified in the cluster. Clicking on a checkbox
associated with a phrase will include it in the refined query.
The Grouper search service has been running in its current format since July 1998, and receives
approximately 65 queries a day. We are logging all activities on this search engine and have logged to
date more than 10,000 queries.
A noteworthy characteristic of STC is that the user does not need to input the required number of
clusters to the system. The clusters produced are ordered by their approximate coherence, and the clusters
gracefully degrade from coherent clusters at the top of the cluster list, to clusters that are characterized by
a single phrase or word at the bottom of the list. These clusters do not necessary represent a semantically
coherent topic, but rather function as an index to the retrieved document set. We believe these index-
clusters are useful to the user when no earlier cluster appears to be of interest. This sometimes occurs
when there are no “natural” clusters in the retrieved document set or when the user’s interest is narrow
and is overshadowed by more dominant themes among the retrieved documents.
Analyzing the system’s logs has showed that in approximately 21% of the sessions users followed
documents from index-clusters. We find this figure to confirm our reasoning in including this type of
clusters in our output.
1 The fifth and the sixth phrases are displayed only if their coverage is greater than 50%.
of higher coverage, it will not be displayed. In Table 1, phrase 7 is not displayed as 75% of its words
appear also in phrase 6, which also had a higher coverage.
2. Sub- and Super- Strings: Often, we find that STC identifies phrases that are sub-strings of other
phrases (we call these sub-phrases and super-phrases). This happens both when the two phrases are
independent phrases in their own right (such as “united states” and “president of the united states”)
and when the phrase was truncated in the snippet received from the search engine (such as “president
of the united states” and “president of the united…”). A sub-phrase will always have a higher
coverage than its super-phrase, but the super-phrase will be more specific and therefore more
informative. To balance this tradeoff, we determine, for each phrase, whether it has a sub-phrase or a
super-phrase among the other phrases of the cluster. If it doesn’t have a sub-phrase, it is designated as
a most-general phrase; if no super-phrase exists, it is designated as a most-specific phrase. We will
consider phrases redundant if they are not in either of these categories. In Table 1, phrases 3 and 4
will not be displayed, as they are neither most-general nor most-specific.
3. Most-General Phrase with Low Coverage: The purpose of displaying short, general phrases is to
achieve high coverage. In cases were a most-general phrase adds little to the coverage of its
corresponding most-specific phrase, we will not display it. Thus, we define a minimal coverage
difference (we found 20% to be satisfactory) that must be attained in order for the most-general
phrase to be displayed. In Table 1, phrase 8 will not be displayed, even though it is a most-general
phrase, as it adds only 5% to the coverage of phrase 6.
Table 1: Determining which phrases to display: The table shows the set of phrases identified in a
cluster. The phrases are presented along with their coverage, whether they are most-specific or most-
general and whether they are selected to be displayed. Phrases 3 and 4 are redundant, and thus will not
be displayed, as they are neither most-general nor most-specific (the Sub- and Super- Strings
heuristic). Phrase 7 is redundant as 75% of its words also appear in the phrase 6, which also has a
higher coverage (the Word Overlap heuristic). Phrase 8 is redundant, even though it is a most-general
phrase, as it adds only 5% to the coverage of phrase 6 (the Most-General Phrase with Low Coverage
heuristic).
As mentioned earlier, documents are processed before they are inserted into the suffix tree (step 1 of
STC). This processing improves the identification of common phrases, however the readability of the text
is reduced. To deal with this problem we keep the original (unprocessed) text of the documents; instead of
displaying a processed, and less comprehensible, phrase, we display the original text that corresponds to
it. For example the phrase "post retrieve document cluster" might actually correspond to the string "post-
retrieval document clustering", which might be more meaningful to the user.
In addition to describing the cluster using its phrases, we also use the standard technique of
identifying and presenting single words that appear frequently within the documents of the cluster. We
display up to five such words for each cluster. Words will not be shown if they appear in less than 40% of
the documents, or in the stoplist, or any of the displayed phrases. A semi-colon separates these words
from the phrases of the cluster. There is one major difference between the phrases and these frequent
words that should be emphasized. A phrase indicates that all the documents containing it are in the
cluster, because it was a basic blocks on which the cluster was constructed. Documents containing a
frequent word, on the other hand, might not all appear in the cluster.
Another design decision in Grouper was how to order the documents in a cluster. This affects both
the order of the documents on the cluster’s page, and the choice of the three documents that have their
titles displayed in the cluster’s summary on the main results page. Two options were considered: sorting
the documents based on their relevance to the query or sorting them based on the number of the cluster's
phrases each contains (this is similar to the option of sorting the documents of a cluster based on their
similarity to the query or based on their similarity to the cluster’s centroid, as described in [13]). We
chose the second approach – sorting by the number of the cluster’s phrases each document contains – as
we believe this results in more informative cluster summaries.
2 For users willing to wait, Grouper also has a “Quality Search” option that will download the original documents
off the Web and use them in the STC algorithm. This option is much slower (1-5 minutes as opposed to a few
seconds), but will generally produce better results.
is shared among several research projects, the reported times are greatly affected by its load as well as the
load on our file server, and would be improved substantially by allocating dedicated hardware to Grouper.
2.5
1.5
1.0
0.5
0.0
50 150 250 350 450 550 650 750
number of retrieved documents
Figure 4: The average time (in seconds) it takes Grouper to cluster the retrieved documents as a
function of the number documents retrieved.
Figure 5: The average number of followed clusters as a function of the number of documents
followed thus far in the session. We compare Grouper clusters to the number of followed clusters if a
random clustering had been created, and show that Grouper is substantially better than random. In
addition, we present the calculated number of clusters followed using HuskySearch data: the
documents retrieved in HuskySearch search sessions are clustered using both STC and K-Means;
assuming that the same documents would have been followed, we calculate the number of followed
clusters. We see that STC produces slightly more coherent clusters than K-Means.
We had hypothesized that the user would visit few clusters in Grouper, but the numbers are not as
low as we had hoped (e.g., users viewing seven documents visit three clusters on average). This result can
be interpreted in several ways. It might indicate that even when users follow many documents, they
typically seek an overview of the retrieved results and not necessarily an exhaustive review of all the
documents on a specific subtopic. Alternatively, these figures might indicate that our clustering algorithm
(using the documents’ snippets) falls short of the ideal goal of creating a single cluster corresponding to
the user’s information needs.
3We have compared STC with additional clustering algorithms (Buckshot, Fractionation, and Agglomerative
Hierarchical Clustering) and found similar results.
snippets or cluster summaries without any clicks.
2. Time spent: as a rough measure of search efficiency, we recorded the amount of time users spent
traversing the results set (Figure 6a). The main problem with this measure is that the time recorded is
the sum of time spent in network delays, in reading documents, and in traversing the results
presentation (ranked list or clusters). Our server-based experimental apparatus cannot decompose the
time measured into its components.
3. Click distance: as another rough measure of search efficiency, we recorded the distance between
successive user clicks on the document set (Figure 6b). While in the ranked-list presentation, this
notion has an unambiguous interpretation (the distance between the first document in the list and the
tenth document is nine), it’s not immediately obvious what is the corresponding notion in the cluster
presentation. Specifically, when a user skips a cluster, what is the distance traversed? We discuss our
distance function below.
We acknowledge that in the experiments described below, we are comparing the behavior of two distinct
user populations – HuskySearch users and Grouper users. Nevertheless, the data we collected shows
substantial differences in how people use the two interfaces. We plan to follow up with a carefully
controlled user study that will complement the log analysis and help to test hypotheses as to why the
observed differences arise.
Looking at the number of retrieved documents that were followed by the user, we note a considerable
difference between the two systems. In HuskySearch users followed only 1.0 documents on average,
whereas in Grouper, users followed 40% more. Table 2 presents the percentage of sessions in which the
user has followed a certain number of retrieved documents for both systems. All sessions resulting in 8 or
more followed documents are presented in a single column (the “8+” column). The data shows that the
percentage of sessions in which users do not follow even a single document is smaller in Grouper (46%)
than in HuskySearch (53%), while the percentage of sessions in which users followed multiple documents
is higher in Grouper.
Optimistically, Table 2 suggests three possible hypotheses: First, it might be easier to find interesting
documents using Grouper (as fewer sessions result in no document being followed). Second, once an
interesting document has been located, Grouper seems to do a better job in helping the user find
additional interesting documents. And third, users might prefer a clustering interface such as Grouper
when faced with tasks were several documents are required. As mentioned earlier, a user study is needed
to test these hypotheses.
Table 2: The percentage of sessions in which users have followed a certain number of retrieved
documents in Grouper and in HuskySearch. All session resulting in 8 or more followed documents are
presented in a single column (the “8+” column). In HuskySearch, users followed two or more
documents in 20% of the sessions, while in Grouper they did so in 29% of the sessions.
The second metric we use to compare the two systems is the time the user spends on each document
followed. When a user follows a document, the request is actually through our server and we log the
arrival times of these requests. The time spent on a followed document is measured as the time between a
user’s request for a document and the user’s previous request (or, for the first followed document, the
time the results page was sent to the user). The measured time therefore includes the time to download
and view a selected document and the time to find the next document of interest. We exclude from the
analysis all sessions in which more than 200 seconds elapsed between the following of consecutive
document (approximately 10% of the sessions) as the user might have abandoned the search task for a
while. The time spent on each document as a function of the number of the documents followed thus far
in the session is presented in Figure 6a.
From Figure 6a it can be seen that the average time per document for the first three followed
documents of a session is lower in HuskySearch, but the Grouper interface is apparently faster for the rest
of the followed documents of the session.
65 16
time spent on document (sec.)
HuskySearch 14 HuskySearch
60 Grouper Grouper - snippets
12
click distance
Grouper - total
55 10
8
50 6
4
45
2
40 0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
document’s rank in session document’s rank in session
Figure 6a: The average amount of time spent on each document followed, as a function of the
document’s rank (the number of documents followed thus far in the session). Figure 6b: The click
distance as a function of the document’s rank. This metric attempts to capture the effort spent by the
user to find the next interesting document. The “Grouper-snippets” line represents the number of
snippets skipped. The “Grouper-total” line represents the total click distance, which is the sum of the
number of snippets and clusters skipped (under the assumption that the “cost” of scanning a cluster’s
summary is equal to that of scanning a snippet).
While speed appears to be an advantage for Grouper when multiple documents are followed, some
caveats are in order. In the analysis above we assume the time to download a document and the time to
view it are independent of the system used and of a documents’ rank in the session, and therefore these
factors should average out across all measurements. Unfortunately, we cannot validate this empirically in
our system. Another possible problem in interpreting this metric is that it is not clear that shorter time is
better. It might indicate that an interesting document was easier to find, but on the other hand, it might
indicate that the user did not find the document interesting enough to dedicate much time to it. Again, the
logs do not give us enough information to discriminate between these possible interpretations.
The third metric used to compare the two systems was click distance, an attempt to estimate the cost
to the user of finding the next interesting document. We assume only documents followed are beneficial
to the user; snippets that are scanned but not followed represent wasted efforts. Likewise, the user is
assumed to gain nothing from the structure of the clusters. The distance between clicks on a ranked list is
simply the number of snippets between the two clicks (or proceeding the first click). In the cluster
interface, we assume the user does not scan through the snippets in clusters that are skipped, but in any
cluster visited, all of the snippets are scanned. For example, if the user clicks on the second document of
the first cluster (which contains 20 documents), skips the second cluster and then clicks on the fifth
document of the third cluster, we assume she scanned 22 snippets between these two clicks (18 in the first
cluster and 4 in the third). In the clustering interface there is the additional cost of skipping clusters. This
should be added to the measure presented above as they both represent wasted effort.
This data is presented in Figure 6b. The “Grouper-snippets” line represents the number of snippets
skipped. The “Grouper-total” line represents the total click distance, which is the sum of the number of
snippets and clusters skipped. We assume the “cost” of scanning a summary of a cluster is equal to that of
scanning a snippet. Of course, alternative assumptions can be made, and it is easy to see from the graph
how these alternative “Grouper-total” lines would look. These results exhibit a trend similar to the results
of the previous experiment – finding the first few interesting documents actually requires more “effort” in
Grouper as compared to HuskySearch’s ranked-list presentation, but after the first two or three
documents, finding additional interesting documents appears to require less effort in Grouper. This
possibly reflects the fact that the user must spend some time/effort to understand the clusters that Grouper
presents, but after doing so the clusters are helpful in finding required information faster. It also confirms
our observation that a clustering interface is not suitable for all search tasks.
5. Conclusion
In this paper, we have presented Grouper – a clustering interface for Web search engine results. We
described the system and the design decisions made in its implementation. Grouper enabled us to carry
out the first empirical assessment of user behavior given a clustering interface to Web search results. By
analyzing the system’s logs and comparing them to the logs of HuskySearch, we were able to show
substantial difference in the patterns of use of the two systems. We are now planning a controlled user
study to validate some of the hypotheses inferred from this data.
By fielding a document clustering system on the Web, we were also able to identify some of its
shortcomings. Grouper creates clusters by merging base clusters (a phrase and the set of documents that
contain it). This is often beneficial, but it can be confusing, especially when the clusters fail to capture the
semantic distinctions the users were expecting. Grouper-II will address this issue by allowing users to
view non-merged base clusters. This view is essentially an inverted index of phrases to the retrieved
document set. We believe this view will be useful in a higher portion of the queries and will be more
intuitive to novice users.
We have also noticed that the number of clusters created increases as more documents are retrieved.
As we are interested in making the system scale to substantially larger retrieved document sets, clusters
should be presented hierarchically so users can navigate the results more efficiently. Grouper-II will
support a hierarchical and interactive interface, similar to Scatter/Gather [13]. The user will be able to
select one or more base clusters to focus on, thus defining a subset of the retrieved document set.
Grouper-II will then, recursively, display the base clusters found in this subset of documents.
Acknowledgements We thank AnHai Doan, Marc Friedman, Mike Perkowitz and Erik Selberg for
commenting on earlier draft of this paper. This research was funded in part by Office of Naval Research
grant 98-1-0177, and by National Science Foundation grants IRI-9357772 and DL-9874759.
6. References
[1] R. B. Allen, P. Obry and M. Littman, An interface for navigating clustered document sets returned by queries,
in: Proceedings of the ACM Conference on Organizational Computing Systems, 1993, pp 166-171.
[2] A. Z. Broder, S. C. Glassman, M. S. Manasse and G. Zweig, Syntactic clustering of the Web, in: Proceedings
of the Sixth International Web Wide World Conference (WWW6), 1997.
[3] M. Chalmers, and P. Chitson, BEAD: explorations in information visualization, in: Proceedings of the 15th
International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92),
1992, pp 330-337.
[4] W. B. Croft, Organizing and searching large files of documents, Ph.D. Thesis, University of Cambridge,
1978.
[5] D. R. Cutting, D. R. Karger, J. O. Pedersen and J. W. Tukey, Scatter/Gather: a cluster-based approach to
browsing large document collections, in: Proceedings of the 15th International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR’92), 1992, pp 318-329.
[6] D. R. Cutting, D. R. Karger and J. O. Pedersen, Constant interaction-time Scatter/Gather browsing of large
document collections, in: Proceedings of the 16th International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR’93), 1993, pp 126-135.
[7] J. L. Fagan, Experiments in automatic phrase indexing for document retrieval: a comparison of syntactic and
non-syntactic methods, Ph.D. Thesis, Cornell University, 1987.
[8] R. H. Fowler, W. A. Fowler and B. A. Wilson, Integrating query, thesaurus, and documents through a
common visual representation, in: Proceedings of the 14th International ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR’91), 1991, pp 142-151.
[9] D. Gusfield, Algorithms on strings, trees and sequences: Computer Science and Computational Biology,
chapter 6, Cambridge University Press, 1997.
[10] M. Hearst, TileBars: visualization of term distribution information in full text information access, in:
Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI’95), 1995, pp
59-66.
[11] M. A. Hearst, The use of categories and clusters in information access interfaces, in: T. Strzalkowski (Ed.),
Natural Language Information Retrieval, Kluwer Academic Publishers, 1998.
[12] M. Hearst and C. Karadi, Cat-a-Cone: an interface for specifying searches and viewing retrieval results using a
large category hierarchy, in: Proceedings of the 20th International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR’97), 1997.
[13] M. A. Hearst and J. O. Pedersen, Reexamining the cluster hypothesis: Scatter/Gather on retrieval results, in:
Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR’96), 1996, pp 76-84.
[14] D. R. Hill, A vector clustering technique, in: Samuelson (Ed.), Mechanized Information Storage, Retrieval
and Dissemination, North-Holland, Amsterdam, 1968.
[15] D. A. Hull, G. Grefenstette, B. M. Schulze, E. Gaussier, H. Schütze and L. O. Pedersen, Xerox TREC-5 site
report: routing, filtering, NLP, and Spanish tracks, in: D. K. Harman (Ed.), The Fifth Text Retrieval
Conference (TREC-5), NIST Special Publication, 1997.
[16] T. Kohonen, Exploration of very large databases by self-organizing maps, in: Proceedings of the IEEE
International Conference on Neural Networks, vol. 1, 1997, pp 1-6.
[17] R. Korfhage, To see, or not to see – is that the query?, in: Proceedings of the 14th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR’91), 1991, pp 134-141.
[18] A. V. Leouski and W. B. Croft, An evaluation of techniques for clustering search results. Technical Report IR-
76, Department of Computer Science, University of Massachusetts, Amherst, 1996.
[20] Lin, X., A self-organizing semantic map for information retrieval, in: Proceedings of the 14th International
ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’91), 1991, pp 262-
269.
[21] Y. S. Maarek and A. J. Wecker, The Librarian’s Assistant: automatically organizing on-line books into
dynamic bookshelves, in: Proceedings of the International Conference on Intelligent Multimedia Information
Retrieval Systems and Management (RIAO’94), 1994.
[22] D. Merkl, Exploration of text collections with hierarchical feature maps, in: Proceedings of the 20th
International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’97),
1997, pp 186-195.
[23] L. T. Nowell, R. K. France, D. Hix, L. S. Heath and E. Fox, Visualizing search results: some alternatives to
query document similarity, in: Proceedings of the 19th International ACM SIGIR Conference on Research
and Development in Information Retrieval (SIGIR’96), 1996, pp 67-75.
[24] K. A. Olsen, R. R. Korfhage, M. B. Spring, K. M. Sochats and J. G. Williams, Visualization of a document
collection: the VIBE system. Information Processing and Management, vol. 29(1), 1993, pp 69-81.
[25] J. J. Rocchio, Document retrieval systems – optimization and evaluation, Ph.D. Thesis, Harvard University,
1966.
[26] G. Salton, C. S. Yang and C. T. Yu, A theory of term importance in automatic text analysis, Journal of the
American Society for Information Science, vol. 26(1), 1975, pp 33-44.
[27] E. Selberg and O. Etzioni, Multi-service search and comparison using the MetaCrawler, in: Proceedings of the
4th World Wide Web Conference (WWW4), 1995.
[28] C. Silverstein and J. O. Pedersen, Almost-constant time clustering of arbitrary corpus subsets, in: Proceedings
of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval
(SIGIR’97), 1997, pp 60-66.
[29] A. Spoerri, InfoCrystal: A visual tool for information retrieval and management, in: Proceedings of
Information Knowledge and Management (CIKM'93), 1993, pp 150-157.
[30] R. Swan and J. Allan, Aspect Windows, 3-D Visualizations, and Indirect Comparisons of Information
Retrieval System, in: Proceedings of the 21st International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR'98), 1998.
[31] R. H. Thompson and W. B. Croft, Support for browsing in an intelligent text retrieval system. International
Journal of Man-Machine Studies, vol. 30(6), 1989, pp 639-668.
[32] E. Ukkonen, On-line construction of suffix trees, Algorithmica, vol. 14, 1995, pp 249-260.
[33] A. Veerasamy and N. J. Belkin, Evaluation of a tool for visualization of information retrieval results, in:
Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR’96), 1996, pp 85-92.
[34] E. M. Voorhees, Implementing agglomerative hierarchical clustering algorithms for use in document retrieval,
Information Processing and Management, vol. 22, 1986, pp 465-476.
[35] P. Weiner, Linear pattern matching algorithms, in: Proceedings of the 14th Annual Symposium on
Foundations of Computer Science (FOCS), 1973, pp 1-11.
[36] P. Willet, Recent trends in hierarchical document clustering: a critical review. Information Processing and
Management, vol. 24, 1988, pp 577-597.
[37] O. Zamir, Visualization of search results in document retrieval systems, General Examination Report,
University of Washington, 1998.
[38] O. Zamir and O. Etzioni, Web document clustering: a feasibility demonstration, in: Proceedings of the 19th
International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98),
1998, pp 46-54.
7. Vitae