Indexing and Retrieval of Scientific Literature

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Indexing and Retrieval of Scientific Literature

Steve Lawrence, Kurt Bollacker, C. Lee Giles


NEC Research institute, 4 Independence Way, Princeton NJ 08540
{lawrence,kurt,giles}@research.nj.nec.com

Abstract Postscript or PDF copies of research articles. This article


discusses the creation of an index of scientific literature on
the web, called CiteSeer, along with several features that im-
The web has greatly improved accessto scientific literature. prove accessto scientific literature.
However, scientific articles on the web are largely disor-
ganized, with research articles being spread across archive The purpose of this paper is to outline the CiteSeer project,
sites, institution sites, journal sites, and researcher home- to provide details of several aspects of the project not con-
pages. No index covers all of the available literature, and the tained in the previous papers that focus on the citation index-
major web search engines typically do not index the con- ing component [3, 131, and to encourage work on the Cite-
tent of Postscript/PDF documents at all. This paper dis- Seer project or related projects (the software and data from
cusses the creation of digital libraries of scientific litera- CiteSeer is available at no cost for non-commercial use).
ture on the web, including the efficient location of articles,
full-text indexing of the articles, autonomous citation index-
ing, information extraction, display of query-sensitive sum- 2 Related Work
maries and citation context, hubs and authorities computa-
tion, similar document detection, user profiling, distributed
error correction, graph analysis, and detection of overlap- There are many freely available indices of scientific lit-
ping documents. The software for the system is available at erature on the web, examples include the LANL e-Print
no cost for non-commercial use. archive, NCSTRL, UCSTRI, ML Papers, LTRS, NZDL,
CORA, and CORR. There are also many commercial
services, one of the most well-known being the Science
1 Introduction Citation Index (http: //www . isinet . corn/). The ef-
fectiveness of the available services varies according to
discipline. The most successful free service appears to be
The progress of science has often been hampered by the in- the LANL e-Print archive [ 141,which has had great success
efficiency of traditional methods of disseminating scientific in the physics community (a plan to shut down the service
information. Publication delays, and the difficulty in easily could not be carried out due to the response from the user
locating all relevant literature, mean that researchersare not community). These services are mostly complementary,
always working with the most up-to-date and comprehen- providing different levels of comprehensiveness, recency,
sive information available. The World Wide Web, along with and features. None of the indices are comprehensive, so
search engines such as AltaVista, have greatly improved the using multiple indices increases coverage, similar to using
dissemination and retrieval of an increasingly large body of multiple web searchengines [20,2 1,241.
information [ 1, 20, 211. However the major web search en-
gines such as AltaVista do not index the content of a large
body of scientific literature on the publicly indexable web: 3 CiteSeer
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advant The CiteSeer project at NEC Research Institute [13] aims
-age and that copies bear this notice and the full citation on the first page.
To copy otherwise, to republish, to post on servers or to
to improve the dissemination, retrieval, and accessibility of
redistribute to lists, requires prior specific permission and/or a fee. scientific literature. Specific areas of focus include the ef-
CIKM ‘99 11199 Kansas City, MO, USA
0 1999 ACM l-58113-146-1/99/0010...$5.00 fective use of the capabilities of the web, and the use of

139
machine learning. CiteSeer locates scientific articles on the 3.1 Locating Scientific Articles
web, extracts information such as the citations, citation con-
text, article title, etc., and performs full-text indexing and There are a number of possible approachesfor locating sci-
autonomous citation indexing. Rather then providing just entific articles on the web. Brute force search of the web
another digital library, CiteSeer provides algorithms, tech- is possible but time consuming. A more efficient technique
niques, and software that can be used in other digital li- might use machine learning algorithms (or heuristics) to di-
braries. The project encompassesareasincluding: rect the crawl [ 10, 231. This can help to locate more docu-
ments earlier in the crawl, but does not guarantee finding all
1. Locution of articles. The efficient location of scientific
papers without crawling the entire web.
articles on the web.
2. Full-text indexing of articles, including the content of CiteSeer uses a more efficient approach to the location
Postscript and PDF files. of articles on the publicly indexable web. Specifically,
3. Autonomous Citation Zndexing (ACI) - the autonomous CiteSeer uses web search engines such as AltaVista to
creation of an index of the links between scientific articles, directly locate pages likely to contain research articles.
similar to the Science Citation Index. Multiple queries are used to the search engines using
4. Znformation extraction. CiteSeer includes algorithms and various keywords likely to match pagescontaining scientific
machine learning techniques for automatically extracting articles (e.g. “Postscript”, “PDF’, “technical report”,
information such as the title and author from indexed doc- “conference”, “proceedings”, etc.). Multiple search engines
uments and individual citations. are used becausethis substantially improves the coverage of
5. Query-sensitive summaries of documents. Similar to the the web [20,21]. This method has a number of advantages.
context display in the search engines Inquirus [19] and The method avoids the duplication of work involved in
Google [6] (http : //www . google. corn/), CiteSeer aims crawling the web, and allows the combination of search
to present results in such a way as to facilitate relevance engines which substantially improves coverage over any
estimation by the user, and improve the overall efficiency single crawler currently available. The method also allows
of the search process. easy customization of the database to specific areas of the
6. Related documents. CiteSeer employs new algorithms for literature by including appropriate keywords in the queries.
the location of related documents, based on citation infor- Hundreds of thousands of scientific articles can be quickly
mation in addition to the usual word information. and efficiently located using this technique.
7. Overlapping documents. The web often contains minor
revisions of articles, which lead to near duplicates in the CiteSeer also supports crawling functionality, designed to
databaseif not dealt with. CiteSeer identifies the amount be used from start pages located similar to the techniques
of overlap between documents in order to detect these re- above. We are not yet using this functionality because of
visions. the success of the more directed approach. Additionally,
8. Citation graph analysis. Analyzing the network of scien- CiteSeer monitors mailing lists, automatically indexing new
tific literature. For example, similar to the work of Page documents as they are posted.
et al. [22,6] and Kleinberg [ 171,CiteSeer aims to identify
“hubs” and “authorities” in the scientific literature. Once services like CiteSeer become common, we expect that
9. Userprojiling. CiteSeer has a user profiling system which many researchers will register their papers directly, making
tracks the interests of usersand recommends new citations all of these techniques less necessary (thousands of papers
and documents when they appear. have already been registered with CiteSeer).
10. Distributed error correction. CiteSeer allows users of the
system to correct errors in the database.
11. External links. Where possible, CiteSeer links to articles 3.2 Full-Text Indexing
in external databases.
12. Universal article access. Access to details, statistics, and CiteSeer includes full-text indexing of the entire content of
external links for articles using a standard article key. articles, similar to the New Zealand Digital Library [31,32].
Postscript and PDF documents are converted to text using
The following sections provide more details on these areas
pstotext (http: //www. research. digital.. com/SRC/
of the CiteSeer project.
virtualpaper/pstotext . html) from the Digital Virtual
Paper project (http : //www . research. digital. corn/
SRC/virtualpaper/home . html). The full-text indexing
performed by CiteSeer is similar to the standard techniques
[30], however there are some important differences.

140
CiteSeer aims to be as up to date as possible. Therefore
the indexing is designed for continuous operation, so that
updates can be performed continuously, without building or
merging a new index. Index organization is similar to pre-
vious work [8, 11, 271. CiteSeer maintains the usual hash
table of words (inverted index) where each entry contains a Table 1: Query execution time for 25 random queries with
compressed version of the word and a pointer to a block in and without initials in a test databaseof about 200,000 doc-
a variable length record file that contains the matching doc- uments containing over 2.5 million citations.
uments and corresponding positions within the documents
(compressed into a single bit stream with variable length
identifiers). As the entries for each word grow the space Citation Index @ [ 121. A citation index indexes the citations
allocated for them grows as a power of 2. CiteSeer supports that research articles make, allowing, for example, the loca-
full Boolean, phrase and proximity retrieval, using a stan- tion of papers that cite a given paper. Autonomous citation
dard recursive descent parser. indexing provides several advantages over traditional cita-
tion indexing. Traditional citation indexing requires manual
CiteSeer does not use any “stop” words (common words like effort. Automating the task as performed by CiteSeer should
“the”, “a”, etc. that are typically excluded from indexing). result in a reduction in cost and an increase in the availabil-
This is important for allowing higher precision search. One ity of citation indices. An autonomous citation index can
example where this is important is when searching for a spe- also provide more comprehensive and up-to-date indices of
cific author. Often author names are only specified in cita- the literature - the Science Citation Index primarily indexes
tions using initials instead of the full name. When looking journal articles while CiteSeer can also index conference pa-
for information on an author with a common last name it is pers, preprints, technical reports, etc. The importance of in-
important to be able to restrict the results only to those items dexing non-journal items varies by discipline, but is particu-
that contain the correct first name or author initials. Thus, larly important in areas like computer science where impor-
it is necessary to be able to search for phrases containing tant research is often presented at conferences. We took a
initials. sample of 10 papers from the WWW7 conference and an-
alyzed the distribution of references. We found that only
With a test databaseof about 200,000 documents and over 19.7% of references were to journal papers, while 30.3%
2.5 million citations, queries are typically executed in a frac- were to conference papers, 18.0% were to books, and 32.0%
tion of a second on a Pentium Pro 200 machine. Perfor- were to technical reports, theses,and web pages.
mance degradesto about one second or longer when queries
include phrases that contain very common words. In the For details of the citation matching and citation indexing in
demonstration database,queries using author initials (for ex- CiteSeer see [3, 131.For related research,seethe Open Jour-
ample, “m jordan” or “m i jordan”) are common, ac- nal Project [ 151,and Cameron’s [9] proposal of a “universal,
counting for about 20% of all queries. These queries were [Internet-based,]-bibliographic and citation databaselinking
often taking severai seconds due to the very frequent oc- every scholarly work ever written”.
currence of initials in citations. In order to speed up these
queries we cache the list of word positions and maintain a Note that CiteSeer has a general philosophy of investigating
hash table for each initial indexed by document identifier word-insensitive algorithms before introducing algorithms
(each entry contains the list of positions within the docu- that use specific word information. This is in order to min-
ment). Proximity comparisons including initials typically imize bias in the errors made by the system. For example,
do not need to examine the entire document list anymore. it is simple to create a probabilistic model that labels the in-
We randomly selected 25 queries that did not use initials dividual fields of citations by using the probability of each
and 25 that did in order to quantify the speedup. Table 1 specific word belonging to certain fields. Such an algorithm
shows the results. Without the speedup queries with initials can work well, however the algorithm depends critically on
were 7 times slower than other queries on average. With the coverage and recency of available training data, and er-
the speedupthe queries with initials executed faster than the rors are likely to be biased towards authors, titles, etc. that
other queries. are not contained in the training data. This may correspond
to a bias against new authors, new subjects, etc. that could
potentially have a negative effect on scientific dissemination.
3.3 Autonomous Citation Indexing
Figure 1 shows a sample CiteSeer response for a search
within the citations extracted from articles. Citations to
CiteSeer includes autonomous citation indexing - the au- the same paper that may be written in different formats are
tonomous creation of a citation index similar to the Science grouped together [ 133. Articles can be sorted according to

141
the number of citations to them or by date. The “hosts” and 3.4 Information Extraction
“self” numbers indicate the number of distinct hosts that the
citing articles were found on, and the number of citations CiteSeer performs several types of information extraction on
predicted to be self-citations. The graph at the bottom the indexed documents:
shows the number of citations versus the year of publication
of the cited articles. The “Context” links show the context Extraction of the list of citations. CiteSeer can identify
of the citations, the “Bib” links provide a BibTeX entry for the citation list in a document, re-order documents that
the article, the “Track” links activate tracking for the article print in reverse order, remove page numbers from within
(new citations will be emailed to the user), and the “Check” the citation list, and delineate individual citations using
links display the individual citations that were grouped citation tags, vertical spacing or indentation.
together as the same article (this can be used for detecting
errors in the citation matching algorithms). The “Field” Extraction of the context of citations. For each citation
selection allows restricting the search results to the author made in a document, CiteSeer can extract the context of
or title fields. the article where the citation is made. Regular expres-
sions are used to identify the references in the text which
may differ from the citation tag used in the citation list.
Extraction of subfields from citations. CiteSeer extracts
fields such as the title and author from citations. For more
details see [ 131.
Extraction of bibliographic details of the articles being
indexed. CiteSeer can identify the indexed articles by ex-
tracting the title and author from the header of the docu-
ment. This is simple to do with reasonably high accuracy
by analyzing the font information. The algorithm cur-
rently used by CiteSeer is as follows:
R. A. Jacobs. M. 1. Jordan. S. J. Nowlm. ad G. E Hinton. Adoprive mixrure~ of
local erpcnr. Neural Compmtim, 3319-87.1991. - U M 2& w - The document is marked up with tags indicating font
Michael I. Jordan. AsracIordynamics cmdparolfelism in a connecrionisr scqucnlid changes. Each font change is identified by the average
machine. In Pmceedings of tbe 1986 Cognitive Science Conference, pages 531-546.
Lawnncc Erlbaum. 1986. CP~YIL B&B x&cd - LhJ width of characters contained in the font.
- Font markup is changed to reflect relative font changes
with respect to the most common font size in the docu-
ment.
- Heuristics search for the title of the document within
the resulting representation (for example, the title is often
written in the largest font in the header of the document).
“u
Self-citations arc not included in the graph or tic main number of citations. For a hidden Markov model approach to extracting subfields
see [25]. Figure 2 shows a sample of the details available for
Figure 1: Sample response of CiteSeer for a search within each document indexed by CiteSeer. The document header,
the citations. Articles can be sorted according to the num- abstract, and citations can be seen, along with an “active bib-
ber of citations to them or by date. The “hosts” and “self liography” of related documents.
numbers indicate the number of distinct hosts that the cit-
ing articles were found on, and the number of citations pre-
dicted to be self-citations. The graph at the bottom shows 3.5 Context and Query-Sensitive Summaries
the number of citations versus the year of publication of the
cited articles. The “Context” links show the context of the Once a user locates an article of possible interest in the
citations, the “Bib” links provide a BibTeX entry for the ar-
database, CiteSeer can display the context of how that
ticle, the “Track” links activate tracking for the article (new
article is cited in subsequent publications. When searching
citations will be emaiied to the user), and the “Check” links within the indexed documents, CiteSeer displays sample
display the individual citations that were grouped together as
context of the query terms within the documents. These
the same article. Article titles are automatically highlighted. techniques typically help the user to more efficiently
The syntax w/2 in the query means that words must appear determine the relevance of the documents in question. In
within a maximum distance of 2 words.
general, query-sensitive summaries of documents have been

142
Hierarchical mixtures of experts and
the EM algorithm (1993)

” ... ...is perfmmcd (1) on vowel clasritication and (2) within a hybrid venion of the JANUS I91 spxch
t-cognition system wing a subset of hc Switchboard large-vocabulary speaker-indcpendcnt continuour
spczh recognition dambase. INTRODUCTION The Hierarchical hltxturrs of Experts (HME)
architerture [2&t] has proven useful for clPglflcati”n and yresston tasks in small to mcdittm
Abstract: WC present a fnc-stmchlmd architecturn for rupcrviwd teaming. Tbc statistical model sized appIIcations with conver8e”ce “ma several orders of “qniiudc lower thvl comparable
underlying the arcbitccturr is a hierarchical mixture model i” which both tie mixture coefticicnts and the neural networks such as the multt.lnycr prreptmn. Tho HME is bat understood as P probabilistic
mixture components arc pcncmlizcd lincar models (GLIM’s). Learning is treated as I ma?.imwn likclihwd de&ion Uee. making use of
pmblem: in particular. we present a” Expectation-Maximization (EM) algorithm for adjusting the
pamten of tbc architcaue. WC alw dcvclop an on-lint lcaming algorithm in which the pawnctcrs are i,‘j JWtlAn. k1.l A .is:,‘th~’ X .\ 1994: :!:Pi,i:. :i,* :i,c ii:r;;ir,,> .,!‘,? :;rr;,’ :,,!:I liw :..M,li*.v:;‘;,;, iu
updated incremntally... ,v:;!,,: (:,-.>,w”‘:“.:: 9. a>;> I‘;L 214 :.ii’:‘y’
lilii.i.~i_ii ,...:.._ii.i.iii_ii...i_..__....._..i....::.:.
ii.i.::..iiii......._....._..._
~_...:::-ii::_ _...._i___yI i .-..._ -._...i ...__.._ _.::._ .._._.:F: .
[...ration delsted...] prtpilr Prototype Selection for Composite Sraml f+hhor Ctassiftm (1999 I&d R. Sk&k -
kpartnmt vf Comp”;sr Scic~xc Uiivcnily of Mass&x~rctts - iunhcrst. Maswhutis 01003 -
? Acttve bibliography (r&kd documents): shdakQcum.ss,.du . July 199.5 . Pmrotyp? Srtrctiw For - Compouitc Nreru Nei8bbor Clas~ifien A
Dissc
&t& w 0.73: A Statistic”1 Approach to Decision Tret Modeling (1994) Michael 1. Jordan
Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Comblidge, MA ... ...i&a of combining classifiers was advanced. In 1989. Clement reviewed over 2tlO paprs o” the more
M 139 jorda”@psycht.mit cdu general issue of combining forczasts [ Clcmcnt, 1989 1. Puttculrr resc%cb interest 2.4 ~cmtty has
(...scction dclctcd...]
been shown tnthe
comblnatton of neural classiflcrs (c.;.. I Edelma”, 1993; Jacobs et nl., 1991;
Jordan n”d Jacobs. 1993; Rrmne, 1993 I .) Classifier combinlion is k&w” under P “umber of nancs.
dcpendiog o” the rcsearda community and thz application. including enremblc 01 awsc”w mc(hodr,
Cltattons made in this daumenc: hybrid orcomporite models. fusing. estimvor mmbination nnd forecast mmbi”&w. aggngation or
symhwts. Here we have .._._.
th-;mlr rpntrat Bourtard. H.. & Ksmp. Y. (1988). Awe-dssociarion by mnlrilayerprcqumns and
singular v&r decomporilion. Biological Cybernetics. 59.291-294.

CW.nh m Brcima”, L.. Friatdman. I. H.. Okbcn. R. A., & Stone, C. J. (1984). ClassiJicarion and
Regression Trees. Bclnmnt. CA: Wadrwortb tnremational Gmup.
[...stion dclctcd...l
th?al\ w Bridle. 1. (1989). Pmbubilirric i”lrrprclcUionoljcedfonvsni classificaion ncwork
ourpurr, with relarionships IO srorisricol portem recognirion. In F. Fogclman- Soulie &I. H’erault (Us.).
Neumcomputing: Algorithms. Architccrures. and Applications. New York: Sptinger-Vcdag.
Figure 3: Sample citation context information for an article
(...seetion deleted...]
in the CiteSeer database. For each article citing the given
article of interest, the header,context of the citation, and the
Figure 2: Sample detail information for an article in the Cite-
specific form of the citation can be seen. Note that the two
Seer database.The document header, abstract, and citations
citations grouped together above actually refer to a technical
can be seen, along with an “active bibliography” of related
report and the corresponding journal article. This is by de-
documents.
sign, the algorithm currently used in the demonstration sys-
tem groups together articles with the same title and authors.
shown to improve search efficiency 128, 291. Tombros
performed a user study that showed that users working
with the query-sensitive summaries had a higher success
ographies is that papers by the same authors or authors at
rate. The query-sensitive summaries allowed the users to
the same institution are often ranked highly, as might be ex-
more accurately and rapidly determine the relevance of
pected. It may therefore be useful to separately identify such
documents, and greatly reduced the need to refer to the full
articles in order to highlight related documents from other
text of documents. Figure 3 shows sample citation context
authors.
and figure 4 shows a sample of the response of CiteSeer for
a search within the text of the indexed articles.
3.7 Overlapping Documents
3.6 Related Documents
There are many duplicate research articles on the web.
Identical documents are easy to detect (CiteSeer uses SHA
Research articles contain citations to related and previous
checksums), however there are many minor revisions of
research. CiteSeer augments the citation list of articles by
articles that would lead to duplicates in the digital library
locating other related documents using algorithms based on
unless detected. For example, two co-authors might have
word and citation information. Details of the algorithms
the same article online but one of the authors might have
can be found in [13]. When viewing the details of a doc-
made a minor revision to the article (e.g. by adding the
ument, CiteSeer displays an “active bibliography”, showing
publication details).
the most related documents. The active bibliography is com-
puted in a fraction of a second in real-time, and changes CiteSeer takes a sentencebased approach to detecting these
to reflect changes to the databaseas new documents are in- revisions. A databaseof all sentencesis maintained and the
dexed. percentage of identica1 sentences is computed between all
documents. Pairs of documents with a very high percentage
One observation we have made regarding the active bibli-

143
present is more expensive than necessaryfor the application
considered here.

3.8 Citation Graph Analysis


Searching for “rdnfommml learning” in Computer Scimce (16191 I documents 23S2873 cimions
tomJ).
917 documnrs found.

CiteSeer currently performs two different types of analysis


on the graph formed by citation links. CiteSeer predicts
whether or not citations are self-citations by comparing the
...lnlelligcmc Rcrcarcb 4 (1996) 237-285 Submitted 9B.5 published 586 Relnforcemcnt Learning: A
authors in the citations with the authors in the header of the
Survey Lcslic Pack Kaclbling Michael L Liltman .._ /...Avcnuc Pim.bwgh. PA 15213 USA Abstract This indexed articles. This information is used when ranking doc-
uments based on citations (currently, self-citations are not
counted).

Page et al. [22] and Kleinberg [ 171have introduced meth-


ods of ranking web pages using link information (also see
the improvements to Kleinberg’s algorithm by Bharat and
ofclassifi&. (b) mthods fd- scaling up supc&d lemming algo&hms, (c) reinforcement kaming. Henzinger [2]). Kleinberg’s algorithm identifies “hubs” and
awl(d) learning complex. stochastic m&Is. I lnhoduaion lix last live... 1... of classificn. (b) mtbods
“authorities”. “Authorities” are pages with many links to
them, while “hubs” are pages containing many links to au-
thorities. In the context of scientific articles, we suggest that
“hub” articles can be useful for an introduction to a field,
Figure 4: Sample response of CiteSeer for a search within and correspond to review, survey, or tutorial style articles.
the full-text of indexed articles, ordered by “hubs”. Arti- Review articles that summarize important literature are rel-
cles can be sorted according to citations, “hubs”, and date. atively rare in some areas, however the ranking of regular
Query-sensitive summaries are shown for each article high- articles as hubs may be useful becausethese papers can con-
lighting the context of the query terms within the document. tain good introductions to areasof the literature in their prior
The search can be restricted to the title or header fields using work and introductory sections. CiteSeer currently allows
the “Field” seleciion. ordering articles by either “hubs” or “authorities”. We are
investigating extensions to the PageNeinberg algorithms,
of identical sentences are considered duplicates. Sentence such as normalization according to the number of citations
identification is non-trivial, however correct identification of contained in articles, and incorporation of the temporal as-
sentencesis not necessaryfor this application. CiteSeer sim- pect of citations (more recent articles are expected to have
ply uses punctuation to delineate sentences(periods, excla- fewer citations). CiteSeer can order results based on the
mation marks, and question marks), although this is not al- number of citations compared to the expected number of ci-
ways correct (e.g. consider abbreviations). tations, considering the time since the article was published.

Sentences are stored in a hash table, where each entry con- We are also interested in analyzing the graph of citations to
tains a list of the documents containing the sentence. Sen- identify communities and experts (see [ 161) and to analyze
tences are preprocessedby removing non-alphabetic charac- the relationships between articles and the evolution of the
ters and vowels, truncating to a maximum length (currently literature. One interesting application of citation graph anal-
80), ignoring short sentences(currently those with less than ysis and/or related document algorithms is the identification
30 characters), and folding 5 additional charactersonto other of potential reviewers for a given article. This may help to
characters to enable packing two characters per byte. In or- locate more appropriate reviewers for given articles. (Cur-
der to avoid the 2Gb maximum file size limitation imposed rently, some editors use web search engines to help them
by some operating systems, entries are split between multi- locate potential reviewers).
ple files. Sentences in the header and citation list of docu-
ments are ignored.
3.9 User Profiling
A similar sentencebased technique is used by COPS to de-
tect copies of documents [5]. Another well-known copy CiteSeer has a system for automatically recommending new
detection mechanism is SCAM [26], which uses word fre- relevant documents via email or the web interface [4]. For
quencies and works better for detecting documents with par- example, new documents that contain specific keywords or
tial overlap. Also related is [7], however the algorithm they citations, or that are related to specific documents can be rec-

144
ommended. A personal profile is maintained for each user. 4 Availability
The profile can be updated manually by the user, or auto-
matically with machine learning basedon browsing patterns
or responsesto recommendations. Users can remain anony- Perhaps most importantly, NEC Research Institute has
mous in which casethey are identified only by a unique iden- made the software for CiteSeer available at no cost for
tification number stored in a HTTP cookie. non-commercial use. To obtain the latest version contact
CiteseerQresearch. nj . net . corn.There is a mailing list
for CiteSeer announcements, to join the list send a message
3.10 Distributed Error Correction
to majordomoQresearch.nj .nec .com with subscribe
citeseer-announce in the body of the message. A
demonstration CiteSeer service indexing over 200,000
CiteSeer uses distributed error correction to allow individ- computer science articles containing over 3 million citations
ual users to correct errors in the database. See [18] for de- can be found at http: //csindex. corn/.
tails of the implementation and issues such as trust, recov-
ery, detecting malicious changes, and the use of correction
information to improve automated algorithms or predict the 5 Summary
probability of errors.

CiteSeer is a digital library that aims to improve the dissem-


3.11 External Links ination, retrieval, and accessibility of scientific literature on
the web. Specific areas of focus include the effective use of
Some papersare not freely available on the web due to copy- the capabilities of the web, and the use of machine learning.
right restrictions, however they may be available in publisher Software and data from the CiteSeer project is available at no
databases. When possible, CiteSeer links citations in the cost for non-commercial use, which we hope will encourage
databaseto external publishers. The ease of this task de- extensions of this and related work.
pends on the organization of the individual databases. The
American Physical Society (APS) provides a good example
and allows easy linking of papers. A sample URL for a pa- References
per in the APS databaseis: http: //publish. aps . erg/
abstract/PRD/vlO/p20. This link would refer to a paper iI11J.M. Barrie and D.E. Presti. The World Wide Web as
in volume 10 of Physical Review D on page 20. It is rel- an instructional tool. Science, 274:371-372.1996.
atively simple for CiteSeer to extract this information from PI K. Bharat and M.R. Henzinger. Improved algorithms
citations. CiteSeer currently generates these links in real- for topic distillation in a hyperlinked environment. In
time. SIGIR Conferenceon Research and Development in In-
formation Retrieval, 1998.
131 Kurt Bollacker, Steve Lawrence, and C. Lee Giles.
3.12 Universal Article Access CiteSeer: An autonomous web agent for automatic re-
trieval and identification of interesting publications. In
CiteSeer contains many kinds of information about articles. Katia P. Sycara and Michael Wooldridge, editors, Pro-
For source articles, CiteSeer has many details including the ceedings of the Second International Conference on
title, authors, abstract, citations, and full-text. However, Autonomous Agents, pages 116-123, New York, 1998.
CiteSeer has citation details and statistics for all articles ACM Press.
cited by any of the source articles. CiteSeer also knows [41 Kurt Bollacker, Steve Lawrence, and C. Lee Giles. A
how to find articles in selected external databases.CiteSeer systemfor automatic personalized tracking of scientific
allows access to all of the information for a given article literature on the web. In Digitat Libraries 99 - The
using a universal article key. Currently, this key consists of Fourth ACM Conference on Digital Libraries, pages
the last name of the first author of the article, the year of 105-l 13, New York, 1999. ACM Press.
publication, and the first word of the article title (ignoring PI S. Brin, J. Davis, and H. Garcia-Molina. Copy detec-
“the”, “a”, etc.), although alternative keys are likely to
tion mechanismsfor digital documents. In Proceedings
be supported in the future. This key is unique for a large of the ACM SIGMOD Annual Conference, 1995.
percentageof articles, but not for all articles. When the key
is not unique CiteSeer presents all articles with the same [61 S. Brin and L. Page. The anatomy of a large-scale
hypertextual web search engine. In Seventh Znterna-
key for user selection.
tional World Wide Web Conference, Brisbane, Aus-
tralia, 1998.

1145
171 Andrei Broder, Steve Glassman, Mark Manasse, and [191Steve Lawrence and C. Lee Giles. Context and page
Geoffrey Zweig. Syntactic clustering of the web. In analysis for improved web search. IEEE Internet Com-
Sixth International World Wide WebConference, pages puting, 2(4):3846, 1998.
391-404,1997. [201 SteveLawrence and C. Lee Giles. Searching the World
PI Eric W. Brown, JamesP. Callan, and W. Bruce Croft. Wide Web. Science,280(5360):98-100,1998.
Fast incremental indexing for full-text information re- [211 Steve Lawrence and C. Lee Giles. Accessibility of in-
trieval. In Proceedings of the 20th International Con- formation on the web. Nature, 400(6740):107-109,
ference on Very Large Databases, pages 192-202, 1999.
1994.
WI L. Page, S. Brin, R. Motwani, and T. Winograd. The
PI Robert D. Cameron. A universal citation databaseas a PageRank citation ranking: Bringing order to the web.
catalyst for reform in scholarly communication. First 1998.
Monday, 2(4), 1997.
1231 J. Rennie and A. McCallum. Using reinforcement
[lOI Junghoo Cho, Hector Garcia-Molina, and Lawrence learning to spider the web efficiently. In Proceedings
Page. Efficient crawling through URL ordering. In of the Sixteenth International Conference on Machine
Proceedings of the Seventh World-Wide Web Confer- Learning (ICML-99), 1999.
ence, 1998.
~41 E. Selberg and 0. Etzioni. Multi-service search and
1111Doug Cutting and Jan Pedersen. Optimizations for comparison using the MetaCrawler. In Proceedings of
dynamic inverted index maintenance. In Proceedings the 1995 World Wide WebConference, 1995.
of the 13th International ACM SZGIR Conference on
Research and Development in lnformation Retrieval, WI Kristie Seymore, Andrew McCallum, and Roni Rosen-
feld. Learning hidden Markov model structure for in-
pages405-4 11,199O.
formation extraction. In AAAZ 99 Workshop on Ma-
[121 Eugene Garfield. Citation Indexing: Its Theory and chine Learning for Information Extraction, 1999.
Application in Science, Technology, and Humanities.
Wiley, New York, 1979. WI N. Shivakumar and H. Garcia-Molina. SCAM: A copy
detection mechanism for digital documents. In 2nd In-
[I31 C. Lee Giles, Kurt Bollacker, and Steve Lawrence. ternational Conference on the Theory and Practice of
CiteSeer: An automatic citation indexing system. In Digital Libraries, 1995.
Ian Witten, Rob Akscyn, and Frank M. Shipman III,
editors, Digit& Libraries 98 - The Third ACM Confer- ~271Anthony Tomasic, Hector Garcia-Molina, and Kurt
Shoens. Incremental updates of inverted lists for text
ence on Digital Libraries, pages89-98, Pittsburgh, PA,
document retrieval. In Proceedings of the 1994 ACM
June 23-26 1998. ACM Press.
SZGMOD Conference, pages 289-300,1994.
[I41 P. Ginsparg. First steps towards electronic research
communication. Computers in Physics, 8:390-396, [281 A. Tombros and M. Sanderson. Advantages of query
biased summaries in information retrieval. In Proceed-
1994.
ings of SZGZR98, Melbourne, Australia, 1998
[W S. Hitchcock, L. Carr, S. Harris, J.M.N. Hey, and
W. Hall. Citation linking: Improving access to on- WI Anastasios Tombros. Reflecting User Information
Needs Through Query Biased Summaries. PhD thesis,
line journals. In Robert B. Allen and Edie Rasmussen,
Department of Computer Science, University of Glas-
editors, Proceedings of the 2nd ACM International
gow, September 1997.
Conference on Digital Libraries, pages 115-122, New
York, NY, 1997. ACM. [301 I.H. Witten, A. Moffat, and T.C. Bell. Managing Giga-
bytes: Compressing and indexing documents and im-
[I61 H. Kautz, B. Selman, and M. Shah. ReferralWeb: ages. Van Nostrand Reinhold, New York, NY, 1994.
Combining social networks and collaborative filtering.
Communications of the ACM, 30( 3), 1997. [311 I.H. Witten, C.G. Nevill-Manning, and
S.J.Cunningham. Building a digital library for
iI71 J. Kleinberg. Authoritative sources in a hyperlinked
computer science research: technical issues. In Pro-
environment. In Proceedings ACM-SIAM Symposium
ceedings Australasian Computer Science Conference,
on Discrete Algorithms, pages668-677, SanFrancisco,
Melbourne, Australia, January 1996.
California, 25-27 January 1998.
[321 I.H. Witten, C.G. Nevill-Manning, and
[181 Steve Lawrence, Kurt Bollacker, and C. Lee Giles. Dis- S.J.Cunningham. Digital libraries based on full-
tributed error correction. In Digital Libraries 99 -
text retrieval. In Proceedings of WebNet 96, San
The Fourth ACM Conferenceon Digital Libraries, page
Francisco, October 1996.
232, New York, 1999. ACM Press.

146

You might also like