Slug A Semantic Web Crawler
Slug A Semantic Web Crawler
Abstract
This paper introduces “Slug” a web crawler (or “Scutter”) designed for harvesting semantic web
content. Implemented in Java using the Jena API, Slug provides a configurable, modular framework
that allows a great degree of flexibility in configuring the retrieval, processing and storage of
harvested content. The framework provides an RDF vocabulary for describing crawler
configurations and collects metadata concerning crawling activity. Crawler metadata allows for
reporting and analysis of crawling progress, as well as more efficient retrieval through the storage
of HTTP caching data.
Introduction
In concept a semantic web crawler differs from a traditional web crawler in only two regards: the
format of the source material it is traversing, and the means of specifying links between information
resources. Whereas a traditional crawler operates on HTML documents, linked using HTML
anchors, a Semantic Web crawler (or “scutter” [Scutter]as they're known within the community)
operates on RDF metadata with linking implemented using the rdfs:seeAlso relationship
[Brickley2003].
However, in practice, the aggregation and processing of semantic web content by a scutter differs
significantly from that of a normal web crawler. A normal crawler must only contend with
extracting text from, possibly invalid, content and subsequent link extraction, whereas a semantic
web crawler must carry out additional processing tasks: merging of information resources via
Inverse-Functional-Properties; tracking provenance of data; harvesting schemas and ontologies in
addition to source data; extraction of embedded metadata (e.g. EXIF, XMP), etc. Matt Biddulph's
2004 paper “Crawling the Semantic Web” [Biddulph04] provides an excellent introduction to these
issues and summarizes implementation experience gained whilst constructing a scutter.
Despite this promising initial research, and the growing availability of rich sources of freely
available metadata, there are still very few open source semantic web crawling frameworks
[RDFCrawl][Biddulph03][Elmo].
Although large RDF datasets are readily available in a number of specific domains, e.g.
bioinformatics, there are as yet no scutters actively aggregating semantic web data for the purposes
of enabling easier application development and further research. The creativity surrounding
“mashups” [Mashup], which mix data and functionality across domains, highlights the need for
easily accessible cross-domain RDF data sets. Semantic web research into areas such as storage and
query performance, federated queries, etc. will be enabled by the provision of a wider variety of
data sources. For example recent experiences have shown that the structure of an RDF data set can
greatly effect triple store performance [Broekstra].
To address these concerns and encourage additional research, the author has written Slug [Slug], an
open source, configurable and modular scutter. The emphasis of the framework is on modularity
and ease of use rather than speed (hence the name!), although performance is currently acceptable.
The rest of this paper describes the design, and features of the framework.
Summary of Features
The current release (“alpha-2”) of the Slug crawling framework includes the following
functionality:
● Multi-threaded retrieval of data via HTTP
● Support for HTTP Conditional GETs [Biddulph03b], enabling optimised retrieval of
previously fetched content
● Internal collection of statistics on crawler state, e.g. number of active workers, to facilitate
monitoring of the crawler
● Crawling from a fixed list of starting points (a “Scutter Plan” [ScutterPlan]) and/or
refreshing previously crawled data sources
● A persistent “memory” for capturing and persisting crawl related metadata. E.g. HTTP
status codes from responses, location of local copies of data, location from where a source
was found, etc.
● Configurable crawler behaviour including crawling “depth”, blacklisting of URLs based on
regular expressions, and avoidance of crawling loops
● Automatic processing and traversal of rdfs:seeAlso links between RDF documents
● Creation of a local cache of retrieved data
● Storage of retrieved RDF data using Jena persistent models
● Configurable processing and filtering pipelines enabling the scutter to be customised for
specific tasks
● Creation of scutter “profiles” using a custom RDF vocabulary for describing the scutter
configuration
Crawler Architecture
The UML diagram provided in Figure 1 summarizes the key components and relationships present
in the Slug framework. Central to the design of the framework are variations of the Master-Slave
and Producer-Consumer design patterns.
The Controller is responsible for managing a list of Tasks which are carried out by a number of
Worker instances. The Controller implementation is agnostic to the exact nature of each Task and
the implementation of the Workers. A factory class is used by the Controller to create worker
instances at application startup and on demand during application processing; each Worker runs as a
separate thread within the application, co-ordinated by the single Controller instance.
The core Controller-Worker framework is therefore similar to the Master-Slave design pattern in
which a central component delegates tasks to a number of slave instances. The variation imposed
here is that the Controller itself does not aggregate the results of completing each Task; neither do
the workers.
The processing of task results, which may involve multiple processing stages, is abstracted from the
task itself by applying the Producer-Consumer pattern. Each Worker produces results which are
handled by a Consumer instance. This allows alternate processing behaviour to be substituted into
the framework without changing the core functionality that deals solely with multi-threaded
retrieval of content via HTTP.
Finally, to allow for modular extension of the consuming of task results, the framework provides an
implementation of the Consumer interface that co-ordinates several other Consumer instances. This
delegation model allows multiple Consumer instances to be involved in processing a single
Response generated by each Worker.
A typical Consumer configuration might consist of:
● A ResponseStorer that stores retrieved RDF data in a local file system cache, treating the
response as a series of bytes
● An RDFConsumer that parses the retrieved data, extracting rdfs:seeAlso links in order
to generate new Tasks, adding them to the Controllers job queue.
● A PersistentResponseStorer that parses the retrieved data, storing the resulting triples in a
persistent Jena model
Another key component in the architecture are “task filters”. These components are used by the
Controller to filter any new tasks encountered, in order to ensure they meet arbitrary criteria. For
example: that the URL associated with the new task does not match a given regular expression,
allowing blacklisting of specific URLs. As with the Consumer interface, the framework allows for
multiple task filters to co-operate in the decision to admit a new task to the queue. Complex filtering
criteria can therefore be assembled from fine-grained filter implementations.
The final aspect to the Slug framework is the concept of a “memory”. A Memory instance,
maintained by the Controller, but accessible by all framework components, provides a persistent
RDF model for storing crawl related metadata. E.g. HTTP response codes and headers. See “The
Scutter Vocabulary” for more details.
All of the components in the Slug framework can be configured using a custom RDF vocabulary
that allows extension of its capabilities by integrating custom components, e.g. new TaskFilters or
Consumers.
Crawler Configuration
A Slug scutter instance is configured via a proprietary RDF vocabulary, in the namespace
“http://purl.org/NET/schemas/slug/config/”. A sample crawler configuration file is
provided in Appendix 1.
The classes and properties in the configuration vocabulary echo the main application architecture:
● A Scutter resource has properties configuring its initial number of workers, its memory, and
the required consumer and task filter components. The latter are both modelled as RDF
sequences.
● A Memory resource is configured with properties indicating a file name (in the case of a
simple file system based memory) or database connection parameters.
● Consumer and Filter components are configured using an “impl” property that indicates the
fully-qualified Java class name of the implementation. Java reflection is used to
automatically instantiate these components at runtime. Component instances are responsible
for configuring themselves further from the available metadata: each instance is passed a
reference to its “peer” RDF resource in the configuration file immediately after instantiation.
An RDF vocabulary was chosen as the means for configuring the framework rather than a custom
XML vocabulary or Java properties file for several reasons.
Firstly the ability to add arbitrary additional properties makes it easy to extend the configuration for
specific component instances. E.g. a custom Consumer instance may need to be configured with the
location and authentication credentials for a remote HTTP service to which it passes newly fetched
data. This same flexibility can also be exploited to improve readability of the configuration by using
Dublin Core properties to associate a title and description with each component.
Secondly, it was desirable to allow for the specification of different crawling “profiles”: alternate
configurations of the framework specialised for different crawling strategies. Using an RDF
vocabulary it was easier to relate together individual components and the Scutter instance that uses
them.
For example a configuration file may contain the specification of a “fast, shallow” scutter that
defines a Scutter instance with a large number of worker threads, but a task filter that ensures that
the depth of crawling is quite shallow. The same configuration might define a “slow, deep” scutter
than uses a smaller number of workers, allows for a large depth when following links, but
additionally blacklisting any LiveJournal URLs encountered during the crawl. These two scutter
profiles may share the same configuration relating to memory storage, and the consumer pipeline.
These configuration options allow the basic framework to be used to specify a number of different
crawling profiles by reconfiguring the components for specialised purposes.
The following SPARQL query shows the contents of the file list-fetches.rq:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX scutter: <http://purl.org/net/scutter/>
SELECT ?date ?status ?contentType ?rawTripleCount
WHERE
{
?representation scutter:fetch ?fetch.
?fetch dc:date ?date.
OPTIONAL { ?fetch scutter:status ?status. }
OPTIONAL { ?fetch scutter:contentType ?contentType. }
OPTIONAL { ?fetch scutter:rawTripleCount ?rawTripleCount. }
?representation scutter:source <http://www.ldodds.com/ldodds.rdf>.
}
ORDER BY DESC(?date)
Queries such as those listed above are important for administering a scutter instance. It should be
possible to discover errors, trace the history of interactions with a given resource, etc. The crawl
metadata may also be useful for identifying problems with the publishing of semantic web data, e.g.
broken links or incorrect mime types in responses.
Cross-referencing crawl metadata with that aggregated by the scutter (e.g. into a persistent store)
provides other possibilities, e.g. tracing usage of particular RDF schemas and/or properties.
A basic feature of the Slug framework is the capability of creating a local cache of remote RDF
data. The crawler memory (see above) includes properties that map source HTTP URLs to local file
system URIs. It is therefore possible to apply some simple inferencing rules to convert from one
vocabulary to another. The following example uses the Jena rules syntax [Inference].
@prefix scutter: <http://purl.org/net/scutter/>.
@prefix lm: <http://jena.hpl.hp.com/2004/08/location-mapping#> .
[inferMappingFromLocalCopy:
[createMapping:
(?mapping lm:name ?source), makeTemp(?x)
-> (?x lm:mapping ?mapping)]
The rules assume the availability of a bespoke inferencing primitive that binds the String value of a
variable to a newly created literal. This is required to map the source properties from the
ScutterVocab (which are RDF Resources) to the literal values used in the location mapping
vocabulary. An implementation of this primitive is included in the Slug distribution.
Jena based applications may apply these rules to generate a location mapping configuration which
can then be used to optimise how the application fetches remote resources. One specific use case for
this feature is in the implementation of a SPARQL Protocol endpoint which may deal with arbitrary
specified SPARQL Datasets.
<slug:Scutter rdf:about="slug">
<slug:hasMemory rdf:resource="memory"/>
<slug:workers>10</slug:workers>
<slug:consumers>
<rdf:Seq>
<rdf:li rdf:resource="storer"/>
<rdf:li rdf:resource="rdf-consumer"/>
</rdf:Seq>
</slug:consumers>
<slug:filters>
<rdf:Seq>
<rdf:li rdf:resource="single-fetch-filter"/>
<rdf:li rdf:resource="depth-filter"/>
<rdf:li rdf:resource="regex-filter"/>
</rdf:Seq>
</slug:filters>
</slug:Scutter>
<slug:Memory rdf:about="memory">
<slug:file>c:\eclipse\workspace\slug\memory.rdf</slug:file>
</slug:Memory>
<slug:Consumer rdf:about="storer">
<dc:title>ResponseStorer</dc:title>
<slug:impl>com.ldodds.slug.http.ResponseStorer</slug:impl>
<slug:cache>c:\temp\slug-cache</slug:cache>
</slug:Consumer>
<slug:Consumer rdf:about="rdf-consumer">
<dc:title>RDFConsumer</dc:title>
<slug:impl>com.ldodds.slug.http.RDFConsumer</slug:impl>
</slug:Consumer>
<slug:Filter rdf:about="regex-filter">
<dc:title>Block URLs based on Regex</dc:title>
<slug:impl>com.ldodds.slug.http.RegexFilter</slug:impl>
<!-- regular expression, if matches, then url not included -->
<slug:filter>livejournal</slug:filter>
</slug:Filter>
<scutter:latestFetch>
<scutter:Fetch>
<dc:date>2006-02-27T14:46:29+0000</dc:date>
<scutter:status>304</scutter:status>
<scutter:etag>"192033-1e38-44217fc0"</scutter:etag>
</scutter:Fetch>
</scutter:latestFetch>
<scutter:fetch>
<scutter:Fetch>
<dc:date>2006-02-27T14:14:49+0000</dc:date>
<scutter:status>304</scutter:status>
<scutter:etag>"192033-1e38-44217fc0"</scutter:etag>
<scutter:rawTripleCount>107</scutter:rawTripleCount>
</scutter:Fetch>
</scutter:fetch>
<scutter:fetch>
<scutter:Fetch>
<dc:date>2006-02-27T14:05:12+0000</dc:date>
<scutter:status>200</scutter:status>
<scutter:contentType>text/xml</scutter:contentType>
<scutter:lastModified>Thu, 16 Jun 2005 12:47:03
GMT</scutter:lastModified>
<scutter:etag>"192033-1e38-44217fc0"</scutter:etag>
<scutter:rawTripleCount>107</scutter:rawTripleCount>
</scutter:Fetch>
</scutter:fetch>
<scutter:localCopy>c:\temp\slug-
cache\www.ldodds.com\ldodds.rdf</scutter:localCopy>
</scutter:Representation>
References
[Biddulph03] Matt Biddulph (2003) An RDF Crawler
http://www.hackdiary.com/archives/000030.html
[Biddulph03b] Matt Biddulph (2003) Using HTTP conditional GET in java for efficient polling
http://www.hackdiary.com/archives/000028.html
[Biddulph04] Matt Biddulph (2004) Crawling the Semantic Web
http://www.idealliance.org/papers/dx_xmle04/papers/03-06-03/03-06-03.html
[Brickley2003] Dan Brickley (2003) RDF Hyper-linking
http://www.w3.org/2001/sw/Europe/talks/xml2003/Overview-6.html
[Broekstra] Jeen Broekstra (2006) Pitfalls in Benchmarking Triple Stores
http://jeenbroekstra.blogspot.com/2006/02/pitfalls-in-benchmarking-triple-stores.html
[Elmo] Peter Mika () The Elmo Scutter
http://www.openrdf.org/doc/elmo/users/userguide.html#d0e229
[FileManager] () The Jena FileManager and LocationMapper http://jena.sourceforge.net/how-
to/filemanager.html
[GRDDL] Dominique Hazaël-Massieux, Dan Connolly (2005) Gleaning Resource Descriptions
from Dialects of Languages (GRDDL) http://www.w3.org/TeamSubmission/grddl/
[Inference] Dave Reynolds (2005) Jena 2 Inference Support
http://jena.sourceforge.net/inference/index.html#rules
[Mashup] () Mashups http://en.wikipedia.org/wiki/Mashup_(web_application_hybrid)
[RDFCrawl] Siegfried Handschuh (2000) RDF Crawl http://ontobroker.semanticweb.org/rdfcrawl/
[Scutter] () Scutter http://rdfweb.org/topic/Scutter
[ScutterPlan] () Scutter Plan http://rdfweb.org/topic/ScutterPlan
[ScutterVocab] Morten Frederiksen (2003) Scutter Vocab http://rdfweb.org/topic/ScutterVocab
[Slug] Leigh Dodds (2006) Slug: A Semantic Web Crawler http://www.ldodds.com/projects/slug
[XMLCatalogs] (2005) XML Catalogs http://www.oasis-
open.org/committees/download.php/14041/xml-catalogs.html