Documentation Extraction Framework
Documentation Extraction Framework
Content
1 Getting started
2 Overview
3 Core Module
4 Dump extraction Module
o 4.1 Configuration
o 4.2 Running the dump extraction
o 4.3 Running Abstract Extraction
5 Step by Step Guide
6 Server Module
o 6.1 Configuration
o 6.2 Running the extraction server
1 Getting started
Before you can start developing, you need to take care of some prerequisites:
DBpedia Extraction Framework Get the most recent revision from the Github repository.
o git clone git://github.com/dbpedia/extraction-framework.git (read only)
Java Development Kit The DBpedia extraction framework uses Java. Get the most recent
JDK from http://java.com/.
o The DBpedia extraction framework currently requires at least Java 7 (JDK v1.7.0)
for full functionality.
o You can compile and run it with an earlier JDK by deleting or blanking the following
two files. (The launchers purge-download and purge-extract in thedump module
won't work, but they are not vitally necessary.)
core/src/main/scala/org/dbpedia/extraction/util/RichPath.scala
dump/src/main/scala/org/dbpedia/extraction/dump/clean/Clean.scala
Maven is used for project management and build automation.
Get it from: http://maven.apache.org/. Either Maven 2 or 3 should work.
This is enough to compile and run the DBpedia extraction framework. The required input files —
the wikimedia dumps — will be downloaded by extractor code if configured to do so (see below).
If you'd like to use a IDE for coding, there are a number of options:
IntelliJ IDEA Currently the most stable IDE for developing with Scala. To get the most recent
Scala Plugin get the current early access versionand install the Scala plugin from the official
repository. Please follow the DBpedia & IntelliJ Quick Start Guide?
Eclipse: Please follow the DBpedia & Eclipse Quick Start Guide.
Netbeans also offers a Scala plugin.
2 Overview
The Live, Wiktionary, and Server modules are not necessary for running the extraction, and can
be disabled from the pom.xml for that use case.
3 Core Module
Components
Source: The Source package provides an abstraction over a source of MediaWiki pages.
WikiParser: The WikiParser package specifies a parser, which transforms an MediaWiki
page source into an Abstract Syntax Tree (AST).
Extractor: An Extractor is a mapping from a page node to a graph of statements about it.
Destination: The Destination package provides an abstraction over a destination of RDF
statements.
In addition to the core components, a number of utility packages offers essential functionality to be
used by the extraction code:
Ontology Classes used to represent an ontology. Methods for both, reading and writing
ontologies are provided. All classes are located in the
namespace org.dbpedia.extraction.ontology
DataParser Parsers to extract data from nodes in the abstract syntax tree. All classes
are located in the namespace org.dbpedia.extraction.dataparser
Util Various utility classes. All classes are located in the
namespace org.dbpedia.extraction.util
For details about a package follow the links.
You can find the complete scaladochere
The framework is undergoing a lot of refactoring, and the following sections are not 100% correct.
For now, you should use the 'dump' branch for bump-based extraction.
$ git clone git://github.com/dbpedia/extraction-framework.git
$ cd extraction-framework
$ mvn clean install
$ cd dump
$ ../run download config=download.properties.file
$ ../run extraction extraction.properties.file
The extraction and download property files contain many documentation comments and can
be easily adapted to your needs
4.1 Configuration
All configuration is read from a Java properties file named dump/config.properties. After a fresh
checkout you will need to copy it from the .default and modify according your needs (e.g. like
removing unwanted languages to be extracted).
dumpDir The directory where the dumps are located. The DEF expects to see
subdirectories of the type "enwiki/[date]" inside
updateDumps If true, the extraction framework will download every dump which is either
missing or not up-to-date. If you want to use your own dumps or don't want the framework
to update the dumps, set it to false and be sure that (uncompressed) dumps are available
in dumpDir/<lang>/<date>/<lang>wiki-<date>-pages-articles.xml.
outputDir The output directory.
languages The languages of the Wikipedia dumps to be extracted.
extractors The extractor classes to be used for the extraction. See Available Extractors.
Language specific extractors can be configured using a property of the
format extractors.{wikiCode}, e.g., extractors.en
Abstracts are not generated by the SimpleWikiParser, they are produced by a local wikipedia clone
using a modified mediawiki installation.
In order to generate clean abstracts from Wikipedia articles one needs to render wiki templates
as they would be rendered in the original Wikipedia instance. So in order for the DBpedia
AbstractExtractor to work, a running MediaWiki instance with Wikipedia data in a database
is necessary.
See:
http://dbpedia.hg.sourceforge.[..]tExtractor.scala#l66
http://dbpedia.hg.sourceforge.[..]xtraction/README.txt
Extraction on Ubuntu?.
Internationalization
If you leave the updateDumps setting to false you could download the dump you are interested with
from http://dumps.wikimedia.org/backup-index.html, pick the latest complete
one from <lang>wiki (e.g., itwiki) and choose the pages-articles.xml.bz2 one (e.g., itwiki-
20120226-pages-articles.xml.bz2). The input file must be placed
in dumpDir/<lang>/<date> (e.g., /srv/dbpedia/dumps/it/201220226/itwiki-20120226-pages-
articles.xml.bz2, if dumpDir is /srv/dbpedia/dumps).
6 Server Module
This module is intended for testing the framework.
6.1 Configuration
There are two Scala classes that configure the parameters of the server:
The extraction server is started by running mvn scala:run from the directory extraction/server.
The standard port is 9999.
A browser window should open in which you can specify the language and the URI that you would
like to extract.