Module On Digitization-GREENSTONE
Module On Digitization-GREENSTONE
Module on Digitization
and Digital Libraries
NOTE
Please note that this PDF version does not have the interactive features
offered through the IMARK courseware such as exercises with feedback,
pop-ups, animations etc.
We recommend that you take the lesson using the interactive courseware
environment, and use the PDF version for printing the lesson and to use as a
reference after you have completed the course.
Introduction
A very important aspect of the creation, A colleague from another department has talked to
organization and provision of access to me about Greenstone, an open source suite of
digital library collections is the software software for building digital library collections.
used for this purpose. Can you please collect some information about it?
Greenstone is a freely available suite of software for building and distributing digital library collections. It
provides a new way of organizing information and publishing it on the Internet or on CD-ROM.
Greenstone is open-source software, issued under the terms of the GNU General Public License.
The aim of the software is to empower users, particularly in universities, libraries and other public service
institutions, to build their own digital libraries.
Greenstone overview
Multiplatform availability
Greenstone is available for various operating system platforms, including Windows (any version),
Linux, Sun Solaris, and Mac OSX.
It is available in both binary (executable) and source code form for the Windows (all versions), Linux,
and Mac OS X operating systems and in source code form for other operating systems (Unix).
Collection building
Supports a variety of interfaces for collection building.
Greenstone overview
Powerful Indexing
Greenstone can build indexes from full text documents and also metadata associated with
these documents. It supports creation of indexes for various metadata fields, either automatically
extracted or manually assigned.
File formats
Greenstone supports different file formats, such as HTML, PDF, DOC, RTF, E-mail, Plain text, PPT
etc. These file formats are converted into a standard XML-based internal format for indexing using
‘plugins’.
New plugins can be developed for file formats not supported by Greenstone. Greenstone allows
you to configure a collection to customize the interface, indexing, browsing and presentation
features according to your requirements.
Greenstone lets you build collections of non-textual multimedia documents such as audio, video,
and pictures accompanied by textual description or metadata to allow searching and browsing.
Unicode, an encoding standard for representing a large number of language scripts, is used
throughout Greenstone. This facilitates building, searching and browsing documents in any
Unicode-compliant language.
Greenstone overview
In order to have a more precise idea of what type of collections can be built with Greenstone, the team
examines some of the various digital library collections that have been developed using
Greenstone. Such collections have been developed around the world, in several languages and in various
domains, including historical, educational, cultural, and research.
New Zealand
Mari El Digital Library
Archives
Republic of Indian LeHigh
government Labour University
information (India) Library
(Russia)
Here are some features the team has found about digital library collections built using Greenstone:
Greenstone overview
A variety of collections can be built Can you help the team to list the features supported
using Greenstone! This is clear from by Greenstone?
the features we have identified. I
think we should summarize them in a
short report...
View
Viewanswer
answer
Greenstone’s installation is guided, but you may want to have more detailed instructions.
Detailed information and step-by-step instructions for installing Greenstone on Windows and Linux
are available in the Greenstone Digital Library Installer’s Guide (2.50). You can find these
documents in the Greenstone Digital Library Documentation section (Resources-software&tools-
Greenstone).
Greenstone installation
Before installing the software, be sure you have all the hardware and software requirements!
Storage requirements:
Software:
Java Run-time Environment (JRE) version 1.4 or above (Install JRE before installing GSDL) –
JRE is required for GLI
[Not required for default Windows installation] Web Server (Apache Recommended)
PERL - gets installed automatically
C++ compiler, if you wish to compile the source code (Visual Studio or GCC)
A Web Browser (Internet Explorer 4 and above or Netscape 4.5 and above)
When you install Greenstone, you are asked to select one from the four available types of setup:
Greenstone installation
For Windows and Linux, installing from binaries (not source code) is recommended, unless if you
want to work at source code level which requires compilation.
Note that collection building is not supported in Windows 3.x.
Since Linux is the primary operating system under which Greenstone has been developed, it is more
robust and efficient on this platform. You may like to opt for Linux platform if you have the required
support for installation and maintenance.
For Windows, if you do not feel comfortable configuring a suitable web server (e.g. Apache or Internet
Information Server), we recommend that you use the Local Library setup (this setup is not available
in other platforms). This setup is easy to install and comes with an inbuilt web server. We would
recommend it over the Web library unless:
a) you expect a large number of simultaneous users, or
b) you are already using a web server to serve other stuff, or
c) you need to offer 24/7 service that will automatically start after rebooting etc.
To serve large number of simultaneous users and provide 24/7 services on the Internet, you should
opt for Web Library setup (all platforms).
Greenstone installation
An end-user accesses a Greenstone digital library collection through its user interface. Before
building your collection, it’s very important to understand how Greenstone supports various features
in the user interface.
Although the user interface of
END-USER INTERFACE different Greenstone collections may
FEATURES appear remarkably similar, each one
can provide varying search, browse
and display features, depending on
Collection Searching
access requirements, nature of
documents comprising the collection
Document Browsing and metadata associated with these
documents.
Presentation of Search
Results As a digital library developer you can
define the desired end-user
Multilingual Support interface features for your
collection.
Greenstone supports different ways of searching collections. They can be grouped in two main
categories: “plain search” (through Google-like single search box) and “form-based search”.
Collection Searching
PLAIN SEARCH
Users can search for words or phrases in the full text of the document
Simple or limit the search to a specific index (e.g. document title or author) by
Simple selecting the available index from the drop-down box.
Advanced Boolean queries
FORM-BASED SEARCH
Simple Users can search for words or phrases across different fields.
Users can search for words or phrases across different fields, with support for
Advanced
Boolean query combination, case folding and stemming.
Document Browsing
Multilingual Support
Yes
No
• Collector,
• Command line mode, and
• Librarian Interface.
Approach 1: Collector
Supported functions
L I
How to use it
More details
Approach 1: Collector
How to use it Collection building using the Collector involves the following steps:
Specify collection information - its name and associated info
Specify source data - where the source data comes from
Configure collection - Adjust the configuration options (advanced
use)
Build the collection
Access the collection!
More details More details on using the Collector for building collections
is given in Greenstone digital library 2.50 – User’s guide
(section 3.4). You can find this document in the
Greenstone Digital Library Documentation section
(Resources-software&tools-Greenstone).
How to use it
More details
Supported functions Command line mode supports simple and advanced collection building.
More details of using the Command line mode for building collections
More details is given in Greenstone digital library 2.50 – Developer’s guide (section
1, pages 1-25). You can find this document in the Greenstone
Digital Library Documentation section.
Supported functions
How to use it
GLI is the most advanced approach to collection building and also metadata
Supported functions management. The GLI has excellent metadata management support and
also supports definition of custom metadata fields, if required.
Now that we have chosen our Before building a Greenstone digital library
approach, the next step is to collection, source documents should be
start building our first collection! prepared carefully to obtain the desired
access (search and browse) and display
features.
If you wish to support full text searching or if you expect Greenstone to automatically extract metadata
(e.g. document title)…
This will not be possible if source documents are in image formats (e.g. image PDF and JPEG).
Ensure that the input documents are in a format from which Greenstone can extract text (e.g. Text
PDF, Word Doc or RTF, and HTML).
If you expect Greenstone to correctly extract the document ‘Title’ metadata automatically…
It will be useful to assign correct document properties, particularly for Microsoft Word documents.
The first readable text line in the PDF is likely to be used as a document title – ensure that this is
appropriate. Similarly, ensure that HTML documents have meaningful document titles. However, if
you are explicitly assigning the document title as a metadata field and using this for indexing, these
precautions are not necessary.
If you find that some Word and PDF documents are not getting converted and indexed properly by
Greenstone…
We suggest that you find an alternative way of converting these to HTML and using them for collection
building. Greenstone uses third party software for converting Word and PDF documents and since there
are so many versions of these documents, conversion may sometimes fail.
If you want to support section level searching or hierarchical browsing of large documents…
You should incorporate appropriate full text tags in the text document. You will find more details of
full text tagging in the following documents:
Tagging Document Files, Section 3.3, in Greenstone Digital Library User’s Guide 2.50;
Librarian Interface: adding and using metadata (gsdl-4-GLI2.pdf) in Greenstone training
workshop material.
You can find these documents in the Greenstone Digital Library Documentation section (Resources-
software&tools-Greenstone).
…
+
Adoption of Dublin Core should be considered
seriously since Greenstone has internal support for
this standard.
Encoding
EncodingSchemes
Schemes Decisions regarding encoding schemes (cataloguing
rules, vocabulary control schemes) to be used and
(cataloguing/vocabulary formats for rendering of content of each field are also
(cataloguing/vocabulary important.
control
controlschemes)
schemes)
These decisions lie outside the digital library software,
but form a very important part of the collection
building process. The metadata element set and
the cataloguing/vocabulary control schemes together
Metadata
MetadataScheme
Scheme comprise the “metadata scheme” (or “metadata
specification”) for the collection and should be
adhered to during collection building.
Summary
Greenstone is a freely available suite of software for building and distributing digital
library collections on the Internet or on CD-ROM.
Digital library developers can benefit from the variety of features supported by
Greenstone, such as multiplatform availability, the capability of providing access in
different ways and managing different file formats, media and languages. Greenstone
also allows powerful indexing, search and browse methods.
When you install Greenstone, you can select the type of setup that best suits your
needs. Available types are: custom, source code, local library and web library setup.
The Librarian Interface provides the most advanced and at the same time a very
user friendly approach to collection building and also metadata management.
The following five exercises will help you test your understanding of the concepts covered in
the lesson and provide you with feedback.
Good luck!
Exercise 1
For which of the following operating platforms is Greenstone software available in binary
(executable) form?
Windows
Unix
Linux
Macintosh
Greenstone uses a standard character encoding format for storing document content in order to
support multiple languages.
Exercise 3
When installing Greenstone in the Windows operating system, the default setup type is…
Look at the Greenstone end-user interface below. What kind of collection searching does it provide?
Exercise 5
Which one of the following collection building approaches supports metadata management?
Collector
Librarian Interface (GLI)
Command line mode
Online Resources:
Additional Reading:
Ian H. Witten and David Bainbridge 2003. How to build a digital library.
Morgan Kaufman Publishers