0% found this document useful (0 votes)
30 views22 pages

Presentation On: Library Building: Submitted To

This document presents a new framework for building digital library collections called "Greenstone 3". It describes how collections are configured using XML files and how documents are represented using METS. It explains the multi-phase process for building collections, including expansion, recognition, encoding, extraction, classification, indexing, and validation. Key improvements over the previous Greenstone system include support for new open standards and more flexible handling of document formats and metadata.

Uploaded by

Aashik Jayswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views22 pages

Presentation On: Library Building: Submitted To

This document presents a new framework for building digital library collections called "Greenstone 3". It describes how collections are configured using XML files and how documents are represented using METS. It explains the multi-phase process for building collections, including expansion, recognition, encoding, extraction, classification, indexing, and validation. Key improvements over the previous Greenstone system include support for new open standards and more flexible handling of document formats and metadata.

Uploaded by

Aashik Jayswal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

SAM HIGGINBOTTOM UNIVERSITY OF AGRICULTURE,

TECHNOLOGY AND SCIENCES , Allahabad , U.P.-211007

PRESENTATION ON: LIBRARY BUILDING


SUBMITTED TO SUBMITTED BY
Ansh Singh
Dr. M. Srivastava 19MSHVS034
M. Sc Horticulture

(Vegetables Science)
SEM:-1ST
Abstract
• This paper introduces a new framework for building digital library
collections and contrasts it with existing systems.
• It describes a radical new step in the development of a widely-used
open-source digital library system, Greenstone, which has evolved
over many years.
• It is sup-ported by a fresh implementation, which forced us to
rethink the entire design rather than making incremental
improvements.
• The redesign capitalizes on the best ideas from the existing system,
which have been refined and developed to open new avenues
through which users can tailor their collections.
• We demonstrate its flexibility by showing how digital library
collections can be extended and altered to satisfy new requirements.
Introduction

• The Greenstone digital library software provides a wide range of tools


for building digital library collections . Based on our own extensive
and varied experience, and that of others , we have designed a new
framework for building digital library collections. We call it
“Greenstone 3” to distinguish it from the earlier system, “Greenstone
2”.
• The new framework is supported by a fresh implementation that is
completely independent of the existing one. It capitalizes on the best
ideas from the existing system, which we have further refined and
developed to open new avenues through which users can tailor their
digital library collections. Several pertinent new open standards have
emerged since the original design many years ago, and a key objective
is to incorporate them into the new design.
Introduction
• The trend towards increasingly open, flexible architectures can
be traced in the development of digital library protocols. It has
provided a simple base-line for metadata access, and
subsequent work strives to base component-based, modular
protocols upon it
• The METS document framework provides an open, extensible
system for representing documents in digital repositories.
Greenstone 3 adopts the same approach for a range of digital
library functions, as this paper demonstrates.
• Set against the move towards standard protocols, today’s
digital library systems must confront an increasing range of
document formats and media, architectural designs for
browsing and classification, indexing requirements, and user
interface techniques.
LIBRARY BUILDING
Collection Configuration

• Collections are designed individually, and the structure of a


collection is encapsulated in an XML file called the “collection
configuration file.” Its contents include build-time configuration
options such as:
• The document types that the collection should recognise
• The metadata access structures or “classifiers” that are to be
provided for users to browse the collection, such as by Titles A–Z
• The full-text indexes to build for searching the collection.
• The configuration file also contains run-time information about
the collections, for example display options.
A collection configuration file
A collection configuration file
METS and Document Representation
• When a document first enters the library, it is given
a unique identifier. That identifier will remain with
the document throughout any subsequent revisions,
and is recorded within the METS framework.

• Having described the configuration controls and


document representation that underpin the new
collection-building architecture, we now describe
the building process itself.
Building Digital Library Collections

In our new architecture, collections are built in several


distinct phases, which occur in sequence.
• Expansion. Compressed files such as Zip archives are
expanded, and links to web sites are expanded into lists
of constituent web pages.
• Recognition. All files are sent to the Recognition
Manager, which identifies groups of files as documents.
• Encoding. All the recognized documents are
catalogued for the subsequent phases of building.
Building Digital Library Collections

• Extraction. Every document is passed through extractors,


which use special processing algorithms to extract
information from the document (e.g. title, key phrases) or
add metadata stored in special files.
• Classification. Documents are assigned to classifiers (e.g.
topical classifiers, ordered list classifiers) depending on their
inherent and extracted metadata.
• Indexation. Documents are sent to indexers to build indexes
that support later searching.
• Validation. Post-building checks are carried out on the
collection as a whole, and on its constituent documents.
Extendibility: Managers and Plugins

• The central role of plugins in each phase should now be apparent.


However, we have made little mention of the structure of plugins
themselves, or how they connect to the core system.
• The architecture achieves extensibility through Plugins and
Managers. Each phase of the building process is controlled by a
Manager— e.g. the Recognizer Manager, the Indexer Manager.
• Managers are configured through the collection configuration
file when the building process starts; and in some cases further
configuration occurs when special files—like metadata-only files
—are found when building. Each manager coordinates the
plugins for its phase of the build cycle.
Advance over earlier work

• The architecture that we have described


capitalizes on lessons learned from the existing
Greenstone digital library system, and
incorporates some very significant improvements.
• At the time the earlier system was designed
(1998) several important open standards did not
exist, or were available only in draft form. For
example, the METS Document Framework, a key
component of the new architecture, was unborn.
Architecture
• In Greenstone 2, collections are constructed in two
phases: importing and building. The first parallels the
Expansion, Recognition and Encoding phases of

Greenstone 3,while the second mirrors the
Extraction, Classification and Indexation phases.
Both phases use the same set of plugins, which are
listed in the collection’s configuration file and loaded
separately in each phase—despite the fact that some
plugins only pertain to one phase.
Plugin details

• Document plugins handle Encoding very


differently in the new design. Originally, all
documents were encoded into the standard
Greenstone Archive Format as soon as they
were encountered. This duplicated content,
and could lose, or render inaccessible, some
information in the original file. The benefit
was that all documents were presented to
subsequent phases in a standardized format.
Example: The Kids Digital Library

• We briefly present an actual digital libraries that was di fficult to


accommodate within the earlier design, and show how it benefits
from the new architecture.
• In the Kids Digital Library each document can belong to several
collections. Some collections are private (e.g. a child’s own
documents), others public. Some documents are unchanging
(accepted final essays); others are under continually revision by
a restricted group of users. Students can annotate the work of
others, and teachers provide feedback too.
• The Kids Digital Library featured some unusual browsing
classifiers such as the “Top ten” and “Latest ten” stories. These
require simple support for feature extraction.
Comparison with other Digital Library Systems

• Cheshire II [3] emphasizes the construction of digital


libraries from original scanned documents.
• The process described for collection building reveals a sys-
tem of fixed metadata fields and a strict control of the
format in which documents are presented to the system.
• Nowhere is support for feature extraction, expansion of
compressed files, or novel indexes described.
• The CORR is built on the NCSTRL software . The CORR
documentation reports that the system requires documents to
be submitted in a standard source format. No documentation
on CORR or NCSTRL describes the parameters for
collection configuration.
Conclusion

• In the new collection building architecture we have


described, the building process is segmented into a
number of distinct phases. Once documents are
identified, they are encoded into a flexible, open
framework (METS) and are passed in that form to the
succeeding phases of the build process.
• Within each phase, the elements are componentized to
support greater portability and simpler development. The
build process is configured through a simple XML format
file which is readily extensible for future components.
References

• J. R. Davis and C. Lagoze. Ncstrl: Design and deployment of a


globally distributed digital library. Journal of the American Society for
Information Science, 51(3):273– 280, 2000.
• Free Software Foundation. GNU make Manual (version 3.80), 2002.
• R. R. Larson and C. Carson. Information access for a digital library:
Cheshire ii and the berkeley environmental digital library. In
Proceedings ASIS ’99, pages 515–535. Information Today, 1999.
• Library of Congress. Metadata Encoding and Transmission Standard
(METS).
• G. W. Paynter, I. H. Witten, S. J. Cunningham, and G. Buchanan.
Scalable browsing for large collections: A case study. In Proceedings of
the Fifth ACM International Conference on Digital Libraries, pages
215–218, June 2000.
References
• C. Sperberg-McQueen and L. Burnard, editors. Guidelines for
Electronic Text Encoding and Interchange. TEI P3 Text Encoding
Initiative, Oxford, 1999.
• H. Suleman and E. A. Fox. Designing protocols in support of digital
library com-ponentization. In Proceedings of the 6th European
Conference on Research and Advanced Technology for Digital
Libraries, pages 568–582. Springer-Verlag, 2002.
• Y. L. Theng, N. Mohd-Nasir, G. Buchanan, B. Fields, H. Thimbleby,
and N. Cas-sidy. Dynamic digital libraries for children. In
Proceedings of the first ACM/IEEE-CS joint conference on Digital
libraries, pages 406–415. ACM Press, 2001.
• I. H. Witten. Examples of practical digital libraries: collections built
internationally using greenstone. D-Lib Magazine, 9(3), 2003.
References

• I. H. Witten and D. Bainbridge. How to build a digital library. Morgan


Kaufmann, San Francisco, CA., 2003.

• I. H. Witten, D. Bainbridge, G. W. Paynter, and S. J. Boddie. Importing


documents and metadata into digital libraries: Requirements analysis and an
extensible architecture. In Proceedings of the European Conference on Digital
Libraries, pages 390–405, Sept. 2002.
• I. H. Witten, A. Moffat, and T. C. Bell. Managing gigabytes: compressing and
indexing documents and images. (second edition). Morgan Kaufmann, San
Francisco, CA., 1999.
• I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning.
KEA: Practical automatic keyphrase extraction. In ACM DL, pages 254–255,
1999.
THANK
YOU

You might also like