Thesisfs: Online Document Management System: Joseph Christian G. Noel William Yu Pierre Tagle, PHD

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ThesisFS: Online Document Management System

Joseph Christian G. Noel William Yu Pierre Tagle, PhD


Author Co-Author, Adviser Co-Author, Adviser

Department of Information Systems


and Computer Science
Ateneo de Manila University
+(632) 426001

[email protected] [email protected] [email protected]


ABSTRACT on, this can get pretty tedious if done manually and separately on
With storage prices falling and capacity ever increasing, the each file by downloading and re-uploading.
problem of how or where to store files and documents have
pretty much been solved for normal users. Indeed, with the 1.3 Scope and Limitations
increasing number of files a user stores, the main problem now is The main thrust of the thesis is to create an Online Document
the efficient and effective management of files the user has. By Management System with metada and content indexing and
"management", this refers to a system which enables easy access, searching functionality and support for automated actions
organization and retrieval of files the user keeps, and the ability through Action Folders. It will not have advanced filesystem
to perform certain functionalities automatically. The authors aim features like online editing and saving of files.
to build a prototype of such a document management system,
ThesisFS. ThesisFS will present all the basic functionalities of a
web-based filesystem, and will have additional document 1.4 Significance of the Study
management features such as intelligent document searching Document Management is an ever increasing need in today's
called Search Folders, automated indexing and tagging called connected world. People are accumulating and accumulating
Smart Indexing, and automated user-defined actions called enormous numbers of files and documents over the course of
Action Folders. their day to day living. In fact, in both of the next-
generation Operating Systems, Apple's Tiger[1][2] and
Microsoft's Longhorn[3][4], management of files and documents
Keywords is one of the key selling features. This study will show what is
Web, Filesystems, Document Management, Searching possible when document management is taken online. While the
system implemented here is raw and far from being a done work,
1. INTRODUCTION it gives a good place to start on further developments of Online
1.1 Background Document Management Systems.
Online Document Management Systems provide a web interface
for the accessing of one's files and folders. This is useful because 2. DOCUMENT MANAGEMENT
it provides cross-platform benefits, where the only requirement to Document management as a technology and a discipline has
access the filesystem is a capable browser. If placed on the traditionally augmented the capabilities of a computer's file
Internet, it also has the added benefit of making the user's files system. By enabling users to characterize their documents, which
instantly portable, accessible wherever there is an internet are usually stored in files, document management systems enable
connection. With the increasing number of computers and other users to store, retrieve, and use their documents more easily and
electronic devices a person uses that is able to connect to the powerfully than they can do within the file system itself. [5]
Internet, there is a greater need to be able to access your files
with any electronic device you might have at the moment. 2.1 Features
Another benefit of Online Document Management Systems is The main purpose of document management should be to help
that when done correctly it may provide an easy means for store and retrieve the user's files. Therefore all doucment
online collaboration between separate groups of people. management systems should have[6]:

1.2 Statement of the Problem 1. Scalability


With the increasing number of files a user has, it keeps getting Scalability means the system will still be useful when large
harder and more complicated to keep track of all of it. numbers of document collections are moved to it. Scalability
Organizing it into folders isn't enough, as a file may need to be doesn't just mean that a system won't crash or slow down given a
categorized into multiple groups. Thus the need for efficient and large enough number of files, but also that the user will be
effective searching functionalities in filesystems. However, provided with enough tools to effectively manage a very large
current web-based filesystems today have very rudimentary collection of documents. For scalability ThesisFS will have
search functionalities available, if at all. This makes web-based dynamic database searches so that the user can still quickly find
filesystems currently unsuitable for storing large numbers of his files among a large collection.
assorted files. In addition, a user may need to perform certain
actions across multiple files. Current web-based filesystems lack 2. Compliant
sufficient functionalities to enable automated actions for files. If Compliancy entails being able to meet and adapt to new
a user has more than a few files he wants to perform an action requirements of the user as they come. In ThesisFS the user can
define his own Action classes which can be attached to Action the content lifecycle—from creation to revision, approval to
Folders. As the user's workflow requirements changes, more archival.[8]
Action classes can be defined or edited to meet those emerging
standards. FileNet Content Manager is a high-end content managing
platform geared more towards big enterprises. It is scalable, and
3. Accessibility it has advanced search capabilities for helping the user find
Remote access means being able to access files and documents specific files and documents.[9] One thing it lacks is the ability to
offsite, anywhere. The main goal of this feature is document modify functionalities in order to meet future requirements and
portability and availability. ThesisFS will be web-based for specifications as they arrive.
easiest accessibility. If put on a public web server it can be
accessed anywhere with an internet connection and a web
browser. 4. THESISFS

4. Dynamic
Search functionalities are needed to help the user find specific
files quickly. This feature goes hand and hand with scalability as
it enables the productivity of the user scale as well with the
number of documents in the collection. ThesisFS will have
multiple options for conducting searches, metadata and content
search, Labels, and Search Folders.

3. REVIEW OF RELATED SYSTEMS


3.1 Yahoo! Briefcase
Yahoo! Briefcase is the web-based file storage system by Yahoo!
Inc. It supports the creation of folders for grouping files, and
Premium Service users have the option for the Public sharing of
their select folders.

Yahoo! Briefcase features remote access, but it's not entirely


scalable as it doesn't give users much tools for management when
the number of files uploaded becomes extremely large. The user
also isn't able to define new features or functions so there's the
danger that the system might not meet future requirements and
demands. Lastly, the system doesn't have any database search
functionality for quick accessing of files.

3.2 Apple iDisk


“Keep important files with you, even when your Mac isn't”
With iDisk, you can store files on Apple's servers and access
them online from any Internet-connected computer, Mac or Figure 1: Use Case of ThesisFS
Windows [7]

iDisk is the online file-storage system by Apple Computers, Inc., 4.1 Filesystem
as part of its' .Mac package of Internet services. It supports the
creation of folders for grouping files, and a public folder for
making files available online. Rather than being a web-based
system, it uses the WebDAV protocol for deployment. Accessing
it requires a Mac OS X computer which has built-in iDisk
WebDAV support, or the iDisk Utility for Windows XP. A big
feature of iDisk, at least on Apple computers, is its integration
with Mac OS X.

iDisk is scalable, allows for remote access, and has integrated


search functionalities with Mac OS X. However, the user isn't
able to edit or create new functionalities and actions so it might
not meet future requirements for document management.

3.3 FileNet Content Manager


FileNet Content Manager provides control, access, and sharing
of content in a secure and highly scalable environment. Content
Manager integrates easily with existing desktop and business
applications for easy collaboration, and it manages all events in Figure 2: Main directory view
Searching will be in a text field available on top of every page.
The filesystem will be web-based to allow for greater Searches can be limited to the user's files, all files, or the specific
accessibility. Thus remote access is enabled simply by placing folder. Search results will be returned in an ordered list, with the
the system in a public webserver. By using an industrial-strength most probable file on top. Searching can be limited to specific
database backend, ThesisFS will be able to handle enormous files,.
collections of documents and their respective metadata. The
filesystem is a basic Unix filesystem. It has files, folders, users
and groups, privileges, and permissions. The user can upload,
download, and delete files, and can create and delete folders. The
base directory of the filesystem is the root directory, symbolized
by '/'. The user can be a member of any number groups. Which
groups the user is a member of has an effect on the privileges he
has on files other than his own. Security is modeled on the
security architecture of the basic Unix filesystem. Folders have 9
bytes reserved for privilege information, specifying the view,
write, and delete privileges of it's owner, group, and the world.
Files have 6 bytes reserved for privilege information, specifying
the view and delete privileges of it's owner, group, and the
world. Security is handled by the individual Servlets.
Figure 4: Returned files during a search
4.2 Smart Indexing
All files and metadata will be put in a relational database for easy 4.3.1 Search Folders
and efficient retrieval of information. As soon as a file is Search Folders are special folders with search criteria already
uploaded, information such as file name, file type, file size, and linked to it. Every time the the user accesses a Search Folder, the
date uploaded will be retrieved and included with the file in the system will do a search for files that match the linked search
file table of the database. Some information propriety to a criteria. Thus, the contents of a Search Folder will always be
specific file format may also be retrieved along with it's contents. dynamically updated depending on what files are in the system
These are parsed by each file format's specific indexer. File and which match the linked search criteria. This makes Search
format specific metadata are added to the label table and the file Folders useful for stuff like Data Mining. Despite the Folder tag,
contents are indexed in the index table. Because different file Search Folders do not have most of the functionalities of normal
formats will require different parsing methods, file contents and folders. Files cannot be uploaded into a Search Folder, for
file format specific metadata parsing are handled by each file example. In fact in the current implementation, Search Folders
format's respective indexer. Indexers are Java classes that are just a link that takes you to the search functionality of the
implement the indexers.Indexer interface. Indexers are mapped system with the search criteria automatically inputed.
to a specific file format in the indexers.xml file in the WEB-INF
folder of the thesisfs package. It is the sum of these functionalities, Smart Indexing, Labels,
metadata and content search, and Search Folders that help the
4.3 Labels user scale with the large number of files may be collected. By
Labels are user-defined metadata that are attached to individual helping the user find specific files quickly, the user is kept
files. A Label has two strings, a key and a value associated with productive and saved from being bogged down by having to find
the key. Files can have unlimited number of Labels, although a single file among many.
each key should be unique for a specific file. During searching,
each search token is matched against the value string of a label,
with each match increasing the search score of the label's file.

Figure 5: Attaching actions to a folder

4.4 Action Folders


Automated Actions are implemented through Action Folders.
Figure 3: Adding Labels to a file. Action Folders are useful for managing workflow by making it
easy to perform multiple actions on numerous files at once.
4.3 Searching
Action Folders are folders with a specific action attached. format in the indexers.xml file in the WEB-INF folder of the
Actions are mapped and listed in the actions.xml file in the thesisfs package. The entry for the content parser of a plaintext
project's WEB-INF directory. Action are defined and file is:
implemented as a Java class in the action package and and should
implement the action.Action interface. Each time a file is <indexer>
uploaded to an Action Folder, the appropriate Action class is <mimetype>text/plain</mimetype>
instantiated and made to run on the uploaded file. Built-in <class>indexers.Plain</class>
Action classes will include Zip, Unzip, and Delete. </indexer>

By defining their own Action classes when the need arises, the To get the appropriate Indexer class for a specific file type,
system can be adapted to the user's needs and be able to meet the getIndexer method of the indexer.IndexerUtil is called.
emerging requirements for workflows and document This method takes a file type as an argument, parses the
management. indexer.xml file with a SAX parser, and returns the Indexer for
that file type. The complete code for parsing during uploads is:
4.5 Usability
The system is meant to be used like any other web-based IndexerUtil indexerUtil = new IndexerUtil
filesystem out there. There will be clear link for downloading (context.getRealPath(indexerConfigPath));
files and entering folders, and html forms for uploads and Indexer indexer = indexerUtil.getIndexer(bf.getType
searches. Users are encouraged to create Search Folders and ());
Action Folders as much as possible to help ease their job of if (indexer != null) {
managing their files. Indeed, the ease with which to create Search indexer.contentIndex((new String(bf.getFile())).
Folders and Action Folders are designed to facilitate this. Proper toLowerCase(), id, statement, bf.getOwner());
use of Search Folders and Action Folders allow the user to be }
more dynamic and gives him more options.
5.2 Searching
To summarize, the key components for any document During searches, ThesisFS tokenizes each search word and tries
management system is effective indexing and searching to match each word to a file's name, Labels, and contents. A
functionality, and a way to automate workflow for easy search score is given for every matched search word. Name
management and collaborations. ThesisFS has content and matches are given 100 points, Label matches are given 50 points
metadata indexing and searching, and even allows the user to add for every matching Label value, content matches are given five
their own metadata through Labels. Searches will look through multiplied by the number of occurrences for every matching
all these and return the relevant results back to the user. A key word. All files with a search score greater than zero will be
feature of ThesisFS is to be able to save those searches as Search ordered descendingly according to search score and returned to
Folders, for later accessing and having folders with dynamic the user in list form.
contents. Action Folders are folders with Actions attached to
them. Actions are triggered whenever a file is moved into folder. 5.3 Action Folders
Actions can perform any action upon a file and can be used to Action Folders are the authors' implementation of automated
manage the workflow and free up the user from doing repetitive actions. Action classes are mapped to an Action Folder with the
tasks. is_action and action column of the folder table. Action classes
are defined in the action.xml configuration file and implemented
These are the main features of ThesisFS. Taken together, they in the action package. A sample of an entry in the action.xml
can be a framework for later and more advanced systems doing file, for the Zip action is:
document management.
<action>
5. IMPLEMENTATION DETAILS <name>Zip</name>
<class>action.Zip</class>
5.1 Indexing </action>
Indexing is done through the specific Indexer classes. All Indexer
classes will have a method called contentIndex. This method All Action files should implement the Action interface and define
takes as argument parameters the contents of the file, the file id, a the doAction method. doAction takes a beans.BinaryFile object
java.sql.Statement object, and the username of the user. The as an argument. This beans.BinaryFile will contain all the
Indexer is responsible for parsing the contents of the file and information of a file for editing. To get the Action class for a
inserting it into the index table of the database. The index table specific Action Folder, the getAction method of the
has the following schema: action.ActionUtil class is used. This method takes as an
argument a String object specifying the name of the Action. In
file TEXT REFERENCES file(id), the Upload servlet, the code for activating an action when a file
word TEXT, is uploaded to an Action Folder is:
count INTEGER,
String actionName = result.getString("action");
The index table contains what words appear in a file, and how ServletContext context = getServletConfig().
many times it appears in the file. The Indexer class inserts data getServletContext();
to the table through the java.sql.Statement argument in the ActionUtil actionUtil = new ActionUtil
contentIndex method. Indexers are mapped to a specific file (context.getRealPath(actionConfigPath));
Action action = actionUtil.getAction(actionName); indexing content. ThesisFS, on the the other hand, goes further
action.doAction(bf); by allowing the user to save those searches, so that dynamic
results can be retrieved at a later date. ThesisFS also implements
the concepts of Labels, basically user-defined metadata that can
6.RESULTS be added to files. This is on the premise that a user will almost
The system was run on an iBook G4 with an 800Mhz G4 certainly know best what information should be tagged along
processor and 640MB of RAM. The operating system is Mac with the file for indexing. Lastly, ThesisFS provides Action
OX X 10.3.8. The application server used is JBoss 3.2.5, and the Folders to help better manage the user's workflow. The
database server is PostgreSQL 7.3. possibilities for advanced Action classes are nearly limitless and
are constrained only by the imagination and skill of the
6.1 Comparison of Features programmer. It is these features, advanced searching
functionality, Search Folders, and Action Folders, that separate
ThesisFS from other document management systems.
Feature DMS Yahoo! FileNet iDisk Thesis
FS
8. RECOMMENDATIONS
Web- Ideal Yes * Yes Yes The authors recommend that further Indexer classes for different
acces filetypes be created the increase the range of files the system can
content index. Also, the authors further recommends that more
File Required Yes Yes Yes Yes
Action classes be created supporting more advanced automated
system
actions upon files.
features
Content Required No Yes No Yes Another way to improve the system is to integrate keywords into
Indexing the search functionality, similar to Google. This can enable the
users to do more advanced searches by constraining or expanding
Plug-in Ideal No * No Yes the search criteria as they see fit.
based
Indexing
Seach Required No Yes Yes Yes Lastly, we recommend adding more advanced filesystem features
features into the system. These features may include Access Control
Lists, copying and moving commands, and online editing of the
Search Ideal No No No Yes contents of files.
Folders
Work Required No Yes No Yes 9. REFERENCES
flow [1] “Working with Spotlight”, Apple Developer Connection,
<http://developer.apple.com/macosx/tiger/spotlight.html>

6.2 Searching [2] Spotlight Technology Preview, Apple Computer, Inc.


In a partialy filled database with even distribution of files and <http://images.apple.com/macosx/tiger/pdf/Spotlight_Tech_Previ
folders, searching for the keyword “doc” returned all the Word ew_20050111.pdf>
files in the database. The search itself took an average of .2134
seonds. [3] “A Developers Perspective on WinFS: Part 1”, Microsoft
Developer's Network,
6.3 Action Folders <http://msdn.microsoft.com/data/WinFS/default.aspx?pull=/libra
Given a folder with an action attached to it that automatically ry/en-us/dnwinfsta/html/winfsdevpersp.asp>
compresses files using the Zip algorithm, moving a 20kB file to
that folder resulted in the being file being zipped in an average of [4] “A Developers Perspective on WinFS: Part 2”, Microsoft
1.1598 seconds. Developer's Network,
<http://msdn.microsoft.com/data/WinFS/default.aspx?pull=/libra
ry/en-us/dnwinfsta/html/winfsdevpersppart2.asp>
6.4 Content Indexing
As an example we indexed a plaintext version of the GNU Public [5] Freter, Todd, “XML: Document and Information
License. The file contains 2,971 words, consisting of 17,990 Management”, Sun Microsystems,
characters. For this particular file content indexing using our own <http://www.sun.com/980908/xml/>
indexing algorithm took an average of 16.3519 seconds. Given a
more efficient algorithm, the time it takes for indexing can be [6] “Real World User Requirements for Document Management”
reduced dramtically. Cimage Novasoft,
<http://www.cimagenovasoft.com/products/whitepaper/wp%
5Findustrial.htm>
7. CONCLUSIONS
When we started, the problem we aimed to solve was to help the [7] Apple iDisk Tour, Apple Computer Corp.,
user find files easily and free the user from doing tedious tasks <http://www.mac.com/1/iTour/tour_idisk.html>
that could be automated and avoided. Most online filesystems
and document management systems focus on categorizing and
[8] FileNet Content Manager, FileNet Corporation, [10] Giampaolo, Dominic, Practical File System Design with the
http://www.filenet.com/English/Products/Content_Manager/ Be File System, Morgan Kauffman Publishers, Inc, CA (1999)

[9] FileNet Content Manager Brochure, FileNet Corporation, [11] Lawrence, Steve, Bollacker, Kurt, Giles, C. Lee, “Indexing
<http://www.filenet.com/English/Products/Datasheets/02325002 and Retrieval of Scientific Literature”, Eight International
7.pdf> Conference on Information and Knowledge Management, pp.
139-146

You might also like