White Paper System Architecture
White Paper System Architecture
White Paper System Architecture
System Architecture
Version 1.2
February 2010
DocuWare AG
Therese-Giehse-Platz 2
DocuWare AG
Therese-Giehse-Platz 2
E-mail: [email protected]
Disclaimer:
This document was compiled to the best of our knowledge and with great care. All
references are to DocuWare products starting with DocuWare version 5.1c.
Essentially, this white paper sets out to describe the basic technical structure of the
DocuWare products. There may be small or temporary differences with respect to
individual functions in a particular version.
2
Contents
Contents
1. Objectives of This White Paper ....................................................................... 5
5.6. Metadata.................................................................................................................................. 28
3
Contents
6. Databases ........................................................................................................ 30
6.1. Database Structure ................................................................................................................ 30
9. Full-Text Index................................................................................................. 38
9.1. Functional Principle ............................................................................................................... 38
4
Objectives of This White Paper
This will enable the technically minded reader to form an opinion about the DocuWare
system and to assess its power in terms of flexibility, scalability and performance when
handling current requirements. The paper includes a discussion of the measures undertaken
to achieve access security and to prevent down-times – or at least to minimize their adverse
effects on users. Another topic we will cover is integration. This will give the reader an idea of
how the DocuWare system behaves within an IT environment that it shares with other
systems, and to what extent customizations may be required in order to ensure maximum
return on investment and minimum administrative costs (total cost of ownership).
The White Paper addresses clients (users), consultancy companies, IT magazines and
distribution partners. It assumes a certain level of technical knowledge about the structure of
modern software applications, ideally of document management systems. Detailed
knowledge of current or previous DocuWare systems is not required.
As this White Paper is the first in a series, it attempts to provide an overview of the total
architecture. There are other White Papers on the subjects of Security and Integrations.
5
Future Requirements
2. Future Requirements
DocuWare is one of the leading developers of document management systems, not just in
Germany, but also worldwide. This undoubtedly is the most important and best proof of the
quality and performance of the company's systems. One of the critical success factors is the
simplicity of the system’s installation, operation and administration.
Thanks to this success DocuWare systems are increasingly being used in larger and more
complex installations. This White Paper will show that technically the DocuWare system is
well suited for larger and more complex environments and as such constitutes a solid
foundation for any future needs.
In the competition for larger and more complex installations, DocuWare is measuring up
against a different set of rivals, who mostly "externalize" this complexity. DocuWare on the
other hand is intent on retaining its proven success factors and to continue to be a leader in
terms of simplicity of installation, operation and administration.
Even though the overriding need continues to be for conventional archiving systems, market
trends are inevitably moving towards an "Integrated Document Management (IDM)“, and in
the longer term even towards "Enterprise Content Management“ systems.
While we may not yet have arrived at a precise definition of Enterprise Content Management
(ECM), the needs of IDM are by now largely established – and the DocuWare systems
already go a long way to cover this.
Integrated document management must be independent of “time and space." This means it
must be available everywhere and at all times, regardless of whether the user is at company
headquarters, at a branch office, at a client site, or in his office at home. It also means that
documents do not necessarily have to be stored at the location where the documents
originate and that the documents are available irrespective of the location at which they were
archived.
6
Future Requirements
Additionally, clients may have very different needs. Some of them are faced with enormous
volumes of documents that may need to be captured and stored, even though they may
seldom be accessed. At the other end of the spectrum there may be clients with relatively
small volumes of documents that are accessed by large numbers of users from various
locations on a constant basis.
It follows that large/complex installations systems must be differentiated and assessed for
suitability above all on the basis of their system architecture. With this in mind, the following
evaluation criteria are important:
Administration
Simple and coherent administration for the entire system in order to reduce maintenance
costs, part of the Total Cost of Ownership (TCO).
Scalability
In order to meet the requirements it may be necessary to implement one large system
spanning several sites, or it may be that several, smaller installations are better suited to
particular organizational and technical needs. Whatever the case may be, mobile users
must have the option of transporting subsets of the archive on their notebooks.
Clearly, the intention is not to cater for such different requirements by providing different
systems, but to cater for different needs with different expansion stages of the same
technology.
Security
In the context of an archiving system ("File cabinet"), security with all its facets, is a
critical consideration. For one thing, the basic need for revision-proof archiving brings
with it the necessity to prevent data loss in case of system failures. If the client depends
on the availability of the system – which is increasingly the case – continuity becomes
ever more important.
In addition, the ability to map organizational competencies and permissions is of great
importance. To safeguard security it must be possible to restrict user access to
functionalities and to data in a flexible manner that matches organizational needs.
Integration capability
The all-important criteria in terms of integration in today's complex and heterogeneous IT
landscapes are the availability of interfaces, the possibility to integrate existing IT
infrastructures, the conformity to standards and the openness towards system internals.
Migration capability
As is well known, information technology has extremely short innovation cycles, while
archives ("file cabinets") often have very long life spans. Consequently, migration is very
much part of the abovementioned integration. Additionally, there are compatibility
requirements in terms of system generations and migration tools that need to be
considered.
These topics, which are very important, fall outside the scope of this White Paper which
concentrates on the system architecture as a whole, but they will be covered in greater detail
by other White Papers on the subjects of Security and Integration Capability.
7
System Architecture - Overview
In order to be able to deliver the full IDM functional spectrum in large and complex
installations and to meet the listed requirements in terms of scalability, integration capability
etc., the design criteria were broken down into
requirements from the perspective of the provider
requirements from the perspective of the user.
DocuWare systems are fully multi-client enabled so that an outsourcing provider can run one
system for a number of clients. Users, storage locations, functional modules, language
support, etc. can all be defined independently for each client, without having to take into
consideration the settings of the other clients. This also means that several, totally different,
archives for different clients using different storage structures and/or different storage
technologies can all be run on one and the same DocuWare system.
To this effect, various different "organizations" are defined in Administration, where each
"organization" represents one particular client. In addition, for each client you can define a
separate administrator with appropriate rights. In other words, it is possible to set up client-
specific configurations in parallel.
A synchronized procedure is required only for changes in the hardware and the basic
configuration, which affect the entire system, for example the type and number of servers.
This is handled by a separate "system administrator" who is authorized to carry out such
8
System Architecture - Overview
system changes but whose rights might be restricted when it comes to the clients' individual
databases. A log allows monitoring usage and guarantees security.
Additionally, the in-built log provides a basis for invoicing when operating the ASP model.
DocuWare providers thus have the option of running several different DocuWare
configurations for multiple clients.
Web access
Web Client provides DocuWare with the option of accessing DocuWare file cabinets via the
Internet/Intranet/Extranet. All essential features, such as opening documents, marking with
notes and stamps, changing index words and storing documents, are available irrespective
of the workstation and without installation on the client computer.
Multi-language support
For DocuWare AG as an international company, support for the most widely used Romance,
Germanic, Slavic and languages such as Japanese and Arabic is of major importance. This
doesn't stop at localizing the user interface, it also involves adding support for various
number and date formats.
DocuWare uses Unicode on its servers and on new client components. The Unicode
character set (UTF8) is used both for the interface and for data management. This makes it
possible to manage documents and their associated index data in various languages
(including Asian ones) within one and the same archive. The same applies for the full-text
index feature.
Location independence
For systems to provide an architecture within large organizations they must be capable of
operating across site boundaries. As a consequence, cross-site archives and sub-archives,
communication between components via WAN technology, remote administration and
synchronization mechanisms are all critical components of the architecture.
Today's hardware makes it possible to store huge volumes of information. Many customers
have therefore amassed very large archives on which they do not want to impose any
software-induced restrictions. This is why every attempt was made to enable DocuWare
systems to handle any volume of documents while allowing full functionality, including
security features.
In order to comply with the requirement for optimal integration of the system into a
heterogeneous IT landscape, great emphasis was put on supporting and using existing de-
facto standards. This includes the use of existing directory, database and mail servers, but
also openness toward different storage technologies.
9
System Architecture - Overview
There are many tasks within the daily operation of a DMS that are completely routine, for
example copying documents from data sources. Similarly, users — especially in
administrative capacities — handle a range of repetitive processes that recur all the time.
This called for a powerful tool that would automate both system-internal as well as user-
oriented processes.
A modern document management system is expected to deliver a great deal more than just
collect and provide information. To achieve a high degree of automation, it must be possible
for the processes in place for capturing and processing documents to be defined in a flexible
manner. In addition, documents these days often control many of the workflows within
organizations. It is part of the remit of a document management system to electronically map
this process control by means of document workflow functions.
DocuWare provides the Workflow server for this purpose. This controls all automation
processes and acts as the workflow engine for the document workflow.
In order to embed the DocuWare system transparently and seamlessly into existing
infrastructures, it must be capable of being integrated with the database technologies of
leading developers independent of the operating system. This is not simply a question of
protecting a company's investment, but plays an important role in terms of administrative
efficiency.
A number of different storage technologies have been competing with each other for a while.
This means that their relative strengths and weaknesses are undergoing constant changes.
By contrast, archiving systems are expected to provide efficient and secure storage facilities
over long periods. This is why openness vis-à-vis storage technologies, independent of
operating systems, is crucial in order to achieve continuity regarding security and efficiency
throughout the entire storage cycle.
Even if the requirements from the perspective of the provider are essential to the system
architecture, the real benefit to the user depends on the functionality document management
system. At this point, we have provided a summary of the features to give an overview of the
requirements. For more detailed descriptions please consult the product literature and the
materials available on the DocuWare website (www.docuware.com).
10
System Architecture - Overview
11
System Architecture - Overview
The next section describes how the above requirements were integrated into the system
architecture.
12
System Architecture - Overview
13
System Architecture - Overview
With this basic system as the starting point, the DocuWare system is expandable and
scalable in discrete steps. The next figure shows an example of how
a process server can be added to give extra functionality
to integrate Web Clients using Web Client Server
separate hardware systems can be used for
authentication and workflow servers on the one hand, and
content servers on the other
database and file store
Workflow Server
The workflow server controls all automation and workflow processes. Automation
processes include, for example, document import/export, file cabinet synchronization,
migration and fulltext indexing.
Web Client Server and Imaging Server
Web Clients can be integrated via the Web Client Server, which in turn accesses the
Imaging Server. Users of these clients need a browser (Internet Explorer or Firefox) and
they can store, search for, display, mark with notes and stamps, etc. documents in
DocuWare file cabinets.
14
System Architecture - Overview
Communication between components occurs via standard protocols such as TCP/IP and
HTTP. This allows systems to be implemented across different sites using Internet
technology. If security is an important consideration, communication can also be realized via
VPN (Virtual Private Network).
Figure 5: File cabinets spanning several sites with Master (m) and Satellite (s).
This architecture not only allows reciprocal access to remote file cabinets but also the
creation of redundant file cabinets in order to be able to work on the same file cabinets
(archives) regardless of site and transmission capacity. Regardless of the file cabinet type
("master" or "satellite"), the full DocuWare functionality can be used at both sites, including
copying any documents. Synchronization between "master" and "satellite" takes place via
Workflow Server (see 8.1). The selected architecture therefore above all follows the
requirements for scalability across site boundaries to cover organizations that have a number
of branches in geographically different locations.
On the client side, all Microsoft Windows versions starting with Windows XP are supported.
This means that a Rich Client exists for these versions making available the full range of
features provided by the DocuWare system.
Users can access DocuWare using a Web Browser via the Web Client. To have full use of all
the features of this Web Client, you need the following: Windows XP or higher and Internet
Explorer from version 6 or Firefox from version 2. The Web Client can also be used with
Firefox on Mac or Linux systems. (For restrictions to the functional scope, see 7.1.6
ClickOnce applications and 7.1.7 Silverlight Plug-In for Web baskets.)
The servers of the DocuWare system are implemented on the basis of Microsoft's .NET
architecture. Since their optimization for the Microsoft platform, both installation and
administration have become much easier, and performance has soared.
15
System Architecture - Overview
Although the DocuWare system comes with a number of its own servers, it does not require
a Windows server license, but only the "Windows engine." This makes the system very
economical, including for set-ups with several DocuWare servers.
DocuWare servers can therefore be run on all platforms supporting one of the Windows
versions XP/2003 or higher.
The basics of DocuWare are a database and a file cabinet. MySQL provides a powerful
database within the basic system. You can then use any Windows filing system as your file
DocuWare Server cabinet, for example the one on the content
server platform. Both tasks are typically
handled by dedicated, existing hardware
systems, which may also reside on non-
File Store User Directory Database Windows platforms. DocuWare can take
LDAP advantage of such resources.
Active
Directory
NT Domain
Figure 6: Open systems integration
Extensive tests were carried out to ensure that the DocuWare system runs on the Microsoft
terminal server and the Citrix Metaframe extensions. This means that using elementary
Windows stations in this environment is a perfectly viable option.
3.5. Summary
DocuWare systems can be integrated into existing IT landscapes without the need for
redundant installation and administrative expenditure, for example in terms of additional
databases or user management.
16
Authentication Server
4. Authentication Server
Authentication Server manages all users and resources within the entire system. Before you
can use the system, you must always log on to Authentication Server.
Figure 7: Authentication
17
Authentication Server
During a user login, the authentication server also checks the licenses for the various
DocuWare servers which are available to that particular user. Both "concurrent licenses" and
"named licenses" are supported.
4.1. Passwords
Passwords are usually encrypted, or stored as hash values. The same applies to system
settings such as the login for the database server.
It uses the "salted" hash procedure, whereby a random value ensures that even two identical
passwords do not generate the same hash value. This means that passwords can neither be
read nor reproduced.
The login options are specified when a user is set up. User management is performed by the
Organizations Administrator.
DocuWare uses a "ticket granting ticket" (TGT) whereby the user or client identify themselves
to the authentication server, request a service, are given a "ticket" and with this ticket can
then use the service of another server, for example of a content server. For the purposes of
identification the client needs "credentials" which, as mentioned above, it receives either
through user input (DocuWare login) or through the Windows user administration (trusted
login). Thus, the Authentication Server exerts the central control function over the sessions
within the system and can on the one hand impose the security features and on the other
react dynamically in case of failure or overload of individual servers.
The communication between client and servers and between servers takes place securely.
The supported protocols are NTLM and Kerberos. Due to the higher level of security it
provides, DocuWare servers optionally use Kerberos amongst themselves and try to use this
protocol also to communicate with external systems. Only in cases where the partner system
does not support this – e.g. older Windows versions – is NTLM used for compatibility
reasons.
18
Authentication Server
4.4.1. Roles
In addition to "user groups," the DocuWare system also works with the "role" concept. This
involves defining "roles" to which authorization profiles (collections of rights) are then
assigned. A "role" therefore is a particular set of authorizations, not users. It typically
corresponds to the rights that are necessary in order to fulfill particular tasks within a
process.
By assigning roles to individual users or user groups these are automatically awarded the
authorizations that were previously defined in the profiles.
A role comprises one or more profiles defining the available features plus one or more
profiles specifying the access rights to stored documents (see "feature profile" and "file
cabinet profile" in the glossary). One or more roles can be assigned to a user or a group.
19
Authentication Server
4.4.2. Profiles
A particular position or "role" in an organizational unit can entail quite different tasks – and
hence require a number of different authorizations. Which is why individual authorizations are
grouped into profiles. The role of the "chief buyer" for example might require the profile for
approving vacation requests as well as the profile for purchasing complex IT systems,
because both these tasks happen to fall within the competency of the chief buyer's position,
even though they are totally unrelated.
Users are allocated roles according to their tasks within the organization. Typically, a user
will have a number of different roles, and in many cases, several users have overlapping
roles. Users can therefore be put into "groups."
A "group" is a set of users. Groups cannot contain subgroups. A user can be a member of
several groups. Users can also be allocated profiles and individual authorizations directly.
Since DocuWare allows the exchange with external user administration systems that may
typically also work with the "group" concept, these settings make it very easy to assign
DocuWare rights to external users.
20
Content Server
5. Content Server
Clients and other servers gain access to A and database information via Content Server.
Thus, Content Server is responsible for providing standard access, central control and
logging of the file cabinet utilization.
The various organizations that use a DocuWare system can use different Content servers
simultaneously. Content Servers are individually scalable, so that it is easy to distribute the
load within a DocuWare system optimally. All Content servers manage index and meta data
of the stored documents in one or more databases.
Moreover, the Content server manages a number of "logical file cabinets" to which
documents are allocated. It looks after all activities that access these archives whether they
involve storing documents or searching and retrieving them.
In order to facilitate the mapping to removable mass storage media, "logical disks" with
specific storage capacities are assigned to the archive. This makes it possible for documents
to be stored together physically so that they can then be swapped out, deleted or transported
more easily.
These "disks" are located in a "storage location," which can be any file store that you may
choose. Different types of storage media are supported (see 5.4). Even within one file store it
is possible to have a combination of different media.
Each organization can have several archives. Each archive uses a database for managing
the index data and one or more locations for storing the documents and header files.
As mentioned already, you can build archives that span several sites by using a master-
satellite structure. The synchronization between master and satellite is handled by the
Workflow server. As far as the user is concerned, both master and satellite provide the same
functionality. You can add documents to either of these two archives and, if you have the
necessary authorization, also modify and delete documents. Each archive has an ID which is
unique in the world and which cannot be altered. This prevents any clashes between names
even if the systems are merged.
21
Content Server
Achieving a high degree of security was an important design criterion for the developers of
the DocuWare server. The following are among the crucial security aspects of Content
Server 1:
Users and administrators require no knowledge of the internal file structure – nor do they
need access rights to it.
Documents and files can be stored in an encrypted format (only in conjunction with
enterprise server).
Files are protected with a type of checksum (using a hash algorithm) making any
changes immediately visible.
When multiple users access the same document at the same time, the DocuWare system
ensures consistency.
Additional security aspects are described in the following section which discusses the main
elements of the Content server and the file cabinets.
Every organization has at least one or more logical "file cabinet(s)" for storing documents.
The archive settings define:
General characteristics, such as name, etc.
Database to be used, and any additional database-related settings
The file cabinet to be used and its subdivision into logical disks (with capacity limits)
Access rights (and file cabinet profiles) for the archive or for individual fields
User dialogs for file storage, searches and results list
Web instance(s) to which the file cabinet is available (for access via Web Client)
When setting up a file cabinet you also need to specify which Content Server is going to be
used to access that file cabinet. Other than that, the archive settings define the principal
functionalities. These include availability of a full-text index, type and extent of the stamps
that are available for document processing as well as electronic signatures.
Optionally, an archive can be accessed via several Content servers. Allocation takes place at
user login and is controlled by the Authentication server. This allows on the one hand load
distribution across several Content servers and on the other a "changeover" if a Content
server should fail.
1
In view of the importance of the security aspect, there is a separate White Paper which provides an
in-depth description of this topic.
22
Content Server
This means that for each document stored, DocuWare may need to manage a number of
files. A "document" as understood by DocuWare may consist of a combination of several
TIFF, Office, PDF and other files, for example in cases where DocuWare stores an e-mail
with multiple attachments as one document. In DocuWare such parts of documents are also
called "pages" (see 5.5 to 5.7). For each document that it stores, DocuWare creates a
separate document directory. The system manages documents by their header file (XML
format). Each document is assigned a unique sequential number, the so-called DOCID. This
is automatically incremented for each new document.
In order to achieve optimum flexibility and openness, the document store is mapped on to a
file directory from which external storage systems may be addressed. The range of options
available in these file directories is determined by the operating system. In view of the
intended open architecture and the independence with regard to the storage systems,
DocuWare models itself on the possibilities offered by these file systems:
CD-ROM standards ISO 9660 and Joliet
DVD standard
Microsoft NTFS, FAT16, FAT32
Linux file systems (ext2, Minix, NFS, etc.)
Novell File System
These were taken as the framework conditions for the DocuWare file storage structure. Since
for reasons of compatibility and performance no more than 256 files should be stored in a
directory, you need to use several hierarchy levels.
By using four, DocuWare can manage more than 4 billion documents per file cabinet (2564 =
4,294,967,296). Below the file directory assigned by the administrator, the DocuWare
directory is addressed by its archive name, the disk numbers, three directory levels and the
document level.
23
Content Server
If for example you allocate the directory D:\DOCS and the name SALE to the file cabinet, the
documents of the first disk will reside in the following subdirectory:
Apart from the header file, the document directory will also contain the files associated with
the stored document, all beginning with "F" (= File) plus a sequential number. Sound
annotations (spoken text, etc.) are identified by the letter "A" (= Annotations), the number of
the associated F file and a sequential number.
A document that consists of several parts and contains speech annotations would therefore
be represented like this:
\00000001\ 00000001.XML
\ F1.pdf
\ F2.doc
\ F3.tif
\ A1_1.wav
\ A1_2.wav
You can transfer these logical disks to another – physical – medium at any time you choose,
for example when they reach a certain size. This has the advantage that documents can be
swapped out to physical media either by pre-defined rules, or automatically. DocuWare
provides a number of convenient support functions which automate the necessary steps.
The concept of logical disks and the open file structure gives the administrator a high degree
of transparency and flexibility when working with stored files. Since the structure conforms to
common standards you may also use the tools provided by the operating system, though
these are less convenient.
24
Content Server
local hard disks, (virtual) network storage media and external storage systems. The
technological basis of these systems is irrelevant, since DocuWare is capable of supporting
any media, provided they conform to the conventions for Windows filing systems. This
means that advanced storage technologies such as RAID systems, NetApp storage
solutions, Network Attached Storage (NAS), other "shared disk" systems or Storage Area
Networks (SAN) can be used, as long as they can be integrated in the Windows file system
as virtual disks.
In addition, DocuWare offers direct support for certain jukeboxes and special storage
systems by providing software that integrates these systems as DocuWare file cabinets just
as it does with Windows file systems.
You can set specific options to determine whether files will be written direct to the target
medium, which in the case of WORM for example will ensure maximum security, or whether
to go via the intermediary of the virtual disk, because CD/DVDs cannot be burnt in
succession.
The following sections describe the different media and their application.
Each GB of mass storage can contain some 20,000 DIN A4 pages. This is the equivalent of
about 40 well filled paper folders. In addition, you have the option of combining several hard
disks in a so-called Disk Array. These arrays are the ideal solution for storage capacities of
up to 150 GB for an archiving system where magnetic storage technology does not present a
problem.
A RAID (Redundant Array of Independent Disks) provides increased security against data
loss in the event of a hard disk failure. Depending on the RAID level, it also allows removal of
the disk "on the fly."
An optical removable disk (CD, DVD, Blu-Ray, WORM) can store up to 50 Gigabytes of data
- the equivalent of 800,000 pages of text. Using such large drives without a jukebox makes
sense only for single workstations. Their advantage is that they can be expanded indefinitely,
simply by inserting more disks. DocuWare looks after the management and numbering of the
disks. As long as you leave the disk labeling to DocuWare, retrieval of documents is very
easy, even if you work with many different disks.
For a long time, optical removable disks were considered to be revision-proof in comparison
to magnetic storage media which is why they were the medium of choice, even if magnetic
disks would have been possible. However, DocuWare ensures that modifications are either
not possible or that they are immediately obvious.
25
Content Server
5.4.3. Jukeboxes
Jukeboxes are "disk-changing robots" that handle optical media, typically containing one to
four drives. Currently, jukeboxes provide the largest storage volume with online access.
Small-volume solutions store 10 GB, high-end systems up to several thousand GB. These
systems are clearly useful for networks that handle huge amounts of data. Access speed is
dependent on the number of inbuilt drives. When frequent disk changes occur, access time
can be several seconds per image file.
Apart from disk-based jukeboxes you can now also have advanced tape systems (tape
libraries), such as WORM tapes which are a cost-effective way of providing large storage
capacities. You need to ascertain that the system can be integrated into the Windows file
system. There are quite a few now that have DocuWare certification.
Special access software will integrate jukeboxes transparently into the Windows file system
so that they can then be used by DocuWare. A list of storage systems that are supported
directly by DocuWare and some of which are certified is published on www.docuware.com.
Until recently, whenever revision-proof archiving was a major concern, optical media were
used. Now, however, RAID-based solutions have become a perfectly good alternative,
especially with large volumes, as they can be made to behave in a similar manner to WORM
drives by using a special software. These are closed systems that typically have the following
characteristics:
Application and users have no knowledge about the physical location of a file within the
subsystem. Accidental or intentional modification of the data by users/administrators is
not possible.
"Hashing" similar to the signature procedure is used to give the file a "fingerprint" – which
also serves as its address.
Identical copies are saved only once.
The file is automatically given a time signature.
Storage and access is possible from different systems on different platforms, i.e.
documents that were stored with DocuWare may be read by applications on other
platforms.
It is possible to increase capacity on the fly. Data is automatically distributed, which also
implies that it is possible to migrate to other media within the same system.
The system provides redundancy, error monitoring and – wherever possible –
autonomous error correction.
In order to utilize these functions the application needs to address a specific interface of the
CAS system. A list of the CAS that are directly supported by DocuWare can be found at
www.docuware.com.
The NetApp storage solutions are based on one of NetApp's own operating systems and can
be integrated in various storage area networks (NAS, SAN, iSCSI). They are especially
intended to manage large volumes of data and for the long-term archiving of WORM
26
Content Server
documents. The company provides special software for data management. This supports the
following tasks:
Management of SANs
Performance optimization
Application integration (e.g. with VMware, SAP, Oracle, Windows, Exchange,
SharePoint)
Data backup and restore
Archiving
Ensuring compliance with statutory retention periods
Together with DocuWare, NetApp Storage is only available for the storage of documents and
requires an enterprise license from DocuWare.
Header files are XML files. Using these standard file formats gives customers the following
benefits:
Less dependency on the manufacturer because the internal structures are open.
Maximum transparency thanks to formats that can be both read and written.
Simplified exchange with all standard-compatible systems, including future DocuWare
generations.
Simplified exchange with capturing systems and scan service providers.
DocuWare uses this format for storing the metadata and any additions. The actual content is
stored separately (for performance reasons), except when exporting. DocuWare uses the
XML file not just for NCI but for all documents that are managed by the DocuWare system.
For each file that is part of a DocuWare document the XML file contains a separate section
which may contain metadata.
27
Content Server
5.6. Metadata
The metadata contain both the attributes allocated by the user (index data, field properties)
and the data that DocuWare requires for its management function (system properties), such
as the DOCID. This data is identical to the index data which the database maintains for every
file.
DocuWare ensure the integrity between the database and the header file. In the event that a
database is irretrievably lost (when no usable backup is available) the header files can be
used to regenerate the database information. However, since this procedure can be rather
time-intensive, it should not be used instead of a traditional data backup.
The storage properties contain information about the history and the logical archive of the
file. Application properties are information that is required for integration with other
applications, for example with SAP.
28
Content Server
5.7. Document
A DocuWare document can consist of several files of different formats (TIFF, Word, PDF,
etc.), which can in turn consist of several pages.
For example:
I. A 3-page paper document that was scanned into DocuWare consists of three
document pages, each of which is a one-page file (b/w TIFF files generated by
DocuWare).
II. For one document, a b/w TIFF file generated by DocuWare, a 3-page Word file, and a
2-page PDF file are linked together. The document then consists of three files:
Annotations (multiple layers of redlining, text and speech annotations, etc.) can be made
within a document in each file, but only on the first page within a file.
As in Adobe PDF, the annotations with their characteristics and any additional attributes such
as user information are stored and then reproduced by the Viewer at runtime. No additional
image files are therefore necessary, and the annotations can be traced and modified in a
flexible manner.
29
Databases
6. Databases
For its operation, DocuWare requires a relational database, which it uses both for storing and
for performing searches within the structured index data of the documents and for the full-text
index. In addition, DocuWare stores all essential system information (such as Authentication
server data) in this database.
DocuWare supports various database systems within a DocuWare system. However, the
administrator has the option of specifying a particular database to be used for each file
cabinet. It is also possible to switch to another database system at a later stage.
The database not only manages the search criteria that are relevant for the user, but also the
system-internal information needed for storing and retrieving the documents in the file
cabinets.
The characteristic that uniquely defines a document is its DOCID - a number for a document
that may consist of various files and is unique within each file cabinet.
Of particular importance are the user-defined fields. These specify the keywords and
categories by which documents are stored and retrieved.
30
Databases
31
Web-Based Applications
7. Web-Based Applications
The trend in IT applications is increasingly toward Web-based solutions. Installation and
maintenance on client computers thereby become unnecessary, access to the application is
possible from anywhere and from all computers, irrespective of operating system. All that is
needed is an Internet connection.
DocuWare is also following this path. File cabinets are accessed through the Web Client.
From the user's perspective, documents are searched and shared as on the Windows Client;
technologically, however, it is a completely new development based on ASP.NET, JAVA
script, AJAX and Silverlight.
The administration of the DocuWare system is also becoming increasingly Web-based. In the
future, it should be possible to manage everything via an Internet connection.
Technologically, this will also be based on Silverlight.
File cabinet access via the Internet is based on Web Client Server, which is installed within
the DocuWare system as an additional server module. Web Client Server supplies the user
interface which is displayed in the browser window.
To access a file cabinet, the user connects to Web Client Server via the Internet using Web
Client. The latter forwards the request to Authentication Server to verify the user account and
the file cabinet access rights via Content Server. From the perspective of the Authentication
Server and Content Server, Web Client Server acts like a client.
Imaging Server, another component for Web-based document access, converts archived
documents that are to be displayed in the Web Client Viewer to a graphics format. This
allows all main file formats to be displayed and printed in high quality without having to install
32
Web-Based Applications
anything on the client computer. Imaging Server is also responsible for converting files to
PDF and for the text search in the Web Client Viewer.
Web Client Server communicates directly with Imaging Server. More than one Imaging
Server can be installed within a DocuWare system, making it possible to distribute the load.
In Web Client, documents can be displayed in the Viewer and in the basket as thumbnails.
For better performance, the thumbnails are not recreated each time they are loaded, but
saved in a dedicated database and supplied from there when needed for display. Thumbnail
Server is responsible for saving and retrieving thumbnails and is connected to both Web
Client Server and the database.
Figure 13: Web Client Server with Imaging Server and Thumbnail Server
Any number of Web instances can be created for a Web Client Server. A unique URL is
assigned to each of these instances. The user connects to Web Client Server via this URL
and loads the corresponding instance in Web Client.
Which file cabinets and file cabinet dialogs are available and how the DocuWare system is
logged onto are defined separately for each instance.
Web Client is the user interface for Web-based file cabinet access.
When the user calls up a URL for a Web instance in the browser, Web Client is displayed in
the browser window. All major features of the document management system can be run via
Web Client: opening documents, marking with annotations and stamps, editing index words,
storing and sending documents, etc.
DocuWare Web Client is based on ASP.Net and Ajax (Asynchronous JavaScript and XML).
These technologies allow Web Client to process searches very quickly, so users receive
immediate answers to their queries. Web Client is based on individual control elements
known as Web Parts.
33
Web-Based Applications
Web Client does not require any installation on the client computer and is not dependent on
the operating system. Only features that cannot be implemented using a browser alone
require applications to be installed on the client computer (see following section).
Sending archived documents via the local mail client, another feature of DocuWare Web
Client, is technically not possible using only a browser. A DocuWare application, a "Smart
Client", must be installed on the local client computer.
DocuWare uses the ClickOnce technology from Microsoft for this. The first time they send
mail, the user clicks to download the DocuWare application once, and this is automatically
installed on the local client computer. No administrative rights are required on Windows. This
application can be updated automatically.
A local application is also required for the browser-based client application of the DocuWare
SmartConnect add-on module. This is also installed on the client computer using the
ClickOnce process.
In DocuWare, documents are processed, e.g. stapled, unstapled and pre-indexed, in so-
called baskets. For Web Client, these baskets are generally not located on the local
computer but on the network. These baskets, also known as Web baskets, are managed by
Content Server. For the Web Client user to be able to use these Web baskets, a Silverlight
browser plug-in must be installed locally. A Silverlight browser plug-in requires a Windows or
Mac operating system.
There are many integration options for Web Client. Web Client can either be integrated as a
whole into other applications or only individual elements of it, such as the result list or the
Viewer. The integration works with Windows and Web programs via special URL calls.
A full overview of the integration options for DocuWare Web Client can be found in the
"Integrations" White Paper.
34
Web-Based Applications
35
Management Framework Process
Document workflow
Automatic dispatch, including user interaction, of documents along pre-defined paths, is
one of the most common workflow applications. Invoices, purchase requests, vacation
applications – these are just some of the documents that need to be created, approved
and posted in large organizations. All these processes can be controlled and tracked by
means of the document workflow.
Workflow Server is the central workflow engine for performing pre-defined workflows.
Workflows have the following characteristics:
Triggering event
Input data
Various logically separate procedural steps
Output data
The Workflow Server works to match this model. Events may be triggered by user actions,
they may be timed, or they can be triggered on reaching a particular condition (e.g. "disk
full").
Such an event then starts off a particular workflow, which – depending on instructions – may
first read in certain input data. Input can come via interaction or by reading a file from a
particular directory, or from data extracted from a database.
The process itself consists of several steps, each of which represents a transaction. If a step
cannot be completed successfully, the Workflow server issues a notification to a (log) file,
whereupon a reset to the last valid state takes place.
36
Management Framework Process
On successfully completing a step, the (intermediate) result is handed to the next procedural
step. The final output is sent to the user, to a directory, or to the DocuWare file cabinet.
An intermediate result of a workflow task can trigger new events which in turn may initiate
new workflows. Several workflows can resolve the same tasks in parallel and for this purpose
share the same resources, such as directories, file cabinets, etc. – while the Workflow server
ensures the integrity of the data. The processing status is monitored and each workflow task
is visible.
More than one Workflow Server can be installed within a DocuWare system. A specific
Workflow Server is then allocated individual workflows. This means that the load can be
distributed among the Workflow Servers.
Pre-defined processes are implemented during the initial DocuWare installation, but also
during the (subsequent) installation of additional expansion modules. Users with the
necessary authorization can modify the pre-defined processes to suit their own needs.
Typically, this is a task that falls to the organization administrator.
Pre-defined workflows that are controlled by the Workflow Server exist for the following
tasks:
Migration
Exporting archives and sub-archives
Generating and synchronizing satellite archives
Creating independent CD/DVD file cabinets
Adding index information from external data sources (AUTOINDEX)
Index Restores
Deleting documents that are defined via filters
Generating and/or updating the fulltext catalog
Importing of documents from spool files (COLD/READ)
37
Full-Text Index
9. Full-Text Index
9.1. Functional Principle
The DocuWare functionality has a full-text index, which is available, but not mandatory, to
users. The full-text service uses the same database as the Content server, but creates its
own tables. Access to the archive database and the documents is direct when generating a
full-text index, i.e. without the intervention of the Content server.
The full-text search function is completely integrated in the client functionality, both for
Windows client and for Web client. This means that no special databases are required and
there is no need for users to familiarize themselves with different search clients. When
configuring file cabinets, users must simply decide whether or not to create a full-text index
for the documents. Full-text searching is carried out via the Content server.
A full-text index can be generated for each logical file cabinet. Which documents are included
is determined by their association to a particular file cabinet. Since a DocuWare system can
contain a great many archives, you may end up with a large number of full-text indexes too.
In view of the fact that the general fluctuations of documents within the archive are managed
by the Content server, communication between the latter and the full-text workflow is
necessary, even though both are independent of each other. This happens "indirectly" via the
full-text main table, whereby the Content server marks documents and files to be indexed –
and those that need to be deleted. The full-text workflow then makes any necessary
modifications and updates the status fields.
Each occurrence of a search string also comes with an evaluation of the probable relevancy
of the term. The result list of a full-text search is sorted according to this relevancy (or
irrelevancy = noise).
To prevent the full-text index from being loaded with irrelevant words such as articles,
pronouns, etc., the full-text process contains a stop-word list which acts as an automatic
filter. The administrator can modify this stop-word list, for example by excluding certain terms
that occur frequently within a company but have no interest for search purposes. The name
DocuWare for example is not a useful differentiator within the DocuWare company. It is also
possible to exclude files (for example image files) by specifying their suffix.
In order to achieve a powerful search for partial strings and to be able to precede a search
term with a wildcard, a special algorithm – the so-called "Multi Suffix Tree" (MST) is used.
This works with two special files that initially identify the correct entry in the dictionary table.
This then provides all other important information (relevancy, position, etc.).
38
Full-Text Index
The actual full-text index is implemented via the MST and the stringlist files which are stored
for each archive within the filing system. The individual words and substrings are stored as a
tree structure in the MST file. The stringlist file is a list of IDs which links all words and
substrings with entries in the dictionary table.
39
Distributed and Redundant Archives
Moreover, it is often desirable to export (sub-) archives, for example in order to deliver
information to mobile users outside the enterprise structure. This can be achieved by so-
called "autonomous archives."
Thanks to today's advanced security technologies such as VPNs, firewalls, etc., misuse can
largely be prevented. In this chapter, we restrict ourselves to a discussion of the functions
that DocuWare provides for distributed and redundant archives.
This means that documents cannot only be read offline but also edited. New documents can
be added to the archive, and certain tasks, such as releases, can be effected via workflow
control.
40
Distributed and Redundant Archives
Synchronization with the master can be time-triggered or can be initiated manually by the
user. Since modifications usually occur in sub-areas of the archive only, the synchronization
areas can be specified by the powerful filter functions. The user-specific restriction of the
synchronization process to individual archives and sub-areas of archives is particularly
important for minimizing both storage requirements on mobile PCs and data transfer volumes
for regular synchronization.
In order to work autonomously, these installations have their own local database. All
necessary components for working with the archive are stored with the data and documents
on one medium, e.g. a CD or DVD. The target system does not require any software to be
installed. However, you may install extra software if you wish to increase the speed.
Such an archive can be used in a flexible way on the most diverse computer systems, e.g.
Notebooks, without necessitating a connection to the rest of the IT infrastructure. The
capacity of the archive depends solely on the medium's capacity, minus the search software.
Typical applications:
Transferring legacy data
Creating backup copies of sub-archives
Interaction with external partners, e.g. service providers or subcontractors
Publishing and distributing catalogs, parts lists and drawings
Providing norms and technical documentation, e.g. for development, quality assurance,
purchasing and distribution
If no modifications are to be made or none are allowed, it makes sense to use archives for
pure search functions on a Notebook – without synchronization.
41
Integration
11. Integration
Archive systems are typically integrated in an existing IT environment. The challenge
therefore is not just to ensure consistency but also to optimize the interchange of data and
documents with other systems without having to invest in complex and highly redundant
administrative expenditure.
DocuWare solves this problem by working with several servers and providing the appropriate
interfaces as well as by adhering to common standards. User data that are maintained in
Active Directories or in LDAP directories can be transferred to DocuWare without any
problems. This of course includes synchronization of changes on the fly.
Moreover, any storage technology can be used, provided it can be mapped as a Windows
file directory. This is the case with all systems by leading manufacturers and means that
DocuWare archives can be set up with non-Microsoft system platforms (such as Linux,
Novell, Solaris).
Integrating third-party platforms is equally an option for database servers, mail systems, Web
servers and applications for which interfaces are available, for example SAP.
In view of the importance of the integration aspect, there is also a White Paper on this.
The following diagram gives an overview of how DocuWare can be set up to work with third-
party applications. This is also described in detail in the "Integrations" White Paper.
42
Integration
43
Scalability
12.Scalability
DocuWare systems are highly scalable, starting from single workstations up to enterprise-
wide systems that can span several sites, accommodate thousands of users and are
distributed across several servers.
Figure 16: Scalability; DocuWare Client subsumes Windows Client and Web Client
However, the most frequent type of installation is a multi-user system within a local network.
The performance of the described system architecture comes into its own when the system
is fully exploited, because functionality can then be distributed across several servers, each
configured to work optimally according to organizational, technical and performance criteria.
TCP/IP networks are required for this – which today provide wide area coverage. The
DocuWare servers require MS Windows platforms, although these can work with other
platforms – see the description under Integration.
In the case of large-scale installations and intensive system utilization, the Content server
can therefore become a bottleneck. In such cases, the load must be distributed across
44
Scalability
several Content servers. If a Content server fails, restarting the client causes the
Authentication server to allocate a new Content server (CTS).
In addition, load distribution can be done by the platform variants of the system
manufacturers, e.g. the Microsoft cluster solution. Thanks to the modular structure and the N-
tier architecture, the options provided by that solution can be used optimally, since the
system can allocate resources according to requirements.
For details about the fail-safe operation of the DocuWare system see our White Paper on
Security.
The organization administrator can define appropriate capacities when setting up the client.
When the maximum capacity has been reached, part of the cache is emptied to make room
for new documents. Optionally, the cache may be emptied when the user session is closed.
In addition, you can specify that the cache should only ever contain current data, i.e. that
data over a certain age is automatically deleted.
45
Glossary
13.Glossary
Administrative Rights Administrative rights are the rights for modifying archive definitions and
definitions within an Organization.
File Cabinet A file cabinet in DocuWare is a logical unit for receiving, storing, searching
and retrieving documents. A file cabinet always comprises the actual
storage location where the documents are physically held, with their
associated database tables, index data and other descriptive or
complementary elements belonging to a document. Optionally, a file cabinet
may contain a full-text index which makes the documents accessible via full-
text information.
A range of storage media types are supported. "Logical disks" are allocated
to the file cabinets which are mapped to the physical storage media
according to certain rules. A file cabinet is a collection of indexed
documents. Precisely coordinated access and administrative rights can be
assigned to file cabinets.
File cabinet administrator User who has administrator privileges for a file cabinet. This right is not
transferable.
Owner User who can create and manage a file cabinet. File cabinet owners
manage the file cabinet structure and allocate the access rights to it. The
administration right is transferable, i.e. the owner may delegate the tasks.
File cabinet profile The archive profile is the set of all access rights to an archive. Among
others this includes the access rights to index fields or documents that may
also be dependent on certain index entries (field-dependent rights). A file
cabinet profile can also include administrative rights within a file cabinet. An
archive profile is defined within an archive.
User In the context of this White Paper, a user is always a DocuWare user. Users
can be combined into groups. Users obtain rights by means of individual
rights, profiles or roles.
COLD COLD is the only proprietary file format in DocuWare. It is an ANSI format
and reads in the text spool data with the DocuWare COLD/READ
instruction.
DocuWare Client DocuWare Client is a generic term for Windows Client and Web Client.
The Windows Client is installed on a Windows computer and runs there as a
native application. Together with the DocuWare servers, it constitutes a
working installation. A DocuWare system always requires at least one
Windows Client.
Using DocuWare Web Client, you can access DocuWare file cabinets via
the Internet. An installation on the client computer is not required. Web
Client Server must be installed in the DocuWare system.
DocuWare Servers DocuWare servers is a generic term and covers all server modules such as
Authentication Server, Content Server, Workflow Server, Imaging Server
and Web Client Server.
DocuWare System The DocuWare system comprises a full DocuWare installation with all
necessary and optional components. A DocuWare system is characterized
by shared hardware and system settings for one or more "organizations".
Occasionally the term "DocuWare" is used to refer to the DocuWare system.
46
Glossary
Document A "document" is a term referring to all objects stored in the file cabinet which
from the user's perspective form a logical unit – i.e. a document. A
document may consist of any number of files. These may be scanned data
in TIFF or multi-TIF format. However, files from output management
systems, Office or graphics applications or even binary files are also
handled.
A file can represent one or more page(s), but it may equally contain stamps,
signatures, annotations or other, similar information associated with the
document. Documents may also be files with content in different formats.
They may be an Office file together with an email file and several TIFF files.
A unique identification is provided by the DOCID.
Field-dependent rights Field-dependent rights define rights, which depend on certain index field
entries.
Function profile A function profile contains the access rights to features of the DocuWare
client. These include the access rights to menu functions and stamps.
Function profiles are defined at organization level. A function profile can
also include administrative rights at organization level.
Group Independent of roles, users can be combined into groups to which roles can
be assigned. A group is therefore a collection of users. The only way to
assign rights to a group is via roles. Groups facilitate the administration of
large numbers of users.
Header DocuWare uses this XML format for storing the metadata (index data) and
any additions (annotations, stamps, etc.). The actual content is stored
separately (for performance reasons), except when exporting.
47
Glossary
Rights Rights allow the execution of particular functionalities within the DocuWare
system. Individual rights can be allocated in the file cabinets and at
organization level.
Role Within enterprise organizations, users are assigned different roles according
to their place in the hierarchy (e.g. approval of vacation requests) and on
their job description (e.g. purchaser). These roles can be mapped in
DocuWare in order to simplify installation and administration. This is
achieved by combining features and access rights into profiles which in turn
are allocated to roles.
The DocuWare system also makes use of the role concept: certain roles
with their associated profiles are predefined in order to handle
administrative tasks.
A role is a collection of profiles. Roles cannot contain individual rights.
Predefined roles facilitate the allocation of administrative rights.
System See DocuWare system.
System administrator The system administrator manages the system, particularly as far as
hardware is concerned. This includes the administration of database
connections, administration of communication paths, and document storage
paths. The system administrator has no access rights to organizational
information. In particular he/she cannot interfere with user administration.
TIFF Tagged Image File Format: The most important format in DocuWare is black
and white (1 bit) TIFF, compressed according to CCITT Group 4. This
format has become the established standard for electronic archiving of
scanned documents. For the purposes of archiving, DocuWare generates a
file for every page of a document.
Predefined roles Predefined roles are supplied with the DocuWare system; they guarantee
that the system works immediately after it has been installed. Pre-defined
roles are: system administrator, organization administrator and file cabinet
owner.
Workflow A workflow is a predefined sequence of steps which DocuWare performs
automatically when a predefined event occurs.
Workflow Server The Workflow server is the module that executes the workflows at runtime.
XML See Header
Access rights Access rights comprise file cabinets or menu features within the DocuWare
client.
48