0% found this document useful (0 votes)
193 views8 pages

Content Archiving With Infoarchive: View Point

This document discusses OpenText InfoArchive, an enterprise archival platform that supports high volume archiving of structured and unstructured data in a cost-effective and compliant manner. It provides insights into InfoArchive architecture, typical use cases, and design considerations for InfoArchive-based archiving solutions.

Uploaded by

Lokesh Sivakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
193 views8 pages

Content Archiving With Infoarchive: View Point

This document discusses OpenText InfoArchive, an enterprise archival platform that supports high volume archiving of structured and unstructured data in a cost-effective and compliant manner. It provides insights into InfoArchive architecture, typical use cases, and design considerations for InfoArchive-based archiving solutions.

Uploaded by

Lokesh Sivakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

VIEW POINT

CONTENT ARCHIVING WITH


INFOARCHIVE
Executive Summary
Making a decision on archival of legacy
applications’ data is a difficult but
necessary step for many IT organizations
today, considering the very high data
volume and the resultant costs associated
with maintaining such applications.
Similarly, live applications with high
storage footprint pose cost challenges to IT
departments. Regulatory, compliance and
legal requirements add further complexity
around the data storage strategy. All
this, coupled with disparate type of data
characteristics present in an organization
(e.g. data formats, confidentiality
requirements, retention needs, output
adaptabilities), as well as increasing digital
requirements (e.g. analytics, reporting,
integrations) makes selection of an archival
platform a non-trivial task.

Due to the sheer volume and resultant


costs, unstructured data (i.e. content)
archival is one of the hot areas in ECM
landscape of many organizations. While
there are many archival products available
in the market offering a number of
options to handle the data complexity,
cost and compliance requirements, it can
be daunting to zero-in on a cost effective
product while ensuring the compliance
and performance expectations are met.

OpenText InfoArchive is one of the leading


players in the archival landscape, allowing
archival options for the data in a cost-
effective and compliant manner, while
offering the digital features, and flexibility
to accommodate any future extensions.

This white paper provides insights into


InfoArchive architecture, typical use cases
and design considerations for InfoArchive
based archiving solutions for content
e.g. archive structure, search design,
transformation strategy, retention strategy,
storage options etc., which in turn, can aid
in designing a cost-effective, standards-
based, performant, infrastructure-agnostic
and compliant archiving platform.

External Document © 2020 Infosys Limited


InfoArchive – How It Works
InfoArchive is an enterprise archival platform, based on OAIS (Open Archival Information System) standards. It supports very high volume
archiving scenarios (petabytes scale), with structured and/or unstructured data, and enhances the ability to secure and leverage critical
application data and content.

It supports multiple languages, enabling the use suited for global audience. Due to its enormous scaling capabilities, it’s well suited as an
application agnostic archival mechanism, i.e. multiple applications can be archived on a single InfoArchive instance, separated from each other
by security controls, and can have different design elements.

Being an OAIS-compliant archival solution, InfoArchive ensures to:


• Gather and accept appropriate information about the data, from information producers
• Apply sufficient controls on the information to ensure long-term preservation
• Determine the access scope of the target user base
• Make the information understandable by the target audience, without any assistance from original information producers
• Follow defined policies and procedures to preserve the data information against reasonable contingencies, and to disseminate the
information as authenticated copies of the original data or as traceable to the original
• Make the information available to the target user base

Architecture Under The Hood Loading The Data


InfoArchive has the classic three-tiered InfoArchive uses XML as the format for An InfoArchive holding (archive) can be
architecture : Native XML database, called data preservation for long-term retention, either table-based or SIP-based (submission
xDB at the backend for data persistence, enabling platform-independence. xDB information packages). The table-based
InfoArchive server as middleware, for database is used to store these files (the archive uses a schema to ingest structured
computational duties and InfoArchive web XML data and structure), which allows xml data and linked files from a database table
UI, providing user experience. queries to be run against the data efficiently, (not covered in this document)
at any level of detail.
The newer versions of InfoArchive (4.x For content (i.e. unstructured data), a SIP-
onwards, with 16.5 stack being the latest) xDB is a proprietary, lightweight XML based archive uses submission information
include a new architecture, shedding the database. Use of XML allows encoding packages (SIPs) to ingest data from files,
Documentum components present till 3.x. information for long-term, platform- data records, or compound records. An
The new InfoArchive application has an independent retention. SIP is a ZIP file containing an index file, a
AngularJS and Bootstrap based UI, with the descriptor file and one or more content
InfoArchive allows Amultiple, different
server relying on spring Data. The result is files. The unstructured data is ingested into
data structures to be stored in a single
an application that is lightweight and more InfoArchive as SIPs, containing the content
repository (archive), which can consist of
performant. files, metadata XML file (containing indexes
both structured and unstructured data.
The REST API layer follows the decoupled The structured and unstructured data for all these content files) and a descriptor
architecture paradigm, allowing a custom UI can constitute a single record (e.g. invoice file(XML), describing the contents of the SIP.
to be built for presentation where need be. content files + SAP data for invoices) and
Post upload, the SIP becomes a logical
It enhances the interoperability capabilities, can be accessed/treated as a single object
partition(AIP) in InfoArchive. The metadata
allowing the functionalities to be used via for querying/reporting.
is stored in xDB, and the zip file is pushed to
REST calls, while data encryption capabilities
A holding is the entity holding an archive’s storage location.
help secure the data at rest.
data (Archival Information Package or
In case of structured data, the SIP will
The modular architecture of InfoArchive AIPs). An application can have one or more
contain only the XML files.
allows the servers to be specialized for holdings. Holding can be thought of as a
performing a specific functionality, e.g. top level (i.e. root) folder, under which all Data upload can be performed using
ingestion, search etc. This allows better the data resides. The data under a holding the batch jobs (scripts) or via REST calls,
flexibility to handle the SLA and availability reside in flat structure. allowing near real-time data ingest.
requirements (e.g. add more search server
A single instance of InfoArchive can have
instances, to improve search responses, add
multiple applications, completed isolated
dedicated servers for data ingest)
from each other, secured by access controls.

External Document © 2020 Infosys Limited


Finding The Data Storage Options will allow authorized users to search the
encrypted data. The search results will be
InfoArchive deploys a 2 step search InfoArchive supports a number of storage displayed in masked format (or decrypted
mechanism. The logical partition(AIP) is solutions, e.g. SAN, cloud, CAS. This provides format for authorized groups).
associated with at least one search criteria. the enterprise a number of options to
A partition would typically contain 10s of choose from, based on the availability, Compliance
thousands of records(AIUs). performance and cost considerations.
On the compliance end, InfoArchive 16.5
On a high level, following is the way Integration with cloud storage is supported
has built in features enabling retention/
InfoArchive search functions – via S3 interface, allowing a standard way of
purge management and hold capabilities.
integrating with different providers (both
• An index is created against the search The built-in retention capabilities allow
EMC ECS & Amazon S3 can be integrated
criteria, which is mapped to the specific you to apply retention to records, on the
using S3 interface). Cheaper storage options
AIP (i.e. partition key) application, AIP or record, using the UI/REST
like Glacier allow tiered data storage,
APIs, during ingestion or via batch jobs.
• 1st step search locates the AIP, which facilitating storage of low value, infrequently
Multiple retention policies can be applied
contains the record. Since AIPs will be low accessed data to be kept in a low cost
to the same record. It provides the fixed,
in numbers, this will be very quick. This storage media.
event-based and mixed mode* retention
search is carried out in main database,
one of the databases within the xDB Data Security capabilities.

For the data security requirements, InfoArchive is certified against many


• 2nd step search will perform the search
InfoArchive supports the content and compliance standards such as Dodd-
within the AIP, for the specific record
metadata encryption as well as data Frank(US), MiFID2/R (EU), OFSI E-13 (Canada)
(Archival Information Unit or AIU). Due to
masking. It supports AES encryption using etc.
smaller number of records to search from
(typically multiples of 10,000), the record Bounty Castle and Java Cryptography *Retain the object for a fixed duration, however dispose
immediately if an event implies that the object is no
is located quickly Architecture (JCA) longer required .

To keep the retrieval performant, a caching As a typical use case example, personally
mechanism is supported, that allows the identifiable information can be encrypted
content to be cached in the local storage while ingesting the data. The search UI

External Document © 2020 Infosys Limited


Archival Scenarios and requirements, compared to the live are expected to be on par with the live
InfoArchive Fitment business applications, with data sitting in business applications. Archival application
retired status, and accessed only for legal/ availability is also critical, due to it being
Archival typically has 2 use cases - cold compliance requirements. part of the business process.
archiving, and live archiving.
The underlying element of this approach is This approach can decrease the load on a
Cold archiving includes data that is no that the data has reached end-of-life from a live application, increasing performance
longer part of live business processes, and business value perspective, and need to be and reducing hardware costs, by offloading
its more cost effective to decommission the retained primarily because of regulatory/ some/most of the application’s data to
application containing this data. However, legal requirements. InfoArchive. The data that you choose to
this data can be valuable to the business archive, and the frequency of archiving, is
and important to preserve for compliance For this scenario, InfoArchive is a straight
based on your business rules. For example,
reasons, and hence this data set is archived, fit, providing a low cost, flexible solution,
on a weekly basis, a bank can archive all
usually in one go (or in a limited, defined with ability to store large volume of data
bank statements that are a year old or more
number of migration cycles). Access to for extended period of time, in a client
this data is infrequent, hence performance agnostic format. It allows multiple archival For the live archiving scenario, InfoArchive’s
criteria for the archival application are quite applications to be created within a single fitment needs to be thoroughly assessed,
relaxed. InfoArchive instance, with options for a based on the availability and SLA
number of storage media (e.g. ECS, S3) requirements of the application. While
Live archiving refers to the use case to be configured. The built-in retention InfoArchive is quite performant when it
where live business data from the leading mechanism ensures that data is retained comes to ingest, search or retrieval, its
application is archived regularly, to obtain as per the compliance requirements, high availability capabilities have some
the cost savings, and keep the leading eliminating any accidental/malicious data limitations (the underlying single instance
application performant. While this data losses. Robust data encryption capabilities of xDB presents a single point of failure).
set too is read only, however the archival ensure that sensitive data, while lying in
process is always running. inactive state, still has the safety measures
Live archiving is characterized by high equivalent to live business data. Role based
ingest rate, as well as higher search/retrieval accesses permits only the authorized users
volumes compared to cold archiving, (typically a small number in this scenario) to
often with a custom/3rd party application access the data.
accessing the InfoArchive data. Since
the data is still used by the leading live
Live Archival
application, performance and availability A slightly different archiving approach
requirements are often quite stringent. has increasingly found favor with large
organizations (especially the financial
Below sections provide some more details
institutions) in past decade. These
on both the archival scenarios, and fitment
companies, owing to the ever-increasing
of InfoArchive for both the cases.
volumes of customer data, e.g. account
Cold Archival statements, marketing e-mails, text
messages, notifications etc. often look for
The traditional data archiving refers to
an archival solution which provides the
the process of moving data from the
traditional archival cost advantages, but can
primary application to a read-only, cost
also match the SLA requirements of the live
effective storage with search and retention
business applications, along with the digital
capabilities, coupled with restricted
enablement for the integrations, reporting
access patterns (e.g. accessible only to
and analytics requirements.
administrators /limited set of users). The
data is moved based on a defined set of In this scenario, typically the archival
criteria (e.g. age), which can be one time or application is integrated with the leading
as multiple steps process. This scenario is business application, serving as its data
often associated with business application store or sometimes directly exposed to the
decommissioning. users via a custom search UI. Writes can
happen synchronously, near-real-time (e.g.
Such archival applications, typically
end of workflow), or asynchronously (e.g.
would have relaxed SLAs and availability
batch jobs). Read SLAs of the applications

External Document © 2020 Infosys Limited


Design Considerations be a better option (multiple AIPs share apply its own hardware retention policy, on
an XDB library, multiple AIPs can be open top of InfoArchive retention, based on the
Search Architecture simultaneously) duration supplied by InfoArchive. ECS treats
The InfoArchive search architecture, as this duration as fixed, and it doesn’t allow
From an availability perspective, since xDB is
described earlier, relies on a 2 step search the item to be purged before this fixed date,
a single point of failure, search architecture
mechanism, i.e. 1st step identifies the AIP neither does it allow the policy duration to
should be carefully vetted against the
containing the record, while the 2nd step be shortened (though it’ll allow the policy
SLA requirements. For live environments,
searches for a given record within the AIP. duration extension). This is not the case
a custom search/caching layer may need
with S3 or SAN.
For the search design, the AIP size and mode to be built on top of InfoArchive, to serve
should be chosen carefully, since if the AIP the required availability requirements. Scenarios like this might have cost and
size is too small (e.g. 10 records per AIP), Alternatively, an active DR architecture can possibly compliance implications in certain
then for a large repository, the AIP count also be considered (with xDB replication cases, and hence requirements should
itself will be very high, and it’ll impact the enabled, and incremental content be clearly articulated early in project,
1st step performance (as per OpenText, post replication in place) and design should take into account this
200,000 AIPs, search performance starts limitation.
to suffer). On the other hand, a very large
Retention Design
Capacity wise, if the number of records is
AIP can choke the 2nd step search, thus InfoArchive supports fixed, event based as massive (e.g. billions), then the retention
eliminating the advantages of the 2 step well as Mixed Mode retention. However, design should be carefully thought about,
search. Also, it must be evaluated where some factors need to be carefully otherwise the managed item database
background search can serve the purpose considered while choosing the retention (retention database) may become huge in
(e.g. scheduled reporting tasks) type for the given types/objects. size, having design and cost implications.
For batch ingest, the private AIP mode Two factors need to be evaluated carefully For example, in event based retention, a
works best (large number of records, e.g. for an effective & easy to use retention retention policy is created for each object. If
50,000 – 100,000, put into a single AIP, gets architecture, which are - we assume that objects for 1 record require
ingested at once). Private mode means that 8 kb of storage space, then if event based
• Functionality required
data becomes searchable immediately once retention is set up on 1 billion records, the
ingestion is completed. • Capacity to be handled resultant policy objects will require almost
On the other hand, if the incoming data Functionality wise, InfoArchive does have a 7.5 TB of space. This, coupled with the fact
volumes are small (typical in live archiving specific behavior when used with a certain that the managed item database needs
scenarios), then aggregated AIP mode hardware combination, so if there is a need to reside on the block storage, on a single
can be a better option (AIP is created with for mixed mode retention, the requirements xDB instance, may require a very large VM
volume and time threshold, and incoming should be thoroughly evaluated against the or physical server for xDB. This can create
records are pushed into it till the threshold is InfoArchive capabilities and the product challenges for the infrastructure architecture
reached, upon which it is closed). stack in use. & maintenance. So for very large data sets,
Batch level retention policies will be better
For medium size data volumes (typically > For example, when InfoArchive is configured suited.
1000 to < 10,000), pooled AIP mode might with ECS cloud storage, the ECS storage will

External Document © 2020 Infosys Limited


Availability retrieval will be very quick, however the communications(e-mails), invoices (jpeg/
downside will be that the storage required tiff), manuals(pdf) or other content in
For the live archiving use cases where for the document will be much greater than different formats. To make it suitable for a
availability SLAs are stringent, the the original document size (3 – 4 times). support executive, to resolve a consumer
application design should take into query quickly, or for an analyst to derive
consideration the InfoArchive reliance on Hence the AIP creation strategy should
meaningful insights from the content, the
a single xDB instance (the Web & IAS layer strive to find a balance between the 2
format homogeneity is often a pre-requisite.
support high availability). There are design extremes. For example, AIPs with n*100
possibilities to mitigate this risk to a certain records in one file (where n <=50 typically For InfoArchive, while this is not a core
extent, e.g. spread the databases in xDB on offer better performance, while providing product capability (except some pdf
different machines, so to protect against the space savings) rendering features), and needs to be
complete availability loss (e.g. the searches handled using additional solution elements,
will still be functional if main database is
Encryption Design this still can have a large impact on the
up, while others are down). However even Encryption is invaluable as a mechanism content storage (and thus costs), as well
in this scenario, the availability risk will be for protecting data, particularly personally as retrieval performance (important in live
there (e.g. the main database itself may identifiable information, from disclosure to archival scenarios).
go down, making InfoArchive metadata unauthorized channels. For encryption to Transformation can typically be performed
unavailable) be ensure security and be effective across in 2 different ways
large enterprises, the encryption keys
While the web UI and InfoArchive server 1. Ingest transformation - as the
must be managed with the same level of
support high availability, xDB as a single name implies, it deals with content
attention, given to the confidential data that
point of failure limits the overall application transformation at the time of ingest
they protect. Enough protection should be
availability and failover options.
introduced in the solution to ensure that the 2. On-the-fly transformation - store the
A possible option to handle this, can be to keys are not easily guessed, disclosed or lost, content in original format, transform it
create an XDB replica and automate the and so that the data they encrypt can be only at the time of retrieval
switch from primary instance to replica recovered by authorized channels.
upon failure detection. xDB doesn’t provide The decision in favor of either of the 2
InfoArchive supports a number of approaches (or a combination of both)
any tools/mechanisms to detect primary
encryption providers (e.g. bouncy castle, should be carefully vetted against the
instance failure and automatic failover,
JCA), offering multiple encryption options. storage costs, retrieval performance, as well
hence a custom script/tool needs to be
However, it provides limited options as regulatory requirements (i.e. keep the
used for this. This can provide a limited level
when it comes to integrate with more ‘original’ content copy).
of failover capability, however failover will
sophisticated centralized encryption and
not be seamless in this case as InfoArchive For example, converting a 10 GB AFP to
key management solutions. For example, it
server needs to be rebooted for failover to pdf while ingesting it, will reduce the size
doesn’t support integration with Vormetric
take effect (i.e. to point to new instance), a bit, however will introduce performance
or Voltage Security. Gemalto-SafeNet, which
hence the switchover will be visible to the challenges in terms of individual record
offered some of these capabilities in earlier
users. retrieval. The same AFP, if split at each record
version, is not supported in 16.x versions.
level, will make the retrievals very efficient,
AIP Storage and Retrieval Hence use of InfoArchive should take however the size will increase multifold (3
Like many other applications, InfoArchive’s into consideration the enterprise key to 4 times compared to original), due to
search/retrieval performance too, is greatly management solution in place if any, and duplication of resources in each pdf.
dependent on the solution design. As the design challenges it may introduce for
In this example, a middle approach might
described earlier, search performance InfoArchive architecture
work better, i.e. a pdf with n number of
requires thoughts around data partitioning
strategy.
Transformation records (where n could range from 100s to
a couple of 1000s records) typically offer
In the Archival space, with growing number
Similarly, data retrieval performance can the performance benefits, while still not
of application areas and content resources
vary a lot, based on the AIP creation strategy bloating the storage.
involved, the diversity challenges related
(i.e. data packaging). For example, retrieving Some add-ons (e.g. ProArchiver) offer
to source data have also increased. This
a record from a 15 GB AFP (a data format) certain storage and performance
presents the heterogeneity challenges, i.e.
record (i.e. the entire file in an AIP) will advantages by offering a custom format
how to present the AFP, Pdf, Word or other
take considerable time, due to the need to for storage, however this needs to take
formats content, in a unified, read-only
load the AFP file on disk, then parse it, and into consideration the vendor lock-in, and
format to the end-user.
retrieve the document. subsequent dependency on the proprietary
The source content is often varied in
On the other hand, if the above AFP is format.
nature, e.g. customer statements (AFP),
split per record while ingesting, the record

External Document © 2020 Infosys Limited


Conclusion
InfoArchive as an archival solution, can
bring in valuable benefits to organizations
handling large content volumes. It
can make existing applications more
performant as well as provide a cost-
effective data store for new ones, helping
bring down the IT budgets significantly.
However, there is no one size fits all
solution that it can offer. Making design
decisions after careful evaluation of
business requirements and considering the
product capabilities, will ensure that the
users are able to maximize the business
benefits, in addition to IT costs savings
realization.

About the Author

Ravi is one of the ecm architects with Infosys, working primarily in OpenText technologies. When not solving the design
challenges, he can be often found immersed in automobile materials.

For more information, contact [email protected]

© 2020 Infosys Limited, Bengaluru, India. All Rights Reserved. Infosys believes the information in this document is accurate as of its publication date; such information is subject to change without notice. Infosys
acknowledges the proprietary rights of other companies to the trademarks, product names and such other intellectual property rights mentioned in this document. Except as expressly permitted, neither this
documentation nor any part of it may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, printing, photocopying, recording or otherwise, without the
prior permission of Infosys Limited and/ or any named intellectual property rights holders under this document.

Infosys.com | NYSE: INFY Stay Connected

You might also like