0% found this document useful (0 votes)
38 views12 pages

IMW14307USEN

Uploaded by

M Yahya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views12 pages

IMW14307USEN

Uploaded by

M Yahya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IBM Software July 2012

White Paper

IBM InfoSphere Guardium Data Redaction:


Reconciling openness with privacy
Document protection for regulatory compliance and risk reduction
2 IBM InfoSphere Guardium Data Redaction: Reconciling openness with privacy

Introduction: The market demand simultaneously withholding access to information that they are
for redaction not allowed to see (see Figure 1).
Until recently, the need to delete sensitive information in
documents was restricted to the national intelligence To take an example from the U.S. government: the Freedom
community. In business, government and nonprofits, the need of Information Act (FOIA) is intended to hold government
for redaction was much rarer. But the need for redaction is organizations more accountable for their actions by making
growing quickly, as organizations face regulatory and business information about those actions available on demand. On the
requirements to enforce document privacy—not just by other hand, the same regulation requires that those ordering
controlling access to entire documents, but by selectively the documents must not see any sensitive personal or national
deleting private units of information. This means automated security information.
redaction solutions are rapidly becoming more important.
Similarly, the Health Insurance Portability and Accountability
The conflict between openness and privacy Act (HIPAA) is designed to enhance sharing of documents
With the data explosion in IT today, storing more data is a between physicians, hospitals and insurers while preventing the
challenge, but managing it is even more of a challenge. Proper unauthorized disclosure of individuals’ personal healthcare
data management ensures that the right people have access to information. For example, consulting physicians need access to
information and that they are using it for legitimate purposes. individuals’ electronic health records, but they do not need to
This creates a conflict between ensuring that those entitled to see billing information that is unrelated to their job duties.
see given types of information can access it easily, while

United States Federal


Regulation FOIA HIPAA PCI Sarbanes-Oxley Act
Rules of Civil Procedure

Functional area Government Legal eDiscovery Health Bank cards Corporate governance

Figure 1: Sample regulations by sector.


IBM Software 3

Both regulatory requirements and business pressures make To do the job correctly, redaction software needs the
redaction essential. following capabilities:

• Redaction can satisfy governmental regulations, including • The software must securely and completely delete all relevant
those in data privacy laws, without restricting the legitimate data. Some ad hoc solutions layer a black rectangle over the
use of information—thus avoiding sanctions, penalties and data, leaving the private data underneath. At the trial of
costs associated with addressing compliance violations after former Illinois Governor Rod Blagojevich in 2010, documents
the fact. with information about U.S. President Barack Obama’s
• Redaction can accelerate business interaction by sharing relationship to the case were officially released with
information with customers, partners and other third parties redactions—–but the underlying text could be easily recovered
without exposing them to sensitive information that they by simply copying and pasting it.
should not see. Organizations often find strong business value • To comply with some regulations, redaction software must be
in sharing information, but they must take care to limit able to retain the original pre-redaction version in a safe place;
exposure to the minimum needed, to avoid the embarrassment for other regulations, it must securely delete such versions.
and competitive risk of leaks. • The tools should allow the labeling of redactions. Rather than
simply masking text, some regulations require a meaningful
label for the redacted text, such as the words “Social Security
Traditional IT solutions are not granular enough Number” or a regulatory section-number, for readability and
What can technology do to achieve this balance between justification of the redaction.
openness and privacy? The first line of defense consists of
familiar access controls for documents, often defined per role.
Another part of the solution is data loss prevention (DLP)
software, which can restrict the transmission (such as through “Enterprises must now add to the basic
email) of sensitive information from an authorized user to an characteristics of data protection—
unauthorized user. Encryption is also an important means of
ensuring data confidentiality. But these approaches, with some
preservation, availability, responsiveness and
exceptions, tend to be blunt instruments that too often restrict confidentiality—the what and where of data.”
access more than is really necessary.
-David Hill, Analyst, Mesabi Group
These forms of document security are essential; but for a more
flexible, fine-grained approach, redaction is needed as well.
4 IBM InfoSphere Guardium Data Redaction: Reconciling openness with privacy

• To address the growing amount of electronically stored


information, the solution must scale to large numbers Privacy vs. security
of documents.
• The software should automatically identify suggested Security and privacy are related, but they are distinct
concepts.
redactions, but also allow for manual review, so that a
compliance officer can accept, reject or refine suggested Security is the infrastructure-level lockdown that prevents or
redactions. grants access to data based on authorization. It is the realm
• The interface should allow secure online viewing, in addition of passwords and encryption. In contrast, privacy control
to the creation of redacted documents. Viewing documents in validates that already-authenticated users have a legitimate
business need to see specific information. These needs are
the browser is more convenient and more secure than issuing
usually specific to a job function and defined by regulatory
a file that could more easily leak out.
or management policy.
• In some use cases, the web viewer must give users with proper
permissions the ability to securely retrieve some types of There are many security solutions that prevent unauthorized
redacted information as long as they specify a valid business user access. However, there are very few privacy solutions
that protect sensitive data from improper use by employees
reason and their access is logged. Without the flexibility of
and other authorized users who might pry into data that they
this feature, redaction policies must be either overcautious and
have no legitimate business purpose to see.
redact information that the user may need, or too permissive,
exposing information for the user’s convenience but revealing Two recent cases illustrate this distinction: doctors and
more information than needed. nurses at UCLA Medical Center were caught going through
• The solution should log the information redacted, along with Britney Spears’ medical records. And during the 2008
presidential campaign, U.S. State Department contractors
the documents, pages and text sections that were viewed, for
viewed passport records of presidential candidates,
future auditing.
including Barack Obama.

IBM® InfoSphere® Guardium® Data Redaction addresses all This was not hacking. These people had passwords; they
these capabilities and more. needed access permissions as part of their day-to-day work.
The problem was that they had no need-to-know. They
accessed the records out of mere curiosity, not out of a
legitimate functional need.
With a redaction-based web viewer, users see the
documents redacted according to their roles: in a hospital,
physicians and financial personnel will see different
information; and in the military, combat officers will see
different information from logistics specialists. These
redactions can be made very conservative, redacting
information if there is any doubt about whether it should be
visible to users of a given role. Thus, where permitted,
authorized users can state their business purposes and fill
in some of the redacted information in their documents,
knowing that all accesses are logged.
IBM Software 5

Automation is essential Free-text and forms


As redaction takes on an important role across enterprise IT The IBM InfoSphere Guardium Data Redaction solution
departments, manual redaction is insufficient, whether with a provides automated redaction that works in two ways, depending
black marker or with ad hoc electronic solutions. on whether the document is free-text or a structured form.

High volumes of electronic documents make manual redaction For free-text documents, the redaction engine automatically
expensive and error-prone. People who have the regulatory identifies and extracts relevant units of information (see Figure 2).
knowledge to identify private information are too expensive for Simply using text patterns is not enough—there is no formal
painstaking rote tasks, and even if they are assigned to such a pattern, for example, that captures personal names. Dictionaries
task, they are only human; they are slow and make mistakes. are not sufficient either, since homonyms can disguise meanings
(is “bush” a plant or a former U.S. president?). Instead, it is
Even if the black marker method were feasible for isolated necessary to combine regular expressions and dictionaries with a
documents, managing the workflow involved makes it syntactic analysis of the text surrounding the relevant information.
prohibitive for large document collections. The redaction
process includes identifying the documents in the repository Structured forms, on the other hand, require a different
that need redaction; finding the sensitive information in each technique, in which the known form layout is leveraged for
document; cross-referencing the semantic type of each unit of accurate redaction. This allows even low-quality scans with
information to the role of the recipient and determining
whether to redact it; creating the redacted copy; reviewing the
redaction and then redacting it again if needed; and finally
storing the redacted copy in a way that links to the original in
the repository.

Automating the redaction process is essential for making this


time-intensive activity more cost-effective, allowing
organizations to better comply with regulations, preserve their
competitive advantage, secure their intellectual property and
safeguard their public reputation.

Figure 2: A free-text document in InfoSphere Guardium Data Redaction


Manager; the text highlighted in blue is to be redacted.
6 IBM InfoSphere Guardium Data Redaction: Reconciling openness with privacy

handwritten text to be processed; if they are accidentally are pure images. For image files, the solution applies high-
skewed or resized, they can be straightened and aligned with a quality optical character recognition, and then processes the
template. To accomplish this, a reviewer begins by redacting a text. If there are any photographs or other graphics in a
sample form (for example, a blank) and marking the sensitive document, the solution preserves them as such.
fields to be redacted, together with elements that identify
instances of the form, such as the form title or identification Though most sensitive information arrives as text, images too
number. This creates a template for subsequent forms. The can contain sensitive information. For example, an X-ray
software redaction solution matches templates to forms, image may identify a patient’s name, a portrait photograph
eliminating the costly presorting of different form types. Next, may betray an identity or a satellite image may expose the
it applies a template to each form, precisely deleting the location of a military unit. In the InfoSphere Guardium Data
marked fields based on their position (see Figure 3). Redaction system, sensitive images can be located in a form
using templates, or marked by a reviewer in the web-based
Complete document format coverage Redaction Manager.
InfoSphere Guardium Data Redaction processes documents in
many formats: PDF, TIFF, Microsoft Word, plain text, XML Each of the document formats—PDF, TIFF, Microsoft Word
files and more. and plain text and XML files—can serve not only as input but
also output, and the choice of output type is configurable. In
Some input documents, such as Microsoft Word and many some cases, regulations or business needs require the redacted
PDFs, carry text in them, but others like TIFF and some PDFs document to be in “native” format, the original format of the

Figure 3: On the left is a skewed scan of a form; on the right is the automatic identification of the sensitive field in the form, as seen in the InfoSphere
Guardium Data Redaction Manager.
IBM Software 7

input. In other cases, it is necessary to output all documents, • Batch redaction: This workflow automatically redacts large
regardless of the input, into standard graphical formats numbers of documents in a repository. Depending on
preserving the precise layout, such as TIFF or PDF. regulatory requirements, a reviewer can then examine from 0
Alternatively, if further machine processing is needed, plain- to 100 percent of the redacted documents and approve, reject
text output can be specified for all input formats. or refine the redaction as needed. In this way, the redaction
solution combines the strengths of machine processing and
Finally, the InfoSphere Guardium Data Redaction system human domain knowledge.
automatically removes the wide variety of hidden information • On-demand redaction: A workflow used when individual
that is often stored in PDFs and Microsoft Word documents, documents must be processed as needed. For example, a
even without the user’s knowledge. This includes hidden layers; business user may need to cleanse private information from a
comments and scripts; white text and tiny fonts; metadata such document before emailing it to a business partner. The sender
as the names of document editors and the creation date; and can open the document in Redaction Manager, which
historical contents of a document preserved with editing instantly suggests text to be redacted. The sender can then
features like Undo and Track Changes. All these are safely refine this redaction before releasing the document.
deleted as part of the redaction workflow. • Secure document viewing: For this workflow, InfoSphere
Guardium Data Redaction provides a document viewer. All
Efficient workflow documents, regardless of original format, are displayed to the
With thousands of documents to redact, workflow management user in a uniform way in the browser, with no need to
is essential. Simply printing out the pages and deleting the download a document that could subsequently be leaked.
sensitive text won’t do it—it would be impossible to keep track Sensitive information in the document is securely deleted
of the stacks of paper—and the same is true with ad hoc according to the recipient’s job role. In accordance with
redaction of masses of electronic documents. The only solution regulations, this data is typically deleted in a way that does not
is for redaction to fit into enterprise content management allow it to be viewed; for some types of information, users may
(ECM) processes. The InfoSphere Guardium Data Redaction have permission to securely retrieve the redacted units after
system supports a variety of such workflows out of the box, and specifying their need to know (see Figure 4).
can read and write documents in ECM systems such as IBM
FileNet® P8 and IBM Content Manager 8:
8 IBM InfoSphere Guardium Data Redaction: Reconciling openness with privacy

Likewise, in eDiscovery as part of legal cases, the United States


Federal Rules of Civil Procedure specify that a litigant’s attorney
can see all client documents in full, including privileged
information, while the opposing counsel can see the documents
minus the attorney-client privileged information.

The InfoSphere Guardium Data


Redaction architecture
InfoSphere Guardium Data Redaction is centered on a server
that automatically identifies sensitive information and
generates the redacted documents. The server controls the
various workflows needed to manage the redaction process,
typically accessing ECM systems with thousands or millions of
documents. For maximum efficiency and better hardware
Figure 4: In this example, a user is providing a business justification to utilization, the server can run multiple redaction sessions in
access redacted SSN information; the user’s permission level determines
whether or not the requested information will be revealed. parallel in multi-core CPUs or across multiple machines.

InfoSphere Guardium Data Redaction features a modular


Policy-based redaction architecture, so the various components (such as repository
To gain maximum business value in the redaction process while connectors, information extractors, and authentication and
also minimizing deployment costs, the redaction solution must policy libraries) can be plugged in as needed to support specific
have the ability to quickly and easily implement the policies redaction requirements.
defined by regulatory frameworks, typically by cross-referencing
the recipient’s role against the type of information to be redacted. For programmatic access, the server exposes redaction services
over SOAP or Java, or simply by placing files in a file system or
The relevant roles are already defined in many enterprises. The ECM folder. This enables integration into existing enterprise
redaction solution leverages these roles and links them to redaction workflows. There are two graphical user interfaces: a
fine-grained permissions drawn from regulations, creating a web-based Redaction Manager that offers redaction review and
privacy compliance system that directly meets requirements at refinement capabilities, and a secure document viewer that
minimum cost. enables documents to be presented on the web without
requiring software installation for the end user (see Figure 5).
Thus, a physician might be allowed to see a patient’s medical
information, but not sensitive financial information, while the
reverse is true for the hospital’s billing clerk.
IBM Software 9

Previous privacy practices, such as manual redaction and


IBM InfoSphere document-level access controls, put the insurer at risk of
Review tool for Guardium Data
manual redaction Redaction server regulatory violation. If the organization shared documents
and visual
verification
Repository without checking them, it risked exposing customers’ private
ECM /
connector
Web application

File system information. This led to the opposite extreme: the insurer
Core withheld documents from various recipients, even where this
Internal In-process
meant violating regulations or missing opportunities for
API Java client business value.

SOAP API SOAP client


As one part of the organization’s requirements, the document
Java API management team needed to archive an extensive collection of
insurance policy applications, most of which were low-quality
scans of forms. These forms contained credit card information,
Remote
Java client and Payment Card Industry Data Security Standard (PCI-
DSS) regulations required the insurer to keep credit card
numbers out of its archives. Conversely, insurance regulations
Figure 5: The InfoSphere Guardium Data Redaction solution architecture. required these documents to be archived indefinitely—decades
may pass before insurance claims are made. By batch-redacting
the policy applications using the InfoSphere Guardium Data
Redaction solution, the insurer was able to rapidly and
Case in point: Data redaction at a efficiently control the private information in the archives. As
health insurer new forms arrived, the same process was automatically applied
A large private health insurer faced new regulations requiring before they were archived. The insurer is now able to smoothly
it to share health records with its customers, healthcare share or archive documents while precisely withholding the
providers and the National Health Service. However, the same information required by law.
laws required the insurer to carefully tailor the released
personal health information according to the role of the InfoSphere Guardium Data Redaction:
document recipient. This insurer also faced financial laws that
Share sensitive data securely
required it to delete credit card numbers stored in its archives.
Regulation, best practices and data privacy laws are changing
the rules for how organizations grant or deny access to
The insurer needed to make documents accessible to
information. Organizations must not only ensure the
independent insurance agents and other business partners.
retention, availability and confidentiality of documents, but
At the same time, preserving privacy was essential to
release or archive precisely the information allowed by
maintaining the company’s good reputation for respecting
regulations and business needs, taking into account intended
its customers’ rights.
readers of the documents.
10 IBM InfoSphere Guardium Data Redaction: Reconciling openness with privacy

The InfoSphere Guardium Data Redaction solution was


IBM InfoSphere Guardium Data Redaction features designed to do just this. It automatically identifies sensitive
data in documents, and securely deletes sections of the
Enterprise integration
document according to the regulatory or business policy while
Out-of-the-box support for FileNet P8 and IBM Content Manager 8
taking into account the semantics of the information and the
Integration-ready capabilities for other document management products
role of the recipient.
Integration-ready for enterprise authentication and policy systems
Regulation-based policy model InfoSphere Guardium Data Redaction supports many of
today’s document types, including scanned or originally
Ease of use electronic documents. It leverages unique entity extraction and
Automatically identify sensitive information in documents, forms, images, optical character recognition techniques to identify sensitive
text and more for review by security professionals or other stakeholders to data in documents, making the redaction process repeatable
set appropriate redaction or other privacy policies or for reporting and reliable for organizations to manage, measure and trust.
Richly functional, zero-install web interfaces InfoSphere Guardium Data Redaction is part of the IBM
Workflow support; automated batch redaction with optional review; Security framework for data and information, helping
on-demand redaction; document review and forwarding organizations meet the broader challenge of protecting
Secure role-based document viewer with optional flexible revealing sensitive data, no matter where it resides.
by policy
About IBM InfoSphere Guardium Solutions
Automated multi-format redaction Since data is a critical component of daily business operations,
Free-text entity extraction with industry-leading library developed by IBM it is essential to ensure privacy and protect data no matter
Research where it resides. Different types of information have different
Advanced form redaction, including low-quality, skewed or resized scans protection requirements; therefore, organizations must take a
holistic approach to safeguarding information.
Mixed free-text/form redaction
Support for multiple input and output document formats, including
• Understand where the data exists: Organizations can’t
Microsoft Word, TIFF, plain text, XML and PNG
protect sensitive data unless they know where it resides and
Graphical/textual redaction, simultaneously preserving layout and
how it’s related across the enterprise.
accurately analyzing the text
Support for English, German, French and Spanish textual entities
Complete removal of hidden data
Support for stamps such as Bates Number, Date, Document ID, Content
Type, Repository Info, or other ways to uniquely identify a document
Support for watermarking to help with content identification and
authentication as well as communication of ownership and copyrights
IBM Software 11

• Safeguard sensitive data, both structured and About IBM InfoSphere


unstructured: Structured data contained in databases must be InfoSphere Guardium is a key part of the IBM InfoSphere
protected from unauthorized access. Unstructured data in portfolio. IBM InfoSphere software is an integrated platform
documents and forms requires privacy policies to redact for defining, integrating, protecting and managing trusted
(remove) sensitive information while still allowing needed information across your systems. The InfoSphere platform
business data to be shared. provides all the foundational building blocks of trusted
• Protect non-production environments: Data in information, including data integration, data warehousing,
nonproduction, development, training and quality assurance master data management and information governance, all
environments needs to be protected, yet still usable during the integrated around a core of shared metadata and models. The
application development, testing and training processes. portfolio is modular, allowing you to start anywhere, and mix
• Secure and continuously monitor access to the data: and match InfoSphere software building blocks with
Enterprise databases, data warehouses and file shares require components from other vendors, or choose to deploy multiple
real-time insight to ensure data access is protected and building blocks together for increased acceleration and value.
audited. Policy-based controls are required to rapidly detect The InfoSphere platform provides an enterprise-class
unauthorized or suspicious activity and alert key personnel. In foundation for information-intensive projects, providing the
addition, databases and file shares need to be protectedagainst performance, scalability, reliability and acceleration needed to
new threats or other malicious activity andcontinually simplify difficult challenges and deliver trusted information to
monitored for weaknesses. your business faster.
• Demonstrate compliance to pass audits: It’s not enough to
develop a holistic approach to data security and privacy. For more information
Organizations must also demonstrate and prove compliance To learn more about IBM InfoSphere, please contact your IBM
to third party auditors. sales representative or visit: ibm.com/software/data/infosphere

IBM InfoSphere Guardium solutions for data security and For more information about data privacy and IBM
compliance support this holistic approach, helping InfoSphere Guardium Data Redaction, please contact
organizations protect against a complex threat landscape while your IBM representative or visit: ibm.com/software/data/
remaining focused on their business goals. guardium/data-redaction/
© Copyright IBM Corporation 2012

IBM Corporation
Software Group
Route 100
Somers, NY 10589
U.S.A.

Produced in the United States of America


July 2012
All Rights Reserved

IBM, the IBM logo, ibm.com, Guardium and InfoSphere are trademarks or
registered trademarks of International Business Machines Corporation in
the United States, other countries or both. If these and other IBM
trademarked terms are marked on their first occurrence in this information
with a trademark symbol (® or ™), these symbols indicate U.S. registered
or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law
trademarks in other countries. A current list of IBM trademarks is available
on the web at “Copyright and trademark information” at ibm.com/legal/
copytrade.shtml

THE INFORMATION IN THIS DOCUMENT IS PROVIDED


“AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED,
INCLUDING WITHOUT ANY WARRANTIES OF MERCHANT-
ABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY
WARRANTY OR CONDITION OF NON-INFRINGEMENT.
IBM products are warranted according to the terms and conditions of
the agreements under which they are provided.

Java and all Java-based trademarks are trademarks of Sun Microsystems,


Inc. in the United States, other countries or both.

Microsoft is a trademark of Microsoft Corporation in the United States,


other countries or both.

Other product, company or service names may be trademarks or service


marks of others.

References in this publication to IBM products or services do not imply


that IBM intends to make them available in all countries in which IBM
operates. All statements regarding IBM’s future direction and intent are
subject to change or withdrawal without notice, and represent goals and
objectives only.

Please Recycle

IMW14307-USEN-03

You might also like