IMW14307USEN
IMW14307USEN
White Paper
Introduction: The market demand simultaneously withholding access to information that they are
for redaction not allowed to see (see Figure 1).
Until recently, the need to delete sensitive information in
documents was restricted to the national intelligence To take an example from the U.S. government: the Freedom
community. In business, government and nonprofits, the need of Information Act (FOIA) is intended to hold government
for redaction was much rarer. But the need for redaction is organizations more accountable for their actions by making
growing quickly, as organizations face regulatory and business information about those actions available on demand. On the
requirements to enforce document privacy—not just by other hand, the same regulation requires that those ordering
controlling access to entire documents, but by selectively the documents must not see any sensitive personal or national
deleting private units of information. This means automated security information.
redaction solutions are rapidly becoming more important.
Similarly, the Health Insurance Portability and Accountability
The conflict between openness and privacy Act (HIPAA) is designed to enhance sharing of documents
With the data explosion in IT today, storing more data is a between physicians, hospitals and insurers while preventing the
challenge, but managing it is even more of a challenge. Proper unauthorized disclosure of individuals’ personal healthcare
data management ensures that the right people have access to information. For example, consulting physicians need access to
information and that they are using it for legitimate purposes. individuals’ electronic health records, but they do not need to
This creates a conflict between ensuring that those entitled to see billing information that is unrelated to their job duties.
see given types of information can access it easily, while
Functional area Government Legal eDiscovery Health Bank cards Corporate governance
Both regulatory requirements and business pressures make To do the job correctly, redaction software needs the
redaction essential. following capabilities:
• Redaction can satisfy governmental regulations, including • The software must securely and completely delete all relevant
those in data privacy laws, without restricting the legitimate data. Some ad hoc solutions layer a black rectangle over the
use of information—thus avoiding sanctions, penalties and data, leaving the private data underneath. At the trial of
costs associated with addressing compliance violations after former Illinois Governor Rod Blagojevich in 2010, documents
the fact. with information about U.S. President Barack Obama’s
• Redaction can accelerate business interaction by sharing relationship to the case were officially released with
information with customers, partners and other third parties redactions—–but the underlying text could be easily recovered
without exposing them to sensitive information that they by simply copying and pasting it.
should not see. Organizations often find strong business value • To comply with some regulations, redaction software must be
in sharing information, but they must take care to limit able to retain the original pre-redaction version in a safe place;
exposure to the minimum needed, to avoid the embarrassment for other regulations, it must securely delete such versions.
and competitive risk of leaks. • The tools should allow the labeling of redactions. Rather than
simply masking text, some regulations require a meaningful
label for the redacted text, such as the words “Social Security
Traditional IT solutions are not granular enough Number” or a regulatory section-number, for readability and
What can technology do to achieve this balance between justification of the redaction.
openness and privacy? The first line of defense consists of
familiar access controls for documents, often defined per role.
Another part of the solution is data loss prevention (DLP)
software, which can restrict the transmission (such as through “Enterprises must now add to the basic
email) of sensitive information from an authorized user to an characteristics of data protection—
unauthorized user. Encryption is also an important means of
ensuring data confidentiality. But these approaches, with some
preservation, availability, responsiveness and
exceptions, tend to be blunt instruments that too often restrict confidentiality—the what and where of data.”
access more than is really necessary.
-David Hill, Analyst, Mesabi Group
These forms of document security are essential; but for a more
flexible, fine-grained approach, redaction is needed as well.
4 IBM InfoSphere Guardium Data Redaction: Reconciling openness with privacy
IBM® InfoSphere® Guardium® Data Redaction addresses all This was not hacking. These people had passwords; they
these capabilities and more. needed access permissions as part of their day-to-day work.
The problem was that they had no need-to-know. They
accessed the records out of mere curiosity, not out of a
legitimate functional need.
With a redaction-based web viewer, users see the
documents redacted according to their roles: in a hospital,
physicians and financial personnel will see different
information; and in the military, combat officers will see
different information from logistics specialists. These
redactions can be made very conservative, redacting
information if there is any doubt about whether it should be
visible to users of a given role. Thus, where permitted,
authorized users can state their business purposes and fill
in some of the redacted information in their documents,
knowing that all accesses are logged.
IBM Software 5
High volumes of electronic documents make manual redaction For free-text documents, the redaction engine automatically
expensive and error-prone. People who have the regulatory identifies and extracts relevant units of information (see Figure 2).
knowledge to identify private information are too expensive for Simply using text patterns is not enough—there is no formal
painstaking rote tasks, and even if they are assigned to such a pattern, for example, that captures personal names. Dictionaries
task, they are only human; they are slow and make mistakes. are not sufficient either, since homonyms can disguise meanings
(is “bush” a plant or a former U.S. president?). Instead, it is
Even if the black marker method were feasible for isolated necessary to combine regular expressions and dictionaries with a
documents, managing the workflow involved makes it syntactic analysis of the text surrounding the relevant information.
prohibitive for large document collections. The redaction
process includes identifying the documents in the repository Structured forms, on the other hand, require a different
that need redaction; finding the sensitive information in each technique, in which the known form layout is leveraged for
document; cross-referencing the semantic type of each unit of accurate redaction. This allows even low-quality scans with
information to the role of the recipient and determining
whether to redact it; creating the redacted copy; reviewing the
redaction and then redacting it again if needed; and finally
storing the redacted copy in a way that links to the original in
the repository.
handwritten text to be processed; if they are accidentally are pure images. For image files, the solution applies high-
skewed or resized, they can be straightened and aligned with a quality optical character recognition, and then processes the
template. To accomplish this, a reviewer begins by redacting a text. If there are any photographs or other graphics in a
sample form (for example, a blank) and marking the sensitive document, the solution preserves them as such.
fields to be redacted, together with elements that identify
instances of the form, such as the form title or identification Though most sensitive information arrives as text, images too
number. This creates a template for subsequent forms. The can contain sensitive information. For example, an X-ray
software redaction solution matches templates to forms, image may identify a patient’s name, a portrait photograph
eliminating the costly presorting of different form types. Next, may betray an identity or a satellite image may expose the
it applies a template to each form, precisely deleting the location of a military unit. In the InfoSphere Guardium Data
marked fields based on their position (see Figure 3). Redaction system, sensitive images can be located in a form
using templates, or marked by a reviewer in the web-based
Complete document format coverage Redaction Manager.
InfoSphere Guardium Data Redaction processes documents in
many formats: PDF, TIFF, Microsoft Word, plain text, XML Each of the document formats—PDF, TIFF, Microsoft Word
files and more. and plain text and XML files—can serve not only as input but
also output, and the choice of output type is configurable. In
Some input documents, such as Microsoft Word and many some cases, regulations or business needs require the redacted
PDFs, carry text in them, but others like TIFF and some PDFs document to be in “native” format, the original format of the
Figure 3: On the left is a skewed scan of a form; on the right is the automatic identification of the sensitive field in the form, as seen in the InfoSphere
Guardium Data Redaction Manager.
IBM Software 7
input. In other cases, it is necessary to output all documents, • Batch redaction: This workflow automatically redacts large
regardless of the input, into standard graphical formats numbers of documents in a repository. Depending on
preserving the precise layout, such as TIFF or PDF. regulatory requirements, a reviewer can then examine from 0
Alternatively, if further machine processing is needed, plain- to 100 percent of the redacted documents and approve, reject
text output can be specified for all input formats. or refine the redaction as needed. In this way, the redaction
solution combines the strengths of machine processing and
Finally, the InfoSphere Guardium Data Redaction system human domain knowledge.
automatically removes the wide variety of hidden information • On-demand redaction: A workflow used when individual
that is often stored in PDFs and Microsoft Word documents, documents must be processed as needed. For example, a
even without the user’s knowledge. This includes hidden layers; business user may need to cleanse private information from a
comments and scripts; white text and tiny fonts; metadata such document before emailing it to a business partner. The sender
as the names of document editors and the creation date; and can open the document in Redaction Manager, which
historical contents of a document preserved with editing instantly suggests text to be redacted. The sender can then
features like Undo and Track Changes. All these are safely refine this redaction before releasing the document.
deleted as part of the redaction workflow. • Secure document viewing: For this workflow, InfoSphere
Guardium Data Redaction provides a document viewer. All
Efficient workflow documents, regardless of original format, are displayed to the
With thousands of documents to redact, workflow management user in a uniform way in the browser, with no need to
is essential. Simply printing out the pages and deleting the download a document that could subsequently be leaked.
sensitive text won’t do it—it would be impossible to keep track Sensitive information in the document is securely deleted
of the stacks of paper—and the same is true with ad hoc according to the recipient’s job role. In accordance with
redaction of masses of electronic documents. The only solution regulations, this data is typically deleted in a way that does not
is for redaction to fit into enterprise content management allow it to be viewed; for some types of information, users may
(ECM) processes. The InfoSphere Guardium Data Redaction have permission to securely retrieve the redacted units after
system supports a variety of such workflows out of the box, and specifying their need to know (see Figure 4).
can read and write documents in ECM systems such as IBM
FileNet® P8 and IBM Content Manager 8:
8 IBM InfoSphere Guardium Data Redaction: Reconciling openness with privacy
File system information. This led to the opposite extreme: the insurer
Core withheld documents from various recipients, even where this
Internal In-process
meant violating regulations or missing opportunities for
API Java client business value.
IBM InfoSphere Guardium solutions for data security and For more information about data privacy and IBM
compliance support this holistic approach, helping InfoSphere Guardium Data Redaction, please contact
organizations protect against a complex threat landscape while your IBM representative or visit: ibm.com/software/data/
remaining focused on their business goals. guardium/data-redaction/
© Copyright IBM Corporation 2012
IBM Corporation
Software Group
Route 100
Somers, NY 10589
U.S.A.
IBM, the IBM logo, ibm.com, Guardium and InfoSphere are trademarks or
registered trademarks of International Business Machines Corporation in
the United States, other countries or both. If these and other IBM
trademarked terms are marked on their first occurrence in this information
with a trademark symbol (® or ™), these symbols indicate U.S. registered
or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law
trademarks in other countries. A current list of IBM trademarks is available
on the web at “Copyright and trademark information” at ibm.com/legal/
copytrade.shtml
Please Recycle
IMW14307-USEN-03