0% found this document useful (0 votes)
380 views8 pages

How To Take Control of Your Data

Proactive technology for classification creates benefits downstream from the data in a litigation event and upstream for information governance of the enterprise. Article discusses how stakeholders can apply classification to their objectives, the types of questions that an e-Discovery provider may ask of you, and one approach to management of poor quality data.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
380 views8 pages

How To Take Control of Your Data

Proactive technology for classification creates benefits downstream from the data in a litigation event and upstream for information governance of the enterprise. Article discusses how stakeholders can apply classification to their objectives, the types of questions that an e-Discovery provider may ask of you, and one approach to management of poor quality data.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Database

VOl.2NO.8

eDiscovery Compendium
E-DISCOVERY AND E-DISCLOSURE: SAME DIFFERENCE? FORENSICS & E-DISCOVERY COMMUNICATION IS THE KEY SYMBIOSIS CORPORATE E-DISCOVERY SUCCESS STARTS WITH INFORMATION GOVERNANCE EDISCOVERY COLLECTIONS THERES MORE THAN ONE WAY TO COPY A FILE
Issue 08/2013 (11) July

HOW TO TAKE CONTROL OF YOUR DATA


INSTEAD OF WAITING FOR THE NEXT TRIGGERING EVENT
by Benjamin Marks and Brent Stanley

Come e-Discovery counsel throughout the land and please dont ignore what you cant understand. During a time of political and social upheaval in 1965, American songwriter Bob Dylan penned The Times They Are A Changin. In our community, change continues to occur as data volumes grow.

What you will learn:


The benefits of classification How to manage ugly data and atypical data populations Stakeholder questions in consideration of classification (see Figure 2)

What you should know:


The difference between reactive e-Discovery and proactive Information Governance Not all managed reviews are created equal Technology Assisted Review requires subject matter expertise for effective deployment

he importance of data classification by relevant business purpose,prior to processing cannot be understated or misunderstood. Proactive technology choices such as classification create numerous benefits downstream during a litigation event, as well as upstream to manage information governance across an enterprise. Poor quality data might not be searchable but that must not diminish its relevance or the need to understand its content. Whereas predictive coding employs technology that relies upon the search ability of good quality text, what is your workflow for the boxes of paper and the unsearchable electronic files created from third generation scans?

Big Data is growing beyond your command, the old methods are rapidly aging In 2013, unstructured data continues to exponentially increase in volume. For the longest time, our industry has followed the Four Ps People, Process, Platform, and Protocol, of the decidedly reactive Electronic Discovery Reference Model. Clients relate that their chief problems tend to revolve around productivity, accuracy, risk mitigation, defensibility of process, and these all have an impact on the bottom line their legal spend. However, the time has come to understand a Fifth P PROACTIVE. We now know that not all workflows are equal. An abundance of interest

INTRODUCTION

80

HOW TO TAKE CONTROL OF YOUR DATA


in enterprise-wide Business Process Management (BPM) cost-saving measures is driving solutions towards creation and deployment of a proactive workflow where classification occurs prior to the managed review of documents. Poor quality data is rarely reviewed or effectively searched prior to, or in conjunction with Rule 26 conferences. Case studies and interactive questions are used to illustrate the concepts in this article. Information Lifecycle Management with a foundation in the EDRM is a multi-step process where data is Forensically Collected, Processed and Analyzed, Hosted, and then Reviewed and Produced, according to a very specific protocol and set of instructions. According to the 2012 Rand Report, Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery, Collections and Processing account for about 27% of the litigation spend, while the Review component is roughly 73% of the litigation spend. We know that strategic decisions enacted upstream will lead to a proactive and cost-effective workflow downstream. Classification is best applied prior to the processing and analytics step, where the on-average cost of 5 cents per document spent proactively upon classification can offset 50 cents per document spent reactively during a traditional linear managed review. In proactive classification workflows, a cost savings accrues leading to greater predictability of budget for both time and money, so that the 73% of spend that occurs in review may carry higher value than a mere linear review. As the present now will later be past Clients shared prior horror stories of what happened on their last managed review such as performing quality control and finding a high amount of material error in the 1stlevel review. They were chagrined about the time the vendor promised that the review would be finished in four weeks, but had to add twenty reviewers and work overtime in the fourth week; not unexpectedly, the project bill was over budget. Of equal frustration were the occurences of last minute productions delivered to associates with no time to spare for around-theclock deposition preparation, over the weekend again. Not surprisingly, these problems occur with greater frequency in a reactive workflow, or where the vendor did not lay the foundation for solutions and defensibility with its failure to ask the right questions. These questions and a clients answers carry a dual purpose: (1) to assist the scoping of the project, and (2) alignment of value to client needs for the purchase of e-Discovery services. (see List 1).
www.eForensicsMag.com

BODY

Client 1 is a Fortune 500 vertically integrated company that relies on several managed review providers and outsourced early case assessment (ECA) tools; ostensibly, they made purchasing decisions based on relationships and price, rather than on an underlying awareness of their needs or the changing technologies in the marketplace. The client shared that they were concerned about the high cost of 1stlevel document review. In an effort to identify cost savings, we offered to re-review their data from a recent case to illustrate how machine learning via a classification tool could provide improved client knowledge about their data, prior to processing and especially prior to managed review, so that intelligent staffing choices could be made for a future managed review.
What is the subject matter? Similar subject matters may engender similar protocols for review Case matters profiles may be replicated for the client Demonstrable expertise from measurable historic results What is the approximate volume of documents? Working assumptions are confirmed Greater predictability for duration of review Which pricing model to apply, hourly or fixed price per document What are the file types? Impact to timing and workflow requirements Historic file type management on this type of case Identify any special types of skilled reviewers needed What are the average pages per document? What are the average pages per GB? What are the average documents per GB? Collectively, these 3 questions assist identification of an atypical document population Such an identification can alert us to special concerns for staffing before the review begins Comparison against historic workflow profiles for anomalies that may impact timing and other services such as privilege log creation or redaction How many custodians? Comparison for historic hit rate Prioritization for workflow and best practices Staffing needs How many issue tags? Historic responsiveness rates compared to current case Best practices favor 10 tags or fewer Discussion of potential areas of data uncertainty prior to review so that data may be strategically batched in mitigation of future costly rereview that results from client protocol change What drives your purchasing decision to choose one provider over another? Is there a feature or aspect of your current service that you consider important? Why?

Considering new technology

PROACTIVE APPROACH TO COST SAVINGS

List 1: Scoping Questions

81

We classified data for its relevant business purpose as a precursor to creating a seed set for predictive analytics. We compared the effectiveness of the tool to an existing in-house product. We identified best practices for seed set creation protocols, and can share some lessons learned about the process that will benefit future clients.

For testing, we utilized a sample batch of data consisting of a mix of Excel, Word, Power Point, Adobe PDF, and MS Outlook Email. The test data set was initially provided and we then proceeded to analyze the data, create categories for classification, identify a seed set, and proceed with an automated classification process on the remainder. The results were provided to us and we loaded them into a Relativity database for testing. The deliverable included a list of identified categories, a list of documents used in their seed set, and a load file listing all documents and their corresponding category. Once all categorization sets were completed, we built saved searches to identify discrepancies. We employed a human element to validate the classifications performed and to create a blind seed set for comparison. The subject matter expertise of the engagement engineer plays a factor in the way that seed sets are created. The new classification technology was able to classify a higher percentage of documents and illustrated better optimization of multiple file types than any of the in-house categorization sets created by incumbent products. The ability to classify on a relevant business purpose with a robust file identification engine is perhaps one of the largest differentiators between competing technologies. The human intelligence married to the artificial intelligence of machine learning is an important step in the iterative process of seed set creation. Subject matter knowledge differs from person to person based on understanding of the type of case, the case in point, familiarity with the use of technology, and professional experience / exposure to documents and concepts that clients provide for production. The Blind Classification set created by the subject matter expert was found to match favorably (72%) with the machine learning technology performed by our tool.

Please consider a new change in your data workflow

higher quality of prioritized data, different from a linear review. Define categories and identify where overlap occurs. It is a prioritized classification and po tentially responsive; certain categories may require a 2ndlook as part of the iterative process, prior to managed review. Client should be encouraged to provide list of responsive terms, privilege names, during custodial collection, for the purpose of data mapping and classification, prior to the project kickoff. A Potential Privilege filter can be applied based upon list of counsel names and mitigate the impact that occurs from inconsistent coding in a traditional linear review. On a case-by-case basis, confirm with your vendor who from their pool of candidates and subject matter experts will be provided for supervised machine learning and seed set creation.

LESSONS LEARNED

If your time to you is worth savin To summarize, proactive pre-processing classification takes a large corpus of unstructured data and organizes it around a central business purpose or theme. This categorization prioritizes and in turn reduces the amount of documents that undergo a traditional linear first pass review for responsiveness. A reduced volume of documents leads to a reduced labor cost where less reviewers are needed to accomplish the same task perhaps in fewer hours, days, or weeks. The potentially responsive documents are classified and prioritized around the relevant purpose and the potentially non-responsive documents are set aside for later review, if necessary. No coding decisions to tag have been made at this stage. Neither have non-responsive documents had to be processed in order to determine that they do not meet threshold requirements for responsive production. Through the use of proactive classification, we have transformed managed review into an engineered review. Its a more efficiently staffed project. We train and qualify our review team on the classification of the data and the alignment of who, what, when and how. Everything that we learned from the classification process is a point of knowledge for the case and this is conveyed through the delivery of a production binder documenting every step taken for defensibility. Better trained reviewers make less material error because the training on the quality control process is very robust. Productive reviewers complete batches faster because they are not distracted by uncategorized linear data.

EVOLUTION OF ENGINEERED REVIEW

PROCEDURAL BEST PRACTICES


Process source data to expose actual (as opposed to stated) file types, system files, duplicates and near-duplicates. Classification on multiple relevant business purposes helps understand data and leads to a
82

HOW TO TAKE CONTROL OF YOUR DATA

Classification organizes the data and themes emerge. Trends and occurrences are readily visible as patterns of behavior: Who was talking to whom, about what, how and when did this occur? Every month for 9 months, Smith and Jones, had a meeting, and exchanged 3 emails with 6 attachments. There were always 3 spreadsheets, 1 HR Word document related to goal measurement, a PowerPoint presentation for the board of directors, and an agenda. There were multiple drafts of the PowerPoint. There were requests for legal advice that made some of the documents, potentially privileged. Lawyers and legal domain names, identified in advance through the use of classification tools, and potentially privileged were set aside for the Privilege Review team instead of having to be reviewed twice, at the risk of making an inconsistent call. Classification identifies frequency of events, conversations, and third parties to a conversation. Then one day, in the 10th month, Smith and Jones introduced Davis, a competitor, to the mix of their regularly patterned behavior. All of a sudden, Smith and Jones

Classification Use Case

were scheduling a meeting with Davis to discuss fixing a price. Consider the following questions. Could you have found that in a traditional linear review? When would you have found it? Would you have noticed the frequent pattern of behavior for 9 months and then spotted the anomaly, Davis, in the 10thmonth? What if you had different reviewers on the two batches, a distinctive likelihood? In a classification system, you could find it with frequency reports, and then using the iterative process of machine learning, train the machine to find other documents like that smoking gun whose existence was previously unknown. Data can be batched specific to this particular incident, before reviewers are in their seats and classification can provide valuable case knowledge in the instance where you arent necessarily aware of what you did not know. Oneby-product for the corporation that engages in classification is the understanding of their data in terms of knowledge management. Classification can deliver reports on frequency of nouns and verbsfor both defensibility of the process undertaken (for use in Rule 26 meet and confers), as well for the identification of the next triggering event. In this manner the wheel is not recreated each and every time there is a triggering event.

Applications include: Classification Services for Information Governance Due Diligence and Audit Support Data Mining on Physical Records aka Whats In the Box? Records Validation and Verification

Services include: Subject matter expertise (SME) in machine learning and data extraction Classification training and certification to clients and partners

Products include: Haystac RetenGine which processes enterprise data and Haystac Web which processes data on the internet

Contact: +1 781-820-7616 Email: [email protected] On the web: http://www.haystac.com To read more from Haystac, please visit http://www.haystac.com/whitepapers

Rather, they are tuned in to the proactive prioritization of the classifications. Thus, they are more likely to spot outliers and departures in behavior patterns, analyze sentiment in a message, and spot differences not readily found in a traditional linear review. (see Frame: Classification Use Case)

Client 2 is a Fortune 100 commercial bank. Because we have a very deep understanding of this banks litigation matters, we undertook three custom tasks that would be considered atypical by any vendor standard in the e-discovery industry. While many providers would shy away from undertaking such projects, these were the perfect test cases to employ technology, identify efficiencies, and share results both with our banking client, and other companies who face the same challenges (Frame: Why is data classification a good idea for your organization?). Ugly data is poor quality data that originated as a paper document at some pointin its life. One easy to digest example is the process of contract execution where a contract was printed and signed, then scanned and sent to a counterparty or additional signatory for signing, where it wasre-scanned and returned. That was at least three generations off the original. Depending upon the quality of the printout and scan, there may be some loss of fidelity during the OCR conversion from native file to TIFF. Recent work for clients in the oil and gas industry required the cleanup of a fax document for the production of a maintenance report related to a well (See Figures 1, 2).

UGLY DATA AND ATYPICAL DOCUMENT POPULATIONS

WHAT IS UGLY DATA?

Transfer of assets and collection of work product across several vertical markets has resulted in the records for such being compiled into a single PDF, usually with no index. This condition is prevalent in the oil and gas and mortgage industries, where the records associated with the asset are created as these large PDFs. The holder of these PDFs is forced to reconstruct the original document collection in order to determine the presence of critical records and/or recreate a database of key attributes contained within the documents. In addition, the quality of the OCR text is usually poor, severely limiting the usefulness of search-based interrogation. Manually splitting these PDFs into their original documents is an expensive and time consuming process. We were able to train on a seed set of documents and automatically split 21 Loan Files into 1900 PDFs, the original document set, accurately identifying the logical document breaks and auto-classifying each document to high levels of accuracy. New document naming conventions are auto-generated, usually based on appending the original file name with the page range of the new document. (see Figure 3) The client provided a list of 13 categories by which to place documents. For comparison, we had our Haystac technology go head to head with human reviewers. The technology was able to categorize all of the documents and left fewer documents in the OTHER category, than the off-shore human review team. The advantage of Haystacs machine-based process is quicker recognition of error patterns and their correction, thus eliminating the inherent variability of human judgment. The process can be applied to millions of pages of PDFs and produce results in a fraction of the time of its manual counterpart.
Future litigation. Classification provides a material benefit to CSuite stakeholders. Reduction of labor cost occurs at the most variable portion of a managed review. Improved productivity to reach higher priority data where strategic decisions are made. Classification enables greater accuracy to allow the production of data sooner.

Large PDF Splitting and Classification

Classification equals preparedness for all stakeholders. Who is involved in your enterprise with making these decisions? Do you have any of these concerns? Indexing and remediation of legacy data for storage. What are the new record-keeping requirements under Dodd-Frank? Classification can reduce annual storage costs at the terabyte and petabyte level. Defensible deletion reduces enterprise risk. Are we holding data for too long? How will we meet the new statutory regimes for re porting under Dodd-Frank? Classification can establish cost-effective predictability for compliance and mitigate costs found in a risk profile. Periodic M&A events that requires great due diligence. Regulatory Compliance.

Why is data classification a good idea for your organization?

Legal and General Counsel

Records Management

Corporate Knowledge Base

Risk Management, Audit, and Compliance

A repository allows clear insight into the language used to discuss common business events:

C-Suite Management

Who was talking to whom, When these conversations were occurring, and Identification of a pattern of expected behavior, thus enabling the visibility ofoutliers, anomalies, and departures from the pattern in essence needles in a haystack. Classification enables the creation of a corporate repository and promotes reusability of data, so that you no longer have to recreate the wheel.

84

HOW TO TAKE CONTROL OF YOUR DATA Auto Extraction of Text for Logical Document Determination
documents and is supplemented with actual document headers gleaned by sampling client data. Image processing extracts the title fragment and algorithmic processing determines the most probable title match. The user interface contains an editor which allows the user to view machine results, enter new headers and correct errors. On this task, we extracted document titles that were meaningful to the useful categorization of the poor quality OCRed documents.

Poor quality text documents can constitute a significant percentage of stored documents. Scanned documents are typically stored as TIFF or PDF files on file servers and email archives and are usually poorly indexed, making them hard to find using enterprise search engines. In addition, important records stored in boxes and files are also poorly indexed at the box or file level, making the box or file contents blind to the enterprise. Manually indexing these documents is resource intensive and costly, yet locating important records is very meaningful to satisfy audit, investigatory and document control objectives, as well as meeting information governance requirements. Document titles are often, a key indicator of the purpose of the document, so that accurately and cost effectively determining the title means document importance as a record can be determined. Determining the title allows classifying the document to a business purpose using database mapping. Using a soft dictionary-based approach to identifying document titles, a dictionary has been compiled from common business function-based

There were 68 fields of entry on a custom reporting document for which items such as loan #, amounts,

Auto-Extraction of Text for Reporting Purposes

Figure 2. Data Cleaned Up

Figure 1. Poor Quality Data


www.eForensicsMag.com 85

A fully defensible engineered review process mitigates a clients risk profile. Reduction in legal spend through a more efficient engineered review where fewer attorneys are needed for a 1st level review is in essence, doing more with less. Corporate Knowledge Base created which signifies an advance in the reuse of data. Accuracy and robust quality control protocols enable the direction and allocation of litigation spend towards higher value legal functions, sooner. Oh the times, they are a changin codes, dates, borrower names, mortgage lenders, title insurance, and other information was required. Our auto-extraction technology was able to accurately populate the data on an Excel spreadsheet in response to the government request for production.

Custom Solutions Yield Workflow Benefits


Accurately Identify Important Records without Manual Brute Force Processing Scale Classification to Large Document Collections Eliminate Unnecessary Documents from Storage Improve the Odds of Finding Critical Records Increase the Speed of Getting Results The slow ones now will later be fast Prospects and clients will one day realize that the lowest vendor price does not always equate to the best value for their litigation spend. On the review side, value can be added where Multi-Class Classification occurs prior to processing and Subject Matter Expertise is applied as the human complement to machine learning. Value is enhanced where the proactive use of technology removes inefficiencies leadingto improved knowledge management and ultimately a higher quality litigation spend. Proposed changes to the Federal Rules of Civil Procedure (proportionality amendments) will seek to reduce time delays and extraordinary costs associated with e-Discovery where such costs outweigh the utility of the task undertaken; in this regard, classification applied in a proactive workflow would meet the goal of proportionality because better organization of data upstream can save downstream costs. The benefits of proactive classification at the outset of an engineered review are multiple: Increased productivity on a 1st level review adds value to the predictability of your litigation budget for time and money.
86

CONCLUSION

Benjamin S. Marks is a consultant on eDiscovery and Information Governance initiatives. Most recently, he assisted development of a document review center in Charlotte, North Carolina, and new product introduction for an e-Discovery service provider. An entrepreneurial strategicminded lawyer with a business operations background, Bens prior work on staffing managed reviews affords him the insight to identify subject matter expertise for teams, develop proactive workflows, and assemble responses to RFPs. Prior to law school, Ben was the founder of Eco Specialties and Design, an environmentally themed promotions company. Today, when hes not building seed sets or reading about Dodd-Franks impact on enterprise risk management, Ben follows Orioles baseball, attends live music events, enjoys cooking, and runs with his puggle in Baltimore, Maryland. He holds a J.D. and Environmental Certificate from Pace University School of Law.

About the Author

You might also like