How To Take Control of Your Data
How To Take Control of Your Data
VOl.2NO.8
eDiscovery Compendium
E-DISCOVERY AND E-DISCLOSURE: SAME DIFFERENCE? FORENSICS & E-DISCOVERY COMMUNICATION IS THE KEY SYMBIOSIS CORPORATE E-DISCOVERY SUCCESS STARTS WITH INFORMATION GOVERNANCE EDISCOVERY COLLECTIONS THERES MORE THAN ONE WAY TO COPY A FILE
Issue 08/2013 (11) July
Come e-Discovery counsel throughout the land and please dont ignore what you cant understand. During a time of political and social upheaval in 1965, American songwriter Bob Dylan penned The Times They Are A Changin. In our community, change continues to occur as data volumes grow.
he importance of data classification by relevant business purpose,prior to processing cannot be understated or misunderstood. Proactive technology choices such as classification create numerous benefits downstream during a litigation event, as well as upstream to manage information governance across an enterprise. Poor quality data might not be searchable but that must not diminish its relevance or the need to understand its content. Whereas predictive coding employs technology that relies upon the search ability of good quality text, what is your workflow for the boxes of paper and the unsearchable electronic files created from third generation scans?
Big Data is growing beyond your command, the old methods are rapidly aging In 2013, unstructured data continues to exponentially increase in volume. For the longest time, our industry has followed the Four Ps People, Process, Platform, and Protocol, of the decidedly reactive Electronic Discovery Reference Model. Clients relate that their chief problems tend to revolve around productivity, accuracy, risk mitigation, defensibility of process, and these all have an impact on the bottom line their legal spend. However, the time has come to understand a Fifth P PROACTIVE. We now know that not all workflows are equal. An abundance of interest
INTRODUCTION
80
BODY
Client 1 is a Fortune 500 vertically integrated company that relies on several managed review providers and outsourced early case assessment (ECA) tools; ostensibly, they made purchasing decisions based on relationships and price, rather than on an underlying awareness of their needs or the changing technologies in the marketplace. The client shared that they were concerned about the high cost of 1stlevel document review. In an effort to identify cost savings, we offered to re-review their data from a recent case to illustrate how machine learning via a classification tool could provide improved client knowledge about their data, prior to processing and especially prior to managed review, so that intelligent staffing choices could be made for a future managed review.
What is the subject matter? Similar subject matters may engender similar protocols for review Case matters profiles may be replicated for the client Demonstrable expertise from measurable historic results What is the approximate volume of documents? Working assumptions are confirmed Greater predictability for duration of review Which pricing model to apply, hourly or fixed price per document What are the file types? Impact to timing and workflow requirements Historic file type management on this type of case Identify any special types of skilled reviewers needed What are the average pages per document? What are the average pages per GB? What are the average documents per GB? Collectively, these 3 questions assist identification of an atypical document population Such an identification can alert us to special concerns for staffing before the review begins Comparison against historic workflow profiles for anomalies that may impact timing and other services such as privilege log creation or redaction How many custodians? Comparison for historic hit rate Prioritization for workflow and best practices Staffing needs How many issue tags? Historic responsiveness rates compared to current case Best practices favor 10 tags or fewer Discussion of potential areas of data uncertainty prior to review so that data may be strategically batched in mitigation of future costly rereview that results from client protocol change What drives your purchasing decision to choose one provider over another? Is there a feature or aspect of your current service that you consider important? Why?
81
We classified data for its relevant business purpose as a precursor to creating a seed set for predictive analytics. We compared the effectiveness of the tool to an existing in-house product. We identified best practices for seed set creation protocols, and can share some lessons learned about the process that will benefit future clients.
For testing, we utilized a sample batch of data consisting of a mix of Excel, Word, Power Point, Adobe PDF, and MS Outlook Email. The test data set was initially provided and we then proceeded to analyze the data, create categories for classification, identify a seed set, and proceed with an automated classification process on the remainder. The results were provided to us and we loaded them into a Relativity database for testing. The deliverable included a list of identified categories, a list of documents used in their seed set, and a load file listing all documents and their corresponding category. Once all categorization sets were completed, we built saved searches to identify discrepancies. We employed a human element to validate the classifications performed and to create a blind seed set for comparison. The subject matter expertise of the engagement engineer plays a factor in the way that seed sets are created. The new classification technology was able to classify a higher percentage of documents and illustrated better optimization of multiple file types than any of the in-house categorization sets created by incumbent products. The ability to classify on a relevant business purpose with a robust file identification engine is perhaps one of the largest differentiators between competing technologies. The human intelligence married to the artificial intelligence of machine learning is an important step in the iterative process of seed set creation. Subject matter knowledge differs from person to person based on understanding of the type of case, the case in point, familiarity with the use of technology, and professional experience / exposure to documents and concepts that clients provide for production. The Blind Classification set created by the subject matter expert was found to match favorably (72%) with the machine learning technology performed by our tool.
higher quality of prioritized data, different from a linear review. Define categories and identify where overlap occurs. It is a prioritized classification and po tentially responsive; certain categories may require a 2ndlook as part of the iterative process, prior to managed review. Client should be encouraged to provide list of responsive terms, privilege names, during custodial collection, for the purpose of data mapping and classification, prior to the project kickoff. A Potential Privilege filter can be applied based upon list of counsel names and mitigate the impact that occurs from inconsistent coding in a traditional linear review. On a case-by-case basis, confirm with your vendor who from their pool of candidates and subject matter experts will be provided for supervised machine learning and seed set creation.
LESSONS LEARNED
If your time to you is worth savin To summarize, proactive pre-processing classification takes a large corpus of unstructured data and organizes it around a central business purpose or theme. This categorization prioritizes and in turn reduces the amount of documents that undergo a traditional linear first pass review for responsiveness. A reduced volume of documents leads to a reduced labor cost where less reviewers are needed to accomplish the same task perhaps in fewer hours, days, or weeks. The potentially responsive documents are classified and prioritized around the relevant purpose and the potentially non-responsive documents are set aside for later review, if necessary. No coding decisions to tag have been made at this stage. Neither have non-responsive documents had to be processed in order to determine that they do not meet threshold requirements for responsive production. Through the use of proactive classification, we have transformed managed review into an engineered review. Its a more efficiently staffed project. We train and qualify our review team on the classification of the data and the alignment of who, what, when and how. Everything that we learned from the classification process is a point of knowledge for the case and this is conveyed through the delivery of a production binder documenting every step taken for defensibility. Better trained reviewers make less material error because the training on the quality control process is very robust. Productive reviewers complete batches faster because they are not distracted by uncategorized linear data.
Classification organizes the data and themes emerge. Trends and occurrences are readily visible as patterns of behavior: Who was talking to whom, about what, how and when did this occur? Every month for 9 months, Smith and Jones, had a meeting, and exchanged 3 emails with 6 attachments. There were always 3 spreadsheets, 1 HR Word document related to goal measurement, a PowerPoint presentation for the board of directors, and an agenda. There were multiple drafts of the PowerPoint. There were requests for legal advice that made some of the documents, potentially privileged. Lawyers and legal domain names, identified in advance through the use of classification tools, and potentially privileged were set aside for the Privilege Review team instead of having to be reviewed twice, at the risk of making an inconsistent call. Classification identifies frequency of events, conversations, and third parties to a conversation. Then one day, in the 10th month, Smith and Jones introduced Davis, a competitor, to the mix of their regularly patterned behavior. All of a sudden, Smith and Jones
were scheduling a meeting with Davis to discuss fixing a price. Consider the following questions. Could you have found that in a traditional linear review? When would you have found it? Would you have noticed the frequent pattern of behavior for 9 months and then spotted the anomaly, Davis, in the 10thmonth? What if you had different reviewers on the two batches, a distinctive likelihood? In a classification system, you could find it with frequency reports, and then using the iterative process of machine learning, train the machine to find other documents like that smoking gun whose existence was previously unknown. Data can be batched specific to this particular incident, before reviewers are in their seats and classification can provide valuable case knowledge in the instance where you arent necessarily aware of what you did not know. Oneby-product for the corporation that engages in classification is the understanding of their data in terms of knowledge management. Classification can deliver reports on frequency of nouns and verbsfor both defensibility of the process undertaken (for use in Rule 26 meet and confers), as well for the identification of the next triggering event. In this manner the wheel is not recreated each and every time there is a triggering event.
Applications include: Classification Services for Information Governance Due Diligence and Audit Support Data Mining on Physical Records aka Whats In the Box? Records Validation and Verification
Services include: Subject matter expertise (SME) in machine learning and data extraction Classification training and certification to clients and partners
Products include: Haystac RetenGine which processes enterprise data and Haystac Web which processes data on the internet
Contact: +1 781-820-7616 Email: [email protected] On the web: http://www.haystac.com To read more from Haystac, please visit http://www.haystac.com/whitepapers
Rather, they are tuned in to the proactive prioritization of the classifications. Thus, they are more likely to spot outliers and departures in behavior patterns, analyze sentiment in a message, and spot differences not readily found in a traditional linear review. (see Frame: Classification Use Case)
Client 2 is a Fortune 100 commercial bank. Because we have a very deep understanding of this banks litigation matters, we undertook three custom tasks that would be considered atypical by any vendor standard in the e-discovery industry. While many providers would shy away from undertaking such projects, these were the perfect test cases to employ technology, identify efficiencies, and share results both with our banking client, and other companies who face the same challenges (Frame: Why is data classification a good idea for your organization?). Ugly data is poor quality data that originated as a paper document at some pointin its life. One easy to digest example is the process of contract execution where a contract was printed and signed, then scanned and sent to a counterparty or additional signatory for signing, where it wasre-scanned and returned. That was at least three generations off the original. Depending upon the quality of the printout and scan, there may be some loss of fidelity during the OCR conversion from native file to TIFF. Recent work for clients in the oil and gas industry required the cleanup of a fax document for the production of a maintenance report related to a well (See Figures 1, 2).
Transfer of assets and collection of work product across several vertical markets has resulted in the records for such being compiled into a single PDF, usually with no index. This condition is prevalent in the oil and gas and mortgage industries, where the records associated with the asset are created as these large PDFs. The holder of these PDFs is forced to reconstruct the original document collection in order to determine the presence of critical records and/or recreate a database of key attributes contained within the documents. In addition, the quality of the OCR text is usually poor, severely limiting the usefulness of search-based interrogation. Manually splitting these PDFs into their original documents is an expensive and time consuming process. We were able to train on a seed set of documents and automatically split 21 Loan Files into 1900 PDFs, the original document set, accurately identifying the logical document breaks and auto-classifying each document to high levels of accuracy. New document naming conventions are auto-generated, usually based on appending the original file name with the page range of the new document. (see Figure 3) The client provided a list of 13 categories by which to place documents. For comparison, we had our Haystac technology go head to head with human reviewers. The technology was able to categorize all of the documents and left fewer documents in the OTHER category, than the off-shore human review team. The advantage of Haystacs machine-based process is quicker recognition of error patterns and their correction, thus eliminating the inherent variability of human judgment. The process can be applied to millions of pages of PDFs and produce results in a fraction of the time of its manual counterpart.
Future litigation. Classification provides a material benefit to CSuite stakeholders. Reduction of labor cost occurs at the most variable portion of a managed review. Improved productivity to reach higher priority data where strategic decisions are made. Classification enables greater accuracy to allow the production of data sooner.
Classification equals preparedness for all stakeholders. Who is involved in your enterprise with making these decisions? Do you have any of these concerns? Indexing and remediation of legacy data for storage. What are the new record-keeping requirements under Dodd-Frank? Classification can reduce annual storage costs at the terabyte and petabyte level. Defensible deletion reduces enterprise risk. Are we holding data for too long? How will we meet the new statutory regimes for re porting under Dodd-Frank? Classification can establish cost-effective predictability for compliance and mitigate costs found in a risk profile. Periodic M&A events that requires great due diligence. Regulatory Compliance.
Records Management
A repository allows clear insight into the language used to discuss common business events:
C-Suite Management
Who was talking to whom, When these conversations were occurring, and Identification of a pattern of expected behavior, thus enabling the visibility ofoutliers, anomalies, and departures from the pattern in essence needles in a haystack. Classification enables the creation of a corporate repository and promotes reusability of data, so that you no longer have to recreate the wheel.
84
HOW TO TAKE CONTROL OF YOUR DATA Auto Extraction of Text for Logical Document Determination
documents and is supplemented with actual document headers gleaned by sampling client data. Image processing extracts the title fragment and algorithmic processing determines the most probable title match. The user interface contains an editor which allows the user to view machine results, enter new headers and correct errors. On this task, we extracted document titles that were meaningful to the useful categorization of the poor quality OCRed documents.
Poor quality text documents can constitute a significant percentage of stored documents. Scanned documents are typically stored as TIFF or PDF files on file servers and email archives and are usually poorly indexed, making them hard to find using enterprise search engines. In addition, important records stored in boxes and files are also poorly indexed at the box or file level, making the box or file contents blind to the enterprise. Manually indexing these documents is resource intensive and costly, yet locating important records is very meaningful to satisfy audit, investigatory and document control objectives, as well as meeting information governance requirements. Document titles are often, a key indicator of the purpose of the document, so that accurately and cost effectively determining the title means document importance as a record can be determined. Determining the title allows classifying the document to a business purpose using database mapping. Using a soft dictionary-based approach to identifying document titles, a dictionary has been compiled from common business function-based
There were 68 fields of entry on a custom reporting document for which items such as loan #, amounts,
A fully defensible engineered review process mitigates a clients risk profile. Reduction in legal spend through a more efficient engineered review where fewer attorneys are needed for a 1st level review is in essence, doing more with less. Corporate Knowledge Base created which signifies an advance in the reuse of data. Accuracy and robust quality control protocols enable the direction and allocation of litigation spend towards higher value legal functions, sooner. Oh the times, they are a changin codes, dates, borrower names, mortgage lenders, title insurance, and other information was required. Our auto-extraction technology was able to accurately populate the data on an Excel spreadsheet in response to the government request for production.
CONCLUSION
Benjamin S. Marks is a consultant on eDiscovery and Information Governance initiatives. Most recently, he assisted development of a document review center in Charlotte, North Carolina, and new product introduction for an e-Discovery service provider. An entrepreneurial strategicminded lawyer with a business operations background, Bens prior work on staffing managed reviews affords him the insight to identify subject matter expertise for teams, develop proactive workflows, and assemble responses to RFPs. Prior to law school, Ben was the founder of Eco Specialties and Design, an environmentally themed promotions company. Today, when hes not building seed sets or reading about Dodd-Franks impact on enterprise risk management, Ben follows Orioles baseball, attends live music events, enjoys cooking, and runs with his puggle in Baltimore, Maryland. He holds a J.D. and Environmental Certificate from Pace University School of Law.