INF - IBM Classification Module
INF - IBM Classification Module
Wei-Dong Zhu Solomon Barron Maurizio Gallotti Vijay Gupta Xiaomei Wang Josemina Magdalen Jane Singer
ibm.com/redbooks
International Technical Support Organization IBM Classification Module: Make It Work for You October 2009
SG24-7707-00
Note: Before using this information and the product it supports, read the information in Notices on page xix.
First Edition (October 2009) This edition applies to Version 8, Release 6, Modification 0 of IBM Classification Module (product number 5724-T45).
Copyright International Business Machines Corporation 2009. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Table of Contents
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi The team who wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv Part 1. Core product information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Introducing Classification Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 How Classification Module works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Input and output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 Configuration process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Business use cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Achieving and maintaining regulatory compliance . . . . . . . . . . . . . . 10 1.3.2 E-mail management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.3 Customer communication management . . . . . . . . . . . . . . . . . . . . . . 12 1.3.4 Enterprise content management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.5 Other use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Integration options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.1 Stand-alone Classification Module . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.2 Integration with IBM FileNet P8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.3 Integration with IBM Content Collector . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 2. Concepts, architecture, tools, and integration . . . . . . . . . . . . 21 2.1 Classification concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.1 Knowledge base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.2 Classification workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.3 Core classification technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.4 Decision plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.5 Field definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 Core components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1 Classification Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 Classification Module server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.3 IBM FileNet P8 integration asset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
2.3 Classification Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.1 Create and configure knowledge base project overview. . . . . . . . . . 36 2.3.2 Taxonomy Proposer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 Classification Module server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4.1 Management Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.4.2 Classification Module server deployment configurations . . . . . . . . . 45 2.5 Classification Module APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.6 Classification Module integration architecture . . . . . . . . . . . . . . . . . . . . . . 49 2.6.1 Classification Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.6.2 Content Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.7 Additional reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 3. Working with knowledge bases and decision plans . . . . . . . . 59 3.1 Importance of knowledge bases and decision plans . . . . . . . . . . . . . . . . . 60 3.2 Creating, training, and analyzing a knowledge base . . . . . . . . . . . . . . . . . 60 3.2.1 Preparing data for import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.2 Building a knowledge base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2.3 Analyzing and learning with a knowledge base . . . . . . . . . . . . . . . . 76 3.3 Creating and analyzing a decision plan. . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.1 Creating a decision plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3.2 Analyzing a decision plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.4 Building a knowledge base advanced topics. . . . . . . . . . . . . . . . . . . . . . . 99 3.4.1 Using an uncategorized or partially categorized content set. . . . . . 100 3.4.2 Using keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.4.3 Enabling the knowledge base with feedback processing . . . . . . . . 102 3.4.4 Working with offline (deferred) feedback. . . . . . . . . . . . . . . . . . . . . 108 3.4.5 Handling overlapping categories . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Chapter 4. Knowledge base 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.1 Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.2 Working with Classification Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2.1 Typical life cycle of a knowledge base . . . . . . . . . . . . . . . . . . . . . . 122 4.2.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.2.3 Classification Module implementation success . . . . . . . . . . . . . . . . 124 4.2.4 Classification Module implementation summary . . . . . . . . . . . . . . . 124 4.3 Defining categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.3.1 Become familiar with the problem and your data . . . . . . . . . . . . . . 125 4.3.2 Choosing categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.3.3 Getting good sample content for each category . . . . . . . . . . . . . . . 128 4.3.4 Discover the categories from your data . . . . . . . . . . . . . . . . . . . . . 128 4.3.5 Separate content and categories for separate uses . . . . . . . . . . . . 128 4.3.6 A detailed example of the Self-Help/CRM scenario . . . . . . . . . . . . 130 4.4 Preparing the content set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
iv
4.4.1 Data sources for knowledge base building . . . . . . . . . . . . . . . . . . . 131 4.4.2 The nature of the content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.4.3 Cleaning corpus texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.4.4 Various types of texts: Their use and content types . . . . . . . . . . . . 134 4.4.5 Overcoming the lack of data: Using initialization data. . . . . . . . . . . 137 4.5 Building and training the knowledge base. . . . . . . . . . . . . . . . . . . . . . . . 139 4.5.1 Building a knowledge base with existing categories and data . . . . 139 4.5.2 Building a knowledge base when the categories are unclear . . . . . 140 4.5.3 Training as opposed to online learning (feedback) . . . . . . . . . . . . . 140 4.5.4 Discover categories from your data using Taxonomy Proposer . . . 141 4.6 Analyzing knowledge base performance: Identifying the problems . . . . 141 4.6.1 Analyzing your knowledge base . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.6.2 Sources of analysis data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.6.3 Scores used for analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.6.4 Understanding your analysis data . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.6.5 Running analysis in Classification Workbench . . . . . . . . . . . . . . . . 145 4.6.6 Measures of accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.6.7 The process of analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.6.8 Automation and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.6.9 The number of categories returned by matching . . . . . . . . . . . . . . 148 4.6.10 Typical analysis process (report interpretation) . . . . . . . . . . . . . . 148 4.6.11 Reviewing Cumulative Success scores . . . . . . . . . . . . . . . . . . . . 149 4.6.12 Understanding precision and recall . . . . . . . . . . . . . . . . . . . . . . . . 150 4.7 Fine-tuning your knowledge base: Fixing the problems . . . . . . . . . . . . . 151 4.7.1 Overlapping categories (categories stealing from each other) . . . . 152 4.7.2 Low scores in all categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 4.7.3 Category does not represent intent of a message . . . . . . . . . . . . . 154 4.7.4 Possible multiple intent categories . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.7.5 Human error (poor manual classification) . . . . . . . . . . . . . . . . . . . . 155 4.7.6 Poor sampling (randomization issues) . . . . . . . . . . . . . . . . . . . . . . 156 4.7.7 Identifying hidden subcategories. . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4.7.8 Poorly performing category with too few examples. . . . . . . . . . . . . 157 4.7.9 Identifying obsolete categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.7.10 Automation thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.8 Maintaining your knowledge base over time: Using feedback . . . . . . . . 160 4.8.1 Working with feedback: Avoiding the problems . . . . . . . . . . . . . . . 161 4.9 Analyzing the knowledge base in production . . . . . . . . . . . . . . . . . . . . . 164 4.9.1 Adding and removing categories. . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.10 Application design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.10.1 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.10.2 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 4.10.3 Using Classification Module to route texts . . . . . . . . . . . . . . . . . . 169 4.10.4 Gathering feedback in an application . . . . . . . . . . . . . . . . . . . . . . 170
Table of Contents
4.10.5 Using read-only and read-write knowledge bases . . . . . . . . . . . . 170 4.10.6 Retraining a knowledge base after a major reorganization using Classification Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 4.10.7 Design considerations for filtering systems . . . . . . . . . . . . . . . . . . 172 4.10.8 User interface for application feedback . . . . . . . . . . . . . . . . . . . . . 173 4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Part 2. Integration details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Chapter 5. Integration with IBM FileNet P8 . . . . . . . . . . . . . . . . . . . . . . . . 177 5.1 Solution introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.1.1 Integration architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.1.2 Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.2 Integration steps overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 5.3 Enabling the integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.3.1 Installing integration component for IBM FileNet P8 . . . . . . . . . . . . 183 5.3.2 Configuring IBM FileNet P8 for integration . . . . . . . . . . . . . . . . . . . 185 5.3.3 Configuring connectivity between Classification Module and IBM FileNet P8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 5.4 Training Classification Module with existing IBM FileNet P8 content . . . 204 5.4.1 Using Content Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 5.4.2 Creating a knowledge base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 5.4.3 Creating a decision plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 5.4.4 Analyzing a decision plan and its related knowledge base . . . . . . . 234 5.4.5 Exporting knowledge bases and decision plans from Classification Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 5.4.6 Deploying the knowledge base and decision plan using Management Console. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 5.5 Configuring and performing classification . . . . . . . . . . . . . . . . . . . . . . . . 249 5.5.1 Classification Center overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 5.5.2 Working with the Classification Center . . . . . . . . . . . . . . . . . . . . . . 252 5.5.3 Configuring Classification Center . . . . . . . . . . . . . . . . . . . . . . . . . . 257 5.5.4 Verifying IBM FileNet P8 structures . . . . . . . . . . . . . . . . . . . . . . . . 269 5.5.5 Performing classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 5.6 Reviewing and fine-tuning classification . . . . . . . . . . . . . . . . . . . . . . . . . 276 5.6.1 Classification Center review function overview . . . . . . . . . . . . . . . . 276 5.6.2 Providing feedback to the knowledge base. . . . . . . . . . . . . . . . . . . 276 5.6.3 Configuring the Classification Center for review . . . . . . . . . . . . . . . 277 5.6.4 Reclassifying a document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Chapter 6. Integration with IBM Content Collector for File Systems . . . 291 6.1 Solution introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 6.1.1 Overview of IBM Content Collector for File Systems . . . . . . . . . . . 292 6.1.2 Integration architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
vi
6.1.3 Use case description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 6.2 Integration steps overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 6.3 Enabling the integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 6.3.1 Installing Classification Module client components on the Content Collector server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 6.3.2 Registering Classification Module with Content Collector . . . . . . . . 296 6.4 Validating the integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 6.4.1 Windows registry entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 6.4.2 Classification Module task in Content Collector Configuration Manager 298 6.4.3 System metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 6.5 Configuring the integration system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 6.5.1 Defining a knowledge base and the Classification Module field definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 6.5.2 Building a file system archiving task route . . . . . . . . . . . . . . . . . . . 305 6.5.3 Activating the system for archiving . . . . . . . . . . . . . . . . . . . . . . . . . 330 6.6 Performing the file system archiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Chapter 7. Integration with IBM Content Collector for E-mail. . . . . . . . . 339 7.1 Solution introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 7.1.1 Overview of IBM Content Collector for E-mail. . . . . . . . . . . . . . . . . 340 7.1.2 Integration architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 7.1.3 Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 7.2 Integration steps overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 7.3 Enabling the integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 7.3.1 Configuring IBM Content Collector server. . . . . . . . . . . . . . . . . . . . 344 7.3.2 Configuring Classification Module server . . . . . . . . . . . . . . . . . . . . 346 7.3.3 Validating the integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 7.4 Use case 1: E-mail archiving with automatic classification . . . . . . . . . . . 352 7.4.1 Creating a knowledge base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 7.4.2 Verifying and starting the servers . . . . . . . . . . . . . . . . . . . . . . . . . . 354 7.4.3 Pre-task route configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 7.4.4 Creating an e-mail task route . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 7.4.5 Activate the integration system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 7.4.6 Verify the automatic classification and archiving results . . . . . . . . . 423 7.5 Use Case 2: E-mail classification with records management . . . . . . . . . 426 7.5.1 Modify the existing task route to add the P8 4.x Declare Record task node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 7.5.2 Verify record declaration results . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
Table of Contents
vii
How to get IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
viii
Figures
1-1 Automated classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1-2 A simple classification scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1-3 Classification Module operational schematic . . . . . . . . . . . . . . . . . . . . . . . 9 1-4 Automatic mapping to records file plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1-5 Mapping old structures to new . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1-6 Customer support and retention management process . . . . . . . . . . . . . . 15 1-7 A simple Classification Module implementation . . . . . . . . . . . . . . . . . . . . 17 1-8 Integration with IBM FileNet P8 content repository . . . . . . . . . . . . . . . . . . 18 1-9 IBM Content Collector integration with IBM Content Manager (CM8). . . . 19 2-1 A KB project for HR categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2-2 Inside the IBM Classification technology. . . . . . . . . . . . . . . . . . . . . . . . . . 25 2-3 A decision plan project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2-4 Field definition settings within Classification Workbench . . . . . . . . . . . . . 28 2-5 Classification Module server system architecture . . . . . . . . . . . . . . . . . . . 31 2-6 Classification Module server processes: Single server mode . . . . . . . . . . 33 2-7 Classification Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2-8 Taxonomy Proposer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2-9 Taxonomy Proposer Workflow Assistant . . . . . . . . . . . . . . . . . . . . . . . . . 38 2-10 Typical workflow of a Classification Module system . . . . . . . . . . . . . . . . 39 2-11 Management Console connection parameters . . . . . . . . . . . . . . . . . . . . 41 2-12 Management Console application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2-13 Management Console: Knowledge bases administration tab . . . . . . . . . 42 2-14 Management Console: Decision plans administration tab . . . . . . . . . . . 43 2-15 Management Console: Field definitions administration tab. . . . . . . . . . . 44 2-16 Management Console: Servers administration tab . . . . . . . . . . . . . . . . . 45 2-17 Management Console: Servers administration host tab . . . . . . . . . . . . . 45 2-18 Single server configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2-19 Multiple server configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2-20 Multiple listener configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2-21 Classification Module client API architecture . . . . . . . . . . . . . . . . . . . . . 49 2-22 Classification Module integration process workflow architecture . . . . . . 50 2-23 Classification Center architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2-24 Classification server start command window . . . . . . . . . . . . . . . . . . . . . 52 2-25 Classification Center landing page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2-26 Classification Center Configuration tab . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2-27 Classification Center Dashboard tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2-28 Classification Center Review tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3-1 Classification Workbench: Knowledge base development cycle . . . . . . . 61
ix
3-2 File system folders as a content set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3-3 Starting Classification Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3-4 New knowledge base project window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3-5 Creating a project by importing a content set . . . . . . . . . . . . . . . . . . . . . . 65 3-6 Content set format: Files from a file system folder . . . . . . . . . . . . . . . . . . 65 3-7 Selecting the root folder of a content set . . . . . . . . . . . . . . . . . . . . . . . . . 66 3-8 Applying file filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3-9 Editing category names for file system folders . . . . . . . . . . . . . . . . . . . . . 68 3-10 Specifying language and text filter settings. . . . . . . . . . . . . . . . . . . . . . . 68 3-11 Project view with imported content items . . . . . . . . . . . . . . . . . . . . . . . . 69 3-12 The Body field properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3-13 Options of creating and analyzing a knowledge base . . . . . . . . . . . . . . . 72 3-14 Create a new knowledge base and delete existing knowledge bases . . 73 3-15 Analysis options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3-16 Status information of creating and analyzing a knowledge base . . . . . . 75 3-17 Content set view with match fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3-18 Generating Cumulative Success report and Total Precision vs. Recall Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3-19 Total cumulative success in Knowledge Base Data Sheet for HR . . . . . 78 3-20 Cumulative Success Report for HR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3-21 Total Precision vs. Recall graph for HR . . . . . . . . . . . . . . . . . . . . . . . . . 80 3-22 Create Knowledge base, using existing knowledge base structure . . . . 81 3-23 Learn during analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3-24 New decision plan project window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3-25 Creating an empty project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3-26 Project overview of the new empty decision plan . . . . . . . . . . . . . . . . . . 85 3-27 Add an existing knowledge base project as a referenced project . . . . . . 86 3-28 Add a knowledge base project to your decision plan . . . . . . . . . . . . . . . 86 3-29 Defining a rule group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3-30 Adding a new rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3-31 Rule 1 Properties tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3-32 Rule 1 Trigger tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3-33 Rule 1 Add Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3-34 Rule 2 Properties tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3-35 Rule 2 Trigger tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3-36 Rule 2 Add Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3-37 Decision plan view with two rules and one knowledge base. . . . . . . . . . 93 3-38 Content set file format: CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3-39 Select CSV file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3-40 Editing property of the Message_body field . . . . . . . . . . . . . . . . . . . . . . 95 3-41 Setting the Content type value to the Message_body field . . . . . . . . . . . 96 3-42 Launching the Analyze Decision Plan wizard . . . . . . . . . . . . . . . . . . . . . 96 3-43 Report options of analyzing a decision plan . . . . . . . . . . . . . . . . . . . . . . 97
3-44 Decision plan summary report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3-45 Analyzed content set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3-46 Taxonomy Proposer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3-47 Keywords CSV file for the knowledge base creation . . . . . . . . . . . . . . 102 3-48 Management Console logon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3-49 Knowledge base menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3-50 Feedback: Defer processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3-51 Restart knowledge base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3-52 Learn using active view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3-53 Analyze knowledge base using active view . . . . . . . . . . . . . . . . . . . . . 109 3-54 View reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3-55 Knowledge Base Data Sheet report . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3-56 Pairs of categories with overlapping intents . . . . . . . . . . . . . . . . . . . . . 112 3-57 Example of overlapping categories: Gift Certificates and Gift Wrap . . . 112 3-58 Category graphs and tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3-59 Stealing/stolen table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3-60 Content item scoring graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3-61 Show Items of the Gift Wrap category . . . . . . . . . . . . . . . . . . . . . . . . . 115 3-62 Select all items belonging to the Gift Wrap category . . . . . . . . . . . . . . 115 3-63 Categorize highlighted items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3-64 Select the Gift Certificates category to apply to highlighted items . . . . 116 3-65 Deleting the Match n fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3-66 Knowledge Base Data Sheet with improved cumulative success . . . . 118 4-1 Analysis process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4-2 Precision/Recall graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5-1 Classification Module integration with IBM FileNet P8 . . . . . . . . . . . . . . 179 5-2 Classification Center Integration Components menu . . . . . . . . . . . . . . . 183 5-3 Basic or Custom installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5-4 Features selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 5-5 IBM FileNet P8 folder structure under icm_integration folder . . . . . . . . . 186 5-6 The Classification Workbench category structure in the Knowledge Base Editor window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 5-7 Start IBM FileNet Enterprise Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5-8 Select IBM FileNet P8 server to integrate with Classification Module. . . 188 5-9 Create a new AddOn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5-10 New AddOn configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5-11 Open the XML file for the new AddOn . . . . . . . . . . . . . . . . . . . . . . . . . 190 5-12 Verify the correct import file name and path for the new AddOn . . . . . 191 5-13 Displaying New AddOn in the list of AddOns . . . . . . . . . . . . . . . . . . . . 192 5-14 Install the new AddOn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 5-15 Select the AddOn to install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 5-16 New AddOn installed confirmation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 5-17 Refresh the object store to see the new imported properties . . . . . . . . 195
Figures
xi
5-18 5-19 5-20 5-21 5-22 5-23 5-24 5-25 5-26 5-27 5-28 5-29 5-30 5-31 5-32 5-33 5-34 5-35 5-36 5-37 5-38 5-39 5-40 5-41 5-42 5-43 5-44 5-45 5-46 5-47 5-48 5-49 5-50 5-51 5-52
Modify properties of document class. . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Document Class Property Definitions tab . . . . . . . . . . . . . . . . . . . . . . . 197 Select all ICM_ prefixed properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Edit setupCommandLine.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 The startConnectTest.bat file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Edit the WcmConfig.properties file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Change wcmconfig parameter in the WcmConfig.properties file . . . . . 201 Edit the WcmApiConfig40.properties file. . . . . . . . . . . . . . . . . . . . . . . . 202 Testing connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Verify Classification Module connection to FileNet P8 is successful . . 204 Extractor.properties file path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Edit Extractor.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Default directory for XML output from Content Extractor . . . . . . . . . . . 208 Path for the object store from which to extract content . . . . . . . . . . . . . 208 Output window of running startExtractor.bat . . . . . . . . . . . . . . . . . . . . . 210 Output of running Content Extractor with the -m parameter . . . . . . . . . 211 Directory where the output of the Content Extractor file is located . . . . 212 ExtractorOutput.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 After content extraction completes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 XML output file from the content extraction. . . . . . . . . . . . . . . . . . . . . . 214 Output binary content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Launch Classification Workbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Classification Workbench: Workflow assistant . . . . . . . . . . . . . . . . . . . 217 Create a new project (decision plan or knowledge base) . . . . . . . . . . . 218 Create an empty project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 New rule for decision plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 New Rule Properties tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Set the trigger for the new rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Add actions to the new rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Display decision plan actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Display configured action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Decision plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Create a new Review rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Define the trigger condition for Review rule . . . . . . . . . . . . . . . . . . . . . 227 Define Review rule based on score result from the select_branch knowledge base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 5-53 Define if score is less than 80% for Review rule . . . . . . . . . . . . . . . . . . 229 5-54 Add action to the Review rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 5-55 Review rule action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 5-56 Action for Review rule is created. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5-57 Decision plan with new Review rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 5-58 Save the new decision plan: branches . . . . . . . . . . . . . . . . . . . . . . . . . 234 5-59 Open project in Classification Workbench . . . . . . . . . . . . . . . . . . . . . . 235
xii
Select knowledge base project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Start the Export Wizard in Classification Workbench . . . . . . . . . . . . . . 236 Classification Workbench Export Wizard: Welcome window . . . . . . . . 237 Classification Workbench Export Wizard: Select knowledge base for export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 5-64 Select knowledge base version 6.x for export. . . . . . . . . . . . . . . . . . . . 239 5-65 Export knowledge base confirmation window . . . . . . . . . . . . . . . . . . . . 239 5-66 Open decision plan: branches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 5-67 Export the decision plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 5-68 Export decision plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 5-69 Export decision plan in .dpn format. . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 5-70 Export decision plan to a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 5-71 Launch Management Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 5-72 URL for the Classification Module listener component . . . . . . . . . . . . . 243 5-73 Add decision plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 5-74 Select the Classification Module server to deploy the decision plan. . . 245 5-75 Warning that the knowledge base associated with this decision plan is missing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 5-76 Add knowledge base in Management Console . . . . . . . . . . . . . . . . . . . 246 5-77 Deployed knowledge base in Management Console . . . . . . . . . . . . . . 247 5-78 Deployed knowledge base information . . . . . . . . . . . . . . . . . . . . . . . . . 248 5-79 Start decision plan and associated knowledge bases. . . . . . . . . . . . . . 249 5-80 Classification Center main page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 5-81 Verify server port is correct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 5-82 Verify that the decision plan and knowledge bases are started . . . . . . 253 5-83 Log on to Workplace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 5-84 Starting the Classification Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 5-85 Launching the Classification Center application . . . . . . . . . . . . . . . . . . 255 5-86 Log on to IBM FileNet P8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 5-87 Warning message when first starting the Classification Center . . . . . . 256 5-88 Classification Module: Configuration page . . . . . . . . . . . . . . . . . . . . . . 258 5-89 Classification Module configuration settings . . . . . . . . . . . . . . . . . . . . . 259 5-90 IBM FileNet P8 settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 5-91 Add fields if they do not already exist . . . . . . . . . . . . . . . . . . . . . . . . . . 262 5-92 Field mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 5-93 Mapping the IBM FileNet P8 document property to the IBM Classification Module field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 5-94 Edit classification filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 5-95 Configure content to classify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 5-96 Choose the folder and the document class. . . . . . . . . . . . . . . . . . . . . . 267 5-97 Edit runtime settings for classifying content . . . . . . . . . . . . . . . . . . . . . 268 5-98 Configure runtime setting for reviewing documents . . . . . . . . . . . . . . . 269 5-99 Open decision plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Figures
xiii
5-100 Open knowledge base that is referenced by the decision plan . . . . . . 271 5-101 View categories for the knowledge base named select_branch . . . . . 272 5-102 Start classifying documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 5-103 Classifying documents with elapsed time showing . . . . . . . . . . . . . . . 275 5-104 Click the Review icon to start the document review process . . . . . . . 278 5-105 Filter Settings for reviewing documents . . . . . . . . . . . . . . . . . . . . . . . 279 5-106 Edit Filter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 5-107 Document filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 5-108 Review documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 5-109 Review decision history. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 5-110 Reclassify documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 5-111 Reclassify document to additional categories . . . . . . . . . . . . . . . . . . . 286 5-112 Save document in XML format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 5-113 Add a document to IBM FileNet P8 for classification analysis . . . . . . 288 5-114 Browse to the document location to add it for classification analysis . 288 5-115 Review the decision history for document classification . . . . . . . . . . . 289 6-1 Architecture overview of the integration . . . . . . . . . . . . . . . . . . . . . . . . . 293 6-2 Installing Classification Module client components on the Content Collector server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 6-3 Validating that the Windows registry entry of ibm.ctms.utilityconnector.ICMClassificationTask exists . . . . . . . . . . . . . 298 6-4 Classification Module task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 6-5 Classification Module system metadata properties in Content Collector Configuration Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 6-6 Selecting the Add knowledge base menu option . . . . . . . . . . . . . . . . . . 302 6-7 Adding a knowledge base to the Classification Module server . . . . . . . . 304 6-8 Field definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 6-9 File system archiving task route to create . . . . . . . . . . . . . . . . . . . . . . . . 306 6-10 The Task Routes explore pane and Toolbox of the Content Collector Configuration Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 6-11 Add icon for adding a new task route . . . . . . . . . . . . . . . . . . . . . . . . . . 308 6-12 Creating a new task route from a blank task route . . . . . . . . . . . . . . . . 309 6-13 A new blank task route . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 6-14 Selecting the FSC Collector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 6-15 Adding an FSC Collector to the task route . . . . . . . . . . . . . . . . . . . . . . 311 6-16 Selecting the P8 4.x Create Document task . . . . . . . . . . . . . . . . . . . . . 313 6-17 Adding a P8 4.x Create Document task to the task route . . . . . . . . . . . 314 6-18 Selecting the IBM Classification Module task . . . . . . . . . . . . . . . . . . . . 316 6-19 Adding a Classification Module task to the task route. . . . . . . . . . . . . . 317 6-20 Configuring the Classification Module task . . . . . . . . . . . . . . . . . . . . . . 318 6-21 Adding the Decision Point to the task route . . . . . . . . . . . . . . . . . . . . . 319 6-22 Add conditional clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 6-23 Configuring the rule of the left branch of the task route . . . . . . . . . . . . 321
xiv
6-24 6-25 6-26 6-27 6-28 6-29 6-30 6-31 6-32 6-33
Configuring the P8 4.x File Document in Folder task . . . . . . . . . . . . . . 323 Configuring the P8 4.x Declare Record task . . . . . . . . . . . . . . . . . . . . . 325 Configuring the FSC Post Processing task . . . . . . . . . . . . . . . . . . . . . . 326 Configuring the rule of the right branch of the task route . . . . . . . . . . . 327 Configuring the P8 4.x File Document in Folder task . . . . . . . . . . . . . . 328 The complete task route created in Configuration Manager . . . . . . . . . 329 Starting the knowledge base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 Activating the file system collector for your task route . . . . . . . . . . . . . 331 Starting Task Routing Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 The HR LegalCase folder before the Content Collector file system archiving operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 6-34 File share files waiting for capture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 6-35 Non-captured files in the monitored file system folder . . . . . . . . . . . . . 334 6-36 HR LegalCase folder after Content Collector file system archiving operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 6-37 Documents in the HRReview folder waiting for review . . . . . . . . . . . . . 336 6-38 Records declared in the P8 file plan object store . . . . . . . . . . . . . . . . . 337 7-1 Integration architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 7-2 Client only installation of Classification Module server . . . . . . . . . . . . . . 345 7-3 Internet Explorer Options setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 7-4 Validating that the Windows registry entry of ibm.ctms.utilityconnector.ICMClassificationTask exists . . . . . . . . . . . . . 349 7-5 Classification Module task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 7-6 IBM Classification Manager system metadata . . . . . . . . . . . . . . . . . . . . 351 7-7 HR knowledge base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 7-8 Management Console knowledge base status . . . . . . . . . . . . . . . . . . . . 355 7-9 Classification Center Configuration tab. . . . . . . . . . . . . . . . . . . . . . . . . . 356 7-10 IBM Content Collector Task Route service status. . . . . . . . . . . . . . . . . 357 7-11 Starting Configuration Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 7-12 Starting IBM Content Collector 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 7-13 IBM Content Collector Quickstart Tutorial. . . . . . . . . . . . . . . . . . . . . . . 359 7-14 Configuration Manager User Interface . . . . . . . . . . . . . . . . . . . . . . . . . 360 7-15 FileNet P8 Workplace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 7-16 IBM FileNet Records Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 7-17 Data Store settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 7-18 FileNet P8 Connector settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 7-19 E-mail Server Connector settings: General tab. . . . . . . . . . . . . . . . . . . 367 7-20 E-mail Server Connector settings: Connection tab . . . . . . . . . . . . . . . . 368 7-21 E-mail Server Connector settings: Active Directory tab . . . . . . . . . . . . 369 7-22 Metadata and Lists configuration: System Metadata . . . . . . . . . . . . . . 370 7-23 E-mail Services Client Configuration settings . . . . . . . . . . . . . . . . . . . . 371 7-24 E-mail Services: Configuration Web Service settings . . . . . . . . . . . . . . 372 7-25 E-mail Services: Information Center settings . . . . . . . . . . . . . . . . . . . . 373
Figures
xv
E-mail Services: Web Application Client settings . . . . . . . . . . . . . . . . . 374 Create New Task Route . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Creating new task route from a blank task route. . . . . . . . . . . . . . . . . . 376 Empty task route showing start and end task nodes. . . . . . . . . . . . . . . 376 Start task node: General tab settings . . . . . . . . . . . . . . . . . . . . . . . . . . 377 Selecting EC Collect E-Mail By Rules task node . . . . . . . . . . . . . . . . . 378 EC Collect E-Mail By Rules: General tab settings . . . . . . . . . . . . . . . . 379 EC Collect E-mail By Rules: Schedule tab settings . . . . . . . . . . . . . . . 380 EC Collect E-mail By Rules: Collection Source tab settings . . . . . . . . . 381 EC Collect E-mail By Rules: Collection Source addition . . . . . . . . . . . . 382 EC Collect E-mail By Rules: Filter tab settings . . . . . . . . . . . . . . . . . . . 383 Selecting EC Extract Metadata task node. . . . . . . . . . . . . . . . . . . . . . . 384 EC Extract Metadata: General tab settings. . . . . . . . . . . . . . . . . . . . . . 385 Selecting EC Prepare E-mail for Archive task node . . . . . . . . . . . . . . . 386 EC Prepare E-mail for Archive: General tab settings . . . . . . . . . . . . . . 387 Adding EC Finalize E-mail for Compliance task node. . . . . . . . . . . . . . 388 EC Finalize for Compliance: General tab settings . . . . . . . . . . . . . . . . 389 Adding P8 4.x Create Document task node . . . . . . . . . . . . . . . . . . . . . 390 P8 4.x Create Document: General tab settings. . . . . . . . . . . . . . . . . . . 392 Adding Classification Module task node . . . . . . . . . . . . . . . . . . . . . . . . 394 Utility Classification Module: General Properties . . . . . . . . . . . . . . . . . 396 Adding a decision point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 Conditional clause for Score < 70 rule . . . . . . . . . . . . . . . . . . . . . . . . . 399 Score < 70 decision rule: General tab settings . . . . . . . . . . . . . . . . . . . 400 Add new rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 Rule General Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 Conditional metadata rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Adding Create P8 4.x File Document in Folder task node . . . . . . . . . . 404 P8 4.x File Document in Folder: General tab settings. . . . . . . . . . . . . . 406 Adding P8 4.x File Document in Folder task node . . . . . . . . . . . . . . . . 407 FileNet P8 4.x File Document in Folder: General tab settings . . . . . . . 409 Adding EC Prepare E-mail for Stubbing task node . . . . . . . . . . . . . . . . 410 EC Prepare E-mail for Stubbing: General tab settings . . . . . . . . . . . . . 412 New link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 New link connecting P8 4.x File Document in Folder task node to EC Prepare E-Mail for Stubbing task node. . . . . . . . . . . . . . . . . . . . . . . . . . 414 7-61 Adding EC Create E-Mail Stub task node . . . . . . . . . . . . . . . . . . . . . . . 415 7-62 E-mail Server - EC Create E-mail Stub: General Properties. . . . . . . . . 417 7-63 Active flag for task route . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 7-64 Start Task Routing Engine service . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 7-65 Activating the audit log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 7-66 Audit Log Node General Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 7-67 Folder created based on top category Redbook/HR/Stock Options . . . 424
7-26 7-27 7-28 7-29 7-30 7-31 7-32 7-33 7-34 7-35 7-36 7-37 7-38 7-39 7-40 7-41 7-42 7-43 7-44 7-45 7-46 7-47 7-48 7-49 7-50 7-51 7-52 7-53 7-54 7-55 7-56 7-57 7-58 7-59 7-60
xvi
Folder created based on top category Redbook/HR/Pay . . . . . . . . . . . 425 Folder created based on score < 70% - Redbook/manualReview . . . . 426 Adding the P8 4.x Declare Record task node . . . . . . . . . . . . . . . . . . . . 428 P8 4.x Declare Record Classification folder selection. . . . . . . . . . . . . . 429 P8 4.x Declare Records: General tab settings . . . . . . . . . . . . . . . . . . . 430 E-mail declared as records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Figures
xvii
xviii
Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. The following company names appearing in this publication are fictitious: Fictional Auto Rental Company A This name is used for instructional purposes only. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
xix
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: AIX DB2 Domino FileNet Global Business Services IBM InfoSphere iSeries Lotus Notes Lotus Notes OmniFind Quickr Redbooks Redbooks (logo) Sametime Tivoli
The following terms are trademarks of other companies: FileNet and the FileNet logo are registered trademarks of FileNet Corporation in the United States, other countries or both. Java, JDBC, Solaris, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Active Directory, Excel, Internet Explorer, Microsoft, MS, Outlook, SharePoint, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
xx
Preface
IBM Classification Module (Classification Module) Version 8.6 is an advanced enterprise software platform tool designed to allow organizations to automate the classification of unstructured content. By deploying the module in various areas of a business, organizations can reduce or avoid manual processes associated with subjective decision making around unstructured content. Organizations can also streamline the ingestion of that content into their business systems in order to use the information within the business systems more effectively. At the same time, the organizations can safely remove irrelevant or obsolete information and therefore utilize the storage infrastructure more efficiently. By reducing the human element in this process, Classification Module ensures accuracy and consistency and enables auditing while simultaneously driving down labor costs. This IBM Redbooks publication explains what Classification Module does, the key concepts to understand when working with Classification Module, and its integration with other products and systems. With this book, we show you how Classification Module helps your organization to automate the classification of large volumes of unstructured content in a consistent and accurate manner. The topics that are covered include building, training, and fine-tuning the knowledge base, creating decision plans, working with Classification Workbench, and step-by-step integration with other products and solutions. This book is intended to educate both technical specialists and non-technical personnel in how to make Classification Module work for your organizations. Changes in product version and name: This book is written based on Classification Module Version 8.6. The new Version 8.7 has since been released. The product also has a new name: IBM InfoSphere Classification Module. For the purpose of accuracy, Version 8.6 was known in the field as IBM Classification Module (and also Classification Module), and this name is used in the rest of the book. Whats new: You can view whats new from the following Information Center topic: http://publib.boulder.ibm.com/infocenter/classify/v8r7/topic/com.ibm .classify.nav.doc/whats_new87.htm
xxi
xxii
Xiaomei Wang is a Senior Technical Consultant in Enterprise Content Management, Business Partner Technical Enablement team, in IBM U.S. She assists IBM Business Partners with integrating IBM Discovery and Enterprise Content Management products into their solutions. Xiaomei holds a Master of Science degree in Computer Science from York University. She joined IBM in 1999 and has held numerous positions in the company, such as the DB2 UDB Advanced Support Specialist, the Information Management Business Development Manager, and the Technical Sales Specialist. Xiaomei is a Certified IBM IT Specialist Professional and a Board Member. In addition, she is an IBM Certified Solution Experts (IBM Content Manager, IBM FileNet P8, DB2 UDB Database Administration and Application Development, and Business Intelligence), and a Prentice Hall book author. The following people from the Classification Module software development team, IBM Israel Software lab, have also contributed to writing this book: Josemina Magdalen is a Senior Team Leader and Architect at Israel Software Group (ILSL). She has a background in Natural Language Processing (text classification and search, as well as text mining and filtering technologies). Josemina joined IBM in 2005 and has worked in the Content Discovery Engineering Group performing software development projects in text categorization, filtering, and search, as well as text analytics. Prior to joining IBM, Josemina has worked in Natural Language Processing research and development (Machine Translation, Text Classification and Search, and Data Mining) for over ten years. Jane Singer is a member of the Advanced Engagement Team. She has worked as a Support Engineer in the Software Group with a focus in the Information Management area, supporting IBM OmniFind Discovery Edition, OmniFind Enterprise Edition, and IBM Classification Module. Jane is part of the IBM Israel Software Development Lab in Jerusalem, Israel, and she has worked within the Quality Assurance group and supported presales for IBM Classification Module for many years. Jane holds a Masters Degree in Library and Information Science and a Ph.D. in Musicology, both from the Hebrew University of Jerusalem. Very special thanks to the entire IBM Classification Module software development team in Jerusalem, Israel. Without their gracious and enthusiastic assistance, we would not have been able to produce this book. Specifically, we would like to thank the following people from the team for their contribution to this book: Shimon Stark Adina Taragin Oren Paikowsky Yariv Tzaban Steve Kirshner
Preface
xxiii
Victoria Mazel Eliahu Ben-Reuven Aryeh Krassenstein Tamar Lavee Boruch Nager Eitan Porat Vladislav Rybak Nir Salansky Dmitry Shusterman Denis Voloshin IBM Software Development Lab, Jerusalem, Israel We want to thank the following people from IBM U.S. who have contributed to this book project: Chuck Beretz Srinivas Varma Chitiveli Joshua Payne IBM Software Development Lab, United States
Comments welcome
Your comments are important to us. We want our books to be as helpful as possible. Send us your comments about this book or other IBM Redbooks publications in one of the following ways:
xxiv
Use the online Contact us review IBM Redbooks publications form found at: ibm.com/redbooks Send your comments in an e-mail to: [email protected] Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400
Preface
xxv
xxvi
Part 1
Part
Chapter 1.
Overview
This chapter provides both technical specialists and non-technical personnel with an introduction to IBM Classification Module (Classification Module) and the advanced classification technology that it uses. After reading this chapter, you can establish the relevance of Classification Module to your project and have a basic understanding of how to architect a classification solution. In this chapter, we discuss the following topics: Introducing Classification Module How Classification Module works Business use cases Integration options Changes in product version and name: This book is written based on IBM Classification Module Version 8.6. The new Version 8.7 has since been released. The product also has a new name: IBM InfoSphere Classification Module. For the purpose of accuracy, Version 8.6 was known in the field as IBM Classification Module, and this name is used in the rest of the book. Whats new: You can view what is new for Version 8.7 at the following Information Center topic: http://publib.boulder.ibm.com/infocenter/classify/v8r7/topic/com.ibm .classify.nav.doc/whats_new87.htm
Chapter 1. Overview
No matter what action the application takes, we want to ensure that the text categorization is as accurate as possible. To ensure accuracy, we build a knowledge base that reflects the nature and structure of the world that it is trying to understand. In addition, the categories in the knowledge base are carefully selected to reflect real-life business processes. In the next few sections, we explain in simple terms the elements that are required in the classification solution. We provide a more detailed and technical description in Chapter 2, Concepts, architecture, tools, and integration on page 21.
Chapter 1. Overview
The learning is maintained in an entity called a knowledge base. Classification Module allows you to test your knowledge base with known content so that you can be confident of its accuracy. You can further augment the textual analysis capabilities of a knowledge base with the addition of rule-based analysis, which is provided through a decision plan. A decision plan can provide a rule-driven framework to invoke multiple knowledge bases in a single call to Classification Module. Several knowledge bases covering separate domains can be maintained so that separate types of classification can be carried out and multiple classifications can be assigned to a single document. We describe detailed training and testing in later chapters of this book.
L co nsecteoremi p tuer sum Lorem consectetuer ipsum d sit olor adipiscing elit. sit amet, dolor a dipiscing e t. amet, li Lorem i p Vi vamusconsectetu ipsum ull L er orem sum Viva scon mu u sectetuer ll a dipisci nd elit. amet, golor li dolor sit t, adipiscinge t.si t ame Viva con mus u sectetuer ll er Vi vamusconsectetu ipsum ull Lorem L oremipsum adipiscing elit. a dolore t.amet, dipisci n sit g li d olor Vivamus ull lsit amet, Vivamus u co l nsectetuer consectetuer adipiscing elit. a dipiscing e t. li Vi vamus ull Viva s u mu ll
B B
Tra ining
L oremi p sum Lorem ipsum d sit amet, olor dolor sit amet, consectetu er con sectetuer a dipisci n elit. g li adipiscinge t. Viva mus ull ll Vi vamusu
Matching
C C
IF!=
?
A: 0.97,, A: 0 .97 B: 0.54, B: 0.5 4, C: 0.12, C: 0.1 2, Review
L L L orem ipsum sum Loremi p d sit amet, olor dolor sit amet, consectetu er con sectetuer a dipisci n elit. g li adipiscinge t. Viva mus u ll Vi vamus ull
Feedback
C C
Lore ipsu m m Lo rem ip sum dolor si t a t, do sit amet, lor me con ctetu sectetuer conse er adipiscing elit. ad ipiscin e it. g l Vivamus ull Vivamus ull
Chapter 1. Overview
Customer communication management: Routing customer communication efficiently Enabling automated query response Providing self-help (FAQ) Enterprise content management: Generating taxonomy Ingesting content Creating content-centric workflow processes Managing records
A partial solution
Classification Module allows you to provide accurate classification and metadata information for these documents to enable timely search and retrieval of the information. Classification Module also allows businesses to apply appropriate access controls on this information. However, these functions are not the full picture. A truly compliant system must also ensure that information is not kept for longer than is necessary and that out-of-date information is disposed of in a timely manner. Because of these requirements, organizations tend to implement a certified records management system.
10
Chapter 1. Overview
11
important e-mails appropriately. This automation can save hundreds of hours of manual effort and can make the difference between being compliant with regulations or not. E-mail management goes beyond organizing a companys stored e-mails. Incoming and outgoing e-mails also need to be actively managed so that they are accurately categorized on receipt/send. Classification Module can provide spam detection, separating important e-mails from spam or other types of unimportant messages. You can also use Classification Module for filtering by identifying messages of special interest and pinpointing a small percentage of relevant texts from a large body of texts.
12
In the case of inbound e-mails to an inquiry desk, Classification Module can assist in either analyzing the e-mail and directing the inquiry to the appropriate staff, or it can decide which standard response to use, given the nature of the inquiry. In the case of self-help or frequently asked questions (FAQ) systems, you can use Classification Module to return the most likely answers to an inquiry submitted to a self-help application.
Taxonomy generation
One of the primary objectives of moving to an ECM system is the ability to apply structure and more granular control to the unstructured information in an organization. Deciding on an appropriate classification scheme can be extremely time-consuming. Classification Module includes a tool specifically designed to propose a category list (a taxonomy) based on an analysis of a representative set of content. It divides the content into conceptually similar clusters and then proposes names for them (which organizations have the option to rename later). You can run the process in an unattended manner, or you can run it interactively with a user who can guide the system to fine-tune the clustering and naming exercise. Taking this approach, an organization can save significant time and effort in arriving at a classification scheme, which can be used as the basis for intelligently importing the large numbers of files that typically reside on network file shares.
Chapter 1. Overview
13
The use of Classification Module in this scenario can ensure that all content is moved into the new environment accurately and consistently regardless of the size of the task.
After this stage of content loading is complete, Classification Module can continue to provide an interface to the new system for content being produced by line of business applications that do not have their own direct interface into the new content repository.
14
latency, as well as eliminating the potential for introducing errors. Figure 1-6 shows how you can use the module in the context of such a process.
Start
IBM Classification Module
IF!= ?
High-value customer?
...
...
Records management
Records management is a critical element of an ECM systems ability to deliver compliance within an organization. We describe compliance and records management in 1.3.1, Achieving and maintaining regulatory compliance on page 10.
Chapter 1. Overview
15
Report management based on report content concepts Structuring OmniFind Enterprise search results along a category tree for simple navigation
What it provides
When implemented as a Web service or as part of a service-oriented architecture (SOA), all applications within an organization can use Classification Module for immediate classification or metadata tagging of a specific text stream. Both the provision of input content and the handling of the output results are the responsibility of the implementer. Classification Module has extensible and well-documented application programming interfaces (APIs) in Java, C, and Component Object Model (COM), as well as Web Services Description Language (WSDL). Sample code that is delivered as part of the installation process provides examples of integration with Microsoft SharePoint Web parts and Microsoft Word 2007. As an example, for a MicroSoft SharePoint integration, you can enable a workflow to call IBM Classification Manager and add metadata, such as a category, to a MicroSoft SharePoint document. For a MicroSoft Word 2007 integration, you can automatically suggest a category for a document when it is either open or saved.
16
Module uses statistical methods (knowledge bases) and rule/ keywordbased methods (decision plan) to determine document classification
P osti tl .i co
What it provides
IBM FileNet P8 integration provides a Web-based application called the Classification Center. The Classification Center enables IBM FileNet P8 documents from one or more folders to be submitted to Classification Module for automatic classification. Through the decision plan and its IBM FileNet specific functions, documents in IBM FileNet P8 content repository can be moved from their existing folders to other locations, metadata can be added to the documents, and documents can even be declared as records with an automatically chosen file plan location.
Chapter 1. Overview
17
Note: Starting from Version 8.7, Classification Module also provides predefined integration with IBM Content Manager.
18
Po stitl .ico
Business application
I BM Content Collector
Posti tl .i o c
CM8
ECM reposi tory
Figure 1-9 IBM Content Collector integration with IBM Content Manager (CM8)
Chapter 1. Overview
19
20
Chapter 2.
21
Categories are the basic elements within the knowledge base. Categories can represent the textual content of a text, or they can indicate another attribute, such as its source. A category has a set of features, which are known as concepts, that characterizes a category and distinguishes it from other categories. The creation and maintenance of these concepts are internal to Classification Module and are not controlled by the user. Instead, the creation and maintenance of these concepts are affected only by the training or learning processes.
22
23
Note: For step-by-step instructions to create a knowledge base using Classification Workbench, refer to Chapter 3, Working with knowledge bases and decision plans on page 59.
24
Document Filter Manager: Strips out the rich formats that are applied to documents, such as Word, Excel, and PDF documents. The document filter identifies the language encoding and converts any other language encoding into the standard Unicode Transformation Encoding Format (UTEF) of the plain text. NLP (Natural Language Processing) Engine: Extracts concepts from free-text fields and generates a Semantic Modeling Language (SML) document. A concept is a basic unit of linguistic or quantitative information that is derived from input text (and its context) that can influence classification. Semantic Modeling Engine: Performs statistical pattern matching on the SML by comparing it with the content of categories residing in the KB to generate the relevancy scores top matching categories.
Figure 2-2 illustrates the classification technology.
Suggested categories Relevancy scores
Input
Output
Feedback
Dcument Filtering
25
Note: It is beyond the scope of this book to discuss in further detail how the technology works internally to perform classification. We highlight the three-phase process instead of going into more detail.
Rules consist of Triggers and Actions. A trigger determines the conditions that must be met to initiate an Action. Each rule has exactly one trigger; however, it
can have multiple actions. An action specifies what action Classification Module will perform if an associated rule is triggered and returns true. The action can be to store a document in a specific folder or document class, to move or copy a document from one folder to another folder, to declare a document as a record so that it can be placed under the control of a records management system, or to extract and update metadata information from the document. You can also configure triggers and actions based on content or metadata. A decision plan can use one or more knowledge bases for a combination of rule-based and knowledge-based classification. For example, you can add a knowledge base to your decision plan project, and you can define rules that are based on matches and scores. You can combine multiple rules into a group. A group is a logical collection of triggers and actions designed to achieve a certain task. You can create multiple groups within a decision plan and sequence them. Each group can then be dynamically set to be activated or not to be activated based on the rule evaluation of any previous group in the sequence. You can use the Classification Workbench to configure decision plans and groups of decision plans. Figure 2-3 on page 27 shows a decision plan project within Classification Workbench tool, which uses a KB as a reference project.
26
Note: For instructions to create a decision plan using Classification Workbench, refer to Chapter 3, Working with knowledge bases and decision plans on page 59.
27
Content type definitions are used in determining the final relevancy score of a
document. Higher weight is assigned if matches are found against a field with a
subject content type as opposed to matches found against a field with a body
content type. Figure 2-4 shows a sample field definition property association dialog box using the Classification Workbench tool.
28
For a detailed discussion about configuring field properties and their usage, refer to Chapter 3, Working with knowledge bases and decision plans on page 59.
29
use Classification Workbench to apply learning to an existing knowledge base (KB), and improve its effectiveness. Typically, you create a KB in stages. Prior to using Classification Workbench, you collect sample data (for example, e-mails and documents) representative of the data that you expect to classify using your own applications or scripts under a file system or an enterprise content management system. Then, you arrange the collected sample data under categories or logical groups that you define for your enterprise. Note: In case a predefined category does not exist, there are several ways in which Classification Module can help you with categorization needs. One way is to use a tool called Taxonomy Proposer. For more information about Taxonomy Proposer, refer to 2.3.2, Taxonomy Proposer on page 37. You import this data (including sample documents and categories) into Classification Workbench to create a content set file. Classification Workbench provides a variety of features and techniques that allow you to fine-tune the content set to optimize knowledge base accuracy. The knowledge base is a single file encapsulating data that the Classification Module requires for accurate classification. The KB will change and improve over time as it learns new categories and adapts to changes in data received by the system. Refer to 2.3, Classification Workbench on page 34 for a description of the functions of Classification Workbench. Chapter 3, Working with knowledge bases and decision plans on page 59 provides a detailed overview and step-by-step instructions to work with Classification Workbench.
30
Client Applications
Listener
Admin
KB R/W
Management Console
DP R/W
KB R/O
KB R/O
DP R/O
DP R/O
D-SERVER
Listener
This component serves as the entry point to the system. Client requests are sent to the listener, which then dispatches them to the appropriate server-side component for processing. Requests related to a specific knowledge base are
31
routed to the corresponding read-write instance, and administration requests are routed to the administration component.
Read-write process
This process handles requests to the knowledge base or decision plan, such as matching, feedback, language identification, and modifications made to the knowledge base or decision plan. There is only one read-write instance for each knowledge base (KB R/W) and one read-write instance for a decision plan (DP R/W). Depending on the workload, the read-write instance might use a load-balancing algorithm to forward read-only requests to an available read-only instance.
Read-only process
This optional process handles read-only requests that are forwarded by the read-write instance. Common read-only requests include matching, language identification, and viewing the structure of a knowledge base or decision plan. You can configure any number of read-only instances for a given knowledge base or decision plan and configure them to run on any number of computers. You can have multiple knowledge base read-only instances (KB R/O) and multiple decision plan read-only instances (DP R/O).
Administration process
This process handles all global administration requests, including requests that originate from the Classification Module administration tool called Management Console. Refer to 2.4.1, Management Console on page 40 for a discussion of this administration tool. The administration process is configured to run on a specific server when the Classification Module is installed. Communication between the various server-side components is accomplished through SOAP. Figure 2-6 on page 33 shows a sample configuration of a Classification Module server in a single server mode.
32
Data server
The data server (D-server) is a proprietary data storage mechanism that is used by the Classification Module server to persist information that is required by the Classification Module server. Types of information include server configuration information, knowledge base information, decision plan information, and feedback information.
SOAP layer
Applications can interact with the Classification Module system by using SOAP. The SOAP layer wraps the native Classification Module function calls. The definition for the SOAP interface is provided by a Web Services Description Language (WSDL). For example, the Classification Module might be included as a Web reference in any .NET application using the WSDL.
Client APIs
Classification Module server provides client APIs to interface with applications to perform data layout and formatting. Classification Module includes C, Component Object Model (COM), and Java client APIs. Classification Module supports several configuration options to accommodate varying client needs. Regardless of the type of system configuration that you choose, you can administer all system components from a single point, because all configuration data is stored in a common data server. In most cases, you develop your own clients that fit in with the rest of your application suite.
33
Management Console
Management Console is the application that you use to administer a Classification Module system.
Classification Center
If you integrate Classification Module with IBM FileNet P8, you can use the Classification Center to manage the classification processes.
Content Extractor
The Content Extractor is a command-line tool that you use to extract the content from an IBM FileNet P8 object store.
34
With Classification Workbench, you can import a knowledge base or decision plan for analysis and fine-tuning. After fine-tuning the performance, you can export the knowledge base or decision plan back to the Classification Module server. Figure 2-7 shows you the Classification Workbench application, which was started using Start Programs IBM Classification Module 8.6 Classification Workbench.
Client applications for Classification Module interact with knowledge bases and decision plans in a variety of ways. In this section, we discuss how you can perform these tasks: Create and configure knowledge bases Create and configure decision plans
35
You can create a decision plan that includes knowledge base classification. The steps that you follow to create a decision plan that includes knowledge base classification are: 1. Create a knowledge base. 2. Add the knowledge base to your decision plan project. 3. Define rules based on matches and scores.
36
You can run your sample content set items through the decision plan to see if the appropriate action was taken on the required content set items. You can check the performance of a decision plan in production by exporting analysis data from the Classification Module server and importing the data into Classification Workbench. For detailed steps, refer to Chapter 3, Working with knowledge bases and decision plans on page 59.
After you run the Taxonomy Proposer, you can create a knowledge base for Classification Module by manually reviewing documents in each cluster or suggested category and renaming the suggested categories based on document content. You import the categorized content set into Classification Workbench to build the knowledge base and, later, fine-tune the knowledge base performance. The Taxonomy Proposer is installed with Classification Workbench and runs in a Windows environment.
37
To start the Taxonomy Proposer, double-click the TaxonomyProposer86U.exe file that is located in the ICM_home\Classification Workbench\Program Files directory. The Taxonomy Proposer displays the Workflow Assistant, which guides you through typical scenarios. You can also access the Workflow Assistant from the Help menu. Figure 2-9 shows the Taxonomy Proposer Workflow Assistant.
38
Note: We encourage you to have your own predefined categorization. It is beyond the scope of this book to cover Taxonomy Proposer in detail. We recommend using the Workflow Assistant for guidance on using the Taxonomy Proposer.
Training KB
Class ified Docs
KB KB
P8 Integration
Tuning
Taxonom y Proposer
Analyzing KB KB/DP DP
ICM Application
Category Structure
ECM Applications
ICM Server
Classified documents (with well-defined structured categories) are imported into the Classification Workbench application as a content set. Unclassified documents can be assigned automatically generated categories by processing these documents through Taxonomy Proposer and then importing them into Classification workbench as a content set. In the Classification Workbench, the content set creates (or trains) a knowledge base. The knowledge base can be analyzed by another content set to understand the accuracy and effectiveness of the knowledge base. Active feedbacks can be applied to help the system to learn about recent changes to the knowledge base. You can also apply other techniques as described in Chapter 3, Working with knowledge bases and decision plans on page 59 to fine-tune the knowledge base. This process is called tuning.
39
The result of the Classification Workbench process of training, analyzing, and tuning is a knowledge base that, when combined with a decision plan, can then be used by a Classification Module server to run the training, analyzing, and tuning as internal processes. After the Classification Module server is running with knowledge base read-only or read-write processes, decision plan read-only or read-write processes, applications, and other clients can invoke functions to this Classification Module server using APIs asking for classification services, such as suggest and decide. Certain IBM ECM applications, such as IBM FileNet P8, can connect directly to the Classification Module server when the integration is enabled. Classification Module in this case can classify documents and automatically file them in the correct folder structure in IBM FileNet P8. We discuss other use case scenarios in the second half of this book. When it comes to the management capabilities of Classification Module server, we discuss the administrative graphical interface of the Classification Module system and the various Classification Module server deployment configurations. In this section, we describe Management Console and Classification Module server deployment configurations.
40
After you connect to the Management Console, you see the Management Console application interface. Figure 2-12 shows the Management Console application with all the administration tabs available.
41
Remove knowledge bases from the Classification Module server. Figure 2-13 shows the Knowledge bases administration tab inside the Management Console application. For detailed administration, refer to Chapter 3, Working with knowledge bases and decision plans on page 59.
42
43
44
Figure 2-17 shows the Servers administration tab inside the Management Console with the host name view. It shows all the knowledge base and decision plan processes running on a particular host.
45
knowledge base and decision plan), and read-only processes (for knowledge base and decision plan), run on the same computer. A single server configuration is useful on multiprocessor computers and in small-scale or development environments. On a single server, there can only be one listener, one knowledge base read-write (KB R/W), and one decision plan read-write process (DP R/W). However, you can have multiple read-only processes (KB R/O) for the same knowledge base or multiple read-only processes (DP R/O) for the same decision plan on a single server. Figure 2-18 shows the single server configuration mode of the Classification Module server.
Clients are either custom applications that use the client libraries or applications for integration with IBM FileNet P8. The applications communicate with the listener that runs within the server through SOAP. The listener forwards the request to the appropriate server-side component for handling. Administration requests are forwarded to the administration process, and requests on a specific knowledge base or decision plan are routed to the read-write instance of that knowledge base or decision plan. Depending on the nature of the request and the workload, the read-write process either handles the request by itself or forwards the request to a read-only instance of the knowledge base or decision plan.
46
47
ensure the availability of a specific knowledge base, read-only instances are distributed on all of the servers. Figure 2-20 shows the multiple listener configuration mode of the Classification Module server.
48
The Classification Module provides several client API libraries to enable the rapid development of various client applications in several programming languages, such as C, C++, C#, and Java, as well as scripting languages, such as Active Server Page (ASP) or VBScript. Samples that demonstrate system functionality and how to use the various client libraries are provided for several programming languages as part of the code samples with Classification Module installation. Figure 2-21 shows the Classification Module client API architecture. The .NET clients interface with the Classification Module server using the WSDL directly with the SOAP layer. Other clients, such as Java, C, and COM use their corresponding APIs to invoke SOAP function calls on the Classification Module server.
.NET/WSDL
ICM Server
Listener KB R/W
JAVA API
SOAP
Client Applications
Admin
DP R/W
DB R/O
COM API
C/C++ API
Listener
49
In this section, we describe applications that get installed along with the Classification Module FileNet P8 Integration platform. We discuss the following two applications: Classification Center Content Extractor
50
whereas the Spring beans use the Classification Module APIs (using the SOAP layer) to invoke Web service methods to Classification Module server.
The Classification Center includes three major tasks that you use to configure classification options, start and monitor the classification processes, and review classification decisions. In this section, we describe each of the Classification Center tasks: Configuration Dashboard Review Before you browse to the Classification Center application Web page, start the Classification Center server by running Start Programs IBM Classification Module 8.6 Classification Center Start Classification Center server. Figure 2-24 on page 52 shows the command window that shows the status of the Classification Center server.
51
After the Classification Center server is started, launch the Classification Center application by running Start Programs IBM Classification Module 8.6 Classification Center Classification Center. Figure 2-25 on page 53 shows the landing page of the Classification Center application.
52
Configuration
You can specify how you want Classification Module to classify documents into folders and document classes in IBM FileNet P8. To configure the setting, click the Configuration icon from the Classification Center main page. You can configure the following settings: General Settings: Specify settings, such as the decision plan to use, the IBM FileNet P8 object store to use, the metadata mappings between the IBM
53
FileNet P8 document and the fields in Classification Module, and whether to declare documents as records. Content to Classify: Specify from what folders or document classes the content is to be classified. Runtime Settings: Select settings, such as how resources are used and the number of threads to use for classifying content. Figure 2-26 shows the Configuration tab in the Classification Center application.
54
Dashboard
With the dashboard, you can start, stop, and monitor the classification processes. You can view process statistics, such as how many documents were classified. You can view summary information, such as which folders or document classes received the greatest number of documents. You can also view the event log and error log. Figure 2-27 shows the Dashboard tab in the Classification Center application.
Review
With the Review tab, you can review the classification results and, if necessary, reclassify documents. When reviewing a document, you can either confirm that the document is correctly classified or select various categories and actions and reclassify the document. By reviewing documents, you help verify that the system performs as expected and help ensure that the correct folders, document classes, and decision plan actions are applied during classification. In addition, when you review documents and manually select appropriate categories, the system learns from your selection, thereby improving future classification.
55
You can also add documents to the IBM FileNet P8 repository. When you add a document, Classification Module analyzes its content and suggests how to classify it. You can then review and confirm the actions or reclassify the document just as any other document that is available for review. Figure 2-28 shows the Review tab in the Classification Center application.
Note: For more details about the usage of Classification Center, refer to 5.5, Configuring and performing classification on page 249.
56
57
58
Chapter 3.
59
60
3. Create, train, and analyze a knowledge base. 4. Evaluate knowledge base performance by generating and viewing summary reports and graphs. 5. As required, improve knowledge base performance by reanalyzing and learning. Figure 3-1 illustrates the typical stages of knowledge base development. Your workflow can vary.
Classification Workbench
Crea te and e dit conte nt set
Preparation Stage
Im port
61
content items derived from all files in the Health Care folder have a category name called Health Care.
62
63
64
2. In the Import Content Set window, as shown in Figure 3-6, determine the type of content set that you want to import. In this example, we select the Files from a file system folder radio button.
Figure 3-6 Content set format: Files from a file system folder
65
3. In the content set location window, as shown in Figure 3-7, specify the fully qualified path for the root folder from which files are to be imported. All files in the root folder and its subfolders are imported. In this example, click Browse, and locate the C:\HR root folder.
4. In the file filter window, when the list of files and folders is displayed, select or clear the check boxes for the files and folders that you want to import. In addition, the Apply Filters button allows you to perform these tasks: Select the folders that you want to include or exclude. Select the file extensions for the files that you want to include or exclude. Select an option to activate random file select. In this example, we use the default filter settings to include all files for import, as shown in Figure 3-8 on page 67.
66
5. In the window to specify category names for each file system folder that you selected, you can edit the Category Name column with a new category name for the folder. In this example, we accept the default folder names as the category names, as shown in Figure 3-9 on page 68.
67
6. Specify the language and text filter settings for the content set. In this example, we accept the default settings, as shown in Figure 3-10.
Figure 3-11 on page 69 shows the project view after importing the HR content set into the Classification Workbench.
68
69
In this example, we decide to assign the content type value to the Body field by double-clicking the Body field in the Project Details pane, as shown in Figure 3-11 on page 69, to display its Field Properties window. In the Field Properties window, the Content type pull-down list box provides the following five options:
Body: For e-mail environments only. Select this option for content fields that contain the main body text. DocTitle: Select this option for content fields that contain document titles. Plain Text: Select this option for content fields that contain textual content. Sender: For e-mail environments only. Select this option for the Sender or From field that contains the e-mail address. Subject: For e-mail environments only. Select this option for a Subject field that contains the subject of an e-mail message.
In this example, we select the Plain Text content type, as shown in Figure 3-12.
70
Run the Create, Analyze and Learn Wizard to create a knowledge base: 1. Click Create, Analyze and Learn Wizard on the toolbar to start the wizard. 2. Specify how to split the content set to create and analyze the knowledge base. Classification Workbench divides the content set in various ways, as indicated by the option names, as shown in Figure 3-13 on page 72. In this example, we select Create and analyze knowledge base using active view and the option of Create using even, analyze using odd. With these settings, the content set in the active view is divided into two parts. One part of the content set is used to create and train the knowledge base, and the other part of the content set is used to analyze the performance of the knowledge base when classifying data. In addition, the Create using even, analyze using odd option allows you to create and train a knowledge base using the content items listed in even-numbered positions and analyze its performance using the items in odd-numbered positions. Note: When you use Create using all, analyze using all, your test results will not be representative of knowledge base performance in a live environment, because the items you use to test the knowledge base are the same items as those that are used to create it. Because the knowledge base already knows these items, it can produce better-than-expected results.
71
3. Choose the default option to create a new knowledge base, deleting any existing knowledge base in this project if there is any, as shown in Figure 3-14 on page 73.
72
Figure 3-14 Create a new knowledge base and delete existing knowledge bases
4. Optionally, add match fields to content set items. In this example, we accept the default settings of adding and displaying five match fields, as shown in Figure 3-15 on page 74. When the Add match fields option is selected, the content set view window displays additional match columns after the analysis. These columns show the top matches identified by the Classification Module as most applicable to each tested content item. In this example, we choose to display five match fields. For each match field, the categorys relevancy score appears in parentheses after the category name. Figure 3-17 on page 75 shows an example of the content set view with match fields. For example, the system determines that the 401k category is most applicable to content item 14; therefore, 401k appears in the Match1 column, followed by the score (99.29), which is the same category that you assign to this content item. Using match fields is a good way to verify your initial categorization, especially if the system-suggested top category differs from the category that you assigned initially.
73
5. The Status Information window allows you to view the create and analyze processes as they progress, as shown in Figure 3-16 on page 75.
74
6. After the knowledge base creation and analysis, Classification Workbench displays a content set view with the match fields added, as shown in Figure 3-17.
75
76
Figure 3-18 Generating Cumulative Success report and Total Precision vs. Recall Graph
77
Figure 3-19 Total cumulative success in Knowledge Base Data Sheet for HR
To understand how well each category performs, you can run the Cumulative Success summary report for each category, as shown in Figure 3-20 on page 79.
78
Total Precision vs. Recall graph The Total Precision vs. Recall graph provides an immediate, visual sense of the
knowledge bases overall accuracy. The graph in Figure 3-21 on page 80 indicates that the HR knowledge base performs well - the curve is in the upper-right portion of the graph. When the curve is in the lower-left portion of the graph and the number of categories is small, the knowledge bases overall performance is poor. If the results are poor, you might want to view the Precision vs. Recall graph for each category to see if individual categories are particularly problematic.
79
Note: If you are satisfied with the accuracy of the knowledge base from the initial creation and analysis, subsequent reanalyzing and learning with the knowledge base is optional.
Reanalyzing and recreating knowledge base based on existing knowledge base structure
If required, you can improve the knowledge base accuracy by reanalyzing and recreating the knowledge base based on existing knowledge base data: 1. Click Create, Analyze and Learn Wizard on the toolbar to launch the wizard. 2. In the Options window, select Create knowledge base using active view and the option of Create using even, analyze using odd.
80
3. In the Specify options for the selected process window, select Create knowledge base, using existing knowledge base structure, as shown in Figure 3-22.
Figure 3-22 Create Knowledge base, using existing knowledge base structure
4. In the Match Fields window, accept the default match field settings, and check Learn during analysis, as shown in Figure 3-23 on page 82. The Learn during analysis option enables the knowledge base to learn from categorized content items as the items are analyzed. All the content items that you designate for analysis are also applied to the knowledge base for learning. The knowledge base learns each text after it is analyzed. For each text learned by the knowledge base, the knowledge base adds the new knowledge information pertaining to the text into the knowledge base. Therefore, the knowledge base is constantly receiving feedback. In this example, we have the HR content set containing 1000 items. We create the knowledge base using the even-numbered items and analyze the knowledge base using the odd-numbered items. In the process of learning during analysis, the feedback from each of the odd-numbered items during the analysis phase gets processed by the Classification Module server and is added to the knowledge base to enhance its accuracy. In the meantime,
81
each time that an item is analyzed, all previous feedback is utilized in the analysis.
5. Click Finish to continue. Wait for the system to analyze and process. 6. Click View Reports on the toolbar to open the View Reports window. 7. Check the following report and graph: Cumulative Success summary report Total Precision vs. Recall graph 8. View the reports and compare them against the previous versions of the reports.
82
together if necessary. Decision plan rules can refer to one or more knowledge bases for a combination of rule-based and content-based classification. This section introduces working with a decision plan in two typical scenarios: 1. Create a decision plan. 2. Analyze a decision plan.
83
3. Select one of the following options to create the project: Create a project by importing a content set. Create a project by importing an existing decision plan file. Create an empty project. In this example, we choose Create an empty project, as shown in Figure 3-25 on page 85.
84
4. Figure 3-26 shows the project overview of the newly created empty decision plan project. The subsequent steps of adding knowledge bases and configuring rules are required to complete the decision plan creation.
85
2. In the Add Project window, select the knowledge base project that will be used in the decision plan. In this example, we select the HR project from the project list and click Add, as shown in Figure 3-28.
86
Creating rules
The decision plan in this example is to identify and file the Health Care documents among all incoming documents in a designated IBM FileNet P8 folder for review. You can fulfill this task using the following two rules: Rule 1: Identify the most relevant top category of each incoming document. Rule 2: If the document belongs to the Health Care category, file it in a designated FileNet P8 folder for review. Perform the following steps to create these two rules: 1. Define a rule group. In this example, we define a rule group by renaming the default New Group. Right-click New Group and select Rename to define its name, as shown in Figure 3-29.
2. A rule consists of properties, a trigger, and actions. Perform the following steps to create Rule 1: a. Right-click the newly defined rule group and select New Rule to create the first new rule, as shown in Figure 3-30 on page 88.
87
b. In the New Rule window, click the Properties tab. It has the following three settings: Name: The name of the rule. When triggered: Refers to the rules behavior when it is triggered. When you have more than one rule in a group, you can specify how you want to affect group processing or decision plan processing based on the trigger. There are three options available for the trigger: Continue, Stop group processing, and Stop all processing. Enabled: Specify this option to enable the trigger. If the check box is cleared, the rule is skipped.
88
c. In the New Rule window, select the Trigger tab. You can click Add to open a series of guided menus to define the trigger, or you can edit the trigger text directly. Click Undo to undo your previous action, and click Clear to start over. After building your trigger, click Validate to ensure that your trigger is written correctly. In this example, Rule 1 is used to process all incoming documents. So, we set the trigger to true, as shown in Figure 3-32.
d. In the New Rule window, select the Actions tab. You can access a set of predefined action scenarios by clicking Add Actions. Most action scenarios are designed for using IBM FileNet P8 integration. For example, you can use action scenario to file and unfile documents in IBM FileNet P8 folders, set document classes, and declare documents as records in IBM FileNet Records Manager. In addition to selecting and configuring action
89
scenarios, you can build your own scenarios by selecting one or more advanced actions. In this example, because Rule 1 is used to identify the category to which each incoming HR document belongs, its action is defined in the following manner. Analyze the document by Classification Module based on the HR knowledge base, and then, assign its most relevant top category name to a user defined content field, as shown in Figure 3-33. This user defined content field will be used in Rule 2 to take further actions.
3. You can use an approach similar to the approach that is described in step 2 to create Rule 2: a. Define the Rule 2 properties, as shown in Figure 3-34 on page 91.
90
c. Define the Rule 2 action as shown in Figure 3-36 on page 92: i. Select File document into a specific in IBM FileNet P8. ii. In the Folder Name field, type the name of the IBM FileNet P8 folder where we want to file the documents. In our example, this name is Content_OS/icm_integration/HealthCareReview. Note: To ensure a correct classification from Classification Module into the IBM FileNet P8 repository, the P8 folder name must be the complete path: <object_store_name>/<folder_name>/<subfolder_name_if_any>
91
iii. In the Preview Decision Plan Actions pane, you can see the action syntax, which is add_to_content_field '$P8:File' 'Content_OS/icm_integration/HealthCareReview' in our example. iv. Click OK.
Now, you have created a decision plan with two rules and one knowledge base, as shown in Figure 3-37 on page 93.
92
Figure 3-37 Decision plan view with two rules and one knowledge base
93
3. In the Import Content Set window, click Browse, and locate the CSV file, as shown in Figure 3-39. And, click Finish to import the content set into Classification Workbench.
94
2. Open the Content type pull-down menu and select Plain Text, as shown in Figure 3-41 on page 96.
95
Figure 3-41 Setting the Content type value to the Message_body field
3. In the Specify options window, select all the fields that will be displayed in the decision plan report. See Figure 3-43 on page 97.
96
Decision plan summary report A decision plan summary report contains the following statistical information:
The Rule statistics section lists all rules in the decision plan and for each rule, whether it was triggered or skipped, the IDs of the content items on which each rule took action, and the percentage of these items relative to the testing set. The Impact on content fields section provides an overall breakdown of what happened to each content field when the decision plan was run against the specified content set. The Modified content fields section shows each content field in the content set that was modified by the decision plan. The Deleted content fields section lists content fields that were deleted from one or more content items by the decision plan.
97
The Content field values section shows all content fields in the content set with the values that they contain after running the decision plan. For example, Figure 3-44 shows the rule statistics information after testing our decision plan against the content set. It indicates that the rule of Identify HealthCare Doc has been triggered for all 156 items in the content set, while the rule of File to P8 Review has been triggered 42 times, and has not been triggered for the rest of the content items.
Analyzed content set The Analyzed Content Set view shows the analyzed content set after running the
content set through the decision plan. In addition to the content fields that are inherited from the testing content set, Classification Workbench can optionally display the user-defined content fields or system-defined content fields utilized in your decision plan. For example, Figure 3-45 on page 99 displays the DocCategory and P8: File fields that are added to the testing content set after running it through our decision plan of HealthCare Review. Per the rule definitions, the DocCategory field contains the category to which each item belongs, and the P8:File field includes the FileNet P8 folder path where the
98
Health Care document will be filed. Moreover, the Analyzed Content Set view provides additional analysis information for each content item, such as the list of suggested categories by Classification Module and their associated relevancy scores, its changed content field, and fired rules.
We have shown you how to create and analyze a decision plan with two rules and one knowledge base. Based on the requirements of your applications, you can create decision plans with rules only, or rules along with multiple knowledge bases, by using the techniques that are described in this section.
99
100
It is beyond the scope of this book to go into detail about Taxonomy Proposer. Follow the Taxonomy Proposer Workflow Assistant to learn to use the tool.
101
Figure 3-47 Keywords CSV file for the knowledge base creation
Defer processing: Tracks the feedback events. Users can extract and analyze the feedback, and then, they can apply it to a knowledge base at a later time. Do not process: Does not track or apply feedback to a knowledge base.
102
If your application is designed to work with deferred feedback, these steps show a typical workflow of maintaining the accuracy of your knowledge base: 1. Defer the processing of feedback for the relevant knowledge base. 2. Extract the saved analysis data containing the postponed feedback. 3. Use the saved analysis data, and add learning to your knowledge base offline. The accuracy of feedback to a knowledge base has a direct impact on the knowledge bases performance. We recommend using the defer processing option to have the knowledgeable users review and analyze feedback before it is applied to their knowledge bases. Note: Depending on which suggested functions were used, the analysis data might not be saved to the server. In general, if you plan to analyze knowledge base performance or to defer feedback, you must turn on Save analysis data in the global properties of the server before the events are sent to the server. Otherwise, no data will be saved unless special suggest and feedback functions are used. Follow these steps to enable the knowledge base with the defer processing feedback option: 1. Launch Management Console by clicking Start Programs IBM Classification Module 8.6 Management Console. 2. Enter the server listener URL, as shown in Figure 3-48, which is defined during the Classification Module product installation. In this example, it is http://localhost:18087.
3. In the Management Console window, select Knowledge bases on the left pane. 4. Right-click your knowledge base on the right pane, and select its Properties menu entry, as shown in Figure 3-49 on page 104.
103
5. In the Properties window, set the Feedback field to Defer processing, as shown in Figure 3-50 on page 105.
104
6. Restart the knowledge base so that the new setting takes effect, as shown in Figure 3-51 on page 106. The system now starts to track the analysis data for this knowledge base. The analysis data that can be useful for knowledge base analysis includes feedback (the confirmation or correction of how a content item was classified) and matches (the categories suggested for a particular content item and the associated relevance scores).
105
# The folder where the output XML files are to be created XmlDir = C:\Data\Xml # The time period for the event data that you want to import StartTime = 2008/08/18 00:03:00.000 # Events starting at this time EndTime = 2008/11/20 23:59:59.000 # Events ending at this time # The type of data that you want to extract # Options are TextOnly, KB, or DP. ExtractType = KB
106
# The name of the knowledge base project (.kb) that you want to # train or analyze. KBName = HR # Extracts all feedback or match events during the specified time # period, or only the last feedback or match event. # Options are All or Recent. Scope = All # The type of feedback event data that you want to extract. # Options are 0 to not extract feedback, or any combination # of Feedback, FeedbackPostpone. FeedbackEvents = FeedbackPostpone # # # # # # # The type of suggest event data that you want to extract. Options are 0 to not extract matches, or any combination of Suggest, SuggestFromDecide, and SuggestDocument. 1. Feedback is used together with suggest to analyze how well the KB performed in the past. 2. Just Postponed Feedback is used to add learning (offline feedback) after auditing
SuggestEvents = 0 # The action to take if both feedback and suggest data is extracted. # Yes: If both FeedbackEvents and SuggestEvents are configured, a # text will be skipped if just one event type is found for the text. # No: Extracts data for texts that have one event type. Correspond = No 2. Run the bnsExtractTexts86.exe command to extract the stored analysis data in XML output files. For example, on Windows: cd C:\IBM\ClassificationModule\Bin bnsExtractTexts86.exe extractHRConfig.txt
107
After adding learning to your knowledge base with offline feedback, you can run a new analysis with a testing content set: 1. Import a testing content set of your choice. Make sure that the data contains a text field with a content type and a classification field for analysis. 2. Analyze your newly learned knowledge base with the testing content set in active view by using the Create, Analyze and Learn Wizard. In particular, choose the option Analyze knowledge base using active view, as shown in Figure 3-53 on page 109.
108
3. View the reports to analyze the knowledge base accuracy. 4. If you are satisfied with the accuracy of your newly learned knowledge base, you can deploy it to your production system.
Graphs:
109
2. Review the Knowledge Base Data Sheet report: a. The Total cumulative success table, as shown in Figure 3-55 on page 111, indicates the low cumulative success.
110
b. The Pairs of Categories with overlapping intents section lists the pairs of categories with overlapping intents, as shown in Figure 3-56 on page 112.
111
c. As an example, we further examine the following pair of categories, as shown in Figure 3-57, with additional reports.
Figure 3-57 Example of overlapping categories: Gift Certificates and Gift Wrap
d. Click View Reports on the toolbar. On the Category Graph and Tables tab, generate reports for the Gift Certificates category, as shown in Figure 3-58 on page 113. Note: It is important to choose Classification Workbench View so that you are able to open and view each content item directly by using the Content Item Scoring report.
112
3. We reach the following conclusion: The Stealing/Stolen Table report in Figure 3-59 shows that the Gift Wrap and Gift Certificates categories are highly overlapped.
113
4. The Content Item Scoring graph in Figure 3-60 shows the scoring for each content item for the selected Gift Certificates category. The light colored (blue) points represent content items that belong to the selected category, while the darker colored (maroon) points represent content items that do not belong to the selected category. The content items with low scores are at the bottom, and those content items that received high scores are in the upper part of the graph. You can click a point and open the document to read it and to decide whether its classification is correct.
5. A common solution to overlapped categories is to combine those categories. In this example, we decide to combine the Gift Wrap and Gift Certificates categories to one Gift Certificates category. Combine the categories using these steps: a. In the Project Details pane, on the Category tab, right-click Gift Wrap, and select Show Items, as shown in Figure 3-61 on page 115.
114
b. Select all content items belonging to the Gift Wrap category, as shown in Figure 3-62.
Figure 3-62 Select all items belonging to the Gift Wrap category
115
c. Right-click the selected items, and select Categorize Highlighted As, as shown in Figure 3-63.
d. Select the Gift Certificates category on the left pane, click >>, and click Apply, as shown in Figure 3-64.
Figure 3-64 Select the Gift Certificates category to apply to highlighted items
116
e. In the Project Details pane, delete the Match n fields on the Fields tab to remove the previous classification result, as shown in Figure 3-65.
6. Use the Create, Analyze and Learn Wizard to create a new knowledge base after removing the Gift Wrap category: a. Select the right content set view in Classification Workbench. b. Click the Create, Analyze and Learn Wizard on the toolbar to launch the wizard. c. Select Create and analyze knowledge base using active view, and Create using even, analyze using odd. d. Select Create new knowledge base, deleting any existing knowledge base in this project. e. Accept the default analysis options, and click Finish. f. Wait until the Create and Analyze operation finishes, and click Close. 7. Click View Reports on the toolbar to generate the Knowledge Base Data Sheet report. Figure 3-66 on page 118 shows that the cumulative success has improved.
117
Figure 3-66 Knowledge Base Data Sheet with improved cumulative success
118
Chapter 4.
119
ECM e-mail archiving, document storage, content management, and records management
Classification Module can assist enterprise content management (ECM) systems in tasks, such as e-mail archiving, records declaration, document management, and eDiscovery readiness, by automating the decision making process. E-mail content (from a mail server) can be extracted by IBM Content Collector and sent for archiving. The ECM application, empowered by Classification Module, performs tasks, such as choosing among the various storage options (using a decision plan), moving or copying a file to a new location, removing documents, and archiving. Typically, a category is mapped to a storage location, such as a directory or archive. ECM can also add additional attributes to the document for the purpose of easy document retrieval.
Self-help applications
Self-help applications, empowered by Classification Module, automate e-mail handling in customer service business processes. Self-help applications can automatically classify customer correspondence, identify the customers problem quickly, and often deliver an answer back immediately based on the classification, which removes the requirement for human intervention.
120
Search applications
Typically, categorization is added to search and retrieval systems as an overall organizational mechanism. The category tree can also be used as a navigational device to help users find possible areas of interest. Within the traditional search, categorization adds to the document description and can be used within traditional search techniques (queries). It can add to the knowledge about a document and help in cases where the search query returns a large result set containing too many irrelevant documents. Typically, this large result set with too many irrelevant documents happens when identified keywords do not reflect the users intent.
Document filtering
You can use Classification Module for document filtering in the following areas: Custom spam The spam filtering system categorizes all incoming e-mails as either spam or non-spam, according to the internal conventions of the company (rather than on universal spam concepts). The user decides how rigid a filter to use, depending on the margin of error that the user will accept. Personal/Business Often, work places want to archive only business correspondence and documents on a file system. The knowledge base needs to learn to differentiate between personal and business-related documents and e-mails. Pinpointing relevant categories and documents Another filtering use case involves finding relevant documents among a large collection of mostly irrelevant documents.
121
important aspects of a successful Classification Module implementation, and the major implementation steps.
122
123
124
4. Fine-tuning your knowledge base. Improve the knowledge base performance by fine-tuning the content and retraining the knowledge base. 5. Deploying knowledge bases. Optionally, create a decision plan to modify how the requests are sent to one or more knowledge bases. 6. Maintaining your knowledge base. Ensure that any changes in the type of texts sent to the server are reflected in the knowledge base, such as removing obsolete categories, adding new categories, or retraining existing categories. In the following sections, we explore each step (except the actual deployment) in detail in order to help you implement a successful Classification Module solution.
125
the company want to do with unclassified e-mails (classify later manually, delete, or other actions). ECM: Does the company want to simply categorize the documents, rearrange them within the file system, or ingest them into a full records management application? Content assessment: Investigate the data that you have gathered and stored for years before you start the eDiscovery readiness preparation, compliancy, or document management organization.
126
Categorization must not be based on external factors that are not reflected in the texts themselves. For example, several issues might easily be assigned to a Requires Research category, although they might actually have little in common. Therefore, Requires Research is typically a bad choice for a category. If there is a valid need to have an alternate view of the data, more than one knowledge base can be created for the same data set, representing different facets (for instance, Country of origin and Subject). However, if any category is based on external information, this factor needs to be dealt with externally to the knowledge base, by either a custom application or by writing rules in a decision plan. Categories must be distinct from each other. If, for example, you create a category to categorize texts related to spectator sports, your category overlaps with a category for the topic of football. If you use Classification Module for an e-mail classification application, overlapping categories will naturally confuse the agent that provides the feedback to Classification Module. As a result, Classification Module will also find it difficult to differentiate between these overlapping categories. Texts with multiple intents, however, are handled extremely well by Classification Module. A single message can belong to more than one category. Categories must reflect the business practices that you want to address by using Classification Module-based applications. For practical reasons, sometimes the suggestion is made to train categories by the answer that will be sent, and not by the intent. However, training categories using this method might not always provide optimal performance, because the answer texts differ in nature from the question texts. If this method is chosen, special care must be taken to analyze and tune the knowledge base, after it has been created and trained. The knowledge base needs to learn to associate the category with the actual texts that it receives, either by feedback or retraining.
127
128
Self-help applications
Typically, users try to find texts relevant to their problems. These answer texts need to be identified (which might involve dividing a large document into topics) and mapped to categories. For each category, a number of sample user requests must be identified and incorporated into the training or learning data.
CRM
Similar to Self-Help applications, the agents answers need to be mapped to possible questions. CRM systems can differ in that routing to agents might require another level of organization within the knowledge base. For example, each agent can deal with a separate area of expertise, and this area becomes a category. The subcategories correspond to each possible answer that is sent by that agent. In this scenario, you can build the knowledge base using example e-mails and documents that have been manually categorized by agents.
129
130
is improved automatically over time through Classification Modules feedback and learning mechanism.
131
Length of texts
We recommend that the texts be similar in length to the actual texts that the Classification Module is likely to encounter in your system.
132
Example one
Your organization plans to use a Classification Module-based application to categorize news articles. You build a corpus by collecting and importing a number of sample news articles from the Internet. Along with the main body text, the Web pages include extra, seemingly unnecessary text (for example, copyright information) that is unrelated to the news articles content. If this extra text will not be included in actual news articles that you plan to categorize using Classification Module, you need to remove it. If, however, you expect to categorize news articles with similar unnecessary text, leave it in the sample news articles to maximize knowledge base training and performance.
133
Example two
You create a corpus by gathering and importing archived e-mail messages. Other people in your company forward messages to you for inclusion in the corpus. By forwarding these messages, your own companys signature is included at the bottom of each message. Because this footer text will not appear in the messages that you expect to receive and categorize using Classification Module, you search for and remove all occurrences of your companys footer. You might want to leave other footer text (besides your companys footer) in the e-mails, if you expect to receive messages with these types of footers.
Defining fields
By defining fields, you determine how you want to process the texts in your corpus. You must define fields and their content types based on the text elements that you expect Classification Module to analyze and act upon. Important: The name of an field must not begin with an underscore (_) character.
134
When Classification Module receives a text directly from a document, this text is first processed by the DocFilterManager. This process extracts metadata and turns binary content into text. The extracted metadata is transformed into fields, which can be handled by a decision plan and either serve as the basis for decisions or be selectively sent to the knowledge base for matching.
Table 4-1 Application text type for Classification Module process and learn Type of application text Self-help inquiry Receives special NLP processing? No Suggested content type PlainText Used in Classification Module function Feedback Matching Training Feedback Matching Training
Removal of original message (for forwarded e-mails and replies) Removal of RE: and FW: E-mail address parsing No
Body
E-mail message subject E-mail message sender Canned answer standard response Keywords Category name Answers title Fields from structured content (database)
Subject Sender Not applicable Knowledge base Initialization Initialization of a new category
No
PlainText
Decision plan can be used to differentiate the use of fields with differing semantic content Selected fields can be used for feedback, matching, and training of multiple knowledge bases
Document
135
Type of application text Extracted DocTitle Extracted FileName Extracted FilePath HTML metadata
Used in Classification Module function Matching Training Can be used by decision plan Can be used by decision plan Can be used for decision plan, similar in nature to structured data
Categorizing texts
As mentioned earlier, the best way to build an effective knowledge base is by manually categorizing the training corpus. A well-categorized corpus maximizes the initial knowledge base performance. If your corpus is not fully categorized, your data might contain structural information that can be used to categorize the texts, for example: Newsgroup name if you are categorizing Newsgroup Tags in HTML data XML tags in XML extracts If your data includes this structured information, it can be assigned directly as the category for each sample text, as we have shown in 3.2.2, Building a knowledge base on page 62.
136
To understand this functionality, consider the task of categorizing the following texts: pen, water glasses, coffee cup, laptop, and writing tablet. If you are asked to make two clusters, you can find objects for writing and objects for drinking. If you are asked to make four clusters, you can find laptops, writing instruments, water glasses, and coffee cups. Clustering is also available in the Taxonomy Proposer. Taxonomy Proposer also assists the user to iteratively build a taxonomy from the suggestions of the Clustering algorithm. Tip: Follow the Taxonomy Proposer Workflow Assistant. Classify by response: If your sample texts include standard responses, you can use Classification Workbench to automatically classify corpus texts according to the particular canned responses that they contain. The system identifies common responses even if they have been modified slightly based on the specific response, such as adding the customers name and account number. Finding patterns: If your text does not contain standard responses, but it contains repeating patterns that can be used to categorize the data, you can use Classification Workbench to find these patterns. The resulting list of items can be examined as a possible category. Manual categorization: This method is the most labor-intensive method, but it is sometimes the only option available. If your test system does not have any systematically identifiable categories, you can use Classification Workbench to assign a category to each text. Your use of these techniques depends on the state of your texts. In most cases, corpus texts require categorization and cleanup. Depending on how the resulting knowledge base performs, you can use the categorization and cleanup techniques to fine-tune the corpus and improve knowledge base performance. Optimal training results are achieved when your corpus contains texts that are as close as possible in content and structure to the real-life texts that Classification Module will categorize.
137
Note: Using initialization data does not make full use of Classification Modules ability to learn from real-life data. For the best results, we recommend corpus-based knowledge base creation. However, using initialization data is a quick way to get a knowledge base or individual categories up and running, without having to gather sample corpus texts. Subsequent learning based on real-life texts must be applied to these categories to ensure optimal results. In spite of their limitations (minimal length and content), within the question and answer types of applications, it is best to train Classification Module using pre-categorized question texts, because these question texts are the texts that the system will be required to classify. However, when sample questions are not available, you can use Classification Workbenchs initialization functionality by providing answer texts, a manually created list of keywords, or even the category names to train the Classification Module. This initialization functionality is useful both when initializing an entire knowledge base and when adding a new category to an active knowledge base.
Initialization data
Initialization data consists of keywords and texts that are associated with individual categories. Keywords are words or phrases that you expect will appear in texts classified by Classification Module. When the system identifies keywords in a text, the category associated with the keywords is more likely to be returned. For example, you expect to receive questions about your companys exchange policy. You want to choose keywords, such as exchange and return for this category. Note: You can attach the same keyword to more than one category. In addition to keywords, you can associate one or more texts with categories in the knowledge base. For example, for a Classification Module-based e-mail classification system, an appropriate text is a canned answer sent in response to e-mail inquiries. Classification Workbench analyzes the text and uses this information to classify incoming texts appropriately. You can also add shorter texts, referred to as titles, to a category (for example, the subject of the canned answer). You can choose to enter initialization data manually or to gather data offline and import it.
138
When using the initialization functionality, the following predefined content types are available: Title: Used for short descriptions Canned answer: Used for predefined texts that the application returns for a category Keywords: Words or phrases associated with categories You cannot change these predefined types.
139
Tip: We recommend that you keep a small to medium-sized corpus to be used as a benchmark for checking after retraining or periodically after feedback. This corpus needs to contain a broad range of the most common types of queries. The purpose of this corpus is to test basic functionality after fine-tuning more specific areas. Often feedback is biased toward a few newer and popular categories. In the next sections, we discuss using precision and recall with thresholds to analyze your knowledge base.
140
141
This section describes the analysis of a knowledge base after it is created, but before it has been deployed on a live server (the first option listed). We discuss the analysis of live performance later in this chapter. To analyze your knowledge bases performance, use a set of texts (also referred to as an analysis content set), consisting of texts that have been categorized (by a subject expert or other business user) into the appropriate categories. For each text, the categories and their suggested scores that were suggested by Classification Module are compared with the categories that were chosen by a person. This information enables you to understand how your knowledge base aligns with the accepted knowledge base design and also how it will perform on live data. Using this process, you can gain insight into potential areas for improvement. Specifically, analyzing the knowledge bases performance allows you to gain the following benefits: Identify how accurately the knowledge base performs. Analyze the use of thresholds for process automation. Identify categories that require improvement. Identify possible new categories to add to the knowledge base or obsolete categories that need to be removed. The analysis process consists of understanding your analysis data, viewing and understanding performance, and identifying categories for improvement. Figure 4-1 on page 143 shows the stages of knowledge base analysis and the activities that you perform during each stage.
142
143
Data quality
Your data must contain the correct categories for each text, which are obtained through manual verification, such as feedback or any other reliable mechanism. These categories represent your expectations and provide the basis for the calculation of success. In addition, your test data must be relevant, unbiased, and reliable. Working with feedback describes the ideal data for feedback.
144
Similar concepts apply to ideal analysis data, because the category field represents the ideal categorization of the text.
Category distribution
The distribution of your analysis data must be representative of the distribution of the actual data that Classification Module will classify using your knowledge base. For example, if you expect that Category A will account for 20% of your classifications, the analysis data needs to reflect that, as well. In general, focus on the most frequently used categories during the analysis process. In certain cases, extremely small categories can be highly important. Make a note of these categories, so that you can give them special attention during your analysis. For specific case analysis, use selected categories.
Languages
The language of the analysis data needs to agree with the categories that are being analyzed.
145
Tip: We recommend to store the ID, name, or other identifier of the person who provided the original feedback in the system. This information can be useful when running reports and performing analysis based on specific users or agents.
146
are happy to receive the correct answer as one of the top few suggestions, while unanswered questions can be directed to an e-mail or a phone number. Your application measures success based on cumulative success or how often the Classification Module identifies the correct answer as one of the top three answers. Custom spam filtering systems or Spam Filtering: Your application categorizes incoming messages as spam or non-spam. Your users cannot afford to lose more than one non-spam e-mail in 100,000, but they are willing to live with a small amount of spam that remains unfiltered. A successful knowledge base can assure that an extremely high percentage (0.001%) of texts that belong to non-spam are assigned correctly (high precision), while a fairly high percentage of spam messages are correctly assigned (reasonably high recall). Classification Workbench allows you to set thresholds per category to adjust the precision/recall levels. For self-help applications, the user is expected to be able to choose the most appropriate answer from a list of suggested results. You will be satisfied with your knowledge base as long as it identifies most of the interested users (moderate precision), while you will excuse false positives in this category (moderate recall).
147
As a result of analysis, you might determine that your current knowledge base does not give you the levels of automation and accuracy (that is, return on investment (ROI)) that you require. If so, you must determine the causes and attempt to correct them. If your application relies on precision and recall ratings, examining a graph of the average precision and recall for all categories might give you sense of the accuracy that you can expect for given levels of automation. You will probably want to examine individual categories and determine the correct level of automation for each category.
148
Are there insufficient items for specific categories? Typically, the performance of a category decreases if it has fewer items than other categories. 2. Look for categories with lots of content but poor performance. Possible solutions include: Merge overlapping categories. Split categories with multiple intents. Create new categories. Use precision to eliminate false positives. 3. Look at other reports for problems: Steal and stolen report: Do you need to merge the categories or retrain with more accurate texts? Precision vs. recall report: Review the report, explain the reports meaning to the client, and understand what the client wants to do with this information. Does the client want more precision or more recall? You can adjust accordingly. Understand False Positive (false yes) and False Negative (false no). Threshold information: Check the threshold setting and determine how it can be set for better precision or recall scores. Deflection report information: Analyze the report and learn how you can use the information for better performance. In the following section, we cover Cumulative Success and Precision/Recall reports. Also, you can review other reports to analyze the performance and improve your knowledge base.
149
Table 4-2 Cumulative Success scores Number of categories returned Percentage of items correctly classified 1 2 3 4 5 6 7 8 9 10
81
85
90
92
94
95
99
100
100
100
This table indicates for this knowledge base that Classification Module selects the correct category as the first choice 81% of the time. It also indicates that if the application presents the top seven scoring categories for an incoming text, the correct answer will be included 99% of the time. Higher percentages reflect better performance. Make sure that you check the cumulative success when running analysis on any newly improved knowledge base.
150
151
To further help you fine-tune your knowledge base, we examine the following issues and topics: Overlapping categories (categories stealing from each other) Low scores in all categories Category does not represent intent of a message Possible multiple intent categories Human error (poor manual classification) Poor sampling (randomization issues) Identifying hidden subcategories Poorly performing category with too few examples Identifying obsolete categories
152
Situation two: Two categories are created (Mortgage Interest and Checking Interest). Mortgage Interest supports interest rate questions for mortgages. Checking Interest supports interest rate questions for checking accounts. Situation three: Out of Stock category is similar to Inventory category (one category is a subset of the other category).
153
There is a problem. However, this kind of text is so rare that it is not worth providing a solution. The text is currently classified to a generic category. If we divide this generic category and allocate a separate category for this kind of text, Classification Module will be able to identify this type of text in the future. There is a technical problem, such as an incorrect rule or a problem with language identification. The text represents a new topic that requires a new category because of a predicted need.
154
Errors in the manual classification that you used to train the system have caused categories to be confused (with multiple intentions). You must always define clear objective criteria for inclusion in a category. If you determine that two categories are often coupled, you might decide to merge them. That is, the multiple intentions might be extremely close in meaning and impossible to separate in practice.
155
156
157
As the analysis progresses, the knowledge base that is used to test each text will always contain the maximum amount of knowledge. In this way, every text used in analysis is also learned, creating a knowledge base much like one in a live system.
158
Recall Determine the percentage of texts that you want to catch for a specific category. Realize that recall and precision are inversely related and that there are costs associated with catching a higher percentage of texts. For example, a filtering application in a security environment might opt to set the recall value for the Terrorism category to 90%. Recall = The percentage of items that are actually relevant to the category, which are recognized as such by Classification Module (the rest are false negatives). High Recall means that you do not have many false negatives; you do not miss many items; and you catch almost all of the items that belong to the category. Precision Determine the percentage of texts that are caught correctly for a given category. For example, you might want 90% of the automatic responses to be correct, regardless of the associated cost. Precision = The percentage of items that Classification Module identifies as relevant to a category, which are actually relevant to the category (the rest are false positives). High Precision means that you do not have many false positives; you do not claim many items to belong to the category when in fact they do not belong. Note: Selecting a threshold based on recall or on precision is possible using Classification Workbench.
159
Question: If you want to be accurate 85% of the time (that is, the precision), what do you do? Answer: Make sure that your scores are calibrated and then set the threshold to 80%. For the Web Self-Service page on your companys Web site, or within the ECM or records management automation, the system is completely unmanned, and your goal is to provide the three most likely responses to every question asked. Question: When you provide the three top scoring categories (and their associated answer texts) to the user, how do you know your level of accuracy? Answer: Look at the Cumulative Success value 3. This value will tell you the percentage of times that the correct answer was included in the top three Classification Module matches. Question: If you want to automatically archive e-mails into an ECM Repository (for example, IBM FileNet P8) with a precision of 90% (that is, precision), what do you do? Answer: The e-mail archiving scenario using IBM Content Collector to inject e-mails into IBM FileNet P8 or IBM Content Manager CM8, can be performed automatically by using Classification Module to decide the location of the e-mails inside the repository. For this purpose, we calculate using Classification Workbench the thresholds setting for Precision 90% and use them in the decision process (for example, you can set a decision plan that will use a knowledge base and its associated thresholds set for precision 90). The same automation process based on thresholds can be used for e-mails inside the ECM repository in the reclassification process.
160
Deferring feedback
Depending on the type of user that supplies the feedback (internal personnel or random users), it might be advisable to defer the feedback, saving it on the server but not sending it to the knowledge base for learning. You can set up this option, prior to giving feedback, using the Management Console. There is a feedback option in the Properties dialog for each knowledge base. Choose Defer processing. When you want to process the feedback, run bnsExtractText, exporting only the deferred feedback. This data can be imported into Classification Workbench and audited before running the Learn using active view option in the Create, Analyze and Learn Wizard.
161
Relevant data
Preparing text prior to sending it to Classification Module will help ensure that the system works with the best possible data. This recommendation applies despite the fact that Classification Module is designed to handle imperfect data and automatically ignores nonrelevant texts. Note: The same text preparation needs to be applied to all texts sent to the Classification Module for training, learning, and matching.
Unbiased data
Under optimal conditions, it is best to provide accurate and unbiased data to Classification Module as feedback. When unbiased data is not available, you must carefully select a random sampling of this data and use it as feedback. The following situations are typical examples of bias in the training set/feedback: Self-help applications: When feedback is received from users who click a feedback button (for example, Did this answer help you?), it is possible that only satisfied customers (or only unsatisfied customers) will submit feedback. This data is both biased and unreliable. Therefore, it is preferable to manually audit this feedback. An objective auditor might use a separate Feedback Tool to audit (for example, confirm or modify) the users feedback, which provides a more reliable and less biased source of input to the Classification Module. E-mail auto-response system: Bias can occur when agents only provide feedback on manually handled messages, while messages that receive auto-responses do not receive feedback. In this situation, Classification Module will not learn the characteristics of messages with high scores (in this example, the auto-response messages), and the results will be biased. This type of bias might not always be undesirable. We recommend that you designate a random percentage of auto-responses for manual audit and quality control. Alternatively, consider this scenario: the agent can only view and provide feedback to the five high-scoring categories suggested by Classification Module and cannot access the other (low-scoring) categories in the knowledge base. This undesirable situation will create the opposite bias described in the previous example, because only the high-scoring categories receive feedback. Therefore, agents must always have access to the entire knowledge base. Certain users of Classification Module-enabled applications do not provide positive feedback when Classification Module suggests the correct category and answer with a high confidence level. Often, low-scoring texts (which are more likely to be selected for feedback submission) are not good
162
representatives of the category and might not be the best training material. Feedback must be implicit (triggered by users actions) whenever possible. For example, the selection of a canned response needs to automatically trigger a feedback. In general, Classification Module takes into account the popularity of categories in the example set and gives more weight to larger categories. As a result, any large deviation from the real world can introduce bias. For example, a category that is not visible to users and receives fewer feedbacks than it deserves will receive lower Classification Module scores in the future. One possible implementation for handling biased feedback is to use the random sampling of texts or messages for feedback. Your application can use the results of a random number generator to select a certain percentage of the messages that will be sampled. For example, if you want to sample 10% of your messages, your application can assign a random number (between 0 and 1) to a message and select it for sampling when the random number value is below 0.1. The application must ensure that all these sampled messages receive feedback (through user feedback or auditing). The other 90% of the messages must not provide any feedback to Classification Module (regardless of the feedback that they might get in the application). In cases where feedback can be provided to all messages, there is no need to use a sampling mechanism. This approach is specifically true for e-mail response applications that never send automatic responses.
Reliable feedback
Unreliable data can originate from feedback submitted by a person who is not familiar with the entire category set. For example, a user who has a problem with the mobile computer display might think that the Laptop-Problems category provided a satisfactory answer, although there is a more specific and appropriate category. When determining how and from whom your system will receive feedback, consider the trade-off between the amount and the reliability of the feedback. Depending on the environment that you support, it is generally preferable to opt for more reliable feedback, as long as the amount of feedback is sufficient to keep your system running accurately. It is important to note that Classification Module does expect a portion of the feedback to be incorrect and, therefore, has mechanisms to avoid problems resulting from incorrect feedback.
163
the most recent feedback bear more weight than the corpus presented to the feedback in previous calls.
164
application, you can use special functions to create a text that is stored on the server. This text is assigned an ID and subsequent functions (suggestWithID, decideWithID, and feedbackWithID) will save the data on the server in such a way that the user feedback is connected to the matched categories. Using the command-line tool bnsExtractText, you can export XML data from the server and import it into Classification Workbench. The resulting data can be used to analyze the performance on the Classification Module server over time. We recommend that you work in this mode if you want full control over the analysis of past knowledge base performance. The disadvantage of this approach is that a lot of data is stored on the server. If the special ID functions are not used, the exported data will contain separate items for the Suggest/Decide and Feedback events. These items can be imported into Classification Workbench and examined manually, or alternatively, the data can be manipulated externally. Option four: Write an external application that stores all of the requests that are sent to the Classification Module server and that is capable of writing this information to XML files that can be imported into Classification Workbench. The disadvantage to this approach is that you must verify that the requests were actually received by the server and that no errors occurred. For analysis options within Classification Workbench, see 4.6.5, Running analysis in Classification Workbench on page 145. For solutions to knowledge base problems, see 4.7, Fine-tuning your knowledge base: Fixing the problems on page 151.
165
When a new category is added to the system, it will not be active until it receives at least one feedback. When you create the new category in Classification Workbench, you can initialize just one category using initialization data. Monitor the performance of a knowledge base with a new category, making sure that it receives appropriate feedback.
4.10.1 Matching
Classification is based on the scores that are returned by Classification Modules Match function and can be integrated at one or more points in your application. The following examples are a few typical scenarios: E-mail response application The application can perform matching when the e-mail is received. In this case, the agent can view the match results at a later time. Or, the application can perform matching each time that the agent views the message. In this case, the classification will be based on a more up-to-date knowledge base, which can include relevant feedback that was processed. This approach will generally improve the accuracy of the system.
166
Self-help/search/text retrieval applications The application calls the matching function each time that the user submits a question/query. This data can be saved and used later for analysis. If feedback is included, this feedback can used for later retraining. Or, if feedback is deferred (not submitted to the knowledge base), it can be audited and used for offline learning. ECM E-mail Archiving, Documents Management, and Records Management scenario In the ECM integration between Classification Module and FileNet P8 or IBM Content Manager CM8, the classification information is used to automatically place an e-mail into the appropriate archiving folder, using IBM Content Collector to extract the e-mail content and store it in the ECM system. Or, the classification information is used to assign the appropriate document class, according to the results of a decision plan. Or, the classification information is used to assign the appropriate document class or define the needed record. You can review this information using the Classification Center process that will accumulate feedback and enrich the respective knowledge bases with new information.
4.10.2 Feedback
The sections that follow describe issues that you need to consider when working with the feedback function.
Submitting feedback
When Classification Module receives feedback, it learns from the submitted text and adds learning to the submitted category. Only learning categories are affected by feedback. Depending on the structure of the knowledge base, a node can be affected by feedback to its parent node. The knowledge base updates itself based on texts that are supplied as feedback. The text has a positive effect on categories that receive positive feedback and a negative effect on the other categories.
167
Consider the case of categories X, Y, and Z. If a multiple-intent message arrives that applies to categories X and Y, we must provide feedback to both X and Y in a single text with two associated categories. If we provide this feedback as two separate texts, you get the following result: We first supply positive feedback to category X. This action implicitly provides negative feedback to categories Y and Z (that is, the system tells categories Y and Z that the text does not belong to their categories). Then, we supply positive feedback to Y. This action implicitly provides negative feedback to categories X and Z. Due to this incorrect use of feedback, Classification Module is now faced with contradictory information. When a new message arrives that is similar to this message, Classification Module will know that Z is not the correct category, but it will be confused by the conflicting feedback on categories X and Y. Therefore, follow the recommended method of providing feedback as a single text. This method provides positive feedback to categories X and Y and negative feedback to category Z with one call.
Negative feedback
Negative feedback is information that is provided by a user indicating that the answer received is incorrect. An application can use this information, but the application must not send it directly to Classification Module for handling.
168
If you use an auditing mechanism to review and validate user feedback, the auditor needs to convert this information into positive feedback before it is sent to Classification Module, using one of the following methods: The auditor, who is familiar with the entire knowledge base, specifies the correct category for the text. If there is no clearly correct category in the knowledge base, but the application needs to simulate this situation by adding a Not relevant to the knowledge base category to handle this situation and supply this text as feedback to that category. This type of category, when added to the knowledge base, is called a background category. The background category might or might not return a score, depending on how you decide to set it up. Note: Extensive use of the Not relevant to the knowledge base category will lower the score of the other categories in the knowledge base and might reduce Classification Modules accuracy for the categories that are of interest. Use this category with caution.
169
not recommended, because the categories for each agent are too widely defined and are associated with too broad of a range of suggested answers.
170
Other applications do not require immediate matching, such as: E-mail response applications in which the sender does not expect an immediate reply Single user desktop applications, where the user does not expect matching to work while the user is working on the knowledge base (for example, reorganizing it and training it) Applications that require immediate matching must always keep at least one instance of the Classification Module knowledge base running. In this case, it is desirable to dedicate one instance of Classification Module to run all the learning and knowledge base management tasks and to dedicate several other servers to run instances of the knowledge base to read-only functions (each instance on its own machine). Classification Module allows you to create a read-only instance of a knowledge base, which is significantly smaller, takes fewer resources, and loads more quickly. We recommend that the learning instance of the knowledge base run on one Classification Module server and that it generates a read-only instance for matching. The learning instance is responsible for updating the read-only instances at a regular interval.
4.10.6 Retraining a knowledge base after a major reorganization using Classification Workbench
In a typical day-to-day scenario, most feedbacks are given to existing categories. Categories are added or deleted infrequently. However, if major changes are made to the knowledge base and corpus (for example, when completely reorganizing mail folders), it makes sense to completely retrain the knowledge base, using texts that are relevant to the newly created category. In this case, the retraining process gives more weight to more recent (that is, more relevant) texts: 1. Select all of the texts from the last period of time (for example, last month), so that the average category has at least 50 - 100 texts. 2. Import the corpus into Classification Workbench, making sure that it includes classified texts for the new category. 3. Create and analyze a new knowledge base with the Create, Analyze and Learn Wizard. 4. Export the knowledge base to the Classification Module server, replacing the old knowledge base.
171
Feedback bias
Because only a small number of the texts in a filtering system are read by users (that is, the messages that receive high scores in one of the relevant categories), the overwhelming majority of messages do not get feedback. This situation creates two kinds of bias: Bias in the feedback to the background category Bias in the feedback to the relevant categories Note: In general, the bias in the feedback to the relevant categories is generally more problematic than the bias in the feedback to the background categories. For example, a drugs category might focus on domestic drug dealers and will return high scores for messages that refer to this organization. Users will not receive suggestions nor provide feedback for messages about foreign drug dealers due to differences such as the names of people, their locations, and language. This situation creates a problem with biased data.
172
We suggest the following methods for handling this bias in the feedback to relevant categories: Use an external mechanism to look for relevant messages, and feed them as feedback to new (or existing) categories in Classification Module. Such mechanisms can be a combination of: External knowledge, such as newspapers, and relevant intelligence A keyword-based search engine Audit and give feedback to a certain percentage of the messages that were filtered out. It might be desirable to create two separate categories (for domestic and foreign drugs) due to the differences in the content of their texts. The method for overcoming the bias in the background category is to audit a certain percentage of the texts that were filtered out. Provide all the texts that received low scores on all the relevant categories as feedback to the background category, which assumes that the overwhelming majority of these texts belong to the background category Accuracy of the filtering system. When analyzing the accuracy of a filtering system that makes use of queues, consider the following measurements: The percentage of texts that are routed to the correct queue. The percentage of texts that are routed to the incorrect queue (that is, they are handled, but they are placed to wrong queue). The percentage of texts that entered the queues but should have been filtered out. These measurements cannot be obtained immediately by using Classification Workbench, because Classification Workbench cannot simulate the queuing mechanism of the application.
173
4.11 Summary
This chapter presented the typical life cycle of a knowledge base and showed how to optimize its behavior, in the context of a number of common scenarios. Within Classification Module, the knowledge base plays a crucial role. Even though the decision plan was later added to enhance the manipulation of one or more knowledge bases, using rule-based mechanisms, only the knowledge base has the ability to adapt dynamically to the content that it receives and to learn new user needs (through feedback). However, as a result of this powerful aspect of knowledge base behavior, it is essential that you closely observe and direct its performance. This chapter showed a wide range of concepts; you need to analyze your own unique requirements to apply and adapt the techniques that are described in the use cases.
174
Part 2
Part
Integration details
In this part, we describe the integration between IBM Classification Module and other products and solutions, including IBM FileNet P8 systems and IBM Content Collector for both E-mail and File Systems. We describe the steps to enable the integrations and step-by-step integration and configuration instructions with use case scenarios.
175
176
Chapter 5.
177
Classification Center
The Classification Center is a Web application that provides the following functions: Configure classification settings (filters folders and documents to be classified). Run automatic classification of chosen content. Monitor classification activity and errors. Review documents according to various filter settings (such as those documents that did not exceed the configured confidence threshold for an automatic classification action), and, if necessary, reclassify documents.
Content Extractor
This is a command-line tool that extracts content from an IBM FileNet P8 repository to create and train knowledge bases. We explain more about their configuration and usage in the later sections of this chapter.
178
Figure 5-1 illustrates the Classification Module and IBM FileNet P8 integration architecture with two major workflows: Setup and Production.
Setup 4 1
IBM FileNet P8
Production
Classification Center 8.6 p ro vides services on d ocuments in P8: 1 Au tomatic classi fication 2 Manual review
Classification Center
Content Extractor
Content Extractor pulls out selected d ocuments from P8 in xml format. Classi fication Workbench uses those documents to bu ild knowledge bases
Uses rule/ keyword base meth ods (decision plan) and statistical m etho ds (knowledge bases) to determine docum ent classification
XMLs bin
3 2
Classification Workbench IBM Classification Server
Setup workflow
The setup workflow consists of the following steps (the numbers match the numbers in Figure 5-1). Step 1: Extract data from the repository: Content Extractor extracts the documents from the IBM FileNet P8 repository to one or more XML files. The attachments referenced in the extracted documents are filed in a bin subfolder. We discuss detailed information about configuring and using Content Extractor in the next section of this chapter. Step 2: Training and analysis in Classification Workbench: Use Classification Workbench to import extracted files (in XML format) together with their attachments and use them to build knowledge bases. A decision plan is created.
179
Production workflow
The production workflow consists of the following steps (the numbers match the numbers in Figure 5-1 on page 179). Step 3: The knowledge base and decision plan are deployed to the Classification Module server, which will perform the classification activities. Classification and review setup with the Classification Center: Configure the Classification Center to classify the document stored in the IBM FileNet P8 object store. It passes the document content to the IBM Classification server, where each document is evaluated by the decision plan. According to the decision plan, the system processes each document (either moves or files the document) based on the category classified by one or more knowledge bases. Certain documents might receive no categories or actions and can remain unclassified. These documents are highlighted for review. You can use the Classification Center to review documents that match specific conditions (a filter). For example, you can review documents that do not exceed a preset confidence threshold for an automatic classification action. Clarification: The threshold is processed by the decision plan rules. Thresholds are not processed in the Classification Center filters. However, you can go to the filters to view the documents that do not exceed a certain threshold. Step 4: Classification Module automatic classification: If necessary, Classification Module reclassifies these documents. Using the Content Extractor, the metadata of classified documents can be exported to XML to evaluate the knowledge base accuracy. This analysis is based on the categories that are suggested by the decision plan and those categories that are chosen by the user in the review process. This data can be imported into Classification Workbench where you can create a large range of reports.
180
Our sample company must classify content in an Enterprise Content Management system and ensure that document retention and disposition policies are enforced. The present enterprise corporate repository is built on IBM FileNet P8. Initially, the content in the repository is not well organized, and it does not comply with records management policies. To address this issue, the IT specialist responsible for the repository wants to ensure that all of the documents in the IBM FileNet P8 content store are organized into a consistent set of folders (or document classes). The business analyst wants to ensure that data in the repository is organized according to a corporate taxonomy. To achieve this goal, the analyst defines a new corporate taxonomy for assigning document properties and classifying content into folders. The analyst also define records management policies (rules for the retention and disposition of documents) per the companys records manager specialists instruction. The IT specialists task is to organize content in the company repository by using the new taxonomy policies. The IT specialist works closely with the business analyst to reclassify content that already exists in the repository by applying this new taxonomy. They must work together to eventually ensure that the correct documents are declared as records so that the records can be managed according to the records management policies. To automate this task, they decide to use Classification Module. They need to configure a set of rules that uses knowledge bases and assign each document to one or more categories in the corporate taxonomy. The rules need to be easy to configure but powerful enough to allow documents to be classified on the basis of both metadata (document properties) and content. During the reclassification phase, information about the documents, such as the document properties and the target folder or document class, might require updating, and certain documents might need to be declared as records. The IT specialist extracts sample content from IBM FileNet P8 for training and testing purposes, and then, the rules can be tested with part of the extracted content to see how the system arrives at its classification decisions. The business analyst plans to use Classification Module to review documents. If the analyst disagrees with a classification decision, the analyst can reclassify a document by applying other classification criteria. By reviewing documents and either confirming the classification decision or reclassifying the content, the analyst helps to fine-tune the system and improve accuracy over time. This use case integration solution requires the following software stack: IBM FileNet P8 (Either IBM FileNet Content Manager or IBM FileNet Business Process Manager) Classification Module
181
182
If the Classification Center menu is not shown, launch the Classification Module installation wizard as described in the IBM InfoSphere Classification Module Version 8.7 Information Center: http://publib.boulder.ibm.com/infocenter/classify/v8r7/ Look for the chapter Deploying on IBM FileNet P8, section Getting Started with IBM FileNet P8 integration, and paragraph Installing the integration components for IBM FileNet P8.
183
During the installation procedure, you will see the window in Figure 5-3, where you select the Custom radio button, and then, click Next.
When you see the dialog that is shown in Figure 5-4 on page 185, select IBM FileNet P8 Integration.
184
Complete the remainder of the installation. At the end, after rebooting, verify that the integration components are installed, as explained at the beginning of this section.
185
The Classification Module knowledge base structure that we use to classify IBM FileNet P8s unclassified content must match the existing folder structure. As a reference for now, Figure 5-6 shows a Classification Workbench - Knowledge Base Editor window, with the knowledge base category structure that will be created in 5.4.2, Creating a knowledge base on page 215 of this chapter.
Figure 5-6 The Classification Workbench category structure in the Knowledge Base Editor window
Classification Module classifies documents based on categories that are defined in the knowledge base. If these categories relate to FileNet P8 folders or FileNet P8 document classes to which documents are to be assigned in the FileNet P8 repository, these folders and document classes must already exist in FileNet P8. Classification Module server and the Classification Center are designed to not create these folders and classes automatically. To configure IBM FileNet P8 for the integration: 1. Importing the AddOn or icm_prps_addon.xml file. 2. Installing the AddOn file.
186
b. Select the IBM FileNet P8 server to integrate with Classification Module, and click Connect. See Figure 5-8 on page 188.
187
Figure 5-8 Select IBM FileNet P8 server to integrate with Classification Module
3. Right-click the AddOn folder (see Figure 5-9) in the right pane, and select New AddOn.
4. The window that is shown in Figure 5-10 on page 189 opens. Type a name for the new AddOn, for example, Classification Module. Then, click Browse to find the directory where the file icm_prps_addon.xml is copied (in our example, the C:\Temp directory).
188
189
Figure 5-11 Open the XML file for the new AddOn
6. Verify that the correct file path and file name show in the Import File field, as shown in Figure 5-12 on page 191. Then, click OK.
190
Figure 5-12 Verify the correct import file name and path for the new AddOn
Figure 5-13 on page 192 shows the result of the AddOns Import operation.
191
192
2. The Add On Installer window appears, as shown in Figure 5-15 on page 194. In the Select AddOns to Install section, select the IBM Classification Module AddOn check box, and then, click Install.
193
3. The AddOn Installation Status window appears confirming the successful installation, as shown in Figure 5-16. Click OK and the window disappears.
4. To see the imported properties in the object store, from the menu bar, select Action Refresh, as shown in Figure 5-17 on page 195.
194
Figure 5-17 Refresh the object store to see the new imported properties
195
2. The Document Class properties window that is shown in Figure 5-19 on page 197 appears. Select the Property Definitions tab, and click Add/Remove.
196
3. The window in Figure 5-20 on page 198 showing the list of the available properties appears. Scroll down the list of the available properties to find properties with the ICM_ prefix. These properties were added in the previous step 4. Select them all from the left pane and add them to the right pane by clicking Propagate (Figure 5-19), and then, click OK.
197
4. In the Document Class Properties window, click Apply, and then OK. Repeat the previous steps for each document class.
198
We used these files in our installation: C:\Program Files\FileNet\ContentEngine\lib\Jace.jar C:\Program Files\FileNet\ContentEngine\lib2\javaapi.jar 2. If Classification Module and IBM FileNet P8 are on separate servers, copy the entire FileNet_home/ContentEngine/wsi directory from the IBM FileNet P8 server to the Classification Module server, in any place on the hard disk. In our example, this step is not required, because we installed the two products on the same server. 3. Edit the ICM_home/ECMTools/setupCommandLine executable file and change the value of the WASP_HOME property to the path where the wsi directory resides, or to the path where you copied it in the previous step. In our example, Figure 5-21 shows where the file is located.
4. Edit the following executable files and change, if needed, the value of the P8_VERSION property to 4.0, as shown in Figure 5-22 on page 200. There are .bat files for Windows and .sh files for UNIX operating systems: ICM_home/ECMTools/startConnectTest ICM_home/ECMTools/startClassificationCenter ICM_home/ECMTools/stopClassificationCenter ICM_home/ECMTools/startExtractor
199
5. Edit the ICM_home/ECMTools/conf/wcmConfig.properties file to change the value of the wcmConfig property to WcmApiConfig40.properties. In our example, Figure 5-23 on page 201 shows where the file located.
200
6. If necessary, change the value of the wcmConfig property to WcmApiConfig40.properties, as shown in Figure 5-24.
7. Edit the ICM_home/ECMTools/conf/WcmApiConfig40.properties file to identify the host name or IP address and the port for the IBM FileNet P8 server. See the example in Figure 5-25 on page 202.
201
8. Verify that Classification Module can connect to the IBM FileNet P8 server: a. Start the IBM FileNet P8 server if it is not running already. b. Open a command-line window. Go to the ICM_home/ECMTools directory, which is C:\IBM\ClassificationModule\ECMTools\ in our example. c. Locate the startConnectTest.bat file. In order to verify the connection with the IBM FileNet P8 system, you must launch this file with the following parameters: IBM FileNet administrator ID and password Configuration file
We launched this file in In our example: startConnectTest.bat -user administrator -password filenet -config conf\WcmApiConfig40.properties d. The test procedure runs for a while. At the end, you see a message on the command-line window similar to the message that is shown in Figure 5-26 on page 203.
202
If the test connection fails, follow this step: For detailed information about the test results, open the html file that was created by the startConnectTest.bat file, that is, the ConnectTest.html file. In our installation, the ConnectTest.html file is located in the C:\IBM\ClassificationModule\ECMTools directory. The file appears similar to Figure 5-27 on page 204 when the test is successful. Ignore the warnings about possible .jar file server version mismatches. Verify that all of the previous steps have executed correctly.
203
204
training is to use existing documents that are stored in the FileNet P8 repository. These documents are extracted using the Content Extractor command-line tool. If documents are already organized in folders and document classes in the FileNet P8 object store, you can use the same structure and exact folder/document class names to create the initial knowledge base categories.
205
Edit the properties file and set, at a minimum, the following properties before you run Content Extractor: If you extract content from an IBM FileNet P8 Version 4.0 server, ensure that the filenetWcmApiConfigPath property is set to conf/WcmApiConfig40.properties or to the name of the WcmApiConfig.properties file that you are using, according to the setup in 5.3.3, Configuring connectivity between Classification Module and IBM FileNet P8 on page 198. Refer to Figure 5-29 on page 207.
206
Ensure that the XML output directory that you specify in the XmlDirectory property exists and is empty. The default value is extractorOutput. Ensure also that the binaryOutputDirectory folder exists under the XML output directory and that it, too, is empty. The default value is binaryOutput as you can see in Figure 5-29. The default location for these two directories, which were created during the Classification Module installation, is C:\IBM\ClassificationModule\ECMTools, as shown in Figure 5-30 on page 208.
207
Figure 5-30 Default directory for XML output from Content Extractor
Ensure that the Path_n properties specify the paths of IBM FileNet P8 folders or document classes from which you want to extract content. The format is Path_x = objectStoreName, which means to extract everything from this object store, or Path_x = objectStoreName/FolderName/FolderName, which means to extract from this folder and its subfolders. In our example, we extracted documents from the review subfolder, which is contained in the icm_integration folder resident in the Content_OS object store. Therefore, the property is Path_1 - Content_OS/icm_integration/review, as shown in Figure 5-31. You can use more than one Path_x parameter for a single extraction operation.
Figure 5-31 Path for the object store from which to extract content
208
The Path_n parameters are mandatory for running Content Extractor. You can configure additional parameters to extract documents from IBM FileNet P8, according to various filters. Refer to the Extractor.properties file or the product documentation for further details. With_x = key[=value]: Positive Document constraint: Extract the document only after you check whether the key exists, for example: With_1 = DocumentTitle = restaurant_wine_list.doc Extracts documents with specific title. With_2 = DocumentClass= Document Extracts all documents with document class equal to Document. With_4 = DocumentBaseClass= BaseDocument Extracts all documents with BaseDocument class and all its subclasses. Date: Date constraint: Extract the document only if it was modified after this date; use this format for dates: dd-MMM-yyyy, for example: Date = 13-Jul-2008 Extract only documents modified after 13 July 2008. FolderMax: Maximum number of documents to extract from each folder; for example: FolderMax = 10 FolderMin: Minimum number of documents to extract from each folder; for example: FolderMax = 1
209
210
Figure 5-33 shows the Content Extractor console output for the previous command.
When the -m parameter is used, the Content Extractor provides the output in the ExtractorOutput.txt text file, which is located in the directory IbmClassificationModule_installation_path\ECMTools. In our example installation, it is C:\IBM\ClassificationModule\ECMTools, as shown in Figure 5-34 on page 212.
211
Figure 5-34 Directory where the output of the Content Extractor file is located
To easily read the file, we suggest that you use WordPad. The file will appear, similar to Figure 5-35, listing all of the document classes that are available in the object store that was previously set in the Extractor.properties file.
212
To extract documents from IBM FileNet P8 repository: 1. Open a command-line window. 2. Move to the directory where the Content Extractor is located, that is, IbmClassificationModule_installation_path\ECMTools. In our example installation, it is C:\IBM\ClassificationModule\ECMTools. 3. Launch the command: startExtractor.bat -u IBM_Filenet_P8_user_ID -p IBM_Filenet_P8_user_password -f Extractor.properties_file_with complete_path In our example installation, the command is: C:\IBM\ClassificationModule\ECMTools>startExtractor.bat -u Administrator -p filenet -f C:\IBM\ClassificationModule\ECMTools\conf\Extractor.properties At the completion of the document extraction activity, the command-line console appears similar to Figure 5-36.
The extracted documents metadata is stored in an XML file that is located in the C:\IBM\ClassificationModule\ECMTools\extractorOutput directory, as shown in Figure 5-37 on page 214.
213
The binary document contents are written in a separate file in the C:\IBM\ClassificationModule\ECMTools\extractorOutput\binaryOutput directory, as shown in Figure 5-38.
214
215
Perform the classification using: a. The statistical analysis provided by a knowledge base previously created, analyzed, and tested using Classification Workbench b. The rules embedded in the decision plan that is created later in this section A document with a score provided from the statistical analysis that is equal to or higher than 80% for one or more categories will be classified in the folders associated with those categories in the IBM FileNet P8 repository. Documents with scores provided from the statistical analysis below 80% for all categories will be classified in the review folder for subsequent manual review. Before starting this procedure, you need to have an available knowledge base. There are use cases where Classification Module and IBM FileNet P8 integration through the Classification Center works with only a decision plan, without the need of a knowledge base. However, in those cases, the classification is performed relying only on keyword rules, and the document classification will not benefit from the statistical analysis of the document content and attributes.
216
2. The Workflow Assistant window appears in Figure 5-40. From the right pane, select Create a decision plan project.
217
2. In the next window, as shown in Figure 5-42 on page 219, select Create an empty project, and click Finish.
218
3. The window in Figure 5-43 appears. Right-click the Referenced Projects tree in the lower left pane and add the project for the knowledge base that was created in 5.4.2, Creating a knowledge base on page 215. In our example, we select select_ranch. 4. Now, click New Group, go to the upper-right pane, right-click anywhere, and select New Rule.
219
5. The New Rule window appears as in Figure 5-44. Type a descriptive rule name in the Name field. Leave the When triggered field set to Continue, leave the Enabled check box selected, and then, click the Trigger tab.
6. Referring to Figure 5-45, click the condition link, verify that the random field specifies 100% (which indicates that the rule will always be evaluated), click the check mark to apply the trigger, and then, click the Actions tab.
7. The window in Figure 5-46 on page 221 appears. Click Add actions.
220
8. The window in Figure 5-47 on page 223 appears. Choose a scenario for your classification in IBM FileNet P8 from this list of available scenarios and their explanations: File the document in an IBM FileNet P8 folder. This scenario copies the document to one or more IBM FileNet P8 folders. Move the document into an IBM FileNet P8 folder. This scenario copies the document to one or more IBM FileNet P8 folders and removes (unfiles) the document from its source folder. Unfile the document from an IBM FileNet P8 folder. This scenario removes (unfiles) the document from one or all folders. The value of the content field contains a folder name or an asterisk (*), indicating that the document is removed from all folders. Note that documents are not deleted from the repository. Set the documents IBM FileNet P8 document class. This scenario assigns a document class to documents. Declare the document as a record in IBM FileNet Records Manager. This scenario declares documents as records in IBM FileNet Records Manager. Set a metadata field for the document in IBM FileNet P8. This scenario sets a metadata field for the document in IBM FileNet P8. If the specified metadata field does not exist in the IBM FileNet P8 repository, it is added.
221
Set a document content field. This scenario sets the value of the specified content field to a value that is copied from another content field or the specified string, counter, or temporary variable. If you enter a new content field name, the content field is added to the content item. In our use case, we perform the following actions: 1. From the Available scenarios pane, we select the action to File the document to an IBM FileNet P8 Folder. 2. In the Scenario settings pane: a. We select the File document into one or more folders, based on category scores in the selected knowledge base radio button. b. From the knowledge base name combination box, select the knowledge base that you have previously added as a referenced project. In our example, we use select_branch. c. You can select one of the three available options to file the document: Top Category only: File the document in a single IBM FileNet P8 folder associated with the top score category. All categories whose score is above this percentage: File the document in one or multiple IBM FileNet P8 folders, depending on the assigned categories, their relative scores, and the threshold that you set in the combination box. All categories whose score is above the defined threshold in this file - File name: This option refers to comma-separated value (csv) files that were produced when setting thresholds during knowledge base tuning, when you work with Precision versus Recall values. For example, you might generate several threshold files based on different precision values, and then write a rule that refers to the different threshold files based on specific trigger conditions. Threshold files that are associated with the knowledge base are included when you add a knowledge base to a decision plan. Associated threshold files are listed under the knowledge base names in the Referenced Projects panel of the Decision Plan window. You can select or clear the check box next to each threshold file to enable or disable it in the decision plan. For details about this topic, refer to the Classification Module V8.6 product Information Center, chapter Configuring Classification - Knowledge Base Editor - Setting Thresholds: http://publib.boulder.ibm.com/infocenter/classify/v8r6/topic/c om.ibm.classify.workbench.doc/t_WBG_Set_Thresholds.htm In our example, we select the second choice, with a threshold of 80%.
222
3. After you complete the Add Action configuration, as shown in Figure 5-47, click OK. The New Rule window appears, which is shown in Figure 5-48 on page 224, with a summary of the configured actions. Make sure that the actions are correct, and click OK.
223
4. The Decision Plan window returns, as shown in Figure 5-49 on page 225, with all of the information of the new rule that was just configured.
224
225
3. In the Define a trigger for the rule window, open the combination box and select the less than (<) symbol, as shown in Figure 5-51 on page 227.
226
4. Click the first instance of the word number, and click the option specify match result. In the next window that appears (see Figure 5-52 on page 228), select the knowledge base from the referenced project, which in our example is select_branch, and then, click OK.
227
Figure 5-52 Define Review rule based on score result from the select_branch knowledge base
5. Click the second instance of the word number, click the option constant number, and type the threshold value in the field. In our example, it is 0.8, which means a score of 80%. Click the check box to confirm. The final result of the above trigger setting is Trigger: $select_branch__scores[1] < 0.8 (see Figure 5-53 on page 229). Therefore, each document with a top category score (assigned by the knowledge base) of less than 80% will trigger the action that we will write in the following steps.
228
Figure 5-53 Define if score is less than 80% for Review rule
6. Click the Action tab, and the window in Figure 5-54 appears. Click Add actions.
229
7. In the Add Action window, as shown in Figure 5-55 on page 231, from the Available scenarios pane, select the action to File the document to an IBM FileNet P8 Folder. 8. In the Scenario settings pane: a. Select the radio button File document into a specific in IBM FileNet P8. b. In the Folder Name field, type the name of the IBM FileNet P8 folder where we want to file the documents with a top category score of less than 80%. In our example, this folder is Content_OS/icm_integration/review. Note: To ensure a correct classification from Classification Module into the IBM FileNet P8 repository, you must the complete path for the FileNet P8 folder name: <object_store_name>/<folder_name>/<subfolder_name_if_any> c. In the Preview Decision Plan Actions pane in Figure 5-55 on page 231, you can see the action syntax, which is add_to_content_field '$P8:File' 'Content_OS/icm_integration/review' in our example. d. Click OK.
230
9. The New Rule window appears, as shown in Figure 5-56 on page 232, with a summary of the configured actions. Verify that the actions are correct, and click OK.
231
10.The main Decision Plan window returns again, as shown in Figure 5-57 on page 233, with both of the rules configured.
232
11.Save your decision plan (our example is named branches), as shown in Figure 5-58 on page 234. From the Classification Workbenchs main menu, click Project Save.
233
Note: Use Classification Workbench to create, analyze, and test knowledge bases and decision plans. Be sure to test and verify the results of a new or modified decision plan and knowledge base before deploying decision plans and knowledge bases to the production environment.
234
5.4.5 Exporting knowledge bases and decision plans from Classification Workbench
After creating the decision plan and knowledge bases, export them in order to make them available to be deployed in the Classification Module server.
3. Select the project for the knowledge base that you want to export, as shown in Figure 5-60 on page 236, and then, click Open.
235
4. After the Knowledge Base Project opens in Classification Workbench, start the Export Wizard by clicking the blue arrow icon that is shown in Figure 5-61. You can also use File Export to open the wizard.
5. The Export Wizard window appears, as shown in Figure 5-62 on page 237. Click Next.
236
6. Because we need to export the knowledge base to deploy to the Classification Module server, select Knowledge base, as shown in Figure 5-63 on page 238, and click Next.
237
Figure 5-63 Classification Workbench Export Wizard: Select knowledge base for export
7. It is possible to export a knowledge base in various formats, depending on how it will be used. For our purposes, select Knowledge base version 6.x and later, as shown in Figure 5-64 on page 239, and click Next.
238
8. Specify a path and a file name for the exported knowledge base, as shown in Figure 5-65, and click Finish. The knowledge base is now exported.
239
4. After the decision plan opens in Classification Workbench, start the Export Wizard by clicking the blue arrow icon, as shown in Figure 5-67 on page 241. You can also use File Export to open the wizard.
240
5. When the Export Wizard window appears, click Next to continue. 6. Because we need to export the decision plan to deploy on the Classification Module server, select Decision plan, as shown in Figure 5-68, and click Next.
7. It is possible to export a decision plan in various formats, depending on how you will use it. For our purposes, select Decision plan file (*.dpn), as shown in Figure 5-69 on page 242, and click Next.
241
8. Specify a path and a file name for the decision plan that you are exporting, as shown in Figure 5-70, and click Finish. The decision plan is now exported.
5.4.6 Deploying the knowledge base and decision plan using Management Console
To make the decision plan and knowledge base available to a Classification Module server, they have to be deployed to the server through an administrative application called the Management Console.
242
In our example, the decision plan has a referenced knowledge base. Therefore, in this section, we perform the deployment of the previously created decision plan and a knowledge base that was created by following the instructions that were provided in Chapter 3, Working with knowledge bases and decision plans on page 59.
3. The Management Console main window appears, as shown in Figure 5-73. In the left pane, right-click Decision plans and select Add decision plan.
243
4. In the Add Decision Plan window, perform the following actions, as shown in the example in Figure 5-74 on page 245: a. Enter a name for the decision plan, which in our example is branches. b. Select the Import a decision plan radio button. c. Navigate to the DP_project_name.dpn file that you previously created with the Classification Workbench. The .dpn default file location is the Classification Workbench project directory, for example: ICM_home\Classification Workbench\Projects_Unicode\branches.dpn d. In the Servers pane, click the Server combination box and select your Classification Module server, in our example, HQDEMO1. e. In the Supported Languages pane, because a decision plan does not really support languages explicitly, you must select a language. Select a language from the right pane, and move it to the right by double-clicking it or by using the move button (>). f. Leave all other default values, and click OK.
244
Figure 5-74 Select the Classification Module server to deploy the decision plan
5. The warning message box in Figure 5-75 on page 246 appears, because the deployed decision plan refers to a knowledge base that is not yet available on the Classification Module server. We deploy this knowledge base in the next steps of this procedure. Click OK for now.
245
Figure 5-75 Warning that the knowledge base associated with this decision plan is missing
3. In the Add Knowledge Base window, perform the following actions, as shown in Figure 5-78 on page 248: a. Type the knowledge base name, which is select_branch in our example. Note: Regardless of the knowledge base file name, here, your knowledge base name must match the name used in the decision plan. b. Select the Import a knowledge base radio button.
246
c. Navigate to the KB_project_name.kb file that you previously created with the Classification Workbench. The .kb default file location is the Classification Workbench project directory, which is normally located in the folder ICM_home\Classification\Workbench\Projects_Unicode. In our example, the folder is the ICM_home\Classification\Workbench\Projects_Unicode\select_branch\ folder, and the knowledge base file is the select_branch.kb file. d. In the Servers pane, click the Server combination box and select your Classification Module server, in our example, HQDEMO1. e. In the Supported Languages pane, select one or more languages from the left, and move them to the right by double-clicking the language or by using the move button (>). Your language choices depend on the languages present in the documents that your system expects and the languages that are used to build the (multilingual) knowledge base. f. Leave all the other default values, and click OK. 4. The Management Console window reappears, as shown in Figure 5-77, with only the just deployed knowledge base.
Figure 5-78 on page 248 shows the detailed information of the deployed knowledge base.
247
5. Start the decision plan and all related knowledge bases by clicking Decision plans in the console tree, right-clicking the decision plan name in the left pane, and selecting Start decision plan and associated knowledge bases.
248
Refer to the example in Figure 5-79. Another way to start the decision plan and all related knowledge bases is to click the green arrow icon.
249
Configuration
Clicking Configuration on the main page enables you to launch the configuration tool and specify how Classification Module is to classify documents into folders or document classes in IBM FileNet P8.
General Settings
Specify information about the Classification Module server, the name of the decision plan that you want to use for rule-based classification, and the name of the IBM FileNet P8 object store where the classified content is to be stored. If your decision plan includes rules for declaring documents as records, you can create records from classified documents in a file plan object store and put the documents under the control of IBM FileNet Records Manager.
250
You can also structure content by specifying how IBM FileNet P8 document properties are to be mapped to fields in Classification Module. By mapping fields, you define the elements that you expect Classification Module to analyze and act upon.
Content to Classify
Select the folders or document classes that contain the content to be classified. You can refine the content set by excluding certain folders or including documents only if they meet certain criteria, such as documents that contain specific document properties or document property values, or if the document date falls in a specific time period.
Runtime Settings
You can control how the classification processes use system resources by limiting the number of documents to classify, specifying the number of seconds that the server is to wait before looking for new documents to classify, and increasing or decreasing the number of threads that the classification processes use to classify content. You can also specify options for previewing the content of documents that you review. For example, you can prevent large documents from being displayed in the Classification Center. To be able to open documents for review in IBM FileNet P8, you can provide the URL for the IBM FileNet Workplace XT browser-based application.
Dashboard tab
Use the dashboard to start, stop, and monitor the progress of the classification processes. You can view process statistics, such as how many documents were classified and how many documents were flagged to be reviewed. You can also view summary information, such as which folders or document classes received the greatest number of documents and which rules returned the greatest number of matches.
Event log
You can view a history of all classification activity, such as when a document was classified and the name of the folder or document class in which it was filed.
Error log
If you need to troubleshoot a problem, you can view the error log. While viewing the log, you can select individual messages to see details about the error.
Review tab
Review the classification results and, if necessary, reclassify documents.
251
Review documents
Use the options on this page to view source documents and review the actions that were applied by the classification processes. You can either confirm that the document is correctly classified or select other categories and reclassify the document. By reviewing documents, you help to verify that the system is performing as expected and to ensure that the correct folders, document classes, and decision plan actions are applied during classification. In addition, when you review documents and manually select appropriate categories, the system learns from your selection, thereby improving future classification.
Add document
You can add new documents to the IBM FileNet P8 repository. When you add a document, Classification Module analyzes its content and suggests how to classify it. You can then review and confirm the actions or reclassify the document just as you reclassify any other document that is available for review.
3. The Management Console main window appears. Click the decision plan that you will use and verify that it is started. Check the related knowledge bases
252
that you will use and verify that they are started by right-clicking each object. All running objects will show the Start option grayed out, as shown in Figure 5-82.
Figure 5-82 Verify that the decision plan and knowledge bases are started
253
You must have read/write permissions for all of the folders in the IBM FileNet P8 object store that are required for classification.
254
3. After the Classification Center server is started, click Start Programs IBM Classification Module 8.6 Classification Center Classification Center to open the Classification Center in a Web browser, as in Figure 5-85.
4. The login page in Figure 5-86 on page 256 appears. Type your IBM FileNet P8 administrator ID and password, and click Log on, as in our example.
255
The first time that the Classification Center starts, certain configuration options have yet to be set, resulting in the warning in Figure 5-87. We will address setting these configuration options next, so for now, click Close.
Figure 5-87 Warning message when first starting the Classification Center
256
257
3. When the window, which is shown in Figure 5-89 on page 259, appears, type these values: The Classification Module URL, including the port number. In our example, it is http://hqdemo1:18087. The Classification Module decision plan that was previously deployed. In our example, it is branches. 4. Click Save.
258
259
3. When the window that is shown in Figure 5-91 on page 262 appears, enter these values: The IBM FileNet P8 object store that contains the content to classify and where to store the content after it is classified. In the case of the decision plan that was previously selected, include rules for declaring documents as records: Select the Declare documents as records check box. Specify the name of the IBM FileNet Records Manager object store where records are to be created for documents that are declared as records when they are classified.
4. Click Save.
260
Field mapping
Classification Module analyzes document fields that are designated for natural language processing. To enable the classification processes to better classify documents, you need to map IBM FileNet P8 document properties to Classification Module fields. The list of available IBM FileNet P8 document properties is determined by the object store that contains the content to be classified. To be able to view a list of the available document properties, ensure that the IBM FileNet P8 server is started. Also, ensure that the Classification Module fields that you want to use for classifying content and to which you want to map document properties are defined in the Management Console application (field definitions). If important information about the documents that you want to classify is stored in document property fields in IBM FileNet P8, you can map these document properties to Classification Module fields so that the information can be evaluated as part of the classification process. For example, if documents in your IBM FileNet P8 repository have a text field that contains comments about the document, you might want to evaluate this metadata in addition to the actual document content when the documents are classified. Note: At least one mapping is required.
261
To map document properties to fields so that they can be used for classification: 1. In the Management Console, check which field definitions are available. A large number of default fields are available. Two common Classification Module fields are Document (data type Document), Title (data type Text), and FileName (data type Text). For the IBM FileNet P8 integration, at least one field of data type Document is required. If not, add the field, as shown in Figure 5-91.
2. In the Management Console, restart all knowledge bases if you added new field definitions. 3. In the Classification Center, select Configuration General Settings Edit Field Mappings, as shown in Figure 5-92 on page 263.
262
4. The window in Figure 5-93 appears. You can see that the default field mapping has already been done by the system. In order to be used in the classification process, each IBM FileNet P8 document property needs to be mapped to an appropriate Classification Module field. If you need additional field mapping, click Add Document Property, and then, click Browse.
Figure 5-93 Mapping the IBM FileNet P8 document property to the IBM Classification Module field
263
5. In the Document Properties window, expand the list of available document classes, select the class that contains the document property that you want to map, and then, select the document property. 6. From the list of available Classification Module fields, select the field to which you want to map the document property, and click Save. When the classification processes run, the content of the document property will be evaluated when determining which categories a document matches best and which decision plan actions to trigger.
Classification filters
When you configure the content to classify, you specify filters to include or exclude documents from the classification process. Although you can specify the content that you want to classify by typing your preferences in fields, you can browse and select folders, classes, properties, and so on, only if the IBM FileNet P8 server is running. To specify a classification filter: 1. In the Classification Center, select the Configuration page, and then, select Content to Classify. When the window in Figure 5-94 on page 265 appears, click Edit Classification Filters.
264
2. On the Content to Classify page that is shown in Figure 5-95 on page 266, expand the filter that you want to configure. You have two options: Enter the folder name into the field. Click Browse, connect to the IBM FileNet P8 object store, and select the folder that you need.
265
3. In our example, we have selected only the Start Folder and the Document class, choosing from the list obtained from the IBM FileNet P8 system, as shown in Figure 5-96 on page 267. After making your choices, click Save.
266
267
268
2. When you are returned to the Classification Module - Configuration tab Runtime Settings pane, click the Dashboard tab.
269
Remember that category names in the knowledge base must follow this convention: For classification into folders: object_store_name/folder_path, for example: ObjectStore1/ParentFolder/SubFolder For classification into document classes: object_store_name/document_class_name, for example: ObjectStore1/DocumentClass To gather the document classes and folders that are needed in IBM FileNet P8: 1. Launch Classification Workbench. 2. Using the Classification Workbench Workflow Assistant, click Open an existing project. 3. Open the decision plan that you have configured in 5.4.3, Creating a decision plan on page 215 in the Classification Center, in our example, branches. As shown in Figure 5-99, select the decision plan, and then, click Open.
4. Look at the lower-left pane to see the knowledge bases that are referenced projects from the decision plan; in our use case, there is just one, select_branch. Double-click select_branch to open the knowledge base, as shown in Figure 5-100 on page 271.
270
Figure 5-100 Open knowledge base that is referenced by the decision plan
5. Click the Category tab located in the left pane, and then, enlarge the pane to see the complete category name that must comply with the rule explained at the beginning of this section. See the example in Figure 5-101 on page 272.
271
Figure 5-101 View categories for the knowledge base named select_branch
Note: Decision plans can contain rules that move or file documents to specific IBM FileNet P8 document classes, folders, or subfolders. If you are not the person who created the decision plan, check with the author of the decision plan to understand if these types of actions are present in the decision plan. Alternatively, you can check by going through all the rules of the decision plan, looking for actions, such as the action that we put in the branches decision plan, in the review rule: Rule Name: review Rule Status: Enabled When Triggered: Continue Trigger: $select_branch__scores[1] < 0.8 Actions: [on] add_to_content_field '$P8:File' 'Content_OS/icm_integration/review' If your results show that document classes or folders are missing, you have to refer to the IBM FileNet P8 administrator to address the problem, or make the Knowledge base match the IBM FileNet P8 document classes or folder structure.
272
Start classification
The Classification Module server and the IBM FileNet P8 server must be running in order to classify documents. The decision plan that is used to classify content and its associated knowledge bases must also be running. For detailed instructions, go to 5.5.2, Working with the Classification Center on page 252. To run the classification processes: 1. If the Classification Center is not running on the Dashboard tab, follow these steps; otherwise, go to step 2: a. Start the Classification Center server by selecting Start Programs IBM Classification Module 8.6 Classification Center Start Classification Center server. b. Wait for the server to start, and then, select Start Programs IBM Classification Module 8.6 Classification Center Classification Center. c. Click the Dashboard tab, and when required, enter your IBM FileNet P8 administrator ID and password. 2. Start the classification by clicking Start classifying, as shown in Figure 5-102 on page 274.
273
3. The pane changes, as shown in Figure 5-103 on page 275. Notice the Elapsed time counter is running.
274
Use the dashboard to view the progress of the classification activity or to open the event and error logs. The classifier can be left running with the status "Waiting for documents", or it can be stopped by clicking Stop classifying.
275
276
then configure the document review filter in the Classification Center to review only the documents in those folders.
277
Figure 5-104 Click the Review icon to start the document review process
2. On the Review Documents page, click the Filter Settings button, as shown in Figure 5-105 on page 279.
278
3. Scroll down the window to find the Edit Filter Settings option, as shown in Figure 5-106 on page 280, and click it.
279
4. The Document Filter page appears, as shown in Figure 5-107 on page 282. Expand the filter that you want to configure, and specify criteria for the documents that you want to review. You can specify and configure one or more of the following criteria: Start folders: Specify any number of folders that contain documents that are to be reviewed. The review set will include documents in the folder that you specify and documents in subfolders of that folder. Click Browse to select a folder from the list of folders in the object store. Click Add Folder to include additional folders in the review set. Skip folders: Specify any number of folders that contain documents that are not to be reviewed. The review set will exclude documents in the folder that you specify and documents in subfolders of that folder. Click Browse to select a folder from the list of folders in the object store. Click Add Folder to exclude additional folders from the review set.
280
Document classes: Specify any number of document classes that contain documents to review. Click Browse to select a document class from the list of classes in the object store. Click Add Document Class to include additional document classes in the review set. Document properties: Limit which documents to review by specifying that only those documents that have specific document properties are included in the review set. Click Browse to select the document class to which the property belongs, and then, select the document property. Click Add Property to specify additional document properties. Document property values: Limit which documents to review by specifying that only those documents which have document properties with specific values are included in the review set. Click Browse to select the document class to which the property belongs, select the document property, and then, type the value that the property must contain. Click Add Property to specify additional document properties and values. Date: Limit which documents to review by specifying that only documents with dates that match a date or date range are included in the review set. This value is the date that the document was last modified. You can specify whether the document date must occur before or after a specific date or specify that the document date must occur in a specific date range. Classification status: Use a documents classification history to control which documents are included in the review set: Do not use the classification status to filter documents: To ignore the classification status when, for example, significant changes were made to the decision plan or a knowledge base after documents were first classified. This option ensures that all documents that match the other document filter criteria are presented for review. Include only documents that were previously classified by IBM Classification Module: To review documents that were previously classified if the decision plan or knowledge base significantly changed, and you want to ensure that the new classification actions are being applied correctly. Include only documents that were not previously classified by IBM Classification Module: To review only newly classified documents if the decision plan or knowledge base did not significantly change. You can limit the review set to documents that were added to the IBM FileNet P8 object store and classified for the first time.
281
In our example installation, we configured the start folder, as shown in Figure 5-107. The icm_integration/review folder is used to store documents with a classification score under the selected threshold. 5. You can select the Save as the default filter check box if the current document filtering options are to be automatically applied when you review documents in future Classification Center sessions. The default filter remains in effect until you save other options as a new default filter. If you change the document filter settings without saving them as the default filter, the settings remain in effect only for the duration of the current Classification Center session. 6. Click Save on the right side of the window to save your settings.
282
283
decision plan that was previously configured in 5.5.3, Configuring Classification Center on page 257. Launch IBM FileNet P8 Workplace XT using the appropriate option. Be aware of the amount of documents in the review set under revision. Look at and modify the Filter Setting through the appropriate button. Obtain information about the document under review, document type, and title. More information is available by clicking Show details. Look at the document content by clicking View full document.
In our example installation, we use the branches decision plan, where a rule is designed to put all the documents with a top category score under 80% in the review folder (for details, see Create the second decision plan rule on page 225). Now, we review this document set and want to submit the document again to the Classification Module to use the feedback already received by the knowledge base in order to have a more accurate classification. It might be useful to look at the document content to understand it, using the View full document option. 4. In the Classification Module Review tab window that is shown in Figure 5-108, scroll the window down to find the Decision History section and look at the score previously assigned for each category, as shown in Figure 5-109 on page 285.
284
5. Click the Reclassify icon at the top of the window, and the window that is shown in Figure 5-110 appears.
6. You can reclassify a document using the category that is suggested by Classification Module, or make your own choice if you think it is more appropriate for an accurate classification. In our example, we selected the
285
Reclassify by selecting specific categories option. Here are several additional options that you can select: Open the combination box, as shown in Figure 5-110 on page 285, to choose a category from the top five categories. Click the points button (...) to choose a category out of the top five categories. Click the plus button (+) to add more categories to be assigned to the document, as shown in Figure 5-111.
7. After you click Submit, the Classification Rules and Actions window appears, as shown in Figure 5-112 on page 287. Review the rules that were triggered by the classification processes, the actions that the decision plan suggests, and the top scoring categories from each knowledge base associated with the decision plan. 8. If you want to save the document in XML format so that you can import the data into Classification Workbench and use it to tune the decision plan, click Save Document in XML Format. When you are prompted to open or save the file, select the option to save the file to disk.
286
9. Click Apply Actions. The document is classified according to the selected actions and removed from the review set. 10.The document is classified in FileNet P8 as you have decided, and the next document in the queue in the review folder is presented to be reviewed; this process happens until there are no documents left in the queue to review.
287
4. On the Add Document window that is shown in Figure 5-114, click Browse to select the file that you want to upload or type its network path, and click OK.
Figure 5-114 Browse to the document location to add it for classification analysis
5. Classification Module processes the document, and the decision history is displayed, as shown in Figure 5-115 on page 289. From the decision history, you can review the rules that were triggered for the document by the classification processes, the actions that the decision plan suggests, and the top scoring categories from each knowledge base associated with the decision plan. On this window, you can apply the classification actions that you want and clear the check boxes for actions that you do not want. Then, click Apply Actions, and the document is added to the IBM FileNet P8 folder or document class that the decision plan recommends.
288
6. If the suggested classification actions are not wanted, click Reclassify in order to specify other actions or categories for classifying the document. The window that is shown in Figure 5-110 on page 285 appears. 7. Follow the same steps described for reclassifying documents (step 7 to step 10 on page 286). 8. In the event that the classification of the document is no longer required, you can click Discard Document. The classification actions are discarded, and the document is not added to the IBM FileNet P8 object store.
289
290
Chapter 6.
291
292
Module, an application that enables Content Collector for File Systems to look inside documents to understand and classify their contents for use in a Content Collector task route. Figure 6-1 shows the architecture overview of the integration between Content Collector for File Systems and Classification Module.
Standard
ICC archives em ails into P8 or CM8 after consulting ICM for classification
Optional
Classification Center 8.6 provides services on docu ments in P8: 1 Automatic classification 2 Declare Record 3 Manual review
CM8
IBM CM8
IBM FileNet P8
Classification Center
ICM with IBM Content Co llector uses statistical meth ods (kn owledg e bases) to determ in e document classification and d eclare record s
Uses rule/keyword base metho ds (d ecision plan ) and statistical methods (kn owledg e bases) to determ in e document classification
Within this integration, Classification Module uses a knowledge base to analyze documents to discover possible categories and their relevancy scores. You can use the top relevancy score and the top category name in IBM Content Collector task routes for any of the following typical scenarios: Assign the folder path for the archived documents in an IBM FileNet P8 or IBM Content Manager (CM8) repository. Identify documents that need to be declared as records. Identify documents that must be reviewed in the Classification Center to enhance the knowledge base accuracy (applicable to the IBM FileNet P8 repository type only when the Classification Module and FileNet P8 integration asset is installed and configured).
293
294
6.3.1 Installing Classification Module client components on the Content Collector server
This example assumes that you have installed the following software successfully: IBM Content Collector for File Systems Classification Module server components IBM FileNet Content Manager or IBM FileNet Business Process Manager IBM FileNet Records Manager
295
For IBM Content Collector to use the classification capabilities of Classification Module, you must install Classification Module client components on the Content Collector server by using the following procedure: 1. Run the Classification Module installation program on the IBM Content Collector server. 2. Select the option to install Custom components. 3. Select the Classification Module Client only check box, as shown in Figure 6-2, and then, finish the rest of the installation steps.
Figure 6-2 Installing Classification Module client components on the Content Collector server
296
Follow this process of registering Classification Module .dll files: 1. Copy the three .dll files to the ctms directory under the Content Collector for File Systems installation path (for our case study, it is the C:\Program Files\IBM\ContentCollector\ctms directory), and rename the bnsClient86.dll file to the bnsClient85.dll file. 2. Open a command window, and from the ctms directory, run the utility connector in registration mode by entering the following command: utilityconnector.exe -r If the command returns an error message saying, Failed to create service: ibm.ctms.utilityconnector.UtilityConnector -- The specified service already exists, most likely you registered the .dll files in the wrong location. To correct the operation, perform the following steps: a. Use the command below to first unregister the Classification Module .dll files in the location where you issued the initial registering command: utilityconnector.exe -u b. Ensure that you change to the ctms directory, and register the .dll files again using the following command: utilityconnector.exe -r
297
2. In the Registry Editor window, navigate to My Computer HKEY_LOCAL_MACHINE SOFTWARE IBM EMRC 4.0 Service Declarations2 ibm.ctms.utilityconnector.UtilityConnector, and validate that the registry entry of ibm.ctms.utilityconnector.ICMClassificationTask is listed, as shown in Figure 6-3.
Figure 6-3 Validating that the Windows registry entry of ibm.ctms.utilityconnector.ICMClassificationTask exists
298
The Classification Module task produces the metadata values during the run time. In turn, you can use them in Content Collector task routes to determine how documents are processed.
299
Use the following steps to ensure that the Classification Module system metadata properties are available in the Content Collector Configuration Manager: 1. Launch the Content Collector Configuration Manager by clicking Start Programs IBM Content Collector Configuration Manager. 2. In the Content Collector Configuration Manager: a. Go to the Metadata and Lists box by clicking Metadata and Lists in the Navigation pane. b. In the Metadata and Lists box on the left pane, select the System Metadata type, and then, select the IBM Classification Manager system metadata in the middle System Metadata pane. Ensure that five Classification Module system metadata properties are displayed on the right pane. See Figure 6-5.
Figure 6-5 Classification Module system metadata properties in Content Collector Configuration Manager
300
before using it in file system archiving solutions. The following list summarizes the configuration process to use Content Collector for File Systems and Classification Module together: 1. Define a knowledge base and Classification Module field definitions. 2. Configure a file system archiving task route, including a Classification Module task. 3. Activate the system for archiving. This section introduces the configuration process in light of the HR legal discovery and compliance use case for our sample company as described in 6.1.3, Use case description on page 294.
6.5.1 Defining a knowledge base and the Classification Module field definitions
A knowledge base provides the criteria that Content Collector for File Systems uses to determine if a document needs to be captured and, if so, how it is processed. In addition, the Classification Module server receives documents as a series of fields. The field definition defines the data type and the method of language processing that the classification server performs on the field. The following steps are the typical workflow to define a knowledge base and configure Classification Module field definitions: 1. Create, analyze, and tune a knowledge base in Classification Workbench. 2. Add a knowledge base to the Classification Module server in Management Console. 3. Configure field definitions in Management Console.
301
Using initialization data When a content set is not available, you can use initialization data to build the knowledge base. Initialization data consists of keywords and texts associated with individual categories. Keywords are words or phrases that you expect will appear in documents classified by the classification server. In addition to keywords, you can associate one or more texts with categories. The Classification Module server uses this information to classify incoming documents appropriately. There are many ways to prepare a categorized content set for building a knowledge base. In our example, we first identify a set of representative documents that are pertinent to the application environment, and we organize the documents in a file system folder structure. The category names that we choose for our knowledge base are used as the folder names. We then import this folder structure into Classification Workbench to create, analyze, and tune the knowledge base. We describe the detailed steps in Chapter 3, Working with knowledge bases and decision plans on page 59.
302
3. The Add Knowledge Base window, which is shown in Figure 6-7 on page 304, shows how to define the following fields to add a knowledge base to the Classification Module server: Specify the knowledge base name. In our example, it is HR. Import a knowledge base that contains statistics by selecting the Import statistics from file radio button. Instruct the Classification Module server to read the knowledge base file that you want to import by selecting the Access file from server option and browsing to the location of the knowledge base file. Specify runtime options for the knowledge base: Specify whether the knowledge base is to use a cache. A cache enables the system to handle knowledge bases that are too large to load into memory in one piece. In our example, because the HR knowledge base is within a reasonable size, we do not select the Use a cache option. The Back up automatically option automatically creates backup copies of the knowledge base when you make changes to it, such as importing knowledge base statistics, changing feedback options, and adding or removing a read-only instance. Having a backup is useful, for example, if you need to reproduce results from the previous version after the knowledge base is changed. In our example, we select Back up automatically. Specify how feedback is to be processed. Because inaccurate feedback downgrades the knowledge base accuracy, in most business environments, feedback is required to be reviewed and applied by a knowledgeable user at a later time. So in our example, we select the Defer processing option.
Specify the servers and ports for running read/write and read-only instances of the knowledge base. In our example, we specify the current server to run both read/write and read-only instances. Specify the languages that the knowledge base is required to support. In our example, our knowledge base supports English.
303
304
Within the Content Collector and Classification Module integration, in order to have the Classification Module server process the content of a document, you have to use a field with the data type of Document. We use the predefined Document field, as shown in Figure 6-8, to build the Content Collector task route.
305
Important: When creating a task route in Content Collector for File Systems, You must arrange tasks in a specific order. Not every task is necessary depending on your desired configuration, but if present, your tasks must follow this order: 1. 2. 3. 4. 5. Create document. Classify with Classification Module task. File document in folder. Declare record. Perform post-processing.
In particular, you can only use the Classification Module task after the document is created. Follow these steps to build the file system archiving task route: 1. Launch the Content Collector Configuration Manager by clicking Start Programs IBM Content Collector Configuration Manager.
306
2. Go to the Task Routes explore pane and toolbox by clicking Task Routes in the Navigation pane of the Configuration Manager, as shown in Figure 6-10. Task Routes Explore Pane & Toolbox
Task Routes
Figure 6-10 The Task Routes explore pane and Toolbox of the Content Collector Configuration Manager
3. In the Task Routes explore pane, click the Add icon to add a new task route, as shown in Figure 6-11 on page 308.
307
Add
4. Content Collector allows you to create new task routes from the beginning or from existing templates. In our example, we show you how to create a simple file system archiving task route from the beginning. In the New Task Route, we select Blank task route as a starting point for the task route creation, and we enter HR Task Route as the task route name, as shown in Figure 6-12 on page 309. Click Create.
308
Figure 6-12 Creating a new task route from a blank task route
5. A new blank task route with Start and End nodes appears in the Designer pane, as shown in Figure 6-13. Click the Save icon to save your progress. Save
Start Node
End Node
309
6. For a new file system task route, a File System Collector is required to define the location from which files will be captured. Follow these steps to add and configure a FSC Collector in your task route: a. From the Toolbox pane, under File System Source, select FSC Collector, as shown in Figure 6-14.
b. Click anywhere in the Task Route Designer pane to drop the new FSC Collector task onto the task route, as shown in Figure 6-15 on page 311.
310
c. After adding the FSC Collector, you can configure it: To name and enable the collector on the General tab page. To specify when to collect files on the Schedule tab page. To specify where to collect files on the Collection Sources tab page. To specify files that must not be collected on the Filter tab page.
In our example, we describe the settings of the FSC Collector for each tab in Table 6-2 on page 312.
311
Table 6-2 Settings for the four configuration tabs of the FSC Collector Configuration tab name General Schedule Settings Enter your task route name and its description. Also, ensure that this collector is Active. The collector can run at different frequencies, such as daily, weekly, and monthly. In our example, we set the collector to run always. Define the file system folders that are monitored by the collector. These folders can be local folders or folders on a shared network drive. In our example, we specify the monitored folder as the C:\HRMonitored folder, and select the Monitored sub-folders option, as well. Set the filter so that the collector will ignore certain documents. In our example, we set the filter options to ignore documents that are already processed, already captured, and where access is denied.
Collection Source
Filter
7. Files are stored as documents in the IBM FileNet P8 content repository. To create a document in the repository, you need to specify where to create the document and how to index it. You specify where to create the document by selecting a repository from a list of the configured repositories. You index the document by choosing a document class for the item and by specifying the values to assign to each property of that class. Perform the following steps to add and configure a P8 4.x Create Document task in your task route: a. From the Toolbox pane, under FileNet P8 4.x Repository, select the P8 4.x Create Document task, as shown in Figure 6-16 on page 313.
312
b. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the Start and End nodes until the task and the arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle, as shown in Figure 6-17 on page 314.
313
Figure 6-17 Adding a P8 4.x Create Document task to the task route
c. You can configure the P8 4.x Create Document task with the settings that are described in Table 6-3 on page 315.
314
Table 6-3 Settings of the P8 4.x Create Document task Configuration section General P8 Connection Detailed settings Enter a name and a description for the task. From the Connection drop-down list, select the connection to use when creating a document. In our example, we choose a predefined P8 4.x connector connecting to a P8 content object store. Set Check In options for your document, such as version and content capture options. These options are required by P8. For more information about each option, refer to P8 Enterprise Manager documentation. In our example, we accept the default settings. Select the document class that you want to use when creating the document, and enter values for the document class properties. In our example, we select the document class called document, and map the Document Title property with the File Name metadata value.
Check In Options
Property Mappings
8. After the documents are created, they are passed to the Classification Module task for analysis. The content of each document can be used to determine whether to capture a document, rather than relying solely on the document metadata or source location metadata to make this determination. In addition, for each document, the Classification Module task populates its system metadata properties as described in 6.4.3, System metadata on page 299. You can use these metadata properties in the task route to determine how to process the documents. Perform the following steps to add and configure a Classification Module task in your task route: a. From the Toolbox pane, under Utility, select the IBM Classification Module task, as shown in Figure 6-18 on page 316.
315
b. Click anywhere in the Designer pane between the P8 4.x Create Document task and the End node to drop the new task onto the task route, as shown in Figure 6-19 on page 317.
316
c. Perform the following steps to configure the Classification Module task: Specify the host name of the server that hosts the Classification Module server. Specify the port of the Classification Module listener component. The default port is 18087. Select the knowledge base that you want to use for analyzing the files collected from the monitored file share. Select the content field that identifies the part of a document that you want Classification Module to analyze. To process the content of the document, this field must be type document. Specify the relevance threshold. In our example, the threshold is set to 0.3, which means that Classification Module only returns suggested categories with relevancy scores greater than 0.3. In other words, if the
317
relevancy score of the top category is not greater than 30%, Content Collector does not capture this document. Specify the maximum number of categories suggested by Classification Module to be used by Content Collector for processing in a task route.
Figure 6-20 shows the settings of the Classification Module task as configured in the HR task route example.
To support this legal discovery and compliance use case, the Content Collector task route uses the Most Relevant Score and Most Relevant Category system metadata properties to determine how to process each document. You need to create a Decision Point and two rules branching from the Classification Module task to handle the documents based on the values of the Most Relevant Score and Most Relevant Category system metadata properties.
318
9. Perform the following steps to add a new Decision Point to your task route: a. From the Toolbox pane, select Decision Point. b. Click anywhere in the Designer pane between the Classification Module task and the End node to drop the new Decision Point onto the task route, as shown in Figure 6-21.
10.Configure the arrow connecting the Decision Point and the End node as a rule by taking the following steps: a. Click the rule arrow to display the rule Configuration Pane. b. This branch of the task route processes the document whose Most Relevant Score value of its Most Relevant Category is greater than 90%. So, we name this rule Score > 90%. c. For the evaluation criteria, select the Configure rule radio button, and click Add.
319
d. In the Edit Conditional Clause window, configure the rule with the following fields, as shown in Figure 6-22: For the Metadata type, select IBM Classification Module. For the Property, select Most Relevant Score. For the Operator, select Greater than. Select the Literal radio button with the value of 0.9.
Figure 6-23 on page 321 shows the rule configured for the left branch of the task route for processing documents whose Most Relevant Score value of its Most Relevant Category is greater than 90%.
320
Figure 6-23 Configuring the rule of the left branch of the task route
11.In order to browse for the document at a later point, you must add the document to a folder in the repository by using the P8 4.x File Document In Folder task. Perform the following steps to add and configure the P8 4.x File Document In Folder task: a. From the Toolbox pane, under FileNet P8 4.x Repository, select the P8 4.x File Document In Folder task. b. Click anywhere in the Designer pane between the Decision Point and the End node to drop the P8 4.x File Document in Folder task onto the task route. c. You can configure the P8 4.x File Document in Folder task with the settings that are described in Table 6-4 on page 322.
321
Table 6-4 Settings of the P8 4.x File Document in Folder task Configuration section name General P8 Connection Detailed settings Enter a name and a description for the task. From the Connections drop-down list, select the connection to use when filing a document. In our example, we choose a predefined P8 4.x connector connecting to a P8 content object store. Folder path: In the Folder Path text box, provide the complete path to the folder in P8. In our example, because the Content Collector task route is filing the document into the Most Relevant Category folder in P8 based on the analysis by Classification Module, the folder path is set with the metadata type of IBM Classification Module and the property of Most Relevant Category. Select to create the folder if it does not exist.
Figure 6-24 on page 323 shows that the P8 4.x File Document in Folder task is configured to file documents into the most relevant folder in P8 suggested by Classification Module, and it also instructs the Content Collector to create the folder in P8 if it does not exist.
322
12.The P8 4.x Declare Record task allows you to declare a record in IBM FileNet Records Manager. Note: You must install and configure IBM FileNet Records Manager 4.x in order to support the usage of the P8 4.x Declare Record task in a Content Collector task route (if it is not installed). Perform the following steps to add and configure the P8 4.x Declare a Record task to your task route: a. From the Toolbox pane, under FileNet P8 4.x Repository, select the P8 4.x Declare Record task. b. Click anywhere in the Designer pane between the P8 4.x File Document in Folder task and the End node to drop the P8 4.x Declare Record task onto the task route.
323
c. You can configure the P8 4.x Declare Record task with the settings that are described in Table 6-5.
Table 6-5 Settings of the P8 4.x Declare Record task Configuration section name General P8 Connection Detailed settings Enter a name and a description for the task. From the Connections drop-down list, select the file plan object store in the repository in which to declare the record. Record Class: Click the Browse button to select a record class to use when declaring the record. Classification: To add a classification to use when declaring a record, in the Classification section, click Add, or to remove a classification, click Remove. The classification can be a static value, or it can be dynamically assigned (that is, based on the name of the folder in which a record is located).
Property Mappings
Figure 6-25 on page 325 shows that a P8 4.x Declare Record task is configured to declare a record for the document under the management of IBM FileNet Records Manager.
324
13.Use the FSC Post Processing task to define what happens to a file on the file system after it has been processed. The settings that you define apply only to the document on the file system, not to the document that is added to the IBM FileNet P8 repository. As its name suggests, you need to place a Post Processing task at or near the end of a task route. Perform the following steps to add and configure the FSC Post Processing task in your task route: a. From the Toolbox pane, under File System Source, select the FSC Post Processing task. b. Click anywhere in the Designer pane between the P8 4.x Declare Record task and the End node to drop the FSC Post Processing task onto the task route. c. You can configure the FSC Post Processing task with the settings that are described in Table 6-6 on page 326.
325
Table 6-6 Settings of the FSC Post Processing task Configuration section name General Post Processing Options Detailed settings Enter the name and description for the task. You can either delete the file from the file system after capture, or you can retain the file on the system. In our example, we choose to retain the file on the file system by selecting Do not delete file and Move file to a folder.
Now, you have configured the task route with the required tasks for processing the documents whose most relevant score is greater than 90%, as shown in Figure 6-26.
326
14.You now need to create a second rule to handle the documents whose most relevant scores are not greater than 90%. Perform the following steps to add and configure the second rule in your task route: a. From the Toolbox pane, select Link. b. Click anywhere in the Designer pane. A new rule arrow appears in the Designer, but neither end of the rule arrow is connected to the task route. c. Using your mouse, select and drag the tail of the new rule to join the decision point. d. Perform the same steps as described at Step 12 to configure the second rule of Classification Module, Not Greater than 0.9, as shown in Figure 6-27.
Figure 6-27 Configuring the rule of the right branch of the task route
327
15.Just as in Step 13, a P8 4.x File Document in Folder task is required here to add the document to a folder in the repository in order to browse it at a later point. Perform the following steps to add and configure the P8 4.x File Document In Folder task: a. From the Toolbox pane, under FileNet P8 4.x Repository, select the P8 4.x File Document In Folder task. b. Click anywhere in the Designer pane between the Decision Point and the FSC Post Processing task to drop the P8 4.x File Document in Folder task onto the task route. c. You need to configure the P8 4.x File Document in Folder task with its general and P8 connection settings as described at Step 13. The only difference is that for the documents whose most relevant score is less than 90%, those documents are configured to be placed in a designated review folder in the P8 repository for later review, as shown in Figure 6-28.
328
16.Use a link to join the P8 4.x File Document in Folder task and the FSC Post Processing task to complete the task route creation: a. From the Toolbox pane, select Link. b. Click anywhere in the Designer pane. A new rule arrow appears in the Designer, but neither end of the rule arrow is connected to the task route. c. Using your mouse, select and drag the tail of the new rule to join the P8 4.x File Document in Folder task. d. Using your mouse, select and drag the head of the rule to join the FSC Post Processing task. Figure 6-29 shows the complete task route that was created in our legal discovery and compliance use case.
329
330
Save
Figure 6-31 Activating the file system collector for your task route
331
332
Figure 6-33 The HR LegalCase folder before the Content Collector file system archiving operation
2. To capture documents: a. Open the file system folder monitored by the File System Collector. In our example, open the C:\HRMonitored folder. b. Drag and drop a set of file share files on the monitored folder. Figure 6-34 shows the example of ten files that were added to the C:\HRMonitored folder.
333
3. Check the monitored file system folder to observe the changes made to the files as they get captured into the ECM repository. In our example, eight out of ten files under the C:\HRMonitored folder are moved to the C:\HRMonitored\Processed file system folder as they get captured. Two files are not captured and left untouched in the C:\HRMonitored folder, because their relevancy score is less than 30% per the Classification Module analysis, as shown in Figure 6-35.
4. In the IBM FileNet P8 Workplace, browse to your P8 content object store and view the documents in the target P8 folders. In our example, browse to the CE_OS P8 object store. The HR LegalCase folder contains the following newly created subfolders, as shown in Figure 6-36 on page 335: 401k EAP Health Care Leaves of Absence Pay PTO Stock Options
These P8 folders are created by the system and align with the Most Relevant Category suggested by the Classification Module server through the Classification Module task processing.
334
Figure 6-36 HR LegalCase folder after Content Collector file system archiving operation
5. You can further review the documents auto-classified into each of the sub-folders. In our example, according to the rules defined in the Content Collector task route, these documents have a Most Relevant Score that is greater than 90%. 6. The Content Collector task route in our example is configured so that the documents with a Most Relevant Score no greater than 90% but greater than 30% are captured into a review folder. You can browse to the review folder and check the documents inside it. For example, Figure 6-37 on page 336 shows the documents that are created in the HRReview folder, waiting for review. Note: When working with the P8 ECM repository, you can use the Classification Center to review the documents in the review folder and take the necessary actions to tune and improve the accuracy of your knowledge base. We describe this process in Chapter 5, Integration with IBM FileNet P8 on page 177.
335
7. In addition to archiving the file share files into a meaningful folder path for easy discovery, you can use the Most Relevancy Score that is assigned to the Most Relevant Category as a threshold to mark the documents as records. In our example, documents with the Most Relevancy Score greater than 90% are declared as records. In the IBM FileNet Records Manager, you can browse to the destination file plan path to review the records declared for those seven documents whose Most Relevancy Score is greater than 90%, as shown in Figure 6-38 on page 337.
336
337
338
Chapter 7.
339
340
In addition, IBM Content Collector for E-mail can use the content classification capabilities of Classification Module: Archive e-mails into predefined folders. For faster discovery of information from archived content, it is important to organize the contents into a known set of predefined folders. Identify mission critical e-mails and declare them as records. To address any third-party litigation or compliance initiatives, it is necessary to identify digital content that might address issues related to enterprise policies or business missions. Populate metadata of the archived instance with category names that might be configured to support parametric searches and the faceted display of search results.
Standard
ICC archives em ails into P8 or CM8 after consulting ICM for classification
Optional
Classificatio n Center 8.6 pro vides services on documents in P8: 1 Au tomatic classification 2 Declare Record 3 Manual review
CM8
IBM CM8
Email Server
ICM with IBM Content Co llector uses statistical meth ods (kn owledg e bases) to determine email classification and d eclare record s
Uses rule/keyword base meth ods (decision p lan ) and statistical methods (kn owledg e bases) to determine email classificatio n
341
Within this integration, Classification Module uses a knowledge base to analyze documents to discover possible categories and their relevancy scores. You can use the top relevancy score and the top category name in IBM Content Collector (Content Collector) task routes for any of the following typical scenarios: Assign the folder path for the archived e-mails in an IBM FileNet P8 or IBM Content Manager (CM8) repository. Identify e-mails that need to be declared as records. Identify e-mails that must be reviewed in the Classification Center to enhance the knowledge base accuracy (applicable to IBM FileNet P8 repository type only when the Classification Module and IBM FileNet P8 integration asset is installed and configured).
342
Configure rules based on corporate records management policies that define which e-mail messages to declare as records. Define a set of rules that filters out irrelevant e-mail messages before they are archived. Train the system using a small set of user mailboxes to serve as a set of representative e-mails. Build the archive from existing e-mail messages over the past year, resulting in an initial archive. Configure the system to actively add to the e-mail archive daily after performing the initial archive. Understand how and why Classification Module is filtering various e-mail messages into various categories, and then, tune the filtering mechanism as needed. Review e-mails that were not auto-classified or those e-mails that were sent to review for auditing purposes.
343
344
345
If the command returns an error message saying, Failed to create service: ibm.ctms.utilityconnector.UtilityConnector -- The specified service already exists, most likely, you registered the .dll files in the wrong location. To correct the operation, perform the following steps: a. Use the command below to first unregister the Classification Module .dll files in the location where you issued the initial registering command. utilityconnector.exe -u b. Ensure that you change to the ctms directory and register the .dll files again using the following command: utilityconnector.exe -r
346
To configure this support: 1. Install Microsoft Office Outlook 2003 or Microsoft Office Outlook 2007 on the server that hosts Classification Module. 2. Select Microsoft Office Outlook as the default e-mail application in your Web browser (see Figure 7-3).
347
2. Stop the Classification Module services: a. Launch Windows services. b. Stop the Classification Module Process Manager service. c. Stop the Classification Module Trace Service service. 3. If e-mails are being archived, overwrite the default document filter: a. Open a DOS command window. b. Change to the C:\IBM\ClassificationModule\Filters directory and enter these commands: copy docFilterManager.xml docFilterManager.xml.orig copy docFilterManager.E-mail.xml docFilterManager.xml 4. Start the Classification Module services: a. Launch Windows services. b. Start the Classification Module Process Manager service. c. Start the Classification Module Trace Service service.
348
Figure 7-4 Validating that the Windows registry entry of ibm.ctms.utilityconnector.ICMClassificationTask exists
349
System metadata
The integration introduces five Classification Module system metadata properties, as described in Table 7-1.
Table 7-1 Classification Module system metadata Metadata type All Relevant Categories All Relevant Categories and Scores All Relevant Scores Most Relevant Category Most Relevant Score Description List of top categories matched Combined list of categories and scores List of top scores Winning category Winning category score
The Classification Module task produces the metadata values during the run time. In turn, you can use them in the task route to determine how to process documents.
350
Use the following steps to ensure that the Classification Module system metadata properties are available in the IBM Content Collector Configuration Manager: 1. Launch the IBM Content Collectors Configuration Manager by clicking Start Programs IBM Content Collector Configuration Manager. 2. Go to the Metadata and Lists box by clicking Metadata and Lists in the explorer pane of the Configuration Manager. 3. From the Metadata and Lists box on the left pane, select the System Metadata type, and then, select the IBM Classification Manager system metadata in the middle pane. Ensure that five Classification Module system metadata properties are displayed on the right pane, as shown in Figure 7-6.
351
352
If your documents are not categorized, but you know how they need to be categorized, follow these steps: a. Import documents into Classification Workbench. b. Assign categories to documents by using Classification Workbench and build a knowledge base. c. Create folders and document classes in an IBM FileNet P8 object store that correspond to categories in the knowledge base. If your documents are not categorized and you do not know how they need to be categorized, follow these steps: a. Import documents into the Taxonomy Proposer and discover categories. b. Import categorized content items into Classification Workbench. c. Build a knowledge base that is based on the discovered categories. d. Create folders and document classes in an IBM FileNet P8 object store that correspond to categories in the knowledge base. If none of the previous situations apply, use a keyword list and follow these steps: a. Create a category structure with a starter list of keywords representative of these categories. b. Use this list of keywords to create and analyze the knowledge base. For our use case, we use the HR knowledge base that we used to demonstrate creating, training, and tuning a knowledge base in Chapter 3, Working with knowledge bases and decision plans on page 59. Figure 7-7 on page 354 shows the HR knowledge base that we use in this use case.
353
354
2. Start Classification Center. Make sure that there are no connection issues from the Classification Module server to the IBM FileNet P8 systems. To verify, log on to the Classification Center using IBM FileNet P8 credentials to see the Configuration tab. See Figure 7-9 on page 356.
355
356
2. Start Configuration Manager by running Start Program IBM Content Collector Configuration Manager, as shown in Figure 7-11 on page 358.
357
358
3. Click Close, and you see the Configuration Manager application user interface, as shown in Figure 7-14 on page 360. The highlighted sections are the major components that make up the Content Collector interface. Starting from top to bottom, going from left to right, these areas are highlighted: Task Routes Pane Toolbox Pane Navigation Pane Designer Pane Configuration or Task Route Pane
359
360
2. If you have IBM FileNet Records Manager installed on an IBM FileNet P8 machine and you plan on declaring records using IBM Content Collector and Classification Module task route nodes, verify that you can connect to IBM FileNet Records Manager. You can verify that you can connect to IBM FileNet Records Manager by opening a browser and going to the following URL, as shown in Figure 7-16 on page 362: http://hostname:port/RecordsManager
361
Data Stores
To configure Data Stores, click Data Stores on the Navigation Pane and select the appropriate configured database from the explorer pane on the left bar. Table 7-2 describes the settings of the Data Store.
Table 7-2 Settings under General tab of the Data Store Property group name General Settings Enter your Data Store name and its description. Also, ensure that Make this my active data store is selected.
362
Settings Enter your database alias name, login name, and login password to access the database. Click Validate Database to check the database connection. Click either Export or Import to export or import database properties.
Database Properties
In our use case example, we use the Data Store settings, as shown in Figure 7-17.
Connectors
Click the Connectors icon under the Navigation pane to configure connectors. There are several types of connectors that you must configure for the IBM Content Collector task routes to work as required:
363
IBM FileNet P8 Connectors E-mail Server Connectors Metadata and Lists E-mail Services
P8 Login Information
In our example, we use three IBM FileNet P8 Connections for three object stores. See Figure 7-18 on page 365.
364
365
Table 7-4 Settings under General tab of the E-mail Server Connection Property group name General Log Settings Settings Enter your E-mail Server name and its description. Also, select the Mail System using the drop-down list. Select the appropriate Log Level using the drop-down list. Select the Log file Location using the Browse button. Truncate the log files and enter the number and size of the log files allowed. Choose the appropriate Logging Type between Common base event or Plain text. Specify the location of the Working Directory by using the Browser button.
Figure 7-19 on page 367 shows the setting in the General tab for the E-mail Server Connector in our example.
366
Table 7-5 describes the settings of the E-mail Server Connector Connection tab.
Table 7-5 Setting under Connection tab of the E-mail Server Connection Property group name Connection Parameters Settings Enter the Connection parameters to the E-mail server. For example, in the case of Exchange, enter the Exchange server host name and the User ID for the Exchange server.
In our example, we use the Connection tab settings for the E-mail Server Connector, as shown in Figure 7-20 on page 368.
367
Table 7-6 describes the settings of the E-mail Server Connector Active Directory tab.
Table 7-6 Settings under Active Directory tab of the E-mail Server Connection Property group name Credentials Location Settings Enter your User ID and Password to access the Active Directory. Choose the Location of the Active Directory as either Domain default or User defined.
In our example, we use the Active Directory tab settings for E-mail Server Connector, as shown in Figure 7-21 on page 369.
368
369
E-mail services
You must configure a required set of e-mail services as a pre-task route configuration step. The required set of e-mail services depends on your e-mail platform in use, such as Microsoft Outlook or Lotus Notes. With either platform, you need to find out what e-mail components need to be captured. For each platform, you then need to install the corresponding Microsoft Outlook components as e-mail services for the IBM Content Collector server, specifically Outlook Client with a connection to the Exchange Server, a connection to a PST file, and a browser connection to the Outlook Web Mail Server. In our example, we configure the following e-mail services. To configure the e-mail services, click the E-mail Services tab in the Navigation pane to configure the following e-mail services: Client Configuration Configuration Web Service Information Center Web Application Client
370
Client Configuration
To configure Client Configuration, click the E-mail Services tab from the Navigation pane, and then, select Client Configuration from the explorer pane on the left bar. Table 7-7 describes the settings for the Client Configuration for E-mail Service.
Table 7-7 Settings for Client Configuration for E-mail Services Property group name General Client Definition Settings Enter the Name and Description of the Client Configuration. Specify the Trigger mailbox, that is, the mailbox that is being monitored.
In our example, we use the Client Configuration settings for E-mail Services that are shown in Figure 7-23.
371
Table 7-8 Settings for Configuration Web Service for E-mail Services Property group name General Configuration Web Service Definition Settings Enter the Name and Description of the Configuration Web Service. Enter the Configuration Web Service Definition, such as Host name and Port number. You can use Validate to check the Configuration Web Service. Use the check box to let IBM Content Collector use the embedded Web application server. Specify the Java Database Connectivity (JDBC) driver directory by using Browse. Specify the JDBC Port number and the Database Server.
In our example, we use the Configuration Web Service settings for E-mail services that are shown in Figure 7-24.
372
Information Center
To configure Information Center, click the E-mail Services tab from the Navigation pane, and then, select Information Center from the explorer pane on the left bar. Table 7-9 describes the settings for the Information Center for E-mail Services.
Table 7-9 Settings for Information Center for E-mail Services Property Group Name General Information Definition Settings Enter the Name and Description of the Information Center. Enter the Information Definition Host name and Password.
In our example, we use the Configuration Web Service settings for E-mail services that are shown in Figure 7-25.
373
Table 7-10 Settings for Web Application Client for E-mail Services Property group name General Web Application Definition Repository Connection Settings Enter the Name and Description of the Web Application Client. Enter the Host name and Port number for the Web Application Client. Select between the two Repository Connections for connection to the ECM system.
In our example, we use the Web Application Client settings for E-mail services that are shown in Figure 7-26.
374
FileNet P8 content repository. We use the HR knowledge base to classify HR related e-mails into the appropriate categories. These steps summarize building our task route for automatic classification: 1. Create an empty task route. 2. Create the EC Collect E-mails by Rules task node. 3. Create the EC Extract Metadata task node. 4. Create the EC Prepare E-mail for Compliance task node. 5. Create the P8 4.x Create Document task node. 6. Create the Classification Module task node. 7. Create a Decision Point and Rules. 8. Create the P8 File Document to Folder task nodes. 9. Create the EC prepare E-Mail for Stubbing task node. 10.Create the EC Create E-Mail Stub task node. We discuss each of these steps in detail.
You can either start by creating a new Blank task route or if your task route has a similar workflow to an existing task route templates, you can create
375
your task route from an Existing template. In this example, we select Blank task route and click Create, as shown in Figure 7-28.
Figure 7-28 Creating new task route from a blank task route
A new blank task route with Start and End nodes appears in the Designer pane, as shown in Figure 7-29.
Figure 7-29 Empty task route showing start and end task nodes
3. Click the green start task node, as highlighted in Figure 7-29, to configure the general properties for the task route. Configure the name and description of the task route. We configured the task route, as shown in Figure 7-30 on page 377.
376
377
2. Click anywhere in the Task Route Designer pane to drop the new EC Collect E-Mail By Rules task onto the task route, as shown in Figure 7-32 on page 379.
378
3. After adding the EC Collect E-Mail By Rules, you can configure it: Name and enable the collector under the General tab. Specify when to collect files under the Schedule tab. Specify from where to collect files under the Collection Sources tab. Specify files that must not be collected under the Filter tab.
379
Table 7-11 describes the settings of the EC Collect E-Mail By Rules under each tab for our example.
Table 7-11 Settings of the EC Collect E-mail By Rules task node Configuration tab name General Schedule Settings Enter your task route name and its description. Also, ensure that this collector is Active. The collector can run at various frequencies, such as daily, weekly, and monthly. In this example, we set the collector to run Always. Define a mailbox to be monitored by the collector. Under the Collection Sources, click Add to add the mailbox to be monitored. In this example, we select Mailbox as the Source Type and we provide the Mailbox Simple Mail Transfer Protocol (SMTP) address. Set the filter to constraint messages, exclude or include monitored folders, and set excluded message types. In this example, we select defaults for the filter options.
Collection Source
Filter
Figure 7-33 shows the Schedule settings for EC Collect E-mail By Rules task node for our example.
380
Figure 7-34 shows the Collection Sources settings for EC Collect E-mail By Rules task node for our example.
Click Add to add Collection Sources, as shown in Figure 7-35 on page 382.
381
Figure 7-36 on page 383 shows the Filter settings for EC Collect E-mail By Rules task node for our example.
382
At this point, we have completed the creation of the E-mail collector task node. In the next section, we create the next task node for our task route.
383
Perform the following steps to add and configure an EC Extract Metadata task node in your task route: 1. From the Toolbox pane, under E-mail Server, select EC Extract Metadata, as shown in Figure 7-37.
384
2. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the Start and End nodes until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. You can click the newly created task node to configure the task node properties. 3. After adding the EC Extract Metadata, you can configure it by entering the Name, Description, and Associate Metadata under the General tab. Figure 7-38 shows the EC Extract Metadata task node settings for our example.
At this point, we have completed the creation of Extract Metadata task node. In the next section, we create the next task node for our task route.
385
2. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the EC Extract Metadata and End nodes until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. You can click the newly created task node to configure the task node properties.
386
3. After adding the EC Prepare E-mail for Archive, you can configure it by entering the Name and Description under the General tab. Figure 7-40 shows the EC Prepare E-mail for Archive task node settings for our example.
At this point, we have completed the creation of the Prepare E-mail for Archive task node. In the next section, we create the next task node for our task route.
387
2. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the EC Prepare E-Mail for Archive and End nodes until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. You can click the newly created task node to configure the task node properties.
388
3. After adding the EC Finalize E-mail for Compliance, you can configure by entering the Name and Description under the General tab. Figure 7-42 shows the EC Finalize E-mail for Compliance task node settings for our example.
At this point, we have completed the creation of the Finalize E-mail for Compliance task node. In the next section, we create the next task node for our task route.
389
2. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the EC Finalize E-mail for Compliance and End nodes until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. You can click the newly created task node to configure the task node properties.
390
3. After adding the P8 4.x Create Document task node, you can configure it with the settings that are shown in Table 7-12.
Table 7-12 Settings for the P8 4.x Create Document task node Configuration section General P8 Connection Detailed settings Enter a name and a description for the task. From the Connection drop-down list, select the connection to use when creating a document. In this example, we choose a predefined P8 4.x connector connecting to a P8 content object store. Set Check In options for your document, such as version and content capture options. These options are required by FileNet P8. For more information about each option, refer to the IBM FileNet Enterprise Manager documentation. In this example, we accept the default settings. Select the document class that you want to use when creating the document, and enter values for the document class properties. In this example, we select the document class called document, and map the Document Title property with the E-mail Subject metadata value.
Check In Options
Property Mappings
Figure 7-44 on page 392 shows the P8 4.x Create Document task node settings for our example.
391
Note: The mapping of the document title to the E-mail subject is highlighted in Figure 7-45 on page 394.
At this point, we have completed the P8 4.x Create Document task node. In the next section, we create the next task node for our task route.
392
393
2. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the P8 4.x Create Document and End nodes until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. You can click the newly created task node to configure the task node properties.
394
3. Configure the Classification Module task, as shown in Table 7-13 on page 395.
Table 7-13 Settings for Classification Module task node Configuration section General Server Classification Detailed settings Enter a name and a description for the task. Specify the Classification Module server address and the listener port. The default port is 18087. Select the knowledge base that you want to use for analyzing the files collected from the monitored file share. Also, select the content field that identifies the part of a document that you want Classification Module to analyze. To process the content of the document, this field must be type Document. Specify the relevance threshold. In our example, we need to set the threshold to 0.3, which means that Classification Module only returns suggested categories with relevancy scores greater than 0.3. Therefore, if the relevancy score of the top category is not greater than 30%, this document is not captured by IBM Content Collector. Also, specify the maximum number of categories suggested by Classification Module to be used by IBM Content Collector for processing in a task route.
Result Set
Figure 7-46 on page 396 shows the settings for the Classification Module task node.
395
At this point, we have completed the creation of the Classification Module task node. In the next section, we create the next task node for our task route.
396
so, you need to create a decision point and two rules branching from the Classification Module task to deal with documents based on the values of the Most Relevant Score and Most Relevant Category system metadata properties. A decision point is a node where you can define multiple rules based on certain conditions. In our example, we introduce two rules based on the relevancy score of each e-mail after being examined and returned by the Classification Module task node. Follow these steps to add a new decision point to your task route: 1. From the Toolbox pane, select Decision Point. 2. Click anywhere in the Designer pane between the Classification Module task and the End node to drop the new decision point onto the task route. The first rule link is highlighted by default, as shown in Figure 7-47 on page 398.
397
3. Configure the arrow connecting the Decision Point and the End node as a rule by taking the following steps: a. Click the rule arrow to display the rule Configuration Pane. b. Enter the name for the rule. Because this branch of the task route only processes documents with a Most Relevant Score value (of its Most
398
Relevant Category) that is not greater than 70%, we name this rule Score < 70%.
c. For the evaluation criteria, select Configure rule, and click Add. d. In the Edit Conditional Clause window, configure the rule with the following fields, as shown in Figure 7-48: For the Metadata type, select IBM Classification Module. For the Property, select Most Relevant Score. For the Operator, select Not Greater than.
Figure 7-49 on page 400 shows the settings for the Score < 70 decision rule.
399
4. We now configure the other Decision Rule by taking the following steps: a. Highlight the Decision Point, right-click the Decision Point, and choose Add Rule, as shown in Figure 7-50 on page 401.
400
The new decision rule is now displayed in the task route, as shown in Figure 7-51 on page 402.
401
b. Enter a name for the rule. Because this branch of the task route processes documents with a Most Relevant Score value (of its Most Relevant Category) that is greater than 70%, we name this rule Score > 70%. c. For the evaluation criteria, select the Configure rule radio button, and click Add.
402
d. In the Edit Conditional Clause window, configure the rule with the following fields, as shown in Figure 7-52: For the Metadata type, select IBM Classification Module. For the Property, select Most Relevant Score. For the Operator, select Greater than.
At this point, we have completed the creation of a decision point and two rules. In the next section, we create the next task nodes for our task route.
403
Figure 7-53 Adding Create P8 4.x File Document in Folder task node
2. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the Decision Point and the end of Rule link (Score >= 70%) until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. Click the newly created task node to configure the task node properties.
404
3. Configure the P8 4.x File Document in Folder task with the settings that are described in Table 7-14.
Table 7-14 Settings for the P8 4.x File Document in Folder task node Configuration section General P8 Connection Detailed settings Enter a name and a description for the task. From the Connections drop-down list, select the connection to be used when filing a document. In this example, we choose a predefined P8 4.x connector connecting to a P8 content object store. Folder path: In the Folder Path text box, provide the complete path to the folder in FileNet P8. In this example, because the IBM Content Collector task route is filing the document into the Most Relevant Category folder in FileNet P8 based on the analysis by Classification Module, the folder path is set with the metadata type of Classification Module and the property of Most Relevant Category. Select to create the folder if it does not exist.
Figure 7-54 on page 406 shows the settings for the P8 4.x File Document in Folder task node for Score >= 70% rule.
405
4. To configure the other P8 4.x File Document in Folder task node, from the Toolbox pane, under FileNet P8 4.x Repository, select Create P8 4.x File Document to Folder, as shown in Figure 7-55 on page 407.
406
5. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the Decision Point and End nodes for the Rule link (Score < 70%) until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. Click the newly created task node to configure the task node properties.
407
6. Configure the P8 4.x File Document in Folder task with the settings that are described in Table 7-15.
Table 7-15 Settings for the P8 4.x File Document in Folder task node Configuration section General P8 Connection Detailed settings Enter a name and a description for the task. From the Connections drop-down list, select the connection to be used when filing a document. In this example, we choose a predefined P8 4.x connector connecting to a P8 content object store. Folder path: In the Folder Path text box, provide the complete path to the folder in FileNet P8. In this example, because the IBM Content Collector task route is filing the document into the Most Relevant Category folder in FileNet P8 based on the analysis by Classification Module, the folder path is set to a fixed location for Manual Review. Select to create the folder if it does not exist.
Figure 7-56 on page 409 shows the settings for P8 4.x File Document in Folder task node for Score < 70% rule.
408
Figure 7-56 FileNet P8 4.x File Document in Folder: General tab settings
At this point, we have completed the creation of two P8 4.x File Document in Folder task nodes. in the next section, we create the next task node for our task route.
409
410
2. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the P8 4.x File Document in Folder and End nodes until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. Click the newly created task node to configure the task node properties. 3. Add the EC Prepare E-Mail for Stubbing task node. Enter a name and a description under the General tab. We set the EC Prepare E-Mail for Stubbing task node settings, as shown in Figure 7-58 on page 412, for our example.
411
4. We now need to add a link from the P8 4.x File Document in Folder from the Score >= 70% Rule link to the newly created EC Prepare E-mail for Stubbing task node. To create a new link, click the link icon in the Toolbox pane. You will see a new link (blue color), as shown in Figure 7-59 on page 413.
412
Drag and stretch the blue link to connect the P8 4.x File Document in Folder task node to EC Prepare E-Mail for Stubbing task node, as shown in Figure 7-60 on page 414.
413
Figure 7-60 New link connecting P8 4.x File Document in Folder task node to EC Prepare E-Mail for Stubbing task node
At this point, we have completed the creation of the Prepare E-mail for Stubbing task node. In the next section, we create the next task node for our task route.
414
415
2. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the EC Prepare E-Mail for Stubbing and End nodes until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. Click the newly created task node to configure the task node properties. 3. Add the EC Create E-Mail Stub task node. Enter a name and a description under the General tab. We set the EC Create E-Mail Stub task node settings, as shown in Figure 7-62 on page 417, in our example.
416
At this point, we have successfully completed the entire task route for our use case scenario. In the next section, we describe the procedure to activate this task route to automatically classify e-mails in the required folder.
417
418
419
420
Click the highlighted audit node in Figure 7-65 to set the correct audit level, as shown in Figure 7-66 on page 422.
421
422
423
4. The IBM Content Collector task route is configured so that, after the e-mail is classified by the Classification Module and archived into IBM FileNet P8 repository under the appropriate folder, a stub in the Microsoft Outlook e-mail client is left behind. In this example, verify that the Outlook client e-mail body is now replaced by a Link to archived E-mail body stub. 5. Similarly, you can verify the classification by sending an e-mail with Pay as the subject and see that the e-mail gets classified and archived into the Redbook/HR/Pay category. In this example, we already have sent an e-mail with Pay Scale as the subject, and the e-mail is already classified and archived into the correct IBM FileNet P8 folder, that is, Object Stores/CE_OS/Redbook/HR/Pay, as shown in Figure 7-68 on page 425.
424
425
4. The IBM Content Collector task route is configured so that, after the e-mail is classified by the Classification Module and archived into IBM FileNet P8 repository under the appropriate folder, a stub in the Outlook e-mail client is left behind. Verify that the Outlook client e-mail body is now replaced by a Link to archived E-mail body stub.
426
with automatic classification on page 352. These steps are an overview of the steps that involved in implementing this use case: 1. 2. 3. 4. 5. Create the knowledge base. Verify the integration settings. Create an e-mail task route. Activate the task route. Verify the use case end results.
The steps listed here are exactly the same steps that we described in 7.4, Use case 1: E-mail archiving with automatic classification on page 352; however, we need to modify the task route with the addition of the P8 4.x Declare Record task node. In this section, we describe how to modify the task route that you created earlier in order to add the records declaration capability. We also verify the use case end results on the Exchange Outlook client and the IBM FileNet P8 records management folder. We show you how to modify the existing task route and to verify the record declaration results.
7.5.1 Modify the existing task route to add the P8 4.x Declare Record task node
This task allows you to configure the required properties to declare a record in IBM FileNet Records Manager. Follow these steps to add and configure a P8 4.x Declare Record task node in your task route: 1. From the Toolbox pane, under FileNet P8 4.x Repository, select P8 4.x Declare Record, as shown in Figure 7-70 on page 428.
427
2. Click anywhere in the Designer pane (white space) to drop the new task onto the task route. After you have dropped the task onto the task route, you can drag it over the arrow connecting the P8 4.x File Document in Folder and EC Prepare E-Mail for Stubbing on the Score >= 70% Rule Link, until the task and arrow are highlighted and the task icon has an exclamation point (!) in a yellow triangle. Click the newly created task node to configure the task node properties.
428
3. Add the P8 4.x Declare Records task node, and configure it with the settings that are shown in Table 7-16.
Table 7-16 Settings for the P8 4.x Declare Record task node Configuration section General P8 Connection Detailed settings Enter a name and a description for the task. From the Connection drop-down list, select the connection to use when creating a document. In this example, we choose a predefined P8 4.x connector connecting to a FileNet P8 content object store. Select the record class that you want to use when creating the record, and enter the classification path for the record declaration. In this example, we select the record class called ElectronicRecord, for Classification /RecordsManagement/File Plan/Email and map the Document Title, From and Sent On property with the Email Subject, From, and Sent On metadata values.
Property Mappings
4. Select the Classification Path by clicking Add under the Property Mappings group. A P8 Classification dialog appears. In this example, we select the P8 Classification path, as shown in Figure 7-71.
Figure 7-72 on page 430 shows the P8 4.x Declare Record task node settings for our example.
429
At this point, we have successfully completed the modification of the existing task route to add the P8 4.x Declare Record task node for our use case scenario. In the next section, we activate this task route and verify the declare records results.
430
431
Note: When working with the IBM FileNet P8 repository, you can use the Classification Center to review the documents in the review folder and take necessary actions to tune and improve the accuracy of your knowledge base.
432
Related publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this book.
Online resources
These Web sites are also relevant as further information sources: IBM InfoSphere Classification Module Version 8.7 Information Center http://publib.boulder.ibm.com/infocenter/classify/v8r7/ IBM Classification Module Version 8.6 publication library http://www.ibm.com/support/docview.wss?rs=3376&uid=swg27012760 IBM Classification Module Web site http://www.ibm.com/software/data/enterprise-search/classification IBM Enterprise Content Management Web site http://www.ibm.com/software/data/content-management
433
434
Index
A
action decision plan, rule 26 active view 71, 80, 108 AddOn 188, 190191 Open XML file 190 Administration process Classification Module server component 31 analysis semantic 22 analysis data 76, 105106 application programming interface (API) 49 automatic classification 17, 39, 178, 180, 187, 276, 342, 352 E-mail archiving 418 second rule 327 setup connectivity 198 supported content repository 16 unique capability 102 Classification Module Server 29 Classification Module server 22, 29, 344346 system architecture 30 classification process 1, 1011, 34, 50, 249, 251252, 261, 263264 classification scheme 7 Classification server administration 40, 44 classification technology 25 Classification Workbench 2324, 2627, 29, 5962, 179, 186, 205, 215, 270, 301302, 352353 CSV file 101 Decision Plan project 234 export wizard 236 extracted content XML output 57 field definition settings 28 file system folder 64 file system folder structure 61 import content set 30 Import documents 353 saved XML data 283 Client APIs 30 compliance 10 use case 9 concepts 22 configuration file 106, 198 Configuration Manager 297298, 300, 307, 343, 348349 Classification Module system metadata 348 explorer pane 351 right side 330 content route 14 unstructured 7 unstructured, taxonomy 7 Content Extractor 29, 34, 50, 179180, 205, 352 Content field definition administration 40, 43 values section 98 content field 69, 90, 95, 221222, 317, 395
B
business use cases 9
C
Categories 22 categorization 23 Classification Center 17, 29, 34, 50, 178, 180, 183, 273, 293, 335, 342343, 346, 355, 432 classification results 283 document review filter 277 IBM FileNet P8 integration 216 review setup 180 Classification Module 35, 11, 2123, 26, 5961, 69, 99, 121, 177179, 181, 291293, 339341 business use cases 9 classification capabilities 296 combined strength 342 configuration process 7 correct classification 91, 230 IBM Content Collector 352 input 6 integration options 16 knowledge base 37 Many applications 16 metadata type 405 output 6
435
content type value 69 content item 26, 36, 6162, 69, 71, 73, 99, 105, 222, 353 additional analysis information 99 content set 7, 3637, 39, 60, 62, 64, 66, 131, 204, 215, 301302 content field 97 text settings 68 corpus 22 ctms directory 297, 345 Cumulative Success Report 77 score 149 summary report 76, 78, 82
E
EC Collect E-Mail 377378, 380 Collection Sources settings 381 Filter settings 382 Schedule settings 380 ECM product 34, 49 ECM repository 292 ECM system 13, 15, 18, 29, 49, 374 critical element 15 e-mail systems 18 e-mail 5, 9, 11, 175, 340, 383 E-mail Address 70 e-mail message 70, 342343 e-mail response application 12 E-mail Service 364, 370371 Client Configuration 371 Client Configuration settings 371 Configuration Web Service 371 Configuration Web Service settings 372373 Information Center 373 required set 370 Web Application Client 373 Web Application Client settings 374 End node 309, 313, 316, 319, 321, 376, 385386, 397398 Enterprise content management (ECM) 10, 13, 21, 292 exclamation point 313, 385386, 388, 390, 394, 404, 407, 411, 416 exported knowledge base file name 239 Extractor.prop erties 205, 207, 209
D
Data Server 30 Data Store 362 decision plan certain set 42 content items 97 decision results 48 multiple groups 26 project type 83 read-write instance 32 upper right pane 225 decision plan (DP) 1, 8, 17, 22, 2627, 32, 36, 5960, 8283, 179180, 205, 215, 234235, 241, 352 Decision Point Designer pane 328 right click 400 decision point 14, 318319, 321, 375, 396397, 404, 407 deployed knowledge base detail information 247 Designer pane 309310, 313, 359, 364, 376 Detailed setting 315, 322, 324, 391, 395, 405 dll file 296297, 345 document class 17, 26, 89, 181, 187, 195196, 198, 205, 215, 250251, 281, 312, 315, 352353 planned classification deals 215 document classification 29 Document Filter Manager 25 document property 181, 205 Document task node 375, 389390 node setting 391
F
feedback 8, 23 feedback processing 102 Field definition 40, 43, 261262, 295, 301, 304 field definition 28 field definition property 28 Field mapping 261, 263 File Document 321323, 375, 403405 File System IBM Content Collector 291, 295296 file system 1819, 30, 4142, 6062, 291293, 295, 297 File Systems criteria ICC 301
436
IBM Content Collector 18 folder structure 61, 185, 272, 302 representative sample files 61 FSC Collector 310312 configuration tabs 312
G
group rule 26
H
host name 44, 201, 269, 317, 367, 372, 374 HR knowledge base 61, 77, 79, 90, 101, 303, 353354, 375, 426 keyword CSV file 101
I
IBM Content Collector 18 Classification Module 340, 342 Integration 18 IBM Content Collector (ICC) 34, 175, 291292, 294, 339341 IBM FileNet P8 1719, 34, 40, 46, 178179, 292294, 340, 342343, 352 administration application 187 administrator Id 255, 273 Classification Module 34, 50 Classification Module integration components 254 Classification Module v8.6 integration component 183 connectivity 344, 346 Connector component 50 Connectors 364 customer 17 document classes 53 document object 389 document property fields 261 e-mail systems 18 folder structure 185 Integration 17 integration 17, 89, 177, 183184 integration component 183 integration components 183 offering 17 parameter 187 right folder structure 40
sample content 181 use Classification Module 182 IBM FileNet P8 document class 210, 221, 272 property 251, 261 IBM FileNet P8 integration asset 29 IBM FileNet Records Manager 60, 89, 178, 182, 221, 292, 294295, 323324, 340, 361, 427 ICC Configuration Manager 297298, 300 Classification Module system metadata properties 297 Classification Module task 298 ICC task route 292295, 299 Record task 323 imported content set decision plan 96 input 6 integration options 16
K
Knowledge base basic elements 22 certain set 41 instance pooling 31 load-balance read-write processes 47 read-only instance 46 read-write instance 32 knowledge base 23 backup copies 303 Category names 215, 270 category names 185 Classification Module works 204 direct impact 103 existing category 425 immediate, visual sense 79 optimize accuracy 30 optimized 29 previous version 303 project type 63 read/write and read-only instances 303 relevant concept 425 runtime options 303 statistical models 276 textual analysis capabilities 8 valid concepts 423 knowledge base (KB) 1, 56, 8, 2224, 34, 39, 45, 5960, 82, 179180, 185, 246247, 293295, 342343, 346, 354, 395, 432
Index
437
L
learning 23 learning process 22 left pane 103, 197, 219, 243, 300, 351 decision plan name 248 Listener Classification Module server component 31 Lotus Quickr 15
imported properties 194 online feedback 23 online learning 23 optical character recognition (OCR) 6 original e-mail 415 output 6
R
Read/only process Classification Module server component 31 Read/write process Classification Module server component 31 Read-only (R/O) 46 read-only instance 32, 48, 303 read-only request 32, 47 Read-write (R/W) 46 read-write instance 32, 46 real life 23, 61 Recall graph 7677, 79, 151 Total Precision 79 records declaration 11 Records management system 10, 26, 292 records management 15 Redbooks Web site 434 Contact us xxv related knowledge base same check 252 relevancy 23 relevancy score 23, 73, 293, 317, 342, 395, 397 relevant category 299, 322, 350, 396, 399 Relevant Score 299, 318319, 327, 350 report management use case 16 representative e-mails 343 return on investment (ROI) 9 review 8 review folder 216, 218, 225, 284, 287, 335, 432 right pane 44, 103, 188, 197, 217, 300, 351 Select one 244 server name 44 root folder 66 fully qualified path 66 route content 14 rule group 26 Rule 2 87, 9091 rules
M
mail routing, use case 15 Management Console 2930 matching 23 Messaging Application Program Interface (MAPI) 346 metadata 6, 20, 180181, 205, 292, 297, 299, 340341, 348 ECM system 20 metadata field 6 Metadata type 299300, 320, 350351, 399 Model-View-Controller (MVC) 50 Module server (MS) 346
N
natural language processing Configure fields 36 textual content 36 Natural Language Processing (NLP) 25 natural language processing (NLP) 5, 22, 25, 36, 261 Navigation Pane click Connectors 364 click Data Stores 362 Connectors tab 365 E-mail Services tab 370 Lists icon 369 Navigation pane 300, 307, 330, 359, 362363 next section 179, 387, 389, 392 NLP Engine 25 non-matched category 425
O
object store 180, 185, 205, 332, 334, 352353, 364 Document base class 195 document classes 205 IBM FileNet P8 Connections 364
438
U
unfile document 89 unstructured content 7 upper-right pane 219 use case 9, 15, 17, 177, 180181, 185, 222, 291, 294, 301, 339, 352353 use cases business 9 compliance 9 mail routing 15 report management 16
S
score 23 content 14 select_branch KB score result 228 selected knowledge base category scores 222 semantic analysis 22 Semantic Modeling Engine 25 Semantic Modeling Language (SML) 25 Simple Object Access Protocol (SOAP) 32 SOAP Layer 30 supported ECM repository e-mail instance 340 system architecture Classification Module server 30 System metadata 299300, 315, 318, 350351, 369
W
white space 313, 385386, 388 Workflow Assistant output 38 window 217 WSDL 33, 49
X
XML format 179, 283, 286 click Save Document 286 Save document 286 XML output directory 205, 207 file 107, 205 XmlDirectory property 207
T
task route 293295, 298, 307, 325, 340, 342343, 345, 349, 354, 375, 377, 380, 383, 385, 387, 417 Active flag 418 FSC Post Processing task 325 general properties 376 later point 389 left branch 320321 metadata properties 315 new FSC Collector task 310 new task 313, 316 Record task 323 right branch 327 task node icon 420 task route decision 19 taxonomy 7, 29 generation 13 Taxonomy Proposer 2930, 3738, 100, 353 Import documents 353 pictorial representation 37 toolbar 71, 76, 80 Toolbox pane 298, 310, 312, 349, 359, 377 link icon 412 training process 22 triggers decision plan, rule 26
Y
yellow triangle 313, 385386, 388, 428 exclamation point 313
Index
439
440
Back cover
Introduces the concept, architecture, tools, and integration Describes building, training, and fine-tuning the knowledge base Provides the steps to integrate with other products and solutions
IBM Classification Module Version 8.6 is an advanced enterprise software platform tool designed to allow organizations to automate the classification of unstructured content. By deploying the module in various areas of a business, organizations can reduce or avoid manual processes associated with subjective decision making around unstructured content and streamline the ingestion of that content into their business systems in order to use the information within their organization. At the same time, they can safely remove irrelevant or obsolete information and therefore utilize their storage infrastructure more efficiently. By reducing the human element in this process, IBM Classification Module ensures accuracy, consistency, and enables auditing while simultaneously driving down labor costs. This IBM Redbooks publication explains what IBM Classification Module Version does, the key concepts to understand, and its integration with other products and systems. With this book, we show you how IBM Classification Module can help your organizations to automate the classification of large volumes of unstructured content in a consistent and accurate manner. We also cover several of the major use cases for IBM Classification Module and show you how to implement each use case. This book is intended to educate both technical specialists and nontechnical personnel in how to make IBM Classification Module work for your organizations.
BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment.