Organisational Learning in Data Mining
Organisational Learning in Data Mining
This white paper is a summary of a chapter written by Jeff Zeanah published in Organizational Data Mining, Idea Group Publishing 2004
INTRODUCTION
Organizations of all kinds are experimenting with the application of data mining techniques. They may refer to these projects as data warehousing applications, market research or data mining. Regardless of the terminology, data mining applications are intended to provide an organization with a better understanding of the environment or market in which they operate. There are two general types of data mining undertakings. Relatively well understood are the traditional scoring applications in which observations are scored to determine if they met certain criteria. In these projects, an organization typically will apply a set of tools to a large database such as a mailing list. For example, a charity considering a mail out to solicit donations will score their list to determine the most likely candidates to be solicited. By using data mining to examine the characteristics of individuals who donated in the past, the charity can reduce a mailing list of two million households to a list of the 200,000 households most likely to donate. Soliciting this smaller list will be more profitable than wasting a mailing to the 1,800,000 households that are very unlikely to respond. In this effort there is not a great need to understand why the households were selected, only whether or not the refinement of the list leads to a higher response rate and increases the profitability of the mail out. Conversely, exploratory data mining is designed to provide strategic insights from the data and guidance for future strategic or operational decision-making. Consider the following example. A consumer products manufacturer is interested in characteristics of the consumers who buy their products rather than their competitors. Are these customers younger or older? Are they married or single? What is their ethnicity? Are their household incomes higher or lower? Simple queries of the companys data warehouse can be used to answer most of these questions. If the data is available, the average household income of the companys customers easily may be compared to the average household income of competitors customers. This is exploratory data mining. And for many organizations, the ability to read the databases, perform these queries, is the extent of their data mining activities. In fact, for very large databases this can be nontrivial, requiring substantial effort. Simple queries may not provide all the answers a company needs. For example, the company discovers that their customers have higher household incomes than the competitors customers have, are older and are more likely to be married. The organization realizes that many married families have higher incomes than do single households (two incomes versus one) and many older households have higher incomes than do the younger beginning households. The company wonders: Are the customers who prefer our product older and happen to be married at a higher percentage or are our customers married and happen to be older? Or are they just higher income people? Whatever the relationship, is this the same as it was five years ago? And most important, what will they be like five years from now? If the overall population ages, will that help the company sales? Will it help the companys sales
only if the aging population stays married? Is the company not seeing a hidden trend that may change the companys strategic direction? These questions are not going to be answered through simple methods. The solution will be found only through causal predictive modeling and similar investigations. Many organizations lack the ability to answer these difficult questions. When they began collecting data in their data warehouse, they expected that they would be able to resolve some of these difficult strategic puzzles. Their exploratory findings are less than they had hoped. They can identify what has happened but have a more difficult time answering why it happened. This paper discusses reasons for these shortcomings. These observations are based on projects reviewed by the author supplemented by discussions with over one hundred data mining professionals. Based on these observations, four impediments to exploratory success have been identified. Conditions leading to the impediments are discussed and solutions presented. The four impediments are: data quality lack of secondary or supporting data insufficient analysis manpower lack of openness to new results. The projects that provided the foundations for these conclusions are proprietary efforts from private corporations and public sector organizations that are seeking to improve their understanding of their environments. Because of the proprietary nature of the projects specific situations and data details are not presented but are discussed in general terms.
PAGE 2
settlement of workers compensation claims. When analyzing these claims to find patterns concerning the value of claims, the amount of workers compensation payment was found to be reliable. This is expected. If the organization did not keep track of payments, the company would not be in business long. Further work with the data reveals that the recipients of the payments were reliable to a lesser degree. There was more uncertainty in some of these data fields. The name might be correct, but the address of the person might be missing or the zip code incorrect. Or perhaps the name was misspelled making it impossible to match the record with other files. The individual identifier information is further away from the financial transaction and did not have a high degree of data quality. Other necessary information for the completion of the study included the type of injury leading to the claim. This information was actually not necessary for the financial transaction. The company found this data was the lowest quality data. Often the data was missing or apparently inaccurate. For example, it is difficult to explain the loss of work of six months for a minor sprained ankle. Other unrecorded mitigating factors had to be in play. The second corrective action for data quality is to recognize that what gets rewarded, gets done. In the above example the field agents had a data entry system to collect information and return that information to the corporate office. However, the field agents were compensated for the number of claims handled. The implementing and structuring of compensation packages is well beyond the scope of this discussion. However, the impact on data quality of this situation is intuitive. In most organizations the sales staff is compensated with sales commissions. There is an obvious incentive to complete the information on the volume of each sale. However, if there is an expectation that the sales staff is to collect demographic information and the collection of that information is not rewarded, then the results will be obvious. The quality of the demographic information will not be as high as the volume of sales. If data is to be a value to the organization, then collecting that data has to be of value to the employees. Data quality needs to be measured and made an organizational goal.
PAGE 3
Improving data quality depends on understanding and utilizing one simple fact. Bad data is usually not going to get the organizations attention, but bad decisions will. Demonstrating the cost of bad decisions or the potential of a future bad decision will make the point. From a practical point of view, it is difficult, if not impossible, to calculate the Return on Investment (ROI) implications of bad data, however, you can calculate the ROI implications of a bad decision.
This concept is not new. Supporting data is frequently brought into an analysis, without much recognition. A typical example is when a company is analyzing product sales over a period of years. Often included in this analysis is data related to economic conditions, such as Gross National Product, interest rates and unemployment. The organization determines their sales performance based on general economic conditions. Therefore, some downturn in sales can be attributed to economic conditions, not to a lack of success of the company related to their competitors. The company may be using a large database of product sales, which resides in the company data warehouse. However, the company does not keep track of economic data and is fortunate that others keep this data for it. Using external secondary data to improve an analysis is common. But, unfortunately, when there is a need for secondary and supporting data from internal sources, we can have an altogether different story. This secondary data does not get the attention of the data warehouse data, but it can be as or more important.
PAGE 4
PAGE 5
person is likely overworked now. They are already overtaxed, and having new data does not mean they have time to fully utilize it.
PAGE 6
Three requirements are needed to create the Sense of Openness. They are: executive sponsorship, a reduction the emphasis on statistical accuracy, and for the data miner to present exploratory findings with good documentation and support. Executive Sponsorship Dr. Robert Kriegel (1996), a leading writer and lecturer on organizational change, believes resistance to change is personal. In his book Sacred Cows Make the Best Burgers, he lists four personal resistance drivers: Fear What if I lose my job, look stupid, cant adapt, etc. Feeling Powerless No one asked me! Inertia Its too much effort, too uncomfortable. Absence of Self-Interest Whats in it for me? (Kriegel, 1996, p. 195) Dealing with these personal resistance drivers is an organizational issue, but they do impact the completion of data mining projects. A managers fear that his understanding of the marketplace, developed over 30 years, no longer applies is powerful. In addition to the fears identified by Kriegel we can add the fear of What if I am wrong? Feeling powerless, left out, when new techniques are used that you dont understand is an equally strong deterrent to change. The same can be said for inertia. When a product is at the top of a cycle of market share, who wants to say, Now is the time to make changes? Even though, we all know that products have life cycles and sales go up and sales go down. And we have all seen great products, once unstoppable, reduced in significance. It makes an organization very uncomfortable to discuss a potential change from this lofty position. The desire to believe that the present situation will continue provides inertia that is hard to overcome. For new exploratory findings to have a significant impact, these personal and organizational issues must be understood and addressed. Although executive sponsorship is cited as the requirement for most organizational accomplishments, ranging from human resource programs to recycling programs, rarely are we told exactly how to apply it. However, the requirement here is clear. An acknowledgement is required that an effort is underway to investigate information to reveal what is not presently known or what is incorrect. Presently held beliefs can be questioned. Furthermore, it is still acceptable to continue the research when it is revealed the initial questioning was incorrect. Heresy: Ignore Statistical Accuracy This section is intentionally mislabeled to make a point to state clearly what we are not saying before the point can be misrepresented. The recommendation is not to ignore statistical accuracy, rather it is to temporarily drop or reduce the requirements of statistical accuracy. In most projects, we analyze a sample of a larger population. Analysts work from a representative sample of the population, drawn randomly. Therefore, there can be variation in the results leading to the need to understand the variation and determine if the sample is accurate. The exploratory data miners job is to find new relationships, relationships that we don't know exist. Often these relationships are found outside the main view of the organization. The researcher often is faced with a dilemma: finding new relationships often pushes the data to its limits, these new relationships are difficult to prove with statistical support. Should the researcher withhold this information until new data is available, which may be a lengthy delay, or should the researcher report the new discovery and begin speculations about the new findings. In the organization with a sense of openness, the speculation will open new discussions about the topic. The discussion will suggest new areas of research to be explored to answer the questions the speculation raises. Even if some of the
PAGE 7
statistically questionable initial ideas later prove wrong, the organization will benefit from the focus on the new issues. In the organization with the sense of openness, the researcher is confident that his speculations will be used appropriately. They will not be confused with research findings, but the speculations will instead be the foundation of potential new knowledge. Presenting Exploratory Findings For speculations to be treated as speculations, then it must be clear what an exploratory result, a finding, must be. These final results of the exploratory analyses should be properly documented and circulated throughout the organization. The goal of exploratory data mining is finding relationships and trends that are not readily apparent. In order to show these difficult findings, the researcher must clearly and completely articulate what has been discovered and offer supporting documentation for the discovery. It should be stated clearly to what the results apply. In order to differentiate these findings from speculations, a format for presenting results, such as the following, should be used. Finding: Women under 25 who buy product X are three times more likely to report that they are interested in using the product for fun than women under 25 who buy products Y and Z. Support 1. Customer surveys from 1998-2001 used for the analysis. Of the women under 25 that responded, there were 98 responses from purchasers of product X and 154 responses from purchasers of products Y and Z. Of the product X responses, 54% listed fun as a reason for purchase; of the product Y and Z responses, 17% mentioned fun. 2. Women of all other age groups did not show the same emphasis on fun. Of those responses, 16% listed fun, if they purchased product X, and 18% listed fun if they bought products Y and Z. The finding states clearly what the exploratory finding is about the reason for buying the product. The statement also makes very clear the dimensionality of the problem and what region of the data the finding related to: gender is a dimension, age is a dimension and product preference is a dimension. The supporting statement gives the statistics behind the statement and even alludes to another dimension, the time period. The format that the organization uses can vary, but the components above should be used. The finding states clearly what population is being discussed. Depending on the audience, it is not necessary to give specific statistical information in the finding. However, the supporting information should give the details. Notice the supporting information also clarifies why this finding is important. In this case, it is because this population is different from other women.
MOVING FORWARD
Writing in The Atlantic, author Jonathan Rauch (2001) presents the concept of the New Old Economy to explain the recent impact of Information Technology (IT) on the economy. The New Old Economy refers to the impact of information technology on old-line businesses that have existed for decades using basically the same processes but with greater efficiency thanks to improvements from IT capabilities.
PAGE 8
The United States economy grew at an unprecedented rate in the 1990s. The economy produced a higher rate of growth of real output per worker in that decade than in the previous decade. Throughout the late 80s and into the 90s organizations operating in the old economy were investing in personal computers and the basic software (spreadsheets and word processing packages) to perform the old-economy tasks. At first the uses of those innovations were a convenience at best and a difficult-to-use nuisance at worse. Gradually as the organizations learned to apply spreadsheets and word processing (the new technology of the time), software and hardware technology showed gains in efficiency. Now those techniques are considered basic to these Old Economy businesses. Hence, the term New Old Economy. Eventually, the convenience of the then new, now commonplace, tools (spreadsheets and word processing) made them more viable than the old mainframe systems that cost hundreds of thousands of dollars. They are now so seamlessly integrated into the companys old-economy businesses that they receive little attention as opposed to the attention given the Internet and e-commerce activities. As Rauch states, the impact of these basic technologies is unquantified and perhaps unquantifiable. However, it is certain that we can see it in todays workplace. The parallel to the growth of the use of PCs in the 80s and 90s is todays use of data mining software and databases. As in the past, there is an organizational learning curve (as well as individual learning curve) in applying the new technology. The organization must understand new large databases and learn how to apply what is available. That process will occur in countless gradual steps and some great leaps in data, software and techniques. A necessary first step is to remove present organizational impediments to the use of exploratory data mining so that these techniques become basic to the process. As Lyndon Johnson once advised the country, We must change to master change. (Johnson, 1966) __________________
REFERENCES
All Things Considered. (2001, May 2). National Public Radio Kriegel, R., & Brandt, D., (1996). Sacred Cows Make the Best Burgers. Warner Books Rauch J. (2001, January). The New Old Economy: Oil, Computers, and the Reinvention of the Earth. The Atlantic Monthly. 35-50 Johnson, L. B. (1966). State of the Union message.
Copyright 2004 by Z Solutions, Inc. For More Information contact Z Solutions at [email protected]
PAGE 9