0% found this document useful (0 votes)
59 views10 pages

Datamining & Dvs

The document discusses data mining and visualization techniques. It describes how visualizing data mining models can help users understand what was discovered from the data in a meaningful way and build trust in the results. Specifically, it suggests using principles from orienteering to visualize models, showing relationships between variables and allowing user interaction to answer questions. Visualizing models along multiple dimensions of factors that influence trust can help users understand the limitations of models without fully comprehending the models themselves. This helps users wisely apply models by understanding what they can and cannot predict.

Uploaded by

anashussain
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views10 pages

Datamining & Dvs

The document discusses data mining and visualization techniques. It describes how visualizing data mining models can help users understand what was discovered from the data in a meaningful way and build trust in the results. Specifically, it suggests using principles from orienteering to visualize models, showing relationships between variables and allowing user interaction to answer questions. Visualizing models along multiple dimensions of factors that influence trust can help users understand the limitations of models without fully comprehending the models themselves. This helps users wisely apply models by understanding what they can and cannot predict.

Uploaded by

anashussain
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Mining And Visualizing

1. Introduction
The point of data visualization is to let the user understand what is going on. Since data mining usually involves extracting "hidden" information from a database, this understanding process can get somewhat complicated. In most standard database operations nearly everything the user sees is something that they knew existed in the database already. A report showing the breakdown of sales by product and region is straightforward for the user to understand because they intuitively know that this kind of information already exists in the database. If the company sells different products in different regions of the county, there is no problem translating a display of this information into a relevant understanding of the business process. Data mining, on the other hand, extracts information from a database that the user did not already know about. Useful relationships between variables that are non-intuitive are the jewels that data mining hopes to locate. Since the user does not know beforehand what the data mining process has discovered, it is a much bigger leap to take the output of the system and translate it into an actionable solution to a business problem. Since there are usually many ways to graphically represent a model, the visualizations that are used should be chosen to maximize the value to the viewer. This requires that we understand the viewer's needs and design the visualization with that end-user in mind. If we assume that the viewer is an expert in the subject area but not data modeling, we must translate the model into a more natural representation for them. For this purpose we suggest the use of orienteering principles as a template for our visualizations. 1.1 Orienteering Orienteering is typically accomplished by two chief approaches: maps and landmarks. Imagine yourself set down in an unknown city with instructions to find a given hotel. The usual method is to obtain a map showing the large-scale areas of the city. Once the "hotel district" is located we will then walk along looking for landmarks such as street names until we arrive at our location. If the landmarks do not match the map, we will re-consult the map and even replace one map with another. If the landmarks do not appear correct then usually one will backtrack, try a short side journey, or ask for further landmarks from people on the street. The degree to which we will follow the landmark chain or trust the map depends upon the match between the landmarks and the map. It will be reinforced by unexpected matches (happening along a unique landmark for which we were not looking), by finding the landmark by two different routes and by noting that variations are small. Additionally, our experience with cities and maps and the urgency of our journey will affect our confidence as well. The combination of a global coordinate system (the map analogy) and the local coordinate system (the landmarks) must fit together and must instill confidence as the journey is traversed. The concept of a manifold is relevant in that the global coordinates

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

must be realizable, as a combination of local coordinate systems is some sense. To grow trust in the user we should: 1. Show that nearby paths (small distances in the model) do not lead to widely different ends 2. Show, on demand, the effect that different perspectives (change of variables or inclusion probabilities) have on model structure 3. Make dynamic changes in coloring, shading, edge definition and viewpoint (dynamic dithering) 4. Sprinkle known relationships (landmarks) throughout the model landscape. 5. Allow interaction that provides more detail and answers queries on demand. The advantages of this manifold approach include the ability to explore it in some optimal way (such as projection pursuit), the ability to reduce the models to a independent coordinate set, and the ability to measure model adequacy in a more natural manner. 1.2 Why Visualize a Data Mining Model? The driving forces behind visualizing data mining models can be broken down into two key areas: Understanding and Trust. Understanding is undoubtedly the most fundamental motivation behind visualizing the model. Although the simplest way to deal with a data mining model is to leave the output in the form of a black box, the user will not necessarily gain an understanding of the underlying behavior in which they are interested. If they take the black box model and score a database, they can get a list of customers to target (send them a catalog, increase their credit limit, etc.). Theres not much for the user to do other than sit back and watch the envelopes go out. This can be a very effective approach. Mailing costs can often be reduced by an order of magnitude without significantly reducing the response rate. The more interesting way to use a data mining model is to get the user to actually understand what is going on so that they can take action directly. Visualizing a model should allow a user to discuss and explain the logic behind the model with colleagues, customers, and other users. Getting buy-in on the logic or rationale is part of building the users trust in the results. For example, if the user is responsible for ordering a print advertising campaign, understanding customer demographics is critical. Decisions about where to put advertising dollars are a direct result of understanding data mining models of customer behavior. Theres no automated way to do this. Its all in the marketing managers head. Unless the output of the data mining system can be understood qualitatively, it wont be of any use. In addition, the model needs to be understood so that the actions that are taken as a result can be justified to others. Understanding means more than just comprehension; it also involves context. If the user can understand what has been discovered in the context of their business issues, they will trust it and put it into use. There are two parts to this problem: 1) visualization of the data mining output in a meaningful way, and 2) allowing the user to interact with the visualization so that simple questions can be answered. Creative solutions to the first part

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

have recently been incorporated into a number of commercial data mining products (such as MineSet [1]). Graphing lift, response, and (probably most importantly) financial indicators (e.g., profit, cost, ROI) give the user a sense of context that can quickly ground the results in reality. After that, simple representations of the data mining results allow the user to see the data mining results. Graphically displaying a decision tree (CART, CHAID, and C4.5) can significantly change that way in which the data mining software is used. Some algorithms can pose more problems than others (e.g., neural networks) can but novel solutions are starting to appear.

2. Trusting the Model


Attributing the appropriate amount of trust to data mining models is essential to using them wisely. Good quantitative measures of "trust" must ultimately reflect the probability that the models predictions would match future test targets. However, due to the exploratory and large-scale nature of most data-mining tasks, fully articulating all of the probabilistic factors to do so would seem to be generally intractable. Thus, instead of focusing on trying to boil "trust" down to one probabilistic quantity, it is typically most useful to visualize along many dimensions some of the key factors that contribute to trust (and distrust) in ones models. Furthermore, since, as with any scientific model, one ultimately can only disprove a model, visualizing the limitations of the model is of prime importance. Indeed, one might best view the overall goal of "visualizing trust" as that of understanding the limitations of the model, as opposed to understanding the model itself. . 2.1 Assessing Trust in a Model Assessing model trustworthiness is typically much more straight-forward than the holy grail of model understanding per se essentially because the former is largely deconstructive while the latter is constructive. For example, without a deep understanding of a given model, one can still use general domain knowledge to detect that it violates expected qualitative principles. A well-known example is that one would be concerned if ones model employed a (presumably spurious) statistic correlation between shoe size and IQ. Of course, there are still very significant challenges in declaring such knowledge as completely and consistently as possible. Domain knowledge is also critical for outlier detection needed to clean data and avoid classic problems such as a juvenile crime committed by a 80-year-old "child". If a data mining model were build using the data in Figure 1, it is possible that outliers (most likely caused by incorrect data entry) will skew the resulting model (especially the zeroyear-old children, which are more reasonable than eighty-year-old children). The common role of visualization here is mostly in terms of annotating model structures with domain knowledge that they violate.

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

Not all assessments of trust are negative in nature, however. In particular, one can also increase ones trust in a model if other reasonable models seem worse. In this sense, assessing trust is also closely related to model comparison. In particular, it is very useful to understand the sensitivity of model predictions and quality to changes in parameters and/or structure of the given model. There are many ways to visualize such sensitivity, often in terms of local and global (conditional) probability densities with special interest in determining whether multiple modes of high probability exist for some parameters and combinations. Such relative measures of trust can be considerably less demanding to formulate than attempts at more absolute measures, but do place special demands on the visualization engine, which must support quick and non-disorientating navigation through neighboring regions in model space. Finally, it is important, though rather rare in practice to date, to consider many transformations of the data during visual exploration of model sensitivities. For example, a model that robustly predicts well the internal pressure of some engineering device should probably also be able to do well predicting related quantities, such as its derivative, its power spectrum, and other relevant quantities (such as nearby or redundant pressures). Checking for such internal consistency is perhaps ultimately one of the most important ways to judge the trustworthiness of a model, beyond standard cross validation error. Automated and interactive means of exploring and visualizing the space (and degrees) of inconsistencies a model entails seems to be a particularly important direction for future research on assessing model trustworthiness.

3. Understanding the Model

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

A model that can be understood is a model that can be trusted. While statistical methods build some trust in a model by assessing its accuracy, they cannot assess the models semantic validity its applicability to the real world. A data mining algorithm that uses a human-understandable model can be checked easily by domain experts, providing much needed semantic validity to the model. Unfortunately, users are often forced to trade off accuracy of a model for understandability.. The rest of this section will focus on understanding classification models. Specifically, we will examine three models built using Silicon Graphics MineSet: decision tree, simple Bayesian, and decision table classifiers [3]. Each of these tools provides a unique form of understanding based on representation, interaction, and integration. Decision trees are easy to understand but can become overwhelmingly large when automatically induced. The SGI MineSet Tree Visualizer uses a detail-hiding approach to simplify the visualization. In figure 2, only the first few levels of the tree are initially displayed, despite the fact that the tree is extensive. The user can gain a basic understanding of the tree by following the branches of these levels. Additional levels of detail are revealed only when the user navigates to a deeper level, providing more information only as needed.

Figure 2: The MineSet Tree Visualizer shows only the portion of the model close to the viewer. Using decision tables as a model representation generates a simple but large model. A full decision table theoretically contains the entire dataset, which may be very large. Therefore simplification is essential. The MineSet decision table arranges the model into levels based on the importance of each feature in the table. The data is automatically aggregated to provide a summary using only the most important features. When the user desires more information, he can drill down as many levels as needed to answer his

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

question. The visualization automatically changes the aggregation of the data to display the desired level of detail. In figure 3, a decision table shows the well-known correlation between head shape and body shape in the monk dataset. It also shows that the classification is ambiguous in cases where head shape does not equal body shape. For these cases, the user can drill down to see that the attribute jacket color determines the class.

Figure 3: The MineSet Decision Table Visualizer shows additional pairs of attributes as the user drills down into the model.

While a good representation can greatly aid the users understanding, in many cases the model contains too much information to provide a representation that is both complete and understandable. In these cases we exploit the brains ability to reason about cause and effect and let the user interact with the more complex model. Interaction can be thought of as "understanding by doing" as opposed to "understanding by seeing". The MineSet Evidence Visualizer allows the user to interact with a simple Bayesian classifier (Figure 4). Even simple Bayesian models are based on multiplying arrays of probabilities that are difficult to understand by themselves. However, by allowing the user to select values for features and see the effects, the visualization provides cause-and-

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

effect insight into the operation of the classifier. The user can play with the model to understand exactly how much each feature affects the classification and ultimately decide to accept or reject the result. In the example in the figure, the user selects the value of "working class" to be "self-employed-incorporated," and the value of "education" to be "professional-school". The pie chart on the right displays the expected distribution of incomes for people with these characteristics.

Figure 4: Specific attribute values are selected in the Evidence Visualizer in order to predict income for people with those characteristics.

Beyond interactive classification, interactively guiding the model-building process provides additional control and understanding to the user. Angoss [4] provides a decision tree tool that gives the user full control over when and how the tree is built. The user may suggest splits, perform pruning, or manually construct sections of the tree. This facility can boost understanding greatly. Figure 5a shows a decision tree split on a cars brand attribute. While the default behavior of the tree is to form a separate branch on the tree for each categorical value, a better approach is often to group similar values together and produces only a few branches. The result shown in figure 5b is easier to understand and can sometimes give better accuracy. Interactive models allow the user to make changes like this as the situation warrants.

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

Figures 5a and 5b: A decision tree having branches for every value of the brand attribute (top), and a decision tree which groups values of brand to produce a simpler structure (bottom).

Interactive techniques and simplified representations can produce models that can be understood within their own context. However, for a user to truly understand a model, he must understand how the model relates to the data from which it was derived. For this goal, tool integration is essential. Few tools on the market today use integration techniques. The techniques that are used come in three forms: drill-through, brushing, and coordinated visualizations. Drillthrough refers to the ability to select a piece of a model and gain access to the original data upon which that piece of the model was derived. For example, the decision tree visualizer allows selection and drill-through on individual branches of the tree. This will

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

provide access to the original data that was used to construct those branches, leaving out the data represented by other parts of the tree. Brushing refers to the ability to select pieces of a model and have the selections appear in an alternate representation. Coordinated visualizations generalize both techniques by showing multiple representations of the same model, combined with representations of the original data. Interactive actions that affect the model also affect the other visualizations. All three of these techniques help the user understand how the model relates to the original data. This provides an external context for the model and helps establish semantic validity. As data mining becomes more extensive in industry and as the number of automated techniques employed increases, there is a natural tendency for models to become increasingly complex. In order to prevent these models from becoming mysterious oracles, whose dictates must be accepted on faith, it is essential to develop more sophisticated visualization techniques to keep pace with the increasing model complexity. Otherwise there is a danger that we will make decisions without understanding the reasoning behind them.

4. Comparing Different Models using Visualization


Model comparison requires the creation of an appropriate metric for the space of models under consideration. To visualize the model comparison, these metrics must be interpretable by a human observer through his or her visual system. The first step is to create a mapping from input to output of the modeling process. The second step is to map this process to the human visual space. 4.1 Different Meanings of the Word "Model" It is important to recognize that the word "model" can have several levels of meaning. Common usage often associates the word model with the data modeling process. For example, we might talk of applying a neural network model to a particular problem. In this case, the word model refers to the generic type of model known as a neural network. Another use of the word model is associated with the end result of the modeling process. In the neural network example, the model could be the specific set of weights, topology, and node types that produces an output given a set of inputs. In still another use, the word model refers to the input-output mapping associated with a "black-box." Such a mapping necessarily places emphasis on careful identification of the input and output spaces. 4.2 Comparing Models as Input-Output Mappings The input-output approach to model comparison simply considers the mapping from a defined input space to a defined output space. For example, we might consider a specific 1-gigabyte database with twenty-five variables (columns). The input space is simply the Cartesian product of the database's twenty-five variables. Any actions inside the model, such as creation of new variables, are hidden in the "black-box" and are not interpreted. At the end of the modeling process, an output is generated. This output could be a number, a prioritized list or even a set of rules about the system. The crucial issue is that

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

we can define the output space in some consistent manner to derive an input to output mapping. 4.3 Comparing Models as Algorithms In the view of a model as static algorithm, again there seems to be a reasonable way to approach the model comparison problem. For example, a neural network model and an adaptive nonlinear regression model might be compared. These models would be expressed as a series of algorithmic steps. Each model's algorithm could then be analyzed by standard methods for measurement of algorithmic performance such as complexity, the finite word length and the stability of the algorithm. The investigator could also include measures on the physical implementation of the algorithm such as computation time, or computation size. Using these metrics the visualization could take the form of bar charts across the metrics. Again, different models could be encoded by color or symbol, and a graph of only difference between the two models on each metric could be provided. Each comparison would be for a static snapshot but certainly dynamic behavior could be exploited through a series of snapshots, i.e. a motion picture.

5.Conclusion
In this paper we have discussed a number of methods to visualize data mining models. Because data mining models typically generate results that were previously unknown to the user, it is important that any model visualization provide the user with sufficient levels of understanding and trust.

6. Acknowledgements
The authors would like to thank Professor Ben Shneiderman and his colleagues at the Human-Computer Interaction Laboratory, University of Maryland, College Park for providing Figure 1. It was created using an early version of the Spotfire visualization software

7. References
[2] D. DeCoste, "Mining multivariate time-series sensor data to discover behavior envelopes," Proceedings of the Third Conference on Knowledge Discovery and Data Mining (KDD-97), Newport Beach, CA, August 1997. [3] D. Rathjens, MineSet Users Guide, Silicon Graphics, Inc., 1997.

PDF created with pdfFactory Pro trial version www.software-partners.co.uk

You might also like