Mining Knowledge of Business Analyst
Mining Knowledge of Business Analyst
Themis Palpanas
University of Trento
[email protected]
Abstract—Several studies have focused on problems related to data mining techniques, including several applications of these
techniques in the e-commerce setting. In this work, we describe how data mining technology can be effectively applied in an e-
commerce environment, delivering significant benefits to the business analyst. We propose a framework that takes the results of
the data mining process as input, and converts these results into actionable knowledge, by enriching them with information that
can be readily interpreted by the business analyst. The framework can accommodate various data mining algorithms, and provides
a customizable user interface.
We experimentally evaluate the proposed approach by using a real-world case study that demonstrates the added benefit of the
proposed method. The same study validates the claim that the produced results represent actionable knowledge that can help the
business analyst improve the business performance, since it significantly reduces the time needed for data analysis, which results in
substantial financial savings.
ACM Categories: H.2.8 Database Applications, H.3 Information Storage and Retrieval
1. INTRODUCTION1
Data mining (Hand et al. 2000) has been extensively used in the past with the purpose of helping the analyst quickly examine
huge collections of data, and extract new knowledge about the processes that generated those data. Examples of data mining
applications include star formation discovery in astronomy data, drug discovery in bioinformatics, targeted advertising in
customer relationship management, market basket analysis in retail data, buying patterns identification in online stores, and
many others (Han et al. 2006).
More recently, data mining techniques have also been proposed and used in the specific context of e-commerce (Li et al.
2005; Yang et al. 2005). In this case, one of the concerns is how to integrate these techniques with the overall business
process, and take advantage of the benefits they have to offer. For example, one of the studies examines the utility of
employing data mining for performing risk analysis of customs declaration cargo (Li et al. 2005). The study shows how the
results of the analysis helped to transform the associated business process, so that it became more efficient and effective.
We argue that an important problem regarding the use of data mining tools by business analysts is that often, information
necessary to make business decisions is absent from data mining. In those cases, the results of the data mining process have
to be enriched before they can be transformed into actionable knowledge.
In this work, we describe a framework that demonstrates how data mining technology can be effectively applied in an e-
commerce environment, in a way that delivers immediate benefits to the business analyst. The main idea behind our approach
is to concentrate on the results of the data mining procedure, and then merge any other related piece of information to these
results. The reason for doing this is the following. The data mining results per se are usually rather poor in information
content, carrying only the necessary bits of data for achieving a particular mining task. Nevertheless, this amount of
information is not sufficient for the business analyst, who needs to have a complete view of the data in order to make
business decisions.
Under the proposed framework, all the relevant information is gathered around the mining results, which now represent
actionable knowledge. These data are organized according to different dimensions, where each dimension represents a
particular aspect of the objects represented in the data mining results.
1
Work done while the author was a member of the IBM T.J. Watson Research Center. An early version of this work has
appeared in the LNCS Proceedings of DEXA (Palpanas et al. 2008).
Once the business analyst has all the above information at hand, she is in a position to interpret the mining results and make
informed business decisions. The multi-dimensional organization of the results allows the analyst to view the data from
different perspectives, group them according to various business criteria, and analyze them further, just like any other
collection of business data.
In order to demonstrate the benefits of the proposed approach, we applied our framework to a real-world scenario, namely to
the problem of identifying and managing failures in manufactured products. In the manufacturing industry, it is very common
that some of the produced products have defects, or fail during usage because of a wrong design or some inappropriate
manufacturing process. In either case, we say that there is a failure in the product. When the products we are examining are
complex, e.g., a vehicle or an airplane, it is very important to be able to identify such failures as early as possible, and trace
them back to the root cause in the design and manufacturing process.
Our case study, where we examined real data coming from the vehicle manufacturing industry, revealed that there exist many
frequent sequences of the same set of failures. That is, many vehicles share a specific set of failures that occur in the same
order (in time) for all those vehicles. Mining the data for frequent failure patterns is essentially a way of searching for
dependencies among the failures. The failure patterns we identify reveal the trends that have been occurring in the past, but
are also indicative of the future trends. These failure patterns are important, because they are entire sequences of the same
failures occurring in many different vehicles. This indicates that the specific failures are correlated, and may lead to the
identification of deeper problems. Furthermore, these patterns represent actionable knowledge, since they can be correlated to
the manufacturing process of the vehicle, and may be translated to errors in the design or the production phases.
Since the final output of our method can be quite sizable, we capture all the detailed results in a database specifically created
to facilitate the further inspection and understanding of these results. Subsequently, these results are correlated with relevant
information from the other databases in the business, thus, creating a rich context for the analysis of the data mining results.
For example, in our case study, the results of the mining process are correlated with detailed information regarding the
manufacturing and the demographics of each one of the vehicles that are part of the mining results. The user can then use a
customized interface in order to access a variety of reports that present the results along several different dimensions that are
useful to the business analyst. What is important to note here is that in our framework it is very easy to change and customize
the interface, so that the reports match the needs of the analysts.
In summary, the method we propose allows the analyst to:
quickly discover frequent, time-ordered, event-patterns,
automate the process of event-pattern discovery using a feedback loop,
enrich these event-patterns with demographics related to the processes that generated them,
identify the commonalities among these event-patterns,
trace the events back to the process that generated them, and
predict future events, based on the history of event-patterns.
The rest of this paper is organized as follows. In Section 2 we review the related work and give some background on frequent
sequence mining. We introduce the knowledge mining framework in Section 3, and present a case study, where we apply the
proposed framework, in Section 4. Finally, we conclude in Section 5.
2. RELATED WORK
There exists a sizable literature in the area of knowledge extraction from huge datasets (Han et al. 2006), and many studies
focus on improving the performance and functionality of the various data mining algorithms (Aggarwal et al. 2004; Agrawal
et al. 1994; Evfimievski et al. 2004; Ganti et al. 2002; Keogh et al. 2002; Manerikar et al. 2009; Palpanas et al. 2001;
Palpanas et al. 2005; Srikant et al. 1996; Tantoro et al. 2008; Yamanishi et al. 2004). These algorithms include clustering,
classification, association rules, frequent sequences, outliers, and others. The main goal of these algorithms is to make the
data mining process scale to very large datasets.
In all the above cases, the focus is on devising efficient algorithms that can process large amounts of data in a small
timeframe. As a result, these algorithms are targeted to expert users, and are very cumbersome for use by business analysts.
Because of their popularity as data mining techniques, some of the above algorithms are being offered as part of commercial
data mining solution packages (DB2 Intelligent Miner 2011; IBM SPSS Modeler 2011; Microsoft SQL Server Business
Intelligence 2011; Oracle Data Mining 2011; SAS Enterprise Miner 2011), as well as open source software (Orange 2011;
RapidMiner 2011; Weka 2011). These packages hide the intricate details from the users, and only expose graphical user
interfaces, which make the use of these algorithms more accessible. The framework we propose in this work proposes a
technique for enriching the results of the algorithms with other relevant data, in such a way that it provides several different
views that are useful to the business analyst.
The CRISP-DM project (Cross Industry Standard Process for Data Mining 2011) and SAS SEMMA (Paper 2008) have
proposed an end-to-end methodology for data mining, starting from understanding the business in which it will be applied,
and ending with the assessment of the proposed solution. In this paper, we present a specific framework that addresses some
of the problems arising during this process, namely, on how to organize and present to the user the new knowledge gained by
applying some data mining techniques. (We further discuss the relationship of the proposed framework to these
methodologies in Section 4.) There also exist standards for defining statistical and data mining models, such as the Predictive
Model Markup Language (PMML) (Predictive Model Markup Language 2011), which is an XML-based language. The main
goal of these standards is to enable the sharing of models among relevant applications.
There are several applications of data mining in the e-business environment (Li et al. 2005; Yang et al. 2005), which prove
the usefulness of this kind of techniques for improving the business operations. These studies show that significant
improvements can be achieved by the use of the results of data mining. However, the proposed solutions are targeted to the
specific tasks for which they were developed. In contrast, we propose a framework that is more general than the above
solutions that can be used in a variety of domains.
Finally, there is work on the problem of applying data mining techniques in the context of data warehouses (Palpanas 2000;
Sarawagi 2000; Sarawagi et al. 1998; Sathe et al. 2001). Data warehouses are usually employed by business analysts to assist
them during the decision-making process. These techniques are very effective in helping the business analyst focus on the
interesting parts of the data in the warehouse. However, they are targeted specifically to data warehouses, and are not directly
related to the framework we propose in this paper.
1) Formal Model
A formal statement of the problem was proposed in (Agrawal et al. 1995). Let D be a database of customer transactions, each
transaction consisting of three fields: customer-id, transaction-time, and the purchased items. Let itemset i = (i1i2 … im) be a
non-empty set of items, and sequence s =< s1s2 … sn > be an ordered list of itemsets. An itemset is large, also called litemset,
if the fraction of customers who bought all the relevant items, is greater than the minimum support.
A sequence < a1a2 … an > is contained in another sequence < b1b2 … bn > if there exist integers i1 < i2 < … < in, such that a1
is a subset of bi1, a2 subset of bi2, …, an subset of bin. In a set of sequences, a sequence s is maximal if s is not contained in
any other sequence.
A sequence which contains all the transactions of a user ordered by increasing time, is called a customer-sequence. A
customer supports a sequence if this is contained in the customer-sequence for this customer. A sequence is large if the
fraction of total customers who support this sequence, is greater than the minimum support threshold. Note that a large
sequence is exclusively composed from litemsets.
Several algorithms have been proposed for the solution of the mining sequential patterns problem (Agrawal et al. 1995;
Ayres et al. 2002; Mannila et al. 1997; Srikant et al. 1996). The above algorithms can also solve more general variations of
the problem, by introducing three new features. The first is time constraints that specify a minimum and a maximum time
period between adjacent elements in a sequential pattern. The second is that items belonging to different transactions that are
close enough in time to one-another, may be considered as belonging to the same transaction. Finally, taxonomies (is-a
hierarchies) are integrated in the operation of the algorithms, so that the mined sequential patterns may contain items across
all levels of the taxonomies. These features offer a much richer processing environment, where patterns of finer granularity of
detail can be mined.
Note that the specific choice of algorithm is orthogonal to the framework we propose, and any of the available frequent
sequence mining algorithms can be used.
2) Proposed Algorithms
We now briefly describe algorithms for solving the frequent sequence mining problem. First, we present the Apriori
algorithm (Agrawal et al. 1995), which consists of the following steps.
1. Scan the database to identify all the frequent sequences of unit length, i.e., 1-element sequences.
2. Iteratively scan the database to identify all frequent sequences. During iteration k, discover all the frequent sequences of
length k (k-sequences) as follows.
Generate the set of all candidate k-sequences by joining two (k-1)-sequences.
Prune the candidate sequence if any of its k-1 contiguous subsequence is not frequent.
Scan the database to determine if the remaining candidate sequences are frequent.
3. Terminate when no more frequent sequences can be found.
The above algorithm has been extended, in order to solve a more general form of the problem, by introducing three new
features (Srikant et al. 1996). First, it adds time constraints that specify a minimum and a maximum time period between
adjacent elements in a sequential pattern. Second, it allows for items that belong to different transactions that are close
enough in time to one-another, to be considered as belonging to the same transaction. Third, it handles taxonomies (is-a
hierarchies), so that the mined sequential patterns may contain items across all levels of the taxonomy.
A slightly different approach is presented in (Mannila et al. 1997). The term episodes is used to describe collections of events
occurring frequently close to each other. There are serial episodes, where the events have to occur in a specific time order,
parallel episodes, where no time constraints are given, and complex episodes which are a combination of the above two.
The discovery of frequent episodes is made by building new candidate episodes and evaluating their actual frequency with a
pass over the data. Complex episodes are handled by decomposing then into serial and parallel ones. Events pertaining to an
episode should occur within a specific time window.
Two variations of the algorithm are given, Winepi and Minepi. The former discovers frequent episodes which are
constrained in a time window, while the latter relaxes this constraint, and is able to come up with rules bearing two different
time windows in the left and right hand side (e.g., “if A and B occur within 15 seconds, then C follows within 30 seconds”).
Other algorithms for solving the frequent sequence mining problem have also been proposed more recently, where the focus
is on optimizing the resource consumption (e.g., the SPAM algorithm (Ayres et al. 2002)). These algorithms aim at
efficiently searching the candidate space of all possible frequent sequences, and result in performance improvements for
several specialized cases of the problem.
For our study, we have used the Apriori algorithm. Nevertheless, the specific choice of algorithm is orthogonal to the
framework we propose, and any of the other frequent sequence mining algorithms can be used instead.
Discovering Patterns
In the following set of steps, we analyze the data in order to identify interesting patterns. These patterns should conform to
some user-defined constraints that determine the form of the patterns that are going to be computed. For example, the user
may focus the analysis by setting minimum and maximum lengths for each of the reported patterns (in the case of frequent
sequences). The user may also set a maximum time between the start and the end of the sequence, or between two
consecutive items of the same sequence.
Obviously, different settings for the parameters will result in different sets of discovered patterns. The proposed framework
allows the users to try out the various alternatives, and customize the mining process to their needs and the specific data they
are working with. (The above functionality is in principle similar to exploratory mining (Ng et al. 1998)).
Data Enrichment
When the mining phase of the process is over and the system has produced the final set of frequent sequences, we proceed to
the post-processing step of data enrichment. The frequent sequences that come out as the result of the previous step only hold
some minimal, vital information: this is the information needed to identify the items that participate in each one of these
frequent sequences. The goal of the data enrichment step is to correlate the identified frequent sequences with all the relevant
pieces of information that are stored in the system.
In order to achieve this goal, data originating from different parts of the business are gathered and integrated with the data
mining results. These data naturally refer to different phases of the lifecycle of each specific item, and as such they enrich the
identified frequent sequences with contextual information relevant to the processes that generated them. When all these data
are put together, the business analyst can drill down on the root causes for the patterns that have been identified by the
mining phase.
The above procedure represents an important step in the framework we describe, since it augments the data mining results
with information that renders those results useful for business intelligence. A more concrete discussion on this topic, relevant
to the vehicle manufacturing business, follows in Section 4.
Result Analysis
Next, the enriched sequences that are produced during the previous step are analyzed in different ways. In Figure 1, we depict
three different analysis modules that we are proposing.
Ranking Sequences: The first module uses various knowledge models and utility functions, and its purpose is to
rank the event-patterns according to different criteria. The ranking mechanism combines objective business metrics
with subjective domain knowledge. The results of this technique are able to capture the most important business
performance issues based on a small but representative fraction of the available information. Therefore, the analyst
can examine huge volumes of data, and quickly focus on the data most relevant to a particular aspect of the business.
(For a more elaborate discussion, the interested reader should refer to our previous work (Chen et al. 2005))
Clustering Sequences: The purpose of the clustering module is to use the contextual information of each event-
pattern in order to identify clusters of similar event-patterns. When a clustering algorithm (for example, K-means
MacQueen 1967) or some subspace clustering method (Domeniconi et al. 2004)) is run against the enriched event-
patterns, it produces groupings of those patterns that are semantically meaningful within the business context. These
groups (or clusters) are a summarized representation of the behavior of the participating event-patterns. Therefore,
they help analyze the aggregated behavior of several business objects, and enable the analyst to gain insight on the
root causes for each behavior.
Predictive Analysis: The third module aims at using the history of event-patterns in order to predict future events.
The identified patterns represent (by definition) an approximation of the past behavior of the items under
examination. Given these data, we can make projections for the future behavior of the same items (Pednault 2004).
This kind of analysis is extremely useful to the business analysts, since it enables them to make predictions and
anticipate future events (based on the assumption that past behavior is a good indicator of future behavior).
We note that the above list of analysis modules is not exhaustive; other modules can be easily incorporated in the proposed
framework as well.
Report Creation
In the final phase of the proposed framework, we produce a set of reports with the aim of summarizing the results of the
preceding data analysis phases. To this extent, we have developed a graphical user interface that enables the analyst to access
and customize several different report types.
4. CASE STUDY
In order to evaluate the effectiveness of the proposed framework, we applied it to a real-world problem. In this section, we
present a case study with one of the two vehicle manufacturing companies we collaborated with.
The manufacturer has data detailing the characteristics of each vehicle that appears in some warranty claim. For each vehicle,
we know the model type, the time and place of its assembly, and also the type of some of its components 2, such as the engine.
Apart from the above information, we also have data regarding the warranty claims for each vehicle, which describe the
behavior of the vehicle after it has left the factory. Thus, we know at which points in time the vehicle failed, and also what
the reason of the failure was, as identified by the mechanic. For those incidents, we also record the vehicle parts that failed,
and the related costs (that is, parts and labor).
A. Proposed Process
We now elaborate on the process we propose in order to extract new knowledge from the available data, and specifically for
the vehicle manufacturing company.
Figure 2 depicts an overview of the approach. Note that here we give a more high-level view than that illustrated in Figure 1.
In the description that follows, we focus on a particular task of knowledge extraction and mining, namely, that of frequent
sequences.
Pre-Processing
As shown in Figure 2, the input to our system are the warranty claims associated with each vehicle. Each warranty claim is
associated with the following pieces of information.
The Vehicle Identification Number (VIN).
2
Of course, vehicle manufacturers also maintain databases with detailed information on the components of each vehicle.
However, for this study, we did not have access to those databases.
The date that the vehicle visited the mechanic for a particular failure (or for a set of failures)3.
The mileage of the vehicle at the time of the visit to the mechanic.
The ids for all the failed vehicle parts, as identified by the mechanic.
The root cause for each one of the failures, as identified by the mechanic.
The cost of the visit to the mechanic, broken down by part cost and labor cost.
These warranty claims are gathered from several different sources, they are normalized, and then stored in a database called
warranty claims data store. The use of such a database is useful for the subsequent steps of data analysis.
During the next phase, we pre-process the warranty claim data, in order to put them in a form that is more suitable for the
frequent sequence discovery step. Let D be the database of warranty claims. We logically divide each claim into four parts.
1. The Vehicle Identification Number (VIN),
2. the date of the claim,
3. the identified failure,
4. and the rest of the attributes.
Example 1: Assume that in the warranty claims database we have the (simplified) entries shown in Table 1. In this case, the
only sequence of failures that is common among more than one of the vehicles is: (300, 500), 450. In this sequence, the first
two failures co-occur, that is, they happen on the same date. The vehicles that exhibit the above sequence are the vehicles
with VIN numbers 1 and 2. Note that in the series of failures for vehicle 2, there is another failure, 150, that takes place
between (300, 500) and 450. This is allowed in our analysis.
In addition to the information mentioned above, which is relevant to the entire pattern, we also associate some information to
each particular failure (of each failure pattern).
Five most common cause-codes for the failure (among the cause-codes of the particular failure, for all the vehicles
that support the failure pattern).
Minimum, maximum, average, and standard deviation of the mileage at which the failure occurred (among the
mileages when the particular failure occurred, for all the vehicles that support the failure pattern).
Minimum, maximum, average, and standard deviation of the replacement part cost for the failure (among the part
costs for the particular failure, for all the vehicles that support the failure pattern).
Minimum, maximum, average, and standard deviation of the labor part cost for the failure (among the labor costs for
the particular failure, for all the vehicles that support the failure pattern).
Obviously, these kinds of queries have the power to turn the output of the frequent failure patterns approach into actionable
knowledge. They can help interpret failure symptoms, identify underlying problems, and plan corrective actions.
Another benefit of using the above database is that it enables the use of data analytics and reporting tools (DB2 Alphablox
2011). These tools can connect to a database and export a rich graphical user interface for exploring the available data and for
producing relevant reports.
B. Evaluation Results
We now present the results of applying the proposed approach to analyze data coming from two different companies in the
vehicle manufacturing industry.
The data refer to warranty claims made for vehicles of particular models during the period of the year 2005. The first dataset
we examined includes almost 2,000,000 records of warranty claims, referring to almost 1,000 different failure reasons. These
claims were made by approximately 250,000 unique vehicles, corresponding to more than 100 different vehicle models.
We should note that this dataset is skewed. For example, 35% of the claims are on vehicles that only failed once, but there are
also vehicles that failed more than 10 times during the same time period (i.e., one year). Another example is that 90% of the
engine claims refer to 5 specific failures (out of the 167 possible failures).
The second dataset is smaller in size (300,000 warranty claims), but has the same overall characteristics. For the sake of
brevity, in this paper we only present results on the first dataset. As mentioned earlier, we provided the interested parties with
the toolset to analyze the available data for frequent sequences of failures, and subsequently, to explore the results and view
them from several different angles. In the following paragraphs, we discuss some sample results, and we show screenshots
from the procedure.
Ranking by the difference of part and labor cost for a specific type of failures.
In this example, we are focusing on failure sequences that involve engine failures, for which the labor cost is more expensive
than the part cost. Such sequences correspond to an interesting subset of the warranty claims, those for which the part cost is
very small, while the labor cost is much higher (around $1,200). Knowledge of these sequences can be useful when
redesigning an engine, so as to make sure that the labor cost for repairing the same kind of failures is reduced (e.g., by
making particular parts of the engine more easily accessible).
2) Focused Reports
We now turn our attention to some specific failure sequences. In the following examples, we concentrate on some frequent
failure sequences of interest, and we analyze in detail the characteristics of the vehicles that support these failure sequences.
These are sequences that may be important to the analyst, because they involve problems on some parts that are under close
examination, or because they appear too often, or just because of their sheer cost.
In all cases, the detailed statistics reveal useful information about the vehicles that exhibit these problems, such as the
predominant vehicle or engine model, the plant and year where most of these vehicles were manufactured, the main cause of
the problems, and others.
It is important to note that the following are just two indicative examples from the many that we used during our case study.
The proposed approach enables the analysts to focus on any aspect that is deemed important for their analysis.
failure X failure Y
cause code % cause code %
inoperative 72 leaking 64
shorted 13 rubs 9
leaking 7 broken 4
Table 2. Example 1: Cause code break-down. (All the code ids and descriptions have been replaced by dummy values
for reasons of anonymity.)
In Table 3, we can see yet another view of the same data. This table shows the years during which most of the vehicles that
exhibit the specific failure sequence were manufactured, their model, the factory where they were manufactured, and the type
of engine they have. We observe that many more of the failed vehicles were manufactured in year 2003, than in year 2002.
The same is true when we compare factory A1 to factory A3. These results can help the analyst understand the root cause of
the problems, and take corrective actions.
3) User Interface
We now present examples from the graphical user interface we have developed. In our case study, we used the Alphablox e-
business analysis application (DB2 Alphablox 2011), but any data analytics and reporting platform can be used instead.
In Figure 3 we illustrate the distribution of the average total cost (i.e., replacement part cost plus labor cost) for repairing the
failures in the failure sequence (y-axis), as a function of the number of failures in the sequence (x-axis). Remember that each
failure sequence refers to a single vehicle. This means that there are several vehicles that repeatedly failed (up to thirteen
times) during the time period represented by our data, which is approximately one year. What is interesting to note in this
graph is that it reveals a trend regarding the cost for fixing those failures. When there are a few failures, the total cost is split
between the part and the labor costs. However, as the number of failures grows, the labor cost dominates the total cost of
fixing these failures. The reason for this trend is that the parts per se are in good working condition (so there are little or no
replacement part costs), yet, some human error (perhaps during an earlier visit to the mechanic) is causing the vehicle to
malfunction.
Figure 3. Average labor and part cost vs total number of failures. (All the code ids and descriptions have been
replaced by dummy values for reasons of anonymity.)
Figure 4 depicts two screen-captures that show the process of selecting a particular view for displaying the results of the
frequent failure sequences analysis. In this particular example, the analyst is interested in getting information about failures
that involve a specific engine type. Therefore, in the picture on the top (Figure 4(a)), the analyst is selecting a view that will
display results for engine type “E”. In the next picture (Figure 4(b)), the system is presenting the analyst with detailed
information about the exact sequences of failures that are occurring most often for the engine type “E”. As we can see, these
sequences of failures have occurred to thousands of vehicles (see second column) within a single year.
Figure 4. Output showing a particular view of the frequent failure sequence details. (All the code ids and descriptions
have been replaced by dummy values for reasons of anonymity.)
C. Discussion
As discussed earlier, the analyst can select one of the frequent failure sequences, and get even more detailed information on
the vehicles that exhibit that particular failure sequence. For example, the analyst can see a break down of the cause of the
failures, the year the vehicles were manufactured, the factory they were manufactured, and others.
When following the process we describe in this work, the analyst can efficiently and effectively focus on the most relevant
and important problems. The analyst is able to view, prioritize, and evaluate the corresponding information according to
different criteria, such as the cost (relating to the financial aspect of the business), or the frequency of failures (relating to
customer satisfaction and the marketing aspect of the business). It is for the above reasons that we claim that the proposed
framework leads to actionable knowledge. Our approach makes it possible to quickly analyze huge collections of data, and
provide the different interested parties in the business with focused and specific evidence as to which processes are
malfunctioning and what the most probable causes may be.
The effectiveness of the proposed approach was also validated by analysts from the two vehicle manufacturing companies
with whom we collaborated in our case studies. The processes they were previously following for analyzing their data could
not produce results of equivalently high quality and rich content. After the adoption of our framework, they were able to cut
down on the time spent on data analysis and interpretation to a small fraction of the time that was previously necessary (from
more than 45 days down to a few days). In addition, our framework enabled them to perform more focused analysis and
deliver reports with a higher impact factor.
Our experience with the real case studies showed that shortening the vehicle warranty resolution cycle by 10 days can save an
Original Equipment Manufacturer (OEM) around $300m and reduce the total number of warranty claims by 5%. Early
evidence also suggests that the results of our analysis can be useful for re-designing the problematic parts of vehicles, in
order to reduce, or even avoid altogether, some of the occurring failures and sequences of failures.
Finally, we note that the proposed framework is not in competition with other data mining methodologies (e.g., CRISP-DM
(Cross Industry Standard Process for Data Mining 2011)), or SEMMA (Paper 2008)), but is rather complementary to those.
In particular, what we describe in this paper is a specific example of how these generic data mining methodologies could be
formalized and implemented, focusing on the steps of data modeling and result presentation. (We note that these steps are
relevant to the data preparation, modeling, and evaluation steps of CRISP-DM, and to the modify, model, and assess steps of
SEMMA.) Our work could also help in the direction of standardizing the interfaces of different commercial data mining tools
(e.g., (DB2 Intelligent Miner 2011; IBM SPSS Modeler 2011; Microsoft SQL Server Business Intelligence 2011; Oracle
Data Mining 2011; SAS Enterprise Miner 2011)) for the purpose of helping business analysts to explore the produced results
along different dimensions. In our case study, we investigated such a report generation technique for a set of queries related
to sequences of vehicle failures.
5. CONCLUSIONS
Data mining algorithms have been used for the past several years in various different domains, because they provide the
ability to analyze the huge collections of data now available, and gain insight into the processes that generate these data.
Nevertheless, these algorithms produce results that are not easy to interpret by non-expert users, and are not easy to use by
business analysts.
In this study, we present a framework for enriching the results of the data mining process with additional information
necessary for the business analyst. This information is relevant to different aspects of the data mining results, and enables the
analyst to manipulate and interpret these results in a principled and systematic way.
We experimentally evaluated our approach with real world case studies, which demonstrated that the proposed framework
has value in the e-commerce context. It offers a formulated way that helps to convert the data mining results into actionable
knowledge that the business analyst can use to improve the business operations. In the context of our case studies, this
translated to changes in the design and manufacturing processes in order to avoid expensive warranty claims for specific
vehicle failures.
There are several research directions that we are currently pursuing. We intend to make our framework more general, by
incorporating more data mining algorithms. This will give the business analyst a richer toolset for analyzing the available data,
and more opportunities for discovering new knowledge. We also plan to make our framework easier to deploy, by making the
entire process model-driven.
REFERENCES
Cross Industry Standard Process for Data Mining. http://www.crisp-dm.org/, 2011.
DB2 Alphablox. http://www.alphablox.com/, 2011.
DB2 Intelligent Miner. http://www-306.ibm.com/software/data/iminer/, 2011.
IBM SPSS Modeler. http://www.spss.com/software/modeling/modeler/, 2011.
Microsoft SQL Server Business Intelligence. http://www.microsoft.com/sql/solutions/bi/default.mspx, 2011.
Oracle Data Mining. http://www.oracle.com/technology/products/bi/odm/, 2011.
Orange. http://www.ailab.si/orange/, 2011.
Predictive Model Markup Language. http://www.dmg.org/pmml-v3-0.html/, 2011.
RapidMiner. http://rapid-i.com/, 2011.
SAS Enterprise Miner. http://www.sas.com/technologies/analytics/datamining/miner/, 2011.
Weka. http://www.cs.waikato.ac.nz/ml/weka/, 2011.
Aggarwal C. C., Han J., Wang J., and Yu P. S. A framework for projected clustering of high dimensional data streams. In
VLDB, pages 852–863, 2004.
Agrawal R. and Srikant R. Fast Algorithms for Mining Association Rules. In VLDB, pages 487–499, Santiago de Chile,
Chile, Sept. 1994.
Agrawal R. and Srikant R.. Mining Sequential Patterns. In ICDE, pages 3–14, Taipei, Taiwan, Mar. 1995.
Ayres J., Flannick J., Gehrke J., and Yiu T.. Sequential Pattern Mining Using a Bitmap Representation. In International
Conference on Knowledge Discovery and Data Mining, 2002.
Chen M. and Sairamesh J.. Ranking-Based Business Information Processing. In E-Commerce Technology, 2005.
Domeniconi C., Papadopoulos D., Gunopulos D., and Ma S.. Subspace clustering of high dimensional data. In SDM, 2004.
Evfimievski A.V., Srikant R., Agrawal R., and Gehrke J.. Privacy preserving mining of association rules. Inf. Syst.,
29(4):343–364, 2004.
Ganti V., Gehrke J., and Ramakrishnan R.. Mining Data Streams under Block Evolution. SIGKDD Explorations, 3(2), 2002.
Han J. and Kamber M.. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2006.
Hand D.J., Mannila H., and Smyth P.. Principles of Data Mining. MIT Press, 2000.
Keogh E., Lonardi S., and Chiu W.. Finding Surprising Patterns in a Time Series Database In Linear Time and Space. In
International Conference on Knowledge Discovery and Data Mining, pages 550–556, Edmonton, Canada, July 2002.
Li Y.-H. and Sun L.-Y.. Study and Applications of Data Mining to the Structure Risk Analysis of Customs Declaration
Cargo. In ICEBE, pages 761–764, 2005.
MacQueen J.. Some Methods for Classification and Analysis of Multivariate Observations. In Berkeley Symposium on
Mathematical Statistics and Probability, 1967.
Manerikar N., Palpanas T.. Frequent Items in Streaming Data: An Experimental Evaluation of the State-of-the-Art. Data and
Knowledge Engineering (DKE) 68(2009), 2009.
Mannila H., Toivonen H., and Verkamo A.I.. Discovery of frequent episodes in event sequences. Technical Report C-1997-
15, Department of Computer Science, University of Helsinki, 1997.
Ng R.T, Lakshmanan L.V.S., Han J., and Pang A.. Exploratory Mining and Pruning Optimizations of Constrained
Associations Rules. In ACM SIGMOD International Conference, Seattle, WA, USA, June 1998.
Palpanas T.. Knowledge Discovery in Data Warehouses. ACM SIGMOD Record, 29(3):88–100, 2000.
Palpanas T., and Koudas N.. Entropy Based Approximate Querying and Exploration of Datacubes. SSDBM, Fairfax, VA,
USA, July 2001.
Palpanas T., Koudas N., and Mendelzon A.. Using Datacube Aggregates for Approximate Querying and Deviation Detection.
IEEE Transactions on Knowledge and Data Engineering, 17(11), 2005.
Palpanas T. and Sairamesh J.. Knowledge Mining for the Business Analyst. In DEXA, 2008.
Paper S. I. W.. From Data to Business Advantage: Data Mining, The SEMMA Methodology and the SAS System. Technical
report, SAS Institute Inc., 2008.
Pednault E.. Transform Regression and the Kolmogorov Superposition Theorem. Technical Report RC-23227, IBM
Research, 2004.
Sarawagi S.. User-Adaptive Exploration of Multidimensional Data. In VLDB International Conference, pages 307–316,
Cairo, Egypt, Sept. 2000.
Sarawagi S., Agrawal R., and Megiddo N.. Discovery-driven Exploration of OLAP Data Cubes. In International Conference
on Extending Database Technology, pages 168–182, Valencia, Spain, Mar. 1998.
Sathe G. and Sarawagi S.. Intelligent Rollups in Multidimensional OLAP Data. In VLDB International Conference, pages
531–540, Rome, Italy, Sept. 2001.
Srikant R. and Agrawal R.. Mining Sequential Patterns: Generalizations and Performance Improvements. In EDBT, pages 3–
17, Avignon, France, Mar. 1996.
Tantono F.I., Manerikar N., Palpanas T.. Efficiently Discovering Recent Frequent Items in Data Streams. SSDBM, Hong
Kong, China, July 2008.
Yamanishi K., Takeuchi J., Williams G.J., and Milne P.. On-Line Unsupervised Outlier Detection Using Finite Mixtures with
Discounting Learning Algorithms. Data Mining and Knowledge Discovery, 8(3), 2004.
Yang X., Weiying W., Hairong M., and Qingwei S.. Design and Implementation of Commerce Data Mining System Based
on Rough Set Theory. In ICEBE, pages 258–265, 2005.