Integrating E-Commerce and Data Mining
Integrating E-Commerce and Data Mining
{suhail,ronnyk,lmason,zijian}@bluemartini.com
arXiv:cs.LG/0007026 14 Jul 2000
el
av
Bo g/M
Tr
Cl g
n
s/
i
s
nt
hi
hi
ice
ok
ok
U
chase)?
ua
ot
ot
SK
Bo
Cl
Q
Figure 3. Data record created by the add product E-commerce data contains many date and time
hierarchy transformation. columns. We have found that these date and time col-
umns convey useful information that can reveal im-
portant patterns. However, the common date and time • What characterizes customers that accept
format containing the year, month, day, hour, minute, cross-sells and up-sells?
and second is not often supported by data mining algo- • What characterizes customers that buy
rithms. Most patterns involving date and time cannot quickly?
be directly discovered from this format. To make the
discovery of patterns involving dates and times easier, • What characterizes visitors that do not buy?
we need transformations which can compute the time
difference between dates (e.g., order date and ship
date), and create new attributes representing day-of- Based on our experience, in addition to automatic
week, day-of-month, week, month, quarter, year, etc. data mining algorithms, it is necessary to provide inter-
from date and time attributes. active model modification tools to support business
insight. Models either automatically generated or cre-
Based on the considerations mentioned above, the ated by interactive modifications can then be examined
architecture is designed to support a rich set of trans- or evaluated on test data. The purpose is to let business
formations. We have found that transformations in- users understand their models before deploying them.
cluding: create new attributes, add hierarchy attributes, For example, we have found that for rule models,
aggregate, filter, sample, delete columns, and score are measures such as confidence, lift, and support at the
useful for making analyses easier. individual rule level and the individual conjunct level
are very useful in addition to the overall accuracy of
With transformations described, let us discuss the the model. In our experience, the following function-
analysis tools. Basic reporting is a bare necessity for e- ality is useful for interactively modifying a rule model:
commerce. Through generated reports, business users • Being able to view the segment (e.g., customer
can understand how a web site is working at different segments) defined by a subset of rules or a sub-
levels and from different points of view. Example set of conjuncts of a rule.
questions that can be answered using reporting are: • Being able to manually modify a rule model by
• What are the top selling products? deleting, adding, or changing a rule or individ-
• What are the worst selling products? ual conjunct.
• What are the top viewed pages? For example, a rule model predicting heavy spenders
• What are the top failed searches? contains the rule:
• What are the conversion rates by brand? IF Income > $80,000 AND
• What is the distribution of web browsers? Age <= 31 AND
• What are the top referrers by visit count? Average Session Duration is between
• What are the top referrers by sales amount? 10 and 20.1 minutes AND
• What are the top abandoned products? Account creation date is before
Our experience shows that some reporting questions 2000-04-01
such as the last two mentioned above are very hard to THEN Heavy spender
answer without an integrated architecture that records It is very likely that you wonder why the split on age
both event streams and sales data. occurs at 31 instead of 30 and the split on average ses-
sion duration occurs at 20.1 minutes instead of 20 min-
Model generation using data mining algorithms is a utes. Why does account creation date appear in the
key component of the architecture. It reveals patterns rule at all? A business user may want to change the
about customers, their purchases, page views, etc. By rule to:
generating models, we can answer questions like:
• What characterizes heavy spenders?
• What characterizes customers that prefer pro-
motion X over Y?
IF Income > $80,000 AND 5 Challenges
Age <= 30 AND
In this section we describe several challenging
Average Session Duration is between
problems based on our experiences in mining e-
10 and 20 minutes
commerce data. The complexity and granularity of
THEN Heavy spender
these problems differ, but each represents a real-life
However, before doing so, it is important to see how area where we believe improvements can be made.
this changes the measures (e.g. confidence, lift, and Except for the first two challenges, the problems deal
support) of this rule and the whole rule model. with data mining algorithmic challenges.
Given that humans are very good at identifying Make Data Mining Models Comprehensible to
patterns from visualized data, visualization and OLAP Business Users
tools can greatly help business users to gain insight into
business problems by complementing reporting tools Business users, from merchandisers who make de-
and data mining algorithms. Our experience suggests cisions about the assortments of products to creative
that visualization tools are very helpful in understand- designers who design web sites to marketers who de-
ing generated models, web site operations, and data cide where to spend advertising dollars, need to under-
itself. Figure 4 shows an example of a visualization stand the results of data mining. Summary reports are
tool, which clearly reveals that females aged between easiest to understand and usually easy to provide, espe-
30 and 39 years are heavy spenders (large square), cially for specific vertical domains. Simple visualiza-
closely followed by males aged between 40 and 49 tions, such as bar charts and two-dimensional scatter-
years. plots, are also easy to understand and can provide more
information and highlight patterns, especially if used in
conjunction with color. Few data mining models,
however, are easy to understand. Classification rules
are the easiest, followed by classification trees. A
visualization for the Naïve-Bayes classifier [12] was
also easy for business users to understand in the second
author's past experience.
The challenge is to define more model types (hy-
pothesis spaces) and ways of presenting them to busi-
ness users. What regression models can we come up
with and how can we present them? (Even linear re-
gression is usually hard for business users to under-
stand.) How can we present nearest-neighbor models,
for example? How can we present the results of asso-
ciation rule algorithms without overwhelming users
with tens of thousands of rules (a nice example of this
problem can be found in Berry and Linoff [13] starting
on page 426)?
Square size represents the average purchase amount
Make Data Transformation and Model Building
Figure 4. A visualization tool reveals the purchase Accessible to Business Users
pattern of female and male customers in
different age groups. The ability to answer a question given by a business
user usually requires some data transformations and
technical understanding of the tools. Our experience is but generalizations are likely to be found at higher lev-
that even commercial report designers and OLAP tools els (e.g., families and categories). Some algorithms
are too hard for most business users. Two common have been designed to support tree-structured attributes
solutions are (i) provide templates (e.g., reporting tem- [15], but they do not scale to the large product hierar-
plates, OLAP cubes, and recommended transforma- chies. The challenge is to support such hierarchies
tions for mining) for common questions, something within the data mining algorithms.
that works well in well-defined vertical markets, and
(ii) provide the expertise through consulting or a serv- Scale Better: Handle Large Amounts of Data
ices organization. The challenge is to find ways to
empower business users so that they will be able to Yahoo! had 465 million page views per day in De-
serve themselves. cember of 1999 [16]. The challenge is to find useful
techniques (other than sampling) that will scale to this
Support Multiple Granularity Levels volume of data. Are there aggregations that should be
performed on the fly as data is collected?
Data collected in a typical web site contains records
at different levels of granularity: Support and Model External Events
• Page views are the lowest level with attributes
such as product viewed and duration. External events, such as marketing campaigns (e.g.,
• Sessions include attributes such as browser promotions and media ads), and site redesigns change
used, initiation time, referring site, and cookie patterns in the data. The challenge is to be able to
information. Each session includes multiple model such events, which create new patterns that
page views. spike and decay over time.
• Customer attributes include name, address, and
demographic attributes. Each customer may be Support Slowly Changing Dimensions
involved in multiple sessions.
Mining at the page view level by joining all the session Visitors’ demographics change: people get married,
and customer attributes violates the basic assumption their children grow, their salaries change, etc. With
inherent in most data mining algorithms, namely that these changes, their needs, which are being modeled,
records are independently and identically distributed. change. Product attributes change: new choices (e.g.,
If we are trying to build a model to predict who visits colors) may be available, packaging material or design
page X, and Joe happens to visit it very often, then we change, and even quality may improve or degrade.
might get a rule that if the visitor’s first name is Joe, These attributes that change over time are often re-
they will likely visit page X. The rule will have multi- ferred to as "slowly changing dimensions" [4]. The
ple records (visits) to support it, but it clearly will not challenge is to keep track of these changes and provide
generalize beyond the specific Joe. This problem is support for such changes in the analyses.
shared by mining problems in the telecommunication
domain [14]. The challenge is to design algorithms Identify Bots and Crawlers
that can support multiple granularity levels correctly.
Bots and crawlers can dramatically change click-
Utilize Hierarchies stream patterns at a web site. For example, Keynote
(www.keynote.com) provides site performance meas-
Products are commonly organized in hierarchies: urements. The Keynote bot can generate a request
SKUs are derived from products, which are derived multiple times a minute, 24 hours a day, 7 days a week,
from product families, which are derived from catego- skewing the statistics about the number of sessions,
ries, etc. A product hierarchy is usually three to eight page hits, and exit pages (last page at each session).
levels deep. A customer purchases SKU level items, Search engines conduct breadth first scans of the site,
generating many requests in short duration. Internet References
Explorer 5.0 supports automatic synchronization of
web pages when a user logs in, when the computer is [1] Eric Schmitt, Harley Manning, Yolanda Paul,
idle, or on a specified schedule; it also supports offline and Sadaf Roshan, Commerce Software Takes
browsing, which loads pages to a specified depth from Off, Forrester Report, March 2000.
a given page. These options create additional click- [2] Eric Schmitt, Harley Manning, Yolanda Paul,
streams and patterns. Identifying such bots to filter and Joyce Tong, Measuring Web Success,
their clickstreams is a non-trivial task, especially for Forrester Report, November 1999.
bots that pretend to be real users. [3] Gregory Piatetsky-Shapiro, Ron Brachman,
Tom Khabaza, Willi Kloesgen, and Evangelos
Simoudis, An Overview of Issues in Devel-
6 Summary oping Industrial Data Mining and Knowledge
Discovery Applications, Proceeding of the
We proposed an architecture that successfully inte- second international conference on Knowl-
grates data mining with an e-commerce system. The edge Discovery and Data Mining, 1996.
proposed architecture consists of three main compo- [4] Ralph Kimball, The Data Warehouse Toolkit:
nents: Business Data Definition, Customer Interaction, Practical Techniques for Building Dimen-
and Analysis, which are connected using data transfer sional Data Warehouses, John Wiley & Sons,
bridges. This integration effectively solves several 1996.
major problems associated with horizontal data mining [5] Ralph Kimball, Laura Reeves, Margy Ross,
tools including the enormous effort required in pre- Warren Thornthwaite, The Data Warehouse
processing of the data before it can be used for mining, Lifecycle Toolkit : Expert Methods for De-
and making the results of mining actionable. The tight signing, Developing, and Deploying Data
integration between the three components of the archi- Warehouses, John Wiley & Sons, 1998.
tecture allows for automated construction of a data [6] Robert Cooley, Bamshad Mobashar, and
warehouse within the Analysis component. The shared Jaideep Shrivastava, Data Preparation for
metadata across the three components further simpli- Mining World Wide Web Browsing Patterns,
fies this construction, and, coupled with the rich set of Knowledge and Information Systems, 1, 1999.
mining algorithms and analysis tools (like visualiza- [7] L. Catledge and J. Pitkow, Characterizing
tion, reporting and OLAP) also increases the efficiency browsing behaviors on the World Wide Web,
of the knowledge discovery process. The tight integra- Computer Networks and ISDN Systems, 27(6),
tion and shared metadata also make it easy to deploy 1995.
results, effectively closing the loop. Finally we pre- [8] J. Pitkow, In search of reliable usage data on
sented several challenging problems that need to be the WWW, Sixth International World Wide
addressed for further enhancement of this architecture. Web Conference, 1997.
[9] Shahana Sen, Balaji Padmanabhan, Alexander
Acknowledgments Tuzhilin, Norman H. White, and Roger Stein,
The identification and satisfaction of con-
We would like to thank other members of the data
sumer analysis-driven information needs of
mining and visualization teams at Blue Martini Soft-
marketers on the WWW, European Journal of
ware and our documentation writer, Cindy Hall. We
Marketing, Vol. 32 No. 7/8 1998.
wish to thank our clients for sharing their data with us
[10] Osmar R. Zaiane, Man Xin, and Jiawei Han,
and helping us refine our architecture and improve
Discovering Web Access Patterns and Trends
Blue Martini’s products.
by Applying OLAP and Data Mining Tech-
nology on Web Logs, Proceedings of Ad-
vances in Digital Libraries Conference
(ADL’98), Santa Barbara, CA, 1998.
[11] Stephen Gomory, Robert Hoch, Juhnyoung
Lee, Mark Podlaseck, Edith Schonberg,
Analysis and Visualization of Metrics for On-
line Merchandizing, Proceedings of
WEBKDD’99, Springer 1999.
[12] Barry Becker, Ron Kohavi, and Dan Sommer-
field, Visualizing the Simple Bayesian Classi-
fier, KDD Workshop on Issues in the Integra-
tion of Data Mining and Data Visualization,
1997.
[13] Michael J. A. Berry and Gordon Linoff, Data
Mining Techniques: For Marketing, Sales,
and Customer Support, John Wiley & Sons,
2000.
[14] Saharon Rosset, Uzi Murad, Einat Neumann,
Yizhak Idan, and Gadi Pinkas, Discovery of
Fraud Rules for Telecommunications: Chal-
lenges and Solutions, Proceedings of the Fifth
ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 1999.
[15] Hussein Almuallim, Yasuhiro Akiba, and Shi-
geo Kaneda, On Handling Tree-Structured
Attributes, Proceedings of the Twelfth Inter-
national Conference on Machine Learning,
p.12--20, 1995.
[16] CFO Magazine, April 2000.