0% found this document useful (0 votes)
41 views11 pages

Integrating E-Commerce and Data Mining

This document proposes an integrated architecture for e-commerce systems with data mining capabilities. The architecture includes three main components: 1) Business Data Definition to define business rules and metadata, 2) Customer Interaction for interfaces between customers and the business, and 3) Analysis for data transformation and mining. Data is transferred between components via bridges to stage data, build a data warehouse, and deploy results. The architecture aims to reduce data preparation time and provide unified access to metadata and tools. Key challenges of integration are also discussed.

Uploaded by

Moh Saad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views11 pages

Integrating E-Commerce and Data Mining

This document proposes an integrated architecture for e-commerce systems with data mining capabilities. The architecture includes three main components: 1) Business Data Definition to define business rules and metadata, 2) Customer Interaction for interfaces between customers and the business, and 3) Analysis for data transformation and mining. Data is transferred between components via bridges to stage data, build a data warehouse, and deploy results. The architecture aims to reduce data preparation time and provide unified access to metadata and tools. Key challenges of integration are also discussed.

Uploaded by

Moh Saad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Integrating E-Commerce and Data Mining:

Architecture and Challenges


Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng

Blue Martini Software


2600 Campus Drive
San Mateo, CA, 94403, USA

{suhail,ronnyk,lmason,zijian}@bluemartini.com
arXiv:cs.LG/0007026 14 Jul 2000

tion is growing quickly with the need to differentiate.


Abstract In Measuring Web Success [2], the authors claim that
We show that the e-commerce domain can provide all "Leaders will use metrics to fuel personalization" and
the right ingredients for successful data mining and that "firms need web intelligence, not log analysis."
claim that it is a killer domain for data mining. We
describe an integrated architecture, based on our expe- Data Mining tools aid the discovery of patterns in
rience at Blue Martini Software, for supporting this data.1 Until recently, companies that have concentrated
integration. The architecture can dramatically reduce on building horizontal data mining modeling tools,
the pre-processing, cleaning, and data understanding have had little commercial success. Many companies
effort often documented to take 80% of the time in were bought, including the acquisition of Compression
knowledge discovery projects. We emphasize the need Sciences by Gentia for $3 million, HyperParallel by
for data collection at the application server layer (not Yahoo for about $2.3 million, Clementine by SPSS for
the web server) in order to support logging of data and $7 million, and Thinking Machines’s Darwin by Oracle
metadata that is essential to the discovery process. We for less than $25 million. Recently, a phase shift has
describe the data transformation bridges required from occurred in the valuation of such companies, and recent
the transaction processing systems and customer event acquisitions have given rise to valuations 10 to 100
streams (e.g., clickstreams) to the data warehouse. We times higher. KD1 was acquired by Net Perceptions
detail the mining workbench, which needs to provide for $116M, RightPoint (previously DataMind) was
multiple views of the data through reporting, data acquired by E.piphany for $400M, DataSage was ac-
mining algorithms, visualization, and OLAP. We con- quired by Vignette for $577M, and NeoVista was ac-
clude with a set of challenges. quired by Accrue for $140M. The shift in valuations
indicates wider recognition of the value of data mining
modeling techniques for e-commerce.
1 Introduction
E-commerce is the killer-domain for data mining.
E-commerce is growing fast, and with this growth
It is ideal because many of the ingredients required for
companies are willing to spend more on improving the
successful data mining are easily satisfied: data records
online experience. In Commerce Software Takes Off
are plentiful, electronic collection provides reliable
[1], the authors from Forrester Research wrote that
data, insight can easily be turned into action, and return
online business to consumer retail spending in 1999
on investment can be measured. To really take advan-
was $20.3 billion and estimated to grow to $144 billion
by 2003. Global 2500 companies will spend 72% more 1
In this paper, we use the term data mining to denote the
on e-commerce in 2000 than they did in 1999. Existing
wider process, sometimes called knowledge discovery, which
sites are using primitive measures, such as page views, includes multiple disciplines, such as preprocessing, report-
but the need for more serious analysis and personaliza- ing, exploratory analysis, visualization, and modeling.
tage of this domain, however, data mining must be templates, articles, images, and multimedia) and busi-
integrated into the e-commerce systems with the ap- ness rules (e.g., personalized content rules, promotion
propriate data transformation bridges from the transac- rules, and rules for cross-sells and up-sells). From a
tion processing system to the data warehouse and vice- data mining perspective the key to the Business Data
versa. Such integration can dramatically reduce the Definition component is the ability to define a rich set
data preparation time, known to take about 80% of the of attributes (metadata) for any type of data. For ex-
time to complete an analysis [3]. An integrated solu- ample, products can have attributes like size, color, and
tion can also provide users with a uniform user inter- targeted age group, and can be arranged in a hierarchy
face and seamless access to metadata. representing categories like men’s and women’s, and
subcategories like shoes and shirts. As another exam-
The paper is organized as follows. Section 2 de- ple, web page templates can have attributes indicating
scribes the integrated architecture that we propose, whether they show products, search results, or are used
explaining the main components and the bridges con- as part of the checkout process. Having a diverse set of
necting them. Section 3 details the data collector, available attributes is not only essential for data min-
which must collect much more data than what is avail- ing, but also for personalizing the customer experience.
able using web server log files. Section 4 describes the
analysis component, which must provide a breadth of The Customer Interaction component provides the
data transformation facilities and analysis tools. We interface between customers and the e-commerce busi-
describe a set of challenging problems in Section 5, ness. Although we use the example of a web site
and conclude with a summary in Section 6. throughout this paper, the term customer interaction
applies more generally to any sort of interaction with
customers. This interaction could take place through a
2 Integrated Architecture web site (e.g., a marketing site or a web store), cus-
tomer service (via telephony or email), wireless appli-
In this section we give a high level overview of a cation, or even a bricks-and-mortar point of sale sys-
proposed architecture for an e-commerce system with tem. For effective analysis of all of these data sources,
intregrated data mining. Details of the most important
parts of the architecture and their advantages appear in
following sections. The described system is an ideal
architecture based on our experiences at Blue Martini
Software. However, we make no claim that everything
described here is implemented in Blue Martini Soft-
ware’s products. In our proposed architecture there are
three main components, Business Data Definition,
Customer Interaction, and Analysis. Connecting these
components are three data transfer bridges, Stage Data,
Build Data Warehouse, and Deploy Results. The rela-
tionship between the components and the data transfer
bridges is illustrated in Figure 1. Next we describe each
Figure 1. Proposed high-level system
component in the architecture and then the bridges that
architecture.
connect these components.
In the Business Data Definition component the e-
commerce business user defines the data and metadata a data collector needs to be an integrated part of the
associated with their business. This data includes mer- Customer Interaction component. To provide maxi-
chandising information (e.g., products, assortments, mum utility, the data collector should not only log sale
and price lists), content information (e.g., web page transactions, but it should also log other types of cus-
tomer interactions, such as web page views for a web makes it a simple matter to compute aggregate statis-
site. Further details of the data collection architecture tics for combinations of customers, sessions, page
for the specific case of a web site are described in Sec- views, and orders automatically. We examine the inte-
tion 3. To illustrate the utility of this integrated data grated analysis component in more detail in Section 4.
collection let us consider the example of an e-
commerce company measuring the effectiveness of its The Stage Data bridge connects the Business Data
web banner advertisements on other sites geared at Definition component to the Customer Interaction
attracting customers to its own site. A similar analysis component. This bridge transfers (or stages) the data
can be applied when measuring the effectiveness of and metadata into the Customer Interaction compo-
advertising or different personalizations on its own site. nent. Having a staging process has several advantages,
including the ability to test changes before having them
The cost of a web banner advertisement is typically implemented in production, allowing for changes in the
based on the number of “click-throughs.” That is, there data formats and replication between the two compo-
is a fee paid for each visitor who clicks on the banner nents for efficiency, and enabling e-commerce busi-
advertisement. Many e-commerce companies measure nesses to have zero down-time.
the effectiveness of their web banner advertisements
using the same metric, the number of click-throughs, The Build Data Warehouse bridge links the Cus-
and thus fail to take into account the sales generated by tomer Interaction component with the Analysis com-
each referred visitor. If the goal is to sell more prod- ponent. This bridge transfers the data collected within
ucts then the site needs to attract buyers rather than the Customer Interaction component to the Analysis
browsers. A recent Forrester Research report [2] stated component and builds a data warehouse for analysis
that “Using hits and page views to judge site success is purposes. The Build Data Warehouse bridge also
like evaluating a musical performance by its volume.” transfers all of the business data defined within the
In practice, we have seen the ratio of generated sales to Business Data Definition component (which was trans-
click-throughs vary by as much as a factor of 20 across ferred to the Customer Interaction component using the
a company’s web banner advertisements. One adver- Stage Data bridge). The data collector in the Customer
tisement generated five times as much in sales as an- Interaction component is usually implemented within
other advertisement, even though clickthroughs from an On-Line Transaction Processing (OLTP) system,
the former advertisement were one quarter of the click- typically designed using entity relation modeling tech-
streams from the latter. The ability to measure this sort niques. OLTP systems are geared towards efficient
of relationship requires conflation of multiple data handling of a large number of small updates and short
sources. queries. This is critical for running an e-commerce
business, but is not appropriate for analysis [4, 5],
The Analysis component provides an integrated en- which usually requires full scans of several very large
vironment for decision support utilizing data transfor- tables and a star schema design which business users
mations, reporting, data mining algorithms, visualiza- can understand. For data mining, we need to build a
tion, and OLAP tools. The richness of the available data warehouse using dimensional modeling tech-
metadata gives the Analysis component significant ad- niques. Both the data warehouse design and the data
vantages over horizontal decision support tools, in both transfer from the OLTP system to the data warehouse
power and ease-of-use. For instance, the system auto- system are very complex and time-consuming tasks.
matically knows the type of each attribute, including Making the construction of the data warehouse an inte-
whether a discrete attribute’s values are ordered, gral part of the architecture significantly reduces the
whether the range of a continuous attribute is bounded, complexity of these tasks. In addition to typical ETL
and textual descriptions. For a web site, the system (Extract, Transform and Load) functionality, the bridge
knows that each customer has web sessions and that supports import and integration of data from both ex-
each web session includes page views and orders. This ternal systems and syndicated data providers (e.g.,
Acxiom). Since the schema in the OLTP system is are a huge number of requests for images and other
controlled by the architecture, we can automatically content on the page. Since all of these are recorded in
convert the OLTP schema to a multi-dimensional star the web server logs, most of the data in the logs relates
schema that is optimized for analysis. to requests for image files that are mostly useless for
analysis and are commonly filtered out. All these re-
The last bridge, Deploy Results, is the key to quests need to be purged from the web logs before they
“closing the loop” and making analytical results ac- can be used. Because of the stateless nature of HTTP,
tionable. It provides the ability to transfer models, each request in a web log appears independent of other
scores, results and new attributes constructed using requests, so it becomes extremely difficult to identify
data transformations back into the Business Data Defi- users and user sessions from this data [6, 7, 8, 9].
nition and Customer Interaction components for use in Since the web logs only contain the name of the page
business rules for personalization. For example, cus- that was requested, these page names have to be
tomers can be scored on their propensity to accept a mapped to the content, products, etc., on the page.
cross-sell and the site can be personalized based on This problem is further compounded by the introduc-
these scores. This is arguably the most difficult part of tion of dynamic content where the same page can be
the knowledge discovery process to implement in a used to display different content for each user. In this
non-integrated system. However, the shared metadata case, details of the content displayed on a web page
across all three components means that results can be may not even be captured in the web log. The mecha-
directly reflected in the data which defines the e- nism used to send request data to the server also affects
commerce company’s business. the information in the web logs. If the browser sends a
request using the “POST” method, then the input pa-
3 Data Collection rameters for this request are not recorded in the web
This section describes the data collection compo- log.
nent of the proposed architecture. This component logs
customers’ transactions (e.g., purchases and returns) Packet sniffers try to collect similar data by looking
and event streams (e.g., clickstreams). While the data at data “on the wire.” While packet sniffers can “see”
collection component is a part of every customer touch more data than what is present in web logs, they still
point (e.g., web site, customer service applications, and have problems identifying users (e.g., same visitor log-
wireless applications), in this section we will describe ging in from two different machines) and sessions.
in detail the data collection at the web site. Most of the Also, given the myriad ways in which web sites are
concepts and techniques mentioned in this section designed it is extremely difficult to extract logical
could be easily extended to other customer touch business information by looking at data streaming
points. across a wire. To further complicate things, packet
sniffers can’t see the data in areas of the site that are
3.1 Clickstream Logging encoded for secure transmission and thus have diffi-
culty working with sites (or areas of a site) that use
Most e-commerce architectures rely on web server
SSL (Secure Socket Layer). Such areas of a site are
logs or packet sniffers as a source for clickstream data.
the most crucial for analysis including checkout and
While both these systems have the advantage of being
forms containing personal data. In many financial sites
non-intrusive, allowing them to “bolt on” to any e-
including banks, the entire site is secure thus making
commerce application, they fall short in logging high
packet sniffers that monitor the encrypted data blind
level events and lack the capability to exploit metadata
and essentially useless, so the sniffers must be given
available in the application. A typical web log contains
access to data prior to encryption, which complicates
data such as the page requested, time of request, client
their integration.
HTTP address, etc., for each web server request. For
each page that is requested from the web server, there
Collecting data at the application server layer can marketers the ability to look directly at product views,
effectively solve all these problems. Since the applica- content views, and product sales, a capability far more
tion server serves the content (e.g., images, products powerful than just page views and click-throughs.
and articles), it has detailed knowledge of the content Some interesting business events that help with the
being served. This is true even when the content is analysis given above and are supported by the archi-
dynamically generated or encoded for transmission tecture are
using SSL. Application servers use cookies (or URL • Add/Remove item to/from shopping cart
encoding in the absence of cookies) to keep track of a • Initiate checkout
user’s session, so “sessionizing” the clickstream is • Finish checkout
trivial. Since the application server also keeps track of • Search event
the user, using login mechanisms or cookies, associat- • Register event
ing the clickstream with a particular visitor is simple. The search keywords and the number of results for
The application server can also be designed to keep each of these searches that can be logged with the
track of information absent in web server logs includ- search events give marketers significant insight into the
ing pages that were aborted (user pressed the “stop” interests of their visitors and the effectiveness of the
button while the page was being downloaded), local search mechanism.
time of the user, speed of the user’s connection and if
the user had turned their cookies off. This method of 3.3 Measuring Personalization Success
collecting clickstream data has significant advantages The architecture also supports a rules engine that
over both web logs and packet sniffers. runs on the web site for personalization. Rules can be
deployed for offering promotions to visitors, displaying
3.2 Business Event Logging
specific products or content to a specific visitor, etc.
The clickstream data collected from the application After the rules are deployed, business events can be
server is rich and interesting; however, significant in- used to track the effect of deploying these rules. A
sight can be gained by looking at subsets of requests as business event can be collected each time that a rule is
one logical event or episode [6, 10]. We call these used in personalization and these events, coupled with
aggregations of requests business events. Business the shopping-cart/checkout events, can give an excel-
events can also be used to describe significant user lent estimate of the effectiveness of each rule. The
actions like sending an email or searching [2]. Since architecture can also use control groups so that person-
the application server has to maintain the context of a alization rules are only activated for a fraction of the
user’s session and related data, the application server is target visitors. This enables analysts to directly look at
the logical choice for logging these business events. sales or results for visitors when the rules were and
Business events can be used to track things like the were not activated.
contents of abandoned shopping carts, which are ex-
tremely difficult to track using only clickstream data. Similar data collection techniques can be used for
Business events also enable marketers to look beyond all the customer touch points like customer service
page hit rates to micro-conversion rates [11]. A micro- representatives, wireless applications, etc. Collecting
conversion rate is defined for each step of the pur- the right data is critical to effective analysis of an e-
chasing process as the fraction of products that are suc- commerce operation.
cessfully carried through to the next step of the pur-
chasing process. Two examples of these are the frac- 4 Analysis
tion of product views that resulted in the product being This section describes the analysis component of
added to the shopping cart and the fraction of products our architecture. We start with a discussion of data
in the shopping cart that successfully passed through transformations, followed by analysis techniques in-
each phase of the checkout process. Thus the inte- cluding reporting, data mining algorithms, visualiza-
grated approach proposed in this architecture gives
tion, and OLAP. The data warehouse is the source data given node of the hierarchy. Let us use the product
of analyses in our architecture. Although dimensional hierarchy shown in Figure 2 as an example. For each
order line or page request containing a product SKU
(Stock Keeping Unit), this transformation creates a
Boolean column corresponding to each selected node
in the hierarchy. It indicates whether this product SKU
belongs to the product category represented by the
node. Figure 3 shows the enriched row from this op-
eration.

Since customers are the main concern of any e-


commerce business, most data mining analyses are at
the customer level. That is, each record of a data set at
the final stage of an analysis is a customer signature
Figure 2. An example hierarchy of products.
containing all the information about the customer.
However, the majority of the data in the data ware-
modeling is usually a prerequisite for analysis, our ex- house is at other levels such as the order header level,
perience shows that many analyses require additional the order line level, and the page request level. Each
data transformations that convert the data into forms customer may have multiple rows at these levels. To
more amenable to data mining. make this detailed information useful for analyses at
the customer level, aggregation transformations are
As we mentioned earlier, the business user can de- necessary. Here are some examples of attributes we
fine product, promotion, and assortment hierarchies in have found useful:
the Business Data Definition component. Figure 2 • What percentage of each customer’s orders
gives a simple example of a product hierarchy. This used a VISA credit card?
hierarchical information is very valuable for analysis, • How much money does each customer spend
but few existing data mining algorithms can utilize it on books?
directly. Therefore, we need data transformations to • How much is each customer’s average order
convert this information to a format that can be used by amount above the mean value of the average
data mining algorithms. One possible solution is to order amount for female customers?
add a column indicating whether the item falls under a • What is the total amount of each customer’s
five most recent purchases over $30?
• What is the frequency of each customer’s pur-
chases?
• What is the recency of each customer’s pur-
s
en

el
av
Bo g/M

Tr

chases (the number of days since the last pur-


Pr ty

Cl g
n

s/
i

s
nt

hi

hi
ice

ok

ok
U

chase)?
ua

ot

ot
SK

Bo
Cl
Q

These attributes are very hard to construct using stan-


4 $12 T T F F dard SQL statements, and need powerful aggregation
transformations. We have found RFM (Recency, Fre-
quency, and Monetary) attributes particularly useful for
the e-commerce domain.

Figure 3. Data record created by the add product E-commerce data contains many date and time
hierarchy transformation. columns. We have found that these date and time col-
umns convey useful information that can reveal im-
portant patterns. However, the common date and time • What characterizes customers that accept
format containing the year, month, day, hour, minute, cross-sells and up-sells?
and second is not often supported by data mining algo- • What characterizes customers that buy
rithms. Most patterns involving date and time cannot quickly?
be directly discovered from this format. To make the
discovery of patterns involving dates and times easier, • What characterizes visitors that do not buy?
we need transformations which can compute the time
difference between dates (e.g., order date and ship
date), and create new attributes representing day-of- Based on our experience, in addition to automatic
week, day-of-month, week, month, quarter, year, etc. data mining algorithms, it is necessary to provide inter-
from date and time attributes. active model modification tools to support business
insight. Models either automatically generated or cre-
Based on the considerations mentioned above, the ated by interactive modifications can then be examined
architecture is designed to support a rich set of trans- or evaluated on test data. The purpose is to let business
formations. We have found that transformations in- users understand their models before deploying them.
cluding: create new attributes, add hierarchy attributes, For example, we have found that for rule models,
aggregate, filter, sample, delete columns, and score are measures such as confidence, lift, and support at the
useful for making analyses easier. individual rule level and the individual conjunct level
are very useful in addition to the overall accuracy of
With transformations described, let us discuss the the model. In our experience, the following function-
analysis tools. Basic reporting is a bare necessity for e- ality is useful for interactively modifying a rule model:
commerce. Through generated reports, business users • Being able to view the segment (e.g., customer
can understand how a web site is working at different segments) defined by a subset of rules or a sub-
levels and from different points of view. Example set of conjuncts of a rule.
questions that can be answered using reporting are: • Being able to manually modify a rule model by
• What are the top selling products? deleting, adding, or changing a rule or individ-
• What are the worst selling products? ual conjunct.
• What are the top viewed pages? For example, a rule model predicting heavy spenders
• What are the top failed searches? contains the rule:
• What are the conversion rates by brand? IF Income > $80,000 AND
• What is the distribution of web browsers? Age <= 31 AND
• What are the top referrers by visit count? Average Session Duration is between
• What are the top referrers by sales amount? 10 and 20.1 minutes AND
• What are the top abandoned products? Account creation date is before
Our experience shows that some reporting questions 2000-04-01
such as the last two mentioned above are very hard to THEN Heavy spender
answer without an integrated architecture that records It is very likely that you wonder why the split on age
both event streams and sales data. occurs at 31 instead of 30 and the split on average ses-
sion duration occurs at 20.1 minutes instead of 20 min-
Model generation using data mining algorithms is a utes. Why does account creation date appear in the
key component of the architecture. It reveals patterns rule at all? A business user may want to change the
about customers, their purchases, page views, etc. By rule to:
generating models, we can answer questions like:
• What characterizes heavy spenders?
• What characterizes customers that prefer pro-
motion X over Y?
IF Income > $80,000 AND 5 Challenges
Age <= 30 AND
In this section we describe several challenging
Average Session Duration is between
problems based on our experiences in mining e-
10 and 20 minutes
commerce data. The complexity and granularity of
THEN Heavy spender
these problems differ, but each represents a real-life
However, before doing so, it is important to see how area where we believe improvements can be made.
this changes the measures (e.g. confidence, lift, and Except for the first two challenges, the problems deal
support) of this rule and the whole rule model. with data mining algorithmic challenges.

Given that humans are very good at identifying Make Data Mining Models Comprehensible to
patterns from visualized data, visualization and OLAP Business Users
tools can greatly help business users to gain insight into
business problems by complementing reporting tools Business users, from merchandisers who make de-
and data mining algorithms. Our experience suggests cisions about the assortments of products to creative
that visualization tools are very helpful in understand- designers who design web sites to marketers who de-
ing generated models, web site operations, and data cide where to spend advertising dollars, need to under-
itself. Figure 4 shows an example of a visualization stand the results of data mining. Summary reports are
tool, which clearly reveals that females aged between easiest to understand and usually easy to provide, espe-
30 and 39 years are heavy spenders (large square), cially for specific vertical domains. Simple visualiza-
closely followed by males aged between 40 and 49 tions, such as bar charts and two-dimensional scatter-
years. plots, are also easy to understand and can provide more
information and highlight patterns, especially if used in
conjunction with color. Few data mining models,
however, are easy to understand. Classification rules
are the easiest, followed by classification trees. A
visualization for the Naïve-Bayes classifier [12] was
also easy for business users to understand in the second
author's past experience.
The challenge is to define more model types (hy-
pothesis spaces) and ways of presenting them to busi-
ness users. What regression models can we come up
with and how can we present them? (Even linear re-
gression is usually hard for business users to under-
stand.) How can we present nearest-neighbor models,
for example? How can we present the results of asso-
ciation rule algorithms without overwhelming users
with tens of thousands of rules (a nice example of this
problem can be found in Berry and Linoff [13] starting
on page 426)?
Square size represents the average purchase amount
Make Data Transformation and Model Building
Figure 4. A visualization tool reveals the purchase Accessible to Business Users
pattern of female and male customers in
different age groups. The ability to answer a question given by a business
user usually requires some data transformations and
technical understanding of the tools. Our experience is but generalizations are likely to be found at higher lev-
that even commercial report designers and OLAP tools els (e.g., families and categories). Some algorithms
are too hard for most business users. Two common have been designed to support tree-structured attributes
solutions are (i) provide templates (e.g., reporting tem- [15], but they do not scale to the large product hierar-
plates, OLAP cubes, and recommended transforma- chies. The challenge is to support such hierarchies
tions for mining) for common questions, something within the data mining algorithms.
that works well in well-defined vertical markets, and
(ii) provide the expertise through consulting or a serv- Scale Better: Handle Large Amounts of Data
ices organization. The challenge is to find ways to
empower business users so that they will be able to Yahoo! had 465 million page views per day in De-
serve themselves. cember of 1999 [16]. The challenge is to find useful
techniques (other than sampling) that will scale to this
Support Multiple Granularity Levels volume of data. Are there aggregations that should be
performed on the fly as data is collected?
Data collected in a typical web site contains records
at different levels of granularity: Support and Model External Events
• Page views are the lowest level with attributes
such as product viewed and duration. External events, such as marketing campaigns (e.g.,
• Sessions include attributes such as browser promotions and media ads), and site redesigns change
used, initiation time, referring site, and cookie patterns in the data. The challenge is to be able to
information. Each session includes multiple model such events, which create new patterns that
page views. spike and decay over time.
• Customer attributes include name, address, and
demographic attributes. Each customer may be Support Slowly Changing Dimensions
involved in multiple sessions.
Mining at the page view level by joining all the session Visitors’ demographics change: people get married,
and customer attributes violates the basic assumption their children grow, their salaries change, etc. With
inherent in most data mining algorithms, namely that these changes, their needs, which are being modeled,
records are independently and identically distributed. change. Product attributes change: new choices (e.g.,
If we are trying to build a model to predict who visits colors) may be available, packaging material or design
page X, and Joe happens to visit it very often, then we change, and even quality may improve or degrade.
might get a rule that if the visitor’s first name is Joe, These attributes that change over time are often re-
they will likely visit page X. The rule will have multi- ferred to as "slowly changing dimensions" [4]. The
ple records (visits) to support it, but it clearly will not challenge is to keep track of these changes and provide
generalize beyond the specific Joe. This problem is support for such changes in the analyses.
shared by mining problems in the telecommunication
domain [14]. The challenge is to design algorithms Identify Bots and Crawlers
that can support multiple granularity levels correctly.
Bots and crawlers can dramatically change click-
Utilize Hierarchies stream patterns at a web site. For example, Keynote
(www.keynote.com) provides site performance meas-
Products are commonly organized in hierarchies: urements. The Keynote bot can generate a request
SKUs are derived from products, which are derived multiple times a minute, 24 hours a day, 7 days a week,
from product families, which are derived from catego- skewing the statistics about the number of sessions,
ries, etc. A product hierarchy is usually three to eight page hits, and exit pages (last page at each session).
levels deep. A customer purchases SKU level items, Search engines conduct breadth first scans of the site,
generating many requests in short duration. Internet References
Explorer 5.0 supports automatic synchronization of
web pages when a user logs in, when the computer is [1] Eric Schmitt, Harley Manning, Yolanda Paul,
idle, or on a specified schedule; it also supports offline and Sadaf Roshan, Commerce Software Takes
browsing, which loads pages to a specified depth from Off, Forrester Report, March 2000.
a given page. These options create additional click- [2] Eric Schmitt, Harley Manning, Yolanda Paul,
streams and patterns. Identifying such bots to filter and Joyce Tong, Measuring Web Success,
their clickstreams is a non-trivial task, especially for Forrester Report, November 1999.
bots that pretend to be real users. [3] Gregory Piatetsky-Shapiro, Ron Brachman,
Tom Khabaza, Willi Kloesgen, and Evangelos
Simoudis, An Overview of Issues in Devel-
6 Summary oping Industrial Data Mining and Knowledge
Discovery Applications, Proceeding of the
We proposed an architecture that successfully inte- second international conference on Knowl-
grates data mining with an e-commerce system. The edge Discovery and Data Mining, 1996.
proposed architecture consists of three main compo- [4] Ralph Kimball, The Data Warehouse Toolkit:
nents: Business Data Definition, Customer Interaction, Practical Techniques for Building Dimen-
and Analysis, which are connected using data transfer sional Data Warehouses, John Wiley & Sons,
bridges. This integration effectively solves several 1996.
major problems associated with horizontal data mining [5] Ralph Kimball, Laura Reeves, Margy Ross,
tools including the enormous effort required in pre- Warren Thornthwaite, The Data Warehouse
processing of the data before it can be used for mining, Lifecycle Toolkit : Expert Methods for De-
and making the results of mining actionable. The tight signing, Developing, and Deploying Data
integration between the three components of the archi- Warehouses, John Wiley & Sons, 1998.
tecture allows for automated construction of a data [6] Robert Cooley, Bamshad Mobashar, and
warehouse within the Analysis component. The shared Jaideep Shrivastava, Data Preparation for
metadata across the three components further simpli- Mining World Wide Web Browsing Patterns,
fies this construction, and, coupled with the rich set of Knowledge and Information Systems, 1, 1999.
mining algorithms and analysis tools (like visualiza- [7] L. Catledge and J. Pitkow, Characterizing
tion, reporting and OLAP) also increases the efficiency browsing behaviors on the World Wide Web,
of the knowledge discovery process. The tight integra- Computer Networks and ISDN Systems, 27(6),
tion and shared metadata also make it easy to deploy 1995.
results, effectively closing the loop. Finally we pre- [8] J. Pitkow, In search of reliable usage data on
sented several challenging problems that need to be the WWW, Sixth International World Wide
addressed for further enhancement of this architecture. Web Conference, 1997.
[9] Shahana Sen, Balaji Padmanabhan, Alexander
Acknowledgments Tuzhilin, Norman H. White, and Roger Stein,
The identification and satisfaction of con-
We would like to thank other members of the data
sumer analysis-driven information needs of
mining and visualization teams at Blue Martini Soft-
marketers on the WWW, European Journal of
ware and our documentation writer, Cindy Hall. We
Marketing, Vol. 32 No. 7/8 1998.
wish to thank our clients for sharing their data with us
[10] Osmar R. Zaiane, Man Xin, and Jiawei Han,
and helping us refine our architecture and improve
Discovering Web Access Patterns and Trends
Blue Martini’s products.
by Applying OLAP and Data Mining Tech-
nology on Web Logs, Proceedings of Ad-
vances in Digital Libraries Conference
(ADL’98), Santa Barbara, CA, 1998.
[11] Stephen Gomory, Robert Hoch, Juhnyoung
Lee, Mark Podlaseck, Edith Schonberg,
Analysis and Visualization of Metrics for On-
line Merchandizing, Proceedings of
WEBKDD’99, Springer 1999.
[12] Barry Becker, Ron Kohavi, and Dan Sommer-
field, Visualizing the Simple Bayesian Classi-
fier, KDD Workshop on Issues in the Integra-
tion of Data Mining and Data Visualization,
1997.
[13] Michael J. A. Berry and Gordon Linoff, Data
Mining Techniques: For Marketing, Sales,
and Customer Support, John Wiley & Sons,
2000.
[14] Saharon Rosset, Uzi Murad, Einat Neumann,
Yizhak Idan, and Gadi Pinkas, Discovery of
Fraud Rules for Telecommunications: Chal-
lenges and Solutions, Proceedings of the Fifth
ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 1999.
[15] Hussein Almuallim, Yasuhiro Akiba, and Shi-
geo Kaneda, On Handling Tree-Structured
Attributes, Proceedings of the Twelfth Inter-
national Conference on Machine Learning,
p.12--20, 1995.
[16] CFO Magazine, April 2000.

You might also like