Data Mining Applications in PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

DATA MINING APPLICATIONS IN

E-COMMERCE

1
TABLE OF CONTENTS

SR. NO. TOPIC PAGE


NO.

ABSTRACT 1

INTRODUCTION 2

1. WHY USE DATA MINING? 3

2. AN OVERVIEW OF E-COMMERCE 4

3. TASKS SOLVED BY DATA MINING 6


3.1 Predicting 6
3.2 Detection of Relation 6
3.3 Clustering 6
3.4 Market Basket analysis 6

4. DATA MINING APPLICATIONS IN E-COMMERCE 7


4.1 Preparation of Personalization Techniques 7
4.2 Searching Pattern In Transactional Data 8

5. A GENETIC ALGORITHM-BASED DATA MINING 10


5.1 Research methodology 10
5.1.1 Assumptions 10
5.1.2 Search for patterns 11
5.1.3 Rule generation 13
5.1.4 Algorithm complexity 15

5.2 Results 15
5.2.1 Preliminary statistical data analysis 15
5.2.2 Searching for patterns 16
5.2.3 Rule generation 16

5.3 Summary 17

6. CONCLUSION 18

BIBLIOGRAPHY 19

APPENDIX-POWER POINT SLIDES 20

2
ABSTRACT

E-commerce domain can provide all the right ingredients for successful data
mining and claim that it is a killer domain for data mining. With the proliferation
of the electronic commerce e-purchasing has become a daily practice for
many purchasing organizations. Data mining has been used in e-
commerce for some time already. It has many applications in this field
such as: searching for patterns in transactional data, preparation of
personalization applications, etc. [1]. However, before this kind of data
analysis will be ever possible, an e-commerce system itself needs to be
successfully implemented.

The data mining applications has been discussed briefly based on


project Involving IBM and Britain’s Safeway supermarkets.

3
INTRODUCTION

Data Mining is the process of extracting knowledge hidden from large volumes of
raw data.

Human analysts with no special tools can no longer make sense of enormous
volumes of data that require processing in order to make informed business
decisions. Data mining automates the process of finding relationships and patterns
in raw data and delivers results that can be either utilized in an automated decision
support system or assessed by a human analyst.

In one way we can say that data mining and e-commerce go hand in hand. It has
been shown with the help of many examples. With the help of data mining we can
predict the future happenings that provides reassessments.

 What goods should be promoted to this customer?


 Will this customer default on a loan or pay back on schedule?
 What medical diagnose should be assigned to this patient?

These are all the questions that can probably be answered if information hidden
among megabytes of data in your database can be found explicitly and utilized.
Modeling the investigated system, discovering relations that connect variables in a
database are the subject of data mining.

4
CHAPTER 1
WHY USE DATA MINING?

Data might be one of the most valuable assets of your corporation – but only if
you know how to reveal valuable knowledge hidden in raw data. Data mining
allows you to extract diamonds of knowledge from your historical data and predict
outcomes of future situations. It will help you optimize your business decisions,
increase the value of each customer and communication, and improve satisfaction
of customer with your services.

Data that require analysis differ for companies in different industries.


Examples include:

 Sales and contacts histories


 Demographic data on your customers and prospects
 Clickstream and transactional data from your website

In all these cases data mining can help you reveal knowledge hidden in data
and turn this knowledge into a crucial competitive advantage. Today increasingly
more companies acknowledge the value of this new opportunity and turn to
Megaputer for leading edge data mining tools and solutions that help optimizing
their operations and increase your bottom line.

5
CHAPTER 2
AN OVERVIEW OF E-COMMERCE

As compared to ancient or shielded legacy systems, data collection can be


controlled to a large extent. We now have the opportunity to design systems that
collect data for the purposes of data mining, rather than having to struggle with
translating and mining data collected for other purposes. Data are collected
electronically, rather then manually, so less noise is introduced from manual
processing. Electronic commerce data are rich, containing information on prior
purchase activity and detailed demographic data.
In addition, some data that previously were very difficult to collect now are
accessible easily. For example, electronic commerce systems can record the
actions of customers in the virtual "store‖ including what they look at, what they
put into their shopping cart and do not buy, and so on. Previously, in order to
obtain such data companies had to trail customers (in person), surreptitiously
recording their activities, or had to undertake complicated analysis of in-store
videos (Underhill 2000). It was not cost-effective to collect such data in bulk, and
correlating them with individual customers is practically impossible. For e-
commerce systems massive amount of data can be collected inexpensively.
Unlike many data mining applications, the vehicle for capitalizing on the
results of mining- the electronic commerce system-already is automated.
Therefore the hurdles of system building are substantially lower, as are the
political and social hurdles involved with automating a manual process. Also,
because the mined models will fit well with the existing system, computing return
on investment can be much easier.

6
Web merchandising, as distinct for example from marketing, focuses on how to
acquire products and how to make them available. Electronic commerce affects
the acquisition of products, because (as illustrated best by dell computer
corporation) the supply chain can be integrated tightly with the customer interface.
Even more intriguing from the data mining perspective, since customers are
interacting with the computer directly, product assortments, virtual product
displays, and other merchandising interfaces can be modified dynamically, and
even can be personalized to individual customers as shown in figure 1.

7
CHAPTER 3
TASKS SOLVED BY DATA MINING

3.1 Predicting
A task of learning a pattern from examples and using the developed model to
predict future values of the target variable.

3.2 Detection of relations


A task of searching for the most influential independent variables for a selected
target variable.

3.3 Clustering
A task of identifying groups of records that are similar between themselves but
different from the rest of the data. Often, the variables providing the best
clustering should be identified as well.

3.4 Market Basket Analysis


Processing transactional data in order to find those groups of products that are
sold together well. One also searches for directed association rules identifying the
best product to be offered with a current selection of purchased products.

8
CHAPTER 4
DATA MINING APPLICATIONS

4.1 Preparation Of Personalization Technique


Personalized techniques provide ease to the customers by providing
recommendation of products. Here we are studying a project involving IBM and
Britain’s Safeway supermarkets, in which customers use palm-top PDAs to
compose shopping lists (based to a large extent on the products they have
purchased previously). The use of PADs increases customer convenience, because
they don’t have to walk the aisles for these purchases; they simply pick them up at
the store. However, it reduces the company’s ability to ―recommend‖ products via
in-store displays, and the like.
Here we show how recommendations can be made instead on the PDA, using a
combination of data mining techniques. The recommendations were made to
actual customers in two field trials. After incorporating ―interestingness‖
knowledge learned from the first trial, in the second trial (in a different store‖) the
results were encouraging notwithstanding several application challenges.
Specifically, 25% of orders included something from the recommendation list,
corresponding to a revenue boost of 1.8% (respectable as compared to other
promotions). Perhaps more important, they show that customers are significantly
more likely to choose high-ranked recommendations than low-ranked ones,
indicating that the algorithms are doing well at modeling the likelihood of
purchasing items previously not purchased. The study shows intuitive rules and
clusters and relative preferences, demonstrating the potential of data mining for
improving understanding of the business—which may be useful even where
recommendations are not implemented (or are not effective).

9
4.2 Searching Pattern In Transactional Data
Data mining algorithms often produce a mass of patterns, much smaller than the
original mountain of data, but still in need of post-processing.
Creating individual consumer profiles for personalized recommendation (or
for other purposes, such as providing dynamic content or tailored advertising)
exacerbates this problem, because now one may be searching for patterns
individually for each of millions of consumers.
As we’ve mentioned, electronic commerce systems allow unprecedented
flexibility in merchandising. However, flexibility is not a benefit unless one
knows how to map the many options to different situations. For example, how
should different product assortments or merchandising cues be chosen? We have
to focus on the analysis and evaluation of web merchandising [1]. Specifically,
analyze the ―clickstreams‖ the series of links followed by customer on a site.
Their thesis is that the effectiveness of many on-line merchandising tactics can be
analyzed by a combination of specified metrics and visualization techniques
applied to clickstreams.
We provide a detailed case study of the analysis of clickstream data from a
web retailer [1]. The study shows how the breakdown of clickstreams into sub
segments can highlight potential problems in merchandising product has many
click-throughs but a low click-to-buy rate. Subsequent analysis shows that it has a
high basket-to-buy rate, but a low click-to-basket rate. This analysis would allow
merchandisers to begin to develop informed hypothesis about how performance
might be improved. For example, since this is a high-priced product, one might
hypothesize that customers were lured to the product page and then turned off by
the product’s high price. If this were true, there are several different actions that
might be appropriate (reduce the price, convince the customer that the product is
worth its high price, target the lure better so as not to ―waste click‖ etc.).
We also study measuring and improving the success of web sites. In particular,
they are concerned that success should be evaluated in terms of the business goal

10
of the web site (e.g., retail sales), and that treatments should not be limited to
measurement alone, but also should suggest concrete avenues for improvement.
WE discuss the discovery of navigation patterns, presenting a brief but
comprehensive survey of the state of the art, and also presenting a method that
addresses some of its deficiencies.

11
Chapter 5
A GENETIC ALGORITHM BASED DATA MINING

5.1 Research methodology


One of the data mining techniques that aim at the discovery of hidden
knowledge is extraction of association or pseudo-association (attributes
are not limited to the binary domain) rules in the form of IF-THEN
statements [2].

We represent a slightly different approach combining the power of


genetic algorithms with the simplicity of standard association rule
generation algorithms will be described. This approach applies a genetic
algorithm to the problem of searching for repetitive patterns hidden in data
and then simply generates pseudo-association rules based on those
patterns.

5.1.1 Assumptions
The main goal of this project was to identify the profiles of e-purchasing
adopters. It was also very important to determine the most significant
factors in terms of the firm’s perceived importance of managerial benefits
as well as the obstacles related to the possible implementation of an e-
purchasing system. The whole knowledge that can be discovered with this
approach is hidden in data and can be extracted virtually with no
assumptions. The only requirement that must be fulfilled prior to the
investigation of this kind of relations in the data is to divide the attribute
space
into two disjoint subspaces representing premise and consequent parts of
the decision rules.

12
5.1.2 Searching For Patterns
The theory of genetic algorithms is based on the process of natural selection,
according to which the nature aims at the creation of organisms that will be
adjusted to the surrounding environment in the best possible way.
In genetic algorithms, possible solutions to the analyzed problem are
encoded into so called chromosomes. Chromosomes consist of genes that
represent a solution numerically. Possible values that can be assigned to a
particular gene are determined by its allele (domain).
We consider the chromosomes that are being produced and modified
along the process of evolution (a sequence of generations) represent
patterns covering records in the data set. Each of such patterns has a
possible coverage in the data (support), which is given by the number of
records matching the pattern.
For the example shown

***1******5*1********3*****
Figure 3: Example of a chromosome (set positions genes no. 4, 11,
13, 22).
t=0;
P (t): = InitializePopulation (no_of_attributes,
attributes_domains);
while (t < max_number_of_generations) do
EvaluateFitness (P (t), dataset);
t: = t+1;
P (t): = Select (P (t-1));

13
Crossover (P (t));
Mutate (P (t));
end while;

Figure 4: Pseudo-code for the general scheme of the


genetic algorithm.

in Figure 3 it will be the number of all records containing given values at


fourth, eleventh, thirteenth, and twenty second positions, no matter what
are all the other values. Obviously, we are mostly interested in patterns
that have relatively high support and this will be the main feature of the
fitness function used for this algorithm. The minimal, desired level of
support in data can be specified before the execution of the genetic
algorithm, so that all the patterns with less coverage will not be included in
the result at all. Pseudo-code for the general scheme of the genetic
algorithm is shown in Figure 4.

FitnessEvaluation (chromosome)
set_positions : =
CountSetPositions (chromosome);
if (HasSupportinData (chromosome)) then
support : = CalculateSupport( chromosome );
fitness : = support * set_positions;
else
partial_support :=
CalculatePartialSupport (chromosome);
fitness := partial_support * threshold_support

14
end if;
FitnessEvaluation := fitness;
end FitnessEvaluation;
CalculatePartialSupport (chromosome)
for each record in dataset
matching :=
CountGenesMatchingRecord (record);

matching_ratio :=
matching / length_of_chromosome;
partial_support :=
partial_support + matching_ratio;
end for;
CalculatePartialSupport :=
partial_support / no_of_records;
end CalculatePartialSupport;

Figure 5: Pseudo-code for the fitness function evaluation

More detailed description of the algorithm for the fitness function


evaluation is presented by the pseudocode in Figure 5.

Another very important feature of the proposed genetic algorithm is a


multi-point crossover option. In many experiments on different types of
data, this approach
was found to be much more effective with respect to both the number of
discovered patterns, and the time of convergence.
As an outcome of several evolutions modeled by this genetic algorithm,
a set of data patterns was created. Those patterns, along with the

15
information about the level of their support, were then used as an input to
the second algorithm that generated pseudo-association rules.

5.1.3 Rule generation

Association rules expose the existence of relations between attributes in


data (binary domain – exists vs. does not exist) while pseudo-association
rules discover
relations between values of those attributes (domains of the attributes
themselves). Basically, pseudoassociation rules are simple IF-THEN
statements:
“If the set of attributes X, included in the premise
part of the rule, has some values, described by a set of
values V (X), then the set of attributes Y, included in
the consequent part, tends to have values, described by
another set of values V (Y)”.
In our case, since we aim at the derivation of rules of type:
“If a given company’s profile, in a sense of attributes
X, is described by the values V (X), then the
likelihood of the company being involved in adoption of
particular e-commerce solution(s), described by a set of
attributes Y, is determined by the level of importance
of the managerial benefits it looks for and concerns
with some obstacles V (Y)”
and:
“If a given company perceives managerial benefits X
important to the extent of V (X), then its likelihood of
adoption of a particular e-commerce solution(s) Y is
determined by the level of concern with some obstacles

16
V (Y)”,
this approach seems to be ideal.

An algorithm of extraction of association rules usually consists of two parts:


searching for patterns hidden in data (in this project achieved by application of the
genetic algorithm) and generation of rules based on those patterns.

5.1.4 Algorithm complexity

Apart from the execution of the standard genetic algorithms operators, the
comparison of the chromosome to the data records is the most critical part
of the algorithm. Effectiveness of this portion of the algorithm strongly
depends on four parameters: the number of records in the database (N),
the number of attributes
in the database (A), the size of the population (P), and the number of
generations (G) within a given evolution process. Since the matching is
nothing more but a value-to-gene comparison of each chromosome in the
population against each record in the database, the total number of such
comparisons (C) is
given by:
C = N ¤ A ¤ P ¤ G:
As for the second part of the methodology presented in this paper - rule
generation, its e±ciency obviously depends on the number of the attributes
included in a given pattern and on the constraints concerning the rule
structure in terms of its premise and consequent parts (provided by the
user).

17
Because of the emphasis on the fast discovery of patterns by the genetic
algorithm in the first part, the rule generation is rather fast. What is even
more important, it is very flexible and allows the user to generate different
set of rules (i.e. different points of view) on the basis of the same set of
patterns.

5.2 Results

5.2.1 Preliminary statistical data analysis


Importance of data mining approach here is that it determines the levels of
support and confidence of potentially discovered patterns and rules (the
levels of support and confidence of the rules are obviously directly related
to the frequency of occurrence of the attributes’ values included in those
rules). Accordingly, the results of this preliminary statistical data analysis
were carefully analyzed and taken into account while determining the
genetic algorithm’s parameters.

5.2.2 Searching for patterns


In order to increase the variety of patterns, the genetic algorithm was
launched on several computers simultaneously.
Because of the relatively small size of the data sample as well as the
conclusions derived from the preliminary statistical data analysis (about
small diversity of the data), support threshold of desired patterns was
lowered to 3 - 5%.
As a result of several evolutions (each consisting of comparable number
of generations) of the genetic algorithm, 1564 patterns were found in the
database. Each of those patterns had at least two “set” values for the
corresponding attributes.

18
5.2.3 Rule generation
Patterns discovered and prepared in the previous step were then used as
the basis for rules generation. At this level, the sets of attributes were
divided into premise and consequent parts of the rules adequately to the
problem specification. The input patterns were also checked for
overlapping, and those that were covered by another pattern were
removed from consideration. After this verification, 633 effective patterns
were preserved. On the basis of this final set of patterns, a
number of rules of given support and confidence was generated.
A few examples of those rules are presented below:
RULE 1:
IF a company uses EDI system, THEN the company is willing to help
suppliers to establish an electronic commerce network WITH support of
32% and confidence
of 79%.
RULE 2:
IF a company uses electronic commerce in purchase orders frequently,
THEN the reduction of transaction time is extremely important for them
WITH support of 21% and confidence of 81%.

5.3 Summary
All of the generated rules were definitely reasonable, however not all of
them were so obvious and could not be easily anticipated. Some of them
simply confirmed the conclusions drawn from the statistical data analysis
while others produced an interesting and novel insight into the problem of
profiling adopters and nonadopters of e-purchasing. On the basis of these
results it can be stated that

19
this approach is appropriate and useful for the discovery of the profile
description hidden in data.

6. CONCLUSION

We discuss that company preference knowledge must be incorporated--the task is


not just to recommend what the customer will most like, but also what the store
would like to sell. It also should be kept in mind that there is more to data mining
than just building an automated recommendation system. If indeed one is
participating in a knowledge discovery process, the knowledge that is discovered
may be used for various purposes. However, when it comes to improving the
efficiency of the knowledge discovery process as a whole, additional research on

20
efficient mining algorithms will have diminishing returns if the rest of the process
remains difficult and manual.
In all we highlight that although electronic commerce systems are an ideal
application for data mining, there still is much research needed—mostly in areas
of the knowledge discovery process other than the algorithmic phase.

21
BIBLIOGRAPHY

[1] Kohavi R., Provost F., Applications of Data Mining to Electronic


Commerce, Data Mining and Knowledge Discovery - International
Journal, Special Issue on E-Commerce and Data Mining, Kluwer
Academic Publishers, Boston, 2001.
[2] Dr. Min H., Smolinski G. T., Boratyn M. G., ―A Genetic Algorithm-based
Data Mining Approach to Profiling the Adopters and Non-Adopters of E-
Purchasing, Logistics and Distribution Institute, University of Louisville,
Louisville, KY 40292.

22
APPENDIX

23

You might also like