Data Mining Applications in PDF
Data Mining Applications in PDF
Data Mining Applications in PDF
E-COMMERCE
1
TABLE OF CONTENTS
ABSTRACT 1
INTRODUCTION 2
2. AN OVERVIEW OF E-COMMERCE 4
5.2 Results 15
5.2.1 Preliminary statistical data analysis 15
5.2.2 Searching for patterns 16
5.2.3 Rule generation 16
5.3 Summary 17
6. CONCLUSION 18
BIBLIOGRAPHY 19
2
ABSTRACT
E-commerce domain can provide all the right ingredients for successful data
mining and claim that it is a killer domain for data mining. With the proliferation
of the electronic commerce e-purchasing has become a daily practice for
many purchasing organizations. Data mining has been used in e-
commerce for some time already. It has many applications in this field
such as: searching for patterns in transactional data, preparation of
personalization applications, etc. [1]. However, before this kind of data
analysis will be ever possible, an e-commerce system itself needs to be
successfully implemented.
3
INTRODUCTION
Data Mining is the process of extracting knowledge hidden from large volumes of
raw data.
Human analysts with no special tools can no longer make sense of enormous
volumes of data that require processing in order to make informed business
decisions. Data mining automates the process of finding relationships and patterns
in raw data and delivers results that can be either utilized in an automated decision
support system or assessed by a human analyst.
In one way we can say that data mining and e-commerce go hand in hand. It has
been shown with the help of many examples. With the help of data mining we can
predict the future happenings that provides reassessments.
These are all the questions that can probably be answered if information hidden
among megabytes of data in your database can be found explicitly and utilized.
Modeling the investigated system, discovering relations that connect variables in a
database are the subject of data mining.
4
CHAPTER 1
WHY USE DATA MINING?
Data might be one of the most valuable assets of your corporation – but only if
you know how to reveal valuable knowledge hidden in raw data. Data mining
allows you to extract diamonds of knowledge from your historical data and predict
outcomes of future situations. It will help you optimize your business decisions,
increase the value of each customer and communication, and improve satisfaction
of customer with your services.
In all these cases data mining can help you reveal knowledge hidden in data
and turn this knowledge into a crucial competitive advantage. Today increasingly
more companies acknowledge the value of this new opportunity and turn to
Megaputer for leading edge data mining tools and solutions that help optimizing
their operations and increase your bottom line.
5
CHAPTER 2
AN OVERVIEW OF E-COMMERCE
6
Web merchandising, as distinct for example from marketing, focuses on how to
acquire products and how to make them available. Electronic commerce affects
the acquisition of products, because (as illustrated best by dell computer
corporation) the supply chain can be integrated tightly with the customer interface.
Even more intriguing from the data mining perspective, since customers are
interacting with the computer directly, product assortments, virtual product
displays, and other merchandising interfaces can be modified dynamically, and
even can be personalized to individual customers as shown in figure 1.
7
CHAPTER 3
TASKS SOLVED BY DATA MINING
3.1 Predicting
A task of learning a pattern from examples and using the developed model to
predict future values of the target variable.
3.3 Clustering
A task of identifying groups of records that are similar between themselves but
different from the rest of the data. Often, the variables providing the best
clustering should be identified as well.
8
CHAPTER 4
DATA MINING APPLICATIONS
9
4.2 Searching Pattern In Transactional Data
Data mining algorithms often produce a mass of patterns, much smaller than the
original mountain of data, but still in need of post-processing.
Creating individual consumer profiles for personalized recommendation (or
for other purposes, such as providing dynamic content or tailored advertising)
exacerbates this problem, because now one may be searching for patterns
individually for each of millions of consumers.
As we’ve mentioned, electronic commerce systems allow unprecedented
flexibility in merchandising. However, flexibility is not a benefit unless one
knows how to map the many options to different situations. For example, how
should different product assortments or merchandising cues be chosen? We have
to focus on the analysis and evaluation of web merchandising [1]. Specifically,
analyze the ―clickstreams‖ the series of links followed by customer on a site.
Their thesis is that the effectiveness of many on-line merchandising tactics can be
analyzed by a combination of specified metrics and visualization techniques
applied to clickstreams.
We provide a detailed case study of the analysis of clickstream data from a
web retailer [1]. The study shows how the breakdown of clickstreams into sub
segments can highlight potential problems in merchandising product has many
click-throughs but a low click-to-buy rate. Subsequent analysis shows that it has a
high basket-to-buy rate, but a low click-to-basket rate. This analysis would allow
merchandisers to begin to develop informed hypothesis about how performance
might be improved. For example, since this is a high-priced product, one might
hypothesize that customers were lured to the product page and then turned off by
the product’s high price. If this were true, there are several different actions that
might be appropriate (reduce the price, convince the customer that the product is
worth its high price, target the lure better so as not to ―waste click‖ etc.).
We also study measuring and improving the success of web sites. In particular,
they are concerned that success should be evaluated in terms of the business goal
10
of the web site (e.g., retail sales), and that treatments should not be limited to
measurement alone, but also should suggest concrete avenues for improvement.
WE discuss the discovery of navigation patterns, presenting a brief but
comprehensive survey of the state of the art, and also presenting a method that
addresses some of its deficiencies.
11
Chapter 5
A GENETIC ALGORITHM BASED DATA MINING
5.1.1 Assumptions
The main goal of this project was to identify the profiles of e-purchasing
adopters. It was also very important to determine the most significant
factors in terms of the firm’s perceived importance of managerial benefits
as well as the obstacles related to the possible implementation of an e-
purchasing system. The whole knowledge that can be discovered with this
approach is hidden in data and can be extracted virtually with no
assumptions. The only requirement that must be fulfilled prior to the
investigation of this kind of relations in the data is to divide the attribute
space
into two disjoint subspaces representing premise and consequent parts of
the decision rules.
12
5.1.2 Searching For Patterns
The theory of genetic algorithms is based on the process of natural selection,
according to which the nature aims at the creation of organisms that will be
adjusted to the surrounding environment in the best possible way.
In genetic algorithms, possible solutions to the analyzed problem are
encoded into so called chromosomes. Chromosomes consist of genes that
represent a solution numerically. Possible values that can be assigned to a
particular gene are determined by its allele (domain).
We consider the chromosomes that are being produced and modified
along the process of evolution (a sequence of generations) represent
patterns covering records in the data set. Each of such patterns has a
possible coverage in the data (support), which is given by the number of
records matching the pattern.
For the example shown
***1******5*1********3*****
Figure 3: Example of a chromosome (set positions genes no. 4, 11,
13, 22).
t=0;
P (t): = InitializePopulation (no_of_attributes,
attributes_domains);
while (t < max_number_of_generations) do
EvaluateFitness (P (t), dataset);
t: = t+1;
P (t): = Select (P (t-1));
13
Crossover (P (t));
Mutate (P (t));
end while;
FitnessEvaluation (chromosome)
set_positions : =
CountSetPositions (chromosome);
if (HasSupportinData (chromosome)) then
support : = CalculateSupport( chromosome );
fitness : = support * set_positions;
else
partial_support :=
CalculatePartialSupport (chromosome);
fitness := partial_support * threshold_support
14
end if;
FitnessEvaluation := fitness;
end FitnessEvaluation;
CalculatePartialSupport (chromosome)
for each record in dataset
matching :=
CountGenesMatchingRecord (record);
matching_ratio :=
matching / length_of_chromosome;
partial_support :=
partial_support + matching_ratio;
end for;
CalculatePartialSupport :=
partial_support / no_of_records;
end CalculatePartialSupport;
15
information about the level of their support, were then used as an input to
the second algorithm that generated pseudo-association rules.
16
V (Y)”,
this approach seems to be ideal.
Apart from the execution of the standard genetic algorithms operators, the
comparison of the chromosome to the data records is the most critical part
of the algorithm. Effectiveness of this portion of the algorithm strongly
depends on four parameters: the number of records in the database (N),
the number of attributes
in the database (A), the size of the population (P), and the number of
generations (G) within a given evolution process. Since the matching is
nothing more but a value-to-gene comparison of each chromosome in the
population against each record in the database, the total number of such
comparisons (C) is
given by:
C = N ¤ A ¤ P ¤ G:
As for the second part of the methodology presented in this paper - rule
generation, its e±ciency obviously depends on the number of the attributes
included in a given pattern and on the constraints concerning the rule
structure in terms of its premise and consequent parts (provided by the
user).
17
Because of the emphasis on the fast discovery of patterns by the genetic
algorithm in the first part, the rule generation is rather fast. What is even
more important, it is very flexible and allows the user to generate different
set of rules (i.e. different points of view) on the basis of the same set of
patterns.
5.2 Results
18
5.2.3 Rule generation
Patterns discovered and prepared in the previous step were then used as
the basis for rules generation. At this level, the sets of attributes were
divided into premise and consequent parts of the rules adequately to the
problem specification. The input patterns were also checked for
overlapping, and those that were covered by another pattern were
removed from consideration. After this verification, 633 effective patterns
were preserved. On the basis of this final set of patterns, a
number of rules of given support and confidence was generated.
A few examples of those rules are presented below:
RULE 1:
IF a company uses EDI system, THEN the company is willing to help
suppliers to establish an electronic commerce network WITH support of
32% and confidence
of 79%.
RULE 2:
IF a company uses electronic commerce in purchase orders frequently,
THEN the reduction of transaction time is extremely important for them
WITH support of 21% and confidence of 81%.
5.3 Summary
All of the generated rules were definitely reasonable, however not all of
them were so obvious and could not be easily anticipated. Some of them
simply confirmed the conclusions drawn from the statistical data analysis
while others produced an interesting and novel insight into the problem of
profiling adopters and nonadopters of e-purchasing. On the basis of these
results it can be stated that
19
this approach is appropriate and useful for the discovery of the profile
description hidden in data.
6. CONCLUSION
20
efficient mining algorithms will have diminishing returns if the rest of the process
remains difficult and manual.
In all we highlight that although electronic commerce systems are an ideal
application for data mining, there still is much research needed—mostly in areas
of the knowledge discovery process other than the algorithmic phase.
21
BIBLIOGRAPHY
22
APPENDIX
23