0% found this document useful (0 votes)
5 views

Chen 2014

Uploaded by

Revati Dewa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Chen 2014

Uploaded by

Revati Dewa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

,17(51$7,21$/&21)(5(1&(21,1129$7,9('(6,*1$1'0$18)$&785,1*

$8*86702175($/48(%(&&$1$'$

PRODUCT RECOMMENDATION SYSTEM FOR SMALL ONLINE RETAILERS


USING ASSOCIATION RULES MINING

Junnan Chen Courtney Miller


&,,6(&RQFRUGLD8QLYHUVLW\ &,,6(&RQFRUGLD8QLYHUVLW\
MXQQBFKH#HQFVFRQFRUGLDFD FRBPLOOH#OLYHFRQFRUGLDFD

Gaby G. Dagher
&,,6(&RQFRUGLD8QLYHUVLW\
GDJKLU#HQFVFRQFRUGLDFD

ABSTRACT
Recommendation systems in e-commerce have become essential tools to help businesses increase
their sales. In this paper, we detail the design of a product recommendation system for small online
retailers. Our system is specifically designed to address the needs of retailers with small data pools
and limited processing power, and is tested for accuracy, efficiency, and scalability on real life data
from a small online retailer.

KEYWORDS
Data Mining, Database Management, E-commerce, Performance

1. INTRODUCTION implement algorithms like those of Amazon, which


has largely put effective product recommendation
The use of product recommendation systems is systems out of the reach of smaller retailers. In
ubiquitous among large e-commerce companies order for a small retailer to implement a product
today; some of the more famous product recommendation system, such a system must be
recommendation modules can be found on efficient when running on a server machine with
Amazon.com (Linden et al, 2003) and eBay. In modest computing capability, as small businesses
many ways, the profits of a large e-commerce normally do not have the financial capacity to invest
company can rise and fall on the efficacy of their in a large infrastructure. The system must also make
product recommendation algorithms, which is why do with significantly less training data than a
such companies often put much of their time and powerhouse like Amazon might have. In order to be
money into these algorithms. of use to the company, however, this
Smaller e-commerce companies, however, often recommendation system must still be robust enough
do not have the skill or the scale of resources to to make a difference in customer click-through on
recommended products.
In this paper, we propose a recommendation
system for a real life small retailer. To make the
system more robust, we identify multiple product
prediction criteria which might apply to any given
customer and we weight each of these criteria such
that they can be applied based on the current
customer to result in a single product
recommendation.
The contributions of this paper can be
‹,((( summarized as follows:


Contribution #1: Though much work has been efficient and successful for businesses. Wang et al.
done on data mining algorithms for large businesses proposed an approach which fuses predictions
with an excellent resource base, we have found between item ratings from other users, ratings of a
relatively little work in the less glamorous area of different item from the same user, and other similar
developing data mining systems for small retail ratings from other similar users. The model gives
businesses. The constraints and focus points of a better recommendations even on problems with
data mining solution scaled for small online retailers sparsity (Wang et al, 2006).
are much different, and we have implemented our Using association rules is a traditional data
solution with these constraints in mind. mining method. Agrawal et al. proposed efficient
Contribution #2: We have tested our solution on algorithms that fast-mine and prune the generated
real data from a small online retailer which has association rules (Agrawal et al, 1993) (Agrawal
transaction data dating back slightly less than three and Srikant, 1994). Association rules-based
years. We show that the proposed solution is recommendation usually implies the top-N items
relatively accurate, as well as efficient and scalable. recommendation. Deshpande et al. developed a
model to calculate the similarities between different
items and then to output the top recommendations
The rest of the paper is organized as follows.
(Deshpande and Karypis, 2004). Ren et al. proposed
Section 2 reviews the related literature. Section 3
a model which dynamic-learns using tendencies of
provides the formal definition of the
the customers’ profile, and thus improves accuracy
recommendation system problem. Section 4
(Ren et al, 2012).
describes the proposed solution. Comprehensive
experimental results are presented in Section 5.
Finally, we conclude the paper in Section 6. 3. PROBLEM FORMULATION
In this section, we formally define the research
2. RELATED WORK problem. First, we present an overview of the
problem of scalable product-recommendation in
In this section, we briefly overview some relevant
Section 3.1. Next, we define and explain the data
work in recommendation systems.
inputs in Section 3.2. Finally, we present the
There are five recommendation algorithms that
problem statement in Section 3.3.
are commonly used: content-based
recommendation, collaborative filtering
3.1. PROBLEM OVERVIEW
recommendation, association rule-based
recommendation, utility-based recommendation and In this paper, we examine a small online retailer that
knowledge-based recommendation. In practice, wishes to recommend products to its customers
hybrid recommendations, which consist of two or based on certain characteristics mined from its data.
more above-mentioned algorithms, are usually used. This product-recommendation system requires the
Collaborative filtering is so far the earliest and interaction of the retailer, all of the retailer’s past
most successful recommendation technology. It customer base, and the active customer who has
filters the information presented to the user by using currently made a selection for his or her shopping
information about other users’ preferences cart.
(Herlocker et al, 2000). It shows good performance The retailer wishes to accurately predict which
in using existing experiences, interests, and personal product recommendation will be most likely to
tastes to extract preference information which is result in the active customer making an extra
more difficult to find intuitively. User-based, item- purchase during this current transaction. The
based, and model-based are three collaborative filter retailer’s customer base and the retailer’s active
types. Despite their efficiency, scalability and customer balance two desires: firstly, customers
sparsity are the main limitations of these types. wish for a certain level of privacy and anonymity;
Cacheda et al. proposed a new algorithm based on secondly, customers may wish to be offered
trends and user-item differences to improve sparsity products which better suit their needs and
by 20% (Cacheda et al, 2011) while Zhang et al. convenience.
proposed an approach called Smoothing and Fusing Most product-recommendation systems take
(CFSF) strategies which construct a local item-user these priorities into account. However, we go a step
matrix from large-scale item-user matrices to further as small online retailers have the following
improve on the scalability issue (Zhang et al, 2009). limitations: low computing resources and small data
In recent years, a personalized one-to-one marketing pools. Thus, our algorithm must both restrict its
strategy has caught researchers’ attention, along resource needs and make do with less overall data.
with the fast growth of e-commerce. Item-based and
user-based collaborative filtering have proven to be 3.2. DATA INPUTS


In this section, we give an overview of the input Given normalized data by id as above, the objective
data and its formatting, and of our methodology in is to design a recommendation system which
cleaning and extracting said data. accepts a current product selection as input and
We started our project by identifying and returns one recommended product as output. The
extracting relevant data tables from our online system must be able to do the following: (1) give an
retailer. This data comes from an open-source e- efficient recommendation response given the
commerce database which has been running for limitations in the computing resources. Specifically,
slightly less than three years, though with a the recommendation process must consume less
moderately large established customer base. We than one tenth of a second of extra processing, and
took this data entirely out of context by extracting (2) the algorithm must be able to make use of a
only those columns of interest to our algorithm, relatively small training dataset, and it must adapt to
many of which contain data ids rather than use different criteria if one criterion is not available.
descriptions or names. Since the source database is
normalized, this data will be of great use when it is 4. PROPOSED SOLUTION
plugged back into the original database and given
In this section, we present our proposed product-
back its context.
recommendation system for small retailers. We
With regard to product data, we extracted product
explain in detail how each part was designed to
category ids rather than product ids, in order to
handle the problems previously mentioned—low
improve scalability. For the data set we selected,
computing resources and small data pools.
this should yield high-quality results, since this
retailer’s product categories are highly specific.
4.1. ASSOCIATION RULES EXTRACTION
Another small retailer may require the use of
product ids instead, if their products are cross-listed We define in our system four different criteria to
across many categories, or if their categories are too help us determine a good product recommendation.
loosely generalized. Table-1 provides a snapshot of Those criteria are:
the ORDERS table in our dataset. • The previous purchases of the customer in
question.
• The home country of the customer in question.
• The month of the current purchase.
• The product which the customer has most
recently selected.
The goal is to extract hidden patterns from the
training dataset that correspond to the four defined
criteria. Since one of our constraints is low
computing resources, we store the set of patterns
corresponding to each criterion into a separate table
for fast access. The following algorithms illustrate
Table 1 - Snapshot of collated data of interest how the hidden patterns are extracted and stored;
the data mining tool R 1 is utilized to perform the
mining process.
Figure-1 illustrates the general makeup of our
dataset with regard to the products purchased under Algorithm 1 association_rules: Association Rules by
each category. Category
Function association_rules
{
Load database.csv;
Load library ‘arules’;
Remove duplicate items in each order;
Read category_id, order_id as ‘Transactions’;
Generate apriori rules (min sup = 0.07, min
confident =0.5);
Read apriori rules as ‘Data Frame’;
Export rules;
Figure 1 – Frequency of category ids in the sample dataset }

3.3. PROBLEM STATEMENT


1
http://www.rdatamining.com/


PHP Interface Top 5 Category Associations
5HFRPPHQGHG
Weighted Recommendation Function
&DWHJRU\,'
Top 5 Results by Customer ID

Active Order
Top 5 Results by Country ID
Product Category ID
Top 5 Results by Month ID
Customer ID

Country ID
Active Customer
Current Month Cashed Association Rules

Category Association Rules Customer ID -> Category ID

R DataMining Algorithms Country ID -> Category ID Month ID -> Category ID


Classification
(Category ID Associations)

Pre-Processing Frequency Analysis


(Customer ID -> Category ID)
ID Extraction Collation
Frequency Analysis
(Country ID -> Category ID)

Frequency Analysis
Data Pool of Previous Orders (Month ID -> Category ID)

In Algorithm-1, we determine association rules Categories by Customer ID

Figure 2 – Overall process of our recommendation system

between the category ids of products in each Function customer_category_ranks


transaction within our dataset. We use the ‘arules’ {
library in R to remove duplicate category ids within Load database.csv;
the same orders; we also use it to regroup our data Load library ‘plyr’;
Count frequency by customer_id and category_id;
from multiple entries per order into transaction-type
Arrange customer_id by descending order;
data, which creates a ‘shopping cart’ format for our Export category_customer_ranks;
data. This shopping cart data is now in the proper }
format to run through the Apriori algorithm which
comes with the arules library. Because we are In Algorithm-3, we use frequency analysis to
expecting a high spread of categories, we define our determine which category ids are most closely tied
minimum support at 7%, but require the minimum to each customer id. We use the ‘plyr’ library in R
confidence to be 50%. Lastly, we export the data to to do a frequency count of category ids by customer
a new csv file so that it can be later read into a table. id. We then order the data into a more intuitively
Algorithm 2 country_category_ranks: Most Popular useful format by ranking it in descending order of
Categories by Country ID frequency. This data is also exported into a csv file.
Function country_category_ranks
{ Algorithm 4 month_category_ranks: Most Popular
Load database.csv; Categories by Month ID
Load library ‘plyr’;
Function month_category_ranks
Count frequency by country_id and category_id;
{
Arrange country_id by descending order;
Load database.csv;
Export country_category_ranks;
Load library ‘plyr’;
}
Read category_id and order_time;
Fomat order_time to POSIXlt timestamps;
In Algorithm-2, we use frequency analysis to Count frequency by month and category_id;
determine which category ids are most closely tied Arrange month by descending order;
to each country id. We use the ‘plyr’ library in R to Export month_category_ranks;
do a frequency count of category ids by country id. }
We then order the data into a more intuitively useful
format by ranking it in descending order of In Algorithm-4, we use frequency analysis to
frequency. Lastly, we export this data to a new csv determine which category ids are most closely tied
file as well, so that we can read it into a different to each month of the year. We load the ‘plyr’ library
table than the results of Algorithm-1. in R, then reformat our timestamp format into
Algorithm 3 customer_category_ranks: Most Popular month ids from 1 to 12. We use the ‘plyr’ library in


R to do a frequency count of category ids by this • Customer Classification: 40%
month id. We then order the data into a more • Category Associations: 30%
intuitively useful format by ranking it in descending
order of frequency. This data is the last to be • Month Classification: 20%
exported into a csv file. • Country Classification: 10%
Later queries on these tables are made using a
PHP interface which simulates integration with an We determined the weights above based on the
e-commerce software by sending a query to the potential specificity of each criterion to the order.
cached table and requesting a category That is, a customer’s previous history of purchases
recommendation. gives the most potentially relevant and specific data
regarding his purchasing needs, while a customer’s
4.2. DETERMINE BEST CATEGORY country gives the least specific data regarding his
As mentioned in Section-4.1, each criterion has its purchasing needs.
own table, which directly relates information about Step 4 – Calculate the Best Recommendation.
the active customer or the last product selection to After weighting our query results, we assign each
likely category recommendations. Each of these returned category id a ‘value’ commensurate with
criteria suggests the top 5 product category ids its ranking (1 through 5) and the weight of its
based on the data of the active customer, starting criterion. We then add the value of any category ids
with the most-suggested category. which repeat across multiple criteria, and return the
Once each criterion’s algorithm has run, its category id with the most aggregate value as the
results are weighted based on importance, and the recommended category.
weighted results of each algorithm are added
together. The category id with the most support is 5. EXPERIMENTAL EVALUATION
then selected, and is given as the overall output.
Since one of the constraints is small data pools, if a We implemented our product recommendation
criterion has no relevant data available (for instance, system in the 64-bit Windows 7 environment on a
if the customer is new and has made no previous machine equipped with an AMD E-450 APU 1.65
purchases), that criterion will be ignored, and our GHz Processor and 8GB DRAM. The training
system will make its recommendation decision dataset comes from an active small online retailer
based only on other criteria which do have relevant whose database contains transactions dating back
data available. slightly less than three years. The number of records
in the training dataset is 4000 records.
4.3. EXECUTING A RECOMMENDATION
5.1. ACCURACY
Normally, the logic used in calculating a
recommendation is done in the model layer of an e- To measure the accuracy of our solution, we built
commerce website. We used a simple set of PHP our test dataset from the same online retailer’s
functions to handle this calculation; these functions database by extracting all relevant orders which had
can be added to the model layer of any e-commerce been placed since the last order we included in our
site which uses an MVC framework. training dataset. We excluded orders which had
Figure-2 illustrates the overall process of our only one item in their basket (which made them
recommendation system. The process in its entirety irrelevant to tests); we also excluded orders from
consists of four major steps: new categories which had not been in the original
Step1 – Accept Active Order Variables. The algorithm’s dataset. The total number of records in
customer selects a product, and information from the test dataset is 200 records. We start by setting
that active customer and his active order is sent to a the first listed category_id of each transaction in the
PHP algorithm in the model layer. test dataset as the ‘active category’ selection. Next,
we extract the associated customer_id, country_id,
Step 2 – Query Cached Tables. Each relevant and month_id as the other input variables, and then
table of cached data association rules is queried. test to see if the category_id which our system
The queries return up to 5 relevant category ids for recommends matches the actual one in the order,
potential recommendation, depending on i.e., determine the percentage of true positives.
availability. These category ids are returned in order
of descending importance.
Step 3 – Weight Query Results. We exclude all
criteria with no query results from its calculations.
We define the importance (weight) of each result as
follows:


recommendation systeem is also efficient, because
our tests show thatt it takes on average 7
milliseconds to run a siingle recommendation.

6. CONCLUSIONS
S
In this paper, we addreess the problem of designing a
recommendation system m that addresses the needs of
small online retailers. That is, it should effectively
work with small data pools and execute relatively
quickly. Our proposed solution makes use of cached
association rules to cutt down on runtime processing
and uses multiplee weighted criteria for
Figure 4 – Accuracy of recommendationss given for increasing recommendation in orrder to make good use of a
number of order records in the test dataset small data pool. As a future work, more product
Figure-4 depicts the acccuracy of our prediction criteria cann be added to yield higher
recommendation system, where thhe number of data recommendation accurracy.
records is 100 and 200 respectivvely. We observe
that the accuracy can be as high as 56% when the 7. REFERENCES
number of records in the test daataset is 100, and
28% when the number of records in the test dataset [1] Nielsen, J. (1993). Usability
U Engineering. Academic
is 200. Even though the accuracyy decreased when Press, Boston, ISBN 0-122-518405-0 (hardcover).
the number of records increased from 100 to 200, [2] B Sarwar, G Karypis, J Konstan and J Riedl. Analysis
we argue that this occurred becauuse the size of the of recommendation algorithms
a for e-commerce.
test dataset available to us is small, and this pattern Proceedings of the 2nd ACM conference on Electronic
will not bear out over a larger test dataset.
d commerce, Page 158-1677, Year 2000.
[3] J. Herlocker, J. Konstan,
K L. Terveen, J. Riedl.
5.2. EFFICIENCY AND SCALA
ABILITY Evaluating collaborative filtering recommender systems.
One major contribution of ouur work is the ACM Transactions on Information Systems (TOIS)
development of a scalable productt recommendation Volume 22 Issue 1, Pagess 5-53, January 2004.
system. We study the runtim me required for [4] Fidel Cacheda, Vícttor Carneiro, Diego Fernández,
recommending a product to an active customer, Vreixo Formoso. Compaarison of collaborative filtering
where the number of records inn the test dataset algorithms: Limitations of current techniques and
ranges from 1,000 to 4,000, as depicted in Figure-5. proposals for scalable, high-performance
h recommender
systems. ACM Transacctions on the Web (TWEB),
Volume 5 Issue 1, Articlee No. 2, February 2011.
[5] G. Linden, B. Smith,
S J. York. Amazon.com
Recommendations: itemm-to-item collaborative filtering.
IEEE Internet Computingg, Volume 7, Issue 1. 2003.
[6] D. Zhang, J. Cao, J. Zhou, M. Guo, V.
Raychoudhury. An Eff fficient Collaborative Filtering
Approach Using Smooothing and Fusing. ICPP '09
Proceedings of the 20009 International Conference on
Parallel Processing, Pages 558-565. September 2009.
[7] J. Wang, A. Vries, M.
M Reinders. Unifying user-based
and item-based collaboorative filtering approaches by
similarity fusion. SIGIR
R '06: Proceedings of the 29th
annual international ACM
A SIGIR conference on
Figure 5 –Scalability of product recom mmendation w.r.t. the Research and developmennt in information retrieval. 2006.
number of records in the training dataset [8] R. Agrawal, T. Im mieliĔski, A. Swami. Mining
We observe that our proposed system
s is scalable association rules betweenn sets of items in large databases.
as the runtime grows linearly wheen the number of SIGMOD '93: Proceedinngs of the 1993 ACM SIGMOD
international conference on
o Management of data, Volume
records increases.
22 Issue 2. June 1993.
According to Jakob Nielsen,, the author of [9] R. Agrawal, R. Srikaant. Fast Algorithms for Mining
Usability Engineering (Nielsen, 19993), 0.1 second is Association Rules in Large Databases. VLDB '94:
considered instantaneous from a user’s point of Proceedings of the 20tth International Conference on
view. Therefore, we argue thaat our proposed Very Large Data Base. Seeptember 1994.


[10] Y. Ren, G. Li, W. Zhou. Improving top-n
recommendations with user consuming profiles.
Proceedings of the 12th Pacific Rim international
conference on Trends in Artificial Intelligence, Pages
887-890, September 2012.
[11] M. Deshpande, G. Karypis. Item-based top-N
recommendation algorithms. Transactions on Information
Systems (TOIS), Volume 22 Issue 1, Pages 143-177,
January 2004.



You might also like