Customer-Oriented Catalog Segmentation: Effective Solution Approaches
Customer-Oriented Catalog Segmentation: Effective Solution Approaches
Customer-Oriented Catalog Segmentation: Effective Solution Approaches
com/locate/dss
Abstract We consider in this paper the customer-oriented catalog segmentation problem that consists of designing K catalogs, each of size r products that maximize the number of covered customers. A customer is covered if he/she has interest in at least the specified minimum number of products in one of the catalogs. The problem addresses the crucial issue of the design of the actual contents of the catalogs that serves as a back-end to catalog production for the purpose of more focused design of catalogs as a targeted marketing tool. We developed two algorithms to solve the problem. Results of an extensive computational study using real and synthetic data sets show that one of the proposed algorithms outperforms the state-of-the-art algorithm found in the literature in terms of customer coverage, resulting potentially in significant increase in organization profit. In the spirit of the guidance role that a Decision Support System (DSS) should play by recommending alternative, satisfactory solutions to the decision maker, the prototype of a DSS integrating all three algorithms is presented to provide the decision maker with an easy-to-use, yet powerful tool to examine various catalog design options and their implications on the contents of the catalogs and the clusters of covered customers. 2006 Elsevier B.V. All rights reserved.
Keywords: Decision support system; Catalog design; Customer clustering
1. Introduction In today's marketing-oriented era, product catalogs are used as an effective tool by a large number of companies that strive to achieve their goals through customer attraction, satisfaction, and retention. For these companies there is an increasing need to optimize the design of their product catalogs. By producing a variety of catalogs, a company can customize the contents of each catalog to meet the needs of a variety of segments of customers. With the fast growing number of products
Tel.: +1 405 744 8649. E-mail address: [email protected]. 0167-9236/$ - see front matter 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2006.04.010
offered by retailers and e-retailers alike, the design of one catalog that can contain all products becomes impractical and non-economical. In many cases, some customers are interested in only a small portion of the products the company carries. It is more convenient for them (and more cost-effective for the company) to receive a catalog that represents just those particular lines or product categories that suit their needs. Print catalogs are even widely used by online stores to allow customers to place orders online or by phone. For example, Landsend.com and Americanmusical.com have integrated their catalogs with their online stores to allow a customer to order items online easily by simply entering the item numbers as they appear in the catalogs.
1861
In fact, the latest Benchmark Report on Operations conducted by Catalog Age magazine in 2005 [5] indicates that 70% of all survey respondents with sales of at least $10 million have their transactional websites fully integrated with their catalog management systems. Catalogs, whether in the form of paper, CD-ROM or online, are used as an effective tool to attract customers to a store and purchase more products beyond the ones contained in the catalog. One of the top challenges that companies using catalogs as a marketing tool face is the increase in marketing costs. The latest Benchmark Survey on Critical Issues and Trends conducted by Catalog Age magazine in 2003 [4] indicates that catalog companies that participated in the survey spent a mean 26.1% of their revenue on marketing expenses and that print and postage costs account for nearly half of those expenses. Catalogs created in other forms are also expensive. The number of catalogs that companies send to customers and potential buyers is increasing at a fast pace. For instance, Victoria's Secret mails around 395 million catalogs every year. The same survey mentioned earlier indicates that nearly four out of five of the respondents (79%) cited the need to reduce costs without reducing offerings or services as one of the top three management issues facing their companies. Faced with increasing marketing costs combined with the need to compete in the global economy, companies are looking for new ways to produce and use their catalogs more effectively. One way that has already been adopted is the automation of the creation of the layout of the catalog using expensive state-of-the-art software packages with price tags as high as millions of dollars depending on the integrated modules and the volume of products the software can handle. These software packages serve as a front-end in the catalog production process. In this paper, we address the crucial issue of the design of the actual contents of the catalogs that serves as a backend to catalog production for the purpose of more focused design of catalogs as a targeted marketing tool. This issue is independent of the catalog delivery media (e.g., paper, CD, or online) and hence all companies traditional brick-and-mortar or online can benefit from proper design of the contents of their catalogs. Similar to the strategy of a manufacturer to offer a product line rather than a single product, a company can produce a line of catalogs rather than one single catalog to address the varying needs and interests of its customers. We consider in this paper the customeroriented catalog segmentation problem, first introduced
in Ester et al. [7] as an extension to the classical catalog segmentation problem [11]. The customer-oriented catalog segmentation problem consists of designing K catalogs, each of size r products that maximize the number of covered customers. A customer is covered if he/she has interest in at least a specified minimum number t, called threshold, of products in one of the catalogs. A similar problem has been mentioned as an open problem in Kleinberg et al. [11]. The customer-oriented catalog segmentation problem is a customer clustering problem where:
The task is to determine K clusters of customers where each cluster is defined by a set of products of interest to the customers and each customer is assigned to the cluster (equivalent to a catalog) that contains the largest number of products of interest to him/her. An acceptable clustering should satisfy two constraints: The size of each cluster is r products, A customer can be assigned to (i.e., covered by) a cluster of products only if the cluster contains at least t products of interest to the customer. The goal is to maximize the number of customers assigned to (i.e., covered by) the clusters. The conceptual foundation of catalog segmentation is the microeconomic framework for data mining that was introduced in Kleinberg et al. [10]. In this framework, a company considers many possible decisions about its customers who, depending on the decision selected, contribute differently to the overall utility/benefit of the decision [7]. In the case of segmentation (clustering) problems, the company strives to make the optimal decision per customer segment rather than per individual customer. The utility/ benefit of a catalog is measured by the number of customers who have interest in at least a specified minimum number of products in the catalog. The classical catalog segmentation problem, introduced in Kleinberg et al. [11] seeks to maximize the number of catalog products of interest to customers with no regards to the customer utility to the company defined in terms of the minimum interest of the customer in at least t products in the catalog he/she receives, a concept referred to as customer coverage. The publication literature, with the exception of Ester et al. [7] focused exclusively on the classical segmentation problem. The authors in Refs. [10] and [12] outlined a sampling-based algorithm by basically enumerating and evaluating all possible partitions of a selected sample of
1862
customers. According to the authors [12], the algorithm can only be shown to work under a fairly strong density assumption on instances in the sample. Xu et al. [15] studied the 2-catalog segmentation problem where only two catalogs are designed. They developed an approximation algorithm based on semi-definite programming that has a performance guarantee of 1 / 2 for any size r of the catalog and a value greater than 1 / 2 when the size of the catalog is at least m/3, where m is the number of available products. To overcome the inefficiency of the sampling-based algorithm, Steinbach et al. [13] studied two variations of the problem where a customer receives one catalog in the first variation and multiple catalogs over time in the second variation. They developed three algorithms based on the K-means clustering approach and reported computational results that compare the performance of the three algorithms. As in Ester et al. [7], we study the customer-oriented catalog segmentation problem where the benefit/utility of a catalog is measured by the number of customers that have at least a specified minimum interest of t products in the catalogs sent to them. This concept of customer coverage in catalog segmentation is similar to the concept of minimum support in association rule mining [1,14]. The concept of customer coverage serves several purposes: (i) catalogs are not personalized to the individual level but to a like-behaving group of customers with similar interests [13]; (ii) a catalog sent to a customer containing only very few products of interest to him/her will not get his/her attention and therefore will be ineffective; and (iii) developing catalogs to target most promising customers can reduce marketing costs. Given the complexity of the customer-oriented segmentation problem (discussed in the next section), we propose two algorithms to solve the problem. The first one constructs the catalogs one at a time in a greedy fashion. Each catalog contains initially all products and then some products are removed so as to minimize the number of uncovered customers until the necessary catalog size is reached. This greedy algorithm includes a clever randomization feature to avoid getting trapped in a local optimum. The second algorithm is inspired from association rule mining in the area of data mining and constructs also the catalogs one at a time. Products are grouped in catalogs to maximize associations between products defined in terms of customer interest relationships. Results of an extensive computational study using real and synthetic data sets show that the proposed greedy algorithm outperforms the state-of-the-art algorithm
developed in Ester et al. [7] in terms of customer coverage, resulting potentially in significant increase in organization profit. In the spirit of the guidance role that a DSS should play by recommending alternative, satisfactory (or satisfying) solutions to the decision maker (DM) [2], we present the prototype of a DSS integrating all three algorithms to provide the DM with an easy-to-use, yet powerful tool to investigate various catalog design options and their implications on the contents of the catalogs and the clusters of covered customers. The remainder of the paper is organized as follows. In Section 2, we give the definition and formulation of the customer-oriented catalog segmentation problem. In Section 3, we present our two proposed algorithms to solve the problem. The following section discusses the DSS that integrates all algorithms to guide the decision maker in obtaining satisfactory catalog line. We then report the results of the extensive computational study to evaluate the performance of the algorithms in comparison with the one proposed in Ester et al. [7]. Finally, we provide conclusions and directions for future research in Section 6. 2. Problem definition and formulation We assume that customers and their product interests are known [7]. The main input data for catalog segmentation is the customer interest database that contains the set of products that each customer is interested in. The product interests of a customer can be obtained either by aggregating all purchase transactions of the customer or by obtaining his/her explicit preferences from a set of products [7]. The customer-oriented catalog segmentation problem consists of designing K catalogs, each of size r products that maximize the number of covered customers. A customer is covered if he/she has interest in at least the specified minimum number (t) of products in one of the catalogs. Let Cj represent the set of products to be included in catalog j. The following notation is used in the formulation of the model. N M aki T r t set of customers, indexed by k set of products, indexed by i 1 if customer k has interest in product i 0 otherwise set of catalogs, indexed by j; |T| = K size of a catalog; i.e., |Cj| = r minimum customer interest threshold
A. Amiri / Decision Support Systems 42 (2006) 18601871 Table 1 Customer interest table for the case example Customer 1 2 3 4 5 6 7 8 9 10 Products of interest 1,10 2,4 1,3,4,5,7,8 1,2,3,4,6,7,9,10 2,9 1,2,4,6,9 3,5 5,6,7,8 1,4,5,6,7,8,9 2 Customer 11 12 13 14 15 16 17 18 19 20
1863
The decision variables are: 8 > 1 if customer k is covered; i:e:; < the customer is interested in at least Xk > t products in one of the catalogs : 0 otherwise 8 > 1 if customer k is covered by catalog j; < i:e:; the customer is interested in at least Vkj > t products in the catalog : 0 otherwise 1 if customer j includes product i Yji 0 otherwise X Max Z Xk 1
kaN
Products of interest 1,2,3,6,8,9 5 2,6,8 2,3,4,6,7,8,9,10 3,4,5 2,3,8 4,6 1,2,5,7,8 1,2,3,4,5,7,9 3,4,5,6,10
Subject to X Yji r
iaM
in the customer-oriented catalog segmentation problem is to create the catalog line that yields the optimal customer coverage. 3. Catalog segmentation algorithms The customer-oriented segmentation problem is very complex from a computational point of view. It is NPhard [7] as it is a generalization of the well-known NPhard maximum set covering problem [8]. This means the time needed to solve the problem increases at a much higher rate (non-polynomial) than the increase in the size of the problem. Therefore, for problem instances of realistic size, it is impractical to obtain optimal solutions in reasonable amount of time. For example, if the task is to create one catalog containing 10 products chosen from 40 available products, the total number of possible configurations exceeds 109. This makes the strategic task of selecting even, say four catalogs combinatorially complex. Consequently, heuristics or approximate algorithms are proposed to generate good feasible solutions efficiently, but not necessarily optimally. This is in accordance with the recommendation of Barkhi at al. [2] to use approximate algorithms to find satisfactory (or satisfying) solutions in a timely fashion to complex problems such as ours when traditional optimization techniques such as Branch and Bound are doomed to fail to generate optimal (or even near-optimal) solutions in acceptable amount of time. We propose two algorithms to solve the customeroriented segmentation problem. The first one, called Greedy Out (GO) algorithm, constructs the catalogs one at a time. Each catalog is formed initially by including all the products in the catalog and then removing (|M| r) products, one at a time, from the catalog in a greedy fashion so as to minimize the number of uncovered customers. The Greedy Out algorithm includes a randomization feature to avoid getting trapped in a local
2 3 4 5
tVkj V Xk V
X
iaM
aki Yji
X
jaT
Vkj
The objective function represents the number of customers covered by at least one catalog. Constraints (2) ensure that each catalog includes r products. Constraints (3) guarantee that a catalog j T covers a customer k N only if the customer is interested in at least t products in the catalog. Constraints (4) ensure that a customer k N is covered (i.e., Xk = 1) only if he/she is covered by at least one catalog. Constraints (5) represent the integrality requirements for the decision variables. Note that if the sizes of the catalogs are different, then the right-hand sides of constraints (2) should be set to Qj, where Qj is the number of products to include in catalog j. The two algorithms developed in the next section as well as the algorithm presented in Ester et al. [7] to solve the problem can be effortlessly adapted to handle this extension very easily. The following case example illustrates the definition of the problem. There are 20 customers and 10 products. The customer interest database is shown in Table 1. The threshold for the minimum customer interest is t = 4 products. The company wants to design 2 catalogs (i.e., K = 2) of size r = 4 products each. One possible solution consists of catalogs 1 and 2 made of products {2,4,6,9} and {1,5,7,8}, respectively. This results in customers 4, 6, and 14 covered by catalog 1 and customers 3 and 9 covered by catalog 2. The task
1864
optimum. The second one, called Association-Based (AB) algorithm, is inspired from association rule mining in the area of data mining and constructs also the catalogs one at a time. Products are grouped in catalogs to maximize associations between products defined in terms of customer interest relationships. 3.1. The Greedy Out algorithm The Greedy Out algorithm, as its name indicates, constructs the catalogs in a greedy fashion so as to maximize the number of customers covered. Each catalog Cj contains initially all available products, and then products are removed consecutively from the catalog until the number of products left in the catalog equals the required catalog size r, while minimizing the decrease of the number of covered customers. The algorithm uses two quantities to guide the search for the victim product to remove next from the catalog. The first quantity, called counter(k), represents the number of products in the currentP catalog that customer k is interested in, initially equals iaM aki . The second quantity, called score(i), is computed for each product i included inP catalog j being formed as follows: the 1 Scorei kaN c :aki 1 ; where Njc = set of cusj counter k tomers (not covered by previous catalogs) who have interest in at least t products in the current catalog j. Here, the parameter score(i) is used to identify the next product to remove from the current catalog. It is the product with the lowest score; that is, the product that makes the lowest number of customers uncovered or closer to being uncovered by the current catalog. The Greedy Out algorithm is outlined below. Step 1: Initialization: Cj = M j T; i.e., catalog j initially includes all products. P Counterk 8kaN iaM aki c Set of covered customers: N = Step 2: For each catalog j T do: c Nj = Set of customers (not covered by previous catalogs) who have interest in at least t products in the current catalog j; (i.e., temporarily covered by catalog j which is under construction). For h: = 1 to |M| and r do For each product i M included in the current catalog j do Calculate score(i) Remove the product i with the lowest score from catalog j, i.e., Cj: = Cj \ {i}.
For each uncovered customer k N jc who has an interest in i do counter(k): = counter(k) 1 if counter(k) < t, then Njc: = Njc \ {k} c c c N : = N Nj c For each uncovered customer k N reset P counterk: jaM aki
Basically, the algorithm works as follows. Initially, none of the customers is covered (i.e., N c = ). The catalogs are then constructed one at a time. Each catalog j initially contains all available products (i.e., Cj = M), then products are removed until the catalog size r is reached, and at this stage the set Njc of uncovered customers (i.e., customers who do not belong to N c) who still have interest in at least t products in the catalog j become covered (i.e., N c: = N c Njc). Greedy heuristics can get trapped in a local optimum. To overcome this problem, we add a randomization feature to our Greedy Out algorithm similar to the one used in Haouari and Chaouachi [9] to solve the set covering problem. Instead of removing the product i with the lowest score from the catalog in step 2, we proceed in two sub-steps as follows. First, a set B of the f products with the lowest f scores is identified (ties are broken arbitrarily). Second, a product in B is randomly selected for removal from the catalog. The selection probability of a product i B is pi P Smax scorei 1 ; iaB Smax scorei 1 maxfscorei: iaBg: where Smax
This probability distribution is decreasing with the scores of the products, and in this manner, the products with the lowest scores are favored. The value of the parameter f should be strictly greater than 1 to make the randomization feature active, but a large value of f tends to make the heuristic pure random. Given the significant computational complexity of the customer-oriented catalog segmentation problem, one can speed up the Greedy Out algorithm (as well as the other algorithms) by restricting the search of the products to include in the catalogs to the subset of the L most popular products, i.e., products with the highest interests among customers. These popular products can be easily identified by first sorting in non-increasing P order the products based on kaN aki , and then selecting the first L products. The proposed Greedy Out algorithm differs from the Randomized Best Product Fit algorithm (referred to as RBPF) proposed in Ester et al. [7] in the
1865
following ways. First, our algorithm starts with a catalog containing all products and then removes (|M| r) products that results in the least number of customers loosing coverage; whereas the RBPF algorithm starts with an empty catalog and adds r products that results in the coverage of the highest number of customers. Second, the randomization feature is fully integrated in the Greedy Out algorithm so as to incorporate diversification within the algorithm; whereas the randomization step in the RBPF algorithm is applied only at the end of the algorithm. This step is repeated a specified number of times and involves selecting randomly a catalog, a product in the catalog, and a product not in the catalog and swapping the two products if the number of covered customers increases. 3.2. The Association-Based algorithm Ideally, one wants to form a catalog by selecting a subset of r products from among the |M| available products that cover the maximum number of customers. Unfortunately, a complete enumeration of the exponential number of such subsets is computationally infeasible. The Association-Based algorithm groups products in a catalog to maximize associations between products defined in terms of customer interest relationships. These interest relationships indicate collections of products that are frequently liked together, i.e., have high associations. The motivation of this algorithm is the successful use of association rule mining in the area of data mining, especially in market basket analysis. By mining a basket data, a marketer identifies associations between products that are often purchased together by customers. The Association-Based algorithm uses the following parameter: Sih is the support for the 2-item set of products i and h; i.e., the number of uncovered customers who like both products i and h; in other words, P Sih kaN u aki akh , where N u is the set of uncovered customers. We limit ourselves to create a catalog by creating then union of 2-item sets such that the sum of their pairwise support is maximize. This is the equivalent to solving the following quadratic problem.
Table 2 Application of the algorithms to the case example RBPF Greedy out
Subject to X Wi r
iaM
Wi af0;1g where Wi
The Association-Based algorithm is outlined below. Step 1: Initialization: Cj = j T u Set of uncovered customers N = N Step 2: For each catalog j M do: Computer Sih for every pair of products (i,h). Solve problem Q (either optimally or heuristically) to create catalog j. If a customer k is covered by catalog j, then N u: = N u \ {k}. Unfortunately, problem Q can be solved optimally and efficiently only for small size problems. For large size problems, we solve problem Q heuristically as follows. First, we select the pair of products that have the highest support among all pairs and include them in the catalog. Next, we add in a greedy fashion one product at a time. At each iteration, the product to be added to the catalog is the one that has the highest total support with the products already in the catalog. 3.3. Illustrative example The results of the application of the three algorithms to the case example are shown in Table 2. The optimal solution obtained using CPLEX [6] is also included in the same table. The RBPF algorithm [7] produces a catalog line covering 3 customers. The solution produced by the Greedy Out algorithm covers 5 customers. The quality of the solution generated by
Optimal
1866
Scenario analysis
Fig. 1. DSS architecture.
the Association-Based algorithm depends on whether problem Q is solved heuristically (3 customers are covered) or optimally (5 customers are covered). The optimal catalog line covers 6 customers. 4. DSS for customer-oriented catalog segmentation In the spirit of the guidance role that a DSS should play by recommending alternative, satisfactory (or satisfying) solutions to the decision maker (DM) [2], we developed a prototype of a DSS integrating all three algorithms to provide the DM with an easy-to-use, yet powerful tool to investigate various catalog design options and their implications on the contents of the catalogs and the clusters of covered customers. The main purpose of the DSS is to aid the decision maker (DM) in designing a catalog line and creating customer clusters. The DSS can be used on any PC running Windows. The general architecture of the DSS is shown in Fig. 1. The user interface is developed using the Delphi object-oriented programming language. The interactive environment of the DSS is based on the WIMP (window-menu-point) paradigm. A menu bar at the top of the screen lists the titles of the available pulldown menus. Scalable windows are used for displaying tabular or graphical information, allowing the visualization of the same information in different forms.
Dialog boxes are used to enter data, display messages, etc. Data entered by keying values in the appropriate locations in a dialog box is validated after the DM has clicked on the Ok button, therefore helping avoid input errors. The Cancel button in a dialog box nullifies the latest user request. The Close button is used to close a window that is used mainly to display information. The DSS provides the DM with an easy-to-use, yet powerful tool to tackle this complex catalog segmentation problem. It allows the DM to examine various catalog design alternatives and their effects on the contents of the catalogs and the clusters of covered customers. When the DSS is launched, it displays a start-up, welcome notice (sometimes called a splash window) (Fig. 2). This window contains the name of the application (Catalog Segmentation Optimizer) and could contain other information (not shown here) such as user support information. Initially, the DM can load the customer interest database and specify values for the problem parameters such as the number of catalogs to create, their sizes, and the minimum interest threshold (Fig. 3). The customer interest data could be stored in a spreadsheet or relational database such MS Access or SQL Server database. Next, the DM can choose to solve the problem (using RBPF, Greedy Out, and Association-Based algorithms) from the Solve pull-
1867
catalog line that covers 11 302 customers, whereas our Greedy Out algorithm created a catalog line that covers 12 855 customers, corresponding to a 13.74% improvement in customer coverage. This important improvement in customer coverage can potentially result in significant increase in profit. The RBPF algorithm [7] took more than 20 h to solve the problem, whereas our Greedy Out algorithm took only 5 h. 5.2. Results for the synthetic data sets We investigate the impact of various problem characteristics such as number of customers, number of products, and catalog size on the performance of the three algorithms, namely RBPF, Greedy Out, and Association-Based algorithm (solved heuristically) using the synthetic data sets. To this end, we employed a full factorial design experiment for the synthetic data sets using the following factors: 1. 2. 3. 4. 5. Number of customers (5000, 10000, 20 000, 30 000) Number of products (1000, 2000) Number of catalogs (14) Catalog size (2001000) Threshold (812).
down menu. The user can display the generated solution either graphically or in tabular form (Fig. 4). He/she can also solve the problem repeatedly for different values the problem parameters to be able to identify the best compromise design. The DM can, of course, edit the problem at any time by changing its parameter values. 5. Computational experiments and results We conducted extensive computational experiments using both real and synthetic data sets. This section describes the data used in the computational study and analyses the results. 5.1. Results for the real data set The real data set corresponds to a retail market data set from an anonymous Belgian retail store reported in Brijs et al. [3] and it is available at the FIMI repository (http://fimi.cs.helsinki.fi/). The real data set includes 88 162 customers and 16 469 products, with each customer interested in 10.3 products on average. This data set is much larger than the real data set used in Ester et al. P based on the data set density (i.e., [7] P kaN iaM aki which is equal to 908 576 for our data set and 355 908 for the data set used in Ester et al. [7]). The task is to create a catalog line of 5 catalogs, each of containing 100 products. The threshold t for customer coverage equals 3. The RBPF algorithm [7] generated a catalog line that covers 54 483 customers, whereas our Greedy Out algorithm created a catalog line that covers 56 100 customers. This corresponds to a 3% improvement in customer coverage. When we designed a catalog line with 4 catalogs, each of containing 500 products, and a threshold of 10, the RBPF algorithm generated a
Following Ester et al. [7], the synthetic data sets are generated using the well-known IBM data generator [1]. We generated 10 instances of each problem category, resulting in a total of 65 data sets. Some of these data sets are much larger than the synthetic data set used in Ester et al. [7] based on the data set density (i.e., P P kaN iaM aki which is equal to 613 984 for our largest data set and 376 713 for the data set used in Ester et al. [7]). The results of applying the algorithms are given in Tables 35. Tables 3, 4 and 5 show the effects of changes in the threshold (t), number of catalogs (K), and catalog size (r) on the performances of all three algorithms, respectively. The average interest size, reported in the tables, represents the average number of products a customer is interested in. We compare the performance of the three algorithms by comparing the values of the objective function of the solutions generated by the three algorithms. To this end, we use GapX,Y to represent the percentage change (i.e., increase if GapX,Y > 0 and decrease if GapX,Y < 0) in the number of customers covered by the catalog line generated by algorithm X over the number of customers covered by the catalog line generated by algorithm Y. c c c More specifically GapX,Y = 100%[(|NX| |NY |) / (|NY |)], c where |NZ | is the number of customers covered by the solution obtained using algorithm Z; Z is either
1868
Table 3 Effects of changes in threshold (t) on the performance of the solution algorithms |N| |M| Average interest size 20.5 K r t % Gap RBPF,AB 4 500 8 9 10 11 12 8 9 10 11 12 8 9 10 11 12 8 9 10 11 12 27.96 28.84 31.18 36.40 42.37 26.64 28.82 31.44 27.98 29.06 25.42 42.61 65.73 93.49 122.89 27.23 43.78 69.33 97.69 131.19 GO,AB 29.78 33.00 36.88 43.92 50.28 32.68 37.75 45.33 49.65 59.08 31.76 51.77 79.52 118.18 165.67 32.19 52.47 80.59 119.78 167.91 GO,RBPF 1.42 3.23 4.34 5.51 5.55 4.77 6.93 10.57 16.93 23.25 5.05 6.42 8.32 12.76 19.19 3.90 6.04 6.65 11.18 15.88 CPU (s) RBPF 565 675 780 916 1050 6445 7542 8447 9251 10 203 7154 8570 9800 10 889 11 760 11 263 13 227 14 994 16 500 17 814 GO 413 436 471 506 549 9295 9980 10 260 10 846 11 879 8693 9711 10 499 10 917 11162 13 847 15 000 16 044 16 571 16 875 AB 587 608 642 702 760 12 445 13 979 13 392 13 485 14 608 11137 11 717 11 657 11 654 11 808 16 852 18 252 19 526 20 167 20 537
5000
1000
10 000
2000
20.6
500
20 000
2000
20.5
500
30 000
2000
20.5
500
c c c RBPF: Randomized Best Product Fit algorithm, AB: Association-Based algorithm, GO: Greedy Out algorithm GapX,Y = 100%[(|NX| |NY|) / (|NY|)], c where |NZ| is the number of customers covered by the solution obtained using algorithm Z.
RBPF, Greedy Out (GO), or Association-Based (AB) algorithm. The higher the GapX,Y is, the better the performance of algorithm X is compared to algorithm Y. The gaps reported are the averages over the 10 instances in each category. The computational tests
were run on a PC with an Intel Pentium III 1800 MHz processor 512 MB of memory. Based on the results of the computational experiments reported in Tables 35, the Association-Based algorithm was the least effective in solving the problem.
Table 4 Effects of changes in number of catalogs (K) on the performance of the solution algorithms |N| |M| Average interest size 20.5 K r t % Gap RBPF,AB 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 500 10 11.75 23.05 28.14 31.18 9.19 5.05 6.96 6.49 23.77 28.84 6.88 40.11 28.70 30.46 39.15 42.96 GO,AB 9.22 24.90 32.32 36.88 12.23 13.56 16.71 17.75 28.15 37.02 47.52 51.77 31.45 36.53 46.88 52.47 GO,RBPF 2.27 1.50 3.26 4.34 2.58 8.10 9.11 10.57 3.41 6.35 7.77 8.32 1.78 4.65 5.55 6.65 CPU (s) RBPF 465 594 691 780 6888 7537 8090 8447 5953 7560 8791 9800 9172 11 724 13 661 14 994 GO 273 354 417 471 8538 9460 9895 10 260 6079 7912 9324 10 499 9291 12 216 14 442 16 044 AB 289 410 526 642 9025 11 477 13 860 16 173 6034 8618 11 275 13 641 11 431 15 030 17 571 19 740
5000
1000
10 000
2000
20.6
500
10
20 000
2000
20.5
500
10
30 000
2000
20.5
500
10
c c c RBPF: Randomized Best Product Fit algorithm, AB: Association-Based algorithm, GO: Greedy Out algorithm GapX,Y = 100%[(|NX | |NY |) / (|NY |)], c where |NZ | is the number of customers covered by the solution obtained using algorithm Z.
A. Amiri / Decision Support Systems 42 (2006) 18601871 Table 5 Effects of changes in catalog size (r) on the performance of the solution algorithms |N| |M| Average interest size 20.5 K r t % Gap RBPF,AB 4 200 300 400 500 600 300 400 500 600 700 800 900 1000 300 400 500 600 700 800 900 1000 300 400 500 600 700 800 900 1000 92.37 51.76 39.90 31.18 8.04 72.80 34.34 31.44 22.32 18.47 15.03 12.03 11.18 95.92 79.22 65.73 44.64 38.47 28.64 23.28 21.08 210.66 129.77 69.33 53.22 45.18 39.05 34.73 31.69 GO,AB 135.95 65.59 48.36 36.88 9.81 171.71 59.19 45.33 32.16 25.97 21.10 16.72 15.00 188.00 106.48 79.52 54.50 46.40 34.30 27.52 24.84 344.39 160.82 80.59 61.56 51.91 44.32 39.17 35.60 GO,RBPF 22.66 9.11 6.05 4.34 1.63 57.24 18.50 10.57 8.04 6.33 5.28 4.18 3.44 47.00 15.21 8.32 6.81 5.73 4.40 3.44 3.11 43.05 13.51 6.65 5.45 4.64 3.79 3.29 2.97 CPU (s) RBPF 942 1030 902 780 710 6180 7906 8447 9339 8671 8357 8118 7849 6250 8843 9800 9725 9051 8777 8368 8030 10 422 13 562 14 994 15 180 14 200 14 318 14 200 13 164 GO 1414 1029 666 471 349 12 840 12 168 10 260 9163 9737 6720 6077 5305 12 618 11 835 10 499 8632 6977 5807 4849 4142 18 960 17 849 16 044 13 575 11 446 9727 8292 7125
1869
AB 1866 1327 879 642 471 17 976 15 575 13 392 12 095 11 879 8333 7961 6631 17 160 15 622 11 657 11 740 9210 7201 5867 5178 23 510 21 597 19 526 18 598 15 567 12 840 10 282 8621
5000
1000
10 000
2000
20.6
20 000
2000
20.5
30 000
2000
20.5
c c c RBPF: Randomized Best Product Fit algorithm, AB: Association-Based algorithm, GO: Greedy Out algorithm GapX,Y = 100%[(|NX| |NY|) / (|NY|)], c where |NZ| is the number of customers covered by the solution obtained using algorithm Z.
The most likely reason is that problem Q was solved heuristically. Computational tests (not reported here) using very small instances show that the algorithm produced very competitive results when problem Q was solved optimally. However, solving problem Q opti% of covered customers vs. threshold (using GO) (|N|=5000, |M|=1000, K=4, r=500) % of Covered customers 90
mally using the synthetic data sets described in the section turned out to be computationally impractical. Despite its relatively poor performance, the AssociationBased algorithm is conceptually innovative. It draws its logic from the successful use of association rule mining in the area of data mining, especially in market basket analysis. The only shortcoming of the implementation
% covered customers vs. # of catalogs (using GO) (|N|=5000, |M|=1000, t=10, r=500) % covered customers
% covered customers
Fig. 5. Percentage of covered customers vs. threshold (t) using Greedy Out algorithm.
Fig. 6. Percentage of covered customers vs. number of catalogs (K) using Greedy Out algorithm.
1870
of this algorithm is that the heuristic it uses to solve problem Q is not very effective. However, this does not diminish the merit of presenting and discussing the algorithm in the paper. In future research studies, researchers may develop more effective and efficient heuristics to solve problem Q, making the algorithm a more viable alternative to solve the customer-oriented catalog segmentation problem. Based on the results of the same experiments, our other proposed algorithm Greedy Out was the best; it produced solutions that are on average 9.14% better than those produced by RBPF. This out-performance is statistically significant (p-value = 0.000). The performance gap between Greedy Out and RBPF widens significantly in favor of Greedy Out with increases in the threshold (t) and the number of catalogs to create (K) and decrease in the size of the catalogs (r). The Greedy Out and RBPF are still preferred to an exhaustive search for the optimal solution through branch-and-bound for example. Indeed, the state-ofthe-art commercial optimization software CPLEX [6], implementing a branch-and-bound approach, is very ineffective and inefficient in solving even small instances of the customer-oriented catalog segmentation problem. For example, we used Greedy Out, RBPF, and CPLEX to solve a small instance of the problem with 1000 customers and 20 products to construct a catalog line of 4 catalogs, each containing 4 products with a threshold equal to 4. The Greedy Out and RBPF algorithms generated catalog lines that cover 435 and 417 customers, respectively. Each algorithm took less than 1 s. However, CPLEX generated a catalog line that covers 419 customers after running for 7200 s. So, the common solver CPLEX is definitely not suitable to solve the customer-oriented catalog segmentation problem. Figs. 57 show the change in the percentage of covered customers due to changes in the threshold (t), number of catalogs (K), and catalog size (r), respectively
% covered customers vs catalog size (using GO) (|N|=5000, |M|=1000, K=4, t=10) % covered customers % covered customers
for the data sets with 5000 customers and 1000 products (using the Greedy Out algorithm). It is important to note that it may be impossible to cover 100% of the customers, as there may exist some customers who are interested in fewer products than the threshold t. 6. Concluding remarks The microeconomic view of data mining has been widely used as a theoretical framework capturing the notion of utility of the discovered knowledge especially in the area of catalog segmentation. We studied in this paper the customer-oriented catalog segmentation problem where the utility/benefit of a catalog is measured by the number of customers who have interest in at least a specified minimum number of products in the catalog. Specifically, the problem consists of designing K catalogs, each of size r products that maximize the number of covered customers. A customer is covered if he/she has interest in at least a specified minimum number of products in one of the catalogs. The problem addresses the crucial issue of the design of the actual contents of the catalogs that serves as a back-end to catalog production for the purpose of more focused design of catalogs as a targeted marketing tool. We proposed two algorithms to solve the customeroriented catalog segmentation problem. The first one, called Greedy Out algorithm, constructs the catalogs one at a time. Each catalog is formed initially by including all the products in the catalog and then removing (|M| r) products, one at a time, from the catalog in a greedy fashion so as to minimize the number of uncovered customers. The Greedy Out algorithm includes a randomization feature to avoid getting trapped in a local optimum. The second one, called AssociationBased (AB) heuristic, is inspired from association rule mining in the area of data mining and constructs also the catalogs one at a time. Products are grouped in catalogs to maximize associations between products defined in terms of customer interest relationships. We conducted an extensive computational study using real and synthetic data sets to compare the performance of the proposed algorithms to the Randomized Best Product Fit (RBPF) proposed in Ester et al. [7]. The results of the study show that the Greedy Out algorithm outperforms the AB and RBPF algorithm in terms of customer coverage, resulting potentially in significant increase in organization profit. The Greedy Out algorithm covers on average 59.65% and 9.14% (statistically significant) more customers than the AB and RBPF algorithms, respectively, using the synthetic data sets. The out-performance of the Greedy Out algorithm over the RBPF algorithm is also
80 70 60 50 40 30 20 200
300
500
600
Fig. 7. Percentage of covered customers vs. catalog size (r) using Greedy Out algorithm.
1871
significant for the real data set as the former algorithm covered 3% to 13.74% more customers than the latter. In the spirit of the guidance role that a DSS should play by recommending alternative, satisfactory (or satisfying) solutions to the decision maker (DM) [2], we presented the prototype of a DSS integrating all three algorithms to provide the DM with an easy-to-use, yet powerful tool to examine various catalog design options and their implications on the contents of the catalogs and the clusters of covered customers. Although the current study makes a significant contribution in the area of catalog segmentation by developing an algorithm that outperforms another algorithm previously proposed by other researchers and by integrating all three algorithms discussed here in a DSS to guide the decision maker in designing a satisfactory catalog line, there are still some important issues in catalog segmentation that are not considered in the current study as well as past studies and therefore could be the subject of future research efforts. One such issue is the need to develop a way to evaluate the quality of the solutions to the problem by, for example, generating tight upper bounds to the problem. This is a very challenging issue since the customer-oriented segmentation problem is NPhard even when only one catalog is to be created and the interest threshold t = 1 [7]. Another issue deals with the space requirements of the products included in the catalogs. A product included in the catalog consumes space in the catalog for a description and image of the product. Space requirements are usually the same for all products. However, in special cases where these requirements are different, the capacity of a catalog should be redefined in terms of the space requirements of the products the catalog can accommodate rather in terms of the number of products it can include. This extension makes the problem much more difficult to solve. Acknowledgments The author would like to thank Dr. Ramesh Sharda and Dr. Varghese Jacob and the anonymous referees for their useful feedback that helped improve the paper's contents and presentation. References
[1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, Proceedings of the 20th International Conference on Very Large Databases, Santiago, Chile, September 1994.
[2] R. Barkhi, E. Rolland, J. Butler, W. Fan, Decision Support System induced guidance for model formulation and solution, Decision Support Systems 40 (2) (2005) 269281. [3] T. Brijs, G. Swinnen, K. Vanhoof, G. Wets, The use of association rules for product assortment decisions: a case study, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 1999, pp. 254260. [4] S. Chiger, Benchmark survey on critical issues and trends, Catalog Age 20 (13) (2003) 3237. [5] S. Chiger, Benchmark 2005 on Operations, www.catalogagemag. com, April 1, 2005. [6] CPLEX 9.0 User's Manual, ILOG Inc., http://www.ilog.com, Mountain View, CA, 2004. [7] M. Ester, R. Ge, W. Jin, Z. Hu, A Microeconomic Data Mining Problem: Customer-Oriented Catalog Segmentation, Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (Seattle, Washington), 2004, pp. 557562. [8] M.R. Garey, D.S. Johnson, Computers and Intractability, A Guide to the Theory of NP-Completeness, Freeman and Company, 1979. [9] M. Haouari, J.S. Chaouachi, A probabilistic greedy search algorithm for combinatorial optimisation with application to the set covering problem, Journal of the Operational Research Society 53 (7) (2002) 792799. [10] J. Kleinberg, C. Papadimitriou, P. Raghavan, A microeconomic view of data mining, Data Mining and Knowledge Discovery 2 (1998) 311324. [11] J. Kleinberg, C. Papadimitriou, P. Raghavan, Segmentation problems, Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (Dallas, Texas), 1998, pp. 473482. [12] J. Kleinberg, C. Papadimitriou, P. Raghavan, Segmentation problems, Journal of the ACM 51 (2) (2004) 263280. [13] M. Steinbach, G. Karypis, V. Kumar, Efficient algorithms for creating product catalogs, Proceedings of SDM, 2001. [14] V. Verykios, A. Elmagarmid, E. Bertino, Y. Saygin, E. Dasseni, Association rule hiding, IEEE Transactions on Knowledge and Data Engineering 16 (4) (2004) 434447. [15] D. Xu, Y. Ye, J. Zhang, Approximating the 2-catalog segmentation problem using semidefinite programming relaxations, Optimization Methods and Software 18 (6) (2003) 705719. Ali Amiri received MBA and Ph.D. degrees in information systems from the Ohio State University, Columbus, OH, in 1988 and 1992, respectively, and a B.S. degree in business administration from the Institut des Hautes Commerciales, Tunisia, in 1985. He is an Associate Professor of Management Science and Information Systems at Oklahoma State University. His research interests include data communications, electronic commerce, data mining, and database management. His papers have appeared in a variety of journals including the European Journal of Operational Research, INFORMS Journal on Computing, ACM Transactions on Internet Technology, Information Sciences, and Naval Research Logistics.