Bees Swarm Optimization Based Approach For Web Information Retrieval
Bees Swarm Optimization Based Approach For Web Information Retrieval
1. Introduction
With the exponentially growing amount of information in the web, the classic process of search knows lacks in efficiency. Innovative tools to address information retrieval (IR) become necessary to cope with the complexity induced by this tremendous volume of information. Many different directions of research are contributing in handling the complexity of the problem. Distributed information retrieval and Personalizing Information Source Selection are examples of these research axes. The recent works are considering the user and sources profiles in order to restrict the search only to the sources that have the same profile as the user [6,7]. In this manner, a lot of information is pruned and therefore, the respond time of such systems becomes rapid.
The weight of a term in a document is computed using the expression tf*idf where tf is the term frequency in the document and idf is the inverted frequency computed usually as follows: idf = log(m/df) where m represents the number of documents and df is the number of documents that contain the term. The component tf indicates the importance of the term for the document, while idf expresses the power of discrimination of this term. In this way, a term having a high value of tf *idf is at the same time important in the document and less frequent in the others. The weight for a query is computed with the same manner. The similarity of a document d and a query q is then computed using the Cosine following formulas: f(d, q) = i (ai * bi) / ( i (ai)2 * i (bi)2 )1/2 (Cosine) ai and bi are the weight of term ti respectively in the document and in the query.
2.2. Discussion
For information retrieval the landscape of the search space is not homogeneous because similar documents may exist in one region that is hard to access because it may be surrounded by barren regions. In other words the promising region can be isolated like an oasis. This situation does not represent all instances but can happen especially when the number of terms is excessively greater than the number of documents. Since in the real world, the terms we use are well defined and stored in a dictionary while at the same
time documents can be created in an infinite quantity, the number of documents is determinant in the search space shape. Therefore two kinds of situations appear. When the number of documents is small, the landscape of the search space cannot be managed easily by heuristic search techniques however the exact approach is more suited. On the contrary, when the number of documents is huge, the search space is more compact and then heuristic search can perform a great job. Heuristic search methods can even help distributed information retrieval systems where users and sources profiles are considered to direct the search to speed up their response time. This observation leaded us to construct corpuses from CACM which is a small collection and in which we have added a considerable number of documents in order to test our approach.
therefore the one that indicates the place of the richest source of food [3]. The meta-heuristic Bees Swarm optimization is inspired by the above collective bee behaviour description. It handles artificial bees to imitate the real bee feeding and working style in solving problems. First, a bee named BeeInit settles down to find a solution presenting good features that we call Sref and from which the other solutions of the search space are determined via a certain strategy. The set of these solutions is called SearchArea. Then, every bee will consider a solution from SearchArea as its starting point in the search. After accomplishing its search, every bee communicates through a structure named Dance to its fellows the best solution visited. One of the solutions of this list will become the new solution of reference for the next iteration of the process. In order to avoid cycles, the solution of reference is stored every time in a taboo list. The choice of the reference solution is made, first according to the quality criterion. However, if after a period of time, the swarm notes that the solution does not progress in term of quality, it integrates a second criterion of diversity that will allow it to escape from the region where it is possibly incarcerated. The BSO algorithm is therefore outlined as follows: begin let Sref be the solution found by BeeInit; while (MaxIter not reached) do insert Sref in a taboo list; determine SearchArea from Sref; assign a solution of SearchArea to each bee; for each Bee k do search starting with the assigned solution; store the result in Dance; endfor; compute the new solution of reference Sref; endwhile; end;
3. BSO meta-heuristic
In 1946 Karl Von Fris while decoding the language of bees, has observed that it is through the dance, that a bee communicates with its fellows upon its return to the hive, the distance, the direction and the wealth of the food source. Bees of a same colony visit more than a dozen of potential exploitation areas. But the colony concentrates its efforts of harvest on a small number among them, the richest and the easiest of access. In addition, numerous observations make appear that a colony can displace its exploitation of a source quickly to another one. In their experience made in 1991, Seely, Camazine and Sneyd have shown that when a colony of bees has the choice between the exploitation of two sources of food where the concentration in sugar is very unequal and situated in a diametrically opposite manner to the hive, one to the north and the other to the south, the colony goes up towards the richest to concentrate its effort of harvest. In that phenomenon, the swarm follows the bee which does the most vigorous dance,
4. BSO-IR Algorithm
In this section, we present the bee swarm optimization algorithm called BSO-IR designed for information retrieval. The adaptation of the meta-heuristic to IR requires the design of the following components: the artificial world where the bees live, the fitness function that evaluates solutions, the initial solution Sref, the strategies to determine the set of solutions SearchArea from Sref, the search procedure performed by each artificial bee, the quality and the diversity strategies and the choice
rules of the reference solution Sref allowing the iteration of the process. Let first start with the description of the problem modelling.
number implies that Sref is probably close to the local optimum of the new region of exploitation. Therefore the probability of an improvement is very weak. On the other hand, if this value is too important, the swarm will move away from the region containing Sref with the risk to lose good solutions. To proceed to the changes, we propose two strategies assuring that the gotten solutions are as distinct as possible. If the number of generated solutions proves to be insufficient, a random technique will be used to complete the rest. The first strategy: The solution s is generated while flipping the term ti from Sref as follows: begin h = 0; while size of SearchArea not reached and h< Flip do s = Sref; p = 0; repeat if the term Flip*p+h exists in s then remove it from s else insert it in s; p = p+1; until Flip*p+h k; SearchArea = SearchArea {s}; (* set of all solutions s *) h = h+1; endwhile end The second strategy: Here we consider Sref as being a set of contiguous packets of terms. The solution s is generated while changing the terms ti of the packet s from Sref, that is: begin while size of SearchArea not reached and h< Flip do s = Sref; p = 0; repeat if the term (k/Flip*h)+p exists in s then remove it from s else insert it in s; p = p+1; until p >= k/Flip SearchArea = SearchArea {s}; h := h+1; endwhile; end; Let n=20 be the number of terms and Flip = 5. If the terms are subscripted from 1 to 20, then the first strategy consists in flipping the terms: (1,6,11,16), (2,7,12,17), (3,8,13,18) and (4,9,14,19), (5,10,15,20), while in the second strategy, the following terms are
inverted: (1,2,3,4), (5,6,7,8), (9,10,11,12), (13,14,15,16) and (17,18,19,20) . Flipping or inverting a term from the solution s consists in removing it from s if it exists in it or adding it to s if it is not in it.
endif endif end Remarks. (Sref is better in quality) is equivalent to (f(Sref) = Max f(s)) where s belongs to Dance and not to the Taboo list. (Sref is better in diversity) is equivalent to (diversity (Sref) = Max diversity (s)) Where s belongs to Dance. If two solutions s1 and s2 are equal in quality, that is, if they have the same value of the objective function then the one that has the largest degree of diversity will be considered. In the same way, if two solutions s1 and s2 present the same degree of diversity, the one that improves the fitness function will be chosen. It can happen, although very rarely, that all solutions of Dance exist in the Taboo list. To palliate to this problem, the solution of reference will be generated at random. maxchances is an empirical parameter and designates the maximum number of chances accorded to create a search region SearchArea.
5. Experimental Results
4.6. The bee search process
The bee search process is an iterative process. The number of iterations is called MaxIter and is an empirical parameter. The process consists of two phases: -A simple local search. -A improvement technique that consists in flipping the maximum of terms of the solution found in the first phase so that the fitness function of the new solution is superior or equal to it. In order to test the performance of the designed algorithm, we have performed a series of extensive experiments. The first one consists in setting the empirical parameters that yield high solutions quality such as the bee colony size, the maximum number of iterations and the maximum number of changes in the improvement procedure. A second step is undertaken in order to test the performance of the algorithm. The latter were implemented in C# on a personal computer. The tested collections are the known CACM, RCV1 and large scale collections generated from CACM by the following process. CACM has 3204 documents of 6468 terms whereas RCV1 possesses 804 414 documents of 47 236 terms.
10
close similarity. In the second step, new documents are generated by concatenating documents of each cluster in all possible ways that is by 2, by 3 and so on until concatenating all the documents altogether. The operation of documents concatenation is performed by merely merging their respective indexes. 5.1.1. Clustering of documents The algorithm of clustering documents of a collection on the basis of their similarity is a BSO based algorithm. It is designed with the same framework as the one described previously for information retrieval. The main difference is threefold; first the classes are initially created by a diversification generator of documents, second BSO is called for each class to fill in the class with documents and third each time an artificial bee finds a good solution it will insert it in a dynamic list representing the cluster. This list managed as FIFO (First In First Out) is sorted according to the similarity of the documents and its size is restricted to r, which is the number of documents per cluster, that is, its size. With this constraint, only documents that have a similarity greater than the similarity of the element located at the end of the queue is inserted in the cluster. The algorithm is outlined as follows: Input: a collection C of m documents Output: p clusters of r documents each begin C = C; Initialize p queues to empty; determine p diversified documents from Sref; (* same process as SearchArea determination *) assign a document to each group; for each group p do create searchArea of size equal to k; assign a solution to each bee; for each Bee k do search starting with the assigned solution; let s be the result of the search; store s in Dance list; endfor; if f(s, p)>f(queue.end, p) then if queue full then remove queue.end; endif; insert s in queue; endif; endfor; end;
5.1.2. Corpus generation The second phase deals with the construction of the benchmark. The idea is to create documents from similar documents belonging to the same class. The algorithm is as follows: Input: p clusters of r documents each documents Output: a collection of begin For each cluster p do For each document d of the cluster do Merge d with the other documents of class p 2by 2 then 3by 3 and so on ; Suppress d from the cluster p; endfor endfor end; The CACM collection called CACM1 in Table I, was transformed in four collections called CACM2, CACM3, CACM4 and CACM5 as shown in Table I.
TABLE I. GENERATED COLLECTIONS
Cluster size 1 5 10 15 20
#doc 3204 19 844 327 364 6 979 380 167 772 004
Figure 1 shows how the size of the generated corpuses grows in an exponential way in terms of the cluster size. The aim of producing large scale collections is then achieved. In the following we describe the experimental study we have undertaken in order to demonstrate the efficiency of BSO on such tremendous corpuses.
11
The advantage of the evolutionary approach is in the respond time where it shows its superiority on the exact algorithm. The time factor is very important since information retrieval is performed on line and thus requires fast reactivity from the system.
Figure 3. Comparison of BSO-IR and the exact algorithm performances for CACM1
TABLE II.
CACM1 30 5 40 30
CACM5 50 8 70 55
Figure 4. Comparison of BSO-IR and the exact algorithm performances for CACM5
Table III exhibits numerical values for similarity and running time of Exact and BSO algorithms on RCV1 collection and four queries. Note that both algorithms have almost the same performance while BSO is faster than the exact algorithm. This observation is shown by another experiment to be less significant for CACM, because of its smaller size. Figure 5 and Figure 6 show the runtime for both algorithms respectively when processing CACM1 and
12
CACM5. Note that the time increases in an exponential time for the classic algorithm whereas it augments almost linearly for BSO-IR.
TABLE III. COMPARISON BETWEEN EXACT AND BSO RUNTIME FOR RCV1 COLLECTION
exact
BSO
query 1 2 3 4 1 2 3 4
for the small collection. On the other hand for large scale corpuses both algorithms have comparable efficacies. However in terms of running time, BSO-IR has yielded a performance level exceeding the one of the exact algorithm. And as a conclusion we can state that BSO-IR is more suited for large scale information retrieval than classical IR method. As a future work, we plan to hybridize metaheuristics with the distributed information retrieval approaches to better address web information retrieval. Another intent is to design and develop a metaoptimization generator that generates automatically the empirical parameters for any meta-heuristic in general and for BSO-IR in particular in order to improve its performance. Actually the manual tuning of parameters is a hard and fastidious task on one hand and may not reach optimality on the other hand in spite of the extensive and numerous experiments we can perform.
References
[1] A. A. R. Ahmed, A. A. L Bahgat, A. A. Abdel Mgeid and A.S. Osman, Using Genetic Algorithm to Improve Information Retrieval Systems, Word Academy of Science, Engineering and Technology WASET06, vol. 17, pp. 6-12, 2006. R. Baeza-Yates, B. Ribiero-Neto, Modern Information Retrieval, Addison Wesley Longman Publishing Co. Inc., 1999. E. Bonabeau, M. Dorigo, G. Theraulaz, Swarm Intellignce from natural to artificial systems, Oxford University Press, 1999. H. Drias, S. Sadeg, S. Yahi: Cooperative Bees Swarm for Solving the Maximum Weighted Satisfiability Problem. IWANN 2005: 318-325 C. Hsinchun, Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning and Genetic Algorith, Journal of the American Society for Information Science. 46, 3, pp. 194-216, 1995. S. Kechid, H. Drias: Personalizing the Source Selection and the Result Merging Process. International Journal on Artificial Intelligence Tools 18(2): 331-354 (2009) S. Kechid, H. Drias: Mutli-agent System for Personalizing Information Source Selection. Web Intelligence 2009: 588-595 P. Kromer, V. Snasel, J. Platos, A. Abraham, Implicit User Modelling Using Hybrid Meta-Heuristics, IEEE HIS, pp. 4247, 2008 C. D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008. P. Pathak, M. Gordon and W. Fan, Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation, 33rd IEEE HICSS, (2000) Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management, 24, 5, pp. 513-523 1988. C. J. Van Rijsbergen, Information Retrieval. Information, Retrieval Group, University of Glasgow, 1979. D. Vrajitoru, Crossover Improvement for the Genetic Algorithm in Information Retrieval, Inf Process Manage, 34, 4, pp. 405-415, 1998.
[2]
[3]
[4]
[5]
[6]
6. Conclusions
In this paper, a bee swarm optimization algorithm named BSO-IR has been designed for information retrieval. The aim of this study is the adaptation of heuristic search techniques to large scale IR and their comparison with classical approaches. Experimental tests have been conducted on the well known CACM and RCV1 collections and also on very large benchmarks generated from CACM for test purposes. The approach designed to construct the large collection is original and enables to increase the scale level of any collection. Through the undertaken experiments we have observed that concerning the solution quality, the exact algorithm achieved slightly better results than BSO-IR
[7] [8]
[9] [10]
[11]
[12] [13]
13