PSO7
PSO7
Abstract—With advances in the computer technologies and transformation to derive the most representative subspace for
the rapid development of Internet, information on the Inter- object representation. The goal of LDA is to find a linear
net is increasing exponentially. To efficiently retrieve relevant transformation that maximizes the between-class scatter and
documents from the explosive growth of the Internet and other
sources of information access, a novel Web document retrieval minimizes the within-class scatter so that the class separability
algorithm based on particle swarm optimization (PSO) and can be optimized in the transformed space. In general, LDA-
linear discriminant analysis (LDA) algorithm is proposed to based dimensionality reduction methods perform better than
deal this problem. Experimental results clearly demonstrate its PCA-based methods. This is because LDA-based methods
effectiveness and efficiency. aim to find projections with most discriminant information,
whereas PCA-based methods find projections with minimal
I. I NTRODUCTION
reconstruction errors. Moreover, LDA has been successfully
With advances in the computer technologies and the rapid used as a dimensionality reduction technique to many pat-
development of Internet, information on the Internet is in- tern recognition problems, such as information retrieval, face
creasing exponentially. In order to make use of this vast recognition, and microarray gene expression data analysis. In
amount of data, efficient and effective techniques to retrieve this paper, we focus on Linear Discriminant Analysis (LDA),
Web document information based on its content need to be since it is a popular and widely dimensionality reduction
developed. As a consequence, the role of information retrieval method in pattern recognition.
(IR) systems is becoming more important [1]. One of the In addition, one of the most important and difficult opera-
most important and difficult operations in information retrieval tions in information retrieval is to generate queries that can
is to generate queries that can succinctly identify relevant succinctly identify relevant documents and reject irrelevant
documents and reject irrelevant documents. In addition, the documents. Relevance feedback (RF) is an important tool to
document data are typically of very high dimensionality, rang- improve the performance of information retrieval (IR) [1]. RF
ing from several thousands to several hundreds of thousand. focuses on the interactions between the user and the search en-
High-dimensional data often leads to inferior retrieval results gine by letting the user label semantically relevant or irrelevant
due to the curse of dimensionality. To achieve higher efficiency documents. In recent years, many relevance feedback methods
in manipulating the Web document data, it is desirable to first have been developed. They either adjust the weights of various
project the documents into a lower-dimensional subspace in features to adapt to the user’s preferences or estimate the
which the semantic structure of the document space becomes density of the positive feedback examples. Recently, Eberhart
clear. Once the high-dimensional document space is mapped and Kennedy proposed the particle swarm optimization (PSO)
into a lower dimensional space, the traditional information based on the analogy of swarm of bird [4]. The main advan-
retrieval algorithms can then be applied. tages of the PSO algorithm are summarized as: simple concept,
The document space is generally of high dimensionality and easy implementation, robustness to control parameters, and
querying in such a high dimensional space is often infeasible computational efficiency when compared with mathematical
due to the curse of dimensionality. Therefore, dimensionality algorithm and other heuristic optimization techniques. The
reduction is essential to the design of efficient document original PSO has been applied to a learning problem of neural
retrieval algorithms. In the past several decades, many di- networks and function optimization problems, and efficiency
mensionality reduction techniques have been proposed. Prin- of the method has been confirmed.In this paper, the objective is
cipal Component Analysis (PCA) [2] and Linear Discriminant to investigate the capability of the PSO algorithm boosted by
Analysis (LDA) [2] have been two of the most commonly LDA-based dimensionality reduction method for Web docu-
used linear dimension reduction methods. Independent Com- ment query optimization in the context of information retrieval.
ponent Analysis (ICA) [3] is another linear decomposition
which seeks statistically independent and non-gaussian com- II. D IMENSIONALITY REDUCTION WITH LDA
ponents, modeling the observed sample as a linear mixture Web document sets are huge, so we need to find a lower-
of independent sources. ICA is mainly used in the ”Blind dimensional representation of the data. Dimensionality reduc-
Source Separation(BSS)” field. PCA utilizes Karhunen-Loeve tion is the representation of high-dimensional patterns in a
204
different regions of the document space simultaneously. The C. Updating the position and velocity
detail steps of the PSO-based Web document retrieval algo- (s)
Each particle i memorizes its own F (Qi ) value and
rithm are described as follows.
chooses the maximum one , which has been better so far as
(s)
personal best position Pi where s denotes the iteration num-
A. The encoding of query particle (s)
ber.The particle with the best F value among Pi is denoted
(s)
The particle is represented by query vector space. Each as global best position Pgb .Note that in the first iteration,each
query particle representing a query is of the form: (0)
particle i is set directly to Pi , and the particle with the best
(0) (0)
Qu = (qu1 , qu2 , · · · , quT ) (7) F value among Pi is set to Pgb . Since each particle initial
position is the only location encountered by each particle at
where T is total number of stemmed terms automatically the run’s start,this position becomes each particle’s respective
(0) (0)
extracted from the documents, qui is the weight of the ith personal best position Pi .The first global best position Pgb
term in Qu and is represented by a real value and defines the is then selected from among these initial positions.
importance of the term in the considered query.Initially, a term Calculate the fitness value of each particle in the population
weight qui is computed as the following formula [8]: using the fitness function F given by Equation(9).Compare
each particle’s fitness value with its personal best position
0.5 × f requi N Pi
(s) (s) (s)
= (pi1 , . . . , piT ),the global best position is denoted
qui = (0.5 + ) × (log2 + 1) (8)
maxf requ ni (s)
as Pgb .Modify the member velocity Vi of each particle i
where f requi is the frequency of term ti in document du , according to the following formulation:
maxf requ is the maximum frequency of any term in the (s+1) (s) (s) (s) (s) (s)
Vi = k[wi Vi + c1 r1 (Pi
− Vi ) + c2 r2 (Pgb − Vi )]
document du ,N is the total number of documents, and ni
(11)
is the number of documents containing the term ti . (s+1) (s+1)
Therefore, particle i0 s position at iteration 0 can be If Vi > Vmax , then Vi = Vmax .
represented as the vector Q0i = (qi1 0 0
, . . . , qiT ) where T Based on the updated velocities, each individual (particle)
is total number of stemmed terms automatically extracted changes its position according to he following formulation:
from the documents.The velocity of particle i(i.e.,Vi0 = (s+1) (s) (s+1)
0 0 Qi = Qi + Vi (12)
(vi1 , . . . , viT ))corresponds to the term weight update quan-
tity,the velocity of each particle is created at random.The (s)
The personal best position Pi of individual at iteration
elements of position and velocity have the same dimension.
(s + 1) is updated as follows:
The initial population (set of queries) contains the initial (s+1) (s) (s+1) (s+1)
query and a list of relevant documents retrieved by this If F (Qi ) > F (Qi ) then Pi = Qi ;
(s+1) (s) (s+1) (s)
initial query. Therefore, the initial population is not randomly If F (Qi ) < F (Qi ) then Pi = Qi ;
(s)
constructed. Using this method, the PSO algorithm begins the where F (Qi ) denotes the fitness function evaluated at the
exploration of the document space in ”good” directions. This iteration number s. Meanwhile,the global best position Pgb at
population is renewed after each iteration of the PSO algorithm iteration (s + 1) is set as the best evaluated position among
(s+1)
. Pi .
205
TABLE I
E. Merging relevant documents AVERAGE RUNNING TIME COMPARISON
At each generation of PSO, these retrieved relevant docu-
Method T-1(s) T-2(s) T-3(s) T-4(s)
ments by all individual queries of query population are merged PSO 0.058 0.061 0.067 0.069
to a single document list, and presented to user. Our adopted SVM 0.073 0.079 0.081 0.088
merging methods according to following range formula: GA 0.067 0.072 0.076 0.085
X RF 0.074 0.082 0.086 0.091
Rel(s) (dj ) = F (Q(s) (s)
u ) · SV (Qu , dj ) (15)
(s)
Qu ∈P op(s) TABLE II
R ELEVANT DOCUMENT COMPARISON
(s)
where P op is the population at the generation(s) of the PSO,
(s)
SV (Qu , dj ) is the satisfaction value(SV) of document dj for Method T-1 T-2 T-3 T-4
(s) PSO 111 69 54 56
the query Qu at the generation(s) of the PSO. It indicates the SVM 96 64 56 53
satisfaction degree to which a retrieved document satisfies the GA 89 62 58 51
imposed query. When a retrieved document dj satisfies the RF 83 60 56 54
(s) (s)
imposed query Qu , the value of SV (Qu , dj ) is set 1, or
else 0.
F. Stopping Criteria VI. C ONCLUSIONS
The PSO algorithm is terminated if the best evaluation With the rapid development of Internet, information on the
value Pgb is not obviously improved or the iteration number Internet is increasing exponentially. As a consequence, the
s approaches to the predefined maximum iteration. role of information retrieval (IR) systems is becoming more
important. To achieve higher efficiency in manipulating the
V. E XPERIMENTAL RESULTS Web document data, we first project the documents into a
In order to demonstrate the effectiveness of the proposed lower-dimensional subspace in which the semantic structure of
document retrieval algorithm based on PSO, we conducted a the document space becomes clear. Once the high-dimensional
series of experiments and compared our proposed algorithm document space is mapped into a lower dimensional space,
with other document retrieval algorithm based on relevant the PSO-based document retrieval algorithm is then applied.
feedback(RF) approach [9], genetic algorithm(GA) [10],and Experimental results are provided to validate the effectiveness
support vector machine(SVM) [11]. The LIBSVM software of the proposed approach.
[12] was used in our system to solve the SVM optimization
problem. Leave-one-out cross validation on the training images R EFERENCES
is applied to select the parameters in SVM. The RBF kernel [1] G. Salton and M. J. McGill, Introduction to Modern Information
Retrieval, New York: McGraw-Hill, 1983.
is used in SVM. The parameter values of the proposed PSO [2] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, second
algorithm are empirically set as follows: The ratio β of edition, New York: Wiley-Interscience, 2000.
increase (decrease) in local search is set 0.05, and the number [3] P. Comon, “Independent component analysis-a new concept?”, Signal
Processing, vol. 36, no. 3, pp. 287–314, April 1994.
of iterations is fixed at 4. [4] R. C. Eberhart, J. Kennedy, “A new optimizer using particle swarm
The standard document collections Reuters-21578 [13] was theory”, in Proceedings of the Sixth International Symposium on Micro
used in our experiments.The actual computational time of Machine and Human Science, 1995, pp. 39–43.
[5] H. Li, K. Zhang, and T. Jiang, “Robust and accurate cancer classification
different algorithms is given in Table I. All of these four with gene expression profiling”, in Proceedings of the 2005 IEEE
algorithms can respond to the user’s query very fast, that is, Computational Systems Bioinformatics Conference, 2005, pp. 310–321.
within 0.1s. Our PSO algorithm is slightly faster than other [6] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd edition.
Baltimore: Johns Hopkins University Press, 1996.
three algorithms. The reason is that the LDA-based dimension- [7] J. Kennedy, The particle swarm:social adaptation of knowledge. Pro-
ality reduction method reduces the cost of the post-processing ceedings of the 1997 IEEE International Conference on Evolutionary
involved in document retrieval, and improves retrieval speed Computation, 1997, pp. 303–308.
[8] A. Singhal, C. Buckley, and M. Mitra, “Pivoted document length normal-
and efficiency. isation”, in Proceedings of the 19th Annual International ACM SIGIR
In addition, we also compare the number of relevant docu- Conference on Research and Development in Information Retrieval,
ment retrieved using PSO and other algorithms. Table II gives 1996, pp. 21–29.
[9] B. T. Bartell, G. W. Cortell, and R. K. Belew, “Optimising similarity
the number of relevant document retrieved at each iteration. using multiquery relevance feedback”, Journal of the American Society
We can clearly see that our PSO more effective than other for Information Science, vol. 49, no. 8, pp. 742–761, December 1998.
algorithms in retrieving relevant documents.The reason is that [10] J. -T. Horng and C. -C. Yeh, “Applying genetic algorithms to query
optimization in document retrieval”, Information Processing and Man-
we designed corresponding position and velocity of each agement, vol. 36, no. 5, pp. 737–759, September 2000.
particle updating operation according to itself characteristics [11] V. N. Vapnik, Statistical Learning Theory. New York: John Wiley&Sons,
of information retrieval, and used local search method to speed 1998.
[12] C. -C. Chang and C. -J. Lin, “LIBSVM: a library for support vector
up finding a better query vector near the original query vector machines”, http: //www. csie. ntu. edu. tw/ cjlin/libsvm, 2001.
after applying the PSO operation. Therefore, our proposed [13] D. D. Lewis, “Reuters-21578 text categorization test collection distribu-
PSO query optimization algorithm efficiently improves the tion 1.0”, http://www.research. att.com/ lewis, 1999.
performance of the query search.
206