0% found this document useful (0 votes)
8 views4 pages

PSO7

Uploaded by

Vk Tech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

PSO7

Uploaded by

Vk Tech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A Web Document Retrieval Algorithm Based on

Particle Swarm Optimization


Ziqiang Wang #1 , Xia Sun #2 , Dexian Zhang #3
#
School of Information Science and Engineering, Henan University of Technology
High New Technology Industries Development Zone, 450001 ZhengZhou, P. R. China
1
[email protected]

Abstract—With advances in the computer technologies and transformation to derive the most representative subspace for
the rapid development of Internet, information on the Inter- object representation. The goal of LDA is to find a linear
net is increasing exponentially. To efficiently retrieve relevant transformation that maximizes the between-class scatter and
documents from the explosive growth of the Internet and other
sources of information access, a novel Web document retrieval minimizes the within-class scatter so that the class separability
algorithm based on particle swarm optimization (PSO) and can be optimized in the transformed space. In general, LDA-
linear discriminant analysis (LDA) algorithm is proposed to based dimensionality reduction methods perform better than
deal this problem. Experimental results clearly demonstrate its PCA-based methods. This is because LDA-based methods
effectiveness and efficiency. aim to find projections with most discriminant information,
whereas PCA-based methods find projections with minimal
I. I NTRODUCTION
reconstruction errors. Moreover, LDA has been successfully
With advances in the computer technologies and the rapid used as a dimensionality reduction technique to many pat-
development of Internet, information on the Internet is in- tern recognition problems, such as information retrieval, face
creasing exponentially. In order to make use of this vast recognition, and microarray gene expression data analysis. In
amount of data, efficient and effective techniques to retrieve this paper, we focus on Linear Discriminant Analysis (LDA),
Web document information based on its content need to be since it is a popular and widely dimensionality reduction
developed. As a consequence, the role of information retrieval method in pattern recognition.
(IR) systems is becoming more important [1]. One of the In addition, one of the most important and difficult opera-
most important and difficult operations in information retrieval tions in information retrieval is to generate queries that can
is to generate queries that can succinctly identify relevant succinctly identify relevant documents and reject irrelevant
documents and reject irrelevant documents. In addition, the documents. Relevance feedback (RF) is an important tool to
document data are typically of very high dimensionality, rang- improve the performance of information retrieval (IR) [1]. RF
ing from several thousands to several hundreds of thousand. focuses on the interactions between the user and the search en-
High-dimensional data often leads to inferior retrieval results gine by letting the user label semantically relevant or irrelevant
due to the curse of dimensionality. To achieve higher efficiency documents. In recent years, many relevance feedback methods
in manipulating the Web document data, it is desirable to first have been developed. They either adjust the weights of various
project the documents into a lower-dimensional subspace in features to adapt to the user’s preferences or estimate the
which the semantic structure of the document space becomes density of the positive feedback examples. Recently, Eberhart
clear. Once the high-dimensional document space is mapped and Kennedy proposed the particle swarm optimization (PSO)
into a lower dimensional space, the traditional information based on the analogy of swarm of bird [4]. The main advan-
retrieval algorithms can then be applied. tages of the PSO algorithm are summarized as: simple concept,
The document space is generally of high dimensionality and easy implementation, robustness to control parameters, and
querying in such a high dimensional space is often infeasible computational efficiency when compared with mathematical
due to the curse of dimensionality. Therefore, dimensionality algorithm and other heuristic optimization techniques. The
reduction is essential to the design of efficient document original PSO has been applied to a learning problem of neural
retrieval algorithms. In the past several decades, many di- networks and function optimization problems, and efficiency
mensionality reduction techniques have been proposed. Prin- of the method has been confirmed.In this paper, the objective is
cipal Component Analysis (PCA) [2] and Linear Discriminant to investigate the capability of the PSO algorithm boosted by
Analysis (LDA) [2] have been two of the most commonly LDA-based dimensionality reduction method for Web docu-
used linear dimension reduction methods. Independent Com- ment query optimization in the context of information retrieval.
ponent Analysis (ICA) [3] is another linear decomposition
which seeks statistically independent and non-gaussian com- II. D IMENSIONALITY REDUCTION WITH LDA
ponents, modeling the observed sample as a linear mixture Web document sets are huge, so we need to find a lower-
of independent sources. ICA is mainly used in the ”Blind dimensional representation of the data. Dimensionality reduc-
Source Separation(BSS)” field. PCA utilizes Karhunen-Loeve tion is the representation of high-dimensional patterns in a

978-1-4244-4105-1/07/$25.00 © 2007 IEEE 203 BIC-TA 2007


low-dimensional subspace based on a transformation which so far, as well as in the direction of the global best position
optimizes a specified criterion in the subspace. This is impor- discovered so far by any of the particles in the swarm. This
tant in information retrieval, since the lower dimensionality means that if a particle discovers a promising new solution,
approximation is not just a tool for transforming a given all the other particles will move closer to it, exploring the
problem into another one which is easier to solve, but the region more thoroughly in the process.Some of the attractive
reduced representation itself will reduce the cost of the post- features of the PSO include ease of implementation and the
processing involved in document retrieval, and will improve fact that no gradient information is required. It can be used
retrieval speed and efficiency. to solve a wide array of different optimization problems;
Linear Discriminant Analysis (LDA) [2] is a popular linear some example applications include neural network training and
dimension reduction method which can be used as a prepro- function minimization.
cessing step for document retrieval.The goal of LDA is to Let s denote the swarm size.Each individual particle i(1 ≤
find a linear transformation that maximizes the between-class i ≤ s) has the following properties: a current position xi in
scatter and minimizes the within-class scatter so that the class search space, a current velocity vi , and a personal best position
separability can be optimized in the transformed space.The pi in the search space, and the global best position pgb among
LDA’s optimization criterion can be written as all the pi .During each iteration, each particle in the swarm is
updated using the following equation.
|W T Sb W |
J(W ) = argmax (1)
|W T Sw W | vi (t + 1) = k[wi vi (t) + c1 r1 (pi − xi (t)) + c2 r2 (pgb − xi (t))]
(3)
where W is a linear transformation matrix whose columns are
xi (t + 1) = xi (t) + vi (t + 1) (4)
the projection vectors used for dimensionality reduction,Sb is
the between-class scatter matrix, and Sw is the within-class where c1 and c2 denote the acceleration coefficients, and r1
scatter matrix.By maximizing the criterion J(W ), LDA finds and r2 are random numbers uniformly distributed within [0,1].
the subspaces in which the classes are most linearly separable. The value of each dimension of every velocity vector vi
The solution that maximizes J(W ) is a set of the eigenvectors can be clamped to the range [−vmax , vmax ] to reduce the
W that must satisfy likelihood of particles leaving the search space. The value
of vmax chosen to be k × xmax (where 0.1 ≤ k ≤ 1).Note
Sb W = λSw W (2)
that this does not restrict the values of xi to the range
This is called the generalized eigenvalue problem. The dis- [−vmax , vmax ].Rather than that, it merely limits the maximum
criminant subspace is spanned by the generalized eigen- distance that a particle will move.
vectors. This ratio is maximized when the column vectors Acceleration coefficients c1 and c2 control how far a particle
of the projection matrix W are the eigenvectors of Sw −1
Sb will move in a single iteration. Typically, these are both set to
.In document retrieval tasks, this method cannot be applied a value of 2.0, although assigning different values to c1 and c2
directly since the number of terms in the document space sometimes leads to improved performance.The inertia weight
is typically larger than the total number of documents. As w in Equation (5) is also used to control the convergence
a consequence,Sw is singular in this case. This problem is also behavior of the PSO.Typical implementations of the PSO adapt
known as the “small sample size (SSS)” problem. To overcome the value of w linearly decreasing it from 1.0 to near 0 over
this problem, we adopt the generalized linear discriminant the execution. In general, the inertia weight w is set according
analysis (GLDA) [5] which is a general, direct, and complete to the following equation [7]:
solution to optimize LDA’s criterion. The merits of the GLDA wmax − wmin
as follows: It is mathematically well-founded and coincides wi = wmax − · iter (5)
itermax
with the conventional LDA when Sw is nonsingular. Unlike the
where itermax is the maximum number of iterations, and iter
conventional LDA, GLDA does not assume the nonsingularity
is the current number of iterations.
of Sw . Hence, GLDA naturally solves the small sample size
In order to guarantee the convergence of the PSO algorithm,
problem. In addition, to accommodate the high dimensionality
the constriction factor k is defined as follows:
of scatter matrices, a fast algorithm of GLDA via singular
2
value decomposition (SVD) [6] is also developed. Extensive k= p (6)
experiments on real world databases verify the competitive |2 − ϕ − ϕ2 − 4ϕ|
performance of the proposed approach. where ϕ = c1 + c2 and ϕ > 4.
III. PARTICLE SWARM OPTIMIZATION IV. T HE WEB DOCUMENT RETRIEVAL ALGORITHM
The particle swarm optimization (PSO) [4] is a population The proposed system is based on a vector space model [1] in
based optimization technique, where the population is called which both documents and queries are represented as vectors.
a swarm. A simple explanation of the PSO’s operation is as After dimension reduction by the above LDA-based method,
follows. Each particle represents a possible solution to the the PSO algorithm was used in the reduced dimensional space.
optimization task. During each iteration each particle acceler- The goal of the PSO algorithm is to find an optimal set of
ates in the direction of its own personal best solution found documents which best match the user’s need by exploring

204
different regions of the document space simultaneously. The C. Updating the position and velocity
detail steps of the PSO-based Web document retrieval algo- (s)
Each particle i memorizes its own F (Qi ) value and
rithm are described as follows.
chooses the maximum one , which has been better so far as
(s)
personal best position Pi where s denotes the iteration num-
A. The encoding of query particle (s)
ber.The particle with the best F value among Pi is denoted
(s)
The particle is represented by query vector space. Each as global best position Pgb .Note that in the first iteration,each
query particle representing a query is of the form: (0)
particle i is set directly to Pi , and the particle with the best
(0) (0)
Qu = (qu1 , qu2 , · · · , quT ) (7) F value among Pi is set to Pgb . Since each particle initial
position is the only location encountered by each particle at
where T is total number of stemmed terms automatically the run’s start,this position becomes each particle’s respective
(0) (0)
extracted from the documents, qui is the weight of the ith personal best position Pi .The first global best position Pgb
term in Qu and is represented by a real value and defines the is then selected from among these initial positions.
importance of the term in the considered query.Initially, a term Calculate the fitness value of each particle in the population
weight qui is computed as the following formula [8]: using the fitness function F given by Equation(9).Compare
each particle’s fitness value with its personal best position
0.5 × f requi N Pi
(s) (s) (s)
= (pi1 , . . . , piT ),the global best position is denoted
qui = (0.5 + ) × (log2 + 1) (8)
maxf requ ni (s)
as Pgb .Modify the member velocity Vi of each particle i
where f requi is the frequency of term ti in document du , according to the following formulation:
maxf requ is the maximum frequency of any term in the (s+1) (s) (s) (s) (s) (s)
Vi = k[wi Vi + c1 r1 (Pi
− Vi ) + c2 r2 (Pgb − Vi )]
document du ,N is the total number of documents, and ni
(11)
is the number of documents containing the term ti . (s+1) (s+1)
Therefore, particle i0 s position at iteration 0 can be If Vi > Vmax , then Vi = Vmax .
represented as the vector Q0i = (qi1 0 0
, . . . , qiT ) where T Based on the updated velocities, each individual (particle)
is total number of stemmed terms automatically extracted changes its position according to he following formulation:
from the documents.The velocity of particle i(i.e.,Vi0 = (s+1) (s) (s+1)
0 0 Qi = Qi + Vi (12)
(vi1 , . . . , viT ))corresponds to the term weight update quan-
tity,the velocity of each particle is created at random.The (s)
The personal best position Pi of individual at iteration
elements of position and velocity have the same dimension.
(s + 1) is updated as follows:
The initial population (set of queries) contains the initial (s+1) (s) (s+1) (s+1)
query and a list of relevant documents retrieved by this If F (Qi ) > F (Qi ) then Pi = Qi ;
(s+1) (s) (s+1) (s)
initial query. Therefore, the initial population is not randomly If F (Qi ) < F (Qi ) then Pi = Qi ;
(s)
constructed. Using this method, the PSO algorithm begins the where F (Qi ) denotes the fitness function evaluated at the
exploration of the document space in ”good” directions. This iteration number s. Meanwhile,the global best position Pgb at
population is renewed after each iteration of the PSO algorithm iteration (s + 1) is set as the best evaluated position among
(s+1)
. Pi .

B. Computing fitness function D. Local Search Procedure


A fitness is assigned to each query in the population. To reinforce the local search abilities of PSO, our algorithm
This fitness represents the effectiveness of a query during the adopts a neighborhood-based local search procedure to find
retrieving stage. Its definition is as follows: a better query vector near the original query vector after
(s)+ (s)−
P applying the PSO algorithm.Let Qu and Qu be the
(s) (s)
dj ∈Dr
(s) Sim(dj , Qu ) neighbors of the query vector Qu , their definition is as
F (Q(s)
u ) =P (s)
(9) follows:
(s)
dj ∈Dnr
Sim(dj , Qu )
(s)+ (s)
qui = qui · (1 + β) (13)
(s)
where Dr is the set of relevant documents retrieved at the
(s) (s)− (s)
generation(s) of the PSO,dj is the jth document,Dnr is the qui = qui · (1 − β) (14)
set of non-relevant documents retrieved at the generation(s)
(s)
of the PSO, and Sim(dj , Qu ) is a similar measure function where the value of β decides the ratio of increase or decrease.
defined as follows: Each weight in a query vector generates two neighboring
(s)
PT vectors. From all neighboring vectors, the vector Qu (new)
(s)
(s) (q · dji ) which has the best fitness function value is selected.If
Sim(dj , Qu ) = qP i=1 uiqP (10) (s) (s) (s)
avg(Qu (new)) is larger than Qu ,then the Qu is replaced
T 2 · T 2
q
i=1 ui d
i=1 ji (s)
by Qu (new).

205
TABLE I
E. Merging relevant documents AVERAGE RUNNING TIME COMPARISON
At each generation of PSO, these retrieved relevant docu-
Method T-1(s) T-2(s) T-3(s) T-4(s)
ments by all individual queries of query population are merged PSO 0.058 0.061 0.067 0.069
to a single document list, and presented to user. Our adopted SVM 0.073 0.079 0.081 0.088
merging methods according to following range formula: GA 0.067 0.072 0.076 0.085
X RF 0.074 0.082 0.086 0.091
Rel(s) (dj ) = F (Q(s) (s)
u ) · SV (Qu , dj ) (15)
(s)
Qu ∈P op(s) TABLE II
R ELEVANT DOCUMENT COMPARISON
(s)
where P op is the population at the generation(s) of the PSO,
(s)
SV (Qu , dj ) is the satisfaction value(SV) of document dj for Method T-1 T-2 T-3 T-4
(s) PSO 111 69 54 56
the query Qu at the generation(s) of the PSO. It indicates the SVM 96 64 56 53
satisfaction degree to which a retrieved document satisfies the GA 89 62 58 51
imposed query. When a retrieved document dj satisfies the RF 83 60 56 54
(s) (s)
imposed query Qu , the value of SV (Qu , dj ) is set 1, or
else 0.
F. Stopping Criteria VI. C ONCLUSIONS
The PSO algorithm is terminated if the best evaluation With the rapid development of Internet, information on the
value Pgb is not obviously improved or the iteration number Internet is increasing exponentially. As a consequence, the
s approaches to the predefined maximum iteration. role of information retrieval (IR) systems is becoming more
important. To achieve higher efficiency in manipulating the
V. E XPERIMENTAL RESULTS Web document data, we first project the documents into a
In order to demonstrate the effectiveness of the proposed lower-dimensional subspace in which the semantic structure of
document retrieval algorithm based on PSO, we conducted a the document space becomes clear. Once the high-dimensional
series of experiments and compared our proposed algorithm document space is mapped into a lower dimensional space,
with other document retrieval algorithm based on relevant the PSO-based document retrieval algorithm is then applied.
feedback(RF) approach [9], genetic algorithm(GA) [10],and Experimental results are provided to validate the effectiveness
support vector machine(SVM) [11]. The LIBSVM software of the proposed approach.
[12] was used in our system to solve the SVM optimization
problem. Leave-one-out cross validation on the training images R EFERENCES
is applied to select the parameters in SVM. The RBF kernel [1] G. Salton and M. J. McGill, Introduction to Modern Information
Retrieval, New York: McGraw-Hill, 1983.
is used in SVM. The parameter values of the proposed PSO [2] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, second
algorithm are empirically set as follows: The ratio β of edition, New York: Wiley-Interscience, 2000.
increase (decrease) in local search is set 0.05, and the number [3] P. Comon, “Independent component analysis-a new concept?”, Signal
Processing, vol. 36, no. 3, pp. 287–314, April 1994.
of iterations is fixed at 4. [4] R. C. Eberhart, J. Kennedy, “A new optimizer using particle swarm
The standard document collections Reuters-21578 [13] was theory”, in Proceedings of the Sixth International Symposium on Micro
used in our experiments.The actual computational time of Machine and Human Science, 1995, pp. 39–43.
[5] H. Li, K. Zhang, and T. Jiang, “Robust and accurate cancer classification
different algorithms is given in Table I. All of these four with gene expression profiling”, in Proceedings of the 2005 IEEE
algorithms can respond to the user’s query very fast, that is, Computational Systems Bioinformatics Conference, 2005, pp. 310–321.
within 0.1s. Our PSO algorithm is slightly faster than other [6] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd edition.
Baltimore: Johns Hopkins University Press, 1996.
three algorithms. The reason is that the LDA-based dimension- [7] J. Kennedy, The particle swarm:social adaptation of knowledge. Pro-
ality reduction method reduces the cost of the post-processing ceedings of the 1997 IEEE International Conference on Evolutionary
involved in document retrieval, and improves retrieval speed Computation, 1997, pp. 303–308.
[8] A. Singhal, C. Buckley, and M. Mitra, “Pivoted document length normal-
and efficiency. isation”, in Proceedings of the 19th Annual International ACM SIGIR
In addition, we also compare the number of relevant docu- Conference on Research and Development in Information Retrieval,
ment retrieved using PSO and other algorithms. Table II gives 1996, pp. 21–29.
[9] B. T. Bartell, G. W. Cortell, and R. K. Belew, “Optimising similarity
the number of relevant document retrieved at each iteration. using multiquery relevance feedback”, Journal of the American Society
We can clearly see that our PSO more effective than other for Information Science, vol. 49, no. 8, pp. 742–761, December 1998.
algorithms in retrieving relevant documents.The reason is that [10] J. -T. Horng and C. -C. Yeh, “Applying genetic algorithms to query
optimization in document retrieval”, Information Processing and Man-
we designed corresponding position and velocity of each agement, vol. 36, no. 5, pp. 737–759, September 2000.
particle updating operation according to itself characteristics [11] V. N. Vapnik, Statistical Learning Theory. New York: John Wiley&Sons,
of information retrieval, and used local search method to speed 1998.
[12] C. -C. Chang and C. -J. Lin, “LIBSVM: a library for support vector
up finding a better query vector near the original query vector machines”, http: //www. csie. ntu. edu. tw/ cjlin/libsvm, 2001.
after applying the PSO operation. Therefore, our proposed [13] D. D. Lewis, “Reuters-21578 text categorization test collection distribu-
PSO query optimization algorithm efficiently improves the tion 1.0”, http://www.research. att.com/ lewis, 1999.
performance of the query search.

206

You might also like