Machine Learning For Information Retrieval
Machine Learning For Information Retrieval
Machine Learning For Information Retrieval
Hsinchun Chen
University of Arizona, Management Information Systems Department, Karl Eller Graduate School of Management,
McClelland Hall 4302, Tucson, AZ 8572 1. E-mail: [email protected]
Information retrieval using probabilistic techniques has attracted significant attention on the part of researchers in
information and computer science over the past few decades. In the 198Os, knowledge-based
techniques also
made an impressive contribution to intelligent
information retrieval and indexing. More recently, information science researchers have turned to other newer artificial-intelligence-based
inductive learning techniques including
neural networks, symbolic learning, and genetic algorithms. These newer techniques, which are grounded on
diverse paradigms, have provided great opportunities for
researchers to enhance the information processing and retrieval capabilities of current information storage and retrieval systems. In this article, we first provide an overview
of these newer techniques and their use in information
science research. To familiarize readers with these techniques, we present three popular methods: the connectionist Hopfield network; the symbolic ID3/ID5R; and evolution-based genetic algorithms. We discuss their knowledge representations
and algorithms in the context of
information retrieval. Sample implementation
and testing
results from our own research are also provided for each
technique. We believe these techniques are promising in
their ability to analyze user queries, identify users information needs, and suggest alternatives for search. With
proper user-system
interactions,
these methods can
greatly complement
the prevailing
full-text,
keywordbased, probabilistic, and knowledge-based
techniques.
Introduction
JOURNAL
OF THE AMERICAN
This situation is particularly evident for textual databases,which are widely usedin traditional library science
environments, in business applications (e.g., manuals,
newsletters,and electronic data interchanges),and in scientific applications (e.g., electronic community systems
and scientific databases).Information stored in thesedatabasesoften has become voluminous. fragmented, and
unstructured after yearsof intensive use.Only userswith
extensive subject area knowledge, system knowledge,
and classification scheme knowledge (Chen & Dhar,
1990) are able to maneuver and explore in these textual
databases.
Most commercial information retrieval systems still
rely on conventional inverted index and Boolean querying techniques. Even full-text retrieval has produced less
than satisfactory results (Blair & Maron, 1985).Probabilistic retrieval techniques have been used to improve the
retrieval performance of information retrieval systems
(Bookstein & Swanson, 1975: Maron & Kuhns, 1960).
The approach is based on two main parameters, the
probability of relevance and the probability of irrelevance of a document. Despite various extensions, probabilistic methodology still requires the independenceassumption for terms and it suffers from difficulty of estimating term-occurrence parameters correctly (Gordon,
1988; Salton, 1989).
Since the late 1980s knowledge-based techniques
have been used extensively by information science researchers.These techniques have attempted to capture
searchersand information specialistsdomain knowledge and classification scheme knowledge, effective
searchstrategies,and query refinement heuristics in document retrieval systems design (Chen & Dhar, 1991).
Despite their usefulness,systemsof this type are consideredperformancesystems(Simon, 1991)-they only perform what they were programmed to do (i.e., they are
without learning ability). Significant efforts are often required to acquire knowledge from domain experts and
to maintain and update the knowledge base.
SCIENCE. 46(3):194-216,
1995
CCC 0002-8231/95/030194-23
A newer paradigm, generally considered to be the machine learning approach, has attracted attention of researchersin artificial intelligence, computer science,and
other functional disciplines such as engineering, medicine, and business (Carbonell, Michalski, & Mitchell,
1983; Michalski, 1983; Weiss & Kulikowski, 1991). In
contrast to performance systems, which acquire knowledgefrom human experts, machine learning systems acquire knowledge automatically from examples, that is,
from source data. The most frequently used techniques
include symbolic, inductive learning algorithms such as
ID3 (Quinlan, 1979) multiple-layered, feed-forward
neural networks such as backpropagation networks
(Rumelhart, Widrow, & Lehr, 1986), and evolutionbased genetic algorithms (Goldberg, 1989). Many information science researchershave started to experiment
with these techniques as well (Belew, 1989; Chen &
Lynch, 1992; Chen et al., 1993; Gordon, 1989; Kwok,
1989).
In this article, we aim to review the prevailing machine learning techniques and to present several sample
implementations in information retrieval to illustrate
the associated knowledge representations and algorithms. Our objectives are to bring these newer techniques to the attention of information science researchers by way of a comprehensive overview and discussion
of algorithms. We review the probabilistic and knowledge-basedtechniques and the emerging machine learning methods developed in artificial intelligence (AI). We
then summarize some recent work adopting AI techniques in information retrieval (IR). After the overview,
we present in detail a neural network implementation
(Hopfield network), a symbolic learning implementation
(ID3 and IDSR), and a genetic algorithms implementation. Detailed algorithms, selectedIR examples, and preliminary testing results are also provided. A summary
concludes the study.
Information Retrieval Using Probabilistic,
Knowledge-Based,
and Machine Learning
Techniques
JOURNAL
OF THE AMERICAN
SCIENCE-April
1995
195
196
JOURNAL
OF THE AMERICAN
SCIENCE-April
1995
JOURNAL
OF THE AMERICAN
SCIENCE-April
1995
197
Learning Systems: An Overview. The symbolic machine learning technique, the resurgent neural networks
approach, and evolution-based genetic algorithms provide drastically different methods of data analysis and
knowledge discovery (Chen et al., in press; Fisher &
McKusik, 1989; Kitano, 1990; Mooney et al., 1989:
Weiss & Kapouleas, 1989: Weiss & Kulikowski, 199 1).
These techniques, which are diverse in their origins and
198
Symboliclearningand 103: Symbolic machine learning techniques, which can be classified based on such
underlying learning strategiesas rote learning, learning
by being told, learning by analogy, learning from examples, and learning from discovery (Carbonell, Michalski, & Mitchell, 1983), have been studied extensively by AI researchers over the past two decades.
Among these techniques, learning from examples, a
special case of inductive learning, appears to be the
most promising symbolic machine learning technique
for knowledge discovery or data analysis. It induces a
general concept description that best describes the positive and negative examples. Examples of algorithms
which require both positive and negative examples are
Quinlans (1983) ID3 and Mitchells (1982) Version
Space. Some algorithms are batch-oriented, such as
JOURNAL
OF THE AMERICAN
SCIENCE-April
1995
analysis methods, backpropagation net, and decisiontree-based inductive learning methods (ID3-like) were
found to achieve comparable performance for several
data sets. Fisher and McKusick (1989) found that using
batch learning, backpropagation performed as well as
ID3, but it was more noise-resistant. They also compared
the effect of incremental learning versus batch learning.
Kitano ( 1990) performed systematic, empirical studies
on the speed of convergence of backpropagation networks and genetic algorithms. The results indicated that
genetic search is, at best, equally efficient as faster variants of a backpropagation algorithm in very small scale
networks, but far less efficient in larger networks. Earlier
research by Montana and Davis (1989), however,
showed that using some domain-specific genetic operators to train the backpropagation network, instead of using the conventional backpropagation delta learning
rule, improved performance. Harp, Samad, and Guha
( 1989) also achieved good results by using GAS for neural
network design.
Systems developed by Kitano ( 1990) and Harp et al.
( 1989) are also considered hybrid systems (genetic algorithms and neural networks), as are systems like COGIN
(Green & Smith, 1991) which performed symbolic induction using genetic algorithms and SC-net (Hall & Romaniuk, 1990), which is a fuzzy connectionist expert system. Other hybrid systems developed in recent years employ symbolic and neural net characteristics. For
example, Touretzky and Hinton (1988) and Gallant
( 1988) proposed connectionist production systems, and
Derthick (1988) and Shastri (199 1) developed different
connectionist semantic networks.
Learning Systems in IR. The adaptive learning
techniques cited have also drawn attention from researchers in information science in recent years. In particular, Doszkocs, Reggia, & Lin ( 1990) provided an excellent review of connectionist models for information
retrieval and Lewis ( 199 1) has briefly surveyed previous
research on machine learning in information retrieval
and discussed promising areas for future research at the
intersection of these two fields.
l
SCIENCE-April
1995
199
havior emerging from the local interactions that occur concurrently between the numerous network
nodes through their synaptic connections. By taking a
broader definition of connectionist models, these authors were able to discussthe well-known vector space
model, cosine measures of similarity, and automatic
clustering and thesaurus in the context of netH!orkrepresentation. Based on the network representation,
spreading activation methods such as constrained
spreading activation adopted in GRANT (Cohen &
Kjeldsen, 1987) and the branch-and-bound algorithm
adopted in METACAT (Chen & Dhar, 1991) can be
considered as variants of connectionist activation.
However, only a few systems are considered classical
connectionist systems that typically consist of
weighted, unlabeled links and exhibit some adaptive
learning capabilities.
The work of Belew is probably the earliest connectionist model adopted in IR. In AIR (Belew, 1989). he
developed a three-layer neural network of authors. index terms, and documents. The system used relevance
feedback from its users to change its representation of
authors, index terms, and documents over time. The
result was a representation of the consensual meaning
of keywords and documents shared by some group of
users. One of his major contributions was the use of a
modified correlational learning rule. The learning process created many new connections between documents and index terms. Rose and Belew (199 1) extended AIR to a hybrid connectionist and symbolic
system called SCALIR which used analogical reasoning to find relevant documents for legal research.Kwok
( 1989) also developed a similar three-layer network of
queries, index terms, and documents. A modified Hebbian learning rule was usedto reformulate probabilistic
information retrieval. Wilkinson and Hingston ( 1991,
1992) incorporated the vector space model in a neural
network for document retrieval. Their network also
consisted of three layers: queries, terms, and documents. They have shown that spreading activation
through related terms can help improve retrieval performance.
While the above systems represent information retrieval applications in terms of their main components
ofdocuments. queries, index terms, authors, etc., other
researchers used different neural networks for more
specific tasks. Lin, Soergel, & Marchionini (199 1)
adopted a Kohonen network for information retrieval.
Kohonens feature map, which produced a two-dimensional grid representation for N-dimensional features,
was applied to construct a self-organizing (unsupervised learning), visual representation of the semantic
relationships between input documents. In MacLeod
and Robertson (199 I), a neural algorithm was used for
document clustering. The algorithm compared favorably with conventional hierarchical clustering algorithms. Chen et al. (1992, 1993, in press) reported a
seriesof experiments and system developments which
generated an automatically created weighted network
of keywords from large textual databases and integrated it with several existing man-made thesauri (e.g.,
200
JOURNAL
OF THE AMERICAN
SCIENCE-April
1995
Han ( 1991) and Han, Cai, and Cercone (1993) developed an attribute-oriented, tree-ascendingmethod for
extracting characteristic and classification rules from
relational databases.The technique relied on some existing conceptual tree for identifying higher-level, abstract concepts in the attributes. Ioannidis, Saulys, and
Whitsitt (1992) examined the idea of incorporating
machine learning algorithms (UNIMEM and COBWEB) into a database system for monitoring the
stream of incoming queries and generating hierarchies
with the most important concepts expressedin those
queries. The goal is for these hierarchies to provide
valuable input for dynamically modifying the physical
and logical designsof a database.Also related to database design, Borgida and Williamson (1985) proposed
the use of machine learning to represent exceptions in
databasesthat are based on semantic data models. Li
and McLeod ( 1989) used machine learning techniques
to handle object flavor evolution in object-oriented databases.
Genetic algorithms and IR: Our literature search revealed several implementations of genetic algorithms
in information retrieval. Gordon (1988) presented a
genetic algorithms-based approach for document indexing. Competing document descriptions (keywords)
are associatedwith a document and altered over time
by using genetic mutation and crossover operators. In
his design, a keyword representsa gene (a bit pattern),
a documents list of keywords representsindividuals (a
bit string), and a collection of documents initially
judged relevant by a user representsthe initial population. Based on a Jaccards score matching function
(fitness measure), the initial population evolved
through generations and eventually converged to an
optimal (improved) population-a set of keywords
which best described the documents. Gordon (1991)
further adopted a similar approach to document clustering. His experiment showed that after genetically redescribing the subject description of documents, descriptions of documents found co-relevant to a set of
queries will bunch together. Redescription improved
the relative density of co-relevant documents by
39.74% after 20 generations and 56.6 1% after 40 generations. Raghavanand Agarwal ( 1987) have also studied the genetic algorithms in connection with document clustering. Petry et al. (1993) applied genetic programming to a weighted information retrieval system.
In their research,a weighted Boolean query was modified to improve recall and precision. They found that
the form of the fitness function has a significant effect
upon performance. Yang and coworkers (Yang & Korlhage, 1993; Yang, Korlhage, & Rasmussen, 1993)
have developed adaptive retrieval methods based on
genetic algorithms and the vector space model using
relevance feedback. They reported the effect of adopting genetic algorithms in large databases,the impact of
genetic operators, and GAs parallel searchingcapability. Frieder and Siegelmann ( 1991) also reported a data
placement strategy for parallel information retrieval
systemsusing a genetic algorithms approach. Their results compared favorably with pseudo-optimal docu-
SCIENCE-April
1995
201
MOID function.
This formula shows the parallel relaxation property of the Hopfield net. At each iteration, all nodes
are activated at the same time. The weight computation scheme, net, = X7& &F,(t), is a unique characteristic of the Hopfield net algorithm. Based on parallel activation, each newly activated node derives its
new weight based on the summation of the products
of the weights assignedto its neighbors and their synapses.
(4) Convergence: The above process is repeated until
there is no change in terms of output between two
iterations, which is accomplished by checking:
n-1
c lP,(t+ l)-P,Cc,(Q 56
,=o
where 6 is the maximal allowable error (a small number). The final output represents the set of terms relevant to the starting keywords. Some default threshold values were selected for (0,, 19,).
sultinglinks representprobabilistic,synapticweights
between any two concepts. For other external thesauri which contain only symbolic links (e.g., narrower term, synonymous term, broader term, etc.),
a user-guided procedure of assigning a probabilistic
weight to each symbolic link can be adopted (Chen
et al., 1993).
The training phase of the Hopfield net is completed when the weights have been computed or assigned. to represents the synaptic weight from
node i to nodej.
(2) Initialization with search terms: An initial set of
search terms is provided by searchers, which serves
as the input pattern. Each node in the network which
matches the search terms is initialized (at time 0) to
have a weight of 1.
p,(O)=x,,Osisn-
A sample sessionof the Hopfield net spreadingactivation is presented below. Three thesauri were incorporated in the experiment: a Public thesaurus (generated
automatically from 3000 articles extracted from DIALOG), the ACM Computing Review Classification System (ACM CRCS), and a portion of the Library of Congress Subject Headings (LCSH) in the computing area.
The links in the ACM CRCS and in the LCSH were assigned weights between 0 and 1. Several user subjects
(MIS graduate students) were also asked to reviewed selected articles and create their own folders for topics of
special interest to them. Notice that some keywords were
folder names assignedby the users (in the format of *.*);
for example, QUERY.OPT folder for query optimization topics; DBMS.AI folder for artificial intelligence
and databases topics; and KEVIN.HOT folder for
HOT (current) topics selectedby a user, Kevin. In the
example shown below, the searcherwas asked to identify
descriptors which were relevant to knowledge indexed
deductive search.The initial searchterms were: information retrieval, knowledge base, thesaurus, and
automatic indexing (as shown in the following interaction).
.t;(nq) =
1 + exp[ -l(neiO- 1
where n&j = Cy:d &p,(t), 0, serves as a threshold or
202
JOURNAL
OF THE AMERICAN
*--.--....-....--...*
Initialterms:
{*Suppliedbythesubject.
_-_---_-----1. (P L) INFORMATIONRETRIEVAL{*P:Public,
ACM,L:LCSH*)
)KNOWLEDGEBASE
2. (P
3. (P
)THESAURUS
4. (P L)AUTOMATICINDEXING
*.-...--..----------*
SCIENCE-April
1995
*)
A:
TABLE I.
Iteration
no.
0
2
3
4
Suggestedterms
INFORMATIONRETRIEVAL
KNOWLEDGEBASE
THESAURUS
AUTOMATICINDEXING
INDEXING
KEVIN.HOT
CLASSIFICATION
EXPERTSYSTEMS
ROSS.HOT
RECALL
INFORMATIONRETRIEVALSYSTEM
EVALUATION
SELLING-INFORMATION
STORAGEANDRETRIEVAL
SYSTEMS
1.00
1.00
1.00
1.00
0.65
0.56
0.50
0.50
0.44
0.50
0.26
0.15
..
1. (P
2.c
3.
(P
4.(P
5. (P
6-C
7.
8.
9.
10.
(P
(P
(P
(P
1, 2, 4, 5,
made and
1 INDEXING
L) SELLING - INFORMATION STORAGEANDRETRIEVALSYSTEMS
) INFORMATION RETRIEVAL SYSTEM EVALUATION
) RECALL
) CLASSIFICATION
L) INFORMATIONSTORAGEANDRETRIEVALSYSTEMS
L)INFORMATIONRETRIEVAL
)KNOWLEDGEBASE
)THESAURUS
L)AUTOMATICINDEXING
Enterthenumberof
system-suggestedtermsor
'0'
toquit>>
{* The uses decide to broaden the search by requesting
the Hopfield
network
to identify
30
newtermsbasedonthetermshehadselected.
*}
........
Enter number
35,36,38
........
[lto40]
or '0'
toquit:
3-7,
9, 33,
Enternumbers
[lto67]or
'0'toquite:O
{*Thesystemlistedhisfinalselections.*}
1.
2.
3.
4.
5.
7.
8.
9.
(P
(P
(P
(P
(P
(
(
(
(P
10.
11.
(P
(
12.
(P
6.
)PRECISION
L) INFORMATIONRETRIEVAL
)INDEXING
L)AUTOMATICINDEXING
)RECALL
L)AUTOMATICABSTRACTING
L)AUTOMATICCLASSIFICATION
L)AUTOMATICINFORMATIONRETRIEVAL
) INFORMATION RETRIEVAL SYSTEM EVALUATION
)THESAURUS
L)INFORMATIONSTORAGEANDRETRIEVALSYSTEMS
)KNOWLEDGEBASE
{* A total
of 12 terms were selected.
weresuggestedbytheHopfieldnetalgorithm.
Eight
terms
*}
1995
203
No. of
terms
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
IO
IO
10
10
10
5
networktesting
Query terms in
(P, A> L)
Suggestedterms in
NN: (P. A, L)
(1, 131)
(l,O, 1)
(1, 1,1)
(0.0,1)
(I,% 1)
(2, 1,O)
(12, I. 7)
(5,0, 16)
(ll,5. II)
(0, 0, 20)
(4,4, 19)
(19,2.3)
(LO, 2)
(LO, 0)
Cl, J)
(2. 1,2)
(16,0,8)
(330, 1)
(I. 2, 1)
(2, I,31
(1.3, 1)
(1.&a
(2.2,4)
(3,2,2)
(2. 3,2)
(1.3.4)
(I, 2, 1)
(1.4, 1)
(4.2,2)
(3,234)
(j,O, 1)
(5.0, 1)
t&O, 3)
(10, I,31
(8.0,4)
(9. l,5)
(8.2,3)
18
I5
14
11
17
21
19
20
15
27
19
22
18
16
9
17
19
24
19
18
16
15
11
23
12
17
25
24
21
28
18.8
(20,3,4)
(Il,5. II)
(11,O. 12)
(20,O. 18)
(4, 11,8)
(22, 1,s)
(20,2,2)
(l3,9,3)
(l7,4,4)
(ll,2,13)
(18,5.6)
(1&2,5)
(15, 8. 3)
(19,436)
(IO, 1. 12)
(2,O. 18)
(19,0.3)
(20,O. I)
(I l,O. 13)
(1%2, 10)
(16,0.8)
(19, 1.6)
(20, 2. 3)
(14.5,2.5,8.5)
204
No. of iterations
NN
Times (seconds)
NN
21
14
18
10
26
18
22
24
16
29
31
34
29
23
10
II
31
33
32
6
27
27
23
33
30
34
32
36
25
31
24.5
1995
ID3/ID5R:
= -pPo.%b
Consideredas an incremental version of the ID3 algorithm, IDSR, developedby Utgoff (1989), is guaranteed
to build the same decision tree as ID3 for a given set of
training instances(Quinlan, 1993). In IDSR, a non-leaf
node contains an attribute test (sameas in ID3) and a set
of other non-test attributes, each with object counts for
the possiblevalues of the attribute. This additional nontest attribute and object count information at each noleaf node allows IDSR to update a decision tree without
rebuilding the entire tree. During the tree rebuilding process,an old test node may be replacedby a new attribute
or swappedwith other positions in the tree. As in ID3,
the tree-building process requires much less computation and time than other inductive learning methods, including neural networks and geneticalgorithms.
To create a robust and real-time inductive learning
system,a relevancefeedback schemewas introduced into
our implementation. Although the proposed inductive
learning algorithms require usersto provide examplesto
confirm their interests,it is inconceivable that userswill
be able to browse the entire databaseto identify such instances. An incremental, interactive feedback process,
therefore, was designedto allow usersto examine a few
documents at a time. In essence,our IDSR algorithm
was implemented such that it provided a few suggested
documents basedon the documents initially provided by
SCIENCE-April
1995
205
006
008
083
084
resentedas:
Initial Training Instances
An ID3/ID5R Example
We created a small test databaseof 60 records. For
evaluation purposes,we were able to manually select a
small set of target desired documents (i.e., eight documents in the areas of information retrieval and keywording). The goal of the experiment was to present a
few documents at a time to our system and seewhether
the system would be able to identify them after the iterative relevance feedback process. The performance of our
Y
n
n
n
Y
n
n
n
Y
n
Y
n
n
Y
n
n
n
Y
n
Y
(+I
(+I
t-1
(-)
The thesauruskeyword producedthe most entropy reduction and was thus selected as the first decision node.
Following the same computation, retrieval was se-
lected as the next (and last) decision node. ID3 constructed the decision tree shown in Figure 1. In the figure,
for example, [2, l] means2 instanceswere in the negative
010
013
014
018
021
022
023
030
031
048
049
050
io7
149
152
177
206
JOURNAL
SCIENCE-April
1995
Y
Y
n
n
n
n
Y
n
n
n
n
n
n
t-1
t-1
(+I
We developed a test database of about 1000 documents from the 1992 COMPENDEX CD-ROM collection of computing literature. We then identified 10 research topics, each of which had between 5 and 20 relevant documents in the database (manually identified).
The testing was conducted by comparing the recall of the
ID3 algorithm and that of the IDSR incremental approach using the 10 researchtopics.
Detailed results of the experiment are presentedin Table 3. IDSR and ID3 achieved the same levels of performance for 5 of the 10 test cases(cases3 and 6-9). After
we examined these casescarefully, we found that the initial documents presentedfor these caseshad very precise
keywords assignedto them. New instancesprovided during relevance feedback were consistent with the initial
documents, thus IDSR did not revise its decision tree.
(At each interaction, IDSR searchedonly a portion of the
entire database.The trees constructed by ID3 remained
constant becauseID3 did not have any interaction with
its users. However, to compare its results with those of
the IDSR fairly, ID3s performance at each interaction
was computed based on the same documents visited by
IDSR. As more documents were examined, ID3s classification results may also have improved.)
For the other five test cases,IDSRs performance increasedgradually until it reached 93.1%. ID3 had been
able to reach 74.9%. These researchtopics tended to have
more diverse keywords in the initial documents provided. IDSR appearedto benefit from incremental query
tree revision based on the relevance feedback information provided by users. In all 10 cases,IDSR was able to
terminate in eight interactions. The responsetimes were
often less than a second for each decision-tree building
process.
In conclusion, the symbolic ID3 algorithm and its
IDSR variant both were shown to be promising techniques for inductive document retrieval. By using the en-
JOURNAL
OF THE AMERICAN
d4,
SCIENCE-April
1995
207
Int. 1
ID3/IDSR
Int. 2
ID3/ID5R
l/1
o/o
l/l
l/l
o/o
l/l
l/l
112
O/l
212
l/l
O/l
212
212
212
313
717
5/5
l/l
1.3/1.3
16.0/16.0
212
212.3
16.5/31.2
Int. 3
ID3/ID5R
Int. 4
ID3/IDSR
Int. 5
ID3/ID5R
213
516
l/4
4/4
l/3
315
619
012
313
11-2
o/2
515
313
313
w3
313
2.813.4
35.0/40. I
115
217
218
214
517
Int. 8
ID3/ID5R
717
IO/IO
717
5.1j6.2
66.3179.3
9/O
414
4.415.2
55.5164.1
for IR
A GeneticAlgorithm: KnowledgeRepresentation
and Procedure
Genetic algorithms (GAS) (Goldberg, 1989; Kohonen, 1989; Michalewicz, 1992) are problem-solving systems based on principles of evolution and heredity. A
GA maintains a population of individuals, P(t) = x,, . . . ,
x, at iteration t. Each individual representsa potential
solution to the problem at hand and is implemented as
some (possibly complex) data structure S. Each solution
x, is evaluated to give some measure offitness. Then a
new population at iteration t + 1 is formed by selecting
the fitter individuals. Some members of the new population undergo transformation by means of genetic operators to form new solutions. There are unary transformations m, (mutation type), which create new individuals
by a small changein a single individual and higher order
transformations c, (crossovertype), which create new individuals by combining parts from several(two or more)
individuals. For example, if parents are representedby a
five-dimensional vector (a,, a2, a3, a4, a5) and (b,, b2, b3,
b4,b,), then a crossoverof chromosomesafter the second
geneproduces offspring (a,, a2, b3, b4, b,) and (b,, b2,a3,
a4, as). The control parameters for genetic operators
Target
10
3110
I l/l I
7110
5.617.I
74.0/90.4
II
4
10
515
616
208
Int. 7
ID3/ID5R
616
Int. 6
ID3/ID5R
5.617.2
74.019I .3
5.717.4
74.9193.I
6
6
5
8
12
10
8.2
(probability of crossoverand mutation) need to be carefully selectedto provide better performance. The intuition behind the crossover operation is information exchangebetweendifferent potential solutions. After some
number of generationsthe program converges-the best
individual hopefully representsthe optimum solution.
Michalewicz (1992) provided an excellent algorithmic
discussion of GAS. Goldberg (1989, 1994) presented a
good summary of many recent GA applications in biology, computer science,engineering,operations research,
physical sciences,and social sciences.
Genetic algorithms use a vocabulary borrowed from
natural genetics in that they talk about genes (or bits),
chromosomes (individuals or bit strings), and population (of individuals). Populations evolve through generations. Our geneticalgorithm was executedin the following steps:
(1) Initializepopulation and evaluatefitness: To initialize a population, we neededfirst to decide the number of genesfor each individual and the total number
of chromosomes @opsise)in the initial population.
When adopting GAS in IR, each gene (bit) in the
chromosome (bit string) represents a certain keyword or concept. The loci (locations of a certain
gene) decide the existence (1, ON) or nonexistence
(0, OFF) of a concept. A chromosome therefore represents a document that consists of multiple concepts. The initial population contains a set of documents which were judged relevant by a searcher
through relevance feedback.The goal of a GA was to
find an optimal set of documents which best
matched the searchersneeds (expressedin terms of
underlying keywords or concepts). An evaluation
function for the@ness of each chromosome was selected basedon Jaccardsscore matching function as
used by Gordon(1988)for documentindexing.The
Jaccards score between two sets,X and Y, was computed as:
SCIENCE-April
1995
#(Xfl Y)/#(XU Y)
where #(S) indicated the cardinality of set S. The
Jaccards score is a common measure of association
in information retrieval (van Rijsbergen, 1979).
(2) Reproduction (selection): Reproduction is the selection of a new population with respect to the probability distribution based on the fitness values. Fitter
individuals have better chancesof being selectedfor
reproduction (Michalewicz, 1992). A roulette wheel
with slots (F) sized according to the total fitness of
the population was defined as follows:
pop3ize
F = C Jitness(VJ
,=I
wherejfitnc~ss(
Vj) indicated the fitness value of chromosome V, according to the Jaccardsscore.
Each chromosome had a certain number of slots
proportional to its fitness value. The selection process was based on spinning the wheel popsize
times-each time we selecteda single chromosome
for a new population. Obviously, some chromosomes were selected more than once. This is in accordance with the genetic inheritance: the best chromosomesget more copies, the averagestay even, and
the worst die off.
(3) Recombination (crossover and mutation): We were
then ready to apply the first recombination operator,
crossover,to the individuals in the new population.
The probability of crossover,pr, gave us the expected
number pc X popsize of chromosomes which should
undergo the crossover operation. For each chromosome, we generated a random number r between 0
and I; if r < pr, then the chromosome was selected
for crossover. We then mated selectedpairs of chromosomes randomly: for each pair of coupled chromosomes we generateda random number pas from
the range of (1.. .m - l), where m was the total
number of genesin a chromosome. The numberpos
indicated the position ofthe crossing point. The coupled chromosomes exchanged genes at the crossing
point as described earlier.
The next recombination operator, mutation, was
performed on a bit-by-bit basis. The probability of
mutation, pm, gave us the expected number of mutated bits p,,, X m X popsize. Every bit in all chromosomes of the whole population had an equal chance
to undergo mutation, that is, change from 0 to 1 or
vice versa. For each chromosome in the crossovered
population, and for each bit within the chromosome,
we generated a random number r from the range of
(0. . 1); if r < pm, we mutated the bit. Typical pc selected ranged between 0.7 and 0.9 and pm ranged between 0.0 1 and 0.03.
(4) Convergence: Following reproduction, crossover,
and mutation, the new population was ready for its
next generation. The rest of the evolutions were simply cyclic repetitions of the above steps until the system reacheda predetermined number of generations
JOURNAL
A GA Example
SCIENCE-April
1995
209
DOCO DATA
DOCl
DOC2
DOC3
DOC4
RETRIEVAL,
DATABASE,
chromosome
111111111111111111110000000000000
000010000001000100001111100000000
000010000000000101000010011110000
000000000001100100000010101001110
000000000001000100000010101000001
Average Fitness = 0.389 I
fitness
[0.287744]
[0.4 I 16921
[0.367556]
[0.427473]
[0.451212]
210
chromosome
000000000001000100000010101000001
000000000001000100000010101000001
000000000001000100000010101000001
000000000001000100000010101000001
000000000001000100000010101000001
Average Fitness = 0.45 12
COMPUTER
fitness
[0.45 121]
[0.45121]
[0.45121]
[0.45 12I]
[0.45121]
SCIENCE-April
1995
CPU
bc.)
I.0
1.0
1.0
1.0
1.0
I.0
0.0
0.0
0.0
0.0
0.0
0.0
0.067
0.05
0.067
0.05
0.067
0.06
7
25
7
9
5
10.6
0.5139
0.5833
0.6111
0.6486
0.7857
avg.
0.5139
0.5833
0.6111
0.6486
0.7857
0.6285
0.0
0.0
0.0
0.0
0.0
0.0
0.083
0.1
0.083
0.067
0.083
0.08
10
8
5
10
16
9.8
1
2
3
4
5
3 dots.
0.384 1
0.4157
0.4286
0.5032
0.5899
avg.
0.3984
0.4360
0.46 11
0.5215
0.6349
0.4904
3.72
4.88
7.1
3.6
7.6
5.38
0.023
0.1
0.1
0.133
0.083
0.088
8
5
13
5
16
9.4
1
2
3
4
5
4 dots.
0.2898
0.3078
0.3194
0.3319
0.4409
avg.
0.3010
0.3142
0.3495
0.3442
0.5060
0.3629
3.8
2.1
9.4
3.7
14.7
6.74
0.117
0.1
0.283
0.25
0.25
0.2
22
15
5
II
10
12.6
1
2
3
4
5
0.3048
0.3068
0.3194
0.4655
0.6181
avg.
0.3370
0.3267
0.3575
0.567 1
0.7171
0.4610
10.5
6.4
11.9
21.8
16.0
13.32
0.4
0.15
0.52
0.3
0.12
0.298
12
7
5
21
21
13.2
IO dots.
0.2489
0.2038
0.2016
0.4997
0.3727
avg.
0.2824
0.2282
0.2343
0.6201
0.4540
0.3638
13.5
12.9
16.2
24.1
21.8
17.7
0.32
0.35
0.47
0.13
0.13
0.28
18
8
6
5
I1
9.6
All
avg.
0.5511
0.168
10.87
No.
Init. score
1
2
3
4
5
I dot.
1.0
1.0
1.0
1.0
1.0
avg.
1
2
3
4
5
2 dots.
5 dots.
I
2
3
4
5
GA score
7.19
Dots.
selected
1995
211
lected machine learning techniques for IR, there are numerous research directions that need to be pursued before we can develop a robust solution to intelligent information retrieval. We briefly review several important
researchdirections below:
l
Limitations of learning techniquesfor IR: The performance of the inductive learning techniques relies
strongly on the examples provided (as in any other statistical and classification techniques) (Weiss & Kulikowski, 199 1). In IR, these examples may include userprovided queries and documents collected during relevance feedback. The importance of sample size has
been stressedheavily, even in the probabilistic models
(Fuhr & Buckley, 1991: Fuhr & Pfeifer, 1994). In reality, user-provided relevance feedback information may
be limited in quantity and noisy (i.e., contradictory or
incorrect), which may have adverse effects for the IR or
indexing tasks. Some learning techniques such as the
neural networks approach have documented noise-resistant capability, but empirical evidence and research
need to be performed to verify this characteristic in the
context of IR and indexing. In our preliminary investigation, all three machine learning algorithms performed satisfactorily for small document samples. but
the effect of the sample size needsto be examined more
carefully.
For large-scale real-life applications, neural networks and, to some extent, genetic algorithms, may
suffer from requiring extensive computation time and
lack of interpretable results. Symbolic learning, on the
other hand. efficiently produces simple production
rules or decision-tree representations. The effectsofthe
representations on the cognition of searchers in the
real-life retrieval environments (e.g., usersacceptance
of the analytical results provided by an intelligent system) remain to be determined.
Applicahilit~~ to the fkll-text retrieval environment: In
addition to extensive IR research conducted in probabilistic models, knowledge-based systems, and machine learning, significant efforts have also been made
by many commercial companies in pursuit of more
effective and intelligent information retrieval systems. In an attempt to understand the potential role
of machine learning in commercial full-text retrieval
systems, we examined several major full-text retrieval
software packages on the market, including: BRS/
SEARCH, BASIS/Plus, PixTex,3 and Topic.4
Most full-text retrieval software has been designed
to handle large volumes of text by indexing every word
(a!d its position). This allows users to perform proximity search, morphological search (using prefix,
suffix, or wildcards), and thesaurus search. BRS/
SEARCH and BASIS/plus are typical of this type of
software. PixTex and Topic, on the other hand, are
212
JOURNAL
OF THE AMERICAN
We believe this researchhas shed light on the feasibility and usefulnessof the newer, AI-based machine learning algorithms for IR. However, more extensive and systematic studies of various system parameters and for
large-scale,real-life applications are needed. We hope by
incorporating into IR inductive learning capabilities,
which are complementary to the prevailing full-text, keyword-based, probabilistic, or knowledge-based techniques, we will be able to advance the design of adaptive
and intelligent information retrieval systems.
Acknowledgments
lishers,Inc.
Belew,R. K. (1989.June).Adaptiveinformation retrieval.In Proceed-
SCIENCE-April
1995
Conferenceon
1995
213
214
1995
1995
215
Wilkinson, R., & Hingston, P. (I 99 1. October). Using the Cosine measure in neural network for document retrieval. In Proceedingsojthe
Fourteenth Annual International ACM/SIGIR Corzferenceon Research and Development in Illformation Retrieval (pp. 202-2 IO).
Chicago, IL.
Wilkinson, R.. Hingston. P., & Osborn, T. (1992). Incorporating the
vector spacemodel in a neural network used for document retrieval.
Library Hi Tech, IO. 69-75.
Yang, J., & Korfhage, R. R. (I 993, April). Effectsofquery term weights
modification in document retrieval: A study basedon a genetic algorithm. In Proceedingsof the Second Annttal Symposium on Docu-
216
1995