0% found this document useful (0 votes)
26 views38 pages

Expert Systems With Applications

1. The document proposes using ontology-supported website models to provide semantic-level search capabilities that can return fast, precise, and stable search results with high user satisfaction. 2. A website model contains a website profile and webpage profiles, which describe the website and pages using domain ontology to interpret semantics. 3. The approach uses website models to support ontology-driven construction of models, focused crawling to expand models based on user interests and domain specificity, and ontology-based webpage retrieval for search.

Uploaded by

Osama Khalid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views38 pages

Expert Systems With Applications

1. The document proposes using ontology-supported website models to provide semantic-level search capabilities that can return fast, precise, and stable search results with high user satisfaction. 2. A website model contains a website profile and webpage profiles, which describe the website and pages using domain ontology to interpret semantics. 3. The approach uses website models to support ontology-driven construction of models, focused crawling to expand models based on user interests and domain specificity, and ontology-based webpage retrieval for search.

Uploaded by

Osama Khalid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

ARTICLE IN PRESS

Available online at www.sciencedirect.com

Expert Systems
with Applications
Expert Systems with Applications xxx (2007) xxx–xxx
www.elsevier.com/locate/eswa

An ontological website models-supported search agent for web services


Sheng-Yuan Yang *

Department of Computer and Communication Engineering, St. John’s University, Taipei, 499, Sec. 4, TamKing Road, Tamsui,
Taipei County 25135, Taiwan, ROC

Abstract

In this paper, we advocate the use of ontology-supported website models to provide a semantic level solution for a search engine so
that it can provide fast, precise and stable search results with a high degree of user satisfaction. A website model contains a website profile
along with a set of webpage profiles. The former remembers the basic information of a website, while the latter contains the basic infor-
mation, statistics information, and ontology information about each webpage stored in the website. Based on the concept, we have devel-
oped a Search Agent which manifests the following interesting features: (1) Ontology-supported construction of website models, by
which we can attribute correct domain semantics into the Web resources collected in the website models. One important technique used
here is ontology-supported classification (OntoClassifier). Our experiments show that the OntoClassifier performs very well in obtaining
accurate and stable webpages classification to support correct annotation of domain semantics. (2) Website models-supported Website
model expansion, by which we can collect Web resources based on both user interests and domain specificity. The core technique here is a
Focused Crawler which employs progressive strategies to do user query-driven webpage expansion, autonomous website expansion, and
query results exploitation to effectively expand the website models. (3) Website models-supported Webpage Retrieval, by which we can
leverage the power of ontology features as a fast index structure to locate most-needed webpages for the user.
Ó 2007 Elsevier Ltd. All rights reserved.

Keywords: Ontology; Website models; Search agents; Web services

1. Introduction one has to properly narrow down the scope of search


and to cleverly sort the sites or combine them in order to
In this information-exploding era, the user expects to benefit much from the Internet. This process is hard to gen-
spend short time retrieving really useful information rather eral users, however; so is it to the experts. Two major fac-
than spending plenty of time and ending up with lots of tors are behind this difficulty. First, the Web is huge; it’s
garbage information. Current general search engines, how- reported that six major public search engines (AltaVista,
ever, produce so many entries of data that often over- Excite, HotBot, Infoseek, Lycos, and Northern Light) col-
whelms the user. For example, Fig. 1 shows six of over lectively only covered about 60% of the Web and the larg-
4000 entries returned from Google, a well-known robot- est coverage of a single engine was about one-third of the
based search engine, for a query that asks for assembling estimated total size of the Web (Lawewnce & Giles,
a cheaper and practical computer. The user usually gets 2000). Empirical study also indicates that no single search
frustrated after a series of visits on these entries, however, engine could return more than 45% of the relevant results
when he discovers that dead entries are everywhere, irrele- (Selberg & Etzioni, 1995). Second, most pre-defined index
vant entries are equally abundant, what he gets is not structures used by the search engines are complex and
exactly what he wants, etc. It is increasingly clear that inflexible. Lawewnce and Giles (1999) reports that indexing
of new or modified pages in one of the major search engines
could take months. Even if we can follow the well-orga-
*
Tel.: +886 2 28013131x6394; fax: +886 28013131x6391. nized Web directory structure of one search engine, it is
E-mail address: [email protected] still very easy that we get lost in the complex category

0957-4174/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2007.09.024

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

2 S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

Fig. 1. A query example.

indexing jungle (see Fig. 2 for an example). Current problem: lack of semantic understanding of Web docu-
domain-specific search engines do help users to narrow ments. New standards for representing website documents,
down the search scope by the techniques of query expan- including XML (Henry, David, Murray, & Noah, 2001),
sion, automatic classification and focused crawling; their RDF (Brickley & Guha, 2004), DOM (Arnaud et al.,
weakness, however, is almost completely ignoring the user 2004), Dublin metatag (Weibel, 1999), and WOM (Man-
interests (Wang, 2003). ola, 1998), can help cross-reference of Web documents;
In general, current search engines face two fundamental they alone, however, cannot help the user in any semantic
problems. First, the index structures are usually very differ- level during the searching of website information. OIL
ent from what the user conjectures about his problems. (2000), DAML (2003), DAML+OIL (2001), and the con-
Second, the classification/clustering mechanisms for data cept of ontology stand for a possible rescue to the attribu-
hardly reflect the physical meanings of the domain con- tion of information semantics. In this paper, we advocate
cepts. These problems stem from a more fundamental the use of ontology-supported website models (Yang,

Fig. 2. A Web directory example.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx 3

2006) to provide a semantic level solution for a search must include descriptions of explicit concepts and their
engine so that it can provide fast, precise and stable search relationships of a specific domain (Ashish & Knoblock,
results with a high degree of user satisfaction. 1997). We have outlined a principle construction procedure
Basically, a website model consists of a website profile in Yang and Ho (1999); following the procedure we have
for a website and a set of webpage profiles for the webpages developed an ontology for the PC domain. Fig. 3 shows
contained in the website. Each webpage profile reflecting a part of the PC ontology taxonomy. The taxonomy repre-
webpage describes how the webpage is interpreted by the sents relevant PC concepts as classes and their parent–child
domain ontology, while a website profile describes how a relationships as isa links, which allow inheritance of fea-
website is interpreted by the semantics of the contained tures from parent classes to child classes. We then carefully
webpages. The website models are closely connected to selected those properties of each concept that are most
the domain ontology, which supports the following func- related to our application and used them as attributes to
tions used in website model construction and application: define the corresponding class. Fig. 4 exemplifies the defini-
query expansion, webpage annotation, webpage/website tion of ontology class ‘‘CPU’’. In the figure, the uppermost
classification (Yang, 2006), and focused collection of node uses various fields to define the semantics of the CPU
domain-related and user-interested Web resources (Yang, class, each field representing an attribute of ‘‘CPU’’, e.g.,
2006). We have developed a Search Agent using website interface, provider, synonym, etc. The nodes at the bottom
models as the core technique, which helps the agent suc- level represent various CPU instances that capture real
cessfully tackle the problems of search scope and user inter- world data. The arrow line with term ‘‘io’’ means the
ests. Our experiments show that the Search Agent can instance of relationship. Our ontology construction tool
locate, integrate and update both domain-related and is Protégé 2000 (Noy & McGuinness, 2001) and the com-
user-interested Web resources in the website models for plete PC ontology can be referenced from the Protégé
ready retrieval. Ontology Library at Stanford Website (http://protege.
The personal computer (PC) domain is chosen as the stanford.edu/ontologies.html). Fig. 5 demonstrates how
target application of our Search Agent and will be used the ontology looks like on Protégé 2000, where the left col-
for explanation in the remaining sections. The rest of the umn represents the taxonomy hierarchically and the right
paper is organized as follows. Section 2 develops the column contains respective attributes for a specific class
domain ontology. Section 3 describes Website models node selected. The example shows that the CPU ontology
and how they are constructed. Section 4 illustrates how contains synonyms, along with a bunch of attributes and
Website models can be used to do better Web search. Sec- constraints on their values. Although the domain ontology
tion 5 describes the design of our search agent and reports was developed in Chinese (but was changed to English here
how it performs. Section 6 discusses related works, while for easy explanation) corresponding English names are
Section 7 concludes the work. treated as Synonyms and can be processed by our system
too.
2. Domain ontology as the first principles In order to facilitate Web search, the domain ontology
was carefully pre-analyzed with respect to how attributes
Ontology is a method of conceptualization on a specific are shared among different classes and then re-organized
domain (Noy & Hafner, 1997). It plays diverse roles in into Fig. 6. Each square node in the figure contains a set
developing intelligent systems, for example, knowledge of representative ontology features for a specific class,
sharing and reusing (Decker et al., 2000, 1998), semantic while each oval node contains related ontology features
analysis of languages (Moldovan & Mihalcea, 2000), etc. between two classes. The latter represents a new node type
Development of an ontology for a specific domain is not called ‘‘related concept’’. We select representative ontology
yet an engineering process, but it is clear that an ontology features for a specific class by first deriving a set of

Hardware
isa
isa
isa
isa isa isa isa isa

Interface Power Storage


Memory Case
Card Equipment Media
isa
isa
isa
isa isa
isa isa isa isa isa isa isa isa

Network Sound Display SCSI Network Power Main


UPS ROM Optical ZIP
Chip Card Card Card Card Supply Memory

isa isa isa isa

CD DVD CDR/W CDR

Fig. 3. Part of PC ontology taxonomy.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

4 S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

CPU
Synonym= Central Processing Unit
D-Frequency String
Interface Instance* CPU Slot
L1 Cache Instance Volume Spec.
Abbr. Instance CPU Spec.
...

io io io io io io io

XEON THUNDERBIRD 1.33G DURON 1.2G PENTIUM 4 2.0AGHZ PENTIUM 4 1.8AGHZ CELERON 1.0G PENTIUM 4 2.53AGHZ
Factory= Intel Synonym= Athlon 1.33G Interface= Socket A D-Frequency= 20 D-Frequency= 18 Interface= Socket 370 Synonym= P4
Interface= Socket A L1 Cache= 64KB Synonym= P4 2.0GHZ Synonym= P4 1.8GHZ L1 Cache= 32KB Interface= Socket 478
L1 Cache= 128KB Abbr.= Duron Interface= Socket 478 Interface= Socket 478 Abbr.= Celeron L1 Cache= 8KB
Abbr.= Athlon Factory= AMD L1 Cache= 8KB L1 Cache= 8KB Factory= Intel Abbr.= P4
Factory= AMD Clock= 1.2GHZ Abbr.= P4 Abbr.= P4 Clock= 1GHZ Factory= Intel
... ... ... ... ... ...

Fig. 4. Ontology class CPU.

class. The representative features are then removed from


the set of candidate terms and the rest of candidate terms
are compared with the attributes of other ontology classes.
For any other ontology class that contains some of these
candidate terms, we add a related concept node to relate
it to the class. Fig. 7 takes CPU and motherboard as two
example classes and show how their related concept node
looks like. The figure shows a related concept node
between two classes; in fact, we may have related concept
nodes among three or more classes too. For instance in
Fig. 6, we have a related concept node that relates CPU,
motherboard and SCSI Card together. Table 1 illustrates
related concept nodes of different levels, where level n
Fig. 5. PC ontology shown on Protégé2000.
means the related concept node relates n classes together.
Under this definition, level 1 reduces to the representative
features of a specific class. Thus, term ‘‘graphi’’ in level 3
candidate terms from pre-selected training webpages means it appears in three classes: graphic card, monitor,
belonging to the class. We then compare them with the and motherboard. This design clearly structures semantics
attributes of the corresponding ontology class; those candi- between ontology classes and their relationships and can
date terms that also appear in the ontology are singled out facilitate the building of semantics-directed website models
and dubbed as the representative ontology features of the to support Web search.

PC Hardware

Graphic Card Optical Drive


Related Related
Concept Concept

Sound Card Hard Drive


Related Related
Concept Concept

Network Card CPU Monitor


Related
Concept

Modem
Related Related
Concept Related Concept Related
Related Concept Concept
Concept

Related
SCSI Card Concept Motherboard

Reference class
Superclass

Fig. 6. Part of re-organized PC ontology.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx 5

CPU Motherboard hyperlinks analysis. WebType identifies one of the follow-


3dnow CPU Motherboard ing six Web types: com (1), net (2), edu (3), gov (4), org
Sse Intel Onboard
Sse2 AMD Atx (5), and other (0), each encoded as an integer in the paren-
Mmx2
L1
Cyrix
Socket
Amr
Cnr
theses. WebNo identifies the website that contains this
Ondie Slot Bank webpage. It is set to zero if we cannot decide what website
Pipline L2 Com1
Superscalar Secc Raid the webpage comes from. Update_Time/Date remembers
Fcpga Secc2 Bios
0.13 Petium Northbridge when the webpage was modified last time. The statistics
64k Celeron
Athlon
Southbridge
information section stores statistics about HTML tag
(representative)
Duron
(representative) properties, e.g., #Frame for the number of frames, #Tag
Morgan
Northwood for the number of different tags, and various texts enclosed
Tualatin
Fsb in tags. Specifically, we remember the texts associated with
K7
478pin
Titles, Anchors, and Headings for webpage analysis; we
423pin also record Outbound_URLs for user-oriented webpage
(related concept node) expansion. Finally, the ontology information section
remembers how the webpage is interpreted by the domain
Fig. 7. Re-organized PC ontology.
ontology. It shows that a webpage can be classified into
several classes with different scores of belief according to
the ontology. It also remembers the ontology features of
Table 1
Example of related concept nodes of different levels (after stemming) each class that appear in the webpage along with their term
frequencies (i.e., number of appearance in the webpage).
Level 3 Level 4 Level 9 Level 10
Domain_Mark is used to remember whether the webpage
ddr Bandwidth Channel Intern belongs to a specific domain; it is set to ‘‘true’’ if the web-
dvi Microphone Connector Memoir
graphi Network Extern Output
page belongs to the domain, and ‘‘false’’ otherwise. This
inch scsi mhz Pin section annotates how a webpage is related to the domain
Kbp Plug and can serve as its semantics, which helps a lot in correct
khz usb retrieval of webpages.
raid Let us turn to the website profile. WebNo identifies a
website, the same as that used in the webpage profile.
Through this number, we can access those webpage profiles
3. Website model and construction
describing the webpages that belong to this website. Web-
site_Title remembers the text between tags hTITLEi of
A website model contains a website profile and a set of
the homepage of the website. Start_URL stores the starting
webpage profiles. A website profile contains statistics about
address of the website. It may be a domain name or a direc-
a website. A webpage profile contains basic information,
tory URL under the domain address. WebType identifies
statistics information, and ontology information about a
one of the six Web types as those used in the webpage pro-
webpage. To produce the above information for website
file. Tree_Level_Limit remembers how the website is struc-
modeling, we use DocExtractor to extract primitive web-
tured, which can keep the search agent from exploring too
page information as well as to perform statistics. It also
deeply, e.g., 5 means it explores at most 5 levels of the web-
transforms the original webpage into a tag-free document
site structure. Update_Time/Date remembers when the
and turns it to OntoAnnotator to annotate ontology infor-
website was modified last time. Fig. 8(b) illustrates an
mation. This section gives detailed description of what a
example website model. This model structure helps inter-
website model looks like and how it is constructed.
pret the semantics of a website through the gathered infor-
mation; it also helps fast retrieval of webpage information
3.1. Website model and autonomous Web resources search. The last point will
become clearer later. Fig. 8(c) illustrates how website pro-
Fig. 8(a) illustrates the format of a website model. The files and webpage profiles are structured.
webpage profile contains three sections, namely, basic
information, statistics information, and ontology informa- 3.2. Website modeling
tion. The first two sections profile a webpage and the last
annotates domain semantics to the webpage. Website modeling involves three modules. We use
Each field in the basic information section is explained DocExtractor to extract basic webpage information and
below. DocNo is automatically generated by the system perform statistics. We then use OntoAnnotator to annotate
for identifying a webpage in the structure index. Location ontology information. Since the ontology information con-
remembers the path of the stored version of the Web page tains webpage classes, OntoAnnotator needs to call Onto-
in the website model; we can use it to answer user queries. Classifier to perform webpage classification. Fig. 9
URL is the path of the webpage on the Internet, the same illustrates the architecture of DocExtractor. DocExtractor
as the returned URL index in the user query result; it helps receives a webpage from DocPool and produces informa-

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

6 S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

Website Profile: Website Profile:


WebNo::Integer WebNo::920916001
Website_Title::String Website_Title::Advanced Micro Devices, AMD - Homepage
Start_URL::String Start_URL::http://www.amd.com/us-en/
WebType::Integer WebType::1
Tree_Level_Limit::Integer Tree_Level_Limit::5
Update_Time/Date::Date/Time Update_Time/Date::04:37:59/AUG-26-2003
..... .....
Webpage Profile: Webpage Profile:
Basic Information: Basic Information:
DocNo::Integer DocNo::9209160011
Location::String Location::H:\DocPool\920916001\1
URL::String URL::http://www.amd.com/us-en/
WebType::Integer WebType::1
WebNo::920916001
WebNo::Integer
Update_Time/Date::10:30:00/JAN-17-2003
Update_Time/Date::Date/Time
.....
.....
Statistics Information:
Statistics Information:
#Tag::251
#Tag
#Frame::3
#Frame
.....
..... Title::Advanced Micro Devices, AMD - Homepage
Title Text Anchor::Home
Anchor Text Heading::Processors
Heading Text Outbound_URLs::http://www.amd.com/home/prodinfo01;
Outbound_URLs http://www.amd.com/home/compsol01; ...
..... .....
Ontology Information: Ontology Information:
Domain_Mark::Boolean Domain_Mark::True
class1: belief1; term11(frequency); ... CPU: 0.8; L1(2); Ondie(2); AMD(5); ...
class2: belief2; term21(frequency); ... Motherboard: 0.5; AGP(1); PCI(1); ...
..... .....

(a) Format of a website model (b) An example website model

WebNo#1 (website profile)

.....
DocNo#11 DocNo#12 DocNo#189 DocNo#190
(webpage profile)

WebNo#2 (website profile)

.....
DocNo#21 DocNo#22 DocNo#290 DocNo#291
(webpage profile)

(c) Conceptual structure of a website model

Fig. 8. Website model format, example and structure.

tion for both basic information and statistics information page into a list of words for further processing by
sections of a webpage profile. It also transforms the web- OntoAnnotator.
page into a list of words (pure text) for further processing Fig. 10 illustrates the architecture of OntoAnnotator.
by OntoAnnotator. Specifically, DocPool contains web- Inside the architecture, OntoClassifier uses the ontology
pages retrieved from the Web. HTML Analyzer analyzes to classify a webpage, and Annotator uses the ontology
the HTML structure to extract URL, Title texts, anchor to annotate ontology features with their term frequencies
texts and heading texts, and to calculate tag-related statis- for each class according to how often they appear in the
tics for website models. HTML TAG Filter removes webpage. Domain Marker uses Table 2 to determine
HTML tags from the webpage, deletes stop words using whether the webpage is relevant to the domain. The condi-
500 stop words we developed from McCallum (1996), tion column in the table means the number of classes
and performs word stemming and standardization. appearing in the webpage and the Limit column specifies
Document Parser transforms the stemmed, tag-free web- a minimal threshold on the average number of features of

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx 7

basic and Table 2


DocExtractor
statistics Domain-relevance threshold for webpages
Webpage HTML information
DocPool Analyzer website Condition (class count) Limit
models
0 None
1 Average P 3
26 Average P 2
HTML TAG Document To
Filter Parser OntoAnnotator 7  10 Average P 1
stemmed,
tag-free
webpage in
plain text
CPU Motherboard
Ontology
3dnow (35) BIOS (72)
Sse2 (52) ATX (47)
Fig. 9. Architecture of DocExtractor. L1 (43) Onboard (41)
Level 1
Pipeline (36) Amr (23)
Ondie (28) Cnr (15)
..... .....

the class which must appear in the webpage. For example, Level Threshold = 1

row 2 means if a webpage contains only one class of a (THW = 1)


domain, then the features of the class appearing in the web- Socket (126)
478pin (78)
Socket (154)
478pin (122) Level 2
page must be greater than or equal to three in order for it 423pin (36)
.....
423pin (98)
.....
to be considered to belong to the domain. In addition to Related concept nodes
classification of webpages, OntoClassifier is used to anno-
tate each ontology class by generating a classification score Fig. 11. Re-drawed ontology structure.
for each ontology class. For instance, the annotation of
(CPU: 0.8, Motherboard: 0.5) means the webpage belongs number of ontology features; we have added #wC0 , the
to class CPU with score 0.8, while it belongs to class Moth- number of words in each class C 0 , for normalization. Also
erboard with score 0.5. note that Eq. (1) only compares those classes with more
OntoClassifier is a two-step classifier based on the re- than three ontology features appearing in d, i.e., it filters
organized ontology structure as shown in Figs. 6 and 7, less possible classes. As to why classes with less than three
where, for ease of explanation, we have deliberately features appearing in d are filtered, we refer to Joachims’
skipped the information of term frequencies. Fig. 11 brings concept that the classification process only has to consider
it back showing that each ontology feature is associated the term with appearance frequency larger than three
with a term frequency. The term frequency comes from (Joachims, 1997).
the number of each feature appearing in the classified
webpages. OntoMatchðd; C 0 Þ
OntoClassifier classifies a webpage in two stages. The HOntoMatchðdÞ ¼ arg max ;
0
C 2c #wC0
first stage uses representative ontology features. We
OntoMatch ðd; C 0 Þ > 3 ð1Þ
employ the level threshold, THW to limit the number of C 0 2c
X
ontology features to be involved in this stage. For example, OntoMatchðd; CÞ ¼ Mðw; CÞ ð2Þ
Fig. 11 shows THW = 1, which means only the set of rep- w2d
resentative features at level 1 will be used. The basic idea of
classification of the first stage is defined by Eq. (1). In the If for any reason the first stage cannot return a class for a
equation, OntoMatch(d,C) is defined by Eq. (2), which cal- webpage, we move to the second stage of classification. The
culates the number of ontological features of class C that second stage no longer uses level thresholds but gives an
appears in webpage d, where M(w,c) returns 1 if word w ontology feature a proper weight according to which level
of d is contained in class C. Thus, Eq. (1) returns class C it is associated with. That is, we modify the traditional clas-
for webpage d if C has the largest number of ontology fea- sifiers by including a level-related weighting mechanism for
tures appearing in d. Note that not all classes have the same the ontology classes to form our ontology-based classifier.

ontology
information

OntoClassifier Annotator website


Processed
Domain model
Webpages
Marker
from Domain_Mark
DocExtractor

OntoAnnotator

Ontology

Fig. 10. Architecture of OntoAnnotator.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

8 S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

This level-related weighting mechanism will give a higher returned Web pages from Web Crawler for DocExtractor
weight to the representative features than to the related fea- during the construction of webpage profiles. It also stores
tures. The second stage of classification is defined by Eq. query results from search engines, which usually contains
(3). Inside the equation, OntoTFIDF(d,C) is defined by a list of URLs. URLExtractor is responsible for extracting
Eq. (4), which calculates a TFIDF score for each class C URLs from the query results and dispatching those URLs
with respect tod according to the terms appearing both that are domain-dependent but not yet in the website mod-
on d and C, where TF(xjy) means the number of appear- els to Distiller. User-Oriented Webpage Expander pin-
ance of word x in y. Eq. (3) is used to create a list of class: points interesting URLs in the website models for further
score pairs for d and finally selects the one with the highest webpage expansion according to the user query. Autono-
score of TFIDF as the class for webpage d. mous Website Evolver autonomously discovers URLs in
the website models that are domain-dependent for further
HOntoTFIDFðdÞ ¼ arg max OntoTFIDFðd; C 0 Þ ð3Þ
0 C 2c webpage expansion. Since these two types of URLs are
OntoTFIDFðd; CÞ both derived from website models, we call them website
X 1 model URLs in the figure. User Priority Queue stores the
TFðwjCÞ TFðwjdÞ
¼ P  P user search strings and the website model URLs from
w2d
Lw TFðw jCÞ
0 TFðw0 jdÞ User-Oriented Webpage Expander. Website Priority Queue
w0 2F C w0 2F C
stores the website model URLs from Autonomous Website
ð4Þ Evolver and the URLs extracted by URLExtractor.
Distiller controls the Web search by associating a prior-
ity score with each URL (or search string) using Eq. (5)
4. Website models application and placing it in a proper Priority Queue. Eq. (5) defines
ULScore(U,F) as the priority score for each URL (or
The basic goal of the website models is to help Web search string).
search taking account both user-interest and domain-
dependence. Section 4.1 explains how this can be achieved. ULScoreðU ; F Þ ¼ W F  S F ðU Þ ð5Þ
The second goal is to help fast retrieval of webpages stored where U represents a URL or search string; and F identifies
in the website models for the user. Section 4.2 explains how the way U is obtained as shown in Table 3, which also as-
this is done. signs to each F a weight WF and a score SF(U). Thus, if
F = 1, i.e., U is a search string, then W1 = 3, and
4.1. Focused web crawling supported by website models S1(U) = 100, which implies all search strings are treated
as the top-priority requests. As for F = 3, if U is new to
In order to effectively use the website models to narrow the website models, S3(U) is set to 1 by URLExtractor;
down the search scope, we propose a new focused crawler otherwise it is set to 0.5. Finally, for F = 2, the URLs
as shown in Fig. 12, which features a progressive crawling may come from User-Oriented Webpage Expander or
strategy in obtaining domain relevant Web information. Autonomous Website Evolver. In the former case, we fol-
Inside the architecture, Web Crawler gathers data from low the algorithm in Fig. 13 (to be explained in Section
the Web. DocPool was mentioned before; it stores all 4.1.1) to calculate S2(U) for each U. The assignment of

Web

Search
Engines Query
result

Focused Crawler Search


Webpages
URL strings

Web Webpage
DocExtractor OntoAnnotator
Crawler DocPool
Webpage
User Website Query result
Priority Priority
Queue Queue URLExtractor User-Oriented
URL
Webpage
Expander
Distiller
Website Model Website
User search strings URLs Autonomous Models
Website
Evolver

Fig. 12. Architecture of focused crawler.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx 9

Table 3 ambiguity in natural language. To overcome this problem,


Basic weighting for URLs to be explored we propose a direct query expansion mechanism, which
Input type F WF SF(U) helps users implicitly formulate their queries. This mecha-
Search strings 1 3 S1(U) = 100 nism uses domain ontology to expand user query. One
Website model URLs 2 2 S2(U) straightforward expansion is to add synonyms of terms
URLs extracted by URLExtractor 3 1 S3(U) contained in the user query into the same query. Synonyms
can be easily retrieved from the ontology. Table 4 illus-
trates a simple example. More complicated expansion adds
S2(U) in the latter case is more complicated; we will de-
ontology classes according to their relationships with the
scribe it in Section 4.1.2.
query terms. The most used relationships follow the inher-
Note that Distiller schedules all the URLs (and search
itance structure. For instance, if more than half of the sub-
strings) in the User Priority Queue according to their prior-
concepts of a concept appear in a user query, we add the
ity scores for web crawling before it starts to schedule the
concept to the query too.
URLs in the Website Priority Queue. In addition, when-
We also propose an implicit webpage expansion mecha-
ever there are new URLs coming into the User Priority
nism oriented to the user interest to better capture the user
Queue, Distiller will stop the scheduling of the URLs in
intention. This user-oriented webpage expansion mecha-
the Website Priority Queue and turn to schedule the new
nism adds webpages related to the user interest for further
URLs in the User Priority Queue. This design prefers
retrieval into the website models. Here we exploit the out-
user-oriented Web resource crawling to website mainte-
bound hyperlinks of the stored webpages in the website
nance, since user-oriented query or webpage expansion
models. To be more precise, we are using Anchor Texts
takes into account both user interest and domain con-
specified by the webpage designer for the hyperlinks, which
straint, which can better meet our design goal than website
contain terms that the designer believes are most suitable to
maintenance.
describe the hyperlinked webpages. We can compare these
anchor texts against a given user query to determine
4.1.1. User-oriented web search supported by website models whether the hyperlinked webpages contain the terms in
Conventional information retrieval research has mainly the query. If yes, it implies the hyperlinked webpages might
been based on (computer-readable) text (Salton & Buckley, be interested by the user and should be collected in the
1988, 1983) to locate desired text documents using a query website models for further query processing. Fig. 13 for-
consisting of a number of keywords, very similar to the malizes this idea into a strategy to select those hyperlinks,
keyword-based search engines. Retrieved documents are or URLs, that the users are strongly interested in. The
ranked by relevance and presented to the user for further algorithm returns a URL-list which contains hyperlinks
exploration. The main issue of this query model lies in along with their scores. The URL-list will be sent to the
the difficulty of query formulation and the inherent word Focused Crawler (to be discussed later), which uses the
scores to rank the URLs and, accordingly fetches hyper-
linked webpages. Note that the algorithm uses RS,D to
Let D be a Domain, Q be a Query, and U be an URL. modulate the scores, which decrease those hyperlinks that
Let URL_List be a List of URLs with Scores.
UserTL(Q) = User terms in Q. are less related to the target domain. RS,D specifies the
AnTerm(U) = Terms in the Anchor Text of U.
S2(U) = Score of U (cf. Table 6). degree of domain correlation of website S with respect to
Webpage_Expand_Strategy(Q, D)
domain D, as defined by Eq. (6). In the equation, NS,D
{ refers to the number of webpages on website S belonging
For each Website S in the Website Model
For each outbound link U in S to domain D; and NS stands for the number of webpages
{
Score = AnE_Score(U, Q) on website S. Here we need the parameter Domain_Mark
If Score is not zero
{
in the webpage profile to determine NS,D. In short, RS,D
S2(U) = RS,D x Score measures how strong a website is related to a domain.
Expanding_URL(U, Score)
} Thus, the autonomous webpage expansion mechanism is
}
Return URL_List also domain-oriented in nature.
}

AnE_Score(U, Q) Table 4
{
For each Term T in AnTerm(U) Example of direct query expansion
If UserTL(Q) contains T
AnE_Score = AnE_Score + 1 Original user query:
Return AnE_Score
} Mainboard CPU Socket KV133 ABIT
Expanding_URL(U, Score) Expanded user query:
{
Add U and Score to URL_List
Mainboard Motherboard
} CPU Central process unit, processor
Socket Slot, connector
KV133
Fig. 13. User-oriented webpage expansion strategy supported by the
ABIT
website models.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

10 S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

N S;D Website1 Website2


RS;D ¼ ð6Þ
NS
A B C E F G H

4.1.2. Domain-oriented web search supported by website outbound


models link

Now, we can discuss how Autonomous Website Evolver


does domain-dependent website expansion. Autonomous J C B I E F G

Website Evolver employs a four-phase progressive strategy Website1 URL range Website2 URL range
to autonomously expand the website models. The first
Fig. 15. Basic operation of the first phase.
phase uses Eq. (7) to calculate S2(U) for each hyperlink
U which is referred to by website S and recognized to be
in S from its URL address but whose hyperlinked webpage website 1. In summary, the first phase prefers to expand the
is not yet collected in S. websites that are well profiled in the website models but
X have less coverage of domain concepts.
S 2 ðU Þ ¼ RS;D  ð1  P S;D ðCÞÞ;
C2D
The first phase is good at collecting more webpages for
well profiled websites; it cannot help with unknown web-
U 2S and P S;D ðCÞ 6¼ 0 ð7Þ
sites, however. Our second phase goes a step further by
where C is a concept of domain D, RS,D was defined in Eq. searching for webpages that can help define a new website
(6), and PS,D is defined by Eq. (8). PS,D(C) measures the profile. In this phase, we exploit URLs that are in the web-
proportion of concept correlation of website S with respect site models, but belong to some unknown website profile.
to concept C of domain D. NS,C refers to the number of We use Eq. (9) to calculate S2(U) for each outbound hyper-
webpages talking about domain concept C on website S. link U of some webpages that is stored in an indefinite web-
Fig. 14 shows the algorithm for calculating NS,C. In short, site profile.
PS,D(C) measures how strong a website is related to a spe- S 2 ðU Þ ¼ AnchorðU ; DÞ ð9Þ
cific domain concept.
N S;C where function Anchor(U,D) gives outbound link U a
P S;D ðCÞ ¼ ð8Þ weight according to how many terms in the anchor text
N S;D
of U belong to domain D.
Literally, Eq. (7) assigns a higher score to U if U belongs to Fig. 16 illustrates how this strategy works. Phase 2 will
a website S which has a higher degree of domain correla- choose hyperlink H for priority score calculation. The
tion RS,D, but contains less domain concepts, i.e., less unknown website X represents a website profile which con-
PS,D(C). Fig. 15 illustrates how this strategy works. It tains insufficient webpages to make clear its relationships
shows webpages I and J will become the first choices for to some domains. Thus, the second phase prefers to expand
calculating their priority scores by the first phase. In the fig- those webpages that can help bring in more information to
ure, the upper nodes represent the webpages stored in the complete the specification of indefinite website profiles.
website models and the lower nodes represent the webpages In the third phase, we relax one more constraint; we
whose URLs are hyperlinked in the upper nodes. Nodes in relax the condition of unknown website profiles. We
dotted circles, e.g., nodes I and J represent webpages not exploit any URLs as long as they are referred to by some
yet stored in the website models and needed to be collected. webpages in the website models. We use Eq. (10) to calcu-
The figure shows webpage J referred to by webpage A in late S2(U) for each outbound hyperlink U which are
website 1 belongs to website 1, but is not yet collected into referred to by any webpage in the website models. This
its website model. Similarity, webpage I referred to by web- equation heavily relies on the anchor texts to determine
page E in website 2 has to be collected into its website mod- which URLs should receive higher priority scores.
el. Note that webpage I is also referred to by webpage C in

Website1 Unknown Website X


Let C be a Concept, S be a Website, and D be a Domain.
Let THW be some threshold numbers for the website models.
OntoPages(S) = The webpages on website S belong to D. A B C E F G
OntoCon(THW, P) = The top THW domain concepts of webpage P.

Calculating_NS,C(S, C)
{
outbound
For each webpage P in OntoPage(S)
{ link
If OntoCon(THW, P) contains C
NS,C = NS,C + 1
}
Return NS,C B C E H E F
}
Website1 URL range Unknown Website X URL range

Fig. 14. Algorithm for calculating NS,C. Fig. 16. Basic operation of the second phase.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx 11


X
S 2 ðU Þ ¼ AnchorðU ; CÞ  RS;D  P S;D ðCÞ ð10Þ Initial state
C2D

Y Need to invoke N
Fig. 17 illustrates how the strategy works. It will select all N
phase 1~3?

nodes in dotted circles for priority score calculation. In


short, the third phase tends to collect every webpage that If Invoke phase 1;
Add URLs to
Periordically
Priority Queue invoke phase 4
is referred to by the webpages in the website models. is empty? Priority Queue
by Distiller
by system

In the last phase, we resort to general website informa-


Y
tion to refresh and expand website profiles. This phase is
periodically invoked according to the Update_Time/Date N
stored in the website profiles and webpage profiles. Specif- Invoke phase 2;
If Wait for
ically, we refer to the analysis of refresh cycles of different Priority Queue Add URLs to
Priority Queue Priority Queue
is empty? empty
types of websites conducted in Cho and Garcia-Molina by Distiller

(2000) and define a weight for each web type as shown in


Table 5. This phase then uses Eq. (11) to assign an S2(U) Y
Invoke phase 3;
Add URLs to
to each U which belongs to a specific website type T. Priority Queue
by Distiller

S 2 ðU Þ ¼ RS;D  W T ; U 2T ð11Þ Fig. 18. Progressive website expansion strategy.

Fig. 18 summarizes how this four-phase progressive web-


Motwani, and Winograd (1999) was employed in Google to
site expansion strategy works.
rank webpages by their link information. Google spends
lots of offline time pre-analyzing the link relationships
4.2. Webpage retrieval from website models among a huge number of webpages and calculating proper
ranking scores for them before storing them in a special
Webpage retrieval concerns the way of providing most- database for answering user query. Google’s high speed of
needed documents for users. Traditional ranking methods response stems from a huge local webpage database
employ an inverted full-text index database along with a along with a time-consuming, offline detailed link structure
ranking algorithm to calculate the ranking sequence of rel- analysis.
evant documents. The problems with this method are clear: Instead, our solution ranking method takes advantage of
too many entries in returned results and too slow response the semantics in the website models. The major index struc-
time. A simplified approach emerged, which employs vari- ture uses ontology features to index webpages in the website
ous ad hoc mechanisms to reduce query space (Salton & models. The ontology index contains terms that are stored
McGill, 1983, 1988). Two major problems are behind these in the webpage profiles. The second index structure is a par-
mechanisms: (1) they need a specific, labor-intensive and tial full-text inverted index since it contains no ontology fea-
time-consuming pre-process and; (2) they cannot respond tures. Fig. 19 shows this two-layered index structure. Since
to the changes of the real environment in time due to the we require each query contain at least one ontology feature,
off-line pre-process. Another new method called Page, Brin, we always can use the ontology index to locate a set of web-
pages. The partial full-text index is then used to further
Website1 Website2 reduce them into a subset of webpages for users.
This design of separating ontology indices from a tradi-
A B C E F G tional full-text is interesting. Since we then know what

outbound Document numbers of


link Ontology Index webpages in website Partial Full-Text Index
(Inverted Index of Ontology models (Inverted Index of Partial
Feature Terms) Terms)
H I J L M N DocNo 1
Term 1 Term 11
DocNo 2
Fig. 17. Basic operation of the third phase. Term 2 Term 21
DocNo 3
Term 3 Term 31
Table 5 DocNo 4
... ...
Weight table to different types of Websites
...
Web type, T WT ... ...
...
Com 0.62
net/org 0.32 ...
edu 0.08
gov 0.07
Fig. 19. Index structures in website models.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

12 S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

ontology features are contained in a user query. Based on


this information, we can apply OntoClassifier to analyze
what ontology classes (equivalently domain concepts) the
user are really interested in and use the information to fast
locate user interested webpages. Let us explain how this is
done in our system. First, we use the second stage of Ont- Fig. 20. Adjustment of user-terms vs. ontology-features in webpage
oClassifier along with a threshold, say THU, to limit the retrieval.
best classes a query is associated with. For example, if we
set THU to three, we select the best three ontology classes
from a query and use them as indices to fast locate user- Note that the two weighting factors are correlated as de-
interested webpages. fined by Eq. (15). The user is allowed to change the ratio
As a matter of fact, we can leverage the identified ontol- between them as illustrated in Fig. 20 to reflect his empha-
ogy features in a user query to properly rank the webpages sis on either user terms or ontology features in retrieving
for the user using the ranking method defined by Eq. (12). webpages.
In the first term of the equation, MQU(P) is the number of
user terms appearing in webpages P, which can be obtained 5. System evaluation
from Fig. 19, and WQU is its weighting value. PS,D(T) is
defined by Eq. (13), which measures, for each term T in 5.1. Architecture of the search agent
the user term part of query Q (i.e., QU(Q)), the ratio of
the number of webpages that contain T (i.e., NS,T), and Fig. 21 shows the architecture of our Search Agent,
the total number of webpages related to D (i.e., NS,D), on which integrates the design concepts and techniques dis-
website S. Multiplying these factors together represents cussed in the previous sections. To recap, Focused Crawler
how strong the to-be-retrieved webpages are user terms- is responsible for gathering webpages into DocPool
oriented. The second term of Eq. (12) does a similar anal- according to user interests and website model weakness.
ysis on ontology features appearing in the user query. Basi- Model Constructor extracts important information from
cally, WQO is a weighing value for the ontology term part a webpage stored in DocPool and annotates proper ontol-
of query Q, and MQO(P) is the number of ontology features ogy information to make a webpage profile for it. It also
appearing in webpage P, which can be obtained from constructs a website profile for each website in due time
Fig. 19 too. As to the factor of PS,D(T), we have a slightly according to what webpages it contain. Webpage Retrieval
different treatment here. It is used to calculate the ratio of uses ontology features in a given user query to fast locate
the number of webpages containing ontology feature T, but and rank a set of most needed webpages in the website
we restrict T to appear only in the top THU concepts, as models and displays it to the user. Finally, the User Inter-
we have set a threshold number of domain concepts for face receives a user query, expands the query using ontol-
each user query. We thus need to add a factor PTH(Q,P) ogy, and sends it to Webpage Retrieval, which in turn
to reflect the fact that we also apply a threshold number returns a list of ranked webpages. Fig. 22 illustrates the
of domain concepts, THW, for each webpage (see Section various ways in which the user can enter query.
3.2). PTH(Q,P) is defined by Eq. (14) measuring the ratio Fig. 22(a) shows the traditional keyword-based model
of the number of domain concepts that appear both in along with the help of ontology features shown in the left
the top THU concepts of query Q and the top THW con- column. The user can directly choose ontology terms into
cepts of domain D (i.e., MTH(Q,P)) to the number of
domain concepts that appear only in the top THU concepts
of Q (i.e., MTH(Q)). This second term thus represents how
WWW
strong the to-be-retrieved webpages are related to user- Search
Engines
interested domain concepts.
X webpages
RAðQ; P Þ ¼ W QU  M QU ðP Þ  P S;D ðT Þ
T 2QU ðQÞ Focused Model
X Crawler
DocPool
Constructor

þ W QO  M QO ðP Þ  P TH ðQ; P Þ  P S;D ðT Þ
Model
T 2OntoðTHU ;QÞ User Searching Strings Information
Query
ð12Þ User
Interface
N S;T Ontology Website Models
P S;D ðT Þ ¼ ð13Þ Answer
N S;D
M TH ðQ; P Þ Webpage
P TH ðQ; P Þ ¼ ð14Þ Retrieval
M TH ðQÞ
W QU þ W QO ¼ 1 ð15Þ Fig. 21. Ontology-centric search agent architecture.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx 13

ranked list of websites along with their contained webpag-


es. Either type of the ranked lists will be turned over to the
User Interface for personalization before being displayed
to the user.

5.2. Experiment environment

Our Search Agent is developed using Borland JBuilder


7.0 on Windows XP. We collected in total ten classes with
100 webpages in each class from hardware-related Web-
sites, as shown in Table 6.

5.3. Performance evaluation of OntoClassifier

The first experiment is to learn how well OntoClassifier


works. We applied the feature selection program as
described in ontology-reorganization to all collected web-
Fig. 22. User query through user interface.
pages to select ontology features for each class. Table 7
the input field. Fig. 22(b) shows the user can use natural shows the number of features for each class.
language to input his query too. The user interface employs To avoid unexpected delay we limit the level of related
a template matching technique to select best-matched concepts to 7 during the second stage classification of Ont-
query templates (Chiu, 2003; Yang, 2007), as shown in oClassifier. Fig. 25 shows its performance for each class.
Fig. 22(c) for the users to conform. Note that either of Several interesting points deserve notice here. First, with
the query is finally transformed into a list of keywords a very small number of ontology features, OntoClassifier
internally by User Interface for subsequent query can perform very accurate classification results in virtually
expansion. all classes. Even with 10 features, over 80% accuracy of
Fig. 23 illustrates a retrieval result for user query classification can be obtained in all classes. Second, the
returned from Webpage Retrieval which contains a ranked accuracy of classification of OntoClassifier is very stable
list of webpages. This retrieval method emphasizes the with respect to the number of ontology features. In Wang
‘‘precision’’ criteria. As a matter of fact, Webpage Retrie-
val can structure another way of retrieval result to empha- Table 6
size the ‘‘recall’’ factor by retrieving and ranking the Experimental data
websites that are most suited to answer the user query. Class Webpage count Website count
Fig. 24 exemplifies such a returned result composed of a CPU 100 7
Motherboard 100 4
Graphic card 100 3
Sound card 100 13
http://www.amd.com AMD - Advanced Micro Device, INC. (CPU, 0.9) 0.95 ...
http://www.intel.com Welcome to Intel (CPU, 0.9) 0.9 ... Network card 100 5
http://www.idt.com Welcome to IDT (CPU, 0.7) 0.8 ... SCSI card 100 7
http://www.cyrix.com VIA Technologies, INC. (CPU, 0.7) 0.7 ...
..... Optical drive 100 5
Monitor 100 4
Fig. 23. Example of webpage-oriented retrieval from website models. Hard drive 100 5
Modem 100 7

AMD website model => CPU classr => .....


=> http://www.amd.com; (CPU, 0.9); 0.95; ...
=> .....
=> Motherboard cluster => ..... Table 7
=> .....
Intel website model => ..... Number of features in each training class
=> CPU class => .....
=> http://www.intel.com; (CPU, 0.9); 0.9; ... Class # of features
=> .....
=> ..... CPU 93
IDT website model => ..... Motherboard 96
=> CPU class => .....
=> http://www.idt.com; (CPU, 0.7); 0.8; ... Graphic card 90
=> ..... Sound card 92
=> .....
Cyrix website model => ..... Network card 67
=> ..... SCSI card 55
=> CPU class => .....
=> http://www.cyrix.com; (CPU, 0.7); 0.7; ... Optical drive 88
..... Monitor 79
Hard drive 57
Modem 66
Fig. 24. Example of website-oriented retrieval from website models.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

14 S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

Onto Classifier CPU Web pages for testing. To make each classifier work best,
100%
Motherboard we allow each classifier to arbitrarily choose the first 40 fea-
80%
Graphic Card tures that make it work best. We limit the number to be 40,
Accuracy

60%
Sound Card because the class SCSI only has 38 features. Fig. 26 illus-
40%
Network Card
trates how the four classifiers work for class as CPU,
SCSI Card
Motherboard, Graphic Card, and Sound Card.
20% Optical Drive
We note that O-PrTFIDF and D-PrTFIDF classifiers
0% Monitor
10 20 30 40 50 60 70 80 90 100
Hard Drive
are the most unstable among the four with respect to differ-
# of Features
Modem
ent numbers of features. The T-PrTFIDF classifier works
rather well except for larger features, which is because T-
Fig. 25. Classification performance of OntoClassifier. PrTFIDF was designed to work based on ontology (Ting,
2000). Its computation complexity is greater than Onto-
(2003), we have reported a performance comparison Classifier though. From this experiment, we learn that
between OntoClassifier and three other similar classifiers, ontology features alone do not work for any classifier;
namely, O-PrTFIDF (Joachims, 1997), T-PrTFIDF (Ting, the ontology features work best for those classifiers that
2000), and D-PrTFIDF (Wang, 2003). All the three classi- are crafted by taking into account how to leverage the
fiers and their respective feature selection methods were re- power of ontology. OntoClassifier is such a classification
implemented. We found that none of these three classifiers mechanism.
can match the performance of OntoClassifier with respect
to either classification accuracy or classification stability. 5.5. User-satisfaction evaluation of system prototype
To verify that the superior performance of OntoClassifi-
er is not due to overfitting, we used 1/3, 1/2, and 2/3 of the Table 9 shows the comparison of user satisfaction of our
collected webpages, respectively, for training in each class system prototype against other search engines. In the table,
to select the ontology features and used all webpages of ST, for satisfaction of testers, represents the average of sat-
the class for testing. Table 8 shows how OntoClassifier isfaction responses from 10 ordinary users, while SE, for
behaves with respect to different ratios of training samples. satisfaction of experts, represents that of satisfaction
The column of number of features gives the number of responses from 10 experts. Basically, each search engine
ontology features used in each class. It does show that receives 100 queries and returns the first 100 webpages
the superior accuracy performance can be obtained even for evaluation of satisfaction by both experts and non-
with 1/3 of training webpages. experts. The table shows that our system prototype with
the techniques described above, the last row, enjoys the
5.4. Do ontology features work for any classifiers? highest satisfaction in all classes. From the evaluation, we
conclude that, unless the comparing search engines are spe-
The second experiment is to learn whether the superior cifically tailored to this specific domain, such as HotBot
performance of OntoClassifier is purely due to ontology and Excite, our system prototype, in general, retrieves
features. In other words, can the ontology features work more correct webpages in almost all classes.
for other classifiers too? From this purpose, this experi-
ment will use the same set of ontology features derived 6. Related works
for OntoClassifier to test the performance of O-PrTFIDF,
D-PrTFIDF and T-PrTFIDF, the three classifiers men- We notice that ontology is mostly used in the systems
tioned before. This time we only used 1/3 of the collected that work on information gathering or classification to
Web pages for training the ontology features and used all improve their gathering processes or the search results

Table 8
Classification performance of OntoClassifier under different ratios of training samples
Class 1/3 Training data 1/2 Training data 2/3 Training data
# of Features Accuracy (%) # of Features Accuracy (%) # of Features Accuracy (%)
CPU 69 97 78 100 82 100
Motherboard 81 100 89 100 89 100
Graphic card 61 100 73 100 77 100
Sound card 73 98 73 99 89 99
Network card 53 94 60 98 64 100
SCSI card 38 93 48 98 50 98
Optical drive 73 90 82 94 87 94
Monitor 69 100 74 100 75 100
Hard drive 39 99 44 98 50 99
Modem 64 100 66 100 66 100

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx 15

CPU from disparate resources (Eichmann, 1998). For instance,


100% MELISA (Abasolo & Gómez, 2000) is an ontology-based
80% information retrieval agent with three levels of abstraction,
D-PrTFIDF
separated ontologies and query models, and definitions of
Accuracy

60% O-PrTFIDF
some aggregation operators for combining results from dif-
T-PrTFIDF
40%
ferent queries. WebSifter II is a semantic taxonomy-based,
OntoClassifier
20% personalizable meta-search agent (Kerschberg, Kim, &
0%
Scime, 2001) that tries to capture the semantics of a user’s
10 20 30 40 decision-oriented search intent, to transform the semantic
# of Features query into target queries for existing search engines, and
(a) CPU class to rank the resulting page hits according to a user-specified
weighted-rating scheme. Chen and Soo (2001) describes an
Motherboard
100%
ontology-based information gathering agent which utilizes
the domain ontology and corresponding support (e.g., pro-
80%
D-PrTFIDF cedure attachments, parsers, wrappers and integration
Accuracy

60% O-PrTFIDF rules) to gather the information related to users’ queries


40% T-PrTFIDF from disparate information resources in order to provide
20%
OntoClassifier
much more coherent results for the users. Park and Zhang
0%
(2003) describe a novel method for webpage classification
10 20 30 40 that is based on a sequential learning of the classifiers
# of Features
which are trained on a small number of labeled data and
(b) Motherboard class
then augmented by a large number of unlabeled data.
Sound Card Wang, Yu, and Nishino (2004) propose a new website
100% information detection system based on Webpage type clas-
sification for searching information in a particular domain.
Accuracy

80%
D-PrTFIDF
60% O-PrTFIDF
SALEM (Semantic Annotation for LEgal Management)
T-PrTFIDF
(Bartolini, Lenci, Montemagni, Pirrelli, & Soria, 2004) is
40%
OntoClassifier an incremental system developed for automated semantic
20%
annotation of (Italian) law texts to effective indexing and
0% retrieval of legal documents. Chan and Lam (2005) propose
10 20 30 40
# of Features an approach for facilitating the functional annotation to
(c) Sound Card class the Gene ontology by focusing on a subtask of annotation,
that is, to determine which of the Gene ontology a litera-
Graphic Card ture is associated with. Swoogle (Ding et al., 2004) is a
100% crawler-based system that discovers, retrieves, analyzes
80% D-PrTFIDF
and indexes knowledge encoded in semantic web docu-
Accuracy

60% O-PrTFIDF
ments on the Web, which can use either character N-Gram
T-PrTFIDF or URIrefs as keywords to find relevant documents and to
40%
OntoClassifier compute the similarity among a set of documents. Finally,
20%
Song, Lim, Park, Kang, and Lee (2005) suggest an auto-
0%
10 20 30 40
mated method for document classification using an ontol-
# of Features ogy, which expresses terminology information and
(d) Graphic Card class vocabulary contained in Web documents by way of a hier-
archical structure. In this paper, we not only proposed
Fig. 26. Do ontology features work for any classifiers?
ontology-directed classification mechanism, namely, Onto-

Table 9
User satisfaction evaluation
K_Word method CPU (SE/ST) Motherboard (SE/ST) Memory (SE/ST) Average (SE/ST)
Yahoo 67%/ 61% 77%/ 78% 38%/ 17% 61%/ 52%
Lycos 64%/ 67% 77%/ 76% 36%/ 20% 59%/ 54%
InfoSeek 69%/ 70% 71%/ 70% 49%/ 28% 63%/ 56%
HotBot 69%/ 63% 78%/ 76% 62%/ 31% 70%/ 57%
Google 66%/ 64% 81%/ 80% 38%/ 21% 62%/ 55%
Excite 66%/ 62% 81%/ 81% 50%/ 24% 66%/ 56%
Alta Vista 63%/ 61% 77%/ 78% 30%/ 21% 57%/ 53%
Our prototype 78%/ 69% 84%/ 78% 45%/ 32% 69%/ 60%

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

16 S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

Classifier can make a decision of the class for a webpage or tion of a webpage. The ontology information is an annota-
a website in the semantic decision process for Web services, tion of how the webpage is interpreted by the domain
but advocated the use of ontology-supported website ontology. The website model also contains a website profile
models to provide a semantic level solution for a search that remembers how a website is related to the webpages
agent so that it can provide fast, precise and stable search and how it is interpreted by the domain ontology.
results. We have developed a Search Agent, which employs
As to Web search, current general search engines use the domain ontology-supported website models as the core
concept of crawlers (spider or soft-robots) to help users technology to search for Web resources that are both
automatically retrieve useful Web information in terms of user-interested and domain-oriented. Our preliminary
ad-hoc mechanisms. For example, Dominos (Hafri & Djer- experimentation demonstrates that the system prototype
aba, 2004) can crawl several thousands of pages every sec- can retrieve more correct webpages with higher user satis-
ond, include a high-performance fault manager, be faction. The Agent features the following interesting char-
platform independent and be able to adapt transparently acteristics. (1) Ontology-supported construction of website
to a wide range of configurations without incurring addi- models. By this, we can attribute domain semantics into the
tional hardware expenditure. Ganesh, Jayaraj, Kalyan, Web resources collected and stored in the local database.
and Aghila (2004) proposes the association-metric to esti- One important technique used here is the Ontology-sup-
mate the semantic content of the URL based on the ported OntoClassifier which can do very accurate and sta-
domain dependent ontology, which in turn strengthens ble classification on webpages to support more correct
the metric that is used for prioritizing the URL queue. Ubi- annotation of domain semantics. Our experiments show
Crawler (Boldi, Codenotti, Samtini, & Vigna, 2004), a scal- that OntoClassifier performs very well in obtaining accu-
able distributed web crawler, is platform independent, rate and stable webpages classification. (2) Website mod-
linear scalability, graceful degradation in the presence of els-supported Website model expansion. By this, we can
faults, a very effective assignment function for partitioning take into account both user interests and domain specific-
the domain to crawl, and more in general the complete ity. The core technique here is the Focused Crawler which
decentralization of every task. Chan (2008) proposes an employs progressive strategies to do user query-driven
intelligent spider that consists of a URL searching agent webpage expansion, autonomous website expansion, and
and an auction data agent to automatically correct related query results exploitation to effectively expand the website
information by crawling over 1000 deals from Taiwan’s models. (3) Website models-supported Webpage Retrieval.
eBay whenever users input the searched product. Finally, We leverage the power of ontology features as a fast index
Google adopts a PageRank approach to rank a large num- structure to locate most-wanted webpages for the user. (4)
ber of webpage link information and pre-record it for solv- We mentioned that the User Interface works as a query
ing the problem (Brin & Page, 1998). A general Web expansion and answer personalization mechanism for
crawler is, in general, a greedy tool that may make the Search Agent. As a matter of fact, the module has been
URL list too large to handle. Focused Crawler, instead, expanded into a User Interface Agent in our information
aims at locating domain knowledge and necessary meta- integration system (Yang, 2006; Yang, 2006). The User
information for assisting the system to find related Web Interface Agent can interact with the user in a more seman-
targets. The concept of Distiller is employed to rank URLs tics-oriented way according to his proficiency degree about
for the Web search (Barfourosh, Nezhad, Anderson, & the domain (Yang, 2007; Yang, 2007).
Perlis, 2002; Diligenti, Coetzee, Lawrence, Giles, & Gori, Most of our current experiments are on the performance
2000; Rennie & McCallum, 1999). IBM is an example, test of OntoClassifier. We are unable to do experiments on
which adopts the HITS algorithm, similar to PageRank, or comparisons of how good the Search Agent is at
for controlling web search Kleinberg, 1999. These methods expanding useful Web resources. Our difficulties are sum-
are ad-hoc and need an off-line, time-consuming pre-pro- marized below. (1) To our knowledge, none of current
cessing. In our system, we not only develop a focused craw- Web search systems adopt a similar approach as ours in
ler using website models as the core technique, which helps the sense that none of them are relying on ontology as
search agents successfully tackle the problems of search heavily as our system to support Web search. It is thus
scope and user interests, but introduce the four-phase pro- rather hard for us to do a fair and convincing comparison.
gressive website expansion strategy for the focused crawler (2) Our ontology construction is based on a set of pre-col-
to control the Web search, which takes into account both lected webpages on a specific domain; it is hard to evaluate
user interests and domain specificity. how critical this pre-collection process is to the nature of
different domains. We are planning to employ the tech-
7. Conclusions and discussion nique of automatic ontology evolution, for example, coop-
erated with data mining technology for discovering useful
We have described how ontology-supported website information and generating desired knowledge that sup-
models can effectively support Web search. A website port ontology construction (Wang, Lu, & Zhang, 2007),
model contains webpage profiles, each recording basic to help study the robustness of our ontology. (3) In general,
information, statistics information, and ontology informa- a domain ontology-based technique cannot work as a gen-

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx 17

eral-purpose search engine. We are planning to create a DAML+OIL. (2001). Available at http://www.daml.org/2001/03/dam-
general-purpose search engine by employing a multiple l+oil-index/.
Decker, S., Melnik, S., van Harmelen, F., Fensel, D., Klein, M.,
number of our Search Agent, supported by a set of domain Broekstra, J., et al. (2000). The semantic web: the roles of XML and
ontologies, through a multi-agent architecture. RDF. IEEE Internet Computing, 4(5), 63–74.
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., & Gori, M. (2000).
Focused crawling using context graphs. In Proceedings of the 26th
Acknowledgements international conference on very large databases (pp. 527–534). Cairo,
Egypt.
The author would like to thank Jr-Chiang Liou, Yung Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., Reddivari, P.,
Doshi, V., & Sachs, J. (2004). Swoogle: A search and metadata engine
Ting, Jr-Well Wu, Yu-Ming Chung, Zing-Tung Chou, for the semantic web. In Proceedings of the 13th ACM international
Ying-Hao Chiu, Ben-Chin Liao, Yi-Ching Chu, Shu-Ting conference on information and knowledge management (pp. 652–659).
Chang, Yai-Hui Chang, Chung-Min Wang, and Fang- Washington, DC, USA.
Chen Chuang for their assistance in system implementa- Eichmann, D. (1998). Automated categorization of web resources. Avail-
tion. This work was supported by the National Science able at http://www.iastate.edu/~CYBERSTACKS/Aristotle.htm.
Ganesh, S., Jayaraj, M., Kalyan, V., & Aghila, G. (2004). Ontology-based
Council, ROC, under Grants NSC-89-2213-E-011-059, web crawler. In Proceedings of the international conference on
NSC-89-2218-E-011-014, and NSC-95-2221-E-129-019. information technology: coding and computing (pp. 337–341). Las
Vegas, NV, USA.
Hafri, Y., & Djeraba, C. (2004). Dominos: a new web crawler’s design. In
References Proceedings of the 4th international web archiving workshop. Bath, UK.
Henry, S. T., David, B., Murray, M., & Noah, M. (2001). XML Base.
Abasolo, J. M., & Gómez, M. (2000). MELISA: An ontology-based agent Available at http://www.w3.org/TR/2001/REC-xmlbase-20010627/.
for information retrieval in medicine Available at http://cite- Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm
seer.nj.nec.com/442210.html. with TFIDF for text categorization. In Proceedings of the 14th
Al-Halami, R., & Berwick, R. (1998). WordNet: an electronic lexical international conference on machine learning (pp. 143-151). Nashville,
database. ISBN 0-262-06197-X. Tennessee, USA.
Arnaud, L. H., Philippe, L. H., Lauren, W., Gavin, N., Jonathan, R., Kerschberg, L., Kim, W., & Scime, A. (2001). WebSifter II: A person-
Mike, C., & Steve, B. (2004). Document object model (DOM) level 3 alizable meta-search agent based on weighted semantic taxonomy tree.
core specification. Available at http://www.w3.org/TR/2004/REC- In Proceedings of the international conference on internet computing
DOM-Level-3-Core-20040407/. (pp. 14–20). Las Vegas, USA.
Ashish, N., & Knoblock, C. A. (1997). Wrapper generation for semi- Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment.
structured Internet sources. ACM SIGMOD Record, 26(4), 8–15. Journal of the ACM (JACM), 46(5), 604–632.
Barfourosh, A. A., Nezhad, H. M., Anderson, M. L., & Perlis, D. (2002). Lawewnce, S., & Giles, C. L. (1999). Accessibility and distribution of
Information retrieval on the world wide web and active logic: a survey information on the web. Nature, 400, 107–109.
and problem definition. Technical Report of CS-TR-4291, Department Lawewnce, S., & Giles, C. L. (2000). Accessibility of information on
of Computer Science, University of Maryland, Maryland, USA. the web. ACM Intelligence: New Visions of AI in Practice, 11(1),
Bartolini, R., Lenci, A., Montemagni, S., Pirrelli, V., & Soria, C. (2004). 32–39.
Automatic classification and analysis of provisions in Italian legal Manola, F. (1998). Towards a web object model. Available at http://
texts: a case study. In Proceedings of the 2nd workshop on regulatory op3.oceanpark.com/papers/wom.html.
ontologies (pp. 593-604). Larnaca, Cyprus. McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling,
Boldi, P., Codenotti, B., Samtini, M., & Vigna, S. (2004). UbiCrawler: A text retrieval, classification and clustering. Available at http://
scalable fully distributed web crawler. Software: Practice and Experi- www.cs.cmu.edu/~mccallum/bow.
ence, 34(8), 711–726. Moldovan, D. I., & Mihalcea, R. (2000). Using WordNet and lexical
Brickley, D., & Guha, R. V. (2004). RDF Vocabulary description operators to improve internet searches. IEEE Internet Computing, 4(1),
language 1.0: RDF schema. Available at http://www.w3.org/TR/ 34–43.
2004/REC-rdf-schema-20040210/. Noy, N. F., & Hafner, C. D. (1997). The state of the art in ontology
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web design. AI Magazine, 18(3), 53–74.
search engine. In Proceedings of the 7th international world wide web Noy, N. F., & McGuinness, D. L. (2001). Ontology development 101: A
conference (pp. 107–117). Brisbane, Australia. guide to creating your first ontology. Stanford Knowledge Systems
Chan, C. C. (2008). Intelligent spider for information retrieval to support Laboratory Technical Report KSL-01-05 and Stanford Medical
mining-based price prediction for online auctioning. expert systems with Informatics Tech. Rep. SMI-2001-0880.
applications: an international journal, 34(1), 347–356. OIL. (2000). http://www.ontoknowledge.org/oil/downl/oil-whitepaper.
Chan, K., & Lam, W. (2005). Gene ontology classification of biomedical pdf.
literatures using context association. In Proceedings of the 2nd Asia Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank
information retrieval symposium (pp. 552–557). Jeju Island, Korea. citation ranking: Bringing order to the web. Stanford Digital Libraries
Chen, Y. J., & Soo, V. W. (2001). Ontology-based information gathering Working Paper of SIDL-WP-1999-0120, Department of Computer
agents. In Proceedings of the 2001 international conference on web Science, University of Stanford, CA, USA.
intelligence (pp. 423–427). Maebashi TERRSA, Japan. Park, S. B., & Zhang, B. T. (2003). Automatic webpage classification
Chiu, Y. H. (2003). An interface agent with ontology-supported user models. enhanced by unlabeled data. In Proceedings of the 4th international
Master thesis, Department of Electronic Engineering, National conference on intelligent data engineering and automated learning (pp.
Taiwan University of Science and Technology, Taipei, Taiwan. 821–825). Hong-Kong, China.
Cho, J., Garcia-Molina, H. (2000). The evolution of the web and Rennie, J., & McCallum, A. (1999). Using reinforcement learning to
implications for an incremental crawler. In Proceedings of the 26th spider the web efficiently. In Proceedings of the 16th international
international conference on very large databases (pp. 200–209). Cairo, conference on machine learning (pp. 335–343). Bled, Slovenia.
Egypt. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic
DAML. (2003). Available at http://www.daml.org/about.html. text retrieval. Information Processing and Management, 24(5), 513–523.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

18 S.-Y. Yang / Expert Systems with Applications xxx (2007) xxx–xxx

Salton, G., & McGill, M. J. (1983). Introduction to modern information Yang, S. Y., & Ho, C. S. (1999). Ontology-supported user models for
retrieval. McGraw-Hill. interface agents. In Proceedings of the 4th conference on artificial
Selberg, E., & Etzioni, O. (1995). Multi-service search and comparison intelligence and applications (pp. 248–253). Chang-Hua, Taiwan.
using the MetaCrawler. In Proceedings of the 4th international world Yang, S. Y. (2006). An ontology-directed webpage classifier for web
wide web conference (pp. 169–173). Boston, USA. services. In Proceedings of joint 3rd international conference on soft
Song, M. H., Lim, S. Y., Park, S. B., Kang, D. J., & Lee, S. J. (2005). An computing and intelligent systems and 7th international symposium on
automatic approach to classify web documents using a domain advanced intelligent systems (pp. 720–724). Tokyo, Japan.
ontology. The first international conference on pattern recognition and Yang, S. Y. (2006). A website model-supported focused crawler for search
machine intelligence (pp. 666–671). Kolkata, India. agents. In Proceedings of the 9th joint conference on information
Ting, Y. (2000). A search agent with website models. Master Thesis, sciences (pp. 755–758). Kaohsiung, Taiwan.
Department of Electronic Engineering, National Taiwan University of Yang, S. Y. (2006). An ontology-supported website model for web search
Science and Technology, Taipei, Taiwan. agents. In Proceedings of the 2006 international computer symposium
Wang, C., Lu, J., & Zhang, G. Q. (2007). Mining key information of web (pp. 874-879). Taipei, Taiwan.
pages: a method and its applications. Expert Systems with Applica- Yang, S. Y. (2006). How does ontology help information management
tions: An International Journal, 33(2), 425–433. processing. WSEAS Transactions on Computers, 5(9), 1843–1850.
Wang, C. M. (2003). Web search with ontology-supported technology. Yang, S. Y. (2006). An ontology-supported information management agent
Master thesis, Department of Computer Science and Information with solution integration and proxy. In Proceedings of the 10th WSEAS
Engineering, National Taiwan University of Science and Technology, international conference on computers (pp. 974-979). Athens, Greece.
Taipei, Taiwan. Yang, S. Y. (2007). An ontology-supported user modeling technique with
Wang, Z. L., Yu, H., & Nishino, F. (2004). Automatic special type query templates for interface agents. In Proceedings of 2007 WSEAS
website detection based on webpage type classification. In Proceedings international conference on computer engineering and applications (pp.
of the first international workshop on web engineering. Santa Cruz, 556–561). Gold Coast, Queensland, Australia.
USA. Yang, S. Y. (2007). How does ontology help user query processing for
Weibel, S. (1999). The State of the Dublin core metadata initiative. D-Lib FAQ services. WSEAS Transactions on Information Science and
Magazine, 5(4). Applications, 4(5), 1121–1128.

Please cite this article in press as: Yang, S.-Y., An ontological website models-supported search agent for web services, Expert Systems
with Applications (2007), doi:10.1016/j.eswa.2007.09.024
ARTICLE IN PRESS

Expert Systems with Applications xxx (2008) xxx–xxx

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Developing of an ontological interface agent with template-based linguistic


processing technique for FAQ services
Sheng-Yuan Yang *
Department of Computer and Communication Engineering, St. John’s University, 499, Sec. 4, TamKing Road, Tamsui, Taipei County 251, Taiwan, ROC

a r t i c l e i n f o a b s t r a c t

This paper proposes an ontological Interface agent which works as an assistant between the users and
Available online xxxx FAQ systems. We integrated several interesting techniques including domain ontology, user modeling,
and template-based linguistic processing to effectively tackle the problems associated with traditional
Keywords: FAQ retrieval systems. Specifically, we address the following issues. Firstly, how can an interface agent
Ontological interface agents learn a user’s specialty in order to build a proper user model for him/her? Secondly, how can domain
Template-based query processing ontology help in establishing user models, analyzing user query, and assisting and guiding interface
FAQ services
usage? Finally, how can the intention and focus of a user be correctly extracted? Our work features an
template-based linguistic processing technique for developing ontological interface agents; a nature lan-
guage query mode, along with an improved keyword-based query mode; and an assistance and guidance
for human–machine interaction. Our preliminary experimentation demonstrates that user intention and
focus of up to eighty percent of the user queries can be correctly understood by the system, and accord-
ingly provides the query solutions with higher user satisfaction.
Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction Each FAQ is represented by one question along with one answer
and is characterized, to be domain-dependent, short and explicit,
With increasing popularity of the Internet, people depend more and frequently asked (Lee, 2000; OuYang, 2000). People usually
on the Web to obtain their information just like a huge knowledge go through the list of FAQs and read those FAQs that are to his/
treasury waiting for exploration. People used to be puzzled by such her questions. This way of answering the user’s questions can save
problems as ‘‘how to search for information on the Web treasury?” the labor power for experts from answering similar questions
As the techniques of Information Retrieval (Salton, Wong, & Yang, repeatedly. The problem is after the fast accumulation of FAQs, it
1975; Salton & McGill, 1983) matured, a variety of information re- becomes harder for people to single out related FAQs. Traditional
trieval systems have been developed, e.g., Search engines, Web FAQ retrieval systems, however, provide only little help, became
portals, etc., to help search on the Web. How to search is no longer they fail to provide assistance and guidance for human–machine
a problem. The problem now comes from the results from these interaction, personalized information services, flexible interaction
information retrieval systems which contain some much informa- interface, etc. (Chiu, 2003).
tion that overwhelms the users. Now, people wants the informa- In order to capture true user’s intention and accordingly pro-
tion retrieval systems to do more help, for instance, by only vides high-quality FAQ answers to meet the user requests, we have
retrieving the results which better meet the user’s requirement. proposed an Interface Agent acquires user intention through an
Other wanted capabilities include a better interface for the user adaptive human–machine interaction interface with the help of
to express his/her true intention, better-personalized services ontology-directed and template-based user models (Yang et al.,
and so on. In short, how to improve traditional information retrie- 1999; Yang, Chiu, & Ho, 2004; Yang, 2006). It also handles user feed-
val systems to provide search results which can better meet the back on the suitability of proposed responses. The agent features
user requirements so as to reduce his/her cognitive loading is an ontology-based representation of domain knowledge, flexible
important issue in current research (Chiu, 2003). interaction interface, and personalized information filtering and
The websites which provide Frequently Asked Questions (FAQ) display. Specifically, according to the user’s behavior and mental
organize user questions and expert answers about a specific prod- state, we employed the technique of user modeling to construct a
uct or discipline in terms of question–answer pairs on the Web. user model to describe his/her characteristics, preference and
knowledge proficiency level, etc. We also used the technique of user
* Corresponding author. Tel.: +886 2 28013131x6394; fax: +886 28013131x6391. stereotype (Rich, 1979) to construct and initialize a new user mod-
E-mail address: [email protected]. el, which helps provide fast-personalized services for new users.

0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2008.03.011

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

2 S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

We built domain ontology (Noy & McGuinness, 2001) to help define outlined a procedure for this in Yang et al. (1999) from how the
domain vocabulary and knowledge and based on that to construct process was conducted in existent systems. By following the proce-
user models and the Interface Agent. We extended the concept of dure we developed an ontology for the PC domain using Protégé
pattern match (Hovy, Hermjakob, & Ravichandran, 2002) to query 2000 (Noy & McGuinness, 2001) as the fundamental background
template to construct natural language based query models. This knowledge for the system, which was originally developed in Chi-
idea leads to the extraction of true user intention and focus from nese (Yang, Chuang, & Ho, 2007) but was changed to English here
his/her query posted as nature language understanding, which help for easy explanation. Fig. 1 shows part of the ontology taxonomy.
the agent to fast find out precise information for the user, just as The taxonomy represents relevant PC concepts as classes and their
(Coyle & Smyth, 2005) in which using interaction histories for parent–child relationships as isa links, which allow inheritance of
enhancing search results. Our preliminary experimentation dem- features from parent classes to child classes. We carefully selected
onstrates that the intention and focus of up to eighty percent of those properties that are most related to our application from each
the users’ queries can be correctly understood, and accordingly pro- concept, and defined them as the detailed ontology for the corre-
vides the query solutions with higher user satisfaction. sponding class. Fig. 2 exemplifies the detailed ontology of the con-
The rest of the paper is organized as follows: Section 2 describes cept of CPU. In the figure, the root node uses various fields to define
the fundamental techniques. Section 3 explains the Interface Agent the semantics of the CPU class, each field representing an attribute
architecture. Section 4 reports the system demonstrations and of ‘‘CPU”, e.g., interface, provider, synonym, etc. The nodes at the
evaluations. Section 5 compares the work with related works, lower level represent various CPU instances, which capture real
while Section 6 conclude the work. The Personal Computer (PC) world data. The arrow line with term ‘‘io” means the instance of
domain is chosen as the target application of our Interface Agent relationship. The complete PC ontology can be referenced from
and will be used for explanation in the remaining sections. the Protégé Ontology Library at Stanford Website (<http://pro-
tege.stanford.edu/download/download.html>). We also developed
2. Fundamental techniques a problem ontology to deal with query questions. Fig. 3 illustrates
part of the Problem ontology, which contains query type and oper-
2.1. Domain ontology ation type. Together they imply the semantics of a question. Final-
ly, we use Protégé’s APIs to develop a set of ontology services,
The concept of ontology in artificial intelligence refers to knowl- which provide primitive functions to support the application of
edge representation for domain-specific contents (Chandraseka- the ontologies. The ontology services currently available include
ran, Josephson, & Benjamins, 1999). It has been advocated as an transforming query terms into canonical ontology terms, finding
important tool to support knowledge sharing and reusing in devel- definitions of specific terms in ontology, finding relationships
oping intelligent systems. Although development of an ontology among terms, finding compatible and/or conflicting terms against
for a specific domain is not yet an engineering process, we have a specific term, etc.

Hardware
isa
isa
isa
isa isa isa isa isa

Interface Power Storage


Memory Case
Card Equipment Media
isa
isa
isa
isa isa
isa isa isa isa isa isa isa isa

Network Sound Display SCSI Network Power Main


UPS ROM Optical ZIP
Chip Card Card Card Card Supply Memory

is a isa isa isa

CD DVD CDR/W CDR

Fig. 1. Part of PC ontology taxonomy.

CPU
Synonym= Central Processing Unit
D-Frequency String
Interface Instance* CPU Slot
L1 Cache Instance Volume Spec.
Abbr. Instance CPU Spec.
...

io io io io io io io

XEON THUNDERBIRD 1.33G DURON 1.2G PENTIUM4 2.0AGHZ PENTIUM 4 1.8AGHZ CELERO N 1.0G PENTIUM 4 2.53AGHZ
Factory= Intel Synonym= Athlon 1.33G Interface= Socket A D-Frequency= 20 D-Frequency= 18 Interface= Socket 370 Synonym= P4
Interface= Socket A L1 Cache= 64KB Synonym= P4 2.0GHZ Synonym= P4 1.8GHZ L1 Cache= 32KB Interface= Socket 478
L1 Cache= 128KB Abbr.= Duron Interface= Socket 478 Interface= Socket 478 Abbr.= Celeron L1 Cache= 8KB
Abbr.= Athlon Factory= AMD L1 Cache= 8KB L1 Cache= 8KB Factory= Intel Abbr.= P4
Factory= AMD Clock= 1.2GHZ Abbr.= P4 Abbr.= P4 Clock= 1GHZ Factory= Intel
... ... ... ... ... ...

Fig. 2. Ontology of the concept of CPU.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx 3

Query

isa isa

Operation Query
Type Type
io
io io io io io io io
io io io io io

Adjust Use Setup Close Open Support Provide How What Why Where

Fig. 3. Part of problem ontology taxonomy.

2.2. Query templates type of question, we further identified several intention types
according to its operations. Table 2 illustrates some examples of
To build the query templates, we have collected in total 1215 intention types. Finally, we define a query pattern for each inten-
FAQs from the FAQ website of six famous motherboard factories tion type. Table 3 illustrates the defined query patterns for the
in Taiwan and used them as the reference materials for query tem- intention types of Table 2. Table 4 explains the syntactical con-
plate construction (Hovy et al., 2002; Soubbotin et al., 2001). Cur- structs of the query patterns.
rently, we only take care of the user query with one intention word Now all information for constructing a query template is ready,
and at most three sentences. These FAQs were analyzed and cate- and we can formally define a query template. Table 5 defines what
gorized into six types of questions as shown in Table 1. For each a query template is. It contains a template number, number of sen-
tences, intention words, intention type, question type, operation
type, query patterns, and focus. Table 6 illustrates an example
Table 1
Question types query template for the ANA_CAN_SUPPORT intention type. Note
here that we collect similar query patterns in the field of ‘‘Query
Question type Intention
patterns,” which are used in detailed analysis of a given query.
(A-NOT-A) Asks about can or cannot, should or should not, have or have not
(HOW) Asks about solving methods
(WHAT) Enumerates related information
(WHEN) Asks about time, year, or date Table 4
(WHERE) Asks place or position Definition of pattern symbols and descriptions
(WHY) Asks reasons
Symbol Description
hi Means single sentence and considers the sequence of intention words
and keywords
Table 2 [] Means at least one sentence and only consider appeared keywords but
Examples of intention types sequence
Si Means the variable part of a template which is any string consists of
Intention Type Description
keywords
ANA_CAN_SUPPORT Asks if support some specifications or products Intention Means the fixed part of a template which can help the system
HOW_SET Asks the method of assignment Word distinguish between user intentions
WHAT_IS Asks the meaning of terminology Keyword Means the concepts in domain ontology which are usually domain
WHEN_SUPPORT Asks when can support terminologies
WHERE_DOWNLOAD Asks where can download Focus Means if the variable part of a template is the user query key point, we
WHY_SETUP Asks reasons about setup called the focus

Table 3
Examples of query patterns

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

4 S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

Table 5 of query templates in our system. Currently, we have in total 154


Query template specification intention words which form an intention word base.
Field Description
Template_Number Template ID 3. Interface agent architecture
#Sentence The number of sentences: 1 for one sentence, 2 for two
sentences, and 3 for three or more sentences 3.1. User modeling
Intention_Word Intention words must appear
Intention_Type Intention type
Question_Type Question type In order to provide customized services, we observe, record, and
Operation_Type Operation type learn the user behavior and mental state in a user model. A user
Query_Patterns The semantic pattern consists of intention words and model contains interaction preference, solution presentation, do-
keywords
main proficiency, terminology table, query history, selection history,
Focus The focus of the user
and user feedback, as shown in Fig. 5. The interaction preference is
responsible for recording user’s preferred interface, e.g., favorite
query mode, favorite recommendation mode, etc. When the user
Table 6
logs on the system, the system can select a proper user interface
Query template for the ANA_CAN_SUPPORT intention type
according to this preference. We provide two modes, either through
keywords or natural language input. We provide three recommen-
dation modes according to hit rates, hot topics, or collaborative
learning. We record recent user’s preferences in a time window,
and accordingly determine the next interaction style. The solution
presentation is responsible for recording solution ranking prefer-
ences of the user. We provide two types of ranking, either according
to the degree of similarity between the proposed solutions and the
user query, or according to user’s proficiency about the solutions.
In addition, we use a Show_Rate parameter (described later) to con-
trol how many items of solutions for display each time, in order to
reduce information-overloading problem. The domain proficiency
factor describes how familiar the user is with the domain. By associ-
According to the generalization relationships among intention ating a proficiency degree with each ontology concept, we can con-
types, we can form a hierarchy of intention types to organize all struct a table, which contains a set of hconcept proficiency-degreei
FAQs. Currently, the hierarchy contains two levels as shown in pairs, as his/her domain proficiency. Thus, during the decision of
Fig. 4. Now, the system can employ the intention type hierarchy solution representation, we can calculate the user’s proficiency de-
to reduce the search scope during the retrieval of FAQs after the gree on solutions using the table, and accordingly only show his/
intention of a user query is identified. Table 7 shows the statistics her most familiar part of solutions and hide the rest for advanced re-
quests. To solve the problem of different terminologies to be used by
different users, we include a terminology table to record this termi-
nology difference. We can use the table to replace the terms used in
the proposed solutions with the user favorite terms during solutions
representation to help him better comprehend the solutions. Finally,
we record the user’s query history as well as FAQ selection history
and corresponding user feedback in each query session in the Inter-
action history, in order to support collaborative recommendation.
The user feedback is a complicated factor. We remember both expli-
cit user feedback in the selection history and implicit user feedback,
which includes query time, time of FAQ click, sequence of FAQ clicks,
sequence of clicked hyperlinks, etc.
In order to quickly build an initial user model for a new user, we
pre-defined five stereotypes, namely, expert, senior, junior, novice,
and amateur (Yang et al., 1999, 2004), to represent different user
group’s characteristics. This approach is based on the idea that the
same group of user tends to exhibit the same behavior and requires
the same information. Fig. 6 illustrates an example user stereotype.
When a new user enters the system, he/she is asked to complete a

Fig. 4. Intention type hierarchy.

Interaction
Table 7 preference
Statistics of query templates
Solution Domain
Question type #Intention type #Template #Pattern #FAQ (%) Presentation Proficiency

(if) 19 53 113 385 (31.7)


(how) 18 44 121 265 (21.8)
(what) 6 18 19 91 (7)
(when) 1 1 3 3 (0.2)
(where) 3 4 4 4 (0.3) Terminology Interaction Explicit Implicit
Table History Feedback Feedback
(why) 25 199 771 467 (38.4)
Total 69 319 1031 1215
Fig. 5. Our user model.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx 5

3.2. Architecture overview

Fig. 7 illustrates the architecture of the Interface Agent and


shows how it interacts with other agents of the FAQ-master (Yang,
2006, 2007), which possesses intelligent retrieval, filtering, and
integration capabilities and can provide high-quality FAQ answers
from the Web. The Interaction Agent provides a personalization
interaction, assistance, and recommendation interface for the user
according to his/her user model, records interaction information
and related feedback in the user model, and helps the User Model
Manager and Proxy Agent (Yang, 2007) to update the user model.
The Query Parser processes the user queries by first segmenting
word, removing conflicting words, and standardizing terms, fol-
lowed by the recording of the user’s terminologies in the terminol-
ogy table of the user model. It finally applies the template
matching technique to select best-matched query templates, and
accordingly transforms the query into an internal query for the
Proxy Agent to search for solutions and collect them into a list of
FAQs, which each containing a corresponding URL. The Web Page
Processor pre-downloads FAQ-relevant webpages and performs
some pre-processing tasks, including labeling keywords for subse-
quent processing. The Scorer calculates the user’s proficiency de-
gree for each FAQ in the FAQ list according to the terminology
Fig. 6. Example of expert stereotype. table in his/her user model. The Personalizer then produces per-
sonalized query solutions according to the terminology table. The
questionnaire, which is used by the system to determine his/her do- User Model Manager is responsible for quickly building an initial
main proficiency, and accordingly select a user stereotype to gener- user model for a new user using the technique of user stereotyping
ate an initial user model to him/her. However, the initial user model as well as updating the user models and stereotypes to dynami-
constructed from the stereotype may be too generic or imprecise. It cally reflect the changes of user behavior. The Recommender is
will be refined to reflect the specific user’s real intent after the sys- responsible for recommending information for the user based on
tem has experiences with his/her query history, FAQ-selection his- hit count, hot topics, or group’s interests when a similar interaction
tory and feedback, and implicit feedback (Chiu, 2003). history is detected.

Search Answerer Proxy


Solution &
Agent Agent Agent
Web Page Links

Internal Query
User Feedback

USER
Action &
Query Interaction Query Web Page
Internet
Agent Parser Processor

User Model
Recommender
Manager

Personalizer Scorer

Data Flow

Template
Homophone User Model Ontology
Support Folw Base
Debug Base

Interface Agent

Fig. 7. Interface agent architecture.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

6 S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

3.2.1. Interaction agent


The Interaction Agent consists of the following three compo-
nents: Adapter, Observor and Assistant. First, the Adapter con-
structs best interaction interfaces according to user’s favorite
query and recommendation modes. It is also responsible for pre-
senting to the user the list of FAQ solutions (from the Personalizer)
or recommendation information (from the Recommender). During
solution representation, it arranges the solutions in terms of the
user’s preferred style (query similarity or solution proficiency)
and displays the solutions according to the ‘‘Show_Rate.” Second,
the Observer passes the user query to the Query Parser, and simul-
taneously collects the interaction information and related feedback
from the user. The interaction information contains user preferred
query mode, recommendation mode, solution presentation mode,
and FAQ clicking behavior, while the related feedback contains
user satisfaction degree and comprehension degree about each
FAQ solution. The User Model Manager needs both interaction
information and related feedback to properly update user models
and stereotypes. The satisfaction degree in related feedback can
also be passed to the Proxy Agent for tuning the solution search
mechanism (Yang, 2006).
Finally, the Assistant provides proper assistance and guidance Fig. 9. User query through our Interface Agent.
to help the user query process. First, the ontology concepts are
structured and presented as a tree so that the users who are not
term standardization. It then applies the template-based pattern
familiar with the domain can check on the tree and learn proper
matching technique to analyze the user query, extract the user
terms to enter their queries. We also rank all ontology concepts
intention and focus, select best-matched query templates as shown
by their probabilities and display them in a keyword list. When
in Fig. 9c (Yang et al., 2004), and trims any irrelevant keywords in
the user enters a query at the input area, the Assistant will auto-
accord with the templates. Finally, it transforms the user query
matically ‘‘scroll” the content of the keyword list to those terms re-
into the internal query format and then passes the query to the
lated to the input keywords. Fig. 8 illustrates an example of this
Proxy Agent for retrieving proper solutions. Fig. 10 shows the flow
automatic keyword scrolling mechanism. If the displayed terms
chart of user query processing. We decided to use template match-
of the list contain a concept that the user wants to enter, he can
ing for query processing because it is an easy and efficient way to
double-click the terms into the input area, e.g., ‘‘ ” (ASUS) at
handle pseudo-natural language processing and the result is usu-
step 2 of Fig. 8. In addition to the keyword-oriented query mode,
ally acceptable on a focused domain.
the Assistant also provides lists of question types and operation
Given a user query in Chinese, we segment the query using
types to help question type-oriented or operation type-oriented
MMSEG. The results of segmentation were not good, for the pre-
search. The user can use one, two, or all of these three mechanisms
defined MMSEG word corpus contains insufficient terms of the
to help form his/her query in order to convey his/her intention to
PC domain. For example, it does not contain the keywords
the system.
‘‘ ” or ‘‘AGP4X”, and returns wrong word segmentation like
3.2.2. Query Parser ‘‘ ”, ‘‘ ”, ‘‘AGP”, and ‘‘4X”. The step of query pruning can easily
The Query Parser pre-processes the user query by performing fix this by using the ontology as a second word corpus to bring
Chinese word segmentation, correction on word segmentation, those mis-segmented words back. It also performs fault correction
fault correction on homophonous or multiple words, and term on homophonous or multiple words using the ontology and homo-
standardization. It then employs template-based pattern matching phone debug base (Chiu, 2003).
to analyze the user query and extract the user intention and focus. The step of query standardization is responsible for replacing the
Finally, it transforms the user query into the internal query format terms used in the user query with the canonical terms in the ontol-
and then passes the query to the Proxy Agent for retrieving proper ogy and intention word base. The original terms and the corre-
solutions (Yang, 2006). Detailed explanation follows. sponding canonical terms will then be stored in the terminology
Fig. 9 illustrates two ways in which the user can enter Chinese table for solution presentation personalization. Finally, we label
query through Interface Agent. Fig. 9a shows the traditional key- those recognized keywords by symbol ‘‘K” and intention words by
word-based method, enhanced by the ontology features as illus- symbol ‘‘I”. The rest are regarded as stop words and removed from
trated in the left column. The user can directly click on the the query. Now, if the user is using the keyword mode, we directly
ontology terms to select them into the input field. Fig. 9b shows jump to the step of query formulation. Otherwise, we use template-
the user using natural language to input his/her query. In this case, based pattern matching to analyze the natural language input.
Interface Agent first employs Query Parser to do Chinese word seg- The step of pattern match is responsible for identifying the
mentation using MMSEG (Tsai, 2000), correction on word segmen- semantic pattern associated with the user query. Using the pre-
tation, fault correction on homophonous or multiple words, and constructed query templates in the template base, we can compare
the user query with the query templates and select the best-
matched one to identify user intention and focus. Fig. 11 shows
the algorithm of fast selecting possibly matched templates,
Fig. 12 describes the algorithm which finds out all patterns
matched with the user query, and Fig. 13 removes those matched
patterns that are generalization of some other matched patterns.
Now we explain how template matching is working by the fol-
Fig. 8. Examples of automatic keyword scrolling mechanism. lowing user query: ‘‘Could Asus K7V motherboard support a CPU

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx 7

User Query

Homophone
Debug Base 1.Segmentation

Ontology 2.Query
User Model
Pruning

Intention-
Word Base 3.Query User Preference
Standardization Terms

Query Modify
NL P Mode Slightly
Fig. 12. Pattern match algorithm.

4.Pattern
Match Template
Base

Keyword Mode

5.User
NO
Confirmation

YES

6.Query Fig. 13. Query pattern removal algorithm.


Formulation

Table 8
Internal Query Format Internal user query and keyword trimming
Data Flow
(a) Internal query form before keyword trimming
Query type Could
Proxy Support Folw Operation type Support
Agent Keywords 1 GHz, K7V, motherboard, Asus, CPU
(b) Internal query form after keyword trimming
Fig. 10. Flow chart of the user query processing. Query type Could
Operation Type Support
Keywords 1 GHz, K7V, CPU

to become candidates for a given query. In this case, the Query Par-
ser will prompt the user to confirm his/her intent (step 5), as illus-
trated in Fig. 9c. If the user says ‘‘No”, which means the pattern
matching results is not the true intention of the user, he/she is al-
lowed to modify the matched result or change to the keyword
mode for placing query.
The purpose of keyword trimming is to remove irrelevant key-
words from the user query; irrelevant keywords sometimes cause
adversarial effects on FAQ retrieval. Query Parser uses trimming
rules, as shown in Table 9, to prune these keywords. For examples,
Fig. 11. Query template selection algorithm. in Table 8a, ‘‘motherboard” is trimmed and replaced with ‘‘K7V”,
since the latter is an instance of the former and can subsume the

over 1 GHz?” Interface Agent first applies MMSEG to obtain the fol-
lowing list of keywords from the user query: hcould, support, Table 9
1 GHZ, K7V, motherboard, Asus, CPUi. The ‘‘could” query type in Examples of trimming rules

the intention type hierarchy is then followed to retrieve corre- Rule Rule description Example
sponding query templates. Table 6 illustrates the only one corre- no
sponding query template, which contains two patterns, namely, 1 A super-class can be replaced with its sub-class ‘‘Interface card”
hcould S1 support S2i and hcould support S1i. We find the second => ‘‘Sound card”
pattern matches the user query and can be selected to transform 2 A class can be replaced with its instance ‘‘CPU” => ‘‘PIII”
3 A slot value referring to some instance of a class can be ‘‘Microsoft” =>
the query into an internal form by query formulation (step 6) as replaced with the instance ‘‘Windows 2000”
shown in Table 8a. Note that there may be more than two patterns

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

8 S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

former according to Trimming Rule 2. Table 8b shows the result of from the ontology. The user either answers a YES or NO to each
the user query after keyword trimming, which now contains only question. The answers are collected and weighted according to
three keywords, namely, 1 GHz, K7V, and CPU. the respective degrees and passed to the Manager, which then cal-
culates a proficiency score for the user according to the percentage
3.2.3. Web Page Processor of correctness of his/her responses to the questions and accord-
The Web Page Processor receives a list of retrieved solutions, ingly instantiates a proper user stereotype as the user model for
which contains one or more FAQs matched with the user query the user.
from the Proxy Agent, each represented as Table 10, and retrieves The second task is to update user models. Here we use the inter-
and caches the solution webpages according to the FAQ_URLs. It action information and user feedback collected by the Interaction
follows to pre-process those webpages for subsequent customiza- Agent in each interaction session or query session. An interaction
tion process, including URL transformation, keyword standardiza- session is defined as the time period from the time point the user
tion, and keyword marking. The URL transformation changes all logs in up to when he logs out, while a query session is defined as
hyperlinks to point toward the cached server. The keyword stan- the time period from when the user gives a query up to when he
dardization transforms all terms in the webpage content into gets the answers and completes the feedback. An interaction ses-
ontology vocabularies. The keyword labeling marks the keywords sion may contain several query sessions. After a query session is
appearing in the webpages by boldfaced hBi KeywordhnBi to facil- completed, we immediately update the interaction preference
itate subsequent keywords processing webpage readability. and solution presentation of the user model. Specifically, the user’s
query mode and solution presentation mode in this query session
3.2.4. Scorer are remembered in both time windows, and the statistics of the
Each FAQ is a short document; concepts involved in FAQs are in preference change for each mode is calculated accordingly, which
general more focused. In other words, the topic (or concept) is will be used to adapt the Interaction Agent on the next query ses-
much clearer and professional. The question part of an FAQ is even sion. Fig. 15 illustrates the algorithm to update the Show_Rate of
more pointed about what concepts are involved. Knowing this the similarity mode. The algorithm uses the ratio of the number
property, we can use the keywords appearing in the question part of user selected FAQs and that of the displayed FAQs to update
of an FAQ to represent its topic. Basically, we use the table of do- the show rate; the algorithm to update the Show_Rate of the pro-
main proficiency to calculate a proficiency degree for each FAQ ficiency mode is similar.
by calculating the proficient concepts appearing in the question In addition, each user will be asked to evaluate each solution
part of the FAQ, detailed as shown in Fig. 14. FAQ in terms of the following five levels of understanding, namely,
very familiar, familiar, average, not familiar, very not familiar. This
3.2.5. Personalizer provides an explicit feedback and we can use it to update his/her
The Personalizer replaces the terms used in the solution FAQs domain proficiency table. Fig. 16 shows the updating algorithm.
with the terms in the user’s terminology table, collected by the Finally, after each interaction session, we can update the user’s
Query Parser, for improving the solution readability. recommendation mode in this session in the respective time

3.2.6. User model manager


The first task of the User Model Manager is to create an initial
user model for a new user. To do this, we pre-defined several ques-
tions for each concept in the domain ontology, for example, ‘‘Do
you know a CPU contains a floating co-processor?”, ‘‘Do you know
the concept of 1 GB = 1000 MB in specifying the capacity of a hard
disk?”, etc. The difficulty degrees of the questions are proportional
to the hierarchy depth of the concepts in the ontology. When a new
user logs on the system, the Manager randomly selects questions

Table 10
Format of retrieved FAQ

Field Description
Fig. 15. Algorithm to update show rate in similarity mode.
FAQ_No. FAQ’s identification
FAQ_Question Question part of FAQ
FAQ_Answer Answer part of FAQ
FAQ_Similarity Similarity degree of the FAQ met with the user query
FAQ_URL Source or related URL of the FAQ

Fig. 14. Proficiency degree calculation algorithm. Fig. 16. Algorithm to update the domain proficiency table.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx 9

window. At the same time, we add the query and FAQ-selection re-
cords of the user into the query history and selection history of his/
her user model.
The third task of the User Model Manager is to update user ste-
reotypes. This happens when a sufficient number of user models in
a stereotype has undergone changes. First, we need to reflect these
changes to stereotypes by re-clustering all affected user models, as
shown in Fig. 17, and then re-calculates all parameters in each ste-
reotype, an example as shown in Fig. 18.

3.2.7. Recommender
The Recommender uses the following three policies to recom-
mend information. (1) High hit FAQs. It recommends the first N
solution FAQs according to their selection counts from all users
in the same group within a time window. (2) Hot topic FAQs. It rec-
ommends the first N solution FAQs according to their popularity,
calculated as statistics on keywords appearing in the query histo-
ries of the same group users within a time window. The algorithm
does the hot degree calculation as shown in Fig. 19. (3) Collabora-
tive recommendation. It refers to the user’s selection histories of
the same group to provide solution recommendation. The basic
Fig. 20. Algorithm to do the collaborative recommendation.

idea is this. If user A and user B are in the same group and the first
n interaction sessions of user A are the same as those of user B,
then we can recommend the highest-rated FAQs in the (n + 1)th
session of user A for user B, detailed algorithm as shown in Fig. 20.

4. Demonstrations and evaluations

Our Interface Agent was developed using the Web-Based client-


server architecture. On the client site, we use JSP (Java Server Page)
and Java Applet for easy interacting with users, as well as observ-
ing and recording user’s behavior. On the server site, we use Java
and Java Servlet, under Apache Tomcat 4.0 Web Server and MS
Fig. 17. Algorithm to re-cluster all user groups. SQL2000 Server running Microsoft Windows XP. In this section,
we first demonstrate the developed Interface Agent, and then re-
port how better it performs.

4.1. System demonstrations

When a new user enters the system, the user is registered by


the Agent as shown in Fig. 21. At the same time, a questionnaire
is produced by the Agent for evaluating the user’s domain profi-
ciency. His/Her answers are then collected and calculated in order
to help build an initial user model for the new user.

Fig. 18. Example to update the stereotype of expert.

Fig. 19. Algorithm to calculate the hot degree. Fig. 21. System register interface.

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

10 S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

Now the user can get into the main tableau of our system (Fig
22), which consists of the following three major tab-frames,
namely, query interface, solution presentation, and logout. The
query interface tab is comprised of the following four frames: user
interaction interface, automatic keyword scrolling list, FAQ recom-
mendation list, and PC ontology tree. The user interaction interface
contains both keywords and NLP query modes as shown in Fig. 9.
The keyword query mode provides the lists of question types and
operation types, which allow the users to express their precise
intentions. The automatic keyword-scrolling list provides ranked-
keyword guidance for user query. A user can browse the PC ontol-
ogy tree to learn domain knowledge. The FAQ recommendation list
provides personalized information recommendations from the sys-
tem, which contains three modes: hit, hot topic, and collaboration.
When the user clicked a mode, the corresponding popup window is
produced by the system.
The solution presentation tab is illustrated in Fig 23. It pre-selects
the solutions ranking method according to the user’s preference and Fig. 24. FAQ-selection and feedback enticing.
hides part of solutions according to his/her Show_Rate for reducing
the cognitive loading of the user. The user can switch the solution
ranking method between similarity ranking and proficiency rank-
ing. The user can click the question part of an FAQ (Fig. 24) for dis-
playing its content or giving it a feedback, which contains the
satisfaction degree and comprehension degree. Fig. 25 illustrates
the window before system logout, which ask the user to fill a ques-
tionnaire for statistics to help further system improvement.

Fig. 25. System logout.

Table 11
Effectiveness of constructed query patterns

#Testing #Correct #Error Precision rate (%)


1215 1182 33 97.28

Fig. 22. Main tableau of our system.


4.2. System evaluations

The evaluation of the overall performance of our system in-


volves lots of manpower and is time-consuming. Here, we focus
on the performance evaluation of the most important module,
i.e., the Query Parser. Our philosophy is that if it can precisely parse
user queries and extract both true query intention and focus from
them, then we can effectively improve the quality of the retrieved.
First, we have done two experiments in evaluating how the user
query processing performs under the support of query templates
and domain ontology. Recall that this processing employs the tech-
nique of template-based pattern matching mechanism to under-
stand user queries and the templates were manually constructed
from 1215 FAQs. In the first experiment, we use these same FAQs
for testing queries, in order to verify whether any conflicts exist
within the query. Table 11 illustrates the experimental results,
where only 33 queries match with more than one query patterns
and result in confusion of query intention, called ‘‘error” in the ta-
ble. These errors may be corrected by the user. The experiment
Fig. 23. Solution presentation. shows the effectiveness rate of the constructed query templates

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx 11

Table 12 ven a user query, he first determines the class of the user query
User satisfaction evaluations according to the keywords and intention words, and then calcu-
K_WORD CPU MOTHERBOARD MEMORY AVERAGE lates the similarity degree between the user query and FAQs in
METHOD (SE/ST) (%) (SE/ST) (%) (SE/ST) (%) (SE/ST) (%) the related classes.
Alta Vista 63/61 77/78 30/21 57/53 The natural language processing technique was not used in the
Excite 66/62 81/81 50/24 66/56 work of Sneiders et al. (1999) for analyzing user queries. It, instead,
Google 66/64 81/80 38/21 62/55 was applied to analyze the FAQs stored in the database long before
HotBot 69/63 78/76 62/31 70/57
InfoSeek 69/70 71/70 49/28 63/56
any user queries are submitted, where each FAQ is associated with a
Lycos 64/67 77/76 36/20 59/54 required, optional, irrelevant, or forbidden keyword to help subse-
Yahoo 67/61 77/78 38/17 61/52 quent prioritized keyword matching. By this way, the work of FAQ
Our approach 78/69 84/78 45 /32 69/60 retrieval can be reduced to keyword matching without inference.
Razmerita, Angehrm, and Maedche (2003) presents a generic
ontology-based user modeling architecture, (OntobUM), applied
in the context of a Knowledge Management System (KMS). The
reaches 97.28%, which implies the template base can be used as an proposed user modeling system relies on a user ontology, using
effective knowledge base to do natural language query processing. Semantic Web technologies, based on the IMS LIP specifications,
Our second experiment is to learn how well this processing and it is integrated in an ontology-based KMS called Ontologging.
understands new queries. First, we collected in total 143 new FAQs, Degemmis, Licchelli, Lops, and Semeraro (2004) presents the
different from the FAQs collected for constructing the query tem- Profile Extractor, a personalization component based on machine
plates, from four famous motherboard factories in Taiwan, includ- learning techniques, which allows for the discovery of preferences
ing ASUS (<http://www.asus.com/>), SIS (<http://www.sis.com/>), and interests of users that have access to a website. Galassi, Gior-
MSI (<http://www.msi.com.tw/>), and GIGABYTE (<http://www.gi- dana, Saitta, and Botta (2005) also presents a method for automat-
ga-byte.com/>). We then used the question parts of those FAQs for ically constructing a sophisticated user/process profile from traces
testing queries, which test how well this processing performs. Our of user/process behavior, which is encoded by means of a Hierar-
experiments show that we can precisely extract true query inten- chical Hidden Markov Model (HHMM). Finally, Hsu and Ho
tions and focuses from 112 FAQs. The rest of 31 FAQs contain up to (1999) propose an intelligent interface agent to acquire patient
three or more sentences in queries, which explain why we failed to data with medicine-related common sense reasoning.
understand them. In summary, 78.3% (112/143) of the new queries In summary, the work of OuYang determines the user query
can be successfully understood. intention according to keywords and intention words appearing
Finally, Table 12 shows the comparison of user satisfaction of in query; while the work of Sneiders uses both similarity degrees
our systemic prototype against other search engines. In the table, on intention words and keywords for solution searching and selec-
ST, for Satisfaction of testers, represents the average of satisfaction tion. Both approaches only consider the comparison between
responses from 10 ordinary users, while SE, for Satisfaction of ex- words and skip the problem of word ambiguity; e.g., two sentences
perts, represents that of satisfaction responses from 10 experts. with the same intention words may not have the same intention.
Basically, each search engine receives 100 queries and returns The work of Lee uses the analysis of syntax and POS to extract
the first 100 webpages for evaluation of satisfaction by both ex- query intention, which is a hard job with Chinese query, because
perts and non-experts. The table shows that our systemic proto- solving the ambiguity either explicit or implicit meanings of Chi-
type supported by ontological user modeling with query nese words, especially in query analysis on long sentences or sen-
templates, the last row, enjoys the highest satisfaction in all clas- tences with complex syntax, is not at all a trivial task. In this paper,
ses. From the evaluation, we concluded that unless the comparing we integrated several interesting techniques including user model-
search engines are specifically tailored to this specific domain such ing, domain ontology, and template-based linguistic processing to
as HotBot and Excite, our techniques, in general, can retrieve more effectively tackle the above annoying problems, just like (Razmeri-
correct webpages in almost all classes, resulting from the intention ta et al., 2003) in which differently associated with ontology and
and focus of a user can be correctly extracted. user modeling and especially, (Paraiso & Barthes, 2006) highlights
the role of ontologies for semantic interpretation. In addition, both
5. Related works and comparisons (Degemmis et al., 2004 & Galassi et al., 2005) propose different
learning techniques for processing usage patterns and user pro-
The work of Lee (2000) presents a user query system consisting files. The automatic processing feature, supported by HHMM and
of an intention part and a keywords part. With the help of syntax unsupervised learning and common sense reasoning techniques,
and parts-of-speech (POS) analysis, he constructs a syntax gram- respectively, provides another level of automation in interaction
mar from collected FAQs, and accordingly offers the capability of mechanism and deserves more attention.
intention extraction from queries. He also extracts the keywords
from queries through a sifting process on POS and stop words. Fi-
nally, he employs the semantic comparison technique of down- 6. Discussions and future work
ward recursion on a parse tree to calculate the similarity degree
of intention parts between the user query and FAQs, and then uses We have developed an Interface Agent to work as an assistant be-
the concept of space mode (Salton et al., 1975) to calculate the vec- tween the users and FAQ systems, which is different from system
tor similarity degree of the keyword parts between the user query architecture and implementation over our previous work (Yang
and FAQs for finding out best-matched FAQs. et al., 2004). It is also used to retrieve FAQs on the domain of PC.
The work of OuYang (2000) classifies pre-collected FAQs We integrated several interesting techniques including domain
according to the business types of ChungHwa Telecom (<http:// ontology, user modeling, and template-based linguistic processing
www.cht.com.tw/CHTFinalE/Web/>). He employs the technique to effectively tackle the problems associated with traditional FAQ
of TFIDF (Term Frequency and Inversed Document Frequency) to retrieval systems. Specifically, we have solved the following issues.
calculate the weights of individual keywords and intention words, Firstly, our ontological interface agent can truly learn a user’s spe-
and accordingly select representative keywords and intention cialty in order to build a proper user model for him/her. Secondly,
words from them to work as the index to each individual class. Gi- the domain ontology can efficiently and effectively help in

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
ARTICLE IN PRESS

12 S.-Y. Yang / Expert Systems with Applications xxx (2008) xxx–xxx

establishing user models, analyzing user query, and assisting and ERCIM workshop on user interfaces for user-centered interaction paradigms for
universal access in the information society (pp. 133–148).
guiding interface usage. Finally, the intention and focus of a user
Galassi, U., Giordana, A., Saitta, L., & Botta, M. (2005). Learning profile based on
can be correctly extracted by the agent. In short, our work features hierarchical hidden markov model. In Proceedings of the 15th international
an template-based linguistic processing technique for developing symposium on foundations of intelligent systems (pp. 47–55).
ontological interface agents; a nature language query mode, along Hovy, E., Hermjakob, U., & Ravichandran, D. (2002). A question/answer typology
with surface text patterns. In Proceedings of the DARPA human language
with an improved keyword-based query mode; and an assistance technology conference (pp. 247–250).
and guidance for human–machine interaction. Our preliminary Hsu, C. C., & Ho, C. S. (1999). Acquiring patient data by an intelligent interface agent
experimentation demonstrates that user intention and focus of up with medicine-related common sense reasoning. Expert Systems with
Applications: An International Journal, 17(4), 257–274.
to eighty percent of the user queries can be correctly understood Lee, C. L. (2000). Intention extraction and semantic matching for internet FAQ
by the system, and accordingly provides the query solutions with retrieval. Master Thesis, Department of Computer Science and Information
higher user satisfaction. Engineering, National Cheng Kung University, Taiwan, ROC.
Noy, N. F., & McGuinness, D. L. (2001). Ontology development 101: A guide to
Most of our current experiments are on the performance test of creating your first ontology. Stanford Knowledge Systems Laboratory Technical
the Query Parser. We are unable to do experiments on or compar- Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-
isons of how good the Interface Agent is at capturing all interaction 2001-0880.
OuYang, Y. L. (2000). Study and implementation of a dialogued-based query system
information/intention of a user. Our difficulties are summarized for telecommunication FAQ services. Master Thesis, Department of Computer
below: (1) To our knowledge, none of current interface systems and Information Science, National Chiao Tung University, Taiwan, ROC.
adopt a similar approach as ours in the sense that none of them Paraiso, E. C., & Barthes, J. P. A. (2006). An intelligent speech interface for personal
assistants in R&D projects. Expert Systems with Applications: An International
are relying on ontology as heavily as our system to support user
Journal, 31(4), 673–683.
interaction. It is thus rather hard for us to do a fair and convincing Razmerita, L., Angehrm, A., & Maedche, A. (2003). Ontology-based user modeling for
comparison. (2) Our ontology construction is based on a set of pre- knowledge management systems. In Proceedings of the 9th international
collected webpages on a specific domain; it is hard to evaluate how conference on user modeling (pp. 213–217).
Rich, E. (1979). User modeling via stereotypes. Cognitive Science, 3, 329–354.
critical this pre-collection process is to the nature of different do- Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New
mains. We are planning to employ the technique of automatic York, USA: McGraw-Hill Book Company.
ontology evolution, for example, cooperated with data mining Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic
indexing. Communications of ACM, 18(11), 613–620.
technology for discovering useful information and generating de- Sneiders, E. (1999). Automated FAQ answering: Continued experience with shallow
sired knowledge that support ontology construction (Wang, Lu, & language understanding. In AAAI Fall symposium on question answering systems
Zhang, 2007), to help study the robustness of our ontology. Finally, (pp. 97–107). Technical Report FS-99-02. North Falmouth, Massachusetts, USA:
AAAI Press.
in the future, not only will we employ the techniques of machine Soubbotin, M. M., & Soubbotin, S. M. (2001). Patterns of potential answer
learning and data mining to automate the construction of the tem- expressions as clues to the right answer. In Proceedings of the TREC-10
plate base, as to the allover system evaluation, but also we are conference (pp. 293–302).
Tsai, C. H. (2000). MMSEG: A word identification system for Mandarin Chinese text
planning to engage the concept of usability evaluation on the do- based on two variants of the maximum matching algorithm. <http://
main of human factor engineering to evaluate the performance of technology.chtsai.org/mmseg/>.
the agent. Wang, C., Lu, J., & Zhang, G. Q. (2007). Mining key information of web pages: A
method and its applications. Expert Systems with Applications: An International
Journal, 33(2), 425–433.
Acknowledgements Yang, S. Y. (2006). An ontology-supported and query template-based user modeling
technique for interface agents. In Symposium on application and development of
The author would like to thank Yai-Hui Chang and Ying-Hao management information system (pp. 168–173).
Yang, S. Y. (2006). How does ontology help information management processing.
Chiu for their assistance in system implementation. This work WSEAS Transactions on Computers, 5(9), 1843–1850.
was supported by the National Science Council, ROC, under Grants Yang, S. Y. (2006). FAQ-master: A new intelligent web information aggregation
NSC-89-2213-E-011-059, NSC-89-2218-E-011-014, and NSC-95- system. In Proceedings of international academic conference 2006 special session
on artificial intelligence theory and application (pp. 2–12).
2221-E-129-019. Yang, S. Y. (2007). An ontological multi-agent system for web FAQ query. In
Proceedings of the international conference on machine learning and cybernetics
References (pp. 2964–2969).
Yang, S. Y. (2007). An ontological proxy agent for web information processing. In
Proceedings of the 10th international conference on computer science and
Chandrasekaran, B., Josephson, J. R., & Benjamins, V. R. (1999). What are ontologies,
informatics (pp. 671–677).
and why do we need them? IEEE Intelligent Systems, 14(1), 20–26.
Yang, S. Y., Chuang, F. C., & Ho, C. S. (2007). Ontology-supported FAQ processing and
Chiu, Y. H. (2003). An interface agent with ontology-supported user models. Master
ranking techniques. Journal of Intelligent Information Systems, 28(3), 233–251.
Thesis, Department of Electronic Engineering, National Taiwan University of
Yang, S. Y., & Ho, C. S. (1999). Ontology-supported user models for interface agents.
Science and Technology, Taiwan, ROC.
In Proceedings of the 4th conference on artificial intelligence and applications (pp.
Coyle, M., & Smyth, B. (2005). Enhancing web search result lists using interaction
248–253).
histories. In Proceedings of the 27th european conference on IR research on
Yang, S. Y., Chiu, Y. H., & Ho, C. S. (2004). Ontology-supported and query template-
advances in information retrieval (pp. 543–545).
based user modeling techniques for interface agents. In The 12th national
Degemmis, M., Licchelli, O., Lops, P., & Semeraro, G. (2004). Learning usage patterns
conference on fuzzy theory and its applications (pp. 181–186).
for personalized information access in e-commerce. In: Proceedings of the 8th

Please cite this article in press as: Yang, S. -Y. , Developing of an ontological interface agent with template-based linguistic ..., Expert Sys-
tems with Applications (2008), doi:10.1016/j.eswa.2008.03.011
WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Vol. 4, November 2007 ISSN: 1709-0832 1400

An Ontological Interface Agent for FAQ Query Processing


SHENG-YUAN YANG
Dept. of Computer and Communication Engineering
St. John’s University
499, Sec. 4, TamKing Rd., Tamsui, Taipei County 251
TAIWAN
[email protected] http://mail.sju.edu.tw/~ysy

Abstract: - In this paper, we describe an Interface Agent which works as an assistant between the users and FAQ systems to
retrieve FAQs on the domain of Personal Computer. It integrates several interesting techniques including domain ontology, user
modeling, and template-based linguistic processing to effectively tackle the problems associated with traditional FAQ retrieval
systems. Specifically, we address how ontology helps interface agents to provide better FAQ services and describe related
algorithms in details. Our work features an ontology-supported, template-based user modeling technique for developing interface
agents. Our preliminary experimentation demonstrates that user intention and focus of up to eighty percent of the user queries
can be correctly understood by the system, and accordingly provides the query solutions with higher user satisfaction.

Key-words: - Ontology, Template-based Processing, User Modeling, Interface Agents.

ontology-directed information aggregation from the


1 Introduction webpages collected by the Search Agent [14,20]. Finally, the
In this information-exploding era, the Internet affects people’s Proxy Agent works as an ontology-enhanced intelligent proxy
life style in terms of how people acquire, present, and mechanism to share most query loading with the Answerer
exchange information. Especially the use of the World Wide Agent [15,16,20].
Web has been leading to a large increase in the number of This paper discusses the Interface Agent focusing on how it
people who access FAQ knowledge bases to find answers to captures true user’s intention and accordingly provides
their questions [10]. As the techniques of Information high-quality FAQ answers. The agent features ontology-based
Retrieval [6,7] matured, a variety of information retrieval representation of domain knowledge, flexible interaction
systems have been developed, e.g., Search engines, Web interface, and personalized information filtering and display.
portals, etc., to help search on the Web. How to search is no Our preliminary experimentation demonstrates that the
longer a problem. The problem now comes from the results intention and focus of up to eighty percent of the users’
from these information retrieval systems which contain some queries can be correctly understood by the system, and
much information that overwhelms the users. Therefore, how accordingly provides the query solutions with higher user
to improve traditional information retrieval systems to satisfaction. The Personal Computer (PC) domain is chosen
provide search results which can better meet the user as the target application of our Interface Agent and will be
requirements so as to reduce his cognitive loading is an used for explanation in the remaining sections.
important issue in current research [2].
2 Fundamental Techniques
User Interface Agent Search Agent Search
2.1 Domain Ontology and Services
query engines
The concept of ontology in artificial intelligence refers to
knowledge representation for domain-specific contents [1]. It
Internal
query Solutions Retrieved Query has been advocated as an important tool to support knowledge
webpages terms
format
Content Base sharing and reusing in developing intelligent systems.
Although development of an ontology for a specific domain
Retrieved
solutions is not yet an engineering process, we have outlined a
Proxy Agent Query
terms
Answerer Agent procedure for this in [11] from how the process was
conducted in existent systems. By following the procedure we
Fig. 1 System architecture for FAQ-master developed an ontology for the PC domain in Chinese using
We have proposed an FAQ-master as an intelligent Web Protégé 2000 [4], but was changed to English here for easy
information aggregation system, which provides intelligent explanation, as the fundamental background knowledge for
information retrieval, filtering, and aggregation services the system.
[12,17]. Fig. 1 illustrates the system architecture of Fig. 2 shows part of the ontology taxonomy. The taxonomy
FAQ-master. The Interface Agent captures user intention represents relevant PC concepts as classes and their
through an adaptive human-machine interaction interface parent-child relationships as isa links, which allow
with the help of ontology-directed and template-based user inheritance of features from parent classes to child classes.
models [18]. The Search Agent performs in-time, Fig. 3 exemplifies the detailed ontology of the concept of
user-oriented, and domain-related Web information retrieval CPU. In the figure, the root node uses various fields to define
with the help of ontology-supported website models [19]. The the semantics of the CPU class, each field representing an
Answerer Agent works as a back end process to perform attribute of “CPU”, e.g., interface, provider, synonym, etc.
The nodes at the lower level represent various CPU instances,
1401 WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Vol. 4, November 2007 ISSN: 1709-0832

which capture real world data. The arrow line with term “io” retrieval of FAQs after the intention of a user query is
means the instance of relationship. The complete PC ontology identified.
can be referenced from the Protégé Ontology Library at
Table 1 Examples of query patterns
Stanford Website Question Operation Intention Type Query Pattern
(http://protege.stanford.edu/download/download.html). Type Type
ANA_CAN_SUPPORT <S1 是否 支援 S2>
是否 支援 GA-7VRX 這塊主機板是否支援 KINGMAX DDR-400?
Hardware
isa
(If) (Support) (Could the GA-7VRX motherboard support the KNIGMAX DDR-400 memory
isa
isa
type?)
isa isa isa isa
isa
HOW_SETUP <如何 在 S1><安裝 S2>
如何 安裝
Interface Power
Memory Case
Storage
(How) (Setup) 如何在 Windows 98SE 下,安裝 8RDA 的音效驅動程式?
Card Equipment Media
isa (How to setup the 8RDA sound driver on a Windows 98SE platform?)
isa
isa
isa isa WHAT_IS <S1 是 什麼>
isa isa isa isa isa isa isa isa 什麼 是
AUX power connector 是什麼?
Network Sound Display SCSI Network Power Main
(What) (Is)
Chip Card Card Card Card Supply
UPS ROM
Memory
Optical ZIP (What is an AUX power connector?)
WHEN_SUPPORT <S1 何時 支援 S2>
何時 支援
isa isa isa isa
(When) (Support) P4T 何時才能支援 32-bit 512 MB RDRAM 記憶體規格?
(When can the P4T support the 32-bit 512 MB RDRAM memory specification?)
CD DVD CDR/W CDR
WHERE_DOWNLOAD <S1><哪裡 可以 下載 S2>
哪裡 下載
CUA 的 Driver CD 遺失,請問哪裡可以下載音效驅動程式?
Fig. 2 Part of PC ontology taxonomy (Where) (Download)
(Where can I download the sound driver of CUA whose Driver CD was lost?)
WHY_PRINT [S1]<S2 無法 列印>
CPU
為什麼 列印
Synonym= Central Processing Unit 為什麼在 Win ME 底下,從休眠狀態中回復後,印表機無法列印。
D-Frequency String
(Why) (Print)
Interface Instance* CPU Slot (Why can I not print after coming back from dormancy on a Win ME platform?)
L1 Cache Instance Volume Spec.
Abbr. Instance CPU Spec.
...

io io io io io io io
Table 2 Query template for the ANA_CAN_SUPPORT
XEON THUNDERBIRD 1.33G DURON 1.2G PENTIUM 4 2.0AGHZ PENTIUM 4 1.8AGHZ CELERON 1.0G PENTIUM 4 2.53AGHZ
intention type
Factory= Intel Synonym= Athlon 1.33G Interface= Socket A D-Frequency= 20 D-Frequency= 18 Interface= Socket 370 Synonym= P4
Interface= Socket A L1 Cache= 64KB Synonym= P4 2.0GHZ Synonym= P4 1.8GHZ L1 Cache= 32KB Interface= Socket 478 Template_Number 304
L1 Cache= 128KB Abbr.= Duron Interface= Socket 478 Interface= Socket 478 Abbr.= Celeron L1 Cache= 8KB
Abbr.= Athlon Factory= AMD L1 Cache= 8KB L1 Cache= 8KB Factory= Intel Abbr.= P4 #Sentence 3
Factory= AMD Clock= 1.2GHZ Abbr.= P4 Abbr.= P4 Clock= 1GHZ Factory= Intel
... ... ... ... ... ... Intention_Word 是否(If)、支援(Support)
Intention_Type ANA_CAN_SUPPORT
Fig. 3 Ontology of the concept of CPU Question_Type 是否(If)
Operation_Type 支援(Support)
We also developed a Problem ontology to deal with query Query_Patterns
[S3]<S1 是否 支援 S2>
[S2]<是否 支援 S1>
questions. Fig. 4 illustrates part of the Problem ontology, Focus S1
which contains query type and operation type. Together they
imply the semantics of a question. Finally, we use Protégé’s I
n 是否
ANA_CAN_SUPPLY
...
t (IF)
APIs to develop a set of ontology services, which provide e
n
ANA_CAN_SET

HOW_SET
primitive functions to support the application of the t
i
如何
(HOW_SOLVE)
...
HOW_FIX
o
ontologies. The ontology services currently available include n
什麼
WHAT_SUPPORT
T ...
transforming query terms into canonical ontology terms, y
p
(WHAT)
WHAT_SETUP

finding definitions of specific terms in ontology, finding e 何時


(WHEN)
WHEN_SUPPORT
H
relationships among terms, finding compatible and/or i
e 哪裡 WHERE_DOWNLOAD
...
conflicting terms against a specific term, etc. r
a
(WHERE)
WHERE_OBTAIN
r
c WHY_USE
為什麼
...
Query h (WHY_EXPLAIN)
WHY_OFF
y
isa isa

Operation Q uery Fig. 5 Intention type hierarchy


T ype T ype
io
io io io io io io io
io io io io io

Adjust Use Setup Close Open Support Provide How W hat W hy W here 3 System Architecture
Fig. 4 Part of problem ontology taxonomy 3.1 User Modeling
2.2 Ontological Query Templates Interaction
preference

Solution Domain
Presentation Proficiency
To build the query templates, we have collected in total 1215
FAQs from the FAQ website of six famous motherboard
factories in Taiwan and used them as the reference materials Terminology Interaction Explicit
Table History Feedback
Implicit
Feedback

for query template construction. Currently, we only take care Fig. 6 Our user model
of the user query with one intention word and at most three
sentences. These FAQs were analyzed and categorized into A user model contains interaction preference, solution
six types of questions. For each type of question, we further presentation, domain proficiency, terminology table, query
identified several intention types according to its operations. history, selection history, and user feedback, as shown in Fig.
Finally, we define a query pattern for each intention type. 6.
Table 1 illustrates the defined query patterns for the intention The interaction preference is responsible for recording
types. Now all information for constructing a query template user’s preferred interface, e.g., favorite query mode, favorite
is ready, and we can formally define a query template [3,8]. recommendation mode, etc. When the user logs on the system,
Table 2 illustrates an example query template for the the system can select a proper user interface according to this
ANA_CAN_SUPPORT intention type. Note here that we preference. We provide two modes, either through keywords
collect similar query patterns in the field of “Query patterns,” or natural language input. We provide three recommendation
which are used in detailed analysis of a given query. modes according to hit rates, hot topics, or collaborative
According to the generalization relationships among learning. We record recent user’s preferences in a time
intention types, we can form a hierarchy of intention types to window, and accordingly determine the next interaction style.
organize all FAQs. Currently, the hierarchy contains two The solution presentation is responsible for recording
levels as shown in Fig. 5. Now, the system can employ the solution ranking preferences of the user. We provide two
intention type hierarchy to reduce the search scope during the types of ranking, either according to the degree of similarity
WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Vol. 4, November 2007 ISSN: 1709-0832 1402

between the proposed solutions and the user query, or segmenting word, removing conflicting words, and
according to user’s proficiency about the solutions. In standardizing terms, followed by the recording of the user’s
addition, we use a Show_Rate parameter to control how many terminologies in the terminology table of the user model. It
items of solutions for display each time, in order to reduce finally applies the template matching technique to select
information overloading problem. best-matched query templates, and accordingly transforms the
The domain proficiency factor describes how familiar the query into an internal query for the Proxy Agent to search for
user is with the domain. By associating a proficiency degree solutions and collect them into a list of FAQs, which each
with each ontology concept, we can construct a table, which containing a corresponding URL. The Web Page Processor
contains a set of <concept proficiency-degree> pairs, as his pre-downloads FAQ-relevant webpages and performs some
domain proficiency. Thus, during the decision of solution pre-processing tasks, including labeling keywords for
representation, we can calculate the user’s proficiency degree subsequent processing. The Scorer calculates the user’s
on solutions using the table, and accordingly only show his proficiency degree for each FAQ in the FAQ list according to
most familiar part of solutions and hide the rest for advanced the terminology table in his user model. The Personalizer then
requests. To solve the problem of different terminologies to produces personalized query solutions according to the
be used by different users, we include a terminology table to terminology table. The User Model Manager is responsible
record this terminology difference. We can use the table to for quickly building an initial user model for a new user using
replace the terms used in the proposed solutions with the user the technique of user stereotyping as well as updating the user
favorite terms during solutions representation to help him models and stereotypes to dynamically reflect the changes of
better comprehend the solutions. user behavior. The Recommender is responsible for
Finally, we record the user’s query history as well as FAQ recommending information for the user based on hit count,
selection history and corresponding user feedback in each hot topics, or group’s interests when a similar interaction
query session in the Interaction history, in order to support history is detected.
collaborative recommendation. The user feedback is a
Search Answerer Proxy
complicated factor. We remember both explicit user feedback Agent Agent Agent
Solution &
Web Page Links

in the selection history and implicit user feedback, which Internal Query
User Feedback
includes query time, time of FAQ click, sequence of FAQ USER
Action &
clicks, sequence of clicked hyperlinks, etc. Query Interaction
Agent
Query
Parser
Web Page
Processor
Internet

In order to quickly build an initial user model for a new


user, we pre-defined five stereotypes [5], namely, expert, Recommender
User Model
Manager

senior, junior, novice, and amateur [11], to represent different


Personalizer Scorer
user group’s characteristics. This approach is based on the
idea that the same group of user tends to exhibit the same Data Flow

Template
behavior and requires the same information. Fig. 7 illustrates Support Folw Homophone
Debug Base
Base
User Model Ontology

an example user stereotype. When a new user enters the Interface Agent

system, he is asked to complete a questionnaire, which is used Fig. 8 Interface agent architecture
by the system to determine his domain proficiency, and
accordingly select a user stereotype to generate an initial user 3.2.1 Interaction Agent
model to him. However, the initial user model constructed The Interaction Agent consists of the following three
from the stereotype may be too generic or imprecise. It will components: Adapter, Observor and Assistant. First, the
be refined to reflect the specific user’s real intent after the Adapter constructs best interaction interfaces according to
system has experiences with his query history, FAQ-selection user’s favorite query and recommendation modes. It is also
history and feedback, and implicit feedback [2]. responsible for presenting to the user the list of FAQ solutions
Stereotype : Expert (from the Personalizer) or recommendation information (from
Interaction
Query Mode
Keyword Mode : 0/5
NLP Mode : 1/5
Use History
N
Time Window Size:5
the Recommender). During solution representation, it
Preference

Recommendation
Hit : 0/7
Hot Topic : 0/7
Use History
Co
arranges the solutions in terms of the user’s preferred style
Mode
Collaborative : 1/7 Time Window Size:7
(query similarity or solution proficiency) and displays the
Use History

Solution
Show Query Similarity : 1/5
Mode Solution Proficiency : 0/5
S
Time Window Size:5
solutions according to the “Show_Rate.” Second, the
Representation
Show_Rate (Similarity Mode) : 0.9 Show_Rate (Proficiency Mode) : 0.9 Observer passes the user query to the Query Parser, and
Domain Proficiency Table
Concept Proficiency
Terminology Table
Prefer Standard
simultaneously collects the interaction information and
Domain
Proficiency
主機板
中央處理器
...
0.9
0.9
0.9
Terminology
related feedback from the user. The interaction information
Table

contains user preferred query mode, recommendation mode,


Interaction Implicit
solution presentation mode, and FAQ clicking behavior, while
Explicit
History Feedback Feedback
the related feedback contains user satisfaction degree and
Fig. 7 Example of expert stereotype comprehension degree about each FAQ solution. The User
Model Manager needs both interaction information and
3.2 System Overview related feedback to properly update user models and
stereotypes. The satisfaction degree in related feedback can
Fig. 8 shows the architecture of the Interface Agent. The
also be passed to the Proxy Agent for tuning the solution
Interaction Agent provides a personalization interaction,
search mechanism [20].
assistance, and recommendation interface for the user
Finally, the Assistant provides proper assistance and
according to his user model, records interaction information
guidance to help the user query process. First, the ontology
and related feedback in the user model, and helps the User
concepts are structured and presented as a tree so that the
Model Manager and Proxy Agent to update the user model.
users who are not familiar with the domain can check on the
The Query Parser processes the user queries by first
1403 WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Vol. 4, November 2007 ISSN: 1709-0832

tree and learn proper terms to enter their queries. We also second word corpus to bring those mis-segmented words back.
rank all ontology concepts by their probabilities and display It also performs fault correction on homophonous or multiple
them in a keyword list. When the user enters a query at the words using the ontology and homophone debug base [2].
input area, the Assistant will automatically “scroll” the The step of query standardization is responsible for
content of the keyword list to those terms related to the input replacing the terms used in the user query with the canonical
keywords. Fig. 9 illustrates an example of this automatic terms in the ontology and intention word base. The original
keyword scrolling mechanism. If the displayed terms of the terms and the corresponding canonical terms will then be
list contain a concept that the user wants to enter, he can stored in the terminology table for solution presentation
double-click the terms into the input area, e.g., “華碩” (ASUS) personalization. Finally, we label those recognized keywords
at step 2 of Fig. 9. In addition to the keyword-oriented query by symbol “K” and intention words by symbol “I.” The rest
mode, the Assistant also provides lists of question types and are regarded as stop words and removed from the query. Now,
operation types to help question type-oriented or operation if the user is using the keyword mode, we directly jump to the
type-oriented search. The user can use one, two, or all of step of query formulation. Otherwise, we use template-based
these three mechanisms to help form his query in order to pattern matching to analyze the natural language input.
convey his intention to the system. The step of pattern match is responsible for identifying the
semantic pattern associated with the user query. Using the
Keyword 華邦 華碩 主機板
List 華碩
螢幕
螢幕
製程
二進制
交換器
pre-constructed query templates in the template base, we can
Input Area 什麼華| 什麼華碩| 什麼華碩主|
compare the user query with the query templates and select
Step (1) (2) (3) the best-matched one to identify user intention and focus. Fig.
Fig. 9 Examples of automatic keyword scrolling 11 shows the algorithm of fast selecting possibly matched
mechanism templates, Fig. 12 describes the algorithm which finds out all
patterns matched with the user query, and Fig. 13 removes
3.2.2 Query Parser
those matched patterns that are generalization of some other
U ser Query
matched patterns.
Template Selection :
H omophone
D ebug B ase 1.Segmentation
Q : User Query.
Q.Intention_Word = {I1, I2,..., IN}, Intention words in Q.
Q.Sentence : Number of sentences in Q.
Ontology Template Base = {T1, T2,..., TM}, M : Number of templates.
2.Query
U ser Model
Pruning For each template Tj in Template Base
{
Intention- If Tj conforms to the follow rules, then select Tj into C.
1. Tj.Sentence = Q.Sentence.
W ord B ase 3.Query User Preference
Standardiz ation Terms
2. Tj.Intention_Word ⊆ Q.Intention_Word.
Query Modify }
NLP Mode Slightly return C : Candidate Templates.

4.Pattern
Fig. 11 Query template selection algorithm
Match Template
B ase Pattern Match :
Keyword Mode For each template Tj in C, candidate templates
{
5.U ser
C onfirmation
NO Tj.Pattern = {P1, P2,...}, Pk : pattern k in template j.
For each PK in Tj.Pattern
{
YES
If Pk match Q, the user query, then
Pk.Intention_Word = Tj.Intention_Word,
6.Query
Formulation Pk.Intention_Type = Tj.Intention_Type,
Pk.Quertion_Type = Tj.Question_Type,
Pk.Operation = Tj.Operation,
Internal Query Format
D ata Flow Pk.Focus = Tj.Focus,
and put Pk in M and break this inner loop.
Proxy Support Folw }
Agent
}
return M : Patterns matching Q.
Fig. 10 Flow chart of the Query Parser
Fig. 12 Pattern match algorithm
The Query Parser pre-processes the user query by performing Pattern Removal :
Chinese word segmentation, correction on word segmentation, For each pattern Pk in M, matched patterns
{
fault correction on homophonous or multiple words, and term If Pk conforms to follow rule, then remove Pk from M.
standardization. It then employs template-based pattern {
∃Pi ∈ M, Pk.Intention_Type = Pi.Intention_Type and
matching to analyze the user query and extract the user Pk.Intention_Word ⊂ Pi.Intention_Word
intention and focus. Finally, it transforms the user query into }
the internal query format and then passes the query to the }

Proxy Agent for retrieving proper solutions [20]. Fig. 10 Fig. 13 Query pattern removal algorithm
shows the flow chart of the Query Parser. Detailed Take the following query as an example: “華碩 K7V 主機
explanation follows.
板是否支援 1GHz 以上的中央處理器呢?”, which means
Given a user query in Chinese, we segment the query using
MMSEG [9]. The results of segmentation were not good, for “Could Asus K7V motherboard support a CPU over 1GHz?”
the predefined MMSEG word corpus contains insufficient Table 2 illustrates a query template example, which contains
terms of the PC domain. For example, it does not contain the two patterns, namely, <S1 是否 支援 S2> and <是否 支援
keywords “華碩” or “AGP4X”, and returns wrong word S1>. We find the second pattern matches the user query and
segmentation like “華”, “碩”, “AGP”, and “4X”. The step of can be selected to transform the query into an internal form
by query formulation (step 6) as shown in Table 3. Note that
query pruning can easily fix this by using the ontology as a
there may be more than two patterns to become candidates for
WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Vol. 4, November 2007 ISSN: 1709-0832 1404

a given query. In this case, the Query Parser will prompt the example, “Do you know a CPU contains a floating
user to confirm his intent (step 5). If the user says “No”, co-processor?”, “Do you know the concept of 1GB=1000MB
which means the pattern matching results is not the true in specifying the capacity of a hard disk?”, etc. The difficulty
intention of the user, he is allowed to modify the matched degrees of the questions are proportional to the hierarchy
result or change to the keyword mode for placing query. depth of the concepts in the ontology. When a new user logs
on the system, the Manager randomly selects questions from
Table 3 Internal form
User_Level …
the ontology. The user either answers an YES or NO to each
Query_Mode … question. The answers are collected and weighted according
Intention_Type …
Question_Type … to the respective degrees and passed to the Manager, which
Operation …
Keyword …
then calculates a proficiency score for the user according to
Focus … the percentage of correctness of his responses to the questions
and accordingly instantiates a proper user stereotype as the
3.2.3 Web Page Processor
user model for the user.
The Web Page Processor receives a list of retrieved solutions, The second task is to update user models. Here we use the
which contains one or more FAQs matched with the user interaction information and user feedback collected by the
query from the Proxy Agent, each represented as Table 4, and Interaction Agent in each interaction session or query session.
retrieves and caches the solution webpages according to the An interaction session is defined as the time period from the
FAQ_URLs. It follows to pre-process those webpages for time point the user logs in up to when he logs out, while a
subsequent customization process, including URL query session is defined as the time period from when the
transformation, keyword standardization, and keyword user gives a query up to when he gets the answers and
marking. The URL transformation changes all hyperlinks to completes the feedback. An interaction session may contain
point toward the cached server. The keyword standardization several query sessions. After a query session is completed, we
transforms all terms in the webpage content into ontology immediately update the interaction preference and solution
vocabularies. The keyword labeling marks the keywords presentation of the user model. Specifically, the user’s query
appearing in the webpages by boldfaced <B>Keyword<\B> mode and solution presentation mode in this query session are
to facilitate subsequent keywords processing webpage remembered in both time windows, and the statistics of the
readability. preference change for each mode is calculated accordingly,
Table 4 Format of retrieved FAQ which will be used to adapt the Interaction Agent on the next
Field Description query session. Fig. 15 illustrates the algorithm to update the
FAQ_No.
FAQ_Question
FAQ’s identification
Question part of FAQ
Show_Rate of the similarity mode. The algorithm uses the
FAQ_Answer Answer part of FAQ ratio of the number of user selected FAQs and that of the
FAQ_Similarity Similarity degree of the FAQ met with the user query
FAQ_URL Source or related URL of the FAQ displayed FAQs to update the show rate; the algorithm to
update the Show_Rate of the proficiency mode is similar.
3.2.4 Scorer NS : Number of FAQ in Solution FAQ List.
Each FAQ is a short document; concepts involved in FAQs N(Similarity Mode) : Number of FAQ shown to user in Similarity Mode.
N(Similarity Mode) = ⎡NS ∗ Show_Rate(Similarity Mode)old ⎤
are in general more focused. In other words, the topic (or NHide = NS - N(Similarity Mode), Number of hidden FAQ.

concept) is much clearer and professional. The question part NSelect : Number of FAQ selected by user in the query session.
Show_Rate(Similarity Mode) = Show_Rate(Similarity Mode)old + Variation
of an FAQ is even more pointed about what concepts are
(
⎧ NSelect ⎛ )
⎪ NS − 0.7 ∗ ⎜⎝1 − exp⎜⎝ −
⎛ N(Similarity Mode) ⎞ ⎞, if NSelect ≥ 0.7
α ⎟⎟
⎠⎠
involved. Knowing this property, we can use the keywords ⎪

NS
Variation = ⎨0, if 0.3 < NSelect < 0.7
appearing in the question part of an FAQ to represent its topic. ⎪ N S

Basically, we use the table of domain proficiency to calculate (


⎪ NSelect ⎛ )
⎪⎩ NS − 0.3 ∗ ⎝⎜1 − exp⎝⎜ −
⎛ N(Similarity Mode) ⎞ ⎞, if NSelect ≤ 0.3,
α ⎟⎟
⎠⎠ NS
where α : weight change rate,
a proficiency degree for each FAQ by calculating the Show_Rate(Similarity Mode) new = Max (Min (Show_Rate(Similarity Mode) , 1) , 0.01)
proficient concepts appearing in the question part of the FAQ,
Fig. 15 Algorithm to update show rate in similarity mode
detailed as shown in Fig. 14.
F = {F1, F2,...} ⊆ C, FAQ Collection.
A : Answer FAQs List, A = {F1, F2,..., Fn}, A ⊂ C, FAQ Collection. Fi ∈ F : FAQs selected and rated by user.
A : FAQs found by the system that best match user query. For each Fi, update user proficiency of each Concept j.
Fi.Q : The Query part of FAQ Fi. Proficiency (Concept j) =
⎛ T ⎛⎜⎝ Concept j⎞⎟⎠ ⎞
For each Fi ∈ A, we calculate each Fi' s Proficiency Score for user k. Proficiencyold (Concept j) + α ∗ ⎜⎜ ∗ Understanding Level⎟⎟
⎝ Number of Concept in Fi ⎠
Number of Concept appering in Fi.Q ∑
Fi = 1 ∗ Appearance(Concept j) ∗ Proficiency(Concept j), Proficiencynew (Concept j) = Max(Min (Proficiency (Concept j) , 1) , 0),
j
where α : Learning rate,
where Concept j ∈ Ontology, Concept j∈ Ontology,
⎧ 1, Concept j appears in Fi.Q T(Concept j) : The times Concept j appears in Fi,
Appearance(Concept j) : ⎨
⎩ 0, Concept j doesn' t appear in Fi.Q, ⎧+ 2, user rates " very familiar" for Fi.
⎪+ 1, user rates " familiar" for Fi.
Proficiency(Concept j) : The drgree of user k' s proficiency in Concept j. ⎪⎪
Understanding Level : ⎨0, user rates " average" for Fi.
Fig. 14 Proficiency degree calculation algorithm ⎪- 1, user rates " not familiar" for Fi.

⎪⎩- 2, user rates " very not familiar" for Fi.
3.2.5 Personalizer Fig. 16 Algorithm to update the domain proficiency table
The Personalizer replaces the terms used in the solution FAQs
In addition, each user will be asked to evaluate each
with the terms in the user’s terminology table, collected by
solution FAQ in terms of the following five levels of
the Query Parser, for improving the solution readability.
understanding, namely, very familiar, familiar, average, not
3.2.6 User Model Manager familiar, very not familiar. This provides an explicit feedback
The first task of the User Model Manager is to create an and we can use it to update his domain proficiency table. Fig.
initial user model for a new user. To do this, we pre-defined 16 shows the updating algorithm. Finally, after each
several questions for each concept in the domain ontology, for interaction session, we can update the user’s recommendation
mode in this session in the respective time window. At the
1405 WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Vol. 4, November 2007 ISSN: 1709-0832

same time, we add the query and FAQ-selection records of from all users in the same group within a time window. 2) Hot
the user into the query history and selection history of his topic FAQs. It recommends the first N solution FAQs
user model. according to their popularity, calculated as statistics on
For each useri in each user proficiency group keywords appearing in the query histories of the same group
{ users within a time window. The algorithm does the hot
∑ Proficiency(Concept j) degree calculation as shown in Fig. 19. 3) Collaborative
Proficiencyavg(useri ) =
j
,
Number of concepts in Domain Proficiency Table recommendation. It refers to the user’s selection histories of
where Concept j : jth Concept in useri' s Domain Proficiency Table,
the same group to provide solution recommendation. The
if (0.8 ≤ Proficiencyavg(useri ) ≤ 1.0 ), user i reassigned to Expert group.
if (0.6 ≤ Proficiencyavg(useri ) < 0.8 ), user i reassigned to Senior group. basic idea is this. If user A and user B are in the same group
if (0.4 ≤ Proficiencyavg(useri ) < 0.6 ), user i reassigned to Junior group. and the first n interaction sessions of user A are the same as
if (0.2 ≤ Proficiencyavg(useri ) < 0.4 ), user i reassigned to Novice group.
those of user B, then we can recommend the highest-rated
if (0.0 ≤ Proficiencyavg(useri ) < 0.2 ), user i reassigned to Amateur group.
}
FAQs in the (n+1)th session of user A for user B, detailed
Fig. 17 Algorithm to re-cluster all user groups algorithm as shown in Fig. 20.
SExpert : Stereotype of Expert.
U = {U1, U2,..., UN}, Users in Expert group. 4 System Demonstration and Experiments
N
∑ Ui.Show_Rate(Similarity Mode) 4.1 System demonstration
i =1
SExpert.Show_Rate(Similarity Mode) =
N
N
∑ Ui.Show_Rate(Proficiency Mode)
i =1
SExpert.Show_Rate(Proficiency Mode) =
N
For each Concept j in Domain Proficiency Table (DPT) Basic user Information
{
N
∑ Ui.DPT.Proficiency(Concept j)
i =1 Questionnaire
SExpert.DPT.Proficiency(Concept j) =
N
}

Fig. 18 Example to update the stereotype of Expert


The third task of the User Model Manager is to update user
stereotypes. This happens when a sufficient number of user
models in a stereotype has undergone changes. First, we need
to reflect these changes to stereotypes by re-clustering all Fig. 21 System register interface
affected user models, as shown in Fig. 17, and then
re-calculates all parameters in each stereotype, an example as
shown in Fig. 18. Automatic Keyword Scrolling List
User Interaction Interface
3.2.7 Recommender
For each faq i ∈ C, FAQ Collection.
HOT Score faq i = FAQ Recommendation List
1 ∗ ∑ Appearance(Concept j ) ∗ Weight(Concept j),
Number of Concepts appearing in faq i j
where Concept j ∈ Ontology,
⎧1, Concept j appears in faq i PC Ontology Tree
Appearance(Concept j ) : ⎨
⎩0, Concept j doesn' t appear in faq i,
Weight(Concept j) : Within a time window, the times that the Concept j appears in user
queries of the same user group, Fig. 22 Main tableau of our system
Fig. 19 Algorithm to calculate the hot degree
When a new user enters the system, the user is registered by
G = {U1, U2,..., Um} , Users in the same group G.
Y = {Uj ∈ G | Uj ≠ Ui}.
the Agent as shown in Fig. 21. At the same time, a
FAQ Collection = {F1, F2,..., Fn}. questionnaire is produced by the Agent for evaluating the
Ui = {Si,1, Si,2,..., Si, x} , Query sessions of user i. user’s domain proficiency. His answers are then collected and
Si, x = {Ri, x,1, Ri, x,2,..., Ri, x, n} , FAQs rated and/or selected by User i in session x.
calculated in order to help build an initial user model for the
⎧1, If User i select Fy in session x, and rate more than satisfying.
⎪0.8, If User i select Fy in session x, and rate satisfying. new user. Now the user can get into the main tableau of our

⎪0.6, If User i select Fy in session x, and rate average. system (Fig. 22), which consists of the following three major

Ri, x, y : ⎨0.6, If User i select Fy in session x, but no rating. tab-frames, namely, query interface, solution presentation,
⎪0.4, If User i select Fy in session x, and rate dissatisfying.

⎪0.2, If User i select Fy in session x, and rate less than dissatisfying.
and logout. The query interface tab is comprised of the
⎪0, If User i doesn' t select Fy in session x.

following four frames: user interaction interface, automatic
⎛ ⎞ keyword scrolling list, FAQ recommendation list, and PC
Similarity(Si, x, Sj, k) = arg max⎜⎜ ∑ ∑ (1 - Distance(Si, x, Sj, k) )⎟⎟,
Sj, k
⎝ Uj∈Y Sj, k∈Uj ⎠ ontology tree. The user interaction interface contains both
n
∑ (Ri, x, z - Rj, k, z )
2 keywords and NLP (Natural Language Processing) query
z =1
where Distance(Si, x, Sj, k) = . modes. The keyword query mode provides the lists of
n
Let Sj, k be the most similar to Si, x, and Si, x + 1 be user i in current query session. question types and operation types, which allow the users to
Recommend following FAQs for Si, x + 1 =
express their precise intentions. The automatic keyword
{Fy | Rj, k, y > 0.5 and Ri, x, y = 0} U {Fy | Rj, k + 1, y > 0.5 and Ri, x, y = 0}
scrolling list provides ranked-keyword guidance for user
Fig. 20 Algorithm to do the collaborative recommendation
query. A user can browse the PC ontology tree to learn
The Recommender uses the following three policies to domain knowledge. The FAQ recommendation list provides
recommend information. 1) High hit FAQs. It recommends personalized information recommendations from the system,
the first N solution FAQs according to their selection counts which contains three modes: hit, hot topic, and collaboration.
WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Vol. 4, November 2007 ISSN: 1709-0832 1406

When the user clicked a mode, the corresponding popup first experiment, we use this same FAQs for testing queries,
window is produced by the system. in order to verify whether any conflicts exist within the query.
The solution presentation tab is illustrated in Fig. 23. It Table 5 illustrates the experimental results, where only 33
pre-selects the solutions ranking method according to the queries match with more than one query patterns and result in
user’s preference and hides part of solutions according to his confusion of query intention, called “error” in the table. These
Show_Rate for reducing the cognitive loading of the user. errors may be corrected by the user. The experiment shows
The user can switch the solution ranking method between the effectiveness rate of the constructed query templates
similarity ranking and proficiency ranking. The user can click reaches 97.28%, which implies the template base can be used
the question part of an FAQ (Fig. 24) for displaying its as an effective knowledge base to do natural language query
content or giving it a feedback, which contains the processing.
satisfaction degree and comprehension degree. Fig. 25
Table 5 Effectiveness of constructed query patterns
illustrates the window before system logout, which ask the #Testing #Correct #Error Precision Rate (%)
user to fill a questionnaire for statistics to help further system 1215 1182 33 97.28 %
improvement.
Our second experiment is to learn how well the Parser
understands new queries. First, we collected in total 143 new
Ranking Order
FAQs, different from the FAQs collected for constructing the
query templates, from four famous motherboard factories in
Show All Taiwan, including ASUS, GIGABYTE, MSI, and SIS. We
then used the question parts of those FAQs for testing queries,
Return in E-mail
which test how well the Parser performs. Our experiments
show that we can precisely extract true query intentions and
focuses from 112 FAQs. The rest of 31 FAQs contain up to
three or more sentences in queries, which explain why we
failed to understand them. In summary, 78.3% (112/143) of
the new queries can be successfully understood.
Fig. 23 Solution presentation
Table 6 User satisfaction evaluation
K_WORD CPU MOTHERBOARD MEMORY AVERAGE
METHOD (SE / ST) (SE / ST) (SE / ST) (SE / ST)
Alta Vista 63% / 61% 77% / 78% 30% / 21% 57% / 53%
Excite 66% / 62% 81% / 81% 50% / 24% 66% / 56%
Google 66% / 64% 81% / 80% 38% / 21% 62% / 55%
HotBot 69% / 63% 78% / 76% 62% / 31% 70% / 57%
User Feedback InfoSeek 69% / 70% 71% / 70% 49% / 28% 63% / 56%
Lycos 64% / 67% 77% / 76% 36% / 20% 59% / 54%
Yahoo 67% / 61% 77% / 78% 38% / 17% 61% / 52%
Our approach 78% / 69% 84% / 78% 45% / 32% 69% / 60%

Finally, Table 6 shows the comparison of user satisfaction


of our systemic prototype against other search engines. In the
table, ST, for Satisfaction of testers, represents the average of
Fig. 24 FAQ-selection and feedback enticing satisfaction responses from 10 ordinary users, while SE, for
Satisfaction of experts, represents that of satisfaction
responses from 10 experts. Basically, each search engine
receives 100 queries and returns the first 100 webpages for
A Questionnaire evaluation of satisfaction by both experts and non-experts.
For Statistics The table shows that our approach, the last row, enjoys the
highest satisfaction in all classes. From the evaluation, we
conclude that, unless the comparing search engines are
specifically tailored to this specific domain, such as HotBot
and Excite, our techniques, in general, retrieves more correct
webpages in almost all classes.

Fig. 25 System logout 5 Discussions and Future work


4.2 System Evaluation We have developed an Interface Agent to work as an assistant
The evaluation of the overall performance of our system between the users and systems, which is different from
involves lots of manpower and is time-consuming. Here, we system architecture and implementation over our previous
focus on the performance evaluation of the most important work [12]. It is used to retrieve FAQs on the domain of PC.
module, i.e., the Query Parser. Our philosophy is that if it can We integrated several interesting techniques including user
precisely parse user queries and extract both true query modeling, domain ontology, and template-based linguistic
intention and focus from them, then we can effectively processing to effectively tackle the problems associated with
improve the quality of the retrieved. Recall that the Query traditional FAQ retrieval systems. In short, our work features
Parser employs the technique of template-based pattern an ontology-supported, template-based user modeling
matching mechanism to understand user queries and the technique for developing interface agents; a nature language
templates were manually constructed from 1215 FAQs. In the query mode, along with an improved keyword-based query
mode; and an assistance and guidance for human-machine
1407 WSEAS TRANSACTIONS ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Vol. 4, November 2007 ISSN: 1709-0832

interaction. Our preliminary experimentation demonstrates Fuzzy Theory and Its Applications, I-Lan, Taiwan, 2004, pp.
that user intention and focus of up to eighty percent of the 181-186.
user queries can be correctly understood by the system, and [14] Yang, S.Y., Chuang, F.C., and Ho, C.S.,
accordingly provides the query solutions with higher user Ontology-Supported FAQ Processing and Ranking
satisfaction. In the future, we are planning to employ the Techniques, Accepted for publication in International
techniques of machine learning and data mining to automate Journal of Intelligent Information Systems, 2005.
the construction of the template base. As to the allover system [15] Yang, S.Y., Liao, P.C., and Ho, C.S., A User-Oriented
Query Prediction and Cache Technique for FAQ Proxy
evaluation, we are planning to employ the concept of
Service, Proc. of the 2005 International Workshop on
usability evaluation on the domain of human factor
Distance Education Technologies, Banff, Canada, 2005, pp.
engineering to evaluate the performance of the user interface. 411-416.
[16] Yang, S.Y., Liao, P.C., and Ho, C.S., An
Acknowledgements Ontology-Supported Case-Based Reasoning Technique for
The author would like to thank Yai-Hui Chang and Ying-Hao FAQ Proxy Service, Proc. of the Seventeenth International
Conference on Software Engineering and Knowledge
Chiu for their assistance in system implementation. This work
Engineering, Taipei, Taiwan, 2005, pp. 639-644.
was supported by the National Science Council, R.O.C.,
[17] Yang, S.Y., FAQ-master: A New Intelligent Web
under Grant NSC-95-2221-E-129-019. Information Aggregation System, International Academic
Conference 2006 Special Session on Artificial Intelligence
References: Theory and Application, Tao-Yuan, Taiwan, 2006, pp. 2-12.
[1] Chandrasekaran, B., Josephson, J.R., and Benjamins, V.R., [18] Yang, S.Y., An Ontology-Supported and Query
What Are Ontologies, and Why Do We Need Them? IEEE Template-Based User Modeling Technique for Interface
Intelligent Systems, Vol. 14, No. 1, 1999, pp. 20-26. Agents, 2006 Symposium on Application and Development
[2] Chiu, Y.H., An Interface Agent with Ontology-Supported of Management Information System, Taipei, Taiwan, 2006,
User Models, Master Thesis, Department of Electronic pp. 168-173.
Engineering, National Taiwan University of Science and [19] Yang, S.Y., An Ontology-Supported Website Model for
Technology, Taiwan, R.O.C., 2003. Web Search Agents, Accepted for presentation in 2006
[3] Hovy, E., Hermjakob, U., and Ravichandran, D., A International Computer Symposium, Taipei, Taiwan, 2006.
Question/Answer Typology with Surface Text Patterns, [20] Yang, S.Y., How Does Ontology Help Information
Proc. of the DARPA Human Language Technology Management Processing, WSEAS Transactions on
conference, San Diego, CA, USA, 2002, pp. 247-250. Computers, Vol. 5, No. 9, 2006, pp. 1843-1850.
[4] Noy, N.F. and McGuinness, D.L., Ontology Development
101: A Guide to Creating Your First Ontology, Stanford
Knowledge Systems Laboratory Technical Report
KSL-01-05 and Stanford Medical Informatics Tech. Rep.
SMI-2001-0880, 2001.
[5] Rich, E., User Modeling via Stereotypes, Cognitive Science,
Vol. 3, 1979, pp. 329-354.
[6] Salton, G., Wong, A., and Yang, C.S., A Vector Space
Model for Automatic Indexing, Communications of ACM,
Vol. 18, No. 11, 1975, pp. 613-620.
[7] Salton, G. and McGill, M.J., Introduction to Modern
Information Retrieval, McGraw-Hill Book Company, New
York, USA, 1983.
[8] Soubbotin, M.M. and Soubbotin, S.M., Patterns of
Potential Answer Expressions as Clues to the Right Answer,
Proc. of the TREC-10 Conference, NIST, Gaithersburg,
MD, USA, 2001, pp. 293-302.
[9] Tsai, C. H. MMSEG: A word identification system for
Mandarin Chinese text based on two variants of the
maximum matching algorithm. Available at
http://technology.chtsai.org/mmseg/, 2000.
[10] Winiwarter, W., Adaptive Natural Language Interface to
FAQ Knowledge Bases, International Journal on Data and
Knowledge Engineering, Vol. 35, 2000, pp. 181-199.
[11] Yang, S.Y. and Ho, C.S., Ontology-Supported User Models
for Interface Agents, Proc. of the 4th Conference on
Artificial Intelligence and Applications, Chang-Hwa,
Taiwan, 1999, pp. 248-253.
[12] Yang, S.Y. and Ho, C.S., An Intelligent Web Information
Aggregation System Based upon Intelligent Retrieval,
Filtering and Integration, Proc. of the 2004 International
Workshop on Distance Education Technologies, Hotel
Sofitel, San Francisco Bay, CA, USA, 2004, pp. 451-456.
[13] Yang, S.Y., Chiu, Y.H., and Ho, C.S., Ontology-Supported
and Query Template-Based User Modeling Techniques for
Interface Agents, 2004 The 12th National Conference on

You might also like