Build Ontologies RPD-ER2002
Build Ontologies RPD-ER2002
Build Ontologies RPD-ER2002
ID 21
1. Introduction
An ontology is an explicit conceptualization of a domain of discourse, and thus provides a shared and common understanding of the domain.[Reim01] We have been producing ontologies for millennia to understand and explain our rationale and environment. From Platos philosophical framework to modern day classification systems, ontologies are, in most cases, the product of extensive analysis and categorization.
Only recently has the process of building ontologies become a research topic of interest. Today, ontologies are built very much ad-hoc. A terminology is first developed providing a controlled vocabulary for the subject area or domain of interest, then it is organized into a taxonomy where key concepts are identified, and finally these concepts are defined and related to create an ontology.
The intent of this paper is to show that domain analysis methods can be used for building ontologies. Domain analysis aims at generic models that represent groups of similar systems within an application domain. In this sense, it deals with categorization of common objects and operations, with clear, unambiguous definitions of them and with defining their relationships.
1.1. Background
Typically, the goal of building ontologies is to create a logical framework, a philosophy, a classification, or to develop a common understanding in a discipline. The goal determines the extent and complexity of the process. Creating an ontology intended only to provide a basic understanding of a domain may require less effort than an ontology intended for supporting formal logic arguments and proofs in a domain.
Rubn Prieto-Daz
2/2002
ER2002
ID 21
Answering questions such as: Why are we building an ontology? What we want to use it for? Is the initial first step in creating an ontology.
A skeletal methodology for building ontologies has been proposed and tested by Uschold and King [Usch95]. This attempt to formalize the ad-hoc process consists of the following steps: Identify Purpose Build Ontology o Ontology capture o Ontology coding o Ontology integration Evaluate Ontology Document Ontology
There are three sub-steps in the Build Ontology process. 1- Ontology capture is the identification and definition of key concepts and relationships in the domain of interest and the terms that refer to such concepts. 2- Ontology coding deals with formalizing such definitions and relationships in some formal language. 3- Ontology integration deals with associating key concepts and terms in the ontology with concepts and terms of other ontologies; that is, incorporating concepts and terms from other domains. Uschold et.al., used this approach to create an Enterprise Ontology [Usch98]. The TOVE (TOronto Virtual Enterprise) project from University of Torontos Enterprise Integration Laboratory has developed and tested several ontologies for modeling enterprises1 [Fox94]. TOVEs approach to engineering ontologies consists of four basic steps [Fox98], in essence very similar to the four steps proposed by Uschold and King above: Define ontology requirements. Define the terminology of the ontology (objects, attributes and relations). Specify the definitions and constraints on the terminology. Test the competency of the ontology.
Building ontologies is more difficult than it seems. In a special issue of Communications of the ACM on Ontology Engineering, Gruninger and Lee [Grun02] indicate that building ontologies is difficult, time-consuming, and expensive. It involves more than
Rubn Prieto-Daz
2/2002
ER2002
ID 21
the steps in the above two approaches: it also requires consensus building. Stemming from this difficulty Holsapple and Joshi [Hols02] have proposed a collaborative approach to ontology design. Despite the need for consensus building, the first two steps in Uschold and Kings and in TOVEs approaches are essential. In this paper we argue that domain analysis provides a method and techniques for supporting these first two steps, in particular the three substeps in Uschold and Kings Build Ontologies process. This paper describes how a domain analysis method can be used for building the basis for ontologies. Section 2 relates both processes and makes the case that domain analysis can be used for building ontologies. Section 3 illustrates a step in the domain analysis method for identifying and categorizing concepts borrowed from Library Science. Section 4 describes how the faceted approach from Library Science is incorporated into the domain analysis method. Section 4 gives an overview of the method and describes a tool for automating parts of the process.
Systematic methods for domain analysis, such as FODA [Kang90] and DARE [Frak98], intend to formalize the ad-hoc nature of the process. We claim that the process for building a narrow ontology is almost identical to the process for building a broad domain model.
http://www.eil.utoronto.ca/enterprise-modelling/tove/index.html
Rubn Prieto-Daz
2/2002
ER2002
ID 21
If conducted ad-hoc, the domain analysis process allows knowledge of a domain to evolve naturally over time until enough experience has been accumulated and several systems have been implemented. At that point, common objects and operations can be identified and generic abstractions can be isolated and reused. As we will see, identifying objects and operations common to a family of systems, categorizing them, and abstracting their commonalities is equivalent to Uschold and Kings first step in the ontology building process: ontology capture. Specifying reusable components that implement generic functions as defined by the abstractions in the domain model is equivalent to the second step in the ontology building process: ontology coding. Lastly, ontology integration is equivalent to incorporating definitions of components from other domains capable of carrying out some of the functionality in the domain being considered. The table below illustrates how little difference there is between an ontology and the product of a domain analysis: a domain model. Feature
Controlled Vocabulary Taxonomy Thesaurus Abstract Concept Definitions Semantic Relationships Multiple Viewpoint Models Axioms Cross-Domain Associations Formal Notation
Domain Model
Yes Yes Yes Informal Yes Yes No Implicit (Via Thesaurus) No
Ontology
Yes Yes Yes Formal Yes Yes Yes Explicit Yes
Ontology capture, in particular, can be completely realized through domain analysis thus providing sufficient information and concepts to facilitate the following tasks of ontology building: ontology coding, integration and formal documentation. In other words, the ontologies produced through domain analysis include the basic concepts and relationships that make them usable and practical, and provide the basis for further refinement.
Rubn Prieto-Daz
2/2002
ER2002
ID 21
where the meaning of each concept is defined by specifying properties, relations to other concepts, and axioms narrowing down the interpretation.
A classification scheme in Library Science is a tool for the production of systematic order based on a controlled and structured index vocabulary. This index vocabulary is called the classification schedule and it consists of a set of names or terms representing concepts or classes, listed in systematic order, to display relationships among classes.
A classification scheme and its respective schedule then, can be considered an extended taxonomy or a reduced ontology. As an extended taxonomy it goes beyond a mere arrangement of categories since it includes relationships among categories and some brief definitions. Thesaurus-like associations provide some of the relationships. As a reduced ontology, it lacks formal definitions of concepts and axioms.
Based on this analysis and the similarity between domain analysis and building ontologies, techniques for deriving classification schemes can be used for systematically initiating the creation of ontologies.
The focus of our discussion in Section 3 is on an approach for the identification, structuring and definition of concepts and terms of a domain adopted from Library Science for creating faceted classification schemes for special collections. This approach is part of the DARE method [Frak98].
3. Building Ontologies
3.1. Classification in Library Science
A classification scheme must be able to express hierarchical relationships as well as relationships created to relate two or more concepts belonging to different hierarchies. Hierarchical relationships are based on the principle of subordination or inclusion and are typical in a taxonomy. Relationships among concepts are presented as compounded classes. For example, the compounded class reproduction of reptiles relates the term
Rubn Prieto-Daz
2/2002
ER2002
ID 21
reproduction from the class processes with the term reptiles from the class taxonomy.
Two types of classification schemes are used in Library Science: enumerative and faceted. The enumerative (or traditional) method postulates a universe of knowledge divided into successively narrower classes that include all possible subclasses and compounded classes. The Dewey Decimal system is a typical example of an enumerative hierarchy where all possible classes are predefined. These schemes are called enumerative because the predefined classes are listed ready-made in a classification schedule.
The faceted approach, proposed by Ranganathan in 1939 [Ran67], relies not on the breakdown of a universe of knowledge, but on building up or synthesizing from the subject statements of particular documents. Subject statements are analyzed into their component elemental classes selected from the schedule. The classifier using such a scheme expresses a compound class by assembling its elemental classes. This process is called synthesis. The arranged groups of elemental classes making up the scheme are called facets. Facets can be construed as perspectives, viewpoints, or dimensions of a particular domain.
A faceted scheme, therefore provides a controlled vocabulary in the form of terms arranged systematically by facets and a set of rules on how to combine such terms to define conceptual descriptors (i.e., categories).
Rubn Prieto-Daz
2/2002
ER2002
ID 21
considered as a domain specific language used to express activities in the domain of the specialized library. Literary Warrant considers that titles capture best, in a simplified form, the concepts in a document and they are used as representative subject statements.
We illustrate this process with an example from [Buch79]. Assume we are asked to build a classification for a list of zoology related titles (i.e., books). The first step is to select a representative sample from the collection. Let us assume we select the following titles: Essays of the physiology of marine fauna Animals of the mountains Amphibious animals Desert reptiles Migratory Birds Salt water fish Mammalian Reproduction Snakes of the Amazon River Experimental reports on the respiration of vertebrates Tropical leaf moths The next step is to group common terms together (i.e., conceptual clustering) physiology, reproduction, respiration tropical, desert, mountains, salt water marine, amphibious fauna, animals, vertebrates, reptiles, snakes, birds, fish, moths, mammals essays, experimental reports These five groups are the initial facets in our special collection of zoology books. Each group is named by the general concept it represents. by process by habitat by element by taxonomy by literary form These five facets are ordered by their relevance to the users of the collection and terms within each group are listed in a logical order. It is assumed in this example that the users of the collection are mainly ecologists making habitat the most relevant facet. The domain or subject area is animals/fauna and the resulting faceted classification scheme is shown below.
Rubn Prieto-Daz
2/2002
ER2002
ID 21
{by habitat} land tropic desert mountain : : water sea river Amazon : :
{by taxonomy} animals/fauna invertebrates insects moths : vertebrates mammals : birds : reptiles snakes : fish
As new titles enter the collection, new facets may be defined and new terms added to the scheme, thus extending and enriching the faceted scheme. This is the core of Literary Warrant. [Buch79] and [Vick60] present detailed tutorials.
This scheme is a structured controlled vocabulary that can be used systematically to define each title of the collection. Each title can now be reduced to a normal form of terms from each facet. To describe a title using this scheme we match by order of relevance each term in the title to the term in the scheme. For example:
The title Essays on the physiology of marine fauna can be represented (i.e., classified) by the following set of terms selected from the faceted scheme. /null/marine/animals/physiology/essays/ The first entry is null because there is no term (i.e., concept) in the title that corresponds to any concept in the habitat facet. The remaining terms are selected by conceptually matching keywords in the title to facet terms in the scheme. Similarly, the titles below are normalized (i.e., classified) following the same process. Animals of the mountains Desert reptiles /mountain/null/animals/null/null/
/desert/null/reptiles/null/null/
Rubn Prieto-Daz
2/2002
ER2002
ID 21
This is the process of synthetic classification using a faceted classification scheme. The scheme provides a vocabulary and some basic rules for converting titles to a normalized set of concepts.
In summary, we have produced, from the bottom-up, a knowledge structure that can be used to generate normalized descriptors of statements that facilitate their categorization: titles sharing the same facet terms belong to same category. This can also be seen as a primitive domain language. In [Prie87] we used function descriptors, instead of book titles, to generate a faceted classification scheme for software components as part of a reuse library system. Function descriptors are, in most cases, one-liners defining a software function. In some cases a function descriptor is the title of the function. Function descriptors were found to be representative subject statements.
In the bottom-up process keywords and phrases are extracted from domain documents using standard text analysis tools. The Literary Warrant technique is then used to build a domain specific faceted classification scheme. The resulting scheme is used to group phrases into categories thus creating clusters that represent concepts in the domain.
This approach is consistent with Lakoffs view of experimental realism and the concept of prototype-based categories which is based on embodied cognitive models [Lako90]. Lakoff argues and demonstrates that ontologies are based on our perception of reality and that any effort to build ontologies must follow such perceptions rather
Rubn Prieto-Daz
2/2002
ER2002
ID 21
than built on abstract concepts imposed from rational arguments. Lakoffs position seems to be accepted by many.
Domain Knowledge
Consult Experts
Domain Experience
Postulate Ontology
Domain Ontology
Classify Phrases
Classification Scheme
Synthesized Categories
Analyze Text
Next, the synthesized clusters are compared to the postulated concepts and an iterative process is carried out where the ontology is modified and adjusted to match the bottomup clusters. Figures 2 & 3 illustrate how this method is carried out.
Rubn Prieto-Daz
2/2002
10
ER2002
ID 21
TOP-DOWN
Postulated Ontology
BOTTOM-UP
Synthesized Categories
Rubn Prieto-Daz
2/2002
11
ER2002
ID 21
x s
t v w
B D E F A
Mapping clusters to proposed concepts may require modifications to the ontology and the concept clusters. Some of the following situations may arise: Cluster A maps to concepts s and t (See Figure 3). In this case cluster A can be broken into two separate concepts, one matching s and another matching t or either s or t can be deleted from the ontology and keep only one link to A from either the parent of t or the parent of s. Clusters B and C map to single concept u. In this case clusters B and C can be merged to represent concept u or concept u can be partitioned into two different concepts.
2/2002
Rubn Prieto-Daz
12
ER2002
ID 21
Elements from clusters D and E and cluster F map to concept x. In this case a new cluster can be created to map to concept x or concept x can be deleted and cluster F merged into D and E.
These examples illustrate how a postulated ontology is modified and validated by concepts extracted from domain documents.
Automated text processing is a mature technology. The concepts, techniques and algorithms adopted by DARE are mainly from Frakes & Baeza-Yates textbook [Frak92]. The text from a document is broken into words by a tokenizing operation (i.e., lexical analysis) and all stop words removed. The resulting keywords are placed in an internal index and used to generate stems. An inverted index is also created for traceability of keywords to their original sources. Phrases are simplified to keywords and stems.
Rubn Prieto-Daz
2/2002
13
ER2002
ID 21
essence, the clustering editor combines two proven syntactic techniques for generating quasi-semantic clusters.
A cluster is a group of terms and phrases whose similarities fall within a specific, user defined conceptual distance from each other. Clusters are represented by a single central term with radial terms extending outward from it. The user can manipulate term positions thus increasing or decreasing their relationships, he can bring terms from the keyword list into a cluster or remove terms from the cluster, and he can merge clusters and rearrange them.
The DARE method recommends building a thesaurus for each facet. Very often terms in a cluster are synonyms of each other or represent essentially the same concept. In these cases a term is selected to represent the concept. We call this selected synonym the canonical term, which is similar to Roschs prototype [Lako90]. Each facet therefore includes canonical terms in its list and each term represents a concept.
The categorization effectiveness of the faceted scheme is determined in part by the number of facets and by the number of terms in each facet. On the one hand, too few facets and terms limits the number of possible categories and decreases precision but facilitates classification. On the other, too many facets and terms provides for large variety of categories and precision but makes classification difficult. In our experience, a
Rubn Prieto-Daz
2/2002
14
ER2002
ID 21
practical and useful faceted scheme should have 4 to 7 facets and no more than five synonyms for any one canonical term.
Use of the DARE tool demonstrated that software component descriptions and requirements definitions required very little human interaction. Such documents are usually written in a consistent style. Use of prose as in non-technical text typically demands a higher level of human interaction in the process. More details on our experiences using DARE are reported in [Frak98].
6. Conclusion
We have described a tool-assisted method for building the basis for ontologies adopted from domain analysis. The resulting ontologies do not include formal definitions of concepts and axioms. Instead a structured controlled vocabulary is produced that define concepts informally and indirectly by example. Concepts are defined by clusters of phrases and statements extracted from a body of textual experience.
One advantage of this approach is that it is practical and useful. The ontologies built by this method may not yet be comprehensive or formal enough for some purposes but they provide sufficient information and concepts to facilitate the task of ontology coding and formal documentation.
Our experiences using DARE and applying domain analysis methods have been generally very positive and include results with immediate applications such as domain models for command and control systems and for banking services. More research,
Rubn Prieto-Daz
2/2002
15
ER2002
ID 21
however, is needed on how to convert or evolve domain models into complete formalized ontologies.
7. Acknowledgments
This research was partially supported by Financial Systems Architects, New York. Special thanks to Sam Redwine for his comments and multiple reviews of this manuscript.
8. References
[Buch79] [Fox93] Buchanan, B. Theory of Library Classification. Clive Bingley, London, 1979. Fox, M. et.al. "A Common Sense Model of the Enterprise", Proceedings of the 2nd Industrial Engineering Research Conference , pp. 425-429, Norcross GA: Institute for Industrial Engineers, 1993. Fox, M. et.al. "An Organisation Ontology for Enterprise Modeling", In Simulating Organizations: Computational Models of Institutions and Groups, M. Prietula, K. Carley & L. Gasser (Eds), Menlo Park CA: AAAI/MIT Press, pp. 131-152, 1998. Frakes, W., Prieto-Daz, R. and Fox C. DARE: domain analysis and reuse environment. In Annals of Software Engineering, (5)125-141, W. Frakes (Ed.) Baltzer Science Publishers, September 1998. Frakes, W.B. and Baeza-Yates, R. (Eds.) Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ. 1992. Gruninger, M. and Lee, J. Ontology Applications and Design Introductory article to a special issue on Ontology Engineering. Communications of the ACM 45(2):39-41, February, 2002. Holsapple, C.W. and Joshi, K.D. A Collaborative Approach to Ontology Design. Communications of the ACM 45(2):42-47, February, 2002. Kang, K., et. al. Feature-Oriented Domain Analysis (FODA) Feasibility Study. CMU/SEI-90-TR-21. Software Engineering Institute, Pittsburg, PA, November, 1990. Lakoff, G. Women, Fire, and Dangerous Things, What Categories Reveal about the Mind. The University of Chicago Press, 1990. (First published 1984) Maarek, Y., Berry, D. and Kaiser, G. An information retrieval approach for automatically constructing software libraries. IEEE Transactions on Software Engineering, 17(8):800-813, August, 1991. Prieto-Daz, R. and Freeman, P. Classifying software for reusability. IEEE Software, 4(1):6-16, January 1987. Prieto-Daz, R. Domain analysis: an introduction. ACM SIGSOFT Software Engineering Notes 15(2):47-54, April, 1990.
[Fox98]
[Frak98]
[Frak92] [Grun02]
[Hols02] [Kang90]
[Lako90] [Maar91]
[Prie87] [Prie90]
Rubn Prieto-Daz
2/2002
16
ER2002
ID 21
[Rang67] [Reim01]
Ranganathan, S.R. Prolegomena to Library Classification. Asian Publishing House, Bombay, India, 1967. Reimer, U. Tutorial on Organizational Memories for Capturing, Sharing and Utilizing Knowledge. International Conference on Enterprise Information Systems, ICEIS 2001, Setubal, Portugal, July 7-10, 2001. http://research.swisslife.ch/~reimer/OM_Tutorial/index.html Uschold, M. and King, M. Towards a Methodology for Building Ontologies. AIAI-TR-183, University of Edinburgh, Edinburgh EH1 1HN, 1995. Presented at the Workshop on Basic Ontological Issues in Knowledge Sharing, IJCAI95, Montreal. http://www.aiai.ed.ac.uk/~entprise/enterprise/ontology.html Uschold, M. et.al. The Enterprise Ontology The Knowledge Engineering Review , Vol. 13, Special Issue on Putting Ontologies to Use (eds. Mike Uschold and Austin Tate), 1998. Also available from AIAI as AIAI-TR-195 at: http://www.aiai.ed.ac.uk/~entprise/enterprise/ontology.html Vickery, V.C. Faceted Classification: A Guide to Construction and Use of Special Schemes. Aslib, 3 Belgrave Square, London, 1960.
[Usch95]
[Usch98]
[Vick60]
Rubn Prieto-Daz
2/2002
17
ER2002
ID 21
Abstract An ontology can be defined as a conceptualization of a domain or subject area typically captured in an abstract model of how people think about things in the domain. Humans have been producing ontologies for millennia to understand and explain our rationale and environment. Only recently has the process of building ontologies become a research topic of interest. Today, ontologies are built very much ad-hoc. A terminology is first developed providing a controlled vocabulary for the subject area or domain of interest, then it is organized into a taxonomy where key concepts are identified, and finally these concepts are defined and related to create an ontology. This paper describes how a domain analysis method based on faceted classification can be used for building ontologies. It relates domain analysis and ontologies, illustrates a step in the domain analysis method for identifying and categorizing concepts, and describes how this step, borrowed from Library Science, is incorporated into the domain analysis method. The paper also gives an overview of the method and describes a tool for automating parts of the process.
Rubn Prieto-Daz
2/2002
18