0% found this document useful (0 votes)
19 views11 pages

Ontology Evaluation For Reuse in The Domain of Process Systems Engineering

Uploaded by

comeonitsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

Ontology Evaluation For Reuse in The Domain of Process Systems Engineering

Uploaded by

comeonitsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Computers and Chemical Engineering 85 (2016) 177–187

Contents lists available at ScienceDirect

Computers and Chemical Engineering


journal homepage: www.elsevier.com/locate/compchemeng

Ontology evaluation for reuse in the domain of Process Systems


Engineering
Nikolaos Trokanas ∗ , Franjo Cecelja
PRISE Group, Faculty of Engineering & Physical Sciences, University of Surrey, Guildford GU2 7XH, United Kingdom

a r t i c l e i n f o a b s t r a c t

Article history: Ontologies are a useful tool for knowledge representation, sharing and reuse. Although the number of
Received 4 November 2014 available ontologies is increasing, the concomitant reuse activities are not following respectively. This
Received in revised form 26 October 2015 is particularly true in the domain of Process Systems Engineering where the ontology development has
Accepted 7 December 2015
been proven to be a challenging task and respective reusability is at its infancy. This paper presents
Available online 17 December 2015
a framework for evaluation of ontology for reuse. The proposed framework benefits from information
about ontologies, such as terminology and ontology structure, to calculate a compatibility metric of
Keywords:
ontology suitability for reuse and hence integration. The framework was demonstrated using a Chemical
Ontology reuse
Compatibility
and Process Engineering case.
Ontology engineering © 2015 Elsevier Ltd. All rights reserved.

1. Introduction agriculture (Soergel et al., 2006) and, more recently, in the domain
of Chemical and Process Engineering (CPE).
In practical terms, ontology (Fig. 1) is a group of terms describing The number of publicly available ontologies, more than 10,000
a domain, organised in a hierarchical structure of sets sharing com- as shown by dedicated search engines (Swoogle, 2009), is a good
mon properties. These properties define the relationships between indicator of the scale of development and use of ontologies across
sets (object properties) or the data value of certain attributes (data disciplines. As evident, ontologies are developed for and used in
type properties). The representation of the domain is complete several areas ranging from knowledge representation (Lee and Suh,
with the instantiation of the ontology with specific elements of the 2007), knowledge and information sharing (Kraines et al., 2005), as
domain. In terms of ontological engineering, the ontology consists a tool for process modelling (Morbach et al., 2007) up to seman-
of (i) classes, (ii) relations, (iii) restrictions and (iv) instances. tic search, data integration (Muñoz et al., 2013) and web service
Ontologies are a useful tool for knowledge representation, shar- discovery. In the domain of CPE efforts in designing ontologies are
ing and reuse. Their potential has been recognised in a variety of directed to:
domains, such as biomedical where Musen et al. (2012) present
a web service that provide access to a large number of biomedi- (i) process systems design (Bock et al., 2010) and more specif-
cal ontologies. More information on the use of ontologies in the ically, to collaborative modelling (Zhang and Yin, 2008) and
biomedical domain is available in Whetzel et al. (2011). Ontolo- design;
gies are also widely used in the legal domain. Efforts have focused (ii) supply chain modelling and design (Muñoz et al., 2010);
on upper ontologies (Casellas et al., 2005), specific methodolo- (iii) decision-making (Raafat et al., 2013) related to environmental
gies for ontology engineering (Corcho et al., 2005) and more effects and causes (Cecelja et al., 2014) and
advanced applications extending to natural language processing (iv) computer-aided process engineering (Morbach et al., 2009),
(Lame, 2005). Finally, ontologies are used in more and diverse specific engineering (Abanda et al., 2013) and data manage-
domains such as the financial domain with examples of simple ment applications (Panetto et al., 2012).
knowledge representation efforts (Zhang et al., 2000) and stan-
dardisation attempts (Alonso et al., 2005), as well as the domain of
In addition, a number of specific ontology applications have
been reported, including knowledge sharing (Morbach et al., 2009),
enterprise information integration (Muñoz et al., 2011), stan-
dardisation of vocabularies for chemical (Muñoz et al., 2010),
∗ Corresponding author. Tel.: +44 1483689471. pharmaceutical (Venkatasubramanian et al., 2006) and petrochem-
E-mail address: [email protected] (N. Trokanas). ical processes (Ni et al., 2011) and mathematical modelling (Suresh

http://dx.doi.org/10.1016/j.compchemeng.2015.12.003
0098-1354/© 2015 Elsevier Ltd. All rights reserved.
178 N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187

Still, as frequently reported since the early attempts to review


Nomenclature existing cases of ontology reuse (Uschold et al., 1998), it is not a
frequently exercised activity in practice perhaps because of the
T primary ontology absence of robust and pragmatic methods for evaluating and iden-
C set of candidate ontologies tifying ontologies for reuse (Alani and Brewster, 2005). This scarcity
CS compatibility score between primary and candidate stems from the lack of a specific framework on ontology reuse
ontologies in addition to the lack of awareness of the process, benefits and
A subset of C, containing all candidate ontologies shortcomings of reusing ontologies.
whose CS score is higher than the defined threshold This paper presents a framework in the form of a set of metrics
 threshold of CS for candidate ontologies that will be and a step by step procedure for evaluating compatibility of
considered ontologies for reuse. Quantification of semantic similarity between
Li vector of natural language for ontology i ontologies is used for ontology matching, aligning, merging and
Ei vector of encoding (datatypes) for ontology i integrating. In contrast to existing methodology of calculating
Ri vector of external resources (imports) for ontology semantic similarity from pairwise similarity between concepts
i aggregated into a single measure (David and Euzenat, 2008),
Ti vector of terminology for ontology i we propose a method of calculating similarity by using high-
LexSim(str1 , str2 ) lexical similarity between strings str1 and level information describing ontologies, such as size, terminology,
str2 external resources and data types. The concomitant compatibil-
CC number of common characters between str1 and str2 ity measure indicates levels of granularity differences, encoding
LCS length of the least common substring between str1 and coverage of ontologies needed for a full assessment of their
and str2 compatibility for reuse.
l1 length of str1
l2 length of str2
2. Ontology reuse: basic definitions
MSP maximum of Levenshtein distances of suffix/prefix
of str1 and str2
Ontology is a framework for knowledge modelling and it is
doc degree of confidence between two ontologies (pri-
described as a group of terms organised in a class-subclass struc-
mary and candidate)
ture and which describes a specific domain (Trokanas et al., 2014b).
OC number of concepts
Ontology is further enhanced with properties characterising terms
OL number of levels
and restrictions on these properties. Ontology is normally instan-
docnorm normalised value of doc, ranging between [0,1]
tiated with instances representing specific entities of the domain.
CoSim cosine similarity of two vectors a and b
Ontology reuse is a process of using existing ontologies as the
Eucl Euclidean distance of two vectors a and b
input to development of new ontologies or to expand existing
NormEucl normalised Euclidean distance, ranging between
ontologies.
[0,1]
Ontology evaluation is a process of assessment of ontology
EuSim Euclidean similarity of two vectors a and b
quality and adequacy for reuse in a specific context and for a specific
Sim similarity score between two vectors a and b, calcu-
goal (Cantador et al., 2007).
lated as the average of cos and sim
Ontology ranking is a process of estimating relative standing
SAB aggregated similarity of Sim scores
of a set of ontologies for given evaluation criteria.
Ontology similarity is the measure defined as the comparison
between two or more ontologies returning a value ranging between
et al., 2008) supporting process integration and interoperability 0 and 1 and which indicates the level of feature correspondence
(Muñoz et al., 2013). between them (Ehrig, 2006).
Developing an ontology is a time consuming process which Ontology compatibility is the measure indicating the level at
requires a high level of domain specific expertise (Alani et al., 2006). which two separate ontologies are suitable to integrate into a single
By definition ontologies are conceptualisations of specifications without a conflict or inconsistency.
aimed to be shared. Ontology reuse is expected to be a paramount Primary ontology is an ontology which is the basis of reuse.
activity for knowledge engineers which, in turn, is expected to Or, it refers to existing or a new ontology that the user decides to
reduce the cost of development (Alani and Brewster, 2005) and expand by using candidate ontologies.
to promote interoperability between applications. This is further Candidate ontology is an ontology which is considered for
amplified by the fact that many of existing ontologies cover com- reuse. After the ontology evaluation process is complete, the can-
plementary and/or overlapping domains (Lonsdale et al., 2010). didate ontology is merged with the primary ontology.

Fig. 1. Ontology excerpt example.


N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187 179

Fig. 2. Structure of measure.

3. Ontology evaluation (iv) External resources: refer to resources used by ontologies. Two
ontologies which use the same external resources are more
3.1. Problem statement likely to describe the same domain(s). In addition, the use
of imports, already developed ontologies, provide a level of
Given a primary ontology (T) and a set of n candidate ontolo- modularisation that facilitates reuse. On the other hand, when
gies C = {c1 , . . ., cn } , calculate compatibility score CS(T,C) which two ontologies use different imported ontologies to describe
identifies a set A = {c|c ∈ C, CS(T, c) > } , where CS is the measure similar domains, e.g. units of measurement, the risk for incon-
of compatibility as defined in Section 2 and  is the compatibility sistencies when integrated increases.
threshold defined by the user (default value 0.5). (v) Size and breadth of the ontologies: This aspect accounts for
different types of ontologies, taking into account the number of
concepts and levels. The term level is defined as the maximum
3.2. Ontology evaluation methodology number of taxonomic ranks of granularity of the taxonomy of
both primary and candidate ontologies. The number of levels
All n candidate ontologies are evaluated for compatibility for and concepts of ontology provides an outlook of whether an
reuse using the ontology metadata, which includes terminology, ontology is detailed, top-level, etc.
language, encoding, external resources and size (Fig. 2). More
precisely, the respective framework approach accounts for the fol-
3.3. Acquisition of ontology metadata
lowing:

For the calculation of the metrics, each ontology is represented


(i) Terminology used in the ontologies: refers to all terms used by a set of vectors representing the ontology metadata. Follow-
in the ontology irrespectively of whether they describe con- ing the ‘good practice’ in ontology design (Schulz et al., 2012),
cepts, properties or instances. The reason behind is in the fact we assume that metadata specify the terms used in the ontology,
that when two ontologies share common terms, they are more the natural language, the data types and the external resources of
likely to describe similar domains. Also, it is easier to com- the ontology. A similarity score is then calculated for each type
bine and integrate them by using common terms as a starting of metadata and as the vector similarity using a combination of
premise. cosine similarity and Euclidean distance (Fig. 2). Similarity between
(ii) Natural language of the ontologies: refers to the languages vectors of the same type is aggregated into a single semantic sim-
used during ontology design. Currently, most of existing ilarity measure. Finally, the similarity measures for different types
ontologies are designed in English with possible annotation of metadata are combined to form a single similarity score.
in other languages. However, some reported ontologies are
developed in other languages (Zhang et al., 2011), and some
even using more than one language (Tudorache, 2008). We 3.3.1. Natural language
argue that the use of one or more common languages improves The use of different natural languages is perhaps the biggest
the ontology reusability potential. obstacle in ontology matching (Hawalah and Fasli, 2011). Most of
(iii) Encoding information of the ontologies: refers to the data existing matching and alignment approaches and techniques rely
types used during the ontology design. This category can cover on lexical similarity between ontologies and hence perform poorly
wider aspects of ontology engineering, i.e. the use of integer when the two ontologies do not share a common language. Still,
as the range of a data type property can cause inconsisten- the same natural language indicates better suitability for reuse. To
cies, if merged with an ontology which uses float data type for this end, we define the vector of natural languages LA of ontology
the same property. This aspect reflects the need for minimi- A as
sed encoding bias during ontology engineering, as defined by
(Gruber, 1995). LA = (l1 , l2 , . . ., li , . . ., ln ) (1)
180 N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187

where li is a Boolean value representing the existence of language Table 1


Lexical similarity.
i ∈ (0, n). A vector is created for each of ontologies participating in
the comparison process. Length n of vector is defined by the number Pair Score
of elements each representing a distinct language and occurring as Acid – Acids 1.00
the union of languages used in the primary ontology LT and in the HDPE – LDPE 0.88
candidate ontology LC as LT ∪ LC . PET 1 – PET 2 0.83
Industrial – Industry 0.82
Article – Particle 0.51
3.3.2. Data types
We argue that the modelling conventions adopted during the
ontology development process could be beneficially used for ontol-
ogy matching. Modelling conventions include aspects ranging from In contrast to natural language, data types and external resource
level of detail (granularity of the ontology) and the ontology instan- information which is comparatively easy to extract from any
tiation, up to practical aspects such as the choice of data types, e.g. ontology, the information on terminology is much more com-
float, integers or undefined. Similar or identical concepts can be plex because of the scale and disparity. We propose to extract the
modelled in different ways. For example, distance can be repre- terminology from whole ontology file (.owl) by an existing termi-
sented in miles or kilometres. In this work we use only low level nology extraction software tools which identifies the key terms and
convention of data types, as defined by the XML Schema standard generates a list with the frequency of their appearance. The tool
(Bray et al., 1998). Hence, the vector of data types EA in ontology A (Anchovy, 2013) also allows for the user to define thresholds, i.e.
is defined as minimum/maximum number of times a term must appear to be
EA = (e1 , e2 , . . ., ei , . . ., en ) (2) considered (Frantzi et al., 2000).

where ei is a value representing the frequency of occurrence of each


data type i ∈ (0, n). Length n of vector is defined by the number of 3.3.5. Lexical similarity
distinct data type elements resulting by the union ET ∪ EC of data There are two ways of addressing lexical similarity when cre-
types used in the primary ontology ET and in the candidate ontology ating vectors representing terminology (Section 3.3.4); (i) using
EC . only identical terms which is faster and less complex but liable
to inaccuracies, i.e. in cases where the two ontologies use both plu-
3.3.3. External resources ral and singular for concept names, these will not be accounted
External resources give an indication of the scope of the ontol- for, and (ii) to employ some lexical similarity algorithm and/or
ogy. When two ontologies use the same external resources, e.g. the external resources such as lexicons, the process which would go
same units or geographic namespace, then they share some similar- beyond the focus of this work. As a compromise, we employ a
ities. The vector representing the external resources of an ontology hybrid algorithm based on established methods for lexical similar-
A, RA , is formed from the ontologies which are imported in each ity for which the similarity of 60% is adopted as a threshold above
case and is given as which two terms are considered identical. More precisely, an exist-
ing and established lexical similarity algorithm is adopted which
RA = (r1 , r2 , . . ., ri , . . ., rn ) (3)
has been proven to perform well with terminology in the domain of
where ri is a Boolean value which represents the existence of exter- chemical engineering and industrial symbiosis. The algorithm uses
nal resource i ∈ (0, n). Length n of vector is defined by the number Levenshtein distance (Levenshtein, 1966) to compare the prefixes
of distinct external resources occurring by the union RT ∪ RC of and suffixes of the compared terms and uses a modified version of
external resources used in the primary ontology RT and in the can- the String Metric presented in Ref. Stoilos et al. (2005) to calculate
didate ontology RC All Web Ontology Language (OWL) ontologies similarity LexSim between two terms str1 and str2 as
have some default external resources which are used to define
information such as the data types (xsd imports) and basic ontol-
 CC + (2 ∗ LCS(str , str )) MSP
1 2
LexSim(str1 , str2 ) = (5)
ogy structure (owl and rdf imports). These external resources are l1 + l2
disregarded because they do not contribute to additional useful
information for matching. where CC is the number of common characters between the two
strings, LCS is the least common substring, li is the length of term
3.3.4. Terminology i and MSP is the maximum of the Levenshtein distances of the suf-
The terminological obstacles arise from the use of different fixes and prefixes of the two terms. As such, not only identical
terms to describe similar or the same concepts, the synonyms, as terms are taken into account but also terms that score above a cer-
well as from the use of same terms to describe different concepts, tain threshold. For this specific application, threshold has been set
the homonyms. As at present, ontology is considered as a group to 0.60. Some indicative results of lexical comparison are given in
of terms including concept and property names, annotations and Table 1, which demonstrate the performance of the lexical similar-
instances and for that reason alone terminology is of vital signif- ity algorithm. It is apparent that the use of letter “s” to denote plural
icance for ontology matching. Most existing matching techniques in English does not affect the similarity score. This is because the
rely on terminological information for the identification of initial use of plural or singular is a mainly dictated by the knowledge engi-
mappings between two ontologies. As such, the vector of terms TA neer’s preference and intuition. Another important aspect of that
representing the terminology of ontology A is defined as algorithm is that similarity is not vastly affected by the different
suffixes, e.g. PET 1, PET 2, which are a very common phenomenon
TA = (t1 , t2 , . . ., ti , . . ., tn ) (4)
in the domain of Process Systems Engineering. Finally, the last pair,
where ti represents the frequency that each term i ∈ (0, n) appears Article – Particle is a pair of terms commonly used in tests for lex-
in ontology A. Length n of vector TA is defined by the number of ical similarity algorithms to test performance for two strings that
distinct terms of the two ontologies occurring by the union TT ∪ TC are almost identical and, at the same time, semantically unrelated.
of the two terminology sets used in the primary ontology TT and in In this case, it is apparent that the algorithm performs well in that
the candidate ontology TC . aspect as well.
N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187 181

Table 2
Candidate ontologies representing conference domain.

Ontology Object properties Data type properties Instances Imports Languages Number of concepts Number of levels

confOf yes Yes no no en 38 7


conference yes Yes no no en 59 3
confious yes Yes yes no en 56 3
crs dr yes Yes no no en 14 2
edas yes Yes yes no en, ru, nl 103 4
ekaw yes no no yes en, fr, cn 73 6
iasted yes yes yes no en 140 6

Table 3 resulting values vary and also include negative values, hence cre-
Results for conference ontologies (doc) and normalised doc.
ating the need for normalisation.
doc Normalised doc Step 3: Normalise the degree of confidence as:
confious −0.37 0.00 doci − docmin
confOf 0.48 0.59 docnorm = (7)
crd rs 0.81 0.82 docmax − docmin
edas 0.80 0.81
with results for the conference ontology shown in Table 3. Here
iasted 1.06 1.00
ekaw 0.43 0.56 docmin and docmax represent the range of calculated values including
all considered ontologies i.

3.4. Structural information 3.5. Ontology similarity

Structure of ontologies represents tacit knowledge in respec- A combination of cosine similarity and Euclidean distance are
tive domains and hence it is a fingerprint which can be used to used for calculating similarity between primary and candidate
assess the similarity between them. Although two ontologies can ontologies and which is calculated for all vectors representing lan-
describe the same domain using the same terms, they are still not guage, terminology, data types and external resources. The results
considered identical unless they share the same structure. This are then aggregated into a single similarity measure Sim.
aspect is more closely associated with the ontology similarity rather Cosine similarity CoSim accounts for the correlation of the values
than the ontology compatibility because reusing ontology does not of two or more vectors. For two vectors a and b, cosine similarity is
require similar structure. In consequence, structural information is calculated as:
used to assess the relative degree of confidence. i=1
The degree of confidence doc is a metric quantifying the struc- ai ∗ bi
CoSim(a, b) =  n
 (8)
tural compatibility of two or more ontologies. To this end, the i=1 2
i=1 2
n
(ai ) ∗ n
(bi )
structural compatibility is defined as the level at which two ontolo-
gies are considered well-suited for each other, i.e. a top level where n is the number of elements in each vector; the vectors
ontology is not appropriate to be a subset of a domain ontology. In must have the same size. For example, for the language vectors
order to identify the set-subset relation between two ontologies, of ontologies ekaw and iasted given as:
the domain of each ontology has to be defined, the process which
currently attracts numerous research activities. Instead, we assume Lekaw = (1, 1, 1)
that the primary ontology is a set and the candidate ontologies are
the possible subsets. In consequence, doc is calculated by follow-
Liasted = (1, 0, 0)
ing 3 steps which we demonstrate through an experiment using
the Conference (Šváb et al., 2005) set of ontologies as presented where the three dimensions of the vectors represent the three lan-
in Table 2. The assumption is that the Conference ontology is the guages (en, ru, nl), the resulting score ranges between 0 and 1. It
primary ontology and also the set for which compatible ontolo- is apparent that the cosine similarity is not affected by the magni-
gies are aimed to find in order to extend the described knowledge. tude of the two vectors, only by their direction, and for that reason
Table 2 contains information about the elements of each ontol- a scale sensitive measure, the Euclidean distance, is also employed.
ogy, i.e. object properties, data type properties, instances, external Euclidean distance Eucl is sensitive to the magnitude of vectors.
resources/imports, languages used and number of concepts and For two vectors a and b, Euclidean distance is calculated as:
number of levels (ontology hierarchy).
Step 1: extract the number of levels OL and the number of con- n
2
cepts OC for each of considered ontologies which accounts for two Eucl(a, b) = (ai − bi ) (9)
dimensions of the size of an ontology. Information regarding the i=1
conference set ontologies is presented in Table 2.
Step 2: calculate the degree of confidence doc, which accounts where n is the number of elements in each vector. Again, the vectors
for size of the primary ontology and each of the candidate ontolo- must have the same size. For normalisation NormEucl, Euclidean
gies as distance of a vector is divided by the maximum value of the calcu-
  lated Euclidean distances as:
 OCtarget − OCcandidate 
doc = log10   (6) Eucli
 OLtarget  NormEucli = (10)
Euclmax
The results of calculated confidence doc for the conference Finally, this is subtracted from 1 to be converted to a similarity
ontology are shown in Table 3. measure EuSim as:
As such, the metric doc accounts for the variability of granular-
ity between the primary ontology and candidate ontologies. The EuSimi = 1 − NormEucli (11)
182 N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187

Table 4
Similarity for conference ontologies.

Similarity type confOf confious crs dr Edas ekaw iasted

Natural language 1.00 1.00 1.00 0.2887 0.2887 1.00


Terminology 0.2818 0.1182 0.2859 0.0896 0.2697 0.2191
Encoding (datatypes) 0.5516 0.6445 0.2110 0.5526 0.00 0.4560
External resources (imports) 1.00 1.00 1.00 1.00 0.3536 1.00
1st Aggregation step 0.70835 0.690675 0.624225 0.482725 0.228 0.668775
2nd Aggregation step 0.672845 0.483473 0.682958 0.580908 0.3276 0.768143

The two similarity scores, namely the cosine similarity and the The coefficients used in Eq. (14) have been experimentally
Euclidean similarity are aggregated into a single measure Sim as verified through a number of case studies including the set of con-
ference ontologies, the case study of eSymbiosis ontology, units of
CoSimi + EuSimi
Simi = , i ∈ O (12) measurement ontology and a set of Biochemistry ontologies.
2
The results of aggregated values for the conference ontology in
where O is the set of candidate ontologies. Table 4, show that the most compatible ontology for the primary
For used Conference ontologies Table 4 presents the similarity ontology conference, is iasted ontology and the second highest is
Sim calculated by observing Eqs. (8)–(12), between the language crs dr ontology. It is apparent that general ontologies with granular-
vectors (Eq. (1)) of primary ontology (conference) and candidate ity level ranging from very generic, i.e. person, event, to very specific
ontologies. i.e. early paid applicant, workshop, are considered more suitable
For the terminology similarity measure, all the terms have been for reuse. On the other hand, low scoring ontologies, are mainly
extracted, converted into vectors, as described by Eq. (4), and ontologies that represent specific conferences. These ontologies
by observing Eqs. (8)–(12), which yields the results presented in contain scientific specific terminology, i.e. MultimediaTopic, differ-
Table 4. The same approach is followed for the calculation of data ent natural languages and instances of conference details (place of
type similarity (Eq. (2)). Next, the similarity of external resources conference and currency).
is calculated from the vectors that represent the external resources
used in each ontology (Eq. (3)). The limited variations on external 4. The compatibility for reuse metrics framework outlined
resources used, is apparent in the similarity results presented in
Table 4. The ontology compatibility framework presented in Fig. 2, is
outlined in the following steps:
3.6. Aggregating similarity
Step R.1: Extract ontology metadata such as terminology, lan-
Finally, the similarity measures of each aspect, the Sim metric guages, datatypes, imports and size and form vectors observing
for each of the elements presented in Eqs. (1)–(4), are aggregated as defined by Eqs. (1)–(4);
(Table 4) into a single measure S between two ontologies A and B Step R.2: Calculate compatibility measure CS by using Eqs. (6) and
as: (14);
i Step R.3: Rank results obtained from Step R.2 by their values;
Simi
O The process of ontology development affects its reusability
SAB = i (13)
O
wi potential. Observing the following guidelines can improve the
reuse process:
where i ∈ O, and O is the set of candidate ontologies, represents the Step D.1: Modularise the ontology given your needs by e.g. domain,
ontology features as described in Section 3.3, Sim is the similarity level, property.
measure from Eq. (12) and w the weighting factor of each feature i. Step D.2: Follow “good practice” conventions by considering
At present, weighting factors are set to 1 for all elements; natural naming conventions, minimised bias, consistent encoding, and
language L, terminology T, data types E and external resources R. annotations.
Most of the high scoring ontologies are general ontologies with Step D.3: Reuse existing ontologies and existing vocabularies, lex-
granularity level ranging from very generic, e.g. person, event icons, etc.
ontologies, to very specific, e.g. early paid applicant, workshop
ontologies. On the other hand, low scoring ontologies, are mainly 5. eSymbiosis ontology experiment
ontologies that represent specific conferences. These ontologies
contain scientific specific terminology, e.g. MultimediaTopic ontol- The experiment with eSymbiosis ontology is presented to
ogy, different natural languages and instances of conference details, demonstrate the framework of reusing ontologies. It involves
e.g. place of conference and currency. ontologies developed for materials and chemical substances. The
eSymbiosis ontology (Trokanas et al., 2012) has been developed
3.7. Aggregating similarity and degree of confidence for the eSymbiosis project (Cecelja et al., 2014). It represents
knowledge in the domain of Industrial Symbiosis, including waste,
Finally, the degree of confidence doc is aggregated with the sim- materials, energy and processing technologies (Raafat et al., 2013).
ilarity score S to give the compatibility score CS of the ontologies The ontology is used for user registration and for formation of sym-
as: biotic networks by input/output matching (Trokanas et al., 2014a).
Three candidate ontologies have been considered for the inte-
CSij = 0.7 ∗ Sij + 0.3 ∗ docij (14)
gration and reuse; the chemElem ontology, the substance ontology
where Sij is the similarity between ontologies i and j and docij , is both part of the SWEET (Raskin and Pan, 2003, 2005) ontologies
the degree of confidence between ontologies i and j. It is evident developed by NASA and the substance class ontology which is part
from Eq. (14) that the similarity S carries a higher weight than the of the OntoCAPE ontology (Morbach et al., 2007), all denoted as
degree of confidence doc. This is because the degree of confidence C1 , C2 and C3 , respectively. The eSymbiosis ontology is the primary
merely provides an indication of the structural compatibility. ontology, denoted as T.
N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187 183

Fig. 3. Inconsistencies from merging eSymbiosis – substance ontologies.

Table 5 Table 7
Datatypes and their frequency. Structural information for candidate ontologies.

Ontology Datatype (frequency) Ontology Number of concepts Number of levels

eSymbiosis float (8), boolean (12), string (5), date (2) eSymbiosis 2250 10
substance class string (4) chemElem 2363 N/A
chemElement integer (6), boolean (1), double (4), string (1) Substance 483 N/A
substance boolean (8), integer (4), double (4), string (1), date substance class 94 N/A
(1), dateTime (1)

The chemElem ontology represents chemical elements which are


defined as “pure chemical substances consisting of one type of atom
distinguished by its atomic number”. In terms of number of concepts,
chemElem is the largest of all the candidate ontologies, consist-
ing of 2363 concepts. Substance ontology represents non-living
building blocks of nature including particles and chemical com-
pounds (Raskin and Pan, 2003). Finally, substance class of OntoCAPE
ontology (Wiesner et al., 2007), a subclass of Material class, repre-
sents pure substances and mixtures.
The first step involves the extraction of the metadata from each
Fig. 4. Unassigned concepts.
ontology (Step R.1), as described in Section 3.3 with focus on extrac-
tion of the terminology. As mentioned earlier, this work attempts
to overcome this challenge with the use of existing term extraction The similarity scores for all types of metadata are presented
software. This, however, requires a certain trade-off between accu- in Table 6, which contains aggregated similarity score S for each
racy and simplicity. For demonstration purposes, we only present candidate ontology, obtained from Eq. (13). This aggregated met-
the creation and calculations for the datatype vectors which are ric S consists of the similarity of all types of metadata including
relatively short. terminology, data types, languages and external resources.
As described in Section 3.3, the datatype vectors are created The second step involves the extraction of the structural infor-
from the frequency of each datatype. Also, the length of the vec- mation from each ontology (Step R.2), required for the calculation of
tor is defined by the union of each primary – candidate ontology the degree of confidence (Section 3). This information is presented
pair. The datatypes of each ontology are given in Table 5. in Table 7. By observing Eq. (6), the number of levels OL of the pri-
From the values given in Table 5, we get the pairs of vectors mary ontology and the number of concepts OC for the primary and
given in Table 6. From these vectors, we calculate the cosine and candidate ontologies are extracted.
Euclidean distance, as described in Section 3.5 (Step R.2). Conse- The degree of confidence is calculated for each of the candidate
quently, Euclidean distance is normalised (Eq. (10)) and converted ontologies by using Eqs. (6) and (7), yielding the results presented in
to Euclidean similarity (Eq. (11)). Finally, the cosine and Euclidean Table 5. The results in this table present the normalised doc values
similarity scores are averaged yielding the following values (Eq. for each of the candidate ontologies.
(12)). All the values for these calculations are provided in Table 6. Finally, the two scores are aggregated by using Eq. (14) (Step R.3).
The same approach is followed for the other types of meta- The final results of the compatibility CS between the primary and
data. These examples are not presented due to the large size of candidate ontologies are also presented in Table 5. In specific, Sub-
the vectors. stance ontology is identified as the most compatible between the

Table 6
Datatype vectors.

Ontology pair Vectors [float, boolean, string, Cosine Euclidean Normalised Euclidean Sim
date, integer, double, similarity distance Euclidean similarity
dateTime] distance

eSymbiosis – eSymbiosis = [8,12, 5, 2]


0.328 15.0997 0.9419 0.0581 0.1931
substance class substance class = [0,0, 1, 0]
eSymbiosis – eSymbiosis = [8,12, 5, 2, 0, 0]
0.1503 16.0312 1 0 0.0751
chemElement chemElement = [0,1, 1, 0, 6, 4]
eSymbiosis – eSymbiosis = [8,12, 5, 2, 0, 0, 0]
0.6724 11.4018 0.7112 0.2888 0.4806
substance substance = [0,8, 1, 1, 4, 4, 1]
184 N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187

Table 8
Compatibility results.

Metadata similarity (Sim) Degree of confidence (doc) Aggregated compatibility score Aggregated compatibility score (with lexical similarity)

chemElem 0.3748 0.4513 0.39780 0.4020


Substance 0.3801 0.9600 0.55490 0.5616
substance class 0.3259 1.0000 0.52810 0.5330

Fig. 5. Inconsistencies from merging eSymbiosis – substance class ontologies.

three candidates, while the chemElem ontology is the least compat- Established validation metrics in ontology matching (precision
ible. chemElem ontology scored low in both doc and S measures. Not and recall) are not helpful in this case as there is a need for
only it does not share many commonalities in terms of terminology a mathematically defined expected outcome in order for those
or other metadata with the primary ontology, but it is also bigger metric to be used. In this framework, the evaluation of the final
than the primary ontology thus scoring low in doc. Although sub- outcome is judged based on the resulting ontology and more
stance class ontology had the highest doc score, its low metadata specifically from improved functionality of the resulting ontology,
similarity S (Table 8) affected the final compatibility score CS. number and type of inconsistencies caused as well as the knowl-
With the use of lexical similarity for the comparison of termi- edge covered by the resulting ontology (broader than primary
nology (Eq. (5)), the results are not significantly affected (Table 5), ontology).
leading to the conclusion that the extraction of terminology using Regarding this case study, integration can be performed using
existing software and the identification of only identical terms is Prompt, a Protégé plugin for merging ontologies (Noy and Musen,
sufficient for the high-level comparison that this work proposes. 2003). Merging the eSymbiosis ontology with the highest scoring
N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187 185

candidate ontology (Substance) creates an ontology with 4553 con- and their structure, to enables calculation of a single compatibil-
cepts, causing 2 inconsistencies (Fig. 3). ity metric. For this, the algorithm relies on the ontology high level
The use of this ontology greatly increases the knowledge information which is readily available, easy to extract and does not
described by the eSymbiosis ontology, especially regarding require any special expertise in ontology matching. It also takes
chemical substances. This increases the materials covered by advantage of existing terminology extraction tools in an effort to
eSymbiosis ontology for Industrial Symbiosis practice and allows simplify the process. This work presents a framework on ontology
for higher accuracy in input/output matching in the eSymbiosis reuse that can be the foundation for further work in ontology evalu-
application. ation and reuse leading to improved awareness in the community of
Merging the primary ontology with the substance class ontology, CPE as well as promote and facilitate the reuse and sharing of exist-
is causing conflicts with the imports of the candidate ontology. This ing ontologies. The use and performance of the metrics has been
is in fact caused by the large number of imported ontologies used, demonstrated with two experiments; one, the Conference ontolo-
some of which are not available leading to unassigned concepts gies, is a well-elaborated benchmark for ontology matching and
(Fig. 4). alignment, whereas the second experiment is based on the eSym-
The resulting ontology consists of 2270 concepts and has a large biosis ontology, which is an application ontology that supports an
number of inconsistencies (Fig. 5). Industrial Symbiosis web service.

6. Conclusions Acknowledgements

This paper presents an algorithm for the evaluation of ontologies This work has been partly funded by the European Commis-
for the purpose of reusing. The presented algorithm benefits from sion (LIFE09 ENV/GR/000300) and the UK Engineering and Physical
information about ontologies, such as the terminology of ontologies Sciences Research Council (EPSRC).
186 N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187

Appendix A.
N. Trokanas, F. Cecelja / Computers and Chemical Engineering 85 (2016) 177–187 187

Lame G. Using NLP techniques to identify legal ontology components: concepts and
relations. In: Law and the semantic web. Springer; 2005. p. 169–84.
Lee J, Suh H. Owl-based product ontology architecture and representation for sharing
product knowledge on a web. ASME; 2007.
Levenshtein VI. Binary codes capable of correcting deletions, insertions and rever-
sals. Sov Phys Dokl 1966:707.
Lonsdale D, Embley DW, Ding Y, Xu L, Hepp M. Reusing ontologies and language
components for ontology generation. Data Knowl Eng 2010;69(4):318–30.
Morbach J, Wiesner A, Marquardt W. OntoCAPE—a (re) usable ontology for
computer-aided process engineering. Comput Chem Eng 2009;33(10):1546–56.
Morbach J, Yang A, Marquardt W. OntoCAPE—a large-scale ontology for chemical
process engineering. Eng Appl Artif Intell 2007;20(2):147–61.
Muñoz E, Capón-García E, Moreno-Benito M, Espuña A, Puigjaner L. Scheduling and
control decision-making under an integrated information environment. Comput
Chem Eng 2011;35(5):774–86.
Muñoz E, Espuña A, Puigjaner L. Towards an ontological infrastructure for chemical
batch process management. Comput Chem Eng 2010;34(5):668–82.
Muñoz E, Capón-García E, Laínez JM, Espuña A, Puigjaner L. Integration of
enterprise levels based on an ontological framework. Chem Eng Res Des
2013;91(8):1542–56.
Musen MA, Noy NF, Shah NH, Whetzel PL, Chute CG, Story MA, et al. The National
Center for Biomedical Ontology. J Am Med Inform Assoc 2012;19(2):190–5.
Noy NF, Musen MA. The PROMPT suite: interactive tools for ontology merging and
mapping. Int J Hum-Comput Stud 2003;59(6):983–1024.
Panetto H, Dassisti M, Tursi A. ONTO-PDM: product-driven ONTOlogy for Product
Data Management interoperability within manufacturing process environment.
Adv Eng Inform 2012;26(2):334–48.
Raafat T, Trokanas N, Cecelja F, Bimi X. An ontological approach towards enabling
processing technologies participation in industrial symbiosis. Comput Chem Eng
References 2013;59:33–46.
Raskin R, Pan M. Semantic web for earth and environmental terminology (sweet). In:
Abanda FH, Tah JHM, Duce D. PV-TONS: a photovoltaic technology ontology system Workshop on semantic web technologies for searching and retrieving scientific
for the design of PV-systems. Eng Appl Artif Intell 2013;26(4):1399–412. data at ISWC; 2003. p. 2003.
Alani H, Brewster C. Ontology ranking based on the analysis of concept structures. Raskin RG, Pan MJ. Knowledge representation in the semantic web for earth and
In: Proceedings of the 3rd international conference on knowledge capture. ACM; environmental terminology (SWEET). Comput Geosci 2005;31(9):1119–25.
2005. p. 51. Schulz S, Seddig-Raufie D, Grewe N, Röhl J, Schober D, Boeker M, et al. Guideline on
Alani H, Brewster C, Shadbolt N. Ranking ontologies with AKTiveRank. In: The developing good ontologies in the biomedical domain with description logics;
semantic web-ISWC 2006. Springer; 2006. p. 1–15. 2012.
Alonso L, Bas L, Bellido S, Contreras J, Benjamins R, Gomez M. WP10: case study Soergel D, Lauser B, Liang A, Fisseha F, Keizer J, Katz S. Reengineering thesauri for
eBanking D10. 7 financial ontology. In: Data, information and process integration new applications: the AGROVOC example. J Digit Inf 2006;4:4.
with semantic web services, FP6-507483; 2005. Stoilos G, Stamou G, Kollias S. A string metric for ontology alignment. In: The seman-
Anchovy. Cross-platform tools for translators [homepage of maxprograms]; 2013, tic web-ISWC 2005; 2005. p. 624–37.
Available: http://www.maxprograms.com/products/anchovy.html [30.10.14]. Suresh P, Joglekar G, Hsu S, Akkisetty P, Hailemariam L, Jain A, et al. Onto MODEL:
Bock C, Zha X, Suh H, Lee J. Ontological product modeling for collaborative design. ontological mathematical modeling knowledge management. In: Computer
Adv Eng Inform 2010;24(4):510–24. aided chemical engineering. Elsevier; 2008. p. 985–90.
Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F. Extensible markup Šváb O, Svátek V, Berka P, Rak D, Tomášek P. Ontofarm: towards an experimental
language (XML). In: World wide web consortium recommendation REC-xml- collection of parallel ontologies. In: Poster track of ISWC; 2005. p. 2005.
19980210; 1998 http://www.w3.org/TR/1998/REC-xml-19980210. Swoogle. Homepage of University of Maryland; 2009, Available: http://swoogle.
Cantador I, Fernández M, Castells P. Improving ontology recommendation and reuse umbc.edu/ [28.10.14].
in WebCORE by collaborative assessments; 2007. Trokanas N, Raafat T, Cecelja F, Kokossis A, Yang A. Semantic formalism for waste
Casellas N, Blázquez M, Kiryakov A, Casanovas P, Poblet M, Benjamins R. OPJK into and processing technology classifications using ontology models. Comput Aided
PROTON: legal domain ontology integration into an upper-level ontology. In: On Chem Eng 2012;30:167–71.
the move to meaningful internet systems 2005: OTM 2005 workshops. Springer; Trokanas N, Cecelja F, Raafat T. Semantic input/output matching for waste processing
2005. p. 846. in industrial symbiosis. Comput Chem Eng 2014a;66:259–68.
Cecelja F, Raafat T, Trokanas N, Innes S, Smith M, Yang A, et al. e-Symbiosis: Trokanas N, Cecelja F, Raafat T. Towards a re-usable ontology for waste processing.
technology-enabled support for industrial symbiosis targeting SMEs and inno- Comput Aided Chem Eng 2014b;33:841–6.
vation. J Clean Prod 2014;98:336–52. Tudorache T. Ontologies in engineering – modeling, consistency and use cases. VDM
Corcho O, Fernández-López M, Gómez-Pérez A, López-Cima A. Building legal ontolo- Verlag Dr. Mueller e.K; 2008.
gies with METHONTOLOGY and WebODE. In: Law and the semantic web. Uschold M, Healy M, Williamson K, Clark P, Woods S. Ontology reuse and application.
Springer; 2005. p. 142–57. In: Formal ontology in information systems; 1998. p. 192.
David J, Euzenat J. Comparison between ontology distances (preliminary results). Venkatasubramanian V, Zhao C, Joglekar G, Jain A, Hailemariam L, Suresh P, et al.
In: The semantic web-ISWC 2008. Springer; 2008. p. 245–60. Ontological informatics infrastructure for pharmaceutical product development
Ehrig M. Ontology alignment: bridging the semantic gap. Springer; 2006. and manufacturing. Comput Chem Eng 2006;30(10):1482–96.
Frantzi K, Ananiadou S, Mima H. Automatic recognition of multi-word terms: the Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, et al. Bio-
C-value/NC-value method. Int J Digit Lib 2000;3(2):115–30. Portal: enhanced functionality via new web services from the National Center
Gruber TR. Toward principles for the design of ontologies used for knowledge shar- for Biomedical Ontology to access and use ontologies in software applications.
ing. Int J Hum Comput Stud 1995;43(5):907–28. Nucleic Acids Res 2011;39(Web Server issue):W541–5.
Hawalah A, Fasli M. A graph-based approach to measuring semantic relatedness in Wiesner A, Morbach J, Marquardt W. An overview on OntoCAPE and its latest appli-
ontologies. In: Proceedings of the international conference on web intelligence, cations. In: Proceedings of the 2007 AIChE annual meeting; 2007.
mining and semantics. ACM; 2011. p. 29. Zhang C, Cao C, Sui Y, Wu X. A Chinese time ontology for the semantic web. Knowl
Ni J, Yi J, Ni S. A practical development of knowledge management model for Based Syst 2011;24(7):1057–74.
petrochemical product family. In: 2011 international conference on Informa- Zhang W, Yin J. Exploring semantic web technologies for ontology-based
tion Management, Innovation Management and Industrial Engineering (ICIII); modeling in collaborative engineering design. Int J Adv Manuf Technol
2011. p. 197. 2008;36(9–10):833–43.
Kraines S, Batres R, Koyama M, Wallace D, Komiyama H. Internet-based integrated Zhang Z, Zhang C, San Ong S. Building an ontology for financial investment. In:
environmental assessment using ontologies to share computational models. J Intelligent data engineering and automated learning—IDEAL 2000. Data mining,
Ind Ecol 2005;9(3):31–50. financial engineering, and intelligent agents. Springer; 2000. p. 308–13.

You might also like