Automatic Building of An Ontology From A Corpus of

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/260777124

Automatic Building of an Ontology from a Corpus of Text Documents Using Data


Mining Tools

Article in Journal of Applied Research and Technology · December 2012


DOI: 10.22201/icat.16656423.2012.10.3.395

CITATIONS READS
15 961

3 authors, including:

Adolfo Guzman-Arenas
Instituto Politécnico Nacional
93 PUBLICATIONS 884 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Fuzzy logic methodologies for solving climate change related problems View project

Automatic Mapping Resonance Images into Multimedia Database View project

All content following this page was uploaded by Adolfo Guzman-Arenas on 05 July 2015.

The user has requested enhancement of the downloaded file.


Automatic Building of an Ontology from a Corpus of Text Documents
Using Data Mining Tools

J. I. Toledo-Alvarado*, A. Guzmán-Arenas, G. L. Martínez-Luna

Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)


Av. Juan de Dios Bátiz esquina con calle Miguel Othón de Mendizabal, 07738
México, D.F., México.
[email protected]

ABSTRACT
In this paper we show a procedure to build automatically an ontology from a corpus of text documents without external
help such as dictionaries or thesauri. The method proposed finds relevant concepts in the form of multi-words in the
corpus and non-hierarchical relations between them in an unsupervised manner.

Keywords: Ontology learning, Data Mining, Machine Learning, Apriori algorithm

RESUMEN
En este artículo mostramos un procedimiento para construir automáticamente una ontología a partir de un corpus de
documentos de texto sin ayuda externa tal como diccionarios o tesauros. El método propuesto encuentra conceptos
relevantes en forma de frases temáticas en el corpus de documentos y relaciones no jerárquicas entre ellos de
manera no supervisada.

1. Introduction

Information in text form is generated at exponential 1.1 Ontology definition


rate [1]. Every day, organizations all over the world
An ontology is a formal, explicit specification of a
generate reports, articles, books, emails, and all
shared conceptualization of a domain of interest
kind of textual data concerning several topics. The
[2]. In computer science, a data structure is
increase of the storage capacity of computers and
usually used to represent knowledge about a
servers enable these organizations to keep all files
certain domain. The most frequent representation
they produce without the need of deleting anything.
of ontologies in computer science is with graphs,
Although this is an obvious advantage for
where nodes represent concepts and edges
everybody, it also implies some problems. One
represent relations between concepts. Novak [3]
mainly problem they face is to know what kind of
gives us a useful definition of concept for our
information they have, and how it is related. One
purpose:
way to organize information in computer science is
in ontology form. This is similar to a conceptual Definition 1.1. Concept is a perceived
map in which the main topics or concepts are regularity in events, objects, records, etc.,
related to each other by some kind of relations. designated by a label.
Ontology learning (OL) is a research area focused
on discovering or constructing in a semiautomatic Relations can be considered as special kinds of
or automatic manner an ontology from a collection concepts which relate other concepts. In general
of text documents. In this paper we show an they are binary, however, there are n-ary relations
automatic way to construct an ontology. As input, which are hard to represent graphically due to the
our method receives a corpus of documents related intervention of several related elements. They are
to a certain context. The experiments were made usually tagged with verbs that describe the
with papers related to “computer tomography”. As relation. Some works classify concepts in classes
output, the method delivers a non-hierarchical or types and instances, i.e., sets I (instances) and
ontology in the form of a set of concepts and a set T (types) form a partition of C. Formally an
of relations between those concepts. ontology is defined as follows:

398 Vol. 10 No.3, June 2012


Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools, J. I. Toledo‐Alvarado et al. / 398‐404

Definition 1.2. Ontology is a tuple O = (C, R) where linguistic realizations of domain specific concepts.
C is a set of nodes (referring to concepts) which The specialized literature provides many examples
some of them are relations. of term extraction methods. Most of these are
R is a set of restrictions in the form of (r[c1, c2,…, based on information retrieval methods for term
ck]) between the relation r and the concepts c1 to k. indexing [6]. Some researchers also consider NLP
Let us call arity of r as k. for term extraction [7].

1.2 Problem statement b. Disambiguation is the process by which every


term found on the corpus is assigned to a specific
The main problem to solve is: Constructing an context. Words may refer to several contexts, for
ontology automatically from a corpus of text example the word horse may refer to an animal, to
documents; however, it relies on solving small a unit of horsepower, to heroin, to an obstruction in
problems in order to get all the elements needed to a vein (mining), and to a frame or structure on
build an ontology. First, it is necessary to identify which something is mounted [8]. There is much
relevant concepts hidden in the corpus of work and there are many methods about term
documents. These concepts can be multi-words, disambiguation v.g. [9]. Most common methods
i.e., they may be formed by n different words. After apply clustering techniques to group similar words,
finding relevant concepts, it is necessary to find the using some association measure to detect term
relations between those concepts. This is a hard pairs statistically correlated [10]. Clustering terms
task since we do not count on external help such in clusters with the same meaning helps to identify
as dictionaries or thesauri that could help us find ontological classes. Dictionaries and thesauris are
relevant or semantic relations between concepts. used in term disambiguation, in this sense,
Having found both sets (concepts and relations), WordNet [11] has been widely used. In order to
building the ontology is a trivial task. know more about research on term disambiguation
you can consult Buitelaar 2005 [4].
The paper is organized as follows: In Section 2, we
review some previous work on ontology learning; c. Finding concepts. Not all terms are suitable
in Section 3, we explain the method of constructing concepts, there is controversy about what can be
an ontology automatically; in Section 4, we show considered as concept; common definitions of
an example of the model; and finally, in Section 5, concepts are i) an abstract idea, ii) an idea or
we discuss the results and present the future work. mental picture of a group or class of objects [12].
In the context of this paper, concept is the label
2. Previous work on ontology learning assigned to some knowledge regularity in the
corpus of documents, conformed by set of terms,
The process of OL may be separated into 6 layers which may consist from one to n terms. In other
[4][5]: words, we consider that a corpus of documents
a. Term Extraction. from a specific context consists on several
b. Disambiguation and synonyms. knowledge abstractions; those abstractions are
c. Finding concepts. regularly found several times all over the corpus.
d. Establishing concept hierarchies. Finding those regularities is equivalent to finding
e. Finding relations between concepts. concepts. Most research focuses on finding
f. Finding rules in the ontology. concepts use machine learning techniques like
clustering [13] in an unsupervised manner.
Research on OL can be applied on one or several However, some work use supervised procedures
layers of the process. This research is focused in order to improve their results [14].
mainly on term extraction, finding relevant
concepts, and finding non-hierarchical relations d. Establishing concept hierarchies. Some kind of
between concepts. ontologies may consider hierarchies between
concepts; this is a kind of relation of the type
a. Term Extraction is the basic process in every subset or is-a. Finding this kind of relations is
ontology learning research. Its objective is to equivalent to establishing a specific taxonomy
obtain terms, which may be considered as between concepts. There are three main

Journal of Applied Research and Technology 399


Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools, J. I. Toledo‐Alvarado et al. / 398‐404

paradigms to induce this kind of relations. The first Definition 3.2. Term is the label that represents a
is the application of lexicon-syntactic patterns to set of identical tokens.
detect hyponymy relations as proposed by Hearst
in [15]. The second is based on Harris’s Definition 3.3. Itemset is a set of terms or items.
distributional hypothesis; in this line, researchers
have exploited hierarchical clustering algorithms to Definition 3.4. Frequent itemset is an itemset
derive hierarchies from text, e.g., [13]. whose items appear in the same documents at
least in S documents or transactions. S is called
e. Finding relations between concepts. Also known the minimum support.
as nonhierarchical relations, this kind of relations After filtering the contents of the text documents, we
establish some kind of bond between two concepts. calculate the frequencies of all terms appearing in the
It is desired that this relations are labeled in order to corpus; we consider the N most frequent terms as
give significant meaning to the ontology. Most work possible concepts or part of concepts that are multi-
on finding nonhierarchical relations combine words. In order to find multi-word concepts, we find
statistical analysis with some level of linguistic the frequent itemsets with the Apriori algorithm.
analysis as in Buitelaar 2004 [16]. Relation Frequent itemsets are sets of items or terms that
extraction through text mining was introduced first appear frequently in the same documents. Then, we
by Maedche and Staab in 2000 [17]. found the multi-word concepts with an original
algorithm. After we get the relevant concepts of the
f. Finding rules. This layer of ontology learning corpus, we find relations between them in the form of
focuses on discovering inference rules from text, documents that contain two concepts, i.e., we
such as "X is author of Y", "X wrote Y", "X solved consider that concepts that appear in the same
Y", "X found a solution to Y", and "X caused Y", "Y documents are somehow related. Finally, we
is triggered by X". There is little work related to this organize them in the form of a graph.
area of research such as the work of Lin and
Pantel [18], and there are not real trends on it. This 3.2 Preprocessing
layer of ontology learning is beyond the objectives
of this work. As input, we consider a corpus of documents in
PDF format related to a certain topic. In the
3. Building automatically an ontology experiments, a set of documents was used related
to “Computed Tomography”. In order to have the
3.1 Definitions documents ready for processing, they need to be
“cleaned” by a series of filters. Firstly, we extract
In our model, we process a corpus of documents in the documents’ content; to achieve this goal, we
order to get the most relevant concepts and the use PDFBox [19], a library from apache that
most important non-hierarchical relations between enables us to extract content from PDF
them. The relevance of a concept is determined by documents. Once we have the content in string
a data mining algorithm used originally for “market form, it is ready to be filtered. The first filter
basket analysis”. The algorithm is intended to find changes the contracted words into their expanded
which items are suitable to be bought together. way, in other words, instead of strings of the form
Every sale is considered as a transaction and "I’m" it changes into "I am". The second filter
every product which was sold is called an item. removes every non-alpha character (symbols), and
removes multiple spaces; the final string contains
For our purpose, the algorithm treats every words separated by one space character between
document as a transaction and every word on the them. The next filter removes a list of stop-words
document as an item of the transaction [20], that is, a list of the most frequent words in the
English language. In this case, we remove the list
Definition 3.1. Token is a character string found of the 2000 most frequent words in the English
in a file which is separated by spaces from other Language. Another filter is the stemmer, this filter
tokens. reduces all the words in the string into their proper
stem form; we do this in order to avoid multiple

400 Vol. 10 No.3, June 2012


Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools, J. I. Toledo‐Alvarado et al. / 398‐404

concepts for words with the same meaning, for 3.3 Mining frequent terms
example, computer and computerized, both words
refer to computation, hence the stemmer reduces The algorithm for mining the documents is called
both words to their proper stem: comput. The Apriori. In order to use it, it is necessary to
stemmer we use is Snowball [21]. After the content convert the documents into a proper input for the
is passed through all the filters, it is considered to algorithm. This is a binary matrix file in which the
be ready to be processed by the data mining columns represent all the words in the corpus
algorithm. In Table II, the statistics of and the rows are the documents in the corpus. In
preprocessing are shown. Columns are different the matrix character "1" means the word
states of preprocessing, and rows show different represented by the column appears in the
metrics. For example, the cell referenced by column document represented by the row, and "0"
Ex (Extraction) and row ToC (Token Count) shows means the word does not appear in such
that 76,649,828 tokens wwere found just after document. The algorithm Apriori was proposed
extraction, without any filter of preprocessing. Table by R. Agrawal and R. Srikant in 1994 [22]. This
2 shows the decrease of tokens, terms and length of algorithm first finds the set of frequent 1-itemsets
tokens as filters are processed. Table 1 shows the by scanning the database to accumulate the
acronyms used in Table 2. count for each item and collect those items that
satisfy minimum support. The resulting set is
cronym Meaning denoted L1. Next L1 is used to find L2, the set of
Ex Extraction frequent 2-itemsets, which is used to find L3,
WiSy Without Symbols and so on, until no more frequent k-itemsets can
WiStW Without Stop-words be found. In Figure 1, we show an example of
St Stemming how this algorithm works.
WiSmW Without Small Words
ToC Token Count
TeC Term Count In the context of this research, each transaction
LoTo Longest token is a text file, and each item is a term in the
AvTeL Average Term Length corpus. In the first step of the algorithm, the
AvToC Average Token Count support for all items is calculated. All of the
AvTeC Average Term Count items that satisfy the support established (in the
AvFL Average File Length example is 2) are candidates to be frequent
ShF Shortest File itemsets. Those items are then passed to the
LoF Longest File
next pool to form 2-itemsets, and so on.
Table 1. Acronyms for preprocessing statistics table.

Ex WiSy WiStW St WiSmW


ToC 76,649,828 70,427,042 33,753,162 33,753,162 27,443,988
TeC 924,219 933,307 928,995 849,421 848,732
LoTo 421 176 176 175 175
AvTeL 5.6 5.4 6.6 5.6 6.5
AvToC 4,508.8 4,142.7 1,985.4 1,985.4 1,614.3
AvTeC 1,604.6 1,051.4 668.6 622.4 554.3
AvFL 30,575.6 26,757.0 15,144.5 12,217.7 26,757.0
ShF 15,764 13,698 8,548 7,586 6,971
LoF 519,786,102 454,869,196 257,457,896 223,157,523 207,700,914

Table 2. Preprocessing Statistics.

Journal of Applied Research and Technology 401


Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools, J. I. Toledo‐Alvarado et al. / 398‐404

Figure 1. Apriori algorithm with minimum support = 2.

3.4 Finding concepts Algorithm 3.1. Finding concepts


/*Array of frequent itemsets */
Every itemset is a suitable candidate to be a fis : ARRAY ;
concept. We will consider a concept as a maximal /*Array of combinations of items within an itemset
frequent itemset whose members appear together */
in at least u documents, where u most be defined com : ARRAY ;
previously by the user. /*Array of les that will be searched for the concepts
*/
Definition 3.5. Maximal Frequent Itemset. An fil : ARRAY ;
itemset is called maximal if it is not a subset of /*Array of items that represent the best
another itemset. combination, i.e., the one with more occurrences */
bestComb : ARRAY
Because frequent itemsets do not take into /*Array of concepts */
account the order of the items within the set, we concepts : ARRAY ;
have to generate all possible combinations of the begin
items within each frequent itemset in order to find for i := 0 to FIS:length 1 step 1 do
the multi-word concepts. That is, for a k- itemset com = GENERATE
there are k! possible concepts. The algorithm to COMBINATIONS(fis[i]);
find concepts is as follows. for c := 0 to COM:length 1 step do

402 Vol. 10 No.3, June 2012


Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools, J. I. Toledo‐Alvarado et al. / 398‐404

best = GET BEST COMB(com[c], 5. Discussion


fil);
if best! = null ^ IS NOT SUBSET As with every parameterized method, it is difficult
(bestComb, concepts) then to find optimal values for the method to work as
concepts:add(best); expected. The parameters used in the experiments
fi were found on a trial-and-error basis, including the
od minimum support, and the threshold to consider an
od itemset as multi-word. In the future, we pretend to
end give better understanding on the impact of these
parameters. We found a non-hierarchical ontology
3.5 Finding relations in a corpus of documents without any help of
dictionaries or thesauri in an automatic way. Most
The relations we find to form the ontology are not work on this field needs either the supervision of
hierarchical. The relations are established when an expert that validates the ontology generated in
two concepts appear in the same documents. every step of the method or metadata such as
dictionaries to find relations between concepts.
Definition 3.6. Relation is an undirected link With this research, we proved that it is possible to
between two concepts. establish unknown relations between new
concepts of the domain of study in an
After finding all concepts, we look for relations unsupervised way and without external help. We
within all documents. This is accomplished by proposed a new manner to learn multi-words
creating pairs of concepts and looking for every (concepts). Instead of looking for multiwords in all
pair in all the documents of the corpus. The order the vocabulary, we look them up only with the most
of the concepts is irrelevant to establish the frequent terms reducing the computational effort.
relation since relations are not directed. This is accomplished with data mining tools such
as the Apriori algorithm. As future work, we want to
4. Experiments tag relations either with some kind of weight or with
a label that describes the relation between the
The experiments were made on an Intel Core 2 concepts. We will also provide qualitative and
Duo Processor running at 2.8 GHz and 4 GB of quantitative measures to validate the ontology
RAM. The experiments were made with a subset generated. This method will be proved with a
of documents already preprocessed. We consider largest corpus to validate its scalability.
1000 documents and take the 100 most frequent
terms as items for the Apriori algorithm, that is, the
file used as input for the Apriori algorithm consists
on a binary matrix of 100x1000, the file size was
196 Kb. With a support of 80 percent, the Apriori
algorithm output were 222 frequent itemsets in less
than a second. The largest itemsets contained 5
items. This means that those 5 terms appear in the
same documents in at least 80 percent of the
documents of the corpus. The algorithm for finding
concepts achieves to find 9 concepts of 2 items, 25
concepts of 3 items and 13 concepts of 4 items.
None concepts of 5 items were found. With these
concepts, 135 relations were found in the corpus.
The set of relations and concepts form an
undirected graph that represents the ontology
learned from the corpus.

Journal of Applied Research and Technology 403


Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools, J. I. Toledo‐Alvarado et al. / 398‐404

References [12] http://www.wordreference.com/definition/concept

[13] Philipp Cimiano, Andreas Hotho, and Steffen Staab,


[1] Victoria Barbosa José Alfredo and Ávila Aoki Manuel. Learning concept hierarchies from text corpora using
Patrones de crecimiento en la generación de información formal concept analysis, Journal of Artificial Intelligence
en discos duros, Revista Digital Universitaria, Vol. 10 No. Research, Vol. 24, No. 1, August, 2005, pp. 305-339.
6, June, 2009.
[14] David Faure & Claire Nedellec, Knowledge
[2] Thomas R. Gruber, A translation approach to portable Acquisition of Predicate Argument Structures from
ontology specifications, Knowledge Acquisition - Special Technical Texts Using Machine Learning: The System
issue: Current issues in knowledge modeling, Vol. 5, No. ASIUM, In Proceedings of the 11th European Workshop
2, June, ,1993, pp. 199-220. on Knowledge Acquisition, Modeling and Management
(EKAW '99), Dieter Fensel and Rudi Studer Editors,
[3] Novak, J. D. & Cañas, A. J., The Theory Underlying Springer-Verlag, London, UK, 1999, pp. 329-334.
Concept Maps and How to Construct and Use Them
(2006-01 Rev 2008-01), Technical report, Florida Institute [15] Marti A. Hearst, Automatic acquisition of hyponyms
for Human and Machine Cognition, 2006. from large text corpora, In Proceedings of the 14th
conference on Computational linguistics - Volume 2
[4] Paul Buitelaar, Philipp Cimiano, Bernardo Magnini, (COLING '92), Vol. 2, 1992, pp. 539-545 Association for
Ontology Learning from Text: An Overview, in Ontology Computational Linguistics, Stroudsburg, PA, USA.
Learning from Text: Methods, Evaluation and
Applications / Frontiers in Artificial Intelligence and [16] Buitelaar, P.; Olejnik, D. & Sintek, M., A Protégé
Applications volume 123, Paul Buitelaar, Philipp Cimiano, Plug-In for Ontology Extraction from Text Based on
Bernardo Magnini Editors, IOS Press, 2005, pp. 1-10. Linguistic Analysis, in 'The Semantic Web: Research and
Applications', Christoph Bussler; John Davies; Dieter
[5] Cimiano, P.; Völker, J. & Studer, R., 'Ontologies on Fensel & Rudi Studer Editors, Springer, Berlin /
Demand? -A Description of the State-of-the-Art, Heidelberg, 2004, pp. 31-44.
Applications, Challenges and Trends for Ontology
Learning from Text Information', Information, [17] Maedche, A. & Staab, S., Discovering Conceptual
Wissenschaft und Praxis, Vol. 57, No. 6-7, October, Relations from Text, in Proceedings of the 14th European
2006, pp. 315-320. Conference on Artificial Intelligence (ECAI), 2000, IOS
Press, Amsterdam.
[6] Manning, C. D.; Raghavan, P. & Schütze, H., An
Introduction to Information Retrieval, Cambridge [18] Dekang Lin and Patrick Pantel, DIRT
University Press, 2008. @SBT@discovery of inference rules from text, In
Proceedings of the seventh ACM SIGKDD international
[7] Kyo Kageura, B Atrice Daille, Hiroshi Nakagawa & conference on Knowledge discovery and data mining
Lee-Feng Chien, Recent advances in computational (KDD '01), 2001, pp. 323-328, ACM, New York, NY,
terminology, John Benjamins, 2004, pp. 1-22. USA.

[8] http://www.wordreference.com/definition/horse. [19] http://pdfbox.apache.org/

[9] José Francisco Martínez Trinidad, Beatriz Beltrán [20]=http://www.cs.utexas.edu/users/ear/cs378NLP/Engli


Martínez, Adolfo Guzmán-Arenas & José Ruiz- shWordFrequencies.txt
Shulcloper, CLASITEX: A Tool for Knowledge Discovery
from Texts, In Proceedings of the Second European [21] http://snowball.tartarus.org/
Symposium on Principles of Data Mining and Knowledge
Discovery (PKDD '98), Jan M. Zytkow and Mohamed [22] Rakesh Agrawal and Ramakrishnan Srikant, Fast
Quafafou Editors, 1998, pp. 459-467, Springer-Verlag, Algorithms for Mining Association Rules in Large
London, UK. Databases, In Proceedings of the 20th International
Conference on Very Large Data Bases (VLDB '94), 1994,
[10] Christopher D. Manning and Hinrich Schütze, pp 487-499, San Francisco, CA, USA.
Foundations of Statistical Natural Language Processing,
MIT Press, Cambridge, MA, USA, 1999.

[11] http://wordnet.princeton.edu/

404 Vol. 10 No.3, June 2012

View publication stats

You might also like