Automatic Building of An Ontology From A Corpus of
Automatic Building of An Ontology From A Corpus of
Automatic Building of An Ontology From A Corpus of
net/publication/260777124
CITATIONS READS
15 961
3 authors, including:
Adolfo Guzman-Arenas
Instituto Politécnico Nacional
93 PUBLICATIONS 884 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Fuzzy logic methodologies for solving climate change related problems View project
All content following this page was uploaded by Adolfo Guzman-Arenas on 05 July 2015.
ABSTRACT
In this paper we show a procedure to build automatically an ontology from a corpus of text documents without external
help such as dictionaries or thesauri. The method proposed finds relevant concepts in the form of multi-words in the
corpus and non-hierarchical relations between them in an unsupervised manner.
RESUMEN
En este artículo mostramos un procedimiento para construir automáticamente una ontología a partir de un corpus de
documentos de texto sin ayuda externa tal como diccionarios o tesauros. El método propuesto encuentra conceptos
relevantes en forma de frases temáticas en el corpus de documentos y relaciones no jerárquicas entre ellos de
manera no supervisada.
1. Introduction
Definition 1.2. Ontology is a tuple O = (C, R) where linguistic realizations of domain specific concepts.
C is a set of nodes (referring to concepts) which The specialized literature provides many examples
some of them are relations. of term extraction methods. Most of these are
R is a set of restrictions in the form of (r[c1, c2,…, based on information retrieval methods for term
ck]) between the relation r and the concepts c1 to k. indexing [6]. Some researchers also consider NLP
Let us call arity of r as k. for term extraction [7].
paradigms to induce this kind of relations. The first Definition 3.2. Term is the label that represents a
is the application of lexicon-syntactic patterns to set of identical tokens.
detect hyponymy relations as proposed by Hearst
in [15]. The second is based on Harris’s Definition 3.3. Itemset is a set of terms or items.
distributional hypothesis; in this line, researchers
have exploited hierarchical clustering algorithms to Definition 3.4. Frequent itemset is an itemset
derive hierarchies from text, e.g., [13]. whose items appear in the same documents at
least in S documents or transactions. S is called
e. Finding relations between concepts. Also known the minimum support.
as nonhierarchical relations, this kind of relations After filtering the contents of the text documents, we
establish some kind of bond between two concepts. calculate the frequencies of all terms appearing in the
It is desired that this relations are labeled in order to corpus; we consider the N most frequent terms as
give significant meaning to the ontology. Most work possible concepts or part of concepts that are multi-
on finding nonhierarchical relations combine words. In order to find multi-word concepts, we find
statistical analysis with some level of linguistic the frequent itemsets with the Apriori algorithm.
analysis as in Buitelaar 2004 [16]. Relation Frequent itemsets are sets of items or terms that
extraction through text mining was introduced first appear frequently in the same documents. Then, we
by Maedche and Staab in 2000 [17]. found the multi-word concepts with an original
algorithm. After we get the relevant concepts of the
f. Finding rules. This layer of ontology learning corpus, we find relations between them in the form of
focuses on discovering inference rules from text, documents that contain two concepts, i.e., we
such as "X is author of Y", "X wrote Y", "X solved consider that concepts that appear in the same
Y", "X found a solution to Y", and "X caused Y", "Y documents are somehow related. Finally, we
is triggered by X". There is little work related to this organize them in the form of a graph.
area of research such as the work of Lin and
Pantel [18], and there are not real trends on it. This 3.2 Preprocessing
layer of ontology learning is beyond the objectives
of this work. As input, we consider a corpus of documents in
PDF format related to a certain topic. In the
3. Building automatically an ontology experiments, a set of documents was used related
to “Computed Tomography”. In order to have the
3.1 Definitions documents ready for processing, they need to be
“cleaned” by a series of filters. Firstly, we extract
In our model, we process a corpus of documents in the documents’ content; to achieve this goal, we
order to get the most relevant concepts and the use PDFBox [19], a library from apache that
most important non-hierarchical relations between enables us to extract content from PDF
them. The relevance of a concept is determined by documents. Once we have the content in string
a data mining algorithm used originally for “market form, it is ready to be filtered. The first filter
basket analysis”. The algorithm is intended to find changes the contracted words into their expanded
which items are suitable to be bought together. way, in other words, instead of strings of the form
Every sale is considered as a transaction and "I’m" it changes into "I am". The second filter
every product which was sold is called an item. removes every non-alpha character (symbols), and
removes multiple spaces; the final string contains
For our purpose, the algorithm treats every words separated by one space character between
document as a transaction and every word on the them. The next filter removes a list of stop-words
document as an item of the transaction [20], that is, a list of the most frequent words in the
English language. In this case, we remove the list
Definition 3.1. Token is a character string found of the 2000 most frequent words in the English
in a file which is separated by spaces from other Language. Another filter is the stemmer, this filter
tokens. reduces all the words in the string into their proper
stem form; we do this in order to avoid multiple
concepts for words with the same meaning, for 3.3 Mining frequent terms
example, computer and computerized, both words
refer to computation, hence the stemmer reduces The algorithm for mining the documents is called
both words to their proper stem: comput. The Apriori. In order to use it, it is necessary to
stemmer we use is Snowball [21]. After the content convert the documents into a proper input for the
is passed through all the filters, it is considered to algorithm. This is a binary matrix file in which the
be ready to be processed by the data mining columns represent all the words in the corpus
algorithm. In Table II, the statistics of and the rows are the documents in the corpus. In
preprocessing are shown. Columns are different the matrix character "1" means the word
states of preprocessing, and rows show different represented by the column appears in the
metrics. For example, the cell referenced by column document represented by the row, and "0"
Ex (Extraction) and row ToC (Token Count) shows means the word does not appear in such
that 76,649,828 tokens wwere found just after document. The algorithm Apriori was proposed
extraction, without any filter of preprocessing. Table by R. Agrawal and R. Srikant in 1994 [22]. This
2 shows the decrease of tokens, terms and length of algorithm first finds the set of frequent 1-itemsets
tokens as filters are processed. Table 1 shows the by scanning the database to accumulate the
acronyms used in Table 2. count for each item and collect those items that
satisfy minimum support. The resulting set is
cronym Meaning denoted L1. Next L1 is used to find L2, the set of
Ex Extraction frequent 2-itemsets, which is used to find L3,
WiSy Without Symbols and so on, until no more frequent k-itemsets can
WiStW Without Stop-words be found. In Figure 1, we show an example of
St Stemming how this algorithm works.
WiSmW Without Small Words
ToC Token Count
TeC Term Count In the context of this research, each transaction
LoTo Longest token is a text file, and each item is a term in the
AvTeL Average Term Length corpus. In the first step of the algorithm, the
AvToC Average Token Count support for all items is calculated. All of the
AvTeC Average Term Count items that satisfy the support established (in the
AvFL Average File Length example is 2) are candidates to be frequent
ShF Shortest File itemsets. Those items are then passed to the
LoF Longest File
next pool to form 2-itemsets, and so on.
Table 1. Acronyms for preprocessing statistics table.
[11] http://wordnet.princeton.edu/