UTtoKB A Model For Semantic Relation Extraction For Unstructured Text
UTtoKB A Model For Semantic Relation Extraction For Unstructured Text
Unstructured Text
Abstract. In this paper, a model prototype called UTtoKB has been built. It extracts semantic relationships from an unstructured text based
on ontology. The model is a pipeline steps based on natural language processing (NLP) tasks and tools like Coreference Resolution (CR),
Named Entity Recognition (NER), Semantic Role Labeling (SRL), and Part of Speech (PoS) Tagging. WordNet is the tool used to measure
similarities between entities to convert them into ontology concepts and properties and populate them.
The model works fine in specific domains, while performance degrades in other domains due to the instability of WordNet performance in
finding semantic similarities.
1. Introduction
NLP focuses on interactions between machines and natural languages, as well as a machine's capacity to comprehend or imitate the
comprehension of human language through a range of tasks and tools. Information Extraction (IE) is a subfield of the NLP, which is
used for extracting a structured information from the semi or unstructured text and convert it into a set of useful structure information
automatically. Ontology-Based Information Extraction (OBIE) is an information extraction technique guided by an ontology in which
structured information is extracted from text according to concepts and properties defined by an ontology[1]. Ontology is described as
a "formal and explicit specification of a shared conceptualization" [2]. Ontologies are typically defined for specific domains.
One of the most popular ontologies in the meanwhile is DBpedia, which is a collection of Resource Description Framework (RDF)
that is extracted from information created in the Wikipedia project [3]. DBpedia is made available to users on the World Wide Web
(WWW). DBpedia allows users to semantically query relationships and properties of Wikipedia resources[3].
In this paper, UTtoKB model is proposed. UTtoKB is built basically on deep learning-based NLP tasks like SRL, CR, NER, and
others like PoS Tagging and some Preprocessing tools. Input text was preprocessed through the usage of CR, tokenization and
sentence splitter to make it readable by machine. Information extracted as a set of relations in triple shape called RDF. RDFs are
refined with the help of WordNet[4] similarity tool with the predefined ontology. The refining of the set of RDFs, improves results by
20% compared to the results obtained from the first extraction of RDFs.
1.3 Architecture
UTtoKB is a system to convert a text to a set of RDF to be added as new assertions to the knowledge base according to specific
domain ontology. UTtoKB is a pipeline of several main components. It takes the text (document) as input and produces RDF tuples
according to a specific domain.
The components work as a workflow, the input text is preprocessed to provide the input sentences to the next module. The next
module is a generic semantic processing component to extract the semantic relations from text based on a semantic role labeler
providing the primary set of RDF to the next module. RDF refinement chooses the RDFs that can be mapped to domain ontology. The
main components in UTtoKB model architecture are:
1. Preprocessing module: the text is preprocessed by applying the CR, tokenizing text, and producing the output as a set of
sentences.
2. RDF Extraction: the main semantic module based on AllenNLP semantic role labeler to convert the sentences to semantic
framing structure, and produce the primary RDFs.
3. RDF Refinement and Ontology Population: this step makes the RDF triples extracted corresponds to Ontology contents. The
final mapping of RDF triples that to domain ontology concepts and properties.
The complete system architecture is shown in figure 1. The main components are discussed in detail in later sections.
1.4 EXPERIMENTS
In UTtoKB prototype model, a set of documents about a specific domain is processed to extract information from these documents
and added it to domain ontology as a new assertion to the knowledge base. Country ontology has been constructed manually using
protégé platform[23]. The main concepts and roles in the predefined ontology are shown in Figure 5.
The model prototype uses AllenNLP models for CR, SRL, and NER. To give a better understanding of the prototype model and
how it remove is works, it shows how architecture pipeline process of the system will be react with the example down below:
1. Input example:
"Muhammad speaks English and French languages while John speaks English only. Both Mu and Johnny reside in London
city."
2. Preprocessing: The input will be split into two sentences after applying Coreference Resolution, tokenization and sentence
splitting.
● Before applying Coreference Resolution:
Muhammad speaks English and French languages while John speaks English only. Both Mu and Johnny reside in London
city.
● After applying Coreference Resolution and Unifying Entities:
Muhammad speaks English and French languages while John speaks English only. Both Muhammad and John reside in
London city.
The final output of this step is:
a) Muhammad speaks English and French languages while John speaks English only
b) Both Muhammad and John reside in London city
3.RDF Extraction:
This step will extract the initial RDF based on SRL, NER and PoS tagging. In SRL, statement will be divided into segments based on
where predicates are mentioned) in the sentence.
a) MuhammadA0 speaks predicate English and French languagesA1
b) JohnA0 speaks predicate English only A1
c) Both Muhammad and John A0 reside predicate in London cityA1
Then convert to initial triple shape with NER and PoS tagging
(Muhammad, speaks, English)
(Muhammad, speaks, French)
(Muhammad, speaks, Languages)
(John, speaks, English)
(Muhammad, reside, London)
(John, reside, London)
(Muhammad, reside, city)
(John, reside, city)
The predicates "reside" and "speaks" are converted to most similar ontology properties "Live” and "Speak" respectively based on
semantic similarity with ontology. This step also adds those entities (John, Muhammad) as instances of <Country: Person> class,
(English, French) as instances of <Country: Language> and London as an instance of <Country: City> class, if they do not exist in
ontology. They represent the needed facts (is-a relations) to be added as new instances to the proper class it really belongs in the
ontology. All what were mentioned earlier can be applied to other examples.
Another experiment is done to evaluate the similarities percentage by using Global vector representation (GloVe) [24] vector and
compare it to WordNet similarity techniques. In Table 4, results are showing some examples of entities and compare it to ontology
classes.
In this work, the UTtoKB model is proposed for extracting structured information from unstructured text with the help of a specific
domain ontology knowledge base. The model is built using NLP tasks from the AllenNLP platform [27] to extract the initial triples.
Extracted triples are refined to be more suitable to ontology concepts and properties using similarity techniques of WordNet. The
model results depend on how to find the semantic similarities between the extracted RDF and the domain ontology concepts and
properties. The results showed that WordNet is not a great tool for giving similarity because of the lack of information it offers in
some subjects. Other ways can be implemented for semantic similarities tests including Global Vector representation (GloVe) with
the help Cosine Similarity[25] and Euclidean distance[26].. The GloVe is an unsupervised vector model introduced by Stanford
University used to represent each word in the corpus with a single unique vector in a low dimension [24]. It is built on the training of
the Co-occurrence matrix representation technique that appears how frequently a word is used together with other words in a corpus.
In Table 4, comparative results are shown for semantic similarities between entities and classes using the WordNet similarities test and
GloVe vectors test. In future work, we are determined to find an efficient way to improve the performance of the model in the
application of ontology mapping.
REFERENCES
1. Martinez-Rodriguez, J.L., Hogan, A., Lopez-Arevalo, I.: Information extraction meets the semantic web: a survey. Semantic
Web. 11, 255–335 (2020)
2. Guarino, N., Oberle, D., Staab, S.: What is an ontology? In: Handbook on ontologies. pp. 1–17. Springer (2009)
3. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P.,
Auer, S.: Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web. 6, 167–195 (2015)
4. Fellbaum, C.: WordNet. In: Theory and applications of ontology: computer applications. pp. 231–243. Springer (2010)
5. Papadopoulos, D., Papadakis, N., Litke, A.: A methodology for open information extraction and representation from large
scientific corpora: the CORD-19 data exploration use case. Applied Sciences. 10, 5630 (2020)
6. Karkaletsis, V., Fragkou, P., Petasis, G., Iosif, E.: Ontology based information extraction from text. In: Knowledge-Driven
Multimedia Information Extraction and Ontology Evolution. pp. 89–109. Springer (2011)
7. Kertkeidkachorn, N., Ichise, R.: T2kg: An end-to-end system for creating knowledge graph from unstructured text. Presented
at the Workshops at the Thirty-First AAAI Conference on Artificial Intelligence (2017)
8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint
arXiv:1301.3781. (2013)
9. Etzioni, O., Bart, R.E., Schmitz, M.D., Doderland, S.G.: Open language learning for information extraction. (2014)
10. Exner, P., Nugues, P.: Entity extraction: From unstructured text to DBpedia RDF triples. Presented at the The Web of Linked
Entities Workshop (WoLE 2012) (2012)
11. Augenstein, I., Padó, S., Rudolph, S.: Lodifier: Generating linked data from unstructured text. Presented at the Extended
Semantic Web Conference (2012)
12. Nebhi, K.: Ontology-based information extraction from twitter. Presented at the Proceedings of the Workshop on Information
Extraction and Entity Analytics on Social Media Data (2012)
13. Rosyiq, A., Hayah, A.R., Hidayanto, A.N., Naisuty, M., Suhanto, A., Budi, N.F.A.: Information extraction from Twitter
using DBpedia ontology: Indonesia tourism places. Presented at the 2019 International Conference on Informatics, Multimedia, Cyber
and Information System (ICIMCIS) (2019)
14. Anantharangachar, R., Ramani, S., Rajagopalan, S.: Ontology guided information extraction from unstructured text. arXiv
preprint arXiv:1302.1335. (2013)
15. Batouche, B., Gardent, C., Monceaux, A., Blagnac, F.: Parsing text into RDF graphs. Presented at the Proceedings of the
XXXI Congress of the Spanish Society for the Processing of Natural Language (2014)
16. Abburu, S., Golla, S.B.: Ontology and NLP support for building disaster knowledge base. Presented at the 2017 2nd
International Conference on Communication and Electronics Systems (ICCES) (2017)
17. Rizvi, S.T.R., Mercier, D., Agne, S., Erkel, S., Dengel, A., Ahmed, S.: Ontology-based Information Extraction from
Technical Documents. Presented at the ICAART (2) (2018)
18. Papadias, E., Kokla, M., Tomai, E.: Educing knowledge from text: semantic information extraction of spatial concepts and
places. AGILE: GIScience Series. 2, 1–7 (2021)
19. Abdelmagid, M., Ahmed, A., Himmat, M.: Information Extraction methods and extraction techniques in the chemical
document’s contents: Survey. ARPN Journal of Engineering and Applied Sciences. 10, 1068–1073 (2015)
20. Grishman, R.: Information extraction. IEEE Intelligent Systems. 30, 8–15 (2015)
21. Wimalasuriya, D.C., Dou, D.: Components for information extraction: Ontology-based information extractors and generic
platforms. Presented at the Proceedings of the 19th ACM international conference on Information and knowledge management (2010)
22. Dung, T.Q., Kameyama, W.: Ontology-based information extraction and information retrieval in health care domain.
Presented at the International Conference on Data Warehousing and Knowledge Discovery (2007)
23. Sivakumar, R., Arivoli, P.: Ontology visualization PROTÉGÉ tools–a review. International Journal of Advanced Information
Technology (IJAIT) Vol. 1, (2011)
24. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. Presented at the Proceedings of
the 2014 conference on empirical methods in natural language processing (EMNLP) (2014)
25. Rahutomo, F., Kitasuka, T., Aritsugi, M.: Semantic cosine similarity. Presented at the The 7th International Student
Conference on Advanced Science and Technology ICAST (2012)
26. Vijaymeena, M., Kavitha, K.: A survey on similarity measures in text mining. Machine Learning and Applications: An
International Journal. 3, 19–28 (2016)
27. Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., Peters, M., Schmitz, M., Zettlemoyer, L.: Allennlp: A
deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640. (2018)