0% found this document useful (0 votes)
26 views7 pages

UTtoKB A Model For Semantic Relation Extraction For Unstructured Text

The document presents the UTtoKB model, which extracts semantic relationships from unstructured text using natural language processing techniques and ontology-based information extraction. The model employs various NLP tasks such as coreference resolution, named entity recognition, and semantic role labeling to convert text into RDF triples, which are then refined and mapped to a predefined ontology. Experimental results demonstrate improved precision and recall in extracting relevant information compared to initial RDF outputs.

Uploaded by

banmustafa66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views7 pages

UTtoKB A Model For Semantic Relation Extraction For Unstructured Text

The document presents the UTtoKB model, which extracts semantic relationships from unstructured text using natural language processing techniques and ontology-based information extraction. The model employs various NLP tasks such as coreference resolution, named entity recognition, and semantic role labeling to convert text into RDF triples, which are then refined and mapped to a predefined ontology. Experimental results demonstrate improved precision and recall in extracting relevant information compared to initial RDF outputs.

Uploaded by

banmustafa66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

UTtoKB: a Model for Semantic Relation Extraction for

Unstructured Text

Mustafa Nabeel Salim1, a) and Dr. Ban Shareef Mustafa2, b)


1, 2
University of Mosul/ College of Computer Science and Mathematics/ Computer Science dep.
a)
[email protected]
b)
[email protected]

Abstract. In this paper, a model prototype called UTtoKB has been built. It extracts semantic relationships from an unstructured text based
on ontology. The model is a pipeline steps based on natural language processing (NLP) tasks and tools like Coreference Resolution (CR),
Named Entity Recognition (NER), Semantic Role Labeling (SRL), and Part of Speech (PoS) Tagging. WordNet is the tool used to measure
similarities between entities to convert them into ontology concepts and properties and populate them.
The model works fine in specific domains, while performance degrades in other domains due to the instability of WordNet performance in
finding semantic similarities.

Keyword. Ontology-Based Information Extraction, WordNet, Triples, Natural Language Processing

1. Introduction
NLP focuses on interactions between machines and natural languages, as well as a machine's capacity to comprehend or imitate the
comprehension of human language through a range of tasks and tools. Information Extraction (IE) is a subfield of the NLP, which is
used for extracting a structured information from the semi or unstructured text and convert it into a set of useful structure information
automatically. Ontology-Based Information Extraction (OBIE) is an information extraction technique guided by an ontology in which
structured information is extracted from text according to concepts and properties defined by an ontology[1]. Ontology is described as
a "formal and explicit specification of a shared conceptualization" [2]. Ontologies are typically defined for specific domains.
One of the most popular ontologies in the meanwhile is DBpedia, which is a collection of Resource Description Framework (RDF)
that is extracted from information created in the Wikipedia project [3]. DBpedia is made available to users on the World Wide Web
(WWW). DBpedia allows users to semantically query relationships and properties of Wikipedia resources[3].
In this paper, UTtoKB model is proposed. UTtoKB is built basically on deep learning-based NLP tasks like SRL, CR, NER, and
others like PoS Tagging and some Preprocessing tools. Input text was preprocessed through the usage of CR, tokenization and
sentence splitter to make it readable by machine. Information extracted as a set of relations in triple shape called RDF. RDFs are
refined with the help of WordNet[4] similarity tool with the predefined ontology. The refining of the set of RDFs, improves results by
20% compared to the results obtained from the first extraction of RDFs.

1.1 Related Works


There are a number of proposed ways for converting raw text into a formal presentation to complete a knowledge graph[5]. Many
works concentrate on extracting semantic relations from raw text and added it to ontology-based knowledge representation. The
relations are formulated as RDF. These systems can work in open information world and extract a sematic relation with no specific
guidance. Other systems use domain ontology concepts and relations to direct the extraction process[6]. In [7], T2KG system is an
early effort to extract sematic relations from unstructured text. T2KG was proposed which used a word2vec[8]vector representation
with cosine similarity and rule based approach to check similarities and make them compatible with DBpedia ontology. For extracting
triples, it used open information extraction technique called Open Language Learning for Information Extraction(OLLIE) [9] that
produces a predicate-argument structure. The system steps use entity extraction, CR, Triple extraction, Triple integration to knowledge
graph and the last step which map the predicate to its correspondence in knowledge graph. In [10], an end to end system created for
extracting structured information from Wikipedia articles into DBpedia namespace. DBpedia is an ontology based knowledge graph of
Wikipedia resources. It extracts useful information from Wikipedia articles specially InfoBox structure. The system uses a unique
architecture to extract RDF Triples from it. This architecture pipeline is including SRL, working in parallel with Named Entity
Linking and CR. LODifier [11]is a system proposed to extract or create RDF representation from unstructured text and link it to
DBpedia and WordNet ontologies. This architecture is built basically on three elements: Semantic Analysis, Named Entity
Recognition (NER) and Word Sense Disambiguation. It searches for all the entities that mentioned in the text and replace it with
English Wikipedia link. The next step is generating URI for Wikifier output. It converts from Wikipedia URI to DBpedia URI. C&C
parser and Boxer system used to determine the relation between entities. The relation produced from boxer converted to RDF
WordNet class types to get RDF URI per relation. Finally, LODifier constructs RDF graphs by defining URI for predicates and
relations that got from Boxer. In [12], a system is proposed to extract structured information from Twitter semi-structured messages.
The system architecture is basically built on GATE to recognize the entities of tweets and connect them to DBpedia ontology. The
system takes BBC and New York time news tweets as input. The GATE used is a pipeline of five steps includes preprocessing,
gazetteer, grammar rules creation, disambiguation to extract RDF triples. Gazetteer is a dictionary that helps to extract entities that
related to Wikipedia and DBpedia contents. In [13], a system is presented to extract information from Twitter messages. This
information is about tourism places in Indonesia. It accepts tweets of Twitter and the ontology of DBpedia as input. To make the input
is readable for the machine; it uses tokenization, sentence splitter, and PoS tagging as preprocessing for the text. PoS tagging helps to
extract NOUNs that are probably a potential entity. After extracting a set of entities, DBpedia spotlight is used to annotate entities
based on DBpedia: Place.
Some other researcher works depends on domain specific ontology that is built by experts. In [14], a method proposed to extents the
content of ontology that more suitable and related to the input text especially instances. In [15], an end-to-end system use dependency
parsing from Stanford to extract the triples. In [16], a system is proposed to extract information from a semi-structured text-based on a
predefined ontology for disaster management. It accepts a semi-structured information document and a predefined ontology. In [17], a
new technique is designed for extracting tabular information from papers that is relevant to users. The system is generic, meaning it
may be used on any document, regardless of its domain or content. In addition to its generic nature, the offered solution is strong and
can handle a variety of document layouts. Table detection and ontological information extraction are the two key modules of the
presented approach. The table detection module extracts all tables from a technical document, but the ontological information
extraction module only extracts relevant tables from all discovered tables. In [5], a pipeline methodology for extracting information
from a huge corpus is proposed, which includes several NLP tasks. The methodology is demonstrated using a large medical dataset
(CORD-19) to complete the previous steps and extract triples, which are then mapped to a rich ontology of biomedical concepts. In
[18], a web-based prototype system is proposed for extracting useful geospatial information from unstructured text. It uses NER to
extract named entities especially the one related to Location.
In UTtoKB, the model uses SRL, NER and PoS tagging for extracting RDF triples. The paper concentrate on mapping techniques to
map the extracted triples to predefined domain specific ontology. WordNet and GloVe vector representation are used in RDF mapping
to help link it to ontology concepts and properties.

1.2 Ontology-Based Information Extraction (OBIE)


The method of IE includes converting an unstructured text or a group of texts into sets of facts[19]. This technique is used for
extracting a certain kinds of information from text or other resources and representing them in a specific knowledge representation
method like Databases and Ontologies [20].
The most known term of Ontology definition is: "Ontology is a formal specification of a shared conceptualization"[2].
Ontologies are mostly used with conjunction with the term Semantic Web. It is represented by a collection of URIs entities that have a
meaning. Ontology represents vocabularies and knowledge base for specific domain. Ontology is made up of various parts including
classes, data type properties, object properties (including taxonomical relationships), instances, object property values, and restrictions
[21].
OBIE is a system that processes unstructured or semi-structured raw text to extract new information according to ontology. Usually,
output will be an addition to that ontology [22]. Another straightforward description of an OBIE system is that it guides IE algorithms
and methods by using ontology to extract the desired information. OBIE system can be defined as “an IE process guided by the
ontology to extract things such as classes, properties and instances”. Thus the part of IE system that differs from OBIE is that method
of extraction which is oriented to identify entities for specific ontology [6].

1.3 Architecture
UTtoKB is a system to convert a text to a set of RDF to be added as new assertions to the knowledge base according to specific
domain ontology. UTtoKB is a pipeline of several main components. It takes the text (document) as input and produces RDF tuples
according to a specific domain.
The components work as a workflow, the input text is preprocessed to provide the input sentences to the next module. The next
module is a generic semantic processing component to extract the semantic relations from text based on a semantic role labeler
providing the primary set of RDF to the next module. RDF refinement chooses the RDFs that can be mapped to domain ontology. The
main components in UTtoKB model architecture are:
1. Preprocessing module: the text is preprocessed by applying the CR, tokenizing text, and producing the output as a set of
sentences.
2. RDF Extraction: the main semantic module based on AllenNLP semantic role labeler to convert the sentences to semantic
framing structure, and produce the primary RDFs.
3. RDF Refinement and Ontology Population: this step makes the RDF triples extracted corresponds to Ontology contents. The
final mapping of RDF triples that to domain ontology concepts and properties.
The complete system architecture is shown in figure 1. The main components are discussed in detail in later sections.

Figure 1: UTtoKB Model Architecture

1.3.1 Text Preprocessing


Preprocessing of text is an important step in many NLP tasks. It helps to initialize the text and clean it from unneeded information
that can influence the implementation of the system negatively especially when the information is huge and unstructured. It consists of
coreference resolution, tokenization and sentence splitter. Figure 2 shows the preprocessing module architecture.

Figure 2: Preprocessing Module Architecture

1.3.2 RDF Extraction


RDF Extraction is considered the main phase in the system. It creates the initial set of RDFs. This step consists of Semantic Role
Labeling ,PoS tagging and Chunks Organizing and Named Entity Recognition .Chunks Organizing is used to merge all similar
arguments extracted from SRL that belong to the same chunk. RDF Extraction has four main steps as shown in figure 3.
Figure 3: RDF Extraction Module Architecture

1.3.3 RDF Refinement and Ontology Population


The main purpose in this phase is to make the RDF triples extracted earlier corresponds to Ontology contents. It accepts set of initial
RDFs extracted from the RDF extraction phase. The model is based on WordNet [4] to find the similarity. At last, in the pipeline, the
system will add the new refined RDF triples separately for IS-A and Non-IS-A relations to the Ontology. Figure 4 shows the
architecture for this phase.

Figure 4: RDF Refinements and Ontology Mapping

1.4 EXPERIMENTS
In UTtoKB prototype model, a set of documents about a specific domain is processed to extract information from these documents
and added it to domain ontology as a new assertion to the knowledge base. Country ontology has been constructed manually using
protégé platform[23]. The main concepts and roles in the predefined ontology are shown in Figure 5.

Figure 5: Country Ontology

The model prototype uses AllenNLP models for CR, SRL, and NER. To give a better understanding of the prototype model and
how it remove is works, it shows how architecture pipeline process of the system will be react with the example down below:

1. Input example:
"Muhammad speaks English and French languages while John speaks English only. Both Mu and Johnny reside in London
city."
2. Preprocessing: The input will be split into two sentences after applying Coreference Resolution, tokenization and sentence
splitting.
● Before applying Coreference Resolution:
Muhammad speaks English and French languages while John speaks English only. Both Mu and Johnny reside in London
city.
● After applying Coreference Resolution and Unifying Entities:
Muhammad speaks English and French languages while John speaks English only. Both Muhammad and John reside in
London city.
The final output of this step is:
a) Muhammad speaks English and French languages while John speaks English only
b) Both Muhammad and John reside in London city
3.RDF Extraction:
This step will extract the initial RDF based on SRL, NER and PoS tagging. In SRL, statement will be divided into segments based on
where predicates are mentioned) in the sentence.
a) MuhammadA0 speaks predicate English and French languagesA1
b) JohnA0 speaks predicate English only A1
c) Both Muhammad and John A0 reside predicate in London cityA1
Then convert to initial triple shape with NER and PoS tagging
(Muhammad, speaks, English)
(Muhammad, speaks, French)
(Muhammad, speaks, Languages)
(John, speaks, English)
(Muhammad, reside, London)
(John, reside, London)
(Muhammad, reside, city)
(John, reside, city)

4. RDF Refinement and Ontology Population


The previous extracted RDF triples are not fully precise and need to be fully dependent on the concepts and properties of the existing
ontology. This step helps to refine them in a proper way that is so convenient for ontology. Unrelated triples will be discarded. The
following triples will be added as new assertion:

(<Country: Person: Muhammad>, <Country: Speak>, <Country: Language: English>)


(<Country: Person: Muhammad>, <Country: Speak>, <Country: Language: French>)
(<Country: Person: John>, <Country: Speak, <Country: Language: English>)
(<Country: Person: Muhammad, <Country: Live>, <Country: City: London>)
(<Country: Person: John, <Country: Live>,<Country: City: London>)

The predicates "reside" and "speaks" are converted to most similar ontology properties "Live” and "Speak" respectively based on
semantic similarity with ontology. This step also adds those entities (John, Muhammad) as instances of <Country: Person> class,
(English, French) as instances of <Country: Language> and London as an instance of <Country: City> class, if they do not exist in
ontology. They represent the needed facts (is-a relations) to be added as new instances to the proper class it really belongs in the
ontology. All what were mentioned earlier can be applied to other examples.

1.5 UTtoKB Experiment Results


The evaluation of OBIE model is not a specific task, due to the lack of standard extractions for a specific ontology domain. The
common approach to deal with this issue is to extract the correct set of RDF triples manually from a test document and use the set to
compare with the model's generated results. All possible triples from the text document case study are manually generated and
calculated both the precision and the recall as the ratio of valid RDFs extracted by UTtoKB pipeline to the total number of valid RDFs
(manually extracted). The F1-score is calculated as the harmonic mean of precision and recall. Tables (1, 2 and 3) show the values of
Precision, recall and F1-score for the initial RDFs and refined RDFs in UTtoKB model.

Table 1: the evaluation of Initial RDFs


Metrics Percentage
Precision (%) 54.7
Recall (%) 48.3
F1 (%) 51.3

Table 2: Evaluation for IS-A and NON-IS_A relations Refined RDFs


Metrics Percentage
Precision (%) 75
Recall (%) 70
F1 (%) 72

Table 3: Evaluation of IS-A and NON-IS_A relations with different Threshold


Metrics Class and Properties Class and Properties
similarity 60% similarity 80%
Precision (%) 75 66
Recall (%) 70 61
F1 (%) 72 62.5

Another experiment is done to evaluate the similarities percentage by using Global vector representation (GloVe) [24] vector and
compare it to WordNet similarity techniques. In Table 4, results are showing some examples of entities and compare it to ontology
classes.

Conclusion and Future Works


Table 4: Similarity check in different methods

In this work, the UTtoKB model is proposed for extracting structured information from unstructured text with the help of a specific
domain ontology knowledge base. The model is built using NLP tasks from the AllenNLP platform [27] to extract the initial triples.
Extracted triples are refined to be more suitable to ontology concepts and properties using similarity techniques of WordNet. The
model results depend on how to find the semantic similarities between the extracted RDF and the domain ontology concepts and
properties. The results showed that WordNet is not a great tool for giving similarity because of the lack of information it offers in
some subjects. Other ways can be implemented for semantic similarities tests including Global Vector representation (GloVe) with
the help Cosine Similarity[25] and Euclidean distance[26].. The GloVe is an unsupervised vector model introduced by Stanford
University used to represent each word in the corpus with a single unique vector in a low dimension [24]. It is built on the training of
the Co-occurrence matrix representation technique that appears how frequently a word is used together with other words in a corpus.
In Table 4, comparative results are shown for semantic similarities between entities and classes using the WordNet similarities test and
GloVe vectors test. In future work, we are determined to find an efficient way to improve the performance of the model in the
application of ontology mapping.

REFERENCES
1. Martinez-Rodriguez, J.L., Hogan, A., Lopez-Arevalo, I.: Information extraction meets the semantic web: a survey. Semantic
Web. 11, 255–335 (2020)
2. Guarino, N., Oberle, D., Staab, S.: What is an ontology? In: Handbook on ontologies. pp. 1–17. Springer (2009)
3. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P.,
Auer, S.: Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web. 6, 167–195 (2015)
4. Fellbaum, C.: WordNet. In: Theory and applications of ontology: computer applications. pp. 231–243. Springer (2010)
5. Papadopoulos, D., Papadakis, N., Litke, A.: A methodology for open information extraction and representation from large
scientific corpora: the CORD-19 data exploration use case. Applied Sciences. 10, 5630 (2020)
6. Karkaletsis, V., Fragkou, P., Petasis, G., Iosif, E.: Ontology based information extraction from text. In: Knowledge-Driven
Multimedia Information Extraction and Ontology Evolution. pp. 89–109. Springer (2011)
7. Kertkeidkachorn, N., Ichise, R.: T2kg: An end-to-end system for creating knowledge graph from unstructured text. Presented
at the Workshops at the Thirty-First AAAI Conference on Artificial Intelligence (2017)
8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint
arXiv:1301.3781. (2013)
9. Etzioni, O., Bart, R.E., Schmitz, M.D., Doderland, S.G.: Open language learning for information extraction. (2014)
10. Exner, P., Nugues, P.: Entity extraction: From unstructured text to DBpedia RDF triples. Presented at the The Web of Linked
Entities Workshop (WoLE 2012) (2012)
11. Augenstein, I., Padó, S., Rudolph, S.: Lodifier: Generating linked data from unstructured text. Presented at the Extended
Semantic Web Conference (2012)
12. Nebhi, K.: Ontology-based information extraction from twitter. Presented at the Proceedings of the Workshop on Information
Extraction and Entity Analytics on Social Media Data (2012)
13. Rosyiq, A., Hayah, A.R., Hidayanto, A.N., Naisuty, M., Suhanto, A., Budi, N.F.A.: Information extraction from Twitter
using DBpedia ontology: Indonesia tourism places. Presented at the 2019 International Conference on Informatics, Multimedia, Cyber
and Information System (ICIMCIS) (2019)
14. Anantharangachar, R., Ramani, S., Rajagopalan, S.: Ontology guided information extraction from unstructured text. arXiv
preprint arXiv:1302.1335. (2013)
15. Batouche, B., Gardent, C., Monceaux, A., Blagnac, F.: Parsing text into RDF graphs. Presented at the Proceedings of the
XXXI Congress of the Spanish Society for the Processing of Natural Language (2014)
16. Abburu, S., Golla, S.B.: Ontology and NLP support for building disaster knowledge base. Presented at the 2017 2nd
International Conference on Communication and Electronics Systems (ICCES) (2017)
17. Rizvi, S.T.R., Mercier, D., Agne, S., Erkel, S., Dengel, A., Ahmed, S.: Ontology-based Information Extraction from
Technical Documents. Presented at the ICAART (2) (2018)
18. Papadias, E., Kokla, M., Tomai, E.: Educing knowledge from text: semantic information extraction of spatial concepts and
places. AGILE: GIScience Series. 2, 1–7 (2021)
19. Abdelmagid, M., Ahmed, A., Himmat, M.: Information Extraction methods and extraction techniques in the chemical
document’s contents: Survey. ARPN Journal of Engineering and Applied Sciences. 10, 1068–1073 (2015)
20. Grishman, R.: Information extraction. IEEE Intelligent Systems. 30, 8–15 (2015)
21. Wimalasuriya, D.C., Dou, D.: Components for information extraction: Ontology-based information extractors and generic
platforms. Presented at the Proceedings of the 19th ACM international conference on Information and knowledge management (2010)
22. Dung, T.Q., Kameyama, W.: Ontology-based information extraction and information retrieval in health care domain.
Presented at the International Conference on Data Warehousing and Knowledge Discovery (2007)
23. Sivakumar, R., Arivoli, P.: Ontology visualization PROTÉGÉ tools–a review. International Journal of Advanced Information
Technology (IJAIT) Vol. 1, (2011)
24. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. Presented at the Proceedings of
the 2014 conference on empirical methods in natural language processing (EMNLP) (2014)
25. Rahutomo, F., Kitasuka, T., Aritsugi, M.: Semantic cosine similarity. Presented at the The 7th International Student
Conference on Advanced Science and Technology ICAST (2012)
26. Vijaymeena, M., Kavitha, K.: A survey on similarity measures in text mining. Machine Learning and Applications: An
International Journal. 3, 19–28 (2016)
27. Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., Peters, M., Schmitz, M., Zettlemoyer, L.: Allennlp: A
deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640. (2018)

You might also like