Unit 5 - Notes
Unit 5 - Notes
• Semantic web
o Today, the Web provides perhaps the simplest way to share information, and literally
everyone writes Web pages, with the help of authoring tools, and a large number of
organizations disseminate data coded in Web pages.
o Web search engines have many limitations like : despite a large corpus provides a
low recall and precision score, sensitive to vocabulary, search results can reference
to the same page (duplicate).
o Syntactic Web, is where information presentation is carried out by computers, and
the interpretation and identification of relevant information is delegated to human
beings
o Web pages require humans to evaluate, classify and interpret information. But due
to the exponential growth of digital data, it has become difficult for humans to
manage. This is known as information overload.
o In order to organize Web content, researchers proposed a series of conceptual
models. The central idea is to categorize information in a standard way, facilitating
its access.
o Metadata : Data about data. They give meaning to the websites that is, provide
information about what the website contains to the system. Example:
o Ontologies :
▪ Ontos(being)+logos(words)
▪ To facilitate knowledge sharing and reuse.
▪ Conceptual models that captured the vocabulary used.
▪ Ontology description languages are specifically designed to define
ontologies.
▪ The Resource Description Framework (RDF) is a general-purpose language
for representing information about resources in the Web
o Formal Systems :
▪ provide the ability to deduce new sentences from existing sentences using
specific inference rules.
▪ Logical inference.
▪ Description logic describes the application domain by defining the domain's
key concepts and then using these concepts to specify the qualities of
objects and humans that appear in the domain.
o We need semantic web to :
▪ Understand the desired tasks and user preferences
▪ Search for information on available resources
▪ Communicate with other software agents
▪ Comparing information so as to provide adequate answers to its users
o The Semantic Web Is Not Artificial Intelligence . It is not a separate web, but and
extension of the current web. Use of complex expressions are not necessary.
• Ontology
o “An ontology is a formal, explicit specification of a shared conceptualization.”
o It is a catalogue of the types of things that are assumed to exist in a domain of
interest D from the perspective of a person who uses a language L for the purpose of
talking about D.
o It provides description on Classes, Relationships(Among classes) and
Properties(attributes) .
o O = {C , R , CH, rel , OA}:
▪ C : Set of concepts
▪ R : Set of relations
▪ C,R are disjoint sets, that is, C ∩ R = Φ.
▪ CH is a subset of C ×C is a concept hierarchy or taxonomy, where
CH(C1,C2) indicates that C1 is a subconcept of C2.
▪ Rel : relation that relates concepts non-taxonomically.
▪ Taxonomies are commonly used for organizing knowledge
▪ OA is a set of ontology axioms, expressed in an appropriate logical
language.
o Three properties ontology must posses
▪ Strict sub concept hierarchy : every instance of a class must be an instance
of its father(is -a hierarchy). This structure is a tree of concepts, all
organized using the generalization (type-of) relationship.
▪ Ambiguity-free interpretation of meanings and relationships
▪ The use of a controlled, finite, but extensible vocabulary.
o Classifying Ontology According to a Semantic Spectrum
o
o RDF Triples and Graphs
▪ The RDF graphs notation translates a set of RDF statements into a graph,
with nodes representing subjects or objects, and arcs representing
properties.
▪
o RDF Schema
▪ RDF offers enormous flexibility but, apart from the rdf:type property, which
has a predefined semantics, it provides no means for defining application-
specific classes and properties.
▪ The "rdfs:Class" qualified name represents the concept of a class in the RDF
Schema vocabulary, and any resource identified as such can be considered a
class within the RDF data model.
▪ Classes are fundamental in organizing and categorizing resources in RDF. By
assigning resources to specific classes, relationships and hierarchies can be
established, enabling better organization and interpretation of data.
▪ A property is any instance of the class rdfs:Property. The rdfs:domain
property is used to indicate that a particular property applies to a
designated class, and the rdfs:range property is used to indicate that the
values of a particular property are instances of a designated class or,
alternatively, are instances (i.e., literals) of an XML Schema datatype.
• Web Ontology Language (OWL)
o The Web Ontology Language (OWL) is a language used to describe classes,
properties, and relations between conceptual objects, making web content
machine-interpretable.
o OWL is defined as a vocabulary, similar to RDF and RDF Schema, but with a richer
semantics.
o An ontology in OWL is a collection of RDF triples that utilizes the OWL vocabulary.
o OWL is organized into three sublanguages: OWL Lite, OWL DL, and OWL Full.
▪ OWL Lite allows hierarchies of classes and properties and simple
constraints, suitable for thesauri and simple ontologies.
▪ OWL DL increases expressiveness while maintaining decidability of the
classification problem, providing all OWL constructs with certain limitations.
▪ OWL Full is the most expressive language without limitations but disregards
decidability issues.
o These requirements were defined by the W3C Consortium to guide the specification
of OWL and ensure that ontology description languages for the Semantic Web meet
certain criteria for compatibility, logic-based design, ontology identification, and
facilitation of development and reuse.
o Compatibility with XML:
▪ Ontology should have an XML serialization syntax.
▪ XML Schema datatypes should be used when applicable.
o Based on description logic:
▪ Language should be designed based on the concepts of concept (class), role
(property), and individual.
▪ Support for expressions related to these concepts.
o Support for defining ontology vocabularies:
▪ Ontology should be identified by a URI reference.
▪ Classes, properties, and individuals within the ontology should be identified
by URI references.
o Facilitation of distributed development and versioning:
▪ Support for developing ontologies in a distributed manner.
▪ Ability to define different versions of the same ontology.
▪ Enable the reuse of previously defined ontologies.
o R1. Ontologies as distinct resource:
▪ Ontologies must have unique identifiers, such as URI references.
o R2. Unambiguous concept referencing with URIs:
▪ Concepts in different ontologies should have distinct absolute identifiers.
▪ Concepts in an ontology can be uniquely identified using URI references.
o R3. Explicit ontology extension:
▪ Ontologies should be able to explicitly extend other ontologies, reusing
concepts while adding new classes and properties.
▪ Ontology extension should be transitive, where A extends B, and B extends
C, implying A also extends C.
o R4. Commitment to ontologies:
▪ Resources should be able to explicitly commit to specific ontologies,
indicating the set of definitions and assumptions made.
o R5. Ontology metadata:
▪ It should be possible to provide metadata for each ontology, such as author,
publishing date, etc.
▪ Metadata properties may be borrowed from the Dublin Core element set.
o R6. Versioning information:
▪ The language should support comparing and relating different versions of
the same ontology.
▪ Features for relating revisions, stating backwards-compatibility, and
deprecating identifiers should be available.
o R7. Class definition primitives:
▪ The language should allow expressing complex definitions of classes,
including subclassing and Boolean combinations (intersection, union,
complement).
o R8. Property definition primitives:
▪ The language should enable expressing property definitions, including
subproperties, domain and range constraints, transitivity, and inverse
properties.
o R9. Datatypes:
▪ The language should provide a set of standard datatypes, possibly based on
XML Schema datatypes.
o R10. Class and property equivalence:
▪ The language should have features for stating that two classes or properties
are equivalent.
o R11. Individual equivalence:
▪ The language should have features for stating that different identifiers
represent the same individual.
o R12. Attaching information to statements:
▪ The language should provide a mechanism to attach additional information
(e.g., source, timestamp, confidence level) to statements.
o R13. Classes as instances:
▪ The language should support treating classes as instances, as the same
concept can be seen as both a class and an individual depending on the
perspective.
o R14. Cardinality constraints:
▪ The language should support specifying cardinality restrictions on
properties, defining the minimum and maximum number of related
individuals.
o R15. XML syntax:
▪ The language should have an XML serialization syntax for compatibility and
reusability with existing XML tools.
o R16. User-displayable labels:
▪ The language should support specifying multiple alternative user-
displayable labels for ontology resources, facilitating multilingual views.
o R17. Supporting a character model:
▪ The language should support the use of multilingual character sets.
o R18. Supporting uniqueness of Unicode strings:
▪ The language should address cases where different character sequences
may appear the same, ensuring uniform normalization and justifying any
deviations from Unicode Normal Form C.
o OWL ontologies typically begin with namespace declarations and include sentences
about the ontology itself, grouped under the "owl:Ontology" tag.
o If an ontology O1 imports another ontology O2, the declarations in O2 are appended
to O1. O1 may include a namespace declaration pointing to the URIref of O2 for
using its vocabulary.
o A datatype property is a binary relation between instances of a class and instances
of a datatype. It is declared using the "owl:DatatypeProperty" constructor.
o An object property is a binary relation between instances of two classes. It is
declared using the "owl:ObjectProperty" constructor.
o In OWL Lite, the terms "rdfs:domain" and "rdfs:range" from RDF Schema are used to
declare the domain and range of a property.
o The term "rdfs:subPropertyOf" is used to declare property hierarchies in OWL Lite.
o OWL Lite is a subset of OWL that offers hierarchies of classes and properties and
simple constraints, suitable for modeling thesauri and simple ontologies.
A Pseudo-relevance feedback framework combining relevance matching
and semantic matching for information retrieval
The authors of this paper propose a new framework for pseudo-relevance feedback (PRF) in
information retrieval that combines relevance and semantic matching to improve the quality of
feedback documents. The authors note that most current PRF methods only consider
relevance matching from the perspective of terms used to sort feedback documents, which can
lead to a semantic gap between query representation and document representation. To
address this, the authors used a reranking mechanism that calculates the exact terms and
semantic similarity between query and document representations using bidirectional encoder
representations from transformers (BERT).
The proposed PRF framework uses probability-based and language-model-based methods to
process the results of the first round of retrieval. Experiments conducted on four Text Retrieval
Conference (TREC) datasets show that the proposed models outperform the baseline models
in terms of mean average precision (MAP). The proposed framework consists of three steps:
reranking, relevance matching, and semantic matching. In the reranking step, a bidirectional
encoder representation from transformers (BERT) is used to calculate the information of exact
terms and semantic similarity between the query and document representations. This step
reduces the semantic gap between the query representation and the document
representation, and improves the quality of feedback documents. The reranking system they
used did not rank all the documents and reduced the computational power of BERT.
In the relevance matching step, the probability-based PRF method is used to calculate the
relevance between the feedback documents and the original query. The relevance is calculated
based on the probability that the feedback documents are relevant to the original query. In the
semantic matching step, the language-model-based PRF method is used to calculate the
relevance between the feedback documents and the original query. The relevance is calculated
based on the similarity between the feedback documents and the original query, as measured
by a language model. To evaluate the proposed models, the authors conducted experiments on
Text Retrieval Conference (TREC) datasets.
The authors compared their proposed models with several baseline models, including a
baseline model without PRF, a model using only relevance matching for PRF, a model using
only semantic matching for PRF, and several state-of-the-art PRF models. The baseline models
were implemented using BM25 for initial retrieval and then applying PRF. To compare the
performance of the different models, the authors performed statistical significance tests using
the paired t-test. The results showed that the proposed models outperformed the baseline
models and the state-of-the-art PRF models in terms of MAP and P@10. Additionally, the
authors conducted an ablation study to investigate the contribution of different components of
the proposed models to their overall performance. The results showed that using semantic
matching in addition to relevance matching significantly improved the performance of the
models. The reranking mechanism also contributed to the improved performance of the
proposed models. Overall, the proposed PRF framework is a promising approach to improving
the quality of feedback documents in information retrieval. The proposed framework has both
theoretical and practical implications. By combining relevance matching and semantic
matching, the framework addresses the semantic gap between query representation and
document representation, and improves retrieval performance.
The experimental results demonstrate the effectiveness of the proposed framework and
suggest that it may be a useful tool for improving the accuracy of information retrieval
systems. However, the proposed framework heavily relies on the quality of the initial retrieval
results. If the initial retrieval results are not accurate, then the effectiveness of the proposed
framework may be limited. The proposed framework only considers the semantic similarity
between the query and the document representations, but it does not consider the domain-
specific knowledge or the user's search history. Incorporating domain-specific knowledge and
user's search history may further improve the effectiveness of the proposed framework.