0% found this document useful (0 votes)
10 views13 pages

Unit 5 - Notes

The document discusses the Semantic Web, highlighting its need for improved information organization through metadata and ontologies to combat information overload. It explains the structure and properties of ontologies, including their classification and the role of RDF and OWL in defining relationships and properties within a web context. Additionally, it outlines the requirements for ontology languages to ensure compatibility and facilitate development and reuse.

Uploaded by

shrustign27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views13 pages

Unit 5 - Notes

The document discusses the Semantic Web, highlighting its need for improved information organization through metadata and ontologies to combat information overload. It explains the structure and properties of ontologies, including their classification and the role of RDF and OWL in defining relationships and properties within a web context. Additionally, it outlines the requirements for ontology languages to ensure compatibility and facilitate development and reuse.

Uploaded by

shrustign27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT 5

• Semantic web
o Today, the Web provides perhaps the simplest way to share information, and literally
everyone writes Web pages, with the help of authoring tools, and a large number of
organizations disseminate data coded in Web pages.
o Web search engines have many limitations like : despite a large corpus provides a
low recall and precision score, sensitive to vocabulary, search results can reference
to the same page (duplicate).
o Syntactic Web, is where information presentation is carried out by computers, and
the interpretation and identification of relevant information is delegated to human
beings
o Web pages require humans to evaluate, classify and interpret information. But due
to the exponential growth of digital data, it has become difficult for humans to
manage. This is known as information overload.
o In order to organize Web content, researchers proposed a series of conceptual
models. The central idea is to categorize information in a standard way, facilitating
its access.
o Metadata : Data about data. They give meaning to the websites that is, provide
information about what the website contains to the system. Example:

o Ontologies :
▪ Ontos(being)+logos(words)
▪ To facilitate knowledge sharing and reuse.
▪ Conceptual models that captured the vocabulary used.
▪ Ontology description languages are specifically designed to define
ontologies.
▪ The Resource Description Framework (RDF) is a general-purpose language
for representing information about resources in the Web
o Formal Systems :
▪ provide the ability to deduce new sentences from existing sentences using
specific inference rules.
▪ Logical inference.
▪ Description logic describes the application domain by defining the domain's
key concepts and then using these concepts to specify the qualities of
objects and humans that appear in the domain.
o We need semantic web to :
▪ Understand the desired tasks and user preferences
▪ Search for information on available resources
▪ Communicate with other software agents
▪ Comparing information so as to provide adequate answers to its users
o The Semantic Web Is Not Artificial Intelligence . It is not a separate web, but and
extension of the current web. Use of complex expressions are not necessary.

• Ontology
o “An ontology is a formal, explicit specification of a shared conceptualization.”
o It is a catalogue of the types of things that are assumed to exist in a domain of
interest D from the perspective of a person who uses a language L for the purpose of
talking about D.
o It provides description on Classes, Relationships(Among classes) and
Properties(attributes) .
o O = {C , R , CH, rel , OA}:
▪ C : Set of concepts
▪ R : Set of relations
▪ C,R are disjoint sets, that is, C ∩ R = Φ.
▪ CH is a subset of C ×C is a concept hierarchy or taxonomy, where
CH(C1,C2) indicates that C1 is a subconcept of C2.
▪ Rel : relation that relates concepts non-taxonomically.
▪ Taxonomies are commonly used for organizing knowledge
▪ OA is a set of ontology axioms, expressed in an appropriate logical
language.
o Three properties ontology must posses
▪ Strict sub concept hierarchy : every instance of a class must be an instance
of its father(is -a hierarchy). This structure is a tree of concepts, all
organized using the generalization (type-of) relationship.
▪ Ambiguity-free interpretation of meanings and relationships
▪ The use of a controlled, finite, but extensible vocabulary.
o Classifying Ontology According to a Semantic Spectrum

▪ The classification, illustrated in the figure above, follows a line where


ontologies range from lightweight to heavyweight, depending on the
complexity and sophistication of the elements they contain.
▪ Controlled vocabularies are finite lists of terms. Eg. North American
Industry Classification System (NAICS)
▪ Glossaries are lists of terms whose meaning is described in natural
language. Eg. Multilingual Glossary of Internet Terminology (NetGlos)
▪ Thesauri are lists of terms and definitions that standardize words for
indexing purposes. Provides definitions and relationships(hierarchical,
associative, or equivalence) between terms. Eg. IEEE
▪ Informal is-a hierarchies are hierarchies that use generalization (type-of)
relationships in an informal way. Related concepts can be aggregated into a
category, even if they do not respect the generalization relationship.
▪ Formal is-a hierarchies are hierarchies that fully respect the generalization
relationship.
▪ Frames are models that include classes and properties. The primitives of the
frame model are classes, or frames, that have properties, slots, or
attributes. Slots apply only to the classes for which they were defined.
▪ Ontologies that express value restrictions and Logical restrictions.
o Classifying Ontology According to their Generality
▪ Upper Level Ontologies describe generic concepts, such as space, time, and
events. These ontologies are, in principle, domain independent .
▪ Domain Ontologies describe the vocabulary pertaining to a given domain,
by specializing the concepts provided by the upper-level ontology.
▪ Task Ontologies describe the vocabulary required to perform generic tasks
or activities, again by specializing the concepts provided by the upper-level
ontology.
▪ Application Ontologies describe the vocabulary of a specific application,
whose concepts correspond, in general, to the roles performed by entities
in a given domain while performing some task or activity.

o Web Ontology Description Languages


▪ Defines ontologies
▪ The layered model for the Semantic Web puts the relationship among
ontology description languages, RDF and RDF Schema, and XML in a better
perspective.
▪ The bottom layer offers character encoding (Unicode) and referencing (URI)
mechanisms.
▪ The second layer introduces XML as the document exchange standard.
▪ The third layer accommodates RDF and RDF Schema as mechanisms to
describe the resources available on the Web. They may be classified as
lightweight ontology languages.
▪ Full ontology description languages appear in the fourth layer as a way to
capture more semantics.
▪ The topmost layer introduces expressive rule languages.
o Other Languages
▪ Standard Generalized Markup Language (SGML) : a metalanguage to define
markup languages for documents, inspired the design of HTML and XML
▪ The Hypertext Markup Language (HTML) : language to format hypertext
documents for human reading.
▪ The Extensible Markup Language (XML) is a general-purpose markup
language for creating special-purpose markup languages.
• RDF : Resource Description
o general-purpose language for representing information about resources in the Web
o Used to represent information about objects
o lightweight ontology language designed to support interoperability between
applications
o XML Essentials
▪ describe structured documents
▪ Unlike HTML, users may create their own tags in XML, creating specific
markup languages, such as the ontology description languages.
▪ have no semantics indicating how to present documents through a Web
browser.
▪ An XML document consists of plain text and markup, in the form of tags,
which is interpreted by application programs.
▪ An element typically consists of a start-tag, the element content, and a
matching end-tag.
▪ Elements may be nested, defining a structure for the document.
o RDF has a very simple and flexible data model, based on the central concept of the
RDF statement.
o A resource is anything that has an identity.
o A Uniform Resource Identifier (URI) is a character string that identifies an abstract
or physical resource on the Web
o URI reference (URIref) denotes the common usage of a URI, with an optional
fragment identifier attached to it and preceded by the character “#”. However, the
URI that results from such a reference includes only the URI after removing the
fragment identifier.
o An absolute URIref identifies a resource independently of the context.
o A relative URIref is a URIref with some prefix omitted. Information from the context
in which the URIref appears is required to fill in the omitted prefix. For example, the
relative URIref #PrivateDoc, appearing in a document identified by the URIref
http://www.cat.com/schema is considered equivalent to the URIref:
http://www.cat.com/schema#PrivateDoc
o An XML namespace, or simply a namespace, is a collection of names.
o RDF Statements is a triple (S,P,O)
▪ S : subject of the statement
▪ P : Property/predicate that denotes a binary relationship.
▪ O : Object. If O is a literal it is also called the value of property P.
▪ s has a P property with value O if s is a e resource identified by the URIref S.
▪ Literals are character strings that represent datatype values.
o RDF Vocabulary
▪ A vocabulary is a set of URIrefs and is therefore synonymous with an XML
namespace.
▪ A vocabulary V is frequently specified in two alternative ways.
▪ The first alternative:
• Select a fixed URIref U and a prefix p for it
• Define a set of qualified names with prefix p
• Define V as the set of URIrefs represented by such qualified names
▪ The second alternative:
• Select a fixed URIref U and a prefix p for it
• a set of fragment identifiers
• Define V as the set of URIrefs obtained by concatenating U, the
character “#” and each fragment identifier
▪ First Alternative:
• Select a fixed URIref U: http://example.com/
• Define a prefix p: ex
• Define a set of qualified names with prefix p: ex:resource1,
ex:resource2, ex:resource3
• Define V as the set of URIrefs represented by such qualified names:
• http://example.com/resource1
• http://example.com/resource2
• http://example.com/resource3
▪ Second Alternative:
• Select a fixed URIref U: http://example.com/
• Define a prefix p: ex
• Define a set of fragment identifiers: fragment1, fragment2, fragment3
• Define V as the set of URIrefs obtained by concatenating U, the
character "#", and each fragment identifier:
• http://example.com#fragment1
• http://example.com#fragment2
• http://example.com#fragment3
o A catalogue is a collection of RDF statements used to describe documents using
metadata. However, the catalogue itself does not contain the actual documents, but
rather provides information about them.The catalogue is based on three native
vocabularies:
▪ docid: This vocabulary contains URIrefs (URI references) that serve as
fragment identifiers, uniquely identifying documents. These URIs can be
used to retrieve or reference specific documents.
▪ authid: This vocabulary contains URIrefs that uniquely identify authorities.
Authorities can be individuals, organizations, or entities responsible for
creating or managing the documents.
▪ cs: This vocabulary is a qualified name vocabulary, which means it contains
URIrefs that uniquely identify document types. These URIs represent
specific categories or classifications of documents.

o
o RDF Triples and Graphs
▪ The RDF graphs notation translates a set of RDF statements into a graph,
with nodes representing subjects or objects, and arcs representing
properties.


o RDF Schema
▪ RDF offers enormous flexibility but, apart from the rdf:type property, which
has a predefined semantics, it provides no means for defining application-
specific classes and properties.
▪ The "rdfs:Class" qualified name represents the concept of a class in the RDF
Schema vocabulary, and any resource identified as such can be considered a
class within the RDF data model.
▪ Classes are fundamental in organizing and categorizing resources in RDF. By
assigning resources to specific classes, relationships and hierarchies can be
established, enabling better organization and interpretation of data.
▪ A property is any instance of the class rdfs:Property. The rdfs:domain
property is used to indicate that a particular property applies to a
designated class, and the rdfs:range property is used to indicate that the
values of a particular property are instances of a designated class or,
alternatively, are instances (i.e., literals) of an XML Schema datatype.
• Web Ontology Language (OWL)
o The Web Ontology Language (OWL) is a language used to describe classes,
properties, and relations between conceptual objects, making web content
machine-interpretable.
o OWL is defined as a vocabulary, similar to RDF and RDF Schema, but with a richer
semantics.
o An ontology in OWL is a collection of RDF triples that utilizes the OWL vocabulary.
o OWL is organized into three sublanguages: OWL Lite, OWL DL, and OWL Full.
▪ OWL Lite allows hierarchies of classes and properties and simple
constraints, suitable for thesauri and simple ontologies.
▪ OWL DL increases expressiveness while maintaining decidability of the
classification problem, providing all OWL constructs with certain limitations.
▪ OWL Full is the most expressive language without limitations but disregards
decidability issues.
o These requirements were defined by the W3C Consortium to guide the specification
of OWL and ensure that ontology description languages for the Semantic Web meet
certain criteria for compatibility, logic-based design, ontology identification, and
facilitation of development and reuse.
o Compatibility with XML:
▪ Ontology should have an XML serialization syntax.
▪ XML Schema datatypes should be used when applicable.
o Based on description logic:
▪ Language should be designed based on the concepts of concept (class), role
(property), and individual.
▪ Support for expressions related to these concepts.
o Support for defining ontology vocabularies:
▪ Ontology should be identified by a URI reference.
▪ Classes, properties, and individuals within the ontology should be identified
by URI references.
o Facilitation of distributed development and versioning:
▪ Support for developing ontologies in a distributed manner.
▪ Ability to define different versions of the same ontology.
▪ Enable the reuse of previously defined ontologies.
o R1. Ontologies as distinct resource:
▪ Ontologies must have unique identifiers, such as URI references.
o R2. Unambiguous concept referencing with URIs:
▪ Concepts in different ontologies should have distinct absolute identifiers.
▪ Concepts in an ontology can be uniquely identified using URI references.
o R3. Explicit ontology extension:
▪ Ontologies should be able to explicitly extend other ontologies, reusing
concepts while adding new classes and properties.
▪ Ontology extension should be transitive, where A extends B, and B extends
C, implying A also extends C.
o R4. Commitment to ontologies:
▪ Resources should be able to explicitly commit to specific ontologies,
indicating the set of definitions and assumptions made.
o R5. Ontology metadata:
▪ It should be possible to provide metadata for each ontology, such as author,
publishing date, etc.
▪ Metadata properties may be borrowed from the Dublin Core element set.
o R6. Versioning information:
▪ The language should support comparing and relating different versions of
the same ontology.
▪ Features for relating revisions, stating backwards-compatibility, and
deprecating identifiers should be available.
o R7. Class definition primitives:
▪ The language should allow expressing complex definitions of classes,
including subclassing and Boolean combinations (intersection, union,
complement).
o R8. Property definition primitives:
▪ The language should enable expressing property definitions, including
subproperties, domain and range constraints, transitivity, and inverse
properties.
o R9. Datatypes:
▪ The language should provide a set of standard datatypes, possibly based on
XML Schema datatypes.
o R10. Class and property equivalence:
▪ The language should have features for stating that two classes or properties
are equivalent.
o R11. Individual equivalence:
▪ The language should have features for stating that different identifiers
represent the same individual.
o R12. Attaching information to statements:
▪ The language should provide a mechanism to attach additional information
(e.g., source, timestamp, confidence level) to statements.
o R13. Classes as instances:
▪ The language should support treating classes as instances, as the same
concept can be seen as both a class and an individual depending on the
perspective.
o R14. Cardinality constraints:
▪ The language should support specifying cardinality restrictions on
properties, defining the minimum and maximum number of related
individuals.
o R15. XML syntax:
▪ The language should have an XML serialization syntax for compatibility and
reusability with existing XML tools.
o R16. User-displayable labels:
▪ The language should support specifying multiple alternative user-
displayable labels for ontology resources, facilitating multilingual views.
o R17. Supporting a character model:
▪ The language should support the use of multilingual character sets.
o R18. Supporting uniqueness of Unicode strings:
▪ The language should address cases where different character sequences
may appear the same, ensuring uniform normalization and justifying any
deviations from Unicode Normal Form C.
o OWL ontologies typically begin with namespace declarations and include sentences
about the ontology itself, grouped under the "owl:Ontology" tag.
o If an ontology O1 imports another ontology O2, the declarations in O2 are appended
to O1. O1 may include a namespace declaration pointing to the URIref of O2 for
using its vocabulary.
o A datatype property is a binary relation between instances of a class and instances
of a datatype. It is declared using the "owl:DatatypeProperty" constructor.
o An object property is a binary relation between instances of two classes. It is
declared using the "owl:ObjectProperty" constructor.
o In OWL Lite, the terms "rdfs:domain" and "rdfs:range" from RDF Schema are used to
declare the domain and range of a property.
o The term "rdfs:subPropertyOf" is used to declare property hierarchies in OWL Lite.
o OWL Lite is a subset of OWL that offers hierarchies of classes and properties and
simple constraints, suitable for modeling thesauri and simple ontologies.
A Pseudo-relevance feedback framework combining relevance matching
and semantic matching for information retrieval

The authors of this paper propose a new framework for pseudo-relevance feedback (PRF) in
information retrieval that combines relevance and semantic matching to improve the quality of
feedback documents. The authors note that most current PRF methods only consider
relevance matching from the perspective of terms used to sort feedback documents, which can
lead to a semantic gap between query representation and document representation. To
address this, the authors used a reranking mechanism that calculates the exact terms and
semantic similarity between query and document representations using bidirectional encoder
representations from transformers (BERT).
The proposed PRF framework uses probability-based and language-model-based methods to
process the results of the first round of retrieval. Experiments conducted on four Text Retrieval
Conference (TREC) datasets show that the proposed models outperform the baseline models
in terms of mean average precision (MAP). The proposed framework consists of three steps:
reranking, relevance matching, and semantic matching. In the reranking step, a bidirectional
encoder representation from transformers (BERT) is used to calculate the information of exact
terms and semantic similarity between the query and document representations. This step
reduces the semantic gap between the query representation and the document
representation, and improves the quality of feedback documents. The reranking system they
used did not rank all the documents and reduced the computational power of BERT.
In the relevance matching step, the probability-based PRF method is used to calculate the
relevance between the feedback documents and the original query. The relevance is calculated
based on the probability that the feedback documents are relevant to the original query. In the
semantic matching step, the language-model-based PRF method is used to calculate the
relevance between the feedback documents and the original query. The relevance is calculated
based on the similarity between the feedback documents and the original query, as measured
by a language model. To evaluate the proposed models, the authors conducted experiments on
Text Retrieval Conference (TREC) datasets.
The authors compared their proposed models with several baseline models, including a
baseline model without PRF, a model using only relevance matching for PRF, a model using
only semantic matching for PRF, and several state-of-the-art PRF models. The baseline models
were implemented using BM25 for initial retrieval and then applying PRF. To compare the
performance of the different models, the authors performed statistical significance tests using
the paired t-test. The results showed that the proposed models outperformed the baseline
models and the state-of-the-art PRF models in terms of MAP and P@10. Additionally, the
authors conducted an ablation study to investigate the contribution of different components of
the proposed models to their overall performance. The results showed that using semantic
matching in addition to relevance matching significantly improved the performance of the
models. The reranking mechanism also contributed to the improved performance of the
proposed models. Overall, the proposed PRF framework is a promising approach to improving
the quality of feedback documents in information retrieval. The proposed framework has both
theoretical and practical implications. By combining relevance matching and semantic
matching, the framework addresses the semantic gap between query representation and
document representation, and improves retrieval performance.
The experimental results demonstrate the effectiveness of the proposed framework and
suggest that it may be a useful tool for improving the accuracy of information retrieval
systems. However, the proposed framework heavily relies on the quality of the initial retrieval
results. If the initial retrieval results are not accurate, then the effectiveness of the proposed
framework may be limited. The proposed framework only considers the semantic similarity
between the query and the document representations, but it does not consider the domain-
specific knowledge or the user's search history. Incorporating domain-specific knowledge and
user's search history may further improve the effectiveness of the proposed framework.

A dummy-based user privacy protection approach for text information


retrieval
• Text retrieval is used in various field, like web search, e-commerce website, digital library
etc. These texts can be retrieved through user queries.
• These queries issued by the user should be protected. This privacy can be divided into
two levels: Textual privacy and Topic Privacy.
• Textual privacy refers to the fact that the query issued by one user is known to that user
alone. Other users cannot determine the query.
• Topic privacy is when other users cannot know the topics of interest of the users.
• This paper aims to protect user queries provided the accuracy and usability of text
retrieval is taken care of
• System model
o This system framework consists of an untrusted server and many trusted clients.
o When a client issues a query (say q0) , a group of dummy queries are created
taking q0 and the historical queries into consideration.
o The retrieved documents are then filtered. The client segregates the results of q0
from the other queries, and displays the result to the user. This is the “filtering
query component”.
o This method makes no change to the retrieval algorithm and the results are not
compromised.
o The dummy queries should also make sure their topic is not related to the user
query ie ,semantically irrelevant to implement topic privacy.
• Attack model
o The server is untrusted. This makes it the target for the attackers.
o Assume the attacker has obtained all the query sequences from the client.
o The attacker can guess the user queries by analyzing the keyword feature
distributions of the query text itself, and the relevance feature distributions of the
user historical query sequence.
o Assume that the attacker has mastered the keyword space, topic space and the
containing relationships between keywords and topics.
o Let us also assume that the attacker knows the retrieval algorithm used. Hence,
the attacker can input each of his obtained queries to the algorithm, and analyze
the output to speculate the user queries.
• Implementation algorithm
o Used Wikipedia to implement.
o In order to identify the topics contained in a user query q, concept titles are used
to represent query keywords, and a number of generalized categories were used
to represent query topics .
o First user keywords are identified. Then the user topics. After that dummy topics
are searched, which have the similar topic features but are not semantically
related.
o Implementation was done in JAVA and performed on a JVM with an Intel Core 2
Duo 3 GHz CPU and 2 GB working memory.
• Results
o The algorithm gave good privacy on all three levels of privacies.
o Their method did was not efficient due to the processing of the dummy queries.
o Overall, the algorithm could mask the user query well.
• Inference
o This paper highlights the importance of privacy for user queries.
o The dummy queries generated in this algorithm
o The algorithm does not cause serious extra time and space overheads.

You might also like