0% found this document useful (0 votes)
35 views53 pages

Semantic Web

The document outlines the concept of the Semantic Web, its evolution from Web 1.0 to Web 4.0, and the technologies that enable it, such as XML, RDF, and ontologies. It emphasizes the need for a web that is machine-processable and capable of understanding data meaning, addressing issues like information overload and poor content aggregation. The document also discusses Tim Berners-Lee's vision for a collaborative and intelligent web, highlighting the importance of trust and logical assertions in the future of web technologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views53 pages

Semantic Web

The document outlines the concept of the Semantic Web, its evolution from Web 1.0 to Web 4.0, and the technologies that enable it, such as XML, RDF, and ontologies. It emphasizes the need for a web that is machine-processable and capable of understanding data meaning, addressing issues like information overload and poor content aggregation. The document also discusses Tim Berners-Lee's vision for a collaborative and intelligent web, highlighting the importance of trust and logical assertions in the future of web technologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

SEMANTIC WEB NOTES

(Academic Year: 2024-25)

B. Tech – IV Year – II Semester

UNIT - I

Introduction: Introduction to Semantic Web, the Business Case for the Semantic Web, XML,
and Its Impact on the Enterprise.

Part-I. Introduction to Semantic Web


1.1 What Is the Semantic Web?

 The Semantic Web definition given by Tim Berners-Lee in 1999 is

“The first step is putting data on the Web in a form that machines can naturally understand, or
converting it to that form. This creates a Semantic Web—a web of data that can be processed
directly or indirectly by machines.”
 The Semantic Web as a logical extension of the current Web instead of a distant possibility.
 The Semantic Web is both achievable and desirable.
 According to the vision of Tim Berners-Lee we can define the Semantic Web as a machine
processable web of smart data.
 We can also further define smart data as data that is application-independent, composable,
classified, and part of a larger information ecosystem (ontology).
 The World Wide Web Consortium (W3C) has established an Activity dedicated to
implementing the vision of the Semantic Web.The following diagram shows an example of
Semantic Web with Web of documents and Web of data.
2

1.2 The Evolution of the Web

The World Wide Web, commonly known as the web, has become an integral part of modern
society. It has revolutionized the way we communicate, work, and access information. However,
the web we know today has evolved significantly since its inception in 1989. In this section we are
going to see an overview of the evolution of the web, from its early days to the current era of web
4.0.

 Web 1.0: The Static Web

 In the first-generation web technology, we can read and share information on web pages.
This is based on bookmarking and hyperlinking. There is a concept of static pages.
 The web as we know it today began in 1989 when Tim Berners-Lee, a British computer
scientist, proposed a new system for sharing information over the internet.
 This system, which he called the World Wide Web, was based on a simple set of protocols
that allowed users to access and share information using hypertext links. The first website,
which was created by Berners-Lee, went live in 1991.
 In the early days of the web, websites were primarily static pages that provided information
to users.
 These pages were created using HTML (Hypertext Markup Language), a markup
language that allowed developers to structure web pages using tags. Websites were
primarily created by developers and were often difficult for users to navigate.

Web 2.0: The Dynamic Web

 In the second-generation web, we can read, write, and interact with each other. This web
2.0 dynamic page and user-generated content replace the static pages.
 The next stage in the evolution of the web was the emergence of Web 2.0 in the early
2000s. This era was characterized by the emergence of user-generated content and social
media platforms.
 The term Web 2.0 was coined by Tim O’Reilly, a technology entrepreneur and writer, to
describe a new generation of web-based applications that were more interactive and
dynamic.
 Web 2.0 was characterized by the emergence of social media platforms such
as Facebook, Twitter, and LinkedIn. These platforms allowed users to create and share
content with each other, and they became a central part of everyday life for many people.
 The emergence of Web 2.0 also saw the rise of e-commerce, with companies such as
Amazon and eBay becoming dominant players in the online retail space.
 Web 3.0: The Semantic Web

 In the third-generation web, machines can think of information rather than humans. This
web 3.0 is also known as the semantic web.
 The next stage in the evolution of the web is Web 3.0, also known as the Semantic Web.
 The Semantic Web is characterized by a shift from the current web, which is focused on
content, to a web that is focused on meaning.
 The Semantic Web aims to create a more intelligent web that can understand and interpret
the meaning of information, making it easier for users to find what they are looking for.
 The Semantic Web is based on a set of technologies and standards, including RDF
(Resource Description Framework) and OWL (Web Ontology Language), which
allow data to be represented in a machine-readable format. This allows machines to
understand the meaning of information and to make intelligent decisions based on that
information.
 Web 4.0: The Intelligent Web

 In the fourth-generation web is also known as the Symbiotic web. With this, humans and
machines can interact with each other.
 Web 4.0, also known as the Intelligent Web, is the next stage in the evolution of the web.
This era is characterized by the emergence of artificial intelligence (AI) and machine
learning (ML) technologies, which are being used to create more intelligent and
personalized web experiences for users.
 The Intelligent Web is based on a combination of AI and ML technologies, including
natural language processing (NLP), image recognition, and predictive analytics. These
technologies allow websites to provide personalized recommendations, automate tasks,
and interact with users in a more human-like way.
 The Intelligent Web is already being used in a variety of industries, including healthcare,
finance, and e-commerce.
 For example, healthcare companies are using AI and ML to analyse patient data and to
develop personalized treatment plans.
 E-commerce companies are using these technologies to provide personalized
recommendations to users based on their browsing and purchasing history.

1.3 Vision of Tim Berners-Lee Towards future of the Web


 Tim Berners-Lee has a two-part vision for the future of the Web.
 The first part is to make the Web a more collaborative medium.
 The second part is to make the Web understandable, and thus processable, by
machines.
 Figure 1.1 is Tim Berners-Lee’s original diagram of his vision.
 Tim Berners-Lee’s original vision clearly involved more than retrieving Hypertext
Markup Language (HTML) pages from Web servers.
 In Figure 1.1 we see relations between information items like “includes,” “describes,”
and “wrote.” Unfortunately, these relationships between resources are not currently
captured on the Web.
 The technology to capture such relationships is called the Resource Description
Framework (RDF).

1.4. How do we create a web of data that machines can process?

 The first step is a paradigm shift in the way we think about data. Historically, data has been
locked away in proprietary applications.
 Data was seen as secondary to processing the data. This incorrect attitude gave rise to
the expression “garbage in, garbage out,” or GIGO. GIGO basically reveals the flaw in the
original argument by establishing the dependency between processing and data.

Figure 1.2 displays the progression of data along a continuum of increasing intelligence.
Figure 1.2 shows four stages of the smart data continuum.

 Text and databases (pre-XML).


 The initial stage where most data is proprietary to an application. Thus, the “smarts” are
in the application and not in the data.
 XML documents for a single domain.
 The stage where data achieves application independence within a specific domain. Data
is now smart enough to move between applications in a single domain.
 An example of this would be the XML standards in the healthcare industry, insurance
industry, or real estate industry.
 Taxonomies and documents with mixed vocabularies.
 In this stage, data can be composed from multiple domains and accurately classified in
a hierarchical taxonomy. In fact, the classification can be used for discovery of data.
 Simple relationships between categories in the taxonomy can be used to relate and thus
combine data. Thus, data is now smart enough to be easily discovered and sensibly combined
with other data.
 Ontologies and rules.
 In this stage, new data can be inferred from existing data by following logical rules. In
essence, data is now smart enough to be described with concrete relationships, and
sophisticated formalisms where logical calculations can be made on this “semantic algebra.”
 This allows the combination and recombination of data at a more atomic level and
very fine-grained analysis of data.
 In this stage, data no longer exists as a blob but as a part of a sophisticated microcosm.
 An example of this data sophistication is the automatic translation of a document in one
domain to the equivalent document in another domain.

1.5. Why Do We Need the Semantic Web?

 The Semantic Web is not just for the World Wide Web.
 It represents a set of technologies that will work equally well on internal corporate
intranets.

 The Semantic Web will resolve several key problems facing current information
technology architectures.
 Information Overload
 Information overload is the most obvious problem in need of a solution, and technology
experts have been warning us about it for 50 years. This condition results from having a rapid
rate of growth in the amount of information available, while days remain 24 hours long and
our brains remain in roughly the same state of development as they were when cavemen
communicated by scrawling messages in stone.
 Stovepipe Systems
 A stovepipe system is a system where all the components are hardwired to only work
together. Therefore, information only flows in the stovepipe and cannot be shared by other
systems or organizations that need it.
 For example, the client can only communicate with specific middleware that only
understands a single database with a fixed schema.
 Poor Content Aggregation
 Putting together information from disparate sources is a recurring problem in a number
of areas, such as financial account aggregation, portal aggregation, comparison shopping, and
content mining.
 Unfortunately, the most common technique for these activities is screen scraping.
 The main drawback of screen scraping method is that it scrapes messages written in
HTML, which describes the format (type size, paragraph spacing, etc.) but doesn’t give a clue
about the meaning of a document.

1.6. How Does XML Fit into the Semantic Web?

 XML is the syntactic foundation layer of the Semantic Web. All other technologies
providing features for the Semantic Web will be built on top of XML.
 Requiring other Semantic Web technologies (like the Resource Description
Framework) to be layered on top of XML guarantees a base level of interoperability.

 The technologies that XML is built upon are Unicode characters and Uniform
Resource Identifiers (URIs).
 The Unicode characters allow XML to be authored using international characters.
 URIs are used as unique identifiers for concepts in the Semantic Web.

1.7. Is XML enough?

 The answer is no, because XML only provides syntactic interoperability. In other words,
sharing an XML document adds meaning to the content; however, only when both parties know
and understand the element names.
 For example, if I label something a <price> $12.00 </price> and us label that field on
the invoice <cost> $12.00 </cost>, there is no way that a machine will know those two mean
the same thing unless Semantic Web technologies like RDF and ontologies are added.

1.8. How Do Web Services Fit into the Semantic Web?

 Web services are software services identified by a URI that are described, discovered, and
accessed using Web protocols.
 The important point about Web services is that they consume and produce XML. Thus, the
first way that Web services fit into the Semantic Web is by furthering the adoption of XML,
or more smart data.
 As Web services proliferate, they become similar to Web pages in that they are more
difficult to discover.
 Semantic Web technologies will be necessary to solve the Web service discovery problem.
There are several research efforts under way to create Semantic Web-enabled Web
services. Figure 1.3 demonstrates the various convergences that combine to form Semantic
Web services.
 The third way that Web services fit into the Semantic Web is in enabling Web services
to interact with other Web services.

 Advanced Web service applications involving comparison, composition, or


orchestration of Web services will require Semantic Web technologies for such interactions to
be automated.
1.9. What’s after Web Services?
Web services complete a platform-neutral processing model for XML. The step after that is to
make both the data and the processing model smarter.
In other words, continue along the “smart-data continuum.” In the near term, this will move
along five axes: logical assertions, classification, formal class models, rules, and trust.
1.9.1 Logical assertions.

 An assertion is the smallest expression of useful information.


 How do we make an assertion? One way is to model the key parts of a sentence by
connecting a subject to an object with a verb.
 The Resource Description Framework (RDF), which captures these associations
between subjects and objects.
1.9.2 Classification.

 We classify things to establish groupings by which generalisations can be made.

 Just as we classify files on our personal computer in a directory structure, we will


continue to better classify resources on corporate intranets and even the Internet.

The downside of classification systems is evident when examining different people’s


filesystem classification on their personal computers. Categories (or folder names) can be
arbitrary, and the membership criteria for categories are often ambiguous.

1.9.3 Formal class models.

 A formal representation of classes and relationships between classes to enable inference


requires rigorous formalisms even beyond conventions used in current object-oriented
programming languages like Java and C#.
 Ontologies are used to represent such formal class hierarchies, constrained properties,
and relations between classes.
 The W3C is developing a Web Ontology Language (abbreviated as OWL).
 Figure 1.5 is an illustrative example of the key components of an ontology.

 It is important to state that the concepts described so far (classes, subclasses, properties)
are not rigorous enough for inference. To each of these basic concepts, additional formalisms
are added. For example, a property can be further specialized as a symmetric property or a
transitive property. Here are the rules that define those formalisms:

If x = y, then y = x. (symmetric property)

If x = y and y = z, then x = z. (transitive property)


 The Web ontology language being developed by the W3C will have a UML presentation
profile as illustrated in Figure 1.6.
 The wide availability of commercial and open-source UML tools in addition to the
familiarity of most programmers with UML will simplify the creation of ontologies. Therefore,
a UML profile for OWL will significantly expand the number of potential ontologists.

1.9.4 Rules.

 With XML, RDF, and inference rules, the Web can be transformed from a collection of
documents into a knowledge base.
 An inference rule allows us to derive conclusions from a set of premises. A well-known
logic rule called “modus ponens” states the following:
 If P is TRUE, then Q is TRUE. P is TRUE.

Therefore, Q is TRUE.
 An example of modus ponens is as follows:

An apple is tasty if it is not cooked. This apple is not cooked. Therefore, it is tasty.
 The Semantic Web can use information in an ontology with logic rules to infer new
information. Let’s look at a common genealogical example of how to infer the “uncle” relation
as depicted in Figure 1.7:
 If a person C is a male and “childOf” a person A, then person C is a “sonOf” person A.
 If a person B is a male and siblingOf a person A, then person B is a “brotherOf” person
A.
 If a person C is a “sonOf” person A, and person B is a “brotherOf” person A, then person
B is the “uncleOf” person C.

1.9.5 Trust.

 Instead of having trust be a binary operation of possessing the correct credentials, we


can make trust determination better by adding semantics.
 For example, we may want to allow access to information if a trusted friend vouches
(via a digital signature) for a third party. Digital signatures are crucial to the “web of trust.”
 In fact, by allowing anyone to make logical statements about resources, smart
applications will only want to make inferences on statements that they can trust. Thus, verifying
the source of statements is a key part of the Semantic Web.

The five directions of semantic web are logical assertions, classification, formal class
models, rules, and trust will move corporate intranets and the Web into a semantically rich

17
knowledge base where smart software agents and Web services can process information and
achieve complex tasks.

Part-II. The Business Case for the Semantic Web


“The business market for this integration of data and programs is huge. The companies who
choose to start exploiting Semantic Web technologies will be the first to reap the rewards.”

2.1. What Is the Semantic Web Good For?


 Traditional knowledge management techniques have faced new challenges by today’s
Internet: information overload, the inefficiency of keyword searching, the lack of
authoritative (trusted) information, and the lack of natural language-processing
computer systems.
 The Semantic Web can bring structure to information chaos. For us to get our
knowledge, we need to do more than dump information into files and databases.
 Figure 2.1 provides a view of how the organization can revolve around the corporate
Semantic Web, impacting virtually every piece of the organization.

18
 We may have projects that could share lessons learned, provide competitive intelligence
information, and save us a lot of time and work.
 If we had a corporate knowledge base that could be searched and analysed by software
agents, we could have Web based applications that save us a lot of time and money.

The following sections provide some of these examples.


 Decision Support
 Having knowledge—not just data—at the fingertips allows us to make better decisions.
 Business Development

 It is important for members of the organization to have up-to-the minute information


that could help us win business.
 E-commerce industry experts believe that the Semantic Web can be used in
matchmaking for e-business.
 Matchmaking is a process in which businesses are put in contact with potential business
partners or customers.
 The opportunities for maximizing the business opportunities with Semantic Web
technologies are limitless.

 Information Sharing and Knowledge Discovery

 Information sharing and communication are paramount in any organization, but as most
organizations grow and collect more information, this is a major struggle.
 We all understand the importance of not reinventing the wheel, but how many times
have we unintentionally duplicated efforts? When organizations get larger, communication
gaps are inevitable.
 With a little bit of effort, a corporate knowledge base could at least include a registry of
descriptions of projects and what each team is building.
 Imagine how easy it would be for the employees to be able to find relevant information.
Using Semantic Web enabled Web services can allow us to create such a registry.

19
 Administration and Automation

 A side effect of having such a knowledge base is the ability of software programs to
automate administrative tasks.
 Booking travel, for example, is an example where the Semantic Web and Web services
could aid in making a painful task easy.
 Making travel arrangements can be an administrative nightmare. Everyone has personal
travel preferences and must take items such as the following into consideration:
 Transportation preference (car, train, bus, plane)
 Hotel preference and rewards associated with hotel
 Airline preference and frequent flyer miles
 Hotel proximity to meeting places
 Hotel room preferences (nonsmoking, king, bar, wireless network in lobby)
 Rental car options and associated rewards
 Price (lodging and transportation per diem rates for the company)

Part-III XML

3. Introduction XML:
3.1 What Is XML?

 XML has become the universal syntax and framework for exchanging data between
organizations. By agreeing on a standard schema, organization can produce these text
documents that can be validated, transmitted, and parsed by any application regardless of
hardware or operating system.
 XML provides universal accepted language for creating semantically rich new markup
languages in a particular domain.
 In other words, we can apply XML to create new markup languages.
 Any language created via the rules of XML, like the Math Markup Language
(MathML), CML (Chemical Markup Language) are called the applications of XML.

20
 A markup language’s primary concern is how to add semantic information about the
raw content in a document; thus, the vocabulary of a markup language is the external “marks”
to be attached or embedded in a document.

3.2 History of XML:


 SGML (Standard Generalized Markup Language) is an international standard for
the definition of markup languages; that is, it is a metalanguage.
 Markup consists of notations called tags that specify the function of a piece of text or
how it is to be displayed.
 The following diagram show the hierarchical structure of the xml invention and its
parent markup language as well as its child languages.

3.3 Why is XML so successful?

XML has four primary accomplishments, that are given bellow.


 XML creates application-independent documents and data.
 It has a standard syntax for meta data.
 It has a standard structure for both documents and data.
 XML is not a new technology (not a 1.0 release).

21
3.4 What are the Characteristics of XML?

There are three important characteristics of XML that make it useful in a variety of systems
and solutions:

 XML is extensible: XML allows you to create your own self-descriptive tags or
language, that suits to the application.

 XML carries the data, does not present it: XML allows you to store the data
irrespective of how it will be presented.

 XML is a public standard: XML was developed by an organization called the


World Wide Web Consortium (W3C) and is available as an open standard.

3.5 XML Usage

A short list of XML usage says it all:


 XML can work behind the scene to simplify the creation of HTML documents for
large web sites.
 XML can be used to exchange the information between organizations and
systems.
 XML can be used for offloading and reloading of databases.
 XML can be used to store and arrange the data, which can customize your data
handling needs.
 XML can easily be merged with style sheets to create almost any desired
output.
 Virtually, any type of data can be expressed as an XML document.

3.6 The Difference between XML and HTML

XML and HTML were designed with different goals:

 XML was designed to carry data - with focus on what data is


 HTML was designed to display data - with focus on how data looks

22
 XML tags are not predefined like HTML tags

 XML documents form a tree structure that starts at "the root" and branches to"the
leaves".

3.7 XML Syntax:


Following is a complete XML document:

<?xml version="1.0"?>

<contact_info>

<name>Rajesh</name>

<company>TCS</company>

<phone>9333332354</phone>

</contact_info>

You can notice there are two kinds of information in the above example:

 markup, like <contact-info> and

 the text, like Rajesh etc.

The following diagram depicts the syntax rules to write different types of markups and text in
an XML document.

23
Let us see each component of the above diagram in detail:
3.8 XML Declaration

The XML document can optionally have an XML declaration. It is written as below:
<?xml version="1.0" encoding="UTF-8"?>

Where version is the XML version and encoding specifies the character encoding used in
the document.

 Syntax Rules for XML declaration

 The XML declaration is case sensitive and must begin with "<?xml>" where
"xml" is written in lower-case.

 If document contains XML declaration, then it strictly needs to be the first


statement of the XML document.

 The XML declaration strictly needs be the first statement in the XML document.

24
 An HTTP protocol can override the value of encoding that you put in the XML
declaration.
3.9 Tags and Elements

An XML file is structured by several XML-elements, also called XML-nodes or XML-


tags. XML-elements' names are enclosed by triangular brackets < > as shown below:

<element>

 Syntax Rules for Tags and Elements

Element Syntax: Each XML-element needs to be closed either with start or with endelements
as shown below:

<element> ........... </element>

or in simple-cases, just this way:

<element/>

3.10 Nesting of elements:

An XML-element can contain multiple XML-elements as its children, but the children
elements must not overlap. i.e., an end tag of an element must have the same nameas that of
the most recent unmatched start tag.

Following example shows incorrect nested tags:


<?xml version="1.0"?>

<contact_info>

25
<company>TCS

<contact_info>

</company>

Following example shows correct nested tags:

<?xml version="1.0"?>

<contact_info>

<company>TCS</company>

<contact_info>

3.11 Root element:

An XML document can have only one root element. For example, following is not a correct
XML document, because both the x and y elements occur at the top level without a root
element:
<x>...</x>

<y>...</y>

The following example shows a correctly formed XML document:

<root>

<x>...</x>

<y>...</y>

26
</root>

3.12 Case sensitivity:


The names of XML-elements are case-sensitive. That means the name of the startand the
end elements need to be exactly in the same case.
For example, <contact_info> is different from <Contact_Info>.

3.13 Attributes:

An attribute specifies a single property for the element, using a name/value pair.An
XML-element can have one or more attributes. For example:
<a href="http://www.w3.org/Index.html">XML is Meta Language</a>

Here href is the attribute name and http://www.w3.org/Index.html is attributevalue.

 Syntax Rules for XML Attributes


Attribute names in XML (unlike HTML) are case sensitive. That is, HREF and href are
considered two different XML attributes.
Same attribute cannot have two values in a syntax. The following example shows
incorrect syntax because the attribute b is specified twice:

<a b="x" c="y" b="z">. </a>

Attribute names are defined without quotation marks, whereas attribute values must always
appear in quotation marks. Following example demonstrates incorrect xml syntax:

<a b=x>. </a>

In the above syntax, the attribute value is not defined in quotation marks.
3.14 XML References

References usually allow you to add or include additional text or markup in an XML
document. References always begin with the symbol "&” which is a reservedcharacter and
end with the symbol ";". XML has two types of references:

27
Entity References: An entity reference contains a name between the start and the end
delimiters. For example, &amp; where amp is name. The name refers to a predefined string
of text and/or markup.

Character References: These contain references, such as &#65; contains a hash mark (“#”)
followed by a number. The number always refers to the Unicode code ofa character. In this
case, 65 refers to alphabet "A".

3.15. XML Tree Structure:

28
An Example XML Document

The image above represents books in this XML:

<?xml version="1.0" encoding="UTF-8"?>


<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>

29
XML documents are formed as element trees.

An XML tree starts at a root element and branches from the root to child elements. All
elements can have sub elements (child elements):

<root>
<child>
<subchild> .............. </subchild>
</child>
</root>

The terms parent, child, and sibling are used to describe the relationships between
elements.

Parents have children. Children have parents. Siblings are children on the same
level (brothers and sisters).

3.16 What are the Principles of XML

 The first key principle of XML is markup is separate from content. A corollary to that
principle is that markup can surround or contain content.
 An XML element is an XML container consisting of a start tag, content (contains
character data, sub elements, or both), and an end tag—except for empty elements, which use
a single tag denoting both the start and end of the element.
 The content of an element can be other elements. Following is an example of an element:

<footnote>

<author> Michael C. Daconta </author>

<title> Java Pitfalls </title>

</footnote>

30
Here we have one element, called “footnote,” which contains character data and two
subelements: “author” and “title.”

 The second key principle of XML is this: A document is classified as a member of a


type by dividing its parts, or elements, into a hierarchical structure known as a tree.

3.17. XML DTD: (Document Type Definition)

 An XML document with correct syntax is called "Well Formed".

 An XML document validated against a DTD is both "Well Formed" and "Valid".

 A "Valid" XML document is a "Well Formed" XML document, which also conforms to
the rules of a DTD.
 DTD is the basic building block of XML.

 The purpose of a DTD is to define the structure of an XML document. It defines the
structure with a list of legal elements.
A document type definition defines the rules and the legal elements and attributes for an
XML document.

31
<!DOCTYPE book[

<!ELEMENT book (title, author, price)>


<!ELEMENT title (#PCDATA)> <!ELEMENT
author(#PCDATA)> <!ELEMENT price
(#PCDATA)> ]>

The DTD above is interpreted like this:

 !DOCTYPE book defines that the root element of the document is book

 !ELEMENT book defines that the book element must contain the elements:

"title, author, price”


 !ELEMENT title defines the title element to be of type "#PCDATA"
 !ELEMENT author defines the author element to be of type "#PCDATA"
 !ELEMENT price defines the price element to be of type
"#PCDATA"Note: PCDATA: Parsed Character Data,

CDATA: Character Data.


There are two types of DTDs:
1) Internal / Embedded DTD.

2) External DTD.

32
1) Internal / Embedded DTD.

<?xml version="1.0" encoding="UTF-8"?>


<!DOCTYPE student [
<!ELEMENT student (id, name, age, addr, email, ph)>
<!ELEMENT id (#PCDATA)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
<!ELEMENT addr (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT ph (#PCDATA)> ]>
<student>
<id>543</id>
<name>Ravi</name>
<age>21</age>
<addr>Guntur</addr>
<email>[email protected]</email>
<ph>9855555</ph>
<gender>male</gender>
</student>
2) External DTD.
“student.dtd”

<!ELEMENT student (id, name, age, addr, email, ph)>


<!ELEMENT id (#PCDATA)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
<!ELEMENT addr (#PCDATA)>
<!ELEMENT email (#PCDATA)>

Save the above code as “student.dtd” and prepare “student.xml” as follows...

33
“student.xml”

<?xml version="1.0" encoding="UTF-8"?>


<!DOCTYPE student SYSTEM "student.dtd">
<student>
<id>543</id>
<name>Ravi</name>
<age>21</age>
<addr>Guntur</addr>
<email>[email protected]</email>
</student>
In the above example we are using <!DOCTYPE student SYSTEM "student.dtd">
which is used to provide “student.dtd” code in our “student.xml” file.
If the above xml code follows the exact rules defined in DTD then we can conclude that
our xml document is a valid document. Otherwise it is an invalid document.

3.18. Why Should XML Documents Be Well-Formed and Valid?

 The XML specification defined two levels of conformance for XML documents: well-
formed and valid. Well-formedness is mandatory, while validity is optional.
 A well-formed XML document complies with all the W3C syntax rules of XML
(explicitly called out in the XML specification as well-formedness constraints) like naming,
nesting, and attribute quoting.
 Well Formed XML Documents
An XML document with correct syntax is called "Well Formed".
The syntax rules were given bellow.
 XML documents must have a root element
 XML elements must have a closing tag
 XML tags are case sensitive
 XML elements must be properly nested
 XML attribute values must be quoted

34
 This requirement guarantees that an XML processor can parse (break into identifiable
components) the document without error.
 If a compliant XML processor encounters a well- formedness violation, the specification
requires it to stop processing the document and report a fatal error to the calling application.
 A valid XML document references and satisfies a schema.
 Valid XML Documents
 A "well formed" XML document is not the same as a "valid" XML document.
 A "valid" XML document must be well formed. In addition, it must conform to a
documenttype definition or schema.

There are two different document type definitions that can be used with XML:

 DTD - The original Document Type Definition


 XML Schema - An XML-based alternative to DTD
 A schema is a separate document whose purpose is to define the legal elements,
attributes, and structure of an XML instance document.
 A schema defines a particular type or class of documents.
 The markup language constrains the information to be of a certain type to be considered
“legal.”
 W3C-compliant XML processors check for well-formedness but may not check for
validity.
 Validation is often a feature that can be turned on or off in an XML parser.
 Validation is time-consuming and not always necessary.
 It is generally best to perform validation either as part of document creation or
immediately after creation.

3.19. What Is XML Schema?


 XML Schema is a definition language that enables us to constrain conforming XML
documents to a specific vocabulary and a specific hierarchical structure.
 The things that we want to define in our language are element types, attribute types,
and the composition of both into composite types (called complex types).

35
 XML Schema is analogous to a database schema, which defines the column names and
data types in database tables.
 XML Schema became a W3C Recommendation (synonymous with standard) on May
5, 2001.
 XML Schema is not the only definition language, and us may hear about others like
Document Type Definitions (DTDs), RELAX NG, and Schematron (see the sidebar titled
“Other Schema Languages”).
 As shown in Figure 3.5, we have two types of documents: a schema document (or
definition document) and multiple instance documents that conform to the schema.
 A good analogy to remember the difference between these two types of documents is
that a schema definition is a blueprint (or template) of a type and each instance is an incarnation
of that template. This also demonstrates the two roles that a schema can play:
 Template for a form generator to generate instances of a document type
 Validator to ensure the accuracy of documents

36
 XML Schemas allow validation of instances to ensure the accuracy of field values and
document structure at the time of creation.
 The accuracy of fields is checked against the type of the field; for example, a quantity
typed as an integer or money typed as a decimal.
 The structure of a document is checked for things like legal element and attribute
names, correct number of children, and required attributes.
 All XML documents should be checked for validity before they are transferred to
another partner or system.

3.20. What Do Schemas Look Like?


 An XML Schema uses XML syntax to declare a set of simple and complex type
declarations.
 A type is a named template that can hold one or more values.
 Simple types hold one value.
 Complex types are composed of multiple simple types.
 A type has two key characteristics: a name and a legal set of values.
 Let’s look at examples of both simple and complex types.
 A simple type is an element declaration that includes its name and value constraints.

Here is an example of an element called “author” that can contain any number of text
characters:

Example: <xsd:element name=“author” type=“xsd:string” />

The preceding element declaration enables an instance document to have an element like this:

<author> Mike Daconta </author>


 The type attributed in the element declaration declares the type to be “xsd:string”.
 A string is a sequence of characters.
 There are many built-in data types defined in the XML Schema specification. Table 3.4
lists the most common.

37
 If a built-in data type does not constrain the values the way the document designer
wants, XML Schema allows the definition of custom data types.

 A complex type is an element that either contains other elements or has attached
attributes.
 Let’s first examine an element with attached attributes and then a more complex element
that contains child elements.
 Here is a definition for a book element that has two attributes called “title” and “pages”:

38
<xsd:element name=book”>

<xsd:complexType>

<xsd:attribute name=“title” type=“xsd:string” />

<xsd:attribute name=“pages” type = “xsd:int” />

</xsd:complexType>

</xsd:element>

An XML instance of the book element would look like this:

<book title = “More Java Pitfalls” pages=“453” />


 Now let’s look at how we define a “product” element with both attributes and child
elements.
 The product element will have three attributes: id, title, and price. It will also have two
child elements: description and categories.
 The categories child element is mandatory and repeatable, while the description child
element will be optional:

<xsd:element name=“product”>

<xsd:complexType>

<xsd:sequence>

<xsd:element name=“description” type=“xsd:string” minOccurs=“0” maxOccurs = “1” />

<xsd:element name=“category” type=“xsd:string” minOccurs = “1” maxOccurs =


“unbounded” />

</xsd:sequence>

39
<xsd:attribute name=“id” type=“xsd:ID” />

<xsd:attribute name=“title” type=“xsd:string” />

<xsd:attribute name=“price” type=“xsd:decimal” />

</xsd:complexType>

</xsd:element>

 Here is an XML instance of the product element defined previously:

<product id=“ P01” title=“Wonder Teddy” price=“49.99”>

<description>The bestselling teddy bear of the year. </description>

<category> toys </category>

<category> stuffed animals </category>

</product>

3.21. What Are XML Namespaces?

 Namespaces are a simple mechanism for creating globally unique names for the
elements and attributes of the markup language.
 This is important for two reasons: to deconflict the meaning of identical names in
different markup languages and to allow different markup languages to be mixed together
without ambiguity.
 Unfortunately, namespaces were not fully compatible with DTDs, and therefore their
adoption has been slow.
 The current markup definition languages, like XML Schema, fully support namespaces.

40
 Namespaces are implemented by requiring every XML name to consist of two parts: a
prefix and a local part. Here is an example of a fully qualified element name:
<xsd:integer>
 The local part is the identifier for the meta data (in the preceding example, the local
part is “integer”), and the prefix is an abbreviation for the actual namespace in the namespace
declaration.
 The actual namespace is a unique Uniform Resource Identifier (URI). Here is a sample
namespace declaration:
 <xsd:schema xmlns:xsd=“http://www.w3.org/2001/XMLSchema”>

The preceding example declares a namespace for all the XML Schema elements to be used in
a schema document.

It defines the prefix “xsd” to stand for the namespace


“http://www.w3.org/2001/XMLSchema”.

 It is important to understand that the prefix is not the namespace.


 The prefix can change from one instance document to another.
 The prefix is merely an abbreviation for the namespace, which is the URI.
 To specify the namespace of the new elements us are defining, us use the
targetNamespace attribute:

<xsd:schema xmlns:xsd=”http://www.w3.org/2001/XMLSchema”
targetNamespace=”http://www.mycompany.com/markup”>

41
 XML Schema Example

The following example shows the employe schema for employee details.

employee.xsd

1. <?xml version="1.0"?>
2. <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" >
3. <xsd:element name="employee">
4. <xsd:complexType>
5. <xsd:sequence>
6. <xsd:element name="firstname" type="xsd:string"/>
7. <xsd:element name="lastname" type="xsd:string"/>
8. <xsd:element name="email" type="xsd:string"/>
9. </xsd:sequence>
10. </xsd:complexType>
11. </xsd:element>
12. </xsd:schema>

The following example shows the employe xml document or xml instance for the above
employee schema.

employee.xml

1. <?xml version="1.0"?>
2. <employee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
3. xsi:noNamespacSchemaLocation="./employee.xsd">
4. <firstname>vimal</firstname>
5. <lastname>jaiswal</lastname>
6. <email>[email protected]</email>
7. </employee>

42
DTD vs XSD

There are many differences between DTD (Document Type Definition) and XSD (XML Schema
Definition). In short, DTD provides less control on XML structure whereas XSD (XML schema)
provides more control.

The important differences are given below:

No. DTD XSD


1) DTD stands for Document Type XSD stands for XML Schema
Definition. Definition.
2) DTDs are derived from SGML XSDs are written in XML.
syntax.
3) DTD doesn't support datatypes. XSD supports datatypes for elements
and attributes.
4) DTD doesn't support namespace. XSD supports namespace.
5) DTD doesn't define order for child XSD defines order for child elements.
elements.
6) DTD is not extensible. XSD is extensible.
7) DTD is not simple to learn. XSD is simple to learn because you
don't need to learn new language.
8) DTD provides less control on XML XSD provides more control on XML
structure. structure.

3.22. What is Parsing?


 XML parsing is the process of reading an XML document and providing an interface to
the user application for accessing the document. An XML parser is a software apparatus that
accomplishes such tasks.
 An XML parser is a software library or package that provides interfaces for client
applications to work with an XML document.

 XML parser validates the document and check that the document is well formatted.

43
 The Parser could be categorized as validating and non-validating
 Validating Parser: It needs a Document type Declaration to parse and gives an error if
the respective document doesn’t match with DTD and constraints.
 Non-Validating: This Parser eliminates DTD and the parser checks for the well-formed
document.

Types of Parsers:

 There are three common ways to parse an XML document: by using the Simple API
for XML (SAX), by building a Document Object Model (DOM), and by employing a new
technique called pull parsing.
 SAX is a style of parsing called event-based parsing where each information class in the
instance document generates a corresponding event in the parser as the document is traversed.

44
 SAX parsers are useful for parsing very large XML documents or in low-memory
environments.
 Pull parsing is a new technique that aims for both low-memory consumption and high
performance.
 Pull parsing is also an event-based parsing technique; however, the events are read by
the application (pulled) and not automatically triggered as in SAX.
 The majority of applications use the DOM approach to parse XML.

3.23. What Is the Document Object Model (DOM)?

 The Document Object Model (DOM) is a language-neutral data model and application
programming interface (API) for programmatic access and manipulation of XML and HTML.
 The Document Object Model is an in-memory representation of an XML or HTML
document and methods to manipulate it.
 DOMs can be loaded from XML documents, saved to XML documents, or dynamically
generated by a program.
 The DOM has provided a standard set of classes and APIs for browsers and
programming languages to represent XML and HTML.
 The DOM is represented as a set of interfaces with specific language bindings to those
interfaces.
 Unlike XML instances and XML schemas, which reside in files on disk, the DOM is an
in-memory representation of a document.

The model for this memory representation is object-oriented programming (OOP).


 Object-oriented programming introduces two key data modelling concepts: classes and
objects.
 A class is a definition or template describing the characteristics and behaviors of a real-
world entity or concept.
 From this description, an in- memory instance of the class can be constructed, which is
called an object.

45
 An object is a specific instance of a class.
 Figure 3.6 graphically portrays a class and two objects

The DOM in Figure 3.7 can also be accessed using specific subclasses of Node for each major
part of the document like Document, DocumentFragment, Element, Attr (for attribute), Text,
and Comment.

This more object-oriented tree is displayed in Figure 3.8.

46
Types of DOM Levels?

The DOM has steadily evolved by increasing the detail of the representation, increasing the
scope of the representation, and adding new manipulation methods. This is accomplished by
dividing the DOM into conformance levels, where each new level adds to the feature set.

47
There are currently three DOM levels:
 DOM Level 1. This set of classes represents XML 1.0 and HTML 4.0 documents.
 DOM Level 2. This extends Level 1 to add support for namespaces; cascading style
sheets, level 2 (CSS2); alternate views; user interface events; and enhanced tree manipulation
via interfaces for traversal and ranges.
 DOM Level 3. This extends Level 2 by adding support for mixed vocabularies (different
namespaces), XPath expressions, load and save methods, and a representation of abstract
schemas (includes both DTD and XML Schema).
 XPath is a language to select a set of nodes within a document. Load and save methods
specify a standard way to load an XML document into a DOM and a way to save a DOM into
an XML document.

Part- IV
4.1. Impact of XML on Enterprise IT

 XML is spreading through the all areas of the enterprise, from the IT department to the
intranet, extranet, Web sites, and databases.
 XML has become integrated with the majority of commercial products on the market,
either as a primary or enabling technology.
 The current and future impact of XML in 10 specific areas are given bellow.
 Data exchange and interoperability.
 XML has become the universal syntax for exchanging data between organizations. By
agreeing on a standard schema, organization can produce these text documents that can be
validated, transmitted, and parsed by any application regardless of hardware or operating
system.
 The government has become a major adopter of XML and is moving all reporting
requirements to XML. Companies report financial information via XML, and local
governments report regulatory information.
 XML has been called the next Electronic Data Interchange (EDI) system, which
formerly was extremely costly, was cumbersome, and used binary encoding.

48
 E-business
 Business-to-business (B2B) transactions have been revolutionized through XML. B2B
revolves around the exchange of business messages to conduct business transactions.
 There are dozens of commercial products supporting numerous business vocabularies
developed by RosettaNet, OASIS, and other organizations.
 Enterprise Application Integration (EAI).
 Enterprise Application Integration is the assembling of legacy applications, databases,
and systems to work together to support integrated Web views, e-commerce, and Enter- prise
Resource Planning (ERP).
 The Open Applications is a nonprofit consortium of companies to define standards for
application integration. It currently boasts over 250 live sites and more than 100 vendors
(including SAP, PeopleSoft, and Oracle) supporting the Open Applications Group Integration
Specification (OAGIS) in their products.
 Enterprise IT architectures.
 The impact of XML on IT architectures has grown increasingly important as a bridge
between the Java 2 Enterprise Edition (J2EE) platform and Microsoft’s .NET platform.
 Large companies are implementing both architectures and turning to XML Web services
to integrate them.
 Additionally, XML is influencing development on every tier of the N-tier network. On
the client tier, XML is transformed via XSLT to multiple presentation languages like Scalable
Vector Graphics (SVG).
 On the Web tier, XML is used primarily as the integration format of choice and merged
in middleware.
 XML is used to configure and deploy applications on web tier like Java Server Pages
(JSP) and Active Server Pages (ASP).
 In the back-end tier, XML is being stored and queried in relational databases and native
XML databases.
 Content Management Systems (CMS).
 CMS is a Web-based system to manage the production and distribution of content to
intranet and Internet sites.

49
 XML technologies are central to these systems in order to separate raw content from
its presentation.
 Content can be transformed on the fly via the Extensible Stylesheet Language
Transformation (XSLT) to browsers or wireless clients.
 Knowledge management and e-learning.
 Knowledge management involves the capturing, cataloging, and distribution of
corporate knowledge on intranets.
 The corporate knowledge as an asset.
 Electronic learning (e-learning) is part of the knowledge acquisition for employees
through online training.
 XML is driving the future of knowledge management.
 Portals and data integration
 A portal is a customizable, multipaned view tailored to support a specific community of
users.
 XML is supported via standard transformation portlets that use XSLT to generate
specific presentations of content (as discussed previously under Content Management
Systems), syndication of content, and the integration of Web services.
 A portlet is a dynamically pluggable application that generates content for one pane (or
sub window) in a portal.
 Syndication is the reuse of content from another site. The most popular format for
syndication is an XML-based format called the Resource Description Framework Site
Summary (RSS).
 All the major portal vendors are integrating Web services into their portal products.
 Customer relationship management (CRM)
 CRM systems enable an organization’s sales and marketing staff to understand, track,
inform, and service their customers. CRM involves many of the other systems we have
discussed here, such as portals, content management systems, data inte- gration, and databases
(see next item), where XML is playing a major role. XML is becoming the glue to tie all these
systems together to enable the sales force or customers (directly) to access information when
they want and wherever they are (including wireless).

50
 Databases and data mining.
 XML has had a greater effect on relational database management systems (DBMS) than
object-oriented programming DBMS (object-oriented database management systems, or
OODBMS).
 XML has a new category of databases called native XML databases exclusively for the
storage and retrieval of XML.
 All the major database vendors have responded to this challenge by supporting XML
translation between relational tables and XML schemas.
 Additionally, all of the database vendors are further integrating XML into their systems
as a native data type.
 This trend toward storing and retrieving XML will accelerate with the completion of the
W3C XQuery specification.
 Collaboration technologies and peer-to-peer (P2P)
 Collaboration technologies allow individuals to interact and participate in joint activities
from disparate locations over computer networks.
 P2P is a specific decentralized collaboration protocol.
 XML is being used for collaboration at the protocol

4.2 Why Meta Data Is Not Enough

 XML meta data is a form of description.


 It describes the purpose or meaning of raw data values via a text format to more easily
enable exchange, interoperability, and application independence.
 Meta data increases the fidelity and granularity of our data. The way to think about
the current state of meta data is that we attach words (or labels) to our data values to
describe it.
 How could we attach sentences? What about paragraphs? While the approach toward
meta data evolution will not follow natural language description, it is a good analogy
for the inadequacy of words alone.

51
 The more computers understand, the more effectively they can handle complex
tasks.
 We have not yet invented all the ways a semantically aware computing system can
drive new business and decrease your operation costs. But to get there, we must push
beyond simple meta data modelling to knowledge modelling and standard knowledge
processing. Here are three emerging steps beyond simple meta data: semantic levels,
rule languages, and inference engines.
 Semantic Levels
o The following figure shows the evolution of data fidelity required for
semantically aware applications.

 Instead of just meta data, we will have an information stack composed of semantic
levels. We are currently at Level 1 with XML Schema, which is represented as
modelling the properties of our data classes.
 We are capturing and processing meta data about isolated data classes like purchase
orders, products, employees, and customers.

52
 On the left side of the diagram, we associate a simple physical metaphor to the state
of each level.
 Level 1 is analogous to describing singular concepts or objects.
 In Level 2, we will move beyond data modelling (simple meta data properties) to
knowledge modelling. This includes the Resource Description Framework (RDF)
and taxonomies.
 Knowledge modelling enables us to model statements both about the relationships
between Level 1 objects and about how those objects operate. This is diagrammed as
connection between our objects in Figure 3.9.
 Beyond the knowledge statements of Level 2 are the superstructures or “closed world
modelling” (CWM) of Level 3. The technology that implements these sophisticated
models of systems is called ontologies.
 Rules and Logic
 The semantic levels of information provide the input for software systems.
 The operations that a software system uses to manipulate the semantic information
will be standardized into one or more rule languages.
 In general, a rule specifies an action if certain conditions are met. The general form
is this: if (x) then y.
 Inference Engines
 Applying rules and logic to semantic data requires standard, embeddable inference
engines. These programs will execute a set of rules on a specific instance of data using
an ontology.
 An early example of these types of inferencing engines is the open-source software
Closed World Machine (CWM).
 CWM is an inference engine that allows you to load ontologies or closed worlds, then
it executes a rule language on that world.
 So, meta data is a starting point for semantic representation and processing. The rise of
meta data is related to the ability to reuse meta data between organizations and systems.

53
Part-V: Topics beyond the Syllabus

 Importance of Semantic Web:

 The Semantic Web's expansion and the tools it brings to the table are putting machines'

analytical skills to work in the areas of content creation, management, learning, support,

media, ecommerce, scientific research, knowledge management, and publishing in

general.

 Knowledge will become meaningful anywhere we convey it.

 The developing semantic web of material and data is a big potential to tap into when it

comes to intelligent content, semantic search, and smart devices.

 The Semantic Web will continue to give birth to new careers, businesses, and global

innovators.

 Publishers can use Semantic Web Technologies to:

 Create intelligent digital content infrastructures

 Connect disparate content silos throughout a large corporation.

 Make use of information to create more immersive experiences.

 Connect content sets from both inside and outside the company.

 Invest in real-world augmented and artificial intelligence.

 Enhance your authoring experiences and workflows.

 Real-world applications of Semantic Web:

Here are some applications of semantic web in real world:

 The oil and gas industry was reported to be using RDF/OWL in 2007 to combine data

from various sources and standardize data exchange, sharing, and integration across

partners or applications. It was also feasible to handle knowledge collaboratively.

54
 The BBC website employed semantic web technology to dynamically display material

during the 2010 FIFA World Cup.

 Facebook announced Open Network in April 2010, allowing web publishers to

incorporate their websites into Facebook's social graph.

 Facebook may then utilise this information to figure out what a user likes, provide

customised suggestions, and connect individuals with similar interests.

You might also like