XML Basics Extensible Markup Language: Divya Panta 21109
XML Basics Extensible Markup Language: Divya Panta 21109
DIVYA PANTA
21109
What is XML
XML stands for eXtensible Markup Language.
Markup languages are designed for the
processing, definition and presentation of text.
The language specifies code for formatting, both
the layout and style, within a text file.
Tags are added to the document to provide the
extra information.
HTML tags tell a browser how to display the
document.
XML tags give a reader some idea what some of
the data means.
What is XML Used For?
XML is often used for distributing data over the Internet.
XML is used in many aspects of web development.
XML is often used to separate data from presentation.
XML tags are not predefined like HTML tags are
XML was designed to carry data - with focus on what data is.
XML stores data in plain text format. This provides a
software- and hardware-independent way of storing,
transporting, and sharing data.
With XML, data can be available to all kinds of "reading
machines" like people, computers, voice machines, news
feeds, etc.
Advantages of XML
Simplicity
Information coded in XML is easy to read and understand,
plus it can be processed easily by computers.
Openness
XML is a W3C standard,
endorsed by software industry market leaders.
Extensibility
There is no fixed set of tags.
New tags can be created as they are needed.
Self-description
XML documents can be stored without [schemas] because they contain
meta data; any XML tag can possess an unlimited number of attributes
such as author or version.
Contains machine-readable context information
Tags, attributes and element structure provide context information ...
opening up new possibilities for highly efficient search engines,
intelligent data mining, agents, etc.
Separates content from presentation
XML tags describe meaning not presentation.
The look and feel of an XML document can be controlled by XSL
stylesheets, allowing the look of a document (or of a complete Web
site) to be changed without touching the content of the document.
Multiple views or presentations of the same content are easily rendered.
Facilitates the comparison and aggregation of data
The tree structure of XML documents allows documents to be
compared and aggregated efficiently element by element.
Can embed multiple data types
XML documents can contain any possible data type — from
multimedia data (image, sound, video) to active components (Java
applets, ActiveX).
Rapid adoption by industry
Software AG, IBM, Sun, Microsoft, Netscape, DataChannel, SAP ...
Example of an HTML Document
<html>
<head><title>Example</title></head.
<body>
<h1>This is an example of a page.</h1>
<h2>Some information goes here.</h2>
</body>
</html>
Example of an XML Document
<?xml version=“1.0”/>
<address>
<name>Alice Lee</name>
<email>[email protected]</email>
<phone>212-346-1234</phone>
<birthday>1985-03-22</birthday>
</address>
Difference Between HTML and XML
HTML tags have a fixed meaning and browsers know
what it is.
XML tags are different for different applications, and
users know what they mean.
HTML tags are used for display.
XML tags are used to describe documents and data.
XML Rules
Tags are enclosed in angle brackets.
Tags come in pairs with start-tags and end-tags.
Tags must be properly nested.
<name><email>…</name></email> is not allowed.
<name><email>…</email><name> is.
Tags that do not have end-tags must be terminated by a
‘/’.
<br /> is an html example.
More XML Rules
Tags are case sensitive.
<address> is not the same as <Address>
XML in any combination of cases is not allowed
as part of a tag.
Tags may not contain ‘<‘ or ‘&’.
Tags follow Java naming conventions, except that
a single colon and other characters are allowed.
They must begin with a letter and may not contain
white space.
Documents must have a single root tag that begins
the document.
Encoding
XML (like Java) uses Unicode to encode characters.
Unicode comes in many flavors. The most common one used
in the West is UTF-8.
UTF-8 is a variable length code. Characters are encoded in 1
byte, 2 bytes, or 4 bytes.
The first 128 characters in Unicode are ASCII.
In UTF-8, the numbers between 128 and 255 code for some of
the more common characters used in western Europe, such as
ã, á, å, or ç.
Two byte codes are used for some characters not listed in the
first 256 and some Asian ideographs.
Four byte codes can handle any ideographs that are left.
Those using non-western languages should investigate other
versions of Unicode.
Well-Formed Documents
An XML document is said to be well-formed if it
follows all the rules.
An XML parser is used to check that all the rules
have been obeyed.
Recent browsers such as Internet Explorer 5 and
Netscape 7 come with XML parsers.
Parsers are also available for free download over
the Internet. One is Xerces, from the Apache
open-source project.
Java 1.4 also supports an open-source parser.
XML Example Revisited
<?xml version=“1.0”/>
<address>
<name>Alice Lee</name>
<email>[email protected]</email>
<phone>212-346-1234</phone>
<birthday>1985-03-22</birthday>
</address>
Markup for the data aids understanding of its purpose.
A flat text file is not nearly so clear.
Alice Lee
[email protected]
212-346-1234
1985-03-22
The last line looks like a date, but what is it for?
Expanded Example
<?xml version = “1.0” ?>
<address>
<name>
<first>Alice</first>
<last>Lee</last>
</name>
<email>[email protected]</email>
<phone>123-45-6789</phone>
<birthday>
<year>1983</year>
<month>07</month>
<day>15</day>
</birthday>
</address>
XML Files are Trees
address