0% found this document useful (0 votes)
136 views4 pages

Extensible Markup Language Parsing Techniques

This document discusses XML parsing techniques. It begins with an introduction to XML and its uses for storing structured data and communicating between applications. It then describes two approaches for XML parsing - event-driven using SAX and tree-based parsing which builds a tree representation of the XML document. The document presents an XML parser that implements a subset of the XML specification to extract information and check validity of XML documents.

Uploaded by

BaneeIshaqueK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views4 pages

Extensible Markup Language Parsing Techniques

This document discusses XML parsing techniques. It begins with an introduction to XML and its uses for storing structured data and communicating between applications. It then describes two approaches for XML parsing - event-driven using SAX and tree-based parsing which builds a tree representation of the XML document. The document presents an XML parser that implements a subset of the XML specification to extract information and check validity of XML documents.

Uploaded by

BaneeIshaqueK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 4

Seminartopicsonline.

com
1111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110
0000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001101011011100010111101101000100010000011101101111000001111000001111001011110001110010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101
0000011110101011010000111010011110001111001011110110000100001001101011011100010111101101000100010000011101101111000001111000001111001011110001110010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111
0010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111
1000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101
1010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001

Extensible Markup Language


Parsing Techniques
Abstract
XML is the language used to develop web applications. XML is a set of rules for
designing structured data in a text format as opposed to binary format, which is
useful for man, and machine both. A parser is used for syntactical and lexical
analysis. XML parser extract the information from the XML document which is very
much needed in all Web applications Simple object access protocol is a protocol that
lets the program to send XML over HTTP to invoke methods on remote objects. An
XML parser can serve as an engine for implementing this or a comparable protocol.
XML parser can also be used to send data messages formatted as XML over HTTP. By
adding XML and HTTP capabilities to application, software developers can begin to
offer alternatives to traditional browsers that have significant value to their
customers. This paper presents an XML parser that implements a subset of the XML
specification. This is useful to all developers and users for checking the
welformedness and validation of an XML documents.
1. Introduction
The World Wide Web Consortium has created an SGML working group to build a set
of specifications that are easy and straightforward to use. The subset called XML has
the advantages of SGML that is extensibility, structure, and validation in a language
and is very easy to learn, use and implement than full SGML. XML is fully
internationalized for both European and Asian languages, with all conforming
processors required to support the Unicode character set in both its UTF-8 and UTF16 encoding. The language is designed for the quickest possible client-side
processing consistent with its primary purpose as an electronic publishing and data
interchange format.
Most application needs to save some configuration data, and often need to transmit
or receive data to or from other applications. This is especially true for software that
interacts with the Internet. If you need a format for interchanging such data, one
solution is to design your own binary format. Besides having some advantages of
storing complex structures, list, arrays etc., it has got some drawbacks, such as
binary format will not be easy to understand and modification will have compatibility
problems. As an alternative we can use a text-based format, which is easy to use,
but not powerful. XML provides a more general solution. It is text-based, hierarchical
format that has an advantage of both binary and text based worlds. It is easy to use
but is also powerful. Even it was primarily designed for the Web, it can be used for
any application that needs to store data or communicate with other applications.
This paper presents an XML Parser that implements a subset of the XML
specification. The goal of an XML Parser is to extract information from the XML
documents. It gets an input an XML file and then it starts parsing it. There are two
different ways of doing it. One way is to use an event-driven approach. The SAX
parser is the model for this approach. The second approach builds a tree to represent
1111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110
1101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100
1111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100111100000011000010011100111101101001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110
0111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100111100000011000010011100111101101001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110011100000001100010111100110101001001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110
1001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110011100000001100010111100110101001001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100
1001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110
1011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110
0001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100111100000011000010011100111101101001111000111010010100101101111010111110100100000111000001011100
0010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111
1000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101
1010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001

the XML document, limiting the size of XML documents that can be parsed by this
approach. The DOM parser is the model for this approach.

Seminartopicsonline.com

Seminartopicsonline.com

2.Extensible Markup Language


The eXtensible Markup Language came out of the world of the Standard Generalized
Markup Language (SGML). Initially XML was developed to overcome the
shortcomings of HTML, a markup language containing stylistic information. The aim
of the XMLs developers was to create a language that was easy
to use over Internet, supported by a wide variety of applications, compatible with
SGML and legible to humans. XML separates content from style as its ancestor,
SGML.
A typical XML is hierarchical. It is made up of elements defined by tags. A document
type definition (DTD), or XML schema, is used to define the structure of a document.
An XML document is referred to as well formed if it conforms to the XML Standards,
and correct (or valid) if it complies with a DTD or Schema. At the core of an XML
application is an XML Parser. All XML parsers will check that the documents they
receive are well formed, and most also check to see if these documents are valid.
2.1 Valid
XML documents have following validating criterias
(i) Meets validity constraints
(ii) Validity constraints referred to as VC or VCs
(iii) Parser checks and determines if validity constraints are fulfilled
(iv) No parser errors
(v) Has to contain a DTD or reference one
(vi) Xml documents without a DTD must always be well formed
2.2 Welformed
XML documents have following well-formness criterias
(i) Contains one or more elements (+ means one or more)
(ii) There is exactly one root element also called a document element
(iii) Elements are properly nested
(a) Children are nested inside their parent
(b) Elements without child elements exist by themselves
(iv) Every child element has one parent element
(v) A child element is said to be in the content of the parent
(vi) A child cannot be in the content of any other element that is in the content of
the parent
(vii) The parent can have 0, 1, or more children (* means zero or more)
(viii) Well Formed Constraints referred to as WFC or WFCs
(ix) Violation of a WFC is a Fatal Error and the parser is supposed to stop sending
document data in the normal manner (i.e., flags an exception in java or triggers an
error message handling routine...)
3. XML Parsing
The XML Parser must be compliant with a complete set of W3C interfaces to ensure
interoperability with applications and web-based technologies.
The functionality of an XML parser includes:
(i) Checking well formedness of a document
1111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110
1101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100
1111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100111100000011000010011100111101101001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110
0111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100111100000011000010011100111101101001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110011100000001100010111100110101001001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110
0010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111
1000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101
1010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001

(ii) Searching with faster algorithms


(iii) Validation of an XML document
4. Types of Parsers

Seminartopicsonline.com

Seminartopicsonline.com
There are two classes of XML Parsers: Validating parsers and non-validating parsers.
All true XML parsers must report violations of the XML specification constraints for
being well formed. Validating
parsers must also report violations of the constraints expressed by the declarations
in the DTD. Non-Validating parsers are required only to check that the DTD is well
formed. They are not required to understand and use the DTD for document
checking. There are some exceptions, however in particular for attributes.
There are two different ways of XML parser to extract information from the XML
document. One way is to use an event-driven approach. The parser begins reading
the string and sends messages when certain events occur. For example, a message is
sent when a start tag is encountered and another event when an end tag is reached.
Programs that use these parsers have callback functions to process the events. When
a message signals that a desired signal has been found, the program can examine
the tag and its accompanying information in detail and act accordingly. The SAX
(Simple API for XML) parser is the model for this approach. The second approach
builds a tree to represent XML document. Each tag of the document represents a
node in the tree. Once built, a program can traverse the tree to process the
document or to search for specific tags. Usually these trees reside in memory,
limiting the size of XML documents that can be parsed by this approach. In contrast,
event-driven parsers do not create a tree and can parse documents of any size.
Most XML parsers are either event driven or produce an in-memory DOM (Document
Object Model) instance of the document. The one we use depends on the application
and memory requirements. Producing a DOM tree requires more memory, but it can
provide greater programmatic flexibility. SAX may be suitable for applications that
need smaller memory footprints, and it can process the XML document as a stream
of events.
5. Developing XML Parsers
The following steps are followed for developing an XML parser.
(i) Canonical XML
XML parsers generally work with canonical XML. This is the XML we are left with after
an XML document is preprocessed. I liken this to a C complier that first removes
directives and macros, leaving only syntactically correct C code. With XML,
preprocessing references to external files (external DTDs) and expansion of entity
reference. What is left over is still an XML document, but one that uses a simpler
syntax. Is canonical XML useful? Absolutely. Many real-world applications generate
XML documents using this simpler syntax. For this reason, most of the XML parser
parses documents that conform to canonical XML.
(ii) Building Trees
DOM (Document Object Model) approach works by building a tree with nodes that
are Tag objects. These pointers allow a tag to maintain two listsone of sibling and
one of children. Each tag also contains a list of Attribute objects and a list of
Contents objects.
(iii) Lexical Analysis
It is basically to figure out the pieces (tokens) of the program -- variables, constants,
keywords, etc. The following points can summarize its functionality: 1111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110
1101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100
1111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100111100000011000010011100111101101001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110
0111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100111100000011000010011100111101101001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110011100000001100010111100110101001001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110
1001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110011100000001100010111100110101001001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100
1001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110
1011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110
1000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101
1010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001

a) Given a bunch of characters, how do you recognize the key things and figure out
the kinds of things
they are?
b) The code that does lexical analysis is called the scanner.
c) First pass: Just chop up everything by white space into words.

Seminartopicsonline.com

Seminartopicsonline.com
6. Parsing Environment
The Lexical class performs the role of lexical analyzer and provides two primary
member functions. The first is called NextToken, which is called as needed to look for
the next token in the string. The second is called GetCharData, which returns strings
that contain whitespace. These functions are called as needed to build the Tag tree.
GetCharData works by collecting every character it encounters into a string until it
reaches either the end of the XML document or encounters a start tag.
7. Parsing Process
A Parser object requests tokens or character data from the Lexical object. As the
tokens and characters data are returned, the Parser builds the tag tree. The Parser
uses a recursive descent technique. The member function, Translate, starts the
process by getting the root tag of the XML document. The function GetTag follows
the syntax by parsing the start tag, the content, and finally the end tag with calls to
StartTag (), Content (), and EndTag (), respectively.
While parsing the content, Lexical could return a StartTag token, signaling the
beginning of another tag. If so, Content () recursively calls GetTag and adds the
returned tag to the current tags content. Parsing start and end tags proceed
similarly. A start tag id defined syntactically as a tag name followed by zero or more
attributes between a < character and > character. Match is a function that serves
only to move the parser to the next token. From the parsers perspective it is telling
Match, I expect to match this token next. If it matches, great. Send it back to me.
If it doesnt match then there is a trouble.
Finally, the use of the parser relies on traversing the resulting Tag tree. The
TagIterator class assists with this task. A TagIterator object requires that u identify
the tag that is the root of the tree. Once this is done, we can call the member
functions Begin () and Next () to move through the tree.
8. Conclusion
With the development of Web technology, the developments of markup languages
are also having a fast pace to meet the specific requirements of the individual
products. This creates a lot of problems for maintaining the common standard among
the developed markup languages. Fortunately extensible markup language fulfills
these requirements. Giving the facilities to develop individual markup languages
keeping the required standard same. This paper ahs presented techniques for the
development of XML parser which is very much needed for checking the well
formedness and validity of an XML document. These techniques are highly useful for
all who wants to develop their own Web documents for Internet applications.

0000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001101011011100010111101101000100010000011101101111000001111000001111001011110001110010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101
0000011110101011010000111010011110001111001011110110000100001001101011011100010111101101000100010000011101101111000001111000001111001011110001110010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111
0010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111
1000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101
1010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001

References
[1]. Alfred V.Aho, Ravi Sethi, Jeffery D.Ullman, Compliers: Principles, Techniques,
And Tools ,
[2]. Simon North, Paul Hermans, XML in 21 Days , SAMS, 2000.
[3]. http://www.xml.com
[4]. http://www.webreferences.com/xml/
[5]. http://www.jclark.com/xml/

Seminartopicsonline.com

You might also like