Extensible Markup Language Parsing Techniques
Extensible Markup Language Parsing Techniques
com
1111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110
0000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001101011011100010111101101000100010000011101101111000001111000001111001011110001110010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101
0000011110101011010000111010011110001111001011110110000100001001101011011100010111101101000100010000011101101111000001111000001111001011110001110010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111
0010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111
1000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101
1010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001
the XML document, limiting the size of XML documents that can be parsed by this
approach. The DOM parser is the model for this approach.
Seminartopicsonline.com
Seminartopicsonline.com
Seminartopicsonline.com
Seminartopicsonline.com
There are two classes of XML Parsers: Validating parsers and non-validating parsers.
All true XML parsers must report violations of the XML specification constraints for
being well formed. Validating
parsers must also report violations of the constraints expressed by the declarations
in the DTD. Non-Validating parsers are required only to check that the DTD is well
formed. They are not required to understand and use the DTD for document
checking. There are some exceptions, however in particular for attributes.
There are two different ways of XML parser to extract information from the XML
document. One way is to use an event-driven approach. The parser begins reading
the string and sends messages when certain events occur. For example, a message is
sent when a start tag is encountered and another event when an end tag is reached.
Programs that use these parsers have callback functions to process the events. When
a message signals that a desired signal has been found, the program can examine
the tag and its accompanying information in detail and act accordingly. The SAX
(Simple API for XML) parser is the model for this approach. The second approach
builds a tree to represent XML document. Each tag of the document represents a
node in the tree. Once built, a program can traverse the tree to process the
document or to search for specific tags. Usually these trees reside in memory,
limiting the size of XML documents that can be parsed by this approach. In contrast,
event-driven parsers do not create a tree and can parse documents of any size.
Most XML parsers are either event driven or produce an in-memory DOM (Document
Object Model) instance of the document. The one we use depends on the application
and memory requirements. Producing a DOM tree requires more memory, but it can
provide greater programmatic flexibility. SAX may be suitable for applications that
need smaller memory footprints, and it can process the XML document as a stream
of events.
5. Developing XML Parsers
The following steps are followed for developing an XML parser.
(i) Canonical XML
XML parsers generally work with canonical XML. This is the XML we are left with after
an XML document is preprocessed. I liken this to a C complier that first removes
directives and macros, leaving only syntactically correct C code. With XML,
preprocessing references to external files (external DTDs) and expansion of entity
reference. What is left over is still an XML document, but one that uses a simpler
syntax. Is canonical XML useful? Absolutely. Many real-world applications generate
XML documents using this simpler syntax. For this reason, most of the XML parser
parses documents that conform to canonical XML.
(ii) Building Trees
DOM (Document Object Model) approach works by building a tree with nodes that
are Tag objects. These pointers allow a tag to maintain two listsone of sibling and
one of children. Each tag also contains a list of Attribute objects and a list of
Contents objects.
(iii) Lexical Analysis
It is basically to figure out the pieces (tokens) of the program -- variables, constants,
keywords, etc. The following points can summarize its functionality: 1111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110
1101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100
1111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100111100000011000010011100111101101001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110
0111000000011000101111001101010010011100100110000101001001111110000101101011111011111010101100000011010001011100111100000011000010011100111101101001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110011100000001100010111100110101001001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110
1001111000111010010100101101111010111110100100000111000001011100001101100011111011011010101100101001111001110110011100000001100010111100110101001001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100
1001110010011000010100100111111000010110101111101111101010110000001101000101110011110000001100001001110011110110100111100011101001010010110111101011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110
1011111010010000011100000101110000110110001111101101101010110010100111100111011001110000000110001011110011010100100111001001100001010010011111100001011010111110111110101011000000110100010111001111000000110000100111001111011010011110001110100101001011011110101111101001000001110000010111000011011000111110110110101011001010011110011101100111000000011000101111001101010010011100100110000101001001111110
1000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101
1010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001
a) Given a bunch of characters, how do you recognize the key things and figure out
the kinds of things
they are?
b) The code that does lexical analysis is called the scanner.
c) First pass: Just chop up everything by white space into words.
Seminartopicsonline.com
Seminartopicsonline.com
6. Parsing Environment
The Lexical class performs the role of lexical analyzer and provides two primary
member functions. The first is called NextToken, which is called as needed to look for
the next token in the string. The second is called GetCharData, which returns strings
that contain whitespace. These functions are called as needed to build the Tag tree.
GetCharData works by collecting every character it encounters into a string until it
reaches either the end of the XML document or encounters a start tag.
7. Parsing Process
A Parser object requests tokens or character data from the Lexical object. As the
tokens and characters data are returned, the Parser builds the tag tree. The Parser
uses a recursive descent technique. The member function, Translate, starts the
process by getting the root tag of the XML document. The function GetTag follows
the syntax by parsing the start tag, the content, and finally the end tag with calls to
StartTag (), Content (), and EndTag (), respectively.
While parsing the content, Lexical could return a StartTag token, signaling the
beginning of another tag. If so, Content () recursively calls GetTag and adds the
returned tag to the current tags content. Parsing start and end tags proceed
similarly. A start tag id defined syntactically as a tag name followed by zero or more
attributes between a < character and > character. Match is a function that serves
only to move the parser to the next token. From the parsers perspective it is telling
Match, I expect to match this token next. If it matches, great. Send it back to me.
If it doesnt match then there is a trouble.
Finally, the use of the parser relies on traversing the resulting Tag tree. The
TagIterator class assists with this task. A TagIterator object requires that u identify
the tag that is the root of the tree. Once this is done, we can call the member
functions Begin () and Next () to move through the tree.
8. Conclusion
With the development of Web technology, the developments of markup languages
are also having a fast pace to meet the specific requirements of the individual
products. This creates a lot of problems for maintaining the common standard among
the developed markup languages. Fortunately extensible markup language fulfills
these requirements. Giving the facilities to develop individual markup languages
keeping the required standard same. This paper ahs presented techniques for the
development of XML parser which is very much needed for checking the well
formedness and validity of an XML document. These techniques are highly useful for
all who wants to develop their own Web documents for Internet applications.
0000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001101011011100010111101101000100010000011101101111000001111000001111001011110001110010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101
0000011110101011010000111010011110001111001011110110000100001001101011011100010111101101000100010000011101101111000001111000001111001011110001110010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111
0010010100001101011000011010010110101101001001001100101100101011000001110100110110111101100000010010010101001101000001111010101101000011101001111000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111
1000111100101111011000010000100110101101110001011110110100010001000001110110111100000111100000111100101111000111001001010000110101100001101001011010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101
1010110100100100110010110010101100000111010011011011110110000001001001010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100001001
References
[1]. Alfred V.Aho, Ravi Sethi, Jeffery D.Ullman, Compliers: Principles, Techniques,
And Tools ,
[2]. Simon North, Paul Hermans, XML in 21 Days , SAMS, 2000.
[3]. http://www.xml.com
[4]. http://www.webreferences.com/xml/
[5]. http://www.jclark.com/xml/
Seminartopicsonline.com