The document discusses the history and development of XML, including its origins from SGML and its creation by the W3C to enable generic SGML to be processed on the web similarly to HTML. It describes how XML allows users to define their own markup languages and has seen widespread success due to enabling data interchange over a standard syntax in an interconnected world.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
58 views
The Extensible Markup Language
The document discusses the history and development of XML, including its origins from SGML and its creation by the W3C to enable generic SGML to be processed on the web similarly to HTML. It describes how XML allows users to define their own markup languages and has seen widespread success due to enabling data interchange over a standard syntax in an interconnected world.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6
The Extensible Markup Language (XML)
According to Markup is defined as extra textual syntax that can be used to
describe formatting, actions, structure information, text semantics, attributes, etc. One example of markup can be the formatting commands of the popular text formatting software TeX. In the late seventies was defined the Standard Generalized Markup Language (SGML), a metalanguage for tagging text developed by Charles F. Goldfarb and his group, and based on a previous work done at IBM. In 1996 the SGML Editorial Review Board became the XML Working Group under the auspices of the World Wide Web Consortium (W3C), chaired by Jon Bosak of Sun Microsystems and with the intermediation of Dan Connolly. This group developed The Extensible Markup Language (XML), a subset of SGML which goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML(also based on SGML). XML has been designed for ease of implementation and for interoperability with both SGML and HTML. XML, the same as SGML, is not exactly a markup language, it is a metalanguage that can be used to define specific markup languages (like XHTML, MathML, SVG, etc.). That means that XML allows users to define new tags and structures for their own language For some reasons, some obvious and other that will remain a mystery, XML have reached an amazing success worldwide. In an interconnected and global society, the interchange of data over a standard syntax has become a key issue, and here is where XML its perfectly.
The Semantic Web
The Semantic Web is a promising initiative lead by the W3C which aim is to provide a data model for the Web, allowing information to be understood and processed also by machines. The definition of the W3C says The Semantic Web is the representation of data on the World Wide Web. [....] It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming.. So, it is clear that this initiative strongly relies on RDF, but what is the meaning of the representation of data? Often in this document we have referred to de di-culties related to search and retrieve information on the Web. One of the main reasons is the fact that the most part of Web data, despite of being processed by machines, can be only understand by humans. This include natural language text, still/moving images, audio, etc. Before we have discussed the diference between information retrieval and data retrieval, saying that while data retrieval is appropriate for databases it is not appropriate (or not enough) for the Web. The reason is that the information on the Web, contrary to databases, does not have an underlying data model. So, the representation of datameans two things, the development of a data model for the Web, and the dissemination of machine understandable metadata (under the framework of the data model) linked in some way to the Web information. Another classic denition is that by Berners-Lee et al. The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. Querying the Semantic Web The Semantic Web initiative has opened a broad spectrum of opportunities for improving the search and retrieve of information on the Internet. Of course this is not casual, but one of the main targets of this new scenario as pointed. However, the consolidation of a standardised way to interchange semantic information is just another step in the race for interoperability. Other battles are being to rationalise the way this information is processed and search and retrieval are maybe the most important elements of the information feed chain. The challenge is to e-cient and rational ways to exploit this new information that begins to be disseminated over the net, and that, despite of it is formalised in a standard way (RDF), it can be stored in diferent ways (embedded on HTML pages, in a database, in specific knowledge repositories, etc.) and it remains highly heterogeneous (an innumerable an unrestricted number of ontologies, potentially overlapped, can co-live in the Semantic Web). This two key issues, how to locate and access the information, and how to manage heterogeneity, are of relevance for our analysis and also very related with what we have said in the previous sections. Some research works redect special approaches to this, like the Edutella project that constitutes a distributed search network for educational resources and is based on P2P networking (its JXTA implementation) and RDF. This interesting work uses the query exchange language family RDF-QEL-i (based on Datalog semantics and subsets) as standardised query exchange language format. Because Edutella peers are highly heterogeneous and have difierent kinds of local storage for RDF triples, as well as some kind of local query language (e.g., SQL) to enable them to participate in the network, wrappers are used to translate queries and results from the peer and vice versa. Another work is Sesam, an extensible architecture implementing a persistent RDF store and a query engine capable to process RQL queries. Of special interest for us is TAP a system that implements a general query interface called GetData, Semantic Negotiation and Web of Trust enabled registries. It introduces the concept of Semantic Search and describes an implemented system which uses the data from the Semantic Web to improve traditional search results. The GetData interface is a simple query interface to network accessible data presented as directed labelled graphs, in contrast to expressive query languages like SQL, RQL or DQL. This work defends deployability against query expressiveness. Related to this project, and also with the query language of Edutella, is RDF-QBE, a mechanism for specifying RDF subgraphs, which they call 'Query by Example', that could allow a high performance standardised interface for retrieval of semantic information from remote servers. From all this study cases we can observe the latent necessity of dening a low-barrier mechanism that allow to harvest heterogeneous semantic information and how it generates a trade-o between deployability and expressiveness. Some of them (e.g. TAP) point the necessity to consider also other conventional or not-semantic search strategies, like crawler-based engines, when thinking in future applications. Data Integration and XML The classic data integration literature focused on the Relational Model for both queries and mappings till mid-90s. However, in late-90s researches turned their interest to a new and emerging data model, XML. The new model aroused as a de-facto standard to expose and interchange data, so it was the ideal choose for systems pursuing data interoperability. Now, XML and its query languages are the selected interfaces for Web Services, XML-native databases and lots of other applications. 5.1 Mapping the classic data integration problems to XML Integrating data from various XML sources arise the same problems described in the classic data integration literature, but new solutions need to be found to tackle the particularities of the new scenario. The first of these classical problems is schema mapping. The schema in which terms is expressed the query (there's no need to call it the Global Schema if we are e.g. in a peer-to-peer context) must be someway mapped to the schema or schemas of the sources where the query will be actually executed. The simplest approach to such mapping is an attribute correspondence, where some property or attribute in one representation corresponds to some attribute in the other representation. We end an increased complexity when mapping concepts that are semantically the same, but the XML representations may be structured diferently. Example 5.1.1. This example, illustrates some of the problems of mapping XML schemas. Source1.xml DTD: pubs book* title author* name publisher* name Source2.xml DTD: authors author* full-name publication* title pub-type The example shows how a simple schema describing books and authors can take diferent shapes. The di-cult of obtaining a mapping between them will depend on the goal of that mapping. It may serve for simple migration tasks (translation of data from one schema to another), and then a simple translation template will be enough. However, it may be needed for querying purposes, and then a more complex strategy is needed, related to the old query rewriting problem described in previous sections. 5.2 XML query languages and data integration XML query languages have been broadly used for the development of simple data integration applications. Mapping between schemas or dening wrappers with XSLT or XQuery can be a direct solution for some real world problems. These solutions generally are based on the manual coding of templates and updates, so they represent the modern version of the more primitive data integration approaches. XSL Transformations (XSLT) XSL Transformations (XSLT) is a language standardized by the W3C for transforming XML documents into other XML documents. XSLT is a component of the W3C's XML Stylesheet Language, and initially its main purpose was to be used in conjunction with a formatting language like XSL:FO, targetting the presentation layer independence. However, XSLT can be used independently, and it has been used in many application areas, but specially by the data integration community. A transformation expressed in XSLT, called a stylesheet, describes rules for transforming a source XML document into a result XML document. An XSLT stylesheet associates patterns with templates. When a pattern is matched against an element in the source XML tree, the corresponding template is instantiated to generate XML code for the result document. This generation can include data from the source tree, but also can include new data. The current version, XSLT 2.0 (W3C Candidate Recommendation 3 November 2005), is a revised version of the XSLT 1.0 Recommendation published on 16 November 1999. It is designed to be used in conjunction with XPath 2.0. XSLT shares the same data model as XPath 2.0, which is defined. The capabilities of XSLT for transforming XML documents makes it a natural choice for data integration applications. In scenarios where heterogeneous XML schemas need to be mapped, XSLT stylesheets can be manually coded or semi-automatically generated to allow the conversion between the dierent schemas. XSLT has been used also for intra-model conversions, like RDF-to-XML. Another usage of XSLT has been in the definition of web wrappers. HTML code can be easily modified to become XHTML with tools like HTML Tidy [143], and then filtered with XSLT stylesheets. Lots of commercial products make use from XSLT data integration capabilities, like the Altova MapForce tool. Example 5.2.1. This example shows how an input XML document can be transformed using an XSLT template. Take the followint XML document describing two movies. intput.xml: 1982 1959 The following XSLT template is applied recursively to all the nodes of the underlying tree of the input document. The template translates the movie elements into record elements. It also translates the id attributes into equivalent elements. template.xslt: