Unit-4 ET
Unit-4 ET
Semi-structured data (also known as partially structured data) is a type of data that doesn’t
follow the tabular structure associated with relational databases or other forms of data tables but
does contain tags and metadata to separate semantic elements and establish hierarchies of records
and fields.
Semi-structured and structured data are distinguished by two primary characteristics. First is
schema. Unlike structured data, semi-structured data doesn’t require a prior schema definition
like structured data does. Without a predefined, fixed schema, semi-structured data is more
flexible and free to evolve over time as new attributes are added. The second key differentiator is
data structure. Semi-structured data supports a hierarchical data structure that contains nested
information. In contrast, structured data simply represents data in a flat table. Semi-structured
data’s nested data hierarchy makes it an ideal format for working with data received from apps
and other internet-enabled devices.
Unstructured data is raw data with no established data model or schema. Semi-structured data is
unlike unstructured data in that it has some definite and consistent markers that create distinct
semantic elements and impose an organizational hierarchy of records and fields within the data.
Semi-structured data comes in a variety of formats, based on the source they originate from.
Here are a few of the most common:
XML: Extensible Markup Language (XML) has become one of the most popular semi-structured
data formats. This versatile and easy-to-use markup language allows users to define tags and
attributes required for storing data in a hierarchical form.
JSON: A commonly used alternative to XML, JavaScript Object Notation (JSON) collects semi-
structured data from IoT devices, web browsers, and smartphones, then organizes that data into
batches before transmitting it to a data platform via a data pipeline. This versatile format can also
be used to transfer data in between servers and apps or internet-connected devices.
Avro: Originally developed for use with Apache Hadoop, Avro is a remote procedure call (RPC)
framework and data serialization. Using schemas defined in JSON, Avro serializes data in a
compact, binary format that can be sent to any app or program where it is deserialized.
1
ORC: Optimized Row Columnar (ORC) is a semi-structured data format that was initially
designed to achieve more-efficient compression and enhance performance for reading, writing,
and processing data over earlier Hive formats.
Parquet: Another columnar storage file format similar to ORC, Parquet is designed for use in the
Hadoop ecosystem. Parquet is ideal for working with complex data in large volumes and features
different methods for efficient data compression and encoding types.
XML Database
XML database is a data persistence software system used for storing the huge amount of
information in XML format. It provides a secure place to store XML documents.
You can query your stored data by using XQuery, export and serialize into desired format. XML
databases are usually associated with document-oriented databases.
1. XML-enabled database
2. Native XML database (NXD)
XML-enable Database
XML-enable database works just like a relational database. It is like an extension provided for
the conversion of XML documents. In this database, data is stored in table, in the form of rows
and columns.
Native XML database is used to store large amount of data. Instead of table format, Native XML
database is based on container format. You can query data by XPath expressions.
Native XML database is preferred over XML-enable database because it is highly capable to
store, maintain and query XML documents.
1. <?xml version="1.0"?>
2. <contact-info>
3. <contact1>
4. <name>Vimal Jaiswal</name>
5. <company>SSSIT.org</company>
6. <phone>(0120) 4256464</phone>
7. </contact1>
2
8. <contact2>
9. <name>Mahesh Sharma </name>
10. <company>SSSIT.org</company>
11. <phone>09990449935</phone>
12. </contact2>
13. </contact-info>
In the above example, a table named contacts is created and holds the contacts (contact1 and
contact2). Each one contains 3 entities name, company and phone.
XML - Schemas
XML Schema is commonly known as XML Schema Definition (XSD). It is used to describe
and validate the structure and the content of XML data. XML schema defines the elements,
attributes and data types. Schema element supports Namespaces. It is similar to a database
schema that describes the data in a database.
Syntax
Elements
As we saw in the XML - Elements chapter, elements are the building blocks of XML document.
An element can be defined within an XSD as follows −
Definition Types
3
You can define XML schema elements in the following ways −
Simple Type
Simple type element is used only in the context of the text. Some of the predefined simple types
are: xs:integer, xs:boolean, xs:string, xs:date. For example −
4
</xs:complexType>
</xs:element>
Attributes
2. In the second approach again we use Database Management System but this time we
can use for storing document content as data element .this process follows XML or XML
DTD schema and all the document have the same structure in which we can design a
relational object database to store the leaf-level data element in the XML document. For
this, we required a mapping algorithm to design a schema for a database that is compatible
with XML document structure.
It is generally used for recreating XML document from the store data which is specified for
XML or DTD schema. This can help to implement internal Data Base Management system
module or middleware which is not part of the Database Management system.
5
3. In the third approach, we can use different type Database management system which is
based on a hierarchical(tree) model for designed and implemented and designing a master
system for storing native XML data. This system is also called as native XML DBMS and
In this process, the system may involve indexing and querying techniques which is work
for all types of an XML document.
It may include data compression technique which is highly used for reducing the size of the
document for storage. There is various software which offers native XML storage option
like Tamino software AG or Excelon for dynamic application platform offers native XML
DBMS capability and oracle also offer native XML storage option.
4. In the fourth approach, we can create and customized XML document and publishing
from pre-existing relational database Because relational database already stored huge
amount of data which is a part of data in which we need to formatted data as a document
for displaying into the web or exchanging.
This process is used to divided the middleware software layer to adjust the conversion
between an XML document and relational database.
SQL | Query Processing
Query Processing includes translations on high level Queries into low level expressions that can
be used at physical level of file system, query optimization and actual execution of query to get
the actual result.
Block Diagram of Query Processing is as:
6
It is done in the following steps:
Step-1:
Parser: During parse call, the database performs the following checks- Syntax check,
Semantic check and Shared pool check, after converting the query into relational algebra.
Parser performs the following checks as (refer detailed diagram):
1. Syntax check – concludes SQL syntactic validity. Example:
SELECT * FORM employee
Here error of wrong spelling of FROM is given by this check.
2. Semantic check – determines whether the statement is meaningful or not. Example:
query contains a tablename which does not exist is checked by this check.
3. Shared Pool check – Every query possess a hash code during its execution. So, this
check determines existence of written hash code in shared pool if code exists in shared
pool then database will not take additional steps for optimization and execution.
Step-2:
Optimizer: During optimization stage, database must perform a hard parse atleast for one
unique DML statement and perform optimization during this parse. This database never
optimizes DDL unless it includes a DML component such as subquery that require
optimization.
7
It is a process in which multiple query execution plan for satisfying a query are examined and
most efficient query plan is satisfied for execution.
Database catalog stores the execution plans and then optimizer passes the lowest cost plan for
execution.
Row Source Generation –
The Row Source Generation is a software that receives a optimal execution plan from the
optimizer and produces an iterative execution plan that is usable by the rest of the database.
the iterative plan is the binary program that when executes by the sql engine produces the
result set.
Step-3:
Execution Engine: Finally runs the query and display the required result.
The XML query language (XQuery) is a complete query language for XML databases. It
stands to XML databases as SQL stands to relational ones. An XML database is a collection of
(related) XML documents.
XQuery works on sequences, not on node sets as XPath. A sequence contains items which are
either XML nodes (such as elements and attributes) or atomic values (like strings and numbers).
The relationship between XPath and XQuery consists in the fact that XPath expressions are used
in XQuery queries (xqueries). Hence, we may consider XPath as a syntactic fragment of
XQuery.
1. load: one or more XML documents are loaded from the database;
2. retrieve: XPath expressions are used to retrieve sequences of tree nodes from the loaded
documents;
3. process: the retrieved node sequences are processed with XQuery operations like
filtering (creating a new sequence by selecting some of the items of the original one) and
ordering (sorting the items of the sequence according to some criteria);
4. construct: new sequences may be constructed and combined with the retrieved ones;
5. output: a final sequence is returned as output.
Some of the above steps are optional. For instance, an xquery may avoid the loading of a
document and the retrieval of sequences with XPath expressions, and, instead, it may create,
process, and output its own sequences. In fact, XQuery can be used as a (Turing-
complete) programming language that manipulates item sequences, possibly using functions
defined by the user. In these pages we will focus on XQuery as an XML query language.
An XQuery processor is a software that evaluates xqueries. See the XQuery resources page for
a list of XQuery processors. Saxon is a good example. In order to evaluate with Saxon the query
contained in the file xquery.xql, type the following:
8
java net.sf.saxon.Query xquery.xql
What is XQuery
XQuery is a functional query language used to retrieve information stored in XML format. It is
same as for XML what SQL is for databases. It was designed to query XML data.
"XQuery is a standardized language for combining documents, databases, Web pages and almost
anything else. It is very widely implemented. It is powerful and easy to learn. XQuery is
replacing proprietary middleware languages and Web Application development languages.
XQuery is replacing complex Java or C++ programs with a few lines of code. XQuery is simpler
to work with and easier to maintain than many other alternatives."
What does it do
XQuery is a functional language which is responsible for finding and extracting elements and
attributes from XML documents.
XQuery Features
There are many features of XQuery query language. A list of top features are given below:
o XQuery is a functional language. It is used to retrieve and query XML based data.
9
o XQuery is expression-oriented programming language with a simple type system.
o XQuery is analogous to SQL. For example: As SQL is query language for databases,
same as XQuery is query language for XML.
o XQuery is XPath based and uses XPath expressions to navigate through XML
documents.
o XQuery is a W3C standard and universally supported by all major databases.
Advantages of XQuery
XQuery vs XPath
Index XQuery XPath
1) XQuery is a functional XPath is a xml path language that is used to select nodes
programming and query
language that is used to from an xml document using queries.
query a group of XML
data.
2) XQuery is used to extract XPath is used to compute values like strings, numbers and
and manipulate data from
either xml documents or boolean types from another xml documents.
relational databases and
ms office documents that
support an xml data
source.
10
and comments.
5) xquery language helps to xpath was created to define a common syntax and
create syntax for new
xml documents. behavior model for xpointer and xslt.
What is XPath?
XPath can be used to navigate through elements and attributes in an XML document.
XPath uses path expressions to select nodes or node-sets in an XML document. These path
expressions look very much like the expressions you see when you work with a traditional
computer file system.
XPath expressions can be used in JavaScript, Java, XML Schema, PHP, Python, C and C++, and
lots of other languages.
11
With XPath knowledge you will be able to take great advantage of XSL.
XPath Example
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
In the table below we have listed some XPath expressions and the result of the expressions:
12
XPath Expression Result
13
element that have a price element with a
Advanced techniques for XML Schema-based data include using object-relational storage;
annotating XML schemas; mapping Schema data types to SQL; using complexType extensions
and restrictions; creating, specifying relational constraints on, and partitioning XML Schema-
based data, storing XMLType data out of line, working with complex or large schemas, and
debugging schema registration.
Object-relational storage of XML documents is based on decomposing the document content into
a set of SQL objects. These SQL objects are based on the SQL 1999 Type framework. When an
XML schema is registered with Oracle XML DB, the required SQL type definitions are
automatically generated from the schema.
A SQL type definition is generated from each complexType defined by the XML schema. Each
element or attribute defined by the complexType becomes a SQL attribute in the corresponding
SQL type. Oracle XML DB automatically maps the 47 scalar data types defined by the XML
Schema Recommendation to the 19 scalar data types supported by SQL. A varray type is
generated for each element and this can occur multiple times.
The generated SQL types allow XML content that is compliant with the XML schema to be
decomposed and stored in the database as a set of objects, without any loss of information. When
an XML document is ingested, the constructs defined by the XML schema are mapped directly
to the equivalent SQL types. This lets Oracle XML DB leverage the full power of Oracle
Database when managing XML, and it can lead to significant reductions in the amount of space
required to store the document. It can also reduce the amount of memory required to query and
update XML content.
14
When you register an XML schema for XMLType data that is stored object-relationally and you
set registration parameter GENTABLES to TRUE, default tables are created automatically to
store the associated XML instance documents.
Order is preserved among XML collection elements when they are stored. The result is
an ordered collection.
You can store data in an ordered collection as a varray in a table. Each element in the collection
is mapped to a SQL object. The collection of SQL objects is stored as a set of rows in a table,
called an ordered collection table (OCT). Oracle XML DB stores a collection as a heap-
based OCT.
You can also use out-of-line storage for an ordered collection. This corresponds to XML schema
annotation SQLInline = "false", and it means that a varray of REFs in the collection table (or the
LOB) tracks the collection content, which is stored out of line.
There is no requirement to annotate an XML schema before using it. Oracle XML DB uses a set
of default assumptions when processing an XML schema that contains no annotations.
15