0% found this document useful (0 votes)
9 views15 pages

Unit-4 ET

Semi-structured data is a flexible data format that contains tags and metadata, distinguishing it from structured data, which requires a predefined schema, and unstructured data, which lacks any organization. Common formats for semi-structured data include XML, JSON, Avro, ORC, and Parquet, each serving different purposes in data storage and transmission. XML databases can be categorized into XML-enabled and native XML databases, with XML Schema Definitions (XSD) used to validate the structure of XML data.

Uploaded by

Aasha Ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

Unit-4 ET

Semi-structured data is a flexible data format that contains tags and metadata, distinguishing it from structured data, which requires a predefined schema, and unstructured data, which lacks any organization. Common formats for semi-structured data include XML, JSON, Avro, ORC, and Parquet, each serving different purposes in data storage and transmission. XML databases can be categorized into XML-enabled and native XML databases, with XML Schema Definitions (XSD) used to validate the structure of XML data.

Uploaded by

Aasha Ganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT - 4

What is semi structured data?

Semi-structured data (also known as partially structured data) is a type of data that doesn’t
follow the tabular structure associated with relational databases or other forms of data tables but
does contain tags and metadata to separate semantic elements and establish hierarchies of records
and fields.

How does semi-structured data differ from structured data?

Semi-structured and structured data are distinguished by two primary characteristics. First is
schema. Unlike structured data, semi-structured data doesn’t require a prior schema definition
like structured data does. Without a predefined, fixed schema, semi-structured data is more
flexible and free to evolve over time as new attributes are added. The second key differentiator is
data structure. Semi-structured data supports a hierarchical data structure that contains nested
information. In contrast, structured data simply represents data in a flat table. Semi-structured
data’s nested data hierarchy makes it an ideal format for working with data received from apps
and other internet-enabled devices.

How does semi-structured data differ from unstructured data?

Unstructured data is raw data with no established data model or schema. Semi-structured data is
unlike unstructured data in that it has some definite and consistent markers that create distinct
semantic elements and impose an organizational hierarchy of records and fields within the data.

EXAMPLE FOR SEMI STRUCTURED DATA FORMATS

Semi-structured data comes in a variety of formats, based on the source they originate from.
Here are a few of the most common:

XML: Extensible Markup Language (XML) has become one of the most popular semi-structured
data formats. This versatile and easy-to-use markup language allows users to define tags and
attributes required for storing data in a hierarchical form.

JSON: A commonly used alternative to XML, JavaScript Object Notation (JSON) collects semi-
structured data from IoT devices, web browsers, and smartphones, then organizes that data into
batches before transmitting it to a data platform via a data pipeline. This versatile format can also
be used to transfer data in between servers and apps or internet-connected devices.

Avro: Originally developed for use with Apache Hadoop, Avro is a remote procedure call (RPC)
framework and data serialization. Using schemas defined in JSON, Avro serializes data in a
compact, binary format that can be sent to any app or program where it is deserialized.

1
ORC: Optimized Row Columnar (ORC) is a semi-structured data format that was initially
designed to achieve more-efficient compression and enhance performance for reading, writing,
and processing data over earlier Hive formats.

Parquet: Another columnar storage file format similar to ORC, Parquet is designed for use in the
Hadoop ecosystem. Parquet is ideal for working with complex data in large volumes and features
different methods for efficient data compression and encoding types.

XML Database

XML database is a data persistence software system used for storing the huge amount of
information in XML format. It provides a secure place to store XML documents.

You can query your stored data by using XQuery, export and serialize into desired format. XML
databases are usually associated with document-oriented databases.

Types of XML databases

There are two types of XML databases.

1. XML-enabled database
2. Native XML database (NXD)

XML-enable Database

XML-enable database works just like a relational database. It is like an extension provided for
the conversion of XML documents. In this database, data is stored in table, in the form of rows
and columns.

Native XML Database

Native XML database is used to store large amount of data. Instead of table format, Native XML
database is based on container format. You can query data by XPath expressions.

Native XML database is preferred over XML-enable database because it is highly capable to
store, maintain and query XML documents.

Let's take an example of XML database:

1. <?xml version="1.0"?>
2. <contact-info>
3. <contact1>
4. <name>Vimal Jaiswal</name>
5. <company>SSSIT.org</company>
6. <phone>(0120) 4256464</phone>
7. </contact1>

2
8. <contact2>
9. <name>Mahesh Sharma </name>
10. <company>SSSIT.org</company>
11. <phone>09990449935</phone>
12. </contact2>
13. </contact-info>

In the above example, a table named contacts is created and holds the contacts (contact1 and
contact2). Each one contains 3 entities name, company and phone.

XML - Schemas
XML Schema is commonly known as XML Schema Definition (XSD). It is used to describe
and validate the structure and the content of XML data. XML schema defines the elements,
attributes and data types. Schema element supports Namespaces. It is similar to a database
schema that describes the data in a database.

Syntax

You need to declare a schema in your XML document as follows −


Example
The following example shows how to use schema −

<?xml version = "1.0" encoding = "UTF-8"?>


<xs:schema xmlns:xs = "http://www.w3.org/2001/XMLSchema">
<xs:element name = "contact">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
<xs:element name = "phone" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
The basic idea behind XML Schemas is that they describe the legitimate format that an XML
document can take.

Elements

As we saw in the XML - Elements chapter, elements are the building blocks of XML document.
An element can be defined within an XSD as follows −

<xs:element name = "x" type = "y"/>

Definition Types

3
You can define XML schema elements in the following ways −
Simple Type
Simple type element is used only in the context of the text. Some of the predefined simple types
are: xs:integer, xs:boolean, xs:string, xs:date. For example −

<xs:element name = "phone_number" type = "xs:int" />


Complex Type
A complex type is a container for other element definitions. This allows you to specify which
child elements an element can contain and to provide some structure within your XML
documents. For example −
<xs:element name = "Address">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
<xs:element name = "phone" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
In the above example, Address element consists of child elements. This is a container for
other <xs:element> definitions, that allows to build a simple hierarchy of elements in the XML
document.
Global Types
With the global type, you can define a single type in your document, which can be used by all
other references. For example, suppose you want to generalize the person and company for
different addresses of the company. In such case, you can define a general type as follows −
<xs:element name = "AddressType">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
Now let us use this type in our example as follows −

<xs:element name = "Address1">


<xs:complexType>
<xs:sequence>
<xs:element name = "address" type = "AddressType" />
<xs:element name = "phone1" type = "xs:int" />
</xs:sequence>

4
</xs:complexType>
</xs:element>

<xs:element name = "Address2">


<xs:complexType>
<xs:sequence>
<xs:element name = "address" type = "AddressType" />
<xs:element name = "phone2" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
Instead of having to define the name and the company twice (once for Address1 and once
for Address2), we now have a single definition. This makes maintenance simpler, i.e., if you
decide to add "Postcode" elements to the address, you need to add them at just one place.

Attributes

Attributes in XSD provide extra information within an element. Attributes


have name and type property as shown below −
<xs:attribute name = "x" type = "y"/>

How to Store and Extract XML Documents from Databases


We can used different apporache, for organizing the contents of XML document to facilitate
their subsequent querying and retrieval.
This are the following approaches, for storing and extracting XML document from database.
1. In the first approach, we can use database management system in which we store the
document as a text and the relational and object DBMS are used to store XML whole
document as a text field in the DBMS record and object. In this process, we can use
document processing for Database management system as a special module and its store
schemaless and document-centric XML document.

2. In the second approach again we use Database Management System but this time we
can use for storing document content as data element .this process follows XML or XML
DTD schema and all the document have the same structure in which we can design a
relational object database to store the leaf-level data element in the XML document. For
this, we required a mapping algorithm to design a schema for a database that is compatible
with XML document structure.

It is generally used for recreating XML document from the store data which is specified for
XML or DTD schema. This can help to implement internal Data Base Management system
module or middleware which is not part of the Database Management system.

5
3. In the third approach, we can use different type Database management system which is
based on a hierarchical(tree) model for designed and implemented and designing a master
system for storing native XML data. This system is also called as native XML DBMS and
In this process, the system may involve indexing and querying techniques which is work
for all types of an XML document.

It may include data compression technique which is highly used for reducing the size of the
document for storage. There is various software which offers native XML storage option
like Tamino software AG or Excelon for dynamic application platform offers native XML
DBMS capability and oracle also offer native XML storage option.
4. In the fourth approach, we can create and customized XML document and publishing
from pre-existing relational database Because relational database already stored huge
amount of data which is a part of data in which we need to formatted data as a document
for displaying into the web or exchanging.

This process is used to divided the middleware software layer to adjust the conversion
between an XML document and relational database.
SQL | Query Processing

Query Processing includes translations on high level Queries into low level expressions that can
be used at physical level of file system, query optimization and actual execution of query to get
the actual result.
Block Diagram of Query Processing is as:

Detailed Diagram is drawn as:

6
It is done in the following steps:
 Step-1:

Parser: During parse call, the database performs the following checks- Syntax check,
Semantic check and Shared pool check, after converting the query into relational algebra.
Parser performs the following checks as (refer detailed diagram):
1. Syntax check – concludes SQL syntactic validity. Example:
SELECT * FORM employee
Here error of wrong spelling of FROM is given by this check.
2. Semantic check – determines whether the statement is meaningful or not. Example:
query contains a tablename which does not exist is checked by this check.
3. Shared Pool check – Every query possess a hash code during its execution. So, this
check determines existence of written hash code in shared pool if code exists in shared
pool then database will not take additional steps for optimization and execution.

Hard Parse and Soft Parse –


If there is a fresh query and its hash code does not exist in shared pool then that query has to
pass through from the additional steps known as hard parsing otherwise if hash code exists
then query does not passes through additional steps. It just passes directly to execution
engine (refer detailed diagram). This is known as soft parsing.
Hard Parse includes following steps – Optimizer and Row source generation.

 Step-2:
Optimizer: During optimization stage, database must perform a hard parse atleast for one
unique DML statement and perform optimization during this parse. This database never
optimizes DDL unless it includes a DML component such as subquery that require
optimization.

7
It is a process in which multiple query execution plan for satisfying a query are examined and
most efficient query plan is satisfied for execution.
Database catalog stores the execution plans and then optimizer passes the lowest cost plan for
execution.
Row Source Generation –
The Row Source Generation is a software that receives a optimal execution plan from the
optimizer and produces an iterative execution plan that is usable by the rest of the database.
the iterative plan is the binary program that when executes by the sql engine produces the
result set.

 Step-3:
Execution Engine: Finally runs the query and display the required result.

XML Query Language

The XML query language (XQuery) is a complete query language for XML databases. It
stands to XML databases as SQL stands to relational ones. An XML database is a collection of
(related) XML documents.

XQuery works on sequences, not on node sets as XPath. A sequence contains items which are
either XML nodes (such as elements and attributes) or atomic values (like strings and numbers).
The relationship between XPath and XQuery consists in the fact that XPath expressions are used
in XQuery queries (xqueries). Hence, we may consider XPath as a syntactic fragment of
XQuery.

A typical xquery works as follows:

1. load: one or more XML documents are loaded from the database;
2. retrieve: XPath expressions are used to retrieve sequences of tree nodes from the loaded
documents;
3. process: the retrieved node sequences are processed with XQuery operations like
filtering (creating a new sequence by selecting some of the items of the original one) and
ordering (sorting the items of the sequence according to some criteria);
4. construct: new sequences may be constructed and combined with the retrieved ones;
5. output: a final sequence is returned as output.

Some of the above steps are optional. For instance, an xquery may avoid the loading of a
document and the retrieval of sequences with XPath expressions, and, instead, it may create,
process, and output its own sequences. In fact, XQuery can be used as a (Turing-
complete) programming language that manipulates item sequences, possibly using functions
defined by the user. In these pages we will focus on XQuery as an XML query language.

An XQuery processor is a software that evaluates xqueries. See the XQuery resources page for
a list of XQuery processors. Saxon is a good example. In order to evaluate with Saxon the query
contained in the file xquery.xql, type the following:

8
java net.sf.saxon.Query xquery.xql

What is XQuery

XQuery is a functional query language used to retrieve information stored in XML format. It is
same as for XML what SQL is for databases. It was designed to query XML data.

XQuery is built on XPath expressions. It is a W3C recommendation which is supported by all


major databases.

The as it is definition of XQuery given by its official documentation is as follows:

"XQuery is a standardized language for combining documents, databases, Web pages and almost
anything else. It is very widely implemented. It is powerful and easy to learn. XQuery is
replacing proprietary middleware languages and Web Application development languages.
XQuery is replacing complex Java or C++ programs with a few lines of code. XQuery is simpler
to work with and easier to maintain than many other alternatives."

What does it do

XQuery is a functional language which is responsible for finding and extracting elements and
attributes from XML documents.

It can be used for following things:

o To extract information to use in a web service.


o To generates summary reports.
o To transform XML data to XHTML.
o Search Web documents for relevant information.

XQuery Features

There are many features of XQuery query language. A list of top features are given below:

o XQuery is a functional language. It is used to retrieve and query XML based data.

9
o XQuery is expression-oriented programming language with a simple type system.
o XQuery is analogous to SQL. For example: As SQL is query language for databases,
same as XQuery is query language for XML.
o XQuery is XPath based and uses XPath expressions to navigate through XML
documents.
o XQuery is a W3C standard and universally supported by all major databases.

Advantages of XQuery

o XQuery can be used to retrieve both hierarchal and tabular data.


o XQuery can also be used to query tree and graphical structures.
o XQUery can used to build web pages.
o XQuery can be used to query web pages.
o XQuery is best for XML-based databases and object-based databases. Object databases
are much more flexible and powerful than purely tabular databases.
o XQuery can be used to transform XML documents into XHTML documents.

XQuery vs XPath
Index XQuery XPath

1) XQuery is a functional XPath is a xml path language that is used to select nodes
programming and query
language that is used to from an xml document using queries.
query a group of XML
data.

2) XQuery is used to extract XPath is used to compute values like strings, numbers and
and manipulate data from
either xml documents or boolean types from another xml documents.
relational databases and
ms office documents that
support an xml data
source.

3) xquery is represented in xpath is represented as tree structure, navigate it by


the form of a tree model
with seven nodes, namely selecting different nodes.
processing instructions,
elements, document
nodes, attributes,
namespaces, text nodes,

10
and comments.

4) xquery supports xpath xpath is still a component of query language.


and extended relational
models.

5) xquery language helps to xpath was created to define a common syntax and
create syntax for new
xml documents. behavior model for xpointer and xslt.

What is XPath?

XPath is a major element in the XSLT standard.

XPath can be used to navigate through elements and attributes in an XML document.

 XPath is a syntax for defining parts of an XML document


 XPath uses path expressions to navigate in XML documents
 XPath contains a library of standard functions
 XPath is a major element in XSLT and in XQuery
 XPath is a W3C recommendation

XPath Path Expressions

XPath uses path expressions to select nodes or node-sets in an XML document. These path
expressions look very much like the expressions you see when you work with a traditional
computer file system.

XPath expressions can be used in JavaScript, Java, XML Schema, PHP, Python, C and C++, and
lots of other languages.

XPath is Used in XSLT

XPath is a major element in the XSLT standard.

11
With XPath knowledge you will be able to take great advantage of XSL.

XPath Example

We will use the following XML document:

<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>

<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>

<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>

<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>

</bookstore>

In the table below we have listed some XPath expressions and the result of the expressions:

12
XPath Expression Result

/bookstore/book[1] Selects the first book element that is the

child of the bookstore element

/bookstore/book[last()] Selects the last book element that is the

child of the bookstore element

/bookstore/book[last()-1] Selects the last but one book element that

is the child of the bookstore element

/bookstore/book[position()<3] Selects the first two book elements that are

children of the bookstore element

//title[@lang] Selects all the title elements that have an

attribute named lang

//title[@lang='en'] Selects all the title elements that have a

"lang" attribute with a value of "en"

/bookstore/book[price>35.00] Selects all the book elements of the bookstore

13
element that have a price element with a

value greater than 35.00

/bookstore/book[price>35.00]/title Selects all the title elements of the book elements

of the bookstore element that have a price element

with a value greater than 35.00

XML Schema Storage and Query: Object-Relational Storage

Advanced techniques for XML Schema-based data include using object-relational storage;
annotating XML schemas; mapping Schema data types to SQL; using complexType extensions
and restrictions; creating, specifying relational constraints on, and partitioning XML Schema-
based data, storing XMLType data out of line, working with complex or large schemas, and
debugging schema registration.

Object-Relational Storage of XML Documents

Object-relational storage of XML documents is based on decomposing the document content into
a set of SQL objects. These SQL objects are based on the SQL 1999 Type framework. When an
XML schema is registered with Oracle XML DB, the required SQL type definitions are
automatically generated from the schema.

A SQL type definition is generated from each complexType defined by the XML schema. Each
element or attribute defined by the complexType becomes a SQL attribute in the corresponding
SQL type. Oracle XML DB automatically maps the 47 scalar data types defined by the XML
Schema Recommendation to the 19 scalar data types supported by SQL. A varray type is
generated for each element and this can occur multiple times.

The generated SQL types allow XML content that is compliant with the XML schema to be
decomposed and stored in the database as a set of objects, without any loss of information. When
an XML document is ingested, the constructs defined by the XML schema are mapped directly
to the equivalent SQL types. This lets Oracle XML DB leverage the full power of Oracle
Database when managing XML, and it can lead to significant reductions in the amount of space
required to store the document. It can also reduce the amount of memory required to query and
update XML content.

How Collections Are Stored for Object-Relational XMLType Storage


You can store an ordered collection as a varray in an ordered collection table (OCT), which can
be a heap-based table. You can store the actual data out of line by using varray entries that
are REFs to the data.

14
When you register an XML schema for XMLType data that is stored object-relationally and you
set registration parameter GENTABLES to TRUE, default tables are created automatically to
store the associated XML instance documents.

Order is preserved among XML collection elements when they are stored. The result is
an ordered collection.

You can store data in an ordered collection as a varray in a table. Each element in the collection
is mapped to a SQL object. The collection of SQL objects is stored as a set of rows in a table,
called an ordered collection table (OCT). Oracle XML DB stores a collection as a heap-
based OCT.

You can also use out-of-line storage for an ordered collection. This corresponds to XML schema
annotation SQLInline = "false", and it means that a varray of REFs in the collection table (or the
LOB) tracks the collection content, which is stored out of line.

There is no requirement to annotate an XML schema before using it. Oracle XML DB uses a set
of default assumptions when processing an XML schema that contains no annotations.

15

You might also like