Unit-Iv XML and Datawarehouse
Unit-Iv XML and Datawarehouse
XML Database: XML – XML Schema – XML DOM and SAX Parsers – XSL – XSLT – XPath
and XQuery – Data Warehouse: Introduction – Multidimensional Data Modeling – Star and
SnowflakeSchema – Architecture – OLAP Operations and quries
XML – DATABASES
XML Database is used to store huge amount of information in the XML format.
As the use of XML is increasing in every field, it is required to have a secured place to
store the XML documents.
The data stored in the database can be queried using XQuery, serialized, and exported
into a desired format.
XML- enabled
Native XML (NXD)
XML enabled database is nothing but the extension provided for the conversion of XML
document.
This is a relational database, where data is stored in tables consisting of rows and
columns.
Native XML database is based on the container rather than table format.
It can store large amount of XML document and data.
Native XML database is queried by the XPath-expressions.
Native XML database has an advantage over the XML-enabled database.
It is highly capable to store, query and maintain the XML document than XML-enabled
database.
<contact2>
<name>Manisha Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 789-4567</phone>
</contact2>
</contact-info>
XML SCHEMA
SYNTAX
EXAMPLE
ELEMENTS
DEFINITION TYPES
SIMPLE TYPE
COMPLEX TYPE
GLOBAL TYPES
With the global type, you can define a single type in your document, which can be used
by all other references.
For example, suppose you want to generalize the person and company for different
addresses of the company. In such case, you can define a general type as follows −
<xs:element name = "AddressType">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
Now let us use this type in our example as follows −
<xs:element name = "Address1">
<xs:complexType>
<xs:sequence>
<xs:element name = "address" type = "AddressType" />
<xs:element name = "phone1" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
ATTRIBUTES
It consumes more memory (if the XML structure is large) as program written once
remains in memory all the time until and unless removed explicitly.
Due to the extensive usage of memory, its operational speed, compared to SAX is slower.
SAX (Simple API for XML)
A SAX Parser implements SAX API. This API is an event based API and less intuitive.
Clients does not know what methods to call, they just overrides the methods of the API and place
his own code inside method.
Advantages
Disadvantages
2) Clients never know the full information because the data is broken into pieces.
XSL
XSL (Extensible Stylesheet Language), formerly called Extensible
Style Language, is a language for creating a style sheet that
describes how data sent over the Web using the Extensible
Markup Language (XML) is to be presented to the user. ...
XSL is developed under the auspices of the World Wide Web
Consortium (W3C).
XSLT
EXtensible Stylesheet Language Transformation commonly known as XSLT is a
way to transform the XML document into other formats such as XHTML.
XSL
Before learning XSLT, we should first understand XSL which stands for
EXtensible Stylesheet Language. It is similar to XML as CSS is to HTML.
What is XSLT
XSLT, Extensible Stylesheet Language Transformations, provides the ability to
transform XML data from one format to another automatically.
XSLT SYNTAX
we have the following sample XML file, students.xml, which is required to be
transformed into a well-formatted HTML document.
students.xml
<?xml version = "1.0"?>
<class>
<student rollno = "393">
<firstname>Dinkar</firstname>
<lastname>Kad</lastname>
<nickname>Dinkar</nickname>
<marks>85</marks>
</student>
<student rollno = "493">
<firstname>Vaneet</firstname>
<lastname>Gupta</lastname>
<nickname>Vinni</nickname>
<marks>95</marks>
</student>
<student rollno = "593">
<firstname>Jasvir</firstname>
<lastname>Singh</lastname>
<nickname>Jazz</nickname>
<marks>90</marks>
</student>
</class>
We need to define an XSLT style sheet document for the above XML document to meet
the following criteria −
Page should have a title Students.
Page should have a table of student details.
Columns should have following headers: Roll No, First Name, Last Name, Nick
Name, Marks
Table must contain details of the students accordingly.
<html>
<body>
<h2>Students</h2>
<xsl:for-each select="class/student">
<tr>
<td>
<!-- value-of processing instruction
process the value of the element matching
the XPath expression
-->
<xsl:value-of select = "@rollno"/>
</td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
Output
XPATH AND XQUERY
XPath (XML Path Language) is a query language for selecting nodes from an XML
document. In addition, XPath may be used to compute values (e.g., strings, numbers, or
Boolean values) from the content of an XML document.
These path expressions look very much like the path expressions you use with
traditional computer file systems:
XPath Standard Functions
XPath includes over 200 built-in functions.
There are functions for string values, numeric values, booleans, date and time
comparison, node manipulation, sequence manipulation, and much more.
Today XPath expressions can also be used in JavaScript, Java, XML Schema, PHP,
Python, C and C++, and lots of other languages.
With XPath knowledge you will be able to take great advantage of your XSLT
knowledge.
XPath Terminology
Nodes
In XPath, there are seven kinds of nodes: element, attribute, text, namespace,
processing-instruction, comment, and document nodes.
XML documents are treated as trees of nodes. The topmost element of the tree is
called the root element.
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
Relationship of Nodes
Parent
Each element and attribute has one parent.
In the following example; the book element is the parent of the title, author, year,
and price:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Children
Element nodes may have zero, one or more children.
In the following example; the title, author, year, and price elements are all
children of the book element:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Siblings
Nodes that have the same parent.
In the following example; the title, author, year, and price elements are all
siblings:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
XPath Syntax
XPath uses path expressions to select nodes or node-sets in an XML document. The
node is selected by following a path or steps.
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="en">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
selecting Nodes
XPath uses path expressions to select nodes in an XML document. The node is
selected by following a path or steps. The most useful path expressions are listed
below:
Expression Description
@ Selects attributes
In the table below we have listed some path expressions and the result of the
expressions:
bookstore//book Selects all book elements that are descendant of the bookstore
Predicates
Predicates are used to find a specific node or a node that contains a specific
value.
In the table below we have listed some path expressions with predicates and the
result of the expressions:
SelectionLanguage to XPath:
In JavaScript:
xml.setProperty("SelectionLanguage","XPath"
Wildcard Description
In the table below we have listed some path expressions and the result of the
expressions:
/bookstore/* Selects all the child element nodes of the bookstore element
In the table below we have listed some path expressions and the result of the
expressions:
//book/title | //book/price Selects all the title AND price elements of all
book elements
//title | //price Selects all the title AND price elements in the
document
What is XQuery
XQuery is a functional language that is used to retrieve information stored in XML
format. XQuery can be used on XML documents, relational databases containing data in
XML formats, or XML Databases. XQuery 3.0 is a W3C recommendation from April 8,
2014.
XQuery is a standardized language for combining documents, databases, Web pages and almost
anything else. It is very widely implemented. It is powerful and easy to learn. XQuery is replacing
proprietary middleware languages and Web Application development languages. XQuery is
replacing complex Java or C++ programs with a few lines of code. XQuery is simpler to work with
and easier to maintain than many other alternatives.
Characteristics
Functional Language − XQuery is a language to retrieve/querying XML based
data.
Analogous to SQL − XQuery is to XML what SQL is to databases.
XPath based − XQuery uses XPath expressions to navigate through XML
documents.
Universally accepted − XQuery is supported by all major databases.
W3C Standard − XQuery is a W3C standard.
Benefits of XQuery
Using XQuery, both hierarchical and tabular data can be retrieved.
XQuery can be used to query tree and graphical structures.
XQuery can be directly used to query webpages.
XQuery can be directly used to build webpages.
XQuery can be used to transform xml documents.
XQuery is ideal for XML-based databases and object-based databases. Object
databases are much more flexible and powerful than purely tabular databases.
DATA WAREHOUSE
Sales
Marketing
HR
SCM, etc.
It may pass through operational data store or other transformations before it
is loaded to the DW system for information processing.
A Data Warehouse is used for reporting and analyzing of information and
stores both historical and current data. The data in DW system is used for
Analytical reporting, which is later used by Business Analysts, Sales
Managers or Knowledge workers for decision-making.
In the above image, you can see that the data is coming from multiple
heterogeneous data sources to a Data Warehouse. Common data sources for a data
warehouse includes −
Operational databases
SAP and non-SAP Applications
Flat Files (xls, csv, txt files)
Data in data warehouse is accessed by BI (Business Intelligence) users for
Analytical Reporting, Data Mining and Analysis. This is used for decision making
by Business Users, Sales Manager, Analysts to define future strategy.
Data Warehousing involves data cleaning, data integration, and data consolidations.
A Data Warehouse has a 3-layer architecture −
It defines how the data comes to a Data Warehouse. It involves various data
sources and operational transaction systems, flat files, applications, etc.
Integration Layer
It consists of Operational Data Store and Staging area. Staging area is used to
perform data cleansing, data transformation and loading data from different sources
to a data warehouse. As multiple data sources are available for extraction at
different time zones, staging area is used to store the data and later to apply
transformations on data.
Presentation Layer
OLTP vs OLAP
Firstly, OLTP stands for Online Transaction Processing, while OLAP stands
for Online Analytical Processing
In an OLTP system, there are a large number of short online transactions such as
INSERT, UPDATE, and DELETE.
Whereas, in an OLTP system, an effective measure is the processing time of short
transactions and is very less. It controls data integrity in multi-access environments.
For an OLTP system, the number of transactions per second measures the
effectiveness. An OLTP Data Warehouse System contains current and detailed data
and is maintained in the schemas in the entity model (3NF).
For Example −
A Day-to-Day transaction system in a retail store, where the customer records are
inserted, updated and deleted on a daily basis. It provides faster query processing.
OLTP databases contain detailed and current data. The schema used to store OLTP
database is the Entity model.
In an OLAP system, there are lesser number of transactions as compared to a
transactional system. The queries executed are complex in nature and involves data
aggregations.
What is an Aggregation?
We save tables with aggregated data like yearly (1 row), quarterly (4 rows),
monthly (12 rows) or so, if someone has to do a year to year comparison, only one
row will be processed. However, in an un-aggregated table it will compare all the
rows. This is called Aggregation.
There are various Aggregation functions that can be used in an OLAP system like
Sum, Avg, Max, Min, etc.
For Example −
SELECT Avg(salary)
FROM employee
WHERE title = 'Programmer';
Key Differences
These are the major differences between an OLAP and an OLTP system.
Indexes − An OLTP system has only few indexes while in an OLAP system
there are many indexes for performance optimization.
Joins − In an OLTP system, large number of joins and data are normalized.
However, in an OLAP system there are less joins and are de-normalized.
Aggregation − In an OLTP system, data is not aggregated while in an OLAP
database more aggregations are used.
Normalization − An OLTP system contains normalized data however data is
not normalized in an OLAP system.
Data mart focuses on a single functional area and represents the simplest form of a
Data Warehouse. Consider a Data Warehouse that contains data for Sales,
Marketing, HR, and Finance. A Data mart focuses on a single functional area like
Sales or Marketing.
In
the above image, you can see the difference between a Data Warehouse and a data
mart.
1110 25 2 125
1210 28 4 252
These dimensional and relational models have their unique way of data
storage that has specific advantages.
Dimension
Dimension provides the context surrounding a business process event. In
simple terms, they give who, what, where of a fact. In the Sales business
process, for the fact quarterly sales number, dimensions would be
Attributes
The Attributes are the various characteristics of the dimension in dimensional
data modeling.
State
Country
Zipcode etc.
Attributes are used to search, filter, or classify facts. Dimension Tables contain
Attributes
Fact Table
A fact table is a primary table in dimension modelling.
1. Measurements/facts
2. Foreign key to dimension table
Dimension Table
A dimension table contains dimensions of a fact.
They are joined to fact table via a foreign key.
Dimension tables are de-normalized tables.
The Dimension Attributes are the various columns in a dimension table
Dimensions offers descriptive characteristics of the facts with the help of
their attributes
No set limit set for given for number of dimensions
The dimension can also contain one or more hierarchical relationships
Conformed Dimension
Outrigger Dimension
Shrunken Dimension
Role-playing Dimension
Dimension to Dimension Table
Junk Dimension
Degenerate Dimension
Swappable Dimension
Step Dimension
The model should describe the Why, How much, When/Where/Who and What
of your business process
To describe the business process, you can use plain text or use basic Business
Process Modelling Notation (BPMN) or Unified Modelling Language (UML).
Example of Grain:
The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis.
Example of Dimensions:
The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis.
Dimensions: Product, Location and Time
Attributes: For Product: Product key (Foreign Key), Name, Type, Specifications
Example of Facts:
The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis.
1. Star Schema
The fact tables in a star schema which is third normal form whereas
dimensional tables are de-normalized.
2. Snowflake Schema
Star Schema
Snowflake Schema
Galaxy Schema
In the following Star Schema example, the fact table is at the center which
contains keys to every dimension table like Dealer_ID, Model ID, Date_ID,
Product_ID, Branch_ID & other attributes like Units sold and revenue.
Example of
Star Schema Diagram
Characteristics of Star Schema:
Every dimension in a star schema is represented with the only one-
dimension table.
The dimension table should contain the set of attributes.
The dimension table is joined to the fact table using a foreign key
The dimension table are not joined to each other
Fact table would contain key and measure
The Star schema is easy to understand and provides optimal disk usage.
The dimension tables are not normalized. For instance, in the above
figure, Country_ID does not have Country lookup table as an OLTP
design would have.
The schema is widely supported by BI Tools
Ex
ample of Snowflake Schema
Characteristics of Snowflake Schema:
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts because
of the more lookup tables.
1. Revenue
2. Product.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways −
Slice
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows
the pivot operation.
OLAP QUERIES
Online Analytical Processing (OLAP) databases facilitate business-intelligence queries.
OLAP is a database technology that has been optimized for querying and reporting,
instead of processing transactions. ... OLAP data is also organized hierarchically and
stored in cubes instead of tables.
When you have to run an OLAP query that perform sum of Sales in table with where
clause in Country=’US’.
Select Sum(Sales) from FCT_SALES where Country=’US’;
It storage type is a column based storage in memory cells all the values for Sales will
come together in database and when an aggregation ‘Sum’ is performed it will be much
faster as compared to an OLTP query.
If table is row based storage with values are stored with different data types coming
together and a ‘Sum’ aggregation is performed, it will too tough to find values for ‘Sales’
column.
MULTIDIMENSIONAL DATA MODELING
These dimensional and relational models have their unique way of data
storage that has specific advantages.
Attributes
The Attributes are the various characteristics of the dimension in dimensional
data modeling.
State
Country
Zipcode etc.
Attributes are used to search, filter, or classify facts. Dimension Tables contain
Attributes
Fact Table
A fact table is a primary table in dimension modelling.
3. Measurements/facts
4. Foreign key to dimension table
Dimension Table
A dimension table contains dimensions of a fact.
They are joined to fact table via a foreign key.
Dimension tables are de-normalized tables.
The Dimension Attributes are the various columns in a dimension table
Dimensions offers descriptive characteristics of the facts with the help of
their attributes
No set limit set for given for number of dimensions
The dimension can also contain one or more hierarchical relationships
Types of Dimensions in Data Warehouse
Following are the Types of Dimensions in Data Warehouse:
Conformed Dimension
Outrigger Dimension
Shrunken Dimension
Role-playing Dimension
Dimension to Dimension Table
Junk Dimension
Degenerate Dimension
Swappable Dimension
Step Dimension
The model should describe the Why, How much, When/Where/Who and What
of your business process
Step 1) Identify the Business Process
Identifying the actual business process a datarehouse should cover. This could
be Marketing, Sales, HR, etc. as per the data analysis needs of the
organization. The selection of the Business process also depends on the
quality of data available for that process. It is the most important step of the
Data Modelling process, and a failure here would have cascading and
irreparable defects.
To describe the business process, you can use plain text or use basic Business
Process Modelling Notation (BPMN) or Unified Modelling Language (UML).
Example of Grain:
The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis.
Example of Dimensions:
The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis.
Attributes: For Product: Product key (Foreign Key), Name, Type, Specifications
Example of Facts:
The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis.
2. Star Schema
The fact tables in a star schema which is third normal form whereas
dimensional tables are de-normalized.
3. Snowflake Schema
Star Schema
Snowflake Schema
Galaxy Schema
What is a Star Schema?
Star Schema in data warehouse, in which the center of the star can have one
fact table and a number of associated dimension tables. It is known as star
schema as its structure resembles a star. The Star Schema data model is the
simplest type of Data Warehouse schema. It is also known as Star Join
Schema and is optimized for querying large data sets.
In the following Star Schema example, the fact table is at the center which
contains keys to every dimension table like Dealer_ID, Model ID, Date_ID,
Product_ID, Branch_ID & other attributes like Units sold and revenue.
Example of
Star Schema Diagram
Characteristics of Star Schema:
Every dimension in a star schema is represented with the only one-
dimension table.
The dimension table should contain the set of attributes.
The dimension table is joined to the fact table using a foreign key
The dimension table are not joined to each other
Fact table would contain key and measure
The Star schema is easy to understand and provides optimal disk usage.
The dimension tables are not normalized. For instance, in the above
figure, Country_ID does not have Country lookup table as an OLTP
design would have.
The schema is widely supported by BI Tools
Ex
ample of Snowflake Schema
Characteristics of Snowflake Schema:
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake
Schema is that you need to perform more maintenance efforts because
of the more lookup tables.
3. Revenue
4. Product.
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways −
Slice
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows
the pivot operation.
OLAP QUERIES
Online Analytical Processing (OLAP) databases facilitate business-intelligence queries.
OLAP is a database technology that has been optimized for querying and reporting,
instead of processing transactions. ... OLAP data is also organized hierarchically and
stored in cubes instead of tables.
When you have to run an OLAP query that perform sum of Sales in table with where
clause in Country=’US’.
Select Sum(Sales) from FCT_SALES where Country=’US’;
It storage type is a column based storage in memory cells all the values for Sales will
come together in database and when an aggregation ‘Sum’ is performed it will be much
faster as compared to an OLTP query.
If table is row based storage with values are stored with different data types coming
together and a ‘Sum’ aggregation is performed, it will too tough to find values for ‘Sales’
column.