Sec A and B DWDM
Sec A and B DWDM
SECTION – A
What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction
processing. It includes historical data derived from transaction data from single and multiple sources. A Data
Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for
decision-makers for data modeling and analysis. A Data Warehouse is a group of data specific to the
entire organization, not only to a particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in
support of management's decisions."
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data
warehouses typically provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data needed by the users
to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online
transaction records. It requires performing data cleaning and integration during data warehousing
to ensure consistency in naming conventions, attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months,
6 months, 12 months, or even previous data from a data warehouse. These variations with a
transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires only two procedures in data
accessing: Initial loading of data and access to data. Therefore, the DW does not require transaction
processing, recovery, and concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and data should not change.
Data Warehouse Usage:-
1. Data warehouses and data marts are used in a wide range of applications.
2. Business executives use the data in data warehouses and data marts to perform data
analysis and make strategic decisions.
3. In many areas, data warehouses are used as an integral part for enterprise management.
4. The data warehouse is mainly used for generating reports and answering predefined queries.
5. It is used to analyze summarized and detailed data, where the results are presented in the
form of reports and charts.
6. Later, the data warehouse is used for strategic purposes, performing multidimensional
analysis and sophisticated operations.
7. Finally, the data warehouse may be employed for knowledge discovery and strategic
decision making using data mining tools.
8. In this context, the tools for data warehousing can he categorized into access and retrieval
tools, database reporting tools, data analysis tools, and data mining tools.
What is Data Mart?
A Data Mart is a subset of a directorial information store, generally oriented to a specific
purpose or primary data subject which may be distributed to provide business needs. Data
Marts are analytical record stores designed to focus on particular business functions for a
specific community within an organization. Data marts are derived from subsets of data in a
data warehouse, though in the bottom-up data warehouse design methodology, the data
warehouse is created from the union of organizational data marts.
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to
gather, store, access, and analyze record. It can be used by smaller businesses to utilize the
data they have accumulated since it is less expensive than implementing a data warehouse.
There are mainly two approaches to designing data marts. These approaches are
Metadata is used for building, maintaining, managing, and using the data warehouses.
Metadata allow users access to help understand the content and find data.
o First, it acts as the glue that links all parts of the data warehouses.
o Next, it provides information about the contents and structures to the developers.
o Finally, it opens the doors to the end-users and makes the contents recognizable in
their terms.
Metadata is Like a Nerve Center. Various processes during the building and administering of
the data warehouse generate parts of the data warehouse metadata. Another uses parts of
metadata generated by one process. In the data warehouse, metadata assumes a key position
and enables communication among various methods. It acts as a nerve centre in the data
warehouse.
The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of
aggregate function value (such as total-sales) computed by grouping three attributes part,
supplier, and customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to
be measure attributes, i.e., the attributes whose values are of interest. Another attributes are selected
as dimensions or functional attributes. The measure attributes are aggregated according to the
dimensions. For example, XYZ may create a sales data warehouse to keep records of the store's sales
for the dimensions time, item, branch, and location. These dimensions enable the store to keep track
of things like monthly sales of items, and the branches and locations at which the items were sold.
Each dimension may have a table identify with it, known as a dimensional table, which describes the
dimensions. For example, a dimension table for items may contain the attributes item_name, brand,
and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse in many cases
because not every cell in each dimension may have corresponding data in the database. Techniques should be
developed to handle sparse cubes efficiently. If a query contains constants at even lower levels than those provided
in a data cube, it is not clear how to make the best use of the precomputed results stored in the data cube. The
model view data in the form of a data cube. OLAP tools are based on the multidimensional data model. Data cubes
usually model n-dimensional data. A data cube enables data to be modeled and viewed in multiple dimensions. A
multidimensional data model is organized around a central theme, like sales and transactions. A fact table represents
this theme. Facts are numerical measures. Thus, the fact table contains measure (such as Rs_sold) and keys to each
of the related dimensional tables. Dimensions are a fact that defines a data cube. Facts are generally quantities,
which are used for analyzing the relationship between dimensions.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two types
of columns: those that include fact and those that are foreign keys to the dimension table. The primary
key of the fact tables is generally a composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables that
include aggregated fact are often instead called summary tables). A fact table generally contains facts
with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of
each of the dimensions table are part of the composite primary keys of the fact table. Dimensional
attributes help to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the following
features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the development cycle,
and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema
A snowflake schema is designed for flexible querying across more complex dimensions and relationship.
It is suitable for many to many and one to many relationships between dimension levels.
1. The primary advantage of the snowflake schema is the development in query performance due to
minimized disk storage requirements and joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels and components.
3. No redundancy, so it is easier to maintain.
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts required due
to the increasing number of lookup tables. It is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and Conformed
Dimension tables.
Fact Constellation Schema is a sophisticated database design that is difficult to summarize information.
Fact Constellation Schema can implement between aggregate Fact tables or decompose a complex Fact
table into independent simplex Fact tables.
The primary disadvantage of the fact constellation schema is that it is a more challenging design
because many variants for specific kinds of aggregation must be considered and selected.
In this architecture, the data is collected into single centralized storage and processed upon completion
by a single machine with a huge structure in terms of memory, processor, and storage.
Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service.
OLAP (On-line Analytical Processing) is represented by a relatively low volume of transactions. Queries
are very difficult and involve aggregations. For OLAP operations, response time is an effectiveness
measure. OLAP applications are generally used by Data Mining techniques. In OLAP database there is
aggregated, historical information, stored in multi-dimensional schemas (generally star schema).
ROLAP MOLAP
ROLAP stands for Relational Online Analytical MOLAP stands for Multidimensional Online Analytical
Processing. Processing.
It usually used when data warehouse contains It used when data warehouse contains relational as well as
relational data. non-relational data.
It has a high response time It has less response time due to prefabricated cubes.
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional techniques.
These are intermediate servers which stand in between a relational back-end server and user frontend
tools. They use a relational or extended-relational DBMS to save and handle warehouse data, and OLAP
middleware to provide missing pieces. ROLAP servers contain optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools and services. ROLAP technology
tends to have higher scalability than MOLAP technology. ROLAP systems work primarily from the data that
resides in a relational database, where the base data and dimension tables are stored as relational tables.
This model permits the multidimensional analysis of data. This technique relies on manipulating the data
stored in the relational database to give the presence of traditional OLAP's slicing and dicing functionality.
In essence, each method of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement.
o Database server.
o ROLAP server.
o Front-end tool.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP technology is depends on
the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data amount.
<="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of
the RDBMS) can control these functionalities.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the relational
database, the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements to query
the relational database, and SQL statements do not suit all needs.
A MOLAP system is based on a native logical model that directly supports multidimensional data and
operations. Data are stored physically into multidimensional arrays, and positional techniques are used to
access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and are stored
in an optimized format in a multidimensional cube, instead of in a relational database. In MOLAP model,
data are structured into proprietary formats by client's reporting requirements with the calculations pre-
generated on the cubes.
MOLAP Architecture
MOLAP Architecture includes the following components
o Database server.
o MOLAP server.
o Front-end tool.
o
o MOLAP structure primarily reads the precompiled data. MOLAP structure has limited capabilities to
dynamically create aggregations or to evaluate results which have not been pre-calculated and stored.
o Applications requiring iterative and comprehensive time-series analysis of trends are well suited for MOLAP
technology (e.g., financial analysis and budgeting).
o Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship Server,
Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
o Some of the problems faced by clients are related to maintaining support to multiple subject areas in an
RDBMS. Some vendors can solve these problems by continuing access from MOLAP tools to detailed data
in and RDBMS.
o This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse architecture that
contains multiple subject areas.
o An example would be the creation of sales data measured by several dimensions (e.g., product and sales
region) to be stored and maintained in a persistent structure. This structure would be provided to reduce
the application overhead of performing calculations and building aggregation during initialization. These
structures can be automatically refreshed at predetermined intervals established by an administrator.
o Advantages
o Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for slicing and
dicing operations.
o Can perform complex calculations: All evaluation have been pre-generated when the cube is created.
Hence, complex calculations are not only possible, but they return quickly.
o Disadvantages
o Limited in the amount of information it can handle: Because all calculations are performed when the
cube is built, it is not possible to contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary and does not already exist in
the organization. Therefore, to adopt MOLAP technology, chances are other investments in human and
capital resources are needed.
HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture. HOLAP systems
save more substantial quantities of detailed data in the relational tables while the aggregations are stored
in the pre-calculated cubes. HOLAP also can drill through from the cube down to the relational tables for
delineated data. The Microsoft SQL Server 2000 provides a hybrid OLAP server.
Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP
and ROLAP.
2. It provides fast access at all levels of
aggregation.
3. HOLAP balances the disk space
requirement, as it only stores the aggregate
information on the OLAP server and the detail
record remains in the relational database. So no
duplicate copy of the detail record is
maintained.
Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP servers.
Other Types
There are also less popular types of OLAP styles upon which one could stumble upon every so often. We
have listed some of the less popular brands existing in the OLAP industry.
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may
include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by external
consultants) are extracted using application program interfaces called a gateway. A gateway is provided
by the underlying DBMS and allows customer programs to generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and
Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists of a third-
party system software, C programs, and shell scripts. The size and complexity of a warehouse manager
varies between specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −
• The controlling process
• Stored procedures or C with SQL
• Backup/Recovery tool
• SQL scripts
t1 1 1 1 0 0
t2 0 1 1 1 0
t3 0 0 0 1 1
t4 1 1 0 1 0
t5 1 1 1 0 1
t6 1 1 1 1 1