0% found this document useful (0 votes)
41 views14 pages

DWDM Unit III

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views14 pages

DWDM Unit III

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

III CSE DWDM -III

UNIT-III

DATA WAREHOUSE AND OLAP TECHNOLOGY

3.1 What is a data warehouse ?

3.2 A Multidimensional Data Model

3.3 Data Warehouse Architecture

3.4 Data warehouse Implementation

3.5 From Data warehouse to Data mining

3.1 What is a data warehouse

A data warehouse is a subject oriented, integrate, time-variant, and nonvolatile collection of


data in support of management’s decision making process “Data warehousing: The process of
constructing and using data warehouses” .

Subject-oriented: A data warehouse is organized around major subjects, such as cus- tomer, supplier,
product, and sales. Rather than concentrating on the day-to-day oper- ations and transaction processing
of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers.

Integrated: A data warehouse is usually constructed by integrating multiple heteroge- neous sources,
such as relational databases, flat files, and on-line transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding structures,
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 1
III CSE DWDM -III
attribute measures, and so on.

Time-variant: Data are stored to provide information from a historical perspective (e.g., the past
5–10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an
element of time.

Nonvolatile: A data warehouse is always a physically separate store of data trans- formed from
the application data found in the operational environment. Due to this separation, a data warehouse
does not require transaction processing, recovery, and concurrency control mechanisms. It usually
requires only two operations in data accessing: initial loading of data and access of data.

3.1.1 Differences between Operational Database Systems and Data Warehouses

What a data warehouse is by comparing these two kinds of systems.

 The major task of on-line operational database systems is to perform on-line trans- action and
query processing. These systems are called on-line transaction processing (OLTP) systems.

 They cover most of the day-to-day operations of an organization, such as purchasing, inventory,
manufacturing, banking, payroll, registration, and accounting.

 Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data
analysis and decision making. Such systems can organize and present data in var- ious formats in order
to accommodate the diverse needs of the different users. These systems are known as on-line
analytical processing (OLAP) systems.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 2


III CSE DWDM -III
The major distinguishing features between OLTP and OLAP are summarized as follows:

Feature OLTP OLAP


Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, DBA, database professional knowledge worker (e.g.,
manager, executive,
Function day-to-day operations long-term
analyst) informational
requirements, decision
DB design ER based, application-oriented star/snowflake,
support subject-
Data current; guaranteed up-to-date historical;
oriented accuracy
maintained over time
Summarization primitive, highly detailed summarized, consolidated
View detailed, flat relational summarized,
Unit of work short, simple transaction multidimensional
complex query
Access read/write mostly read
Focus data in information out
Operations index/hash on primary key lots of scans
Number of
records accessed
Number of users tens
thousands millions
hundreds
DB size 100 MB to GB 100 GB to TB
Priority high performance,high availability high flexibility, end-user
Metric transaction throughput autonomy
query throughput, response
time

3.2MULTIDIMENSIONAL DATA MODEL

A data warehouse is based on multidimensional data model which views data in the form of
data cube.
3.2.1 From tables and spreadsheets to data cubes
 A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
 Dimensions are perspectives or entities with respect to which an organization wants to keep
records such as time, item, branch, location etc.
 Dimension table, such as item (item name, brand, type), or time (day, week, month, quarter, year)
gives further descriptions about dimensions
 Fact table contains measures (such as dollars _sold) and keys to each of the related dimension
tables.
 In data warehousing literature, an n-D base cube is called base cuboids. The top most o-D cuboids,
which hold the highest-level of summarization, called the apex cuboids. The lattice of cuboids
forms a data cube.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 3


III CSE DWDM -III
 A 3-D view of sales data warehouse, according to the dimensions time, item, and location. The
measure displayed is dollars sold (in thousands).

A 3-D data cube representation of the data in above Table

If we continue in this way, we may display any n-D data as a series of (n − 1)-D “cubes.” The data
cube is a metaphor for multidimensional data storage. The actual physical storage of such data may
differ from its logical representation. The important thing to remember is that data cubes are n-
dimensional and do not confine data to 3-D.

A 4-D data cube representation of sales data, according to the dimensions time, item, location, and supplier.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 4


III CSE DWDM -III

Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier. Each
cuboid represents a different degree of summarization.

3.2.2 Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models
 The entity-relationship data model is commonly used in the design of relational databases,
where a database schema consists of a set of entities and the relationships between them. Such
a data model is appropriate for on-line transaction processing.
 A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line
data analysis.
 The most popular data model for a data warehouse is a multidimensional model, which can exist
in the formof a star schema, a snowflake schema, or a fact constellation schema.

Star schema:
 The most common modeling paradigm is the star schema, in which the data warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2)
a set of smaller attendant tables (dimension tables), one for each dimension.
 The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 5


III CSE DWDM -III

Snowflake schema:
 The snowflake schema is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables.
 The resulting schema graph forms a shape similar to a snowflake.
 The major difference between the snowflake and star schema models is that the dimension tables
of the snowflake model may be kept in normalized form to reduce redundancies.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 6


III CSE DWDM -III
Fact constellation:
 Sophisticated applications may require multiple fact tables to share dimension tables.
 This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema
or a fact constellation .

Fact constellation schema of a sales and shipping data warehouse.

3.2.3 Measures: three categories


 Distributive: if the result derived by applying the function to n aggregate values is the same as
that derived by applying the function on all data without partitioning.
E.g., count (), sum (), min (), max (),
 Algebraic: if it can be computed by an algebraic function with M argument (where M is a
bounded integer), each of which obtained by applying a distributive aggregate function.
E.g., avg (), min_N (), standard_deviation ().
 Holistic: if there is no constant bound on the storage size needed to describe a sub aggregate.
E.g., median (), mode (), rank ()

3.2.4 A Concept Hierarchy


A concept hierarchy defines a sequence of mapping from a set of low-level concepts to higher-
level, more general concepts.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 7


III CSE DWDM -III

A concept hierarchy for location. Due to space limitations, not all of the hierarchy nodes are
shown, indicated by ellipses between nodes.

Hierarchical and lattice structures of attributes in warehouse dimensions: (a) a hierarchy for
location and (b) a lattice for time.

A concept hierarchy for price.

 Concept hierarchies may also be defined by discretizing or grouping values for a given dimension
or attribute, resulting in a set-grouping hierarchy.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 8


III CSE DWDM -III
3.2.5 OLAP Operations in the Multidimensional Data Model

 In the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies.
 This organization provides users with the flexibility to view data from different perspectives.
 A number of OLAP data cube operations exist to materialize these different views, allowing
interactive querying and analysis of the data at hand.
 Hence, OLAP provides a user-friendly environment for interactive data analysis.

Roll-up:
 The roll-up operation (also called the drill-up operation by some vendors) performs aggregation on
a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction.
 Example : The result of a roll-up operation performed on the central cube by climbing up the
concept hierarchy for location given in Figure.
 This hierarchy was defined as the total order “street <city < province or state < country.”
 The roll-up operation shown aggregates the data by ascending the location hierarchy from the level
of city to the level of country.

Drill-down:
 Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data.
 Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions.
 Figure 4.12 shows the result of a drill-down operation performed on the central cube by stepping
down a concept hierarchy for time defined as “day < month < quarter < year.”
 Drill-down occurs by descending the time hierarchy fromthe level of quarter to the more detailed
level of month.

Slice and dice:


 The slice operation performs a selection on one dimension of the given cube, resulting in a
subcube.
 Figure 4.12 shows a slice operation where the sales data are selected from the central cube for
the dimension time using the criterion time = “Q1.”
 The dice operation defines a subcube by performing a selection on two or more dimensions.\
 Figure 4.12 shows a dice operation on the central cube based on the following selection criteria
that involve three dimensions: (location = “Toronto” or “Vancouver”) and (time D=“Q1” or
“Q2”) and (item D “home entertainment” or “computer”).

Pivot (rotate):
 Pivot (also called rotate) is a visualization operation that rotates the data axes in view to provide
an alternative data presentation.
 Figure 4.12 shows a pivot operation where the item and location axes in a 2-D slice are rotated.
Other examples include rotating the axes in a 3-D cube, or transforming a 3-D cube into a series

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 9


III CSE DWDM -III
of 2-D planes.

Other OLAP operations: Some OLAP systems offer additional drilling operations. For example,
drill-across executes queries involving (i.e., across) more than one fact table. The drill-through
operation uses relational SQL facilities to drill through the bottom level of a data cube down to its
back-end relational tables.

3.3 Data warehouse Architecture

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 10


III CSE DWDM -III
3.3.1Steps for the Design and construction of Data warehouse are:
1. Design of Data warehouse : A business analysis framework
2. Data warehouse design process

1.Design of a data warehouse: a Business Analysis Framework


 To design an effective data warehouse we need to understand and analyze business needs and
construct a business framework.
 The construction of a large and complex information system can be viewed as the construction of
a large and complex building, for which the owner, architect, a builder have different views.
 These views are combined to form a complex framework that represents the top-down, business-
driven, or owner’s perspective, as well as the bottom-up, builder-driven, or implementer’s view of
the information system.
 Four views regarding the design of a data warehouse
o Top-down view - Allows selection of the relevant information necessary for the data
warehouse
o Data source view - Exposes the information being captured, stored, and managed by
operational systems
o Data warehouse view - Consists of fact tables an dimensional tables
o Business query view - Sees the perspective of data in the warehouse from the view of end-
user
2 Data warehouse Design process
 Top-down, Bottom-up approaches or a combination of both
 Top-down: Starts with overall Design an planning (mature)
 Bottom-up: Starts with experiments an prototypes(rapid)
 From software engineering point of view
 Water fall: Structured an systemically analysis at each step before proceeding to the next
 Spiral: Rapid generation of increasingly functional systems, short turn around time, quick turn
around
 Typical data warehouse design process
1. Choose a business process to model, e.g., orders, invoices, etc.
2. Choose the grain (atomic level of data) of the business process
3. Choose the dimensions that will apply to each fact table record.
4. Dimension the measure that will populate each fact table record.

3.3.2 Multi-Tiered Architecture- Three tier Data warehouse Architecture

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 11


III CSE DWDM -III
Data warehouse often adopt a three-tier architecture.
1. The bottom tier is a warehouse database server that us almost always a relational database system.
Back-end tools and utilities are used to feed data into the bottom tier from operational databases or
other external sources (such as customer profile information provided by external consultants). These
tools and utilities perform data extraction, cleaning an transformation. This tier also contains a
metadata respiratory, which stores information about the data warehouse and its contents
2. The middle tier is an OLAP server that is typically implemented using either(1) a relational OLAP
(ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data
to standard relational operations; or (2) a multidimensional OLAP (MOLAP) model, that is, a special-
purpose server that directly implemented multidimensional data and operations.
3. The top tier is a front-end client layer, which contains query and reporting, analysis tools, and/or
data mining tools (e.g. trend analysis, prediction, and so on).

A three-tier data warehousing architecture.


Three Data Warehouse Models
1. Enterprise warehouse
Collects all of the information about subjects spanning in the enterprise the entire organization.
2. Data Mart
A subset of corporate-wide data that is of value to a specific groups, such as marketing
3. Virtual Warehouse

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 12


III CSE DWDM -III
A set of views over operational database. Only some of the possible summary views may be
materialized.
Data warehouse Development: A Recommended Approach
 A recommended method for the development of data warehouse systems is to implement the data
warehouse in an increment and evolutionary manner.
 First, a high-level corporate data model is defined within is defined within a responsibly short
period(such as one or two months) that provides a corporate-wide, consistent, intenerated view of
data among the different subjects and potentials uses.
 Second, independent data mart can be implemented in parallel with the enterprise warehouse
based on the same corporate data model such as above.
 Third, distributed data mart can be constructed to integrate different data marts via hub servers.
 Finally, a multitier data warehouse is connected to where the enterprise warehouse is the soul for
custodian of all warehouse data, which is then distributed to the various dependent data marts.

A recommended approach for data warehouse development

3.3.3 Data Warehouse Back-End Tools and Utilities

 Data warehouse systems use back-end tools and utilities to populate and refresh their data (These
tools and utilities include the following functions:
 Data extraction, which typically gathers data from multiple, heterogeneous, and exter- nal
sources
 Data cleaning, which detects errors in the data and rectifies them when possible.
 Data transformation, which converts data from legacy or host format to warehouse format
 Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions
 Refresh, which propagates the updates from the data sources to the warehouse

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 13


III CSE DWDM -III
3.3.4 Metadata Repository

 Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouse objects.
 Metadata are created for the data names and definitions of the given warehouse.
 Additional metadata are created and captured for timestamping any extracted data, the source of
the extracted data, and missing fields that have been added by data cleaning or integration
processes.

3.3.5 OLAP Server Architectures


1. Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and manage warehouse data
 And OLAP middle ware to support missing pieces
 Include optimization of DBMS backend, implementation of aggregation navigation logic, and
additional tools and services
 Greater scalability
2. Multidimensional OLAP (MOLAP)
 Array-based multidimensional storage engine(sparse matrix techniques)
 Fast indexing to pre-computed summarized data
3. Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational, high-level: array
4. Specialized SQL servers
 Specialized support for SQL queries over star/snowflake schemas

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 14

You might also like