0% found this document useful (0 votes)
37 views64 pages

DW 1

Uploaded by

p bb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views64 pages

DW 1

Uploaded by

p bb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

MCS-221

Data Warehousing
Indira Gandhi National Open University
School of Computer and Information
Sciences (SOCIS)
and Data Mining

DATA WAREHOUSE FUNDAMENTALS


AND ARCHITECTURE 1
COURSE INTRODUCTION
This course is of 4 credits, divided into two parts – first part (2 credits) covering
the Data Warehousing and second part (2 credits) covering the Data Mining.

A data warehouse is a system that stores data from a company’s operational


databases as well as external sources. Data warehouse platforms are different from
operational databases because they store historical information, making it easier
for business leaders to analyze data over a specific period of time. Data warehouse
platforms also sort data based on different subject matter, such as customers,
products or business activities.

Many global corporations have turned to data warehousing to organize data that
streams in from corporate branches and operations centers around the world. It’s
essential for IT students to understand how data warehousing helps businesses
remain competitive in a quickly evolving global marketplace. Data warehousing is
an increasingly important business intelligence tool which enables historical
insights, ensure consistency, allow organizations to make better business decisions,
decrease costs, maximize efficiency, increase the power and speed of data
analytics, provides major competitive edge and increase sales to improve the
bottom line.

It is necessary to choose adequate Data Mining algorithms for making Data


Warehouse more useful. Data mining algorithms are used for transforming data
into business information and thereby improving decision making process. Data
Mining is a set of methods used for data analysis, created with the aim to find out
specific dependence, relations and rules related to data and making them out in the
new higher level quality information. Data Mining gives results that show the
interdependence and relations of data. These dependences are mainly based on
various mathematical and statistical relations. Data are collected from internal
database and converted into various documents, reports, list etc. which can be
further used in decision making processes. After selecting the data for analysis,
Data Mining is applied to the appropriate rules of behavior and patterns. That is
the reasons why Data Mining is also known as extraction of knowledge, data
archeology or pattern analysis. Data mining helps to develop smart market
decision, run accurate campaigns, make predictions, and more. With the help of
Data mining, we can analyze customer behaviors and their insights. This leads to
great success and data-driven business.
The course is organized into 4 Blocks:
Block 1 covers the Introductory topics on Data Warehousing, Data Warehouse
architecture, Data Marts and Dimensional Modeling.
Block 2 covers the Extract, Transform and Loading (ETL) aspects of Data Warehousing,
Online Analytical Processing and some Trends in Data Warehouse.
Block 3 covers the introductory topics related to Data Mining, Data Preprocessing and
Mining Frequent Patterns and Associations
Block 4 covers the Classification, Clustering of Data Mining, Text and Web Mining.
There is a lab component associated with this course (i.e., Section-2 Data Mining Lab of
MCSL-223 course).
MCS-221
Data Warehousing
Indira Gandhi and Data Mining
National Open University
School of Computer and
Information Sciences

Block

1
DATA WAREHOUSE FUNDAMENTALS
AND ARCHITECTURE
UNIT 1
Fundamentals of Data Warehouse
UNIT 2
Data Warehouse Architecture
UNIT 3
Dimensional Modeling
PROGRAMME DESIGN COMMITTEE
Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Solutions, Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

COURSE DESIGN COMMITTEE


Prof. T.V. Vijay Kumar, JNU, New Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Dr. Rahul Johri, USICT, GGSIPU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Mr. Vinay Kumar Sharma, NVLI, IGNOU Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU

SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU

BLOCK PREPARATION TEAM


Course Editor Course Writers
Prof. Devendra Kumar Tayal Unit 1: Ms. U. Chaitanya, Asst Professor
Dept. of Computer Science & Engineering Dept. of Information Technology
Indira Gandhi Delhi Technical University for Women Mahatma Gandhi Institute of Technology
New Delhi Hyderabad
Unit 2: Prof. K. Swathi
Language Editor NRI Institute of Technology
Prof. Parmod Kumar Vijayawada
School of Humanities Unit 3: Prof. Archana Singh
IGNOU Dept. Of Information Technology
New Delhi Amity School of Engineering & Technology
Noida

Course Coordinator: Prof. V.V. Subrahmanyam

Print Production
Mr. Sanjay Aggarwal, Assistant Registrar (Publication), MPDD

July, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission in writing from
the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at Maidan Garhi, New
Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
BLOCK INTRODUCTION
The title of the Block is Data Warehouse Fundamentals and Architecture. The objectives of
this block are to make you understand about the underlying concepts of Data Warehousing,
identify the components of the Data Warehouse Architecture, to know the difference between
the Data Warehouse and Data Marts, to understand the Data Warehouse Development Life
Cycle and to elucidate the dimensional modeling techniques.
The block is organized into 3 units:

Unit 1 covers the fundamentals of data warehousing, its evolution, characteristics of


data warehousing, online transaction processing systems and applications of data warehouses;
Unit 2 covers the data warehouse architecture, data marts and data warehouse development
life cycle; and
Unit 3 covers the introduction to dimensional modeling, identifying facts and dimensions,
star schema, snowflake schema and fact constellation schema.
UNIT 1 FUNDAMENTALS OF
DATA WAREHOUSE
Structure

1.0 Introduction
1.1 Objectives
1.2 Evolution of Data Warehouse
1.3 Data Warehouse and its Need
1.3.1 Need for Data Warehouse
1.3.2 Benefits of Data Warehouse
1.4 Data Warehouse Design Approaches
1.4.1 Top-Down Approach
1.4.2 Bottom-Up Approach
1.5 Characteristics of a Data Warehouse
1.5.1 How Data Warehouse Works?
1.6 OLTP and OLAP
1.6.1 Online Transaction Processing (OLTP)
1.6.2 Online Analytical Processing (OLAP)
1.7 Data Granularity
1.8 Metadata and Data Warehousing
1.9 Data Warehouse Applications
1.10 Types of Data Warehouses
1.10.1 Enterprise Data Warehouse
1.10.2 Operational Data Store
1.10.3 Data Mart
1.11 Popular Data Warehouse Platforms
1.12 Summary
1.13 Solutions/Answers
1.14 Further Readings

1.0 INTRODUCTION

A database often contains information or data collection that is generally stored


electronically in a computer system. It is easy to access, manage, modify,
update, monitor, and organize the data. Data is stored in the tables of the
database.

The process of consolidating data and analyzing it to obtain some insights has
been around for centuries, but we just recently began referring to this as data
warehousing. Any operational or transactional system is only designed with its
own functionality and hence, it could handle limited amounts of data for a
limited amount of time. The operational systems are not designed or
architected for long term data retention as the historical data is little to no
importance to them. However, to gain a point-in-time visibility and understand
the high-level operational aspects of any business, the historical data plays a
vital role. With the emergence of matured Relational Database Management
Systems (RDBMS) in 1960s, engineers across various enterprises started
1
Fundamentals of
Data Warehouse
architecting ways to copy the data from the transactional systems over to
different databases via manual or automated mechanism and use it for
reporting and analysis. As the data in the transactional systems would get
purged periodically, it would not be the case in these analytical repositories as
their purpose was to store as much data as possible; hence the word “data
warehouse” came into existence because these repositories would become a
warehouse for the data.

Data Warehousing (DW) as a practice became very prominent during late 80s
when the enterprises started building decision support systems that were
mainly responsible to support reporting. As there was a rapid advancement in
the performance of these relational database during late 1990s and early 2000s,
Data Warehousing became a core part of the Information Technology group
across large enterprises. In fact, some of the vendors like Netezza, Teradata
started offering customized hardware to manage data warehouse architectures
within state-of-the-art machines. Data Warehousing had evolved to be on top
of the list of priorities since mid 2000s. Data supply chain ecosystem has
grown exponentially in the current world and so is the way enterprises
architect their data warehouses.

A well architected data warehouse serves as an extended vision for the


enterprise where multiple departments can gain actionable insights to manage
key business decisions that could drive operational excellence or revenue
generating opportunities for the enterprise.

This unit covers the basic features of data warehousing, its evolution,
characteristics, online transaction processing (OLTP), online analytical
processing, popular platforms and applications of data warehouses.

1.1 OBJECTIVES

After going through this unit, you shall be able to:

• understand the evolution of data warehouse;


• describe various characteristics of data warehouse;
• describe the benefits and applications of a data warehouse;
• discuss the significance of metadata in data ware house;
• list and discuss the types of data warehouses, and
• identify the popular data warehouse platforms;

1.2 EVOLUTION OF DATA WAREHOUSE


The relational database revolution in the early 1980s ushered in an era of
improved access to the valuable information contained deep within data. It was
soon discovered that databases modeled to be efficient at transactional
processing were not always optimized for complex reporting or analytical
needs.
2
Data Warehouse
In fact, the need for systems offering decision support functionality predates Fundamentals and
Architecture
the first relational model and SQL. But the practice known today as Data
Warehousing really saw its genesis in the late 1980s. An IBM Systems Journal
article published in 1988, An architecture for a business information system
coined the term “business data warehouse,” although a future progenitor of the
practice, Bill Inmon, used a similar term in the 1970s. Considered by many to
be the Father of Data Warehousing, Bill Inmon, an American Computer
Scientist is first began to discuss the principles around the Data Warehouse and
even coined the term. Throughout the latter 1970s into the 1980s, Inmon
worked extensively as a data professional, honing his expertise in all manners
of relational Data Modeling. Inmon’s work as a Data Warehousing pioneer
took off in the early 1990s when he ventured out on his own, forming his first
company, Prism Solutions. One of Prism’s main products was the Prism
Warehouse Manager, one of the first industry tools for creating and managing
a Data Warehouse.

In 1992, Inmon published Building the Data Warehouse, one of the seminal
volumes of the industry. Later in the 1990s, Inmon developed the concept of
the Corporate Information Factory, an enterprise level view of an
organization’s data of which Data Warehousing plays one part. Inmon’s
approach to Data Warehouse design focuses on a centralized data repository
modeled to the third normal form. Inmon's approach is often characterized as a
top-down approach. Inmon feels using strong relational modeling leads to
enterprise-wide consistency facilitating easier development of individual data
marts to better serve the needs of the departments using the actual data. This
approach differs in some respects to the “other” father of Data Warehousing,
Ralph Kimball.

While Inmon’s Building the Data Warehouse provided a robust theoretical


background for the concepts surrounding Data Warehousing, it was Ralph
Kimball’s The Data Warehouse Toolkit, first published in 1996, that included a
host of industry-honed, practical examples for OLAP-style modeling. Kimball,
on the other hand, favors the development of individual data marts at the
departmental level that get integrated together using the Information Bus
architecture. This bottom up approach fits-in nicely with Kimball’s preference
for star-schema modeling. Both approaches remain core to Data Warehousing
architecture as it stands today. Smaller firms might find Kimball’s data mart
approach to be easier to implement with a constrained budget. Dimensional
modeling in many cases is easier for the end user to understand.

According to Bill Inmon, “A warehouse is a subject-oriented, integrated, time-


variant and non-volatile collection of data in support of management’s
decision making process”.

According to Ralph Kimball, “Data warehouse is the conglomerate of all data


marts within the enterprise. Information is always stored in the dimensional
model”.
3
Fundamentals of
Data Warehouse 1.3 DATA WAREHOUSING AND ITS NEED

Data Warehouse is used to collect and manage data from various sources, in
order to provide meaningful business insights. A data warehouse is usually
used for linking and analyzing heterogeneous sources of business data. The
data warehouse is the center of the data collection and reporting framework
developed for the BI system. Data warehouse systems are real-time
repositories of information, which are likely to be tied to specific applications.
Data warehouses gather data from multiple sources (including databases), with
an emphasis on storing, filtering, retrieving and in particular, analyzing huge
quantities of organized data. The data warehouse operates in information-rich
environment that provides an overview of the company, makes the current and
historical data of the company available for decisions, enables decision support
transactions without obstructing operating systems, makes information
consistent for the organization, and presents a flexible and interactive
information source.

1.3.1 Need for Data Warehouse

Data warehouses are used extensively in the largest and most complex
businesses around the world. In demanding situations, good decision making
becomes critical. Significant and relevant data is required to make decisions.
This is possible only with the help of a well-designed data warehouse.
Following are some of the reasons for the need of Data Warehouses:

Enhancing the turnaround time for analysis and reporting: Data warehouse
allows business users to access critical data from a single source enabling them
to take quick decisions. They need not waste time retrieving data from multiple
sources. The business executives can query the data themselves with minimal
or no support from IT which in turn saves money and time.

Improved Business Intelligence: Data warehouse helps in achieving the vision


for the managers and business executives. Outcomes that affect the strategy
and procedures of an organization will be based on reliable facts and supported
with evidence and organizational data.

Benefit of historical data: Transactional data stores data on a day to day basis
or for a very short period of duration without the inclusion of historical data. In
comparison, a data warehouse stores large amounts of historical data which
enables the business to include time-period analysis, trend analysis, and trend
forecasts.

Standardization of data: The data from heterogeneous sources are available in


a single format in a data warehouse. This simplifies the readability and
accessibility of data. For example, gender is denoted as Male/ Female in
Source 1 and m/f in Source 2 but in a data warehouse the gender is stored in a
format which is common across all the businesses i.e. M/F.
4
Data Warehouse
Immense ROI (Return On Investment): Return On Investment refers to the Fundamentals and
Architecture
additional revenues or reduces expenses a business will be able to realize from
any project.

Now, let us study the benefits.

1.3.2 Benefits of Data Warehouse

Several enterprises adopt data warehousing as it offers many benefits, such as


streamlining the business and increasing profits. Following are some of the
benefits of having a data warehouse:

Scalability - Businesses today cannot survive for long if they cannot easily
expand and scale to match the increase in the volume of daily transactions.
DW is easy to scale, making it easier for the business to stride ahead with
minimum hassle.

Access to Historical Insights - Though real-time data is important, historical


insights cannot be ignored when tracing patterns. Data warehousing allows
businesses to access past data with just a few clicks. Data that are months and
years old can be stored in the warehouse.

Works On-Premises and on Cloud - Data warehouses can be built on-premises


or on cloud platforms. Enterprises can choose either option, depending on their
existing business system and the long-term plan. Some businesses rely on both.

Better Efficiency - Data warehousing increases the efficiency of the business


by collecting data from multiple sources and processing it to provide reliable
and actionable insights. The top management uses these insights to make better
and faster decisions, resulting in more productivity and improved
performance.

Improved Data Security - Data security is crucial in every enterprise. By


collecting data in a centralized warehouse, it becomes easier to set up a multi-
level security system to prevent the data from being misused. Provide
restricted access to data based on the roles and responsibilities of the
employees.

Increase Revenue and Returns - When the management and employees have
access to valuable data analytics, their decisions and actions will strengthen the
business. This increases the revenue in the long run.

Faster and Accurate Data Analytics - When data is available in the central data
warehouse, it takes less time to perform data analysis and generate reports.
Since the data is already cleaned and formatted, the results will be more
accurate.

Let us study the various approaches in detail in the following section.

5
Fundamentals of
Data Warehouse 1.4 DATA WAREHOUSE DESIGN APPROACHES
Data Warehouse design approaches are very important aspect of building data
warehouse. Selection of right data warehouse design could save lot of time and
project cost.
There are two different Data Warehouse Design Approaches normally
followed when designing a Data Warehouse solution and based on the
requirements of your project you can choose which one suits your particular
scenario. These methodologies are a result of research from Bill Inmon (Top-
Down Approach) and Ralph Kimball(Bottom up Approach).

1.4.1 Top-down Approach

Bill Inmon’s design methodology is based on a top-down approach which is


illustrated in the Figure 1. In the top-down approach, the data warehouse is
designed first and then data mart is built on top of data warehouse.

Figure 1: Top-Down DW Design Approach

Below are the steps that are involved in top-down approach:

• Data is extracted from the various source systems. The extracts are
loaded and validated in the stage area. Validation is required to make
sure the extracted data is accurate and correct. You can use the ETL
tools or approach to extract and push to the data warehouse.
• Data is extracted from the data warehouse in regular basis in stage area.
At this step, you will apply various aggregation, summarization
techniques on extracted data and loaded back to the data warehouse.
• Once the aggregation and summarization is completed, various data
marts extract that data and apply the some more transformation to make
the data structure as defined by the data marts.

6
Data Warehouse
1.4.2 Bottom-up Approach Fundamentals and
Architecture
Ralph Kimball’s data warehouse design approach is called dimensional
modelling or the Kimball methodology which is illustrated in Figure 2. This
methodology follows the bottom-up approach.

As per this method, data marts are first created to provide the reporting and
analytics capability for specific business process, later with these data marts
enterprise data warehouse is created.

Figure 2: Bottom-Up DW Design Approach

Basically, Kimball model reverses the Inmon model i.e. Data marts are directly
loaded with the data from the source systems and then ETL process is used to
load in to Data Warehouse. The above image depicts how the top-down
approach works.

Below are the steps that are involved in bottom-up approach:

• The data flow in the bottom up approach starts from extraction of data
from various source systems into the stage area where it is processed
and loaded into the data marts that are handling specific business
process.

• After data marts are refreshed the current data is once again extracted
in stage area and transformations are applied to create data into the data
mart structure. The data is the extracted from Data Mart to the staging
area is aggregated, summarized and so on loaded into EDW and then
made available for the end user for analysis and enables critical
business decisions.
Having discussed the data warehouse design strategies, let us study the
characteristics of the DW in the next section.

7
Fundamentals of
Data Warehouse 1.5 CHARACTERISTICS OF A
DATA WAREHOUSE
Data warehouses are systems that are concerned with studying, analyzing and
presenting enterprise data in a way that enables senior management to make
decisions. The data warehouses have four essential characteristics that
distinguish them from any other data and these characteristics are as follows:

• Subject-oriented
A DW is always a subject-oriented one, as it provides information about a
specific theme instead of current organizational operations. On specific
themes, it can be done. That means that it is proposed to handle the data
warehousing process with a specific theme (subject) that is more defined.
Figure 3 shows Sales, Products, Customers and Account are the different
themes.
A data warehouse never emphasizes only existing activities. Instead, it focuses
on data demonstration and analysis to make different decisions. It also
provides an easy and accurate demonstration of specific themes by eliminating
information that is not needed to make decisions.

Figure 3: Subject-oriented Characteristic Feature of a DW

• Integrated

Integration involves setting up a common system to measure all similar data


from multiple systems. Data was to be shared within several database
repositories and must be stored in a secured manner to access by the data
warehouse. A data warehouse integrates data from various sources and
combines it in a relational database. It must be consistent, readable, and coded..
The data warehouse integrates several subject areas as shown in the figure 4.

8
Data Warehouse
Fundamentals and
Architecture

Figure 4: Integrated Characteristic Feature of a DW

• Time-Variant

Information may be held in various intervals such as weekly, monthly, and


yearly as shown in Figure 5. It provides a series of limited-time, variable
rate, online transactions. The data warehouse covers a broader range of
data than the operational systems. When the data stored in the data store
has a certain amount of time, it can be predictable and provide history. It
has aspects of time embedded within it. One other facet of the data
warehouse is that the data cannot be changed, modified or updated once it
is stored.

Figure 5: Time- Variant Characteristic Feature of a DW

• Non-Volatile

The data residing in the data warehouse is permanent, as the name non -
volatile suggests. It also ensures that when new data is added, data is not
erased or removed. It requires the mammoth amount of data and analyses
the data within the technologies of warehouse. Figure 6 shows the non-
volatile data warehouse vs operational database. A data warehouse is kept
separate from the operational database and thus the data warehouse does
not represent regular changes in the operational database. Data warehouse
integration manages different warehouses relevant to the topic.

9
Fundamentals of
Data Warehouse

Figure 6: Non –Volatile Characteristic Feature of DW

1.5.1 How Data Warehouse Works?


A data warehouse is a central repository in which one or more sources of
information are collected. Data in the data warehouse may be Structured,
Semi-structured, or Unstructured. Data are processed, transformed, and
accessed by end users for use in business intelligence reporting and decision-
making. A data warehouse integrates disparate primary sources into a
comprehensive source. Through the integration of all this information, an
organization can maintain a more holistic level of customer service. This
ensures that all available data is properly considered. Data warehouse enables
data mining to find patterns of information that increases profits.

The figure 7 shows the important components of the data warehouse.

Figure 7: Components of a Data Warehouse

• Load Manager

Load Manager Component of data warehouse is responsible for collection of


data from operational system and converts them into usable form for the users.
This component is responsible for importing and exporting data from
operational systems. This component includes all of the programs and
10
Data Warehouse
applications interfaces that are responsible for pooling the data out of the Fundamentals and
Architecture
operational system, preparing it, loading it into warehouse itself it performs the
following tasks such as identification of data, validation of data about the
accuracy, extraction of data from original source, cleansing of data by
eliminating meaningless values and making it usable, data formatting, data
standardization by getting them into a consistent form, data merging by taking
data from different sources and consolidating into one place and establishing
referential integrity.

• Warehouse Manager

The warehouse manager is the center of data-warehousing system and is the


data warehouse itself. It is a large, physical database that holds a vast am0unt
of information from a wide variety of sources. The data within the data
warehouse is organized such that it becomes easy to find, use and update
frequently from its sources.

• Query Manager

Query Manager Component provides the end-users with access to the stored
warehouse information through the use of specialized end-user tools. Data
mining access tools have various categories such as query and reporting, on-
line analytical processing (OLAP), statistics, data discovery and graphical and
geographical information systems.

• End-user access tools

This is divided into the following categories, such as:

• Reporting Data
• Query Tools
• Data Dippers
• Tools for EIS
• Tools for OLAP and tools for data mining.

 Check Your Progress 1

1) What is a Data Warehouse and why is it important?


……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…………………………………………………………………………..

2. Mention the characteristics of a Data Warehouse.

……………………………………………………………………………
……………………………………………………………………………

11
Fundamentals of
Data Warehouse

1.6 OLTP AND OLAP


Online Transaction Processing (OLTP) and Online Analytical Processing
(OLAP) are the two terms which look similar but refer to different kinds of
systems. Online transaction processing (OLTP) captures, stores, and processes
data from transactions in real time. Online analytical processing (OLAP) uses
complex queries to analyze aggregated historical data from OLTP systems.
1.6.1 Online Transaction Processing (OLTP)

An OLTP system captures and maintains transaction data in a database. Each


transaction involves individual database records made up of multiple fields or
columns. Examples include banking and credit card activity or retail checkout
scanning.

In OLTP, the emphasis is on fast processing, because OLTP databases are


read, written, and updated frequently. If a transaction fails, built-in system
logic ensures data integrity.

1.6.2 Online Analytical Processing (OLTP)

OLAP applies complex queries to large amounts of historical data, aggregated


from OLTP databases and other sources, for data mining, analytics,
and business intelligence projects. In OLAP, the emphasis is on response time to
these complex queries. Each query involves one or more columns of data
aggregated from many rows.

Examples include year-over-year financial performance or marketing lead


generation trends. OLAP databases and data warehouses give analysts and
decision-makers the ability to use custom reporting tools to turn data into
information. Query failure in OLAP does not interrupt or delay transaction
processing for customers, but it can delay or impact the accuracy of business
intelligence insights.

OLTP is operational, while OLAP is informational. A glance at the key


features of both kinds of processing illustrates their fundamental differences,
and how they work together. The table (Table 1) below summarizes
differences between OLTP and OLAP.
Table 1: OLTP Vs OLAP

OLTP OLAP
Characteristics Handles a large number of Handles large volumes of
small transactions data with complex queries
Query types Simple standardized queries Complex queries
Operations Based on INSERT, UPDATE, Based on SELECT
DELETE commands commands to aggregate
data for reporting

12
Data Warehouse
Response time Milliseconds Seconds, minutes, or hours Fundamentals and
Architecture
depending on the amount
of data to process
Design Industry-specific, such as Subject-specific, such as
retail, manufacturing, or sales, inventory, or
banking marketing
Source Transactions Aggregated data from
transactions
Purpose Control and run essential Plan, solve problems,
business operations in real support decisions, discover
time hidden insights
Data updates Short, fast updates initiated by Data periodically refreshed
user with scheduled, long-
running batch jobs
Space Generally small if historical Generally large due to
requirements data is archived aggregating large datasets
Backup and Regular backups required to Lost data can be reloaded
recovery ensure business continuity and from OLTP database as
meet legal and governance needed in lieu of regular
requirements backups
Productivity Increases productivity of end Increases productivity of
users business managers, data
analysts, and executives
Data view Lists day-to-day business Multi-dimensional view of
transactions enterprise data
User examples Customer-facing personnel, Knowledge workers such
clerks, online shoppers as data analysts, business
analysts, and executives
Database design Normalized databases for Denormalized databases
efficiency for analysis

OLTP provides an immediate record of current business activity, while OLAP


generates and validates insights from that data as it’s compiled over time. That
historical perspective empowers accurate forecasting, but as with all business
intelligence, the insights generated with OLAP are only as good as the data
pipeline from which they emanate.

 Check Your Progress 2


1) Why a data warehouse is separated from Operational Databases?

…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
2) Mention the key differences between a database and a data warehouse.

…………………………………………………………………………………………
…………………………………………………………………………………………

13
Fundamentals of
Data Warehouse 1.7 DATA GRANULARITY

Granularity is one of the main elements in the modeling of DW data.


Granularity of data refers to detail levels. Multiple levels of detail may be
available depending on the requirements. At least two granular levels exist for
many data warehouses. The relation between detailing and granularity is
important to understand. It means greater detail of the data (less summary)
when you speak of less granularity or fine granularity. Greater granularity
means fewer details or gross granularity (greater summarization). The
operational data is stored at the lowest level of information. The sale units will
be stored and stored in the outlet system at the unit level of product per
transaction. The amount ordered is collected and stored by the customer at unit
level per order in the order entry system. You can add up the individual
transactions whenever you need the summary data. When you search for items
in a product ordered this month and add them together, you have the total of all
product orders entered in that month. Referral data is generally not tracked in
an operating system. When a user requests an analysis in the data warehouse,
the user has to view summary data first. A user can successfully promote the
entire product unit throughout a given area. The user may then wish to
examine the region's breakdowns. The next step might be to look at the
following levels of the sales units in each store. The analysis often starts at a
high level, and is reduced in detail.

Therefore, in a data store, the summary of data at different levels can be


maintained effectively. You can provide answers at the simplest level or the
most detailed level. The detail level of data is the level of granularity in an
existing data warehouse. The more detail in the data, the finer the size of the
data. You need to save a lot of permanent data in the data warehouse in order
to safely store data. The type of granularity depends on how data will be
processed, and the performance expectations. For example, each year details of
each month, day, hour, minute, second, and so forth are available.

1.8 METADATA AND DATA WAREHOUSING

In a data warehouse, data is stored using a common schema controlled by a


common dictionary. Within the data dictionary, data is kept about the logical
data structures, file and address data, index information and others. The
metadata should contain the following data warehouse information:

• The data structure based on the programmer's view


• Data structure based on DSS analysts' view
• The DW's data sources
• The data transformation at the moment of its migration to DW
• Model of data
• The connection between the data model and the DW
• Data extraction history
14
Data Warehouse
Fundamentals and
Architecture
In the DW environment, metadata is a major component. Metadata helps to
control reporting accuracy, validates the transformation of data and ensures
calculation accuracy. Metadata also complies with the company end-users'
definition of business terms. More details on metadata are provided in the next
Unit.

1.9 DATA WAREHOUSING APPLICATIONS

In different sectors, there are numerous applications such as e-commerce,


telecommunication, transport, marketing, distribution and retail. Given below
are some of the applications of data warehouses:

Investment and Insurance: In this sector, data warehousing is used to analyze


the customer, market trends and other patterns of data. The two sub-sectors
where data warehousing plays an important role are Forex and stock markets.

Healthcare: A data warehousing system is used to forecast outcomes of a


treatment generate its reports and share the data with different units. These
units can be the research labs, medical units, and insurance
providers. Enterprise data warehouses serve as the backbone of healthcare
systems as they are updated with recent information which is crucial for saving
lives.

Retail: Be it distribution, marketing, examining pricing policies, keeping a


track of promotional deals, and finding the pattern in the customer buying
trends: data warehousing solves it all. Many retail chains incorporate enterprise
data warehousing for business intelligence and forecasting.

Social Media Websites: Social networking sites such as Facebook, Twitter,


LinkedIn etc. are based on large data sets analyses. These sites collect data on
members, groups; locations etc. and store this information in a single central
repository. Data warehouse is necessary to implement the same data, because
of its high volume of data.

Banking: Most banks are now using warehouses to see account/cardholder


spending patterns. They use this to make special offers, deals, etc. available.

Government: In addition to store and analyze taxes used to detect tax theft,
government uses the data warehouse.

Airlines: It is used in the airline system for operational purposes such as crew
assignments, road profitability analyses, flight frequency programs
promotions, etc.

Public sector: Information is collected in the public sector's data warehouse. It


helps government agencies and departments manage their data and records.

15
Fundamentals of
Data Warehouse 1.10 TYPES OF DATA WAREHOUSES

There are three different types of traditional Data Warehouse models as listed
below:

i. Enterprise
ii. Operational
iii. Data Mart

(i) Enterprise Data Warehouse

An enterprise provides a central repository database for decision support


throughout the enterprise. It is a central place where all business information
from different sources and applications are made available. Once it is stored, it
can be used for analysis and used by all the people across the organization. The
enterprise goal is to provide a complete overview of any particular object in the
data model.

(ii) Operational Data Warehouse

These features have a sizable enterprise-wide scope, but unlike the substantial
enterprise warehouse, data is refreshed in near real-time and used for routine
commercial activity. It assists in obtaining data straight from the database,
which also helps data transaction processing. The data present in the
Operational Data Store can be scrubbed, and the duplication which is present
can be reviewed and fixed by examining the corresponding market rules.

(iii) Data Mart

Data Mart may be a subset of knowledge warehouse, and it supports a specific


region, business unit, or business function. Data Mart focuses on storing data
for a particular functional area, and it contains a subset of data saved in a
memory. Data Marts help in enhancing user responses and also reduces the
volume of data for data analysis. It makes it more comfortable to go forward
with the report. More on Data Marts can be studied in the next Unit.

1.11 POPULAR DATA WAREHOUSE PLATFORMS

A data warehouse is a critical database for supporting data analysis and acts as
a conduit between analytical tools and operational data stores. The most
popular data warehousing solutions include a range of useful features for data
management and consolidation.

You can use them to extract/curate data from a range of environments,


transform data and remove duplicates, and ensure consistency in your
analytics.

16
Data Warehouse
Google BigQuery Fundamentals and
Architecture
BigQuery is a cost-effective data warehousing tool with built-in machine
learning capabilities. You can integrate it with Cloud ML and TensorFlow to
create powerful AI models. It can also execute queries on petabytes of data for
real-time analytics. This scalable and serverless cloud data warehouse is ideal
for companies that want to keep costs low. If you need a quick way to make
informed decisions through data analysis, BigQuery is one of the solutions.

AWS Redshift

Redshift is a cloud-based data warehousing tool for enterprises. The platform


can process petabytes of data quite fast. That's why it’s suitable for high-speed
data analytics. It also supports automatic concurrency scaling. The automation
increases or decreases query processing resources to match workload demand.

Although tooling provided by Amazon reduces the need to have a database


administrator full time, it does not eliminate the need for one. Amazon
Redshift is known to have issues with handling storage efficiently in an
environment prone to frequent deletes.

Snowflake

Snowflake is a data warehousing solution that offers a variety of options for


public cloud technology. With Snowflake, you can make your business more
data-driven. You may use Snowflake to set up an enterprise-grade cloud data
warehouse. With Snowflake, you can analyze data from various unstructured
and structured sources. However, Snowflake is dependent on Azure, Amazon
Web Services (AWS), Google Cloud Services (GCS). The support can be a
problem whenever one of those cloud servers has an independent outage.

Microsoft Azure Synapse

Microsoft Azure is a robust platform for data management, analytics,


integration, and more, with solutions spanning AI, blockchain, and more than a
dozen unique databases for varying use cases. Among them is Azure Synapse,
formerly known as Azure SQL Data Warehouse, a platform built for analytics,
providing you the ability to query data using either serverless or provisioned
resources at scale.

Azure Synapse brings together the two worlds of data warehousing and
analytics with a unified experience to ingest, prepare, manage, and serve data
for immediate BI and machine learning. The broader Azure platform includes
thousands of tools, including others that interface with the various Azure
databases.

17
Fundamentals of
Data Warehouse

1.12 SUMMARY

In this unit you have studied about the evolution, characteristics, benefits and
applications of data ware house.

Operational database system provides day-to-day information, although


strategic decision-making cannot be used easily. Data Warehouse is a concept
designed to aid strategic information. Data Warehouse allows people to make
decisions and provides flexible, convenient and interactive sources of strategic
intelligence. A data warehouse combines several technologies because it
collects data from various operational data base systems and external sources
such as magazines, newspapers and reports from the same industry, removes
contradictions, transforms the data and then stores them in formats suited to
easy access for decision-making purposes. The defining characteristics of the
data warehouse are: Subject oriented, integrated, time-variant, and non-
volatile.

Data warehouses are meant to be used by executives, managers, and other


people at higher managerial levels who may not have much technical expertise
in handling the databases.

Advantages of data warehouses include better decisions, increased


productivity, lower operational costs, enhanced asset and liability management,
and better CRM.

1.13 SOLUTIONS/ANSWERS
Check Your Progress 1

1) Data Warehousing (DW) is a process for collecting and managing data


from diverse sources to provide meaningful insights into the business.
A Data Warehouse is typically used to connect and analyze
heterogeneous sources of business data. The data warehouse is the
centerpiece of the BI system built for data analysis and reporting.

It is amalgam of technologies and components which helps to use data


strategically. Instead of transaction processing, it is the automated
collection of a vast amount of information by a company that is
configured for demand and review. It’s a process of transforming data
into information and making it available for users to make a difference
in a timely way.

The archive of decision support (Data Warehouse) is managed


independently from the operating infrastructure of the organization.
The data warehouse, however, is not a product but rather an
environment. It is an organizational framework of an information
18
Data Warehouse
system that provides consumers with knowledge regarding current and Fundamentals and
Architecture
historical decision help that is difficult to access or present in the
conventional operating data store.

Data storage platforms also sort data on a variety of subjects like


customers, products or business.

• Data storage is a tool that companies can use increasingly important for
corporate intelligence:
• Make uniformity possible. All research data gathered and shared to
decision makers worldwide should be used in a uniform format.
Standardization of data from various sources reduces the risk of
misinterpretation as well as overall accuracy of interpretation.
• Take better business decisions. Successful entrepreneurs have a
thorough understanding of data, and are good at predicting future
trends. The data storage system helps users access various data sets at
speed and efficiency.
• Data storage platforms allow companies to access their business' past
history and evaluate ideas and projects. This gives managers an idea of
how they can improve their sales and management practices.

2). Following are the four main characteristics of a data warehouse:

i) Subject oriented

A data warehouse is subject-oriented, as it provides information on a


topic rather than the ongoing operations of organizations. Such issues
may be inventory, promotion, storage, etc. Never does a data
warehouse concentrate on the current processes. Instead, it emphasized
modeling and analyzing decision-making data. It also provides a simple
and succinct description of the particular subject by excluding details
that would not be useful in helping the decision process.

(ii) Integrated

Integration in Data Warehouse means establishing a standard unit of


measurement from the different databases for all the similar data. The
data must also get stored in a simple and universally acceptable manner
within the Data Warehouse. Through combining data from various
sources such as a mainframe, relational databases, flat files, etc., a data
warehouse is created. It must also keep the naming conventions,
format, and coding consistent. Such an application assists in robust data
analysis. Consistency must be maintained in naming conventions,
measurements of characteristics, specification of encoding, etc.

(iii) Time-variant

Compared to operating systems, the time horizon for the data


warehouse is given period and provides historical information. It
contains a temporal element, either explicitly or implicitly.One such
19
Fundamentals of
Data Warehouse
location in the record key system where Data Warehouse data shows
time variation is. Each primary key contained with the DW should have
an element of time either implicitly or explicitly. Just like the day, the
month of the week, etc.

(iv) Non-volatile

Also, the data warehouse is non-volatile, meaning that prior data will
not be erased when new data are entered into it. Data is read-only, only
updated regularly. It also assists in analyzing historical data and in
understanding what and when it happened. The transaction process,
recovery, and competitiveness control mechanisms are not required. In
the Data Warehouse environment, activities such as deleting, updating,
and inserting that are performed in an operational application
environment are omitted.

Check Your Progress 2

1) Data Warehouse systems are segregated from production databases so


that they aren't intermingled and cause conflicts.

• There is a database available for tasks such as searching records,


indexing, and digital archiving. Data warehouse queries are often
complex due to their varied and complex nature.
• It is possible to manage multiple transactions simultaneously
through business databases. Concurrency control and recovery
mechanisms are needed to ensure that the database in operational
databases is robust and consistent.
• The operational database query allows for reading and
modification of operations, whilst the read access to stored
information is required for OLAP queries only.
• A database of operations maintains current information. In
contrast, historical data is kept in a warehouse.

2) A database stores the current data required to power an application. A data


warehouse stores current and historical data from one or more systems in a
predefined and fixed schema, which allows business analysts and data
scientists to easily analyze the data. The table below summarizes
differences between databases, data warehouses:

Table 2: Database Vs Data Warehouse

Characteristic Database Data Warehouse


Feature
Workloads Operational and Analytical
transactional
Data Type Structured or semi- Structured and/or semi-structured
structured
20
Data Warehouse
Fundamentals and
Schema Rigid or flexible Pre-defined and fixed schema Architecture
Flexibility schema depending on definition for ingest (schema on
database type write and read)
Data Freshness Real time May not be up-to-date based on
frequency of ETL processes
Users Application developers Business analysts and data
scientists
Pros Fast queries for storing The fixed schema makes working
and updating data with the data easy for business
analysts
Cons May have limited Difficult to design and evolve
analytics capabilities schema
Scaling compute may require
unnecessary scaling of storage,
because they are tightly coupled

1.14 FURTHER READINGS

1. William H. Inmon, Building the Data Warehouse, Wiley, 4th Edition,


2005.
2. Data Warehousing Fundamentals, Paulraj Ponnaiah, Wiley
Student Edition.
3. Data Warehousing, Reema Thareja, Oxford University Press, 2011.

21
Data Warehouse Architecture

UNIT 2 DATA WAREHOUSE ARCHITECTURE

Structure

2.0 Introduction
2.1 Objectives
2.2 Data Warehouse Architecture and its Types
2.2.1 Types of Data Warehouse Architectures
2.3 Components of Data Warehouse Architecture
2.4 Layers of Data Warehouse Architecture
2.4.1 Best Practices for Data Warehouse Architecture
2.5 Data Marts
2.5.1 Data Mart Vs Data Warehouse
2.6 Benefits of Data Marts
2.7 Types of Data Marts
2.8 Structure of a Data Mart
2.9 Designing the Data Marts
2.10 Limitations with Data Marts
2.11 Summary
2.12 Solutions / Answers
2.13 Further Readings

2.0 INTRODUCTION

In the previous unit we had studied about the data warehousing and related
topics. Despite numerous advancements over the last five years in the arena of
Big Data, cloud computing, predictive analysis, and information technologies,
data warehouses have only gained more significance. For the success of any
data warehouse, its architecture plays an important role. Since three decades,
the data warehouse architecture has been the pillar of the corporate data
ecosystems.

This unit present various topics including the basic concept of data warehouse
architecture, its types, significant components and layers of data ware house
architecture, data marts and their designing.

2.1 OBJECTIVES

After going through this unit, you shall be able to:

• Understand the purpose of data warehouse architecture;


• Describe the process of storing the data in a data warehouse;
• List and discuss the various types of data warehouse architectures;
• Discuss various components and layers of data warehouse architecture;
• To summarize the functionality of data marts, their benefits and various
types, and
• To know the ways of structuring and designing the data marts.

1
Data Warehouse
Fundamentals And
Architecture
2.2 DATA WAREHOUSE ARCHITECTURE AND ITS
TYPES

Data warehouse architecture is a data storage framework’s design of an


organization. It takes information from raw data sets and stores it in a
structured and easily digestible format.

A data warehouse architecture plays a vital role in the data enterprise. As


databases assist in storing and processing data, and data warehouses help in
analyzing that data.

Data warehousing is a process of storing a large amount of data by a business


or organization. The data warehouse is designed to perform large complex
analytical queries on large multi-dimensional datasets in a straightforward
manner. Data warehouses extract data from different resources, which are in
different fonts, convert it into a unique form, and place data in Data
Warehouse.

2.2.1 Types of Data Warehouse Architectures

Data warehouse architecture defines the arrangement of the data in different


databases. As the data must be organized and cleansed to be valuable, a
modern data warehouse structure identifies the most effective technique of
extracting information from raw data.

Using a dimensional model, the raw data in the staging area is extracted and
converted into a simple consumable warehousing structure to deliver valuable
business intelligence. When designing a data warehouse, there are three
different types of models to consider, based on the approach of number of tiers
the architecture has.

(i) Single-tier data warehouse architecture


(ii) Two-tier data warehouse architecture
(iii) Three-tier data warehouse architecture

The details of each of the architecture are given below:

(i) Single-tier data warehouse architecture

The single-tier architecture (Figure 1) is not a frequently practiced


approach. The main goal of having such architecture is to remove
redundancy by minimizing the amount of data stored. Its primary
disadvantage is that it doesn’t have a component that separates analytical
and transactional processing.

2
Data Warehouse Architecture

Figure 1: Single Tier Data Warehouse Architecture

(ii) Two-tier data warehouse architecture

The two-tier architecture (Figure 2) includes a staging area for all data sources,
before the data warehouse layer. By adding a staging area between the sources
and the storage repository, you ensure all data loaded into the warehouse is
cleansed and in the appropriate format.

Figure 2: Two- Tier Data Warehouse Architecture

(iii) Three-tier data warehouse architecture

The three-tier approach (Figure 3) is the most widely used architecture for data
warehouse systems.

Essentially, it consists of three tiers:

1. The bottom tier is the database of the warehouse, where the cleansed
and transformed data is loaded.
2. The middle tier is the application layer giving an abstracted view of
the database. It arranges the data to make it more suitable for analysis.
This is done with an OLAP server, implemented using the ROLAP or
MOLAP model.
3. The top-tier is where the user accesses and interacts with the data. It
represents the front-end client layer. You can use reporting tools, query,
analysis or data mining tools.

3
Data Warehouse
Fundamentals And
Architecture

Figure 3: Three- Tier Data Warehouse Architecture

Figure 4 illustrates the complete data warehouse architecture with the three
tiers:

Figure 4: 3-Tiers of Data Warehouse

2.2.2 Cloud-based Data Warehouse Architecture

Cloud-based data warehouse architecture is relatively new when compared to


legacy options. This data warehouse architecture means that the actual data
warehouses are accessed through the cloud. There are several cloud based data
warehouses options, each of which has different architectures for the same
benefits of integrating, analyzing, and acting on data from different sources.
The difference between a cloud-based data warehouse approach compared to
that of a traditional approach include:

• Up-front costs: The different components required for traditional, on-


premises data warehouses mandate pricey up-front expenses. Since the
components of cloud architecture are accessed through the cloud, these
expenses don’t apply.

• Ongoing costs: While businesses with on-prem data warehouses must


deal with upgrade and maintenance costs, the cloud offers a low, pay-
as-you-go model.

4
Data Warehouse Architecture
• Speed: Cloud-based data warehouse architecture is substantially
speedier than on-premises options, partly due to the use of ELT —
which is an uncommon process for on-premises counterparts.

• Flexibility: Cloud data warehouses are designed to account for the


variety of formats and structures found in big data. Traditional
relational options are designed simply to integrate similarly structured
data.

• Scale: The elastic resources of the cloud make it ideal for the scale
required of big datasets. Additionally, cloud-based data warehousing
options can also scale down as needed, which is difficult to do with
other approaches.

Cloud-based platforms make it possible to create, share, and store massive data
sets with ease, paving the way for more efficient and effective data access and
analysis. Cloud systems are built for sustainable business growth, with many
modern Software-as-a Service (SaaS) providers separating data storage from
computing to improve scalability when querying data.

Some of the more notable cloud data warehouses in the market include
Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure SQL
Data Warehouse.

Now, let’s learn about the major components of a data warehouse and how
they help build and scale a data warehouse in the next section.

2.3 COMPONENTS OF DATA WAREHOUSE


ARCHITECTURE

A data warehouse design consists of six main components:

• Data Warehouse Database


• ETL DbEM, D3- AT/Bu/RL
• Metadata
• Data Warehouse Access Tools
• Data Warehouse Bus
• Data Warehouse Reporting Layer

The details of all the components are given below.

2.3.1 Data Warehouse Database

The central component of DW architecture is a data warehouse database that


stocks all enterprise data and makes it manageable for reporting. Obviously,
this means you need to choose which kind of database you’ll use to store data
in your warehouse.

The following are the four database types that you can use:
5
Data Warehouse
Fundamentals And
Architecture • Typical relational databases are the row-centered databases you
perhaps use on an everyday basis —for example, Microsoft SQL
Server, SAP, Oracle, and IBM DB2.
• Analytics databases are precisely developed for data storage to sustain
and manage analytics, such as Teradata and Greenplum.
• Data warehouse applications aren’t exactly a kind of storage database,
but several dealers now offer applications that offer software for data
management as well as hardware for storing data. For example, SAP
Hana, Oracle Exadata, and IBM Netezza.
• Cloud-based databases can be hosted and retrieved on the cloud so that
you don’t have to procure any hardware to set up your data
warehouse—for example, Amazon Redshift, Google BigQuery, and
Microsoft Azure SQL.

2.3.2 Extraction, Transformation, and Loading Tools (ETL)

ETL tools are central components of enterprise data warehouse architecture.


These tools help extract data from different sources, transform it into a suitable
arrangement, and load it into a data warehouse.

The ETL tool you choose will determine:

• The time expended in data extraction


• Approaches to extracting data
• Kind of transformations applied and the simplicity to do so
• Business rule definition for data validation and cleansing to improve
end-product analytics
• Filling mislaid data
• Outlining information distribution from the fundamental depository to
your BI applications

2.3.3 Metadata

Before we delve into the different types of metadata in data mining, we first
need to understand what metadata is. In the data warehouse architecture,
metadata describes the data warehouse database and offers a framework for
data. It helps in constructing, preserving, handling, and making use of the data
warehouse.

There are two types of metadata in data mining:

• Technical Metadata comprises information that can be used by


developers and managers when executing warehouse development and
administration tasks.
• Business Metadata comprises information that offers an easily
understandable standpoint of the data stored in the warehouse.

Metadata plays an important role for businesses and the technical teams to
understand the data present in the warehouse and convert it into information.
6
Data Warehouse Architecture

2.3.4 Data Warehouse Access Tools

A data warehouse uses a database or group of databases as a foundation. Data


warehouse corporations generally cannot work with databases without the use
of tools unless they have database administrators available. However, that is
not the case with all business units. This is why they use the assistance of
several no-code data warehousing tools, such as:

• Query and reporting tools help users produce corporate reports for
analysis that can be in the form of spreadsheets, calculations, or
interactive visuals.
• Application development tools help create tailored reports and present
them in interpretations intended for reporting purposes.
• Data mining tools for data warehousing systematize the procedure of
identifying arrays and links in huge quantities of data using cutting-
edge statistical modeling methods.
• OLAP tools help construct a multi-dimensional data warehouse and
allow the analysis of enterprise data from numerous viewpoints.

2.3.5 Data Warehouse Bus

It defines the data flow within a data warehousing bus architecture and
includes a data mart. A data mart is an access level that allows users to transfer
data. It is also used for partitioning data that is produced for a particular user
group.

2.3.6 Data Warehouse Reporting Layer

The reporting layer in the data warehouse allows the end-users to access the BI
interface or BI database architecture. The purpose of the reporting layer in the
data warehouse is to act as a dashboard for data visualization, create reports,
and take out any required information.

Constructing a data warehouse is primarily dependent on a particular business.


And every data warehouse architecture has four layers. Let’s study them in
following section.

2.4 LAYERS OF DATA WAREHOUSE ARCHITECTURE

In general, the data warehouse architecture can be divided into four layers.
They are:

i. Data Source Layer


ii. Data Staging Layer
iii. Data Storage Layer
iv. Data Presentation Layer

Let us study the various layers and their functionality.

7
Data Warehouse
Fundamentals And
Architecture (i) Data source layer

The data source layer is the place where unique information, gathered from an
assortment of inner and outside sources, resides in the social database.
Following are the examples of the data source layer:

• Operational Data — Product information, stock information,


marketing information, or HR information
• Social Media Data — Website hits, content fame, contact page
completion
• Outsider Data — Demographic information, study information,
statistics information

While most data warehouses manage organized data, thought ought to be given
to the future utilization of unstructured data sources, for example, voice
accounts, scanned pictures, and unstructured text. These floods of data are
significant storehouses of information and ought to be viewed when building
up your warehouse.

(ii) Data Staging Layer

This layer dwells between information sources and the data warehouse. In this
layer, information is separated from various inside and outer data sources.
Since source data comes in various organizations, the data extraction layer will
use numerous technologies and devices to extricate the necessary information.
Once the extracted data has been stacked, it will be exposed to high-level
quality checks. The conclusive outcome will be perfect and organized data that
you will stack into your data warehouse. The staging layer contains the given
parts:

• Landing Database and Staging Area

The landing database stores the information recovered from the data source.
Before the data goes to the warehouse, the staging process does stringent
quality checks on it. Arranging is a basic step in architecture. Poor information
will add up to inadequate data, and the result is poor business dynamic. The
arranging layer is where you need to make changes in accordance with the
business process to deal with unstructured information sources.

• Data Integration Tool

Extract, Transform and Load tools (ETL) are the data tools used to extricate
information from source frameworks, change, and prepare information and
load it into the warehouse.

(iii) Data Storage Layer

This layer is the place where the data that was washed down in the arranging
zone is put away as a solitary central archive. Contingent upon your business
and your warehouse architecture necessities, your data storage might be a data
warehouse center, data mart (data warehouse somewhat recreated for particular
departments), or an Operational Data Store (ODS).
8
Data Warehouse Architecture
(iv) Data Presentation Layer

This is where the users communicate with the scrubbed and sorted out data.
This layer of the data architecture gives users the capacity to query the data for
item or service insights, break down the data to conduct theoretical business
situations, and create computerized or specially appointed reports.

You may utilize an OLAP or reporting instrument with an easy to understand


Graphical User Interface (GUI) to assist users with building their queries,
perform analysis, or plan their reports.

2.4.1 Best Practices for Data Warehouse Architecture

Designing the data warehouse with the designated architecture is an art. Some
of the best practices are shown below:

• Create data warehouse models that are optimized for information


retrieval in both dimensional, de-normalized, or hybrid approaches.
• Select a single approach for data warehouse designs such as the top-
down or the bottom-up approach and stick with it.
• Always cleanse and transform data using an ETL tool before loading
the data to the data warehouse.
• Create an automated data cleansing process where all data is uniformly
cleaned before loading.
• Allow sharing of metadata between different components of the data
warehouse for a smooth retrieval process.
• Always make sure that data is properly integrated and not just
consolidated when moving it from the data stores to the data
warehouse. This would require the 3NF normalization of data models.
• Monitor the performance and security. The information in the data
warehouse is valuable, though it must be readily accessible to provide
value to the organization. Monitor system usage carefully to ensure that
performance levels are high.
• Maintain the data quality standards, metadata, structure, and
governance. New sources of valuable data are becoming available
routinely, but they require consistent management as part of a data
warehouse. Follow procedures for data cleaning, defining metadata,
and meeting governance standards.
• Provide an agile architecture. As the corporate and business unit usage
increases, they will discover a wide range of data mart and warehouse
needs. A flexible platform will support them far better than a limited,
restrictive product.
• Automate the processes such as maintenance. In addition to adding
value to business intelligence, machine learning can automate data
warehouse technical management functions to maintain speed and
reduce operating costs.
• Use the cloud strategically. Business units and departments have
different deployment needs. Use on-premise systems when required,
and capitalize on cloud data warehouses for scalability, reduced cost,
and phone and tablet access.

9
Data Warehouse
Fundamentals And
Architecture
2.5 DATA MARTS
A data mart is a subset of a data warehouse focused on a particular line of
business, department, or subject area. Data marts make specific data available
to a defined group of users, which allows those users to quickly access critical
insights without wasting time searching through an entire data warehouse. For
example, many companies may have a data mart that aligns with a specific
department in the business, such as finance, sales, or marketing.

2.5.1 Data Mart Vs Data Warehouse

Data marts and data warehouses are both highly structured repositories where
data is stored and managed until it is needed. However, they differ in the scope
of data stored: data warehouses are built to serve as the central store of data for
the entire business, whereas a data mart fulfills the request of a specific
division or business function. Because a data warehouse contains data for the
entire company, it is best practice to have strictly control who can access it.
Additionally, querying the data you need in a data warehouse is an incredibly
difficult task for the business. Thus, the primary purpose of a data mart is to
isolate—or partition—a smaller set of data from a whole to provide easier data
access for the end consumers.

A data mart can be created from an existing data warehouse—the top-down


approach—or from other sources, such as internal operational systems or
external data. Similar to a data warehouse, it is a relational database that stores
transactional data (time value, numerical order, reference to one or more
object) in columns and rows making it easy to organize and access.

On the other hand, separate business units may create their own data marts
based on their own data requirements. If business needs dictate, multiple data
marts can be merged together to create a single, data warehouse. This is the
bottom-up development approach.

In a nut-shell, following are the differences:

• Data mart is for a specific company department and normally a subset


of an enterprise-wide data warehouse.

• Data marts improve query speed with a smaller, more specialized set of
data.

• Data warehouses help make enterprise-wide strategic decisions, data


marts are for department level, tactical decisions.

• Data warehouse includes many data sets and takes time to update, data
marts handle smaller, faster-changing data sets.

• Data warehouse implementation can take many years, data marts are
much smaller in scope and can be implemented in months.

10
Data Warehouse Architecture

2.6 BENEFITS OF DATA MARTS


Data marts are designed to meet the needs of specific groups by having a
comparatively narrow subject of data. And while a data mart can still contain
millions of records, its objective is to provide business users with the most
relevant data in the shortest amount of time.

With its smaller, focused design, a data mart has several benefits to the end
user, including the following:

• Cost-efficiency: There are many factors to consider when setting up a


data mart, such as the scope, integrations, and the process to extract,
transform, and load (ETL). However, a data mart typically only incurs
a fraction of the cost of a data warehouse.

• Simplified data access: Data marts only hold a small subset of data, so
users can quickly retrieve the data they need with less work than they
could when working with a broader data set from a data warehouse.

• Quicker access to insights: Intuition gained from a data warehouse


supports strategic decision-making at the enterprise level, which
impacts the entire business. A data mart fuels business intelligence and
analytics that guide decisions at the department level. Teams can
leverage focused data insights with their specific goals in mind. As
teams identify and extract valuable data in a shorter space of time, the
enterprise benefits from accelerated business processes and higher
productivity.

• Simpler data maintenance: A data warehouse holds a wealth of


business information, with scope for multiple lines of business. Data
marts focus on a single line, housing fewer than 100GB, which leads to
less clutter and easier maintenance.

• Easier and faster implementation: A data warehouse involves


significant implementation time, especially in a large enterprise, as it
collects data from a host of internal and external sources. On the other
hand, you only need a small subset of data when setting up a data mart,
so implementation tends to be more efficient and include less set-up
time.

2.7 TYPES OF DATA MARTS

There are three types of data marts that differ based on their relationship to the
data warehouse and the respective data sources of each system.

• Dependent data marts are partitioned segments within an enterprise


data warehouse. This top-down approach begins with the storage of all
business data in one central location. The newly created data marts
11
Data Warehouse
Fundamentals And
Architecture extract a defined subset of the primary data whenever required for
analysis.

• Independent data marts act as a standalone system that doesn't rely


on a data warehouse. Analysts can extract data on a particular subject
or business process from internal or external data sources, process it,
and then store it in a data mart repository until the team needs it.

• Hybrid data marts combine data from existing data warehouses and
other operational sources. This unified approach leverages the speed
and user-friendly interface of a top-down approach and also offers the
enterprise-level integration of the independent method.

2.8 STRUCTURE OF A DATA MART

A data mart is a subject-oriented relational database that stores transactional


data in rows and columns, which makes it easy to access, organize, and
understand. As it contains historical data, this structure makes it easier for an
analyst to determine data trends. Typical data fields include numerical order,
time value, and references to one or more objects.

Companies organize data marts in a multidimensional schema as a blueprint to


address the needs of the people using the databases for analytical tasks. The
three main types of schema:

Star

Star schema is a logical formation of tables in a multidimensional database that


resembles a star shape. In this blueprint, one fact table—a metric set that
relates to a specific business event or process—resides at the center of the star,
surrounded by several associated dimension tables.

There is no dependency between dimension tables, so a star schema requires


fewer joins when writing queries. This structure makes querying easier, so star
schemas are highly efficient for analysts who want to access and navigate large
data sets.

Snowflake

A snowflake schema is a logical extension of a star schema, building out the


blueprint with additional dimension tables. The dimension tables are
normalized to protect data integrity and minimize data redundancy.

While this method requires less space to store dimension tables, it is a complex
structure that can be difficult to maintain. The main benefit of using snowflake
schema is the low demand for disk space, but the caveat is a negative impact
on performance due to the additional tables.

12
Data Warehouse Architecture
Data Vault

Data vault is a modern database modeling technique that enables IT


professionals to design agile enterprise data warehouses. This approach
enforces a layered structure and has been developed specifically to combat
issues with agility, flexibility, and scalability that arise when using the other
schema models. Data vault eliminates star schema's need for cleansing and
streamlines the addition of new data sources without any disruption to existing
schema.

2.9 DESIGNING THE DATA MARTS

Data marts guide important business decisions at a departmental level. For


example, a marketing team may use data marts to analyze consumer behaviors,
while sales staff could use data marts to compile quarterly sales reports. As
these tasks happen within their respective departments, the teams don't need
access to all enterprise data.
Typically, a data mart is created and managed by the specific business
department that intends to use it. The process for designing a data mart usually
comprises the following steps:

(i) Essential Requirements Gathering

The first step is to create a robust design. Some critical processes involved in
this phase include collecting the corporate and technical requirements,
identifying data sources, choosing a suitable data subset, and designing the
logical layout (database schema) and physical structure.

(ii) Build/Construct

The next step is to construct it. This includes creating the physical database
and the logical structures. In this phase, you’ll build the tables, fields, indexes,
and access controls.

(iii) Populate/Data Transfer

The next step is to populate the mart, which means transferring data into it. In
this phase, you can also set the frequency of data transfer, such as daily or
weekly. This usually involves extracting source information, cleaning and
transforming the data, and loading it into the departmental repository.

(iv) Data Access

In this step, the data loaded into the data mart is used in querying, generating
reports, graphs, and publishing. The main task involved in this phase is setting
up a meta-layer and translating database structures and item names into
corporate expressions so that non-technical operators can easily use the data

13
Data Warehouse
Fundamentals And
Architecture mart. If necessary, you can also set up API and interfaces to simplify data
access.

(v) Manage

The last step involves management and observation, which includes:

• Controlling ongoing user access.


• Optimization and refinement of the target system for improved
performance.
• Addition and management of new data into the repository.
• Configuring recovery settings and ensuring system availability in the
event of failure.

2.10 LIMITATIONS WITH DATA MARTS

Prospective builders of data warehouses are frequently advised to “start small”


with a data mart and use that kernel to expand gradually into a full blown data
warehouse. This approach to warehousing generally leads to failed projects for
several reasons.
Sometimes the new data mart is so successful that the configuration is overrun
by user demands. The databases grow too large too fast, response times
become unacceptably long, and user frustration leads to searching for other
ways to get the answers.

The more common reason for failure is that the data mart is immediately
unsuccessful because it is designed in such a way that users are unable to
retrieve the sort of information they want and need to extract from the data.
Databases are highly denormalized to respond to a small set of canned queries;
summaries, rather than detail data, comprise the database so that fine-grained
exploratory data analysis is not possible; and support for ad hoc queries is
either absent or so poor as to discourage users from bothering with them.

The very factors that frequently defeat data mart projects are also the most
commonly recommended approaches to designing data marts and data
warehouses in the popular data warehousing literature:

• Denormalization (dimensional modeling)


• Storing aggregates at the expense of detail data
• Skewing performance toward a small, preselected set of queries at the
expense of all other exploratory analyses

14
Data Warehouse Architecture
Check your Progress 1

1. Define data warehouse architecture.

…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………

2. What is the correct flow of the data warehouse architecture?

…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………

3. Mention some Data Mart Use Cases.

…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………

2.11 SUMMARY

Data warehouse architecture is the design and building blocks of the


modern data warehouse. In this unit we have studied the basic building blocks
of the data warehouse, data warehouse architecture, its types, architecture
models, data marts, designing of data marts and limitations.

In this next unit we will study about Dimensional Modeling.

2.12 SOLUTIONS / ANSWERS

Check Your Progress 1:

1. The method for defining the entire architecture of data communication


processing as well as the presentation that exists for end-clients is the data
warehouse architecture. Every data warehouse is different, and each of them
is characterized based on the standard vital components.

In simple words, a data warehouse is an information system that consists of

15
Data Warehouse
Fundamentals And
Architecture commutative and historical data from single or multiple sources. The process
of reporting and analysis of data in the organizations is simplified with the
help of different data warehousing concepts. There are different approaches to
constructing a data warehouse architecture. Any approach is used based on
the requirements of the organizations.
2. On every operational database, there are a certain fixed number of
operations that have to be applied. There are different well-defined
techniques for delivering suitable solutions. Data warehousing is found
to be more effective when the correct flow of the data warehouse
architecture is completely followed.

The four different processes that contribute to a data warehouse are


extracting and loading the data, cleaning and transforming the data,
backing up and archiving the data, and carrying out the query
management process by directing them to the appropriate data sources.

3. Data marts are used to solve specific organizational problems,


especially those that are unique to one department. Typical use cases
for a data mart include:

Focused Analytics
Analytics is perhaps the most common application of data marts. The
data in these repositories is entirely relevant to the requirements of the
business department, with no extraneous information, resulting in faster
and more accurate analysis. For example, financial analysts will find it
easier to work with a financial data mart, rather than working with an
entire data warehouse.
Fast Turnaround
Data marts are generally faster to develop than a data warehouse, as the
developers are working with fewer sources and a limited schema. Data
marts are ideal for data projects operating under challenging time
constraints.
Permission Management
Data marts can be a risk-free way to grant limited data access without
exposing the entire data warehouse. For example, dependent data mart
contains a segment of warehouse data, and users are only able to view
the contents of the mart. This prevents unauthorized access and
accidental writes.
Better Resource Management
Data marts are sometimes used where there is a disparity in resource
usage between different departments. For example, the logistics
department might perform a high volume of daily database actions,
which causes the marketing team’s analytics tools to run slow. By
providing each department with its own data mart, it’s easier to allocate
resources according to their needs.

16
Data Warehouse Architecture

2.13 FURTHER READINGS

1. William H. Inmon, Building the Data Warehouse, Wiley, 4th Edition,


2005.
2. Data Warehousing Fundamentals, Paulraj Ponnaiah, Wiley
Student Edition, 2001.
3. Data Warehousing, Reema Thareja, Oxford University Press, 2011.

17
UNIT 3 DIMENSIONAL MODELING
Structure

3.0 Introduction
3.1 Objectives
3.2 Dimensional Modeling
3.2.1 Strengths of Dimensional Modeling
3.3 Identifying Facts and Dimensions
3.4 Star Schema
3.4.1 Features of Star Schema
3.5 Advantages and Disadvantages of Star Schema
3.6 Snowflake Schema
3.6.1 Features of Snowflake Schema
3.7 Advantages and Disadvantages of Snowflake Schema
3.7.1 Star Schema Vs Snowflake Schema
3.8 Fact Constellation Schema
3.8.1 Advantages and Disadvantages of Fact Constellation Schema
3.9 Aggregate Tables
3.10 Need for Building Aggregate Fact Tables
Limitations of Aggregate Fact Tables
3.11 Aggregate Fact Tables and Derived Dimension Tables
3.12 Summary
3.13 Solutions/Answers
3.14 Further Readings

3.0 INTRODUCTION

In the earlier unit, we had studied about the Data Warehouse Architecture and
Data Marts. In this unit let us focus on the modeling aspects. In this unit we
will go through the dimensional modeling, star schema, snowflake schema,
aggregate tables and Fact constellation schema.

3.1 OBJECTIVES
After going through this unit, you shall be able to:
• understand the purpose of dimension modeling;
• identifying the measures, facts, and dimensions;
• discuss the fact and dimension tables and their pros and cons;
• discuss the Star and Snowflake schema;
• explore comparative analysis of star and snowflake schema;
• describe Aggregate facts, fact constellation, and
• discuss various examples of star and snowflake schema.

19
Dimensional Modeling
3.2 DIMENSIONAL MODELING

Dimensional modeling is a data model design adopted when building a data


warehouse. Simply, it can be understood that dimension modeling reduces the
response time of query fired unlike relational systems. The concept behind
dimensional modeling is all about the conceptual design. Firstly let’s see the
introduction to dimensional modeling and how it is different from a traditional
data model design. A data model is a representation of how data is stored in a
database and it is usually a diagram of the few tables and the relationships that
exist between them. This modeling is designed to read, summarize and
compute some numeric data from a data warehouse. A data warehouse is an
example of a system that requires small number of large tables. This is due to
many users using the application to read lot of data a characteristic of a data
warehouse is to write the data once and read it many times over so it is the read
operation that is dominant in a data warehouse. Now let's look at the data
warehouse containing customer related information in a single table this makes
it a lot easier for analytics just to count the number of customers by country but
this time the use of tables in the data warehouse simplify the query processing.
The main objective of dimension modeling is to provide an easy architecture
for the end user to write queries and also, to reduce the number of relationships
between the tables and dimensions hence providing efficient query handling.

Dimensional modeling populates data in a cube as a logical representation with


OLAP data management. The concept was developed by Ralph Kimball. It has
“fact” and “dimension” as its two important measure. The transaction record is
divided into either “facts”, which consists of business numerical transaction
data, or “dimensions”, which are the reference information that gives context
to the facts. The more detail about fact and dimension is explained in the
subsequent sections.

The main objective of dimension modeling is to provide an easy architecture


for the end user to write queries. Also it will reduce the number of
relationships between the tables and dimensions, hence providing efficient
query handling.

The following are the steps in Dimension modeling as shown in figure1.


1. Identify Business Process
2. Identify Grain (level of detail)
3. Identify dimensions and attributes
5. Build Schema
The model should describe the Why, How much, When/Where/Who and What
of your business process.

20
Data Warehouse
Fundamentals and
Architecture

Figure 1: Steps in Dimension Modeling

Step 1: Identify the Business Objectives


Selection of the right business process to build a data warehouse and
identifying the business objectives is the first step in dimension modeling. This
is very important step otherwise this can lead to repeated process and software
defects.

Step 2: Identifying Granularity


The grain literally means each minute detail of the business problem. This is
decomposing of the large and complex problem into the lowest level
information. For example, if there is some data month-wise. So, the table
would contain details of all the months in a year. It depends on the report to be
submitted to the management. This affects the size of the data warehouse.

Step 3: Identifying Dimensions and attributes


The dimensions of the data warehouse can be understood by the entities of the
database. like, items, products, date, stocks, time etc. The identification of the
primary keys and the foreign keys specifications all are described here.

Step 4: Build the Schema


The database structure or arrangement of columns in a database table, decides
the schema. There are various popular schemas like, star, snowflake, fact
constellation schemas - summarizing, from the selection of business process to
identifying each and every finest level of detail of the business transactions.
Identifying the significant dimensions and attributes would help to build the
schema.

3.2.1 Strengths of Dimensional Modeling

Following are some of the strengths of Dimensional Modeling:

• It provides the simplicity of architecture or schema to understand and


handle various stakeholders from warehouse designers to business
clients.
21
Dimensional Modeling • It reduces the number of relationships between different data elements.
• It promotes data quality by enforcing foreign key constraints as a form
of referential integrity check on a data warehouse. The dimensional
modeling helps the database administrators to maintain the reliability of
the data.
• The aggregate functions used in the schemas optimize the query
performance posted by the customers. Since data warehouse size keeps
on increasing and with this increased size, the optimization becomes
the concern which dimension modeling makes it easy.

3.3 IDENTIFYING FACTS AND DIMENSIONS


We have studied the steps of dimension modeling in the previous section. The
last step narrated is to build the schema. So, let’s see the elementary measures
to build a schema.
Facts and Fact table: A fact is an event. It is a measure which represents
business items or transactions of items having association and context data.
The Fact table contains the description of all the primary keys of all the tables
used in the business processes which acts as a foreign key in the fact table. It
also has an aggregate function to compute the business process on some entity.
It is a numeric attribute of a fact, representing the performance or behavior of
the business relative to the dimensions. The number of columns in the fact
table is less than the dimension table. It is more normalized form.
Dimensions and Dimension table: It is a collection of data which describe
one business dimension. Dimensions decide the contextual background for the
facts, and they are the framework over which OLAP is performed. Dimension
tables establish the context of the facts. The table stores fields that describe the
facts. The data in the table are in de normalized form. So, it contains large
number of columns as compared to fact table. The attributes in a dimension
table are used as row and column headings in a document or query results
display.
Example: In the example of student registration case study to any particular
course can have attributes like student_id, course_id, program_id,
date_of_registration, fee_id in fact table. Course summary can have course
name, duration of the course etc. Student information can contain the personal
details about the student like name, address, contact details etc.

Student Registration

Fact Table (student_id, course_id, program_id, date_of_registration, fee_id)


Measure: Sum (Fee_amount))
Dimension Tables (Student_details,
Course_details
Program_details,
Fee_details,
Date)

22
Data Warehouse
Fundamentals and
3.4 STAR SCHEMA Architecture

There are two basic popular models which are used for dimensional modeling:
• Star Model
• Snowflake Model
Star Model: It represents the multidimensional model. In this model the data
is organized into facts and dimensions. The star model is the underlying
structure for a dimensional model. It has one broad central table (fact table)
and a set of smaller tables (dimensions) arranged in a star design. This design
is logically shown in the below figure 2.

Figure 2 : Star Schema

3.4.1 Features of Star Schema

• The data is in denormalized database.


• It provides quick query response
• Star schema is flexible can be changed or added easily.
• It reduces the complexity of metadata for developers and end users.

3.5 ADVANTAGES AND DISADVANTAGES OF


STAR SCHEMA
3.5.1 Advantages of Star Schema
Star schemas are easy for end users and applications to understand and
navigate. With a well-designed schema, users can quickly analyze large,
multidimensional data sets. The main advantages of star schemas in a decision-
support environment are:

23
Dimensional Modeling

• Query performance
Because a star schema database has a small number of tables and clear join
paths, queries run faster than they do against an OLTP system. Small single-
table queries, usually of dimension tables, are almost instantaneous. Large join
queries that involve multiple tables take only seconds or minutes to run.
In a star schema database design, the dimensions are linked only through the
central fact table. When two dimension tables are used in a query, only one
join path, intersecting the fact table, exists between those two tables. This
design feature enforces accurate and consistent query results.
• Load performance and administration
Structural simplicity also reduces the time required to load large batches of
data into a star schema database. By defining facts and dimensions and
separating them into different tables, the impact of a load operation is reduced.
Dimension tables can be populated once and occasionally refreshed. You can
add new facts regularly and selectively by appending records to a fact table.
• Built-in referential integrity
A star schema has referential integrity built in when data is loaded. Referential
integrity is enforced because each record in a dimension table has a unique
primary key, and all keys in the fact tables are legitimate foreign keys drawn
from the dimension tables. A record in the fact table that is not related
correctly to a dimension cannot be given the correct key value to be retrieved.
• Easily understood
A star schema is easy to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end user, because
they represent the fundamental relationship between parts of the underlying
business. Users can also browse dimension table attributes before constructing
a query.
3.5.2 Disadvantages of Star Schema
As mentioned before, improving read queries and analysis in a star schema
could involve certain challenges:
• Decreased data integrity: Because of the denormalized data structure,
star schemas do not enforce data integrity very well. Although star
schemas use countermeasures to prevent anomalies from developing, a
simple insert or update command can still cause data incongruities.
• Less capable of handling diverse and complex queries: Databases
designers build and optimize star schemas for specific analytical needs.
As denormalized data sets, they work best with a relatively narrow set
of simple queries. Comparatively, a normalized schema permits a far
wider variety of more complex analytical queries.
• No Many-to-Many Relationships: Because they offer a simple
dimension schema, star schemas don’t work well for “many-to-many
data relationships”
24
Data Warehouse
Fundamentals and
Example 1: Suppose a star schema is composed of a Sales fact table as shown Architecture
in Figure 3a and several dimension tables connected to it for Time, Branch,
Item and Location.
Fact Table
Sales is the Fact table.
Dimension Tables
The Time table has a column for each day, month, quarter, year etc..
The Item table has columns for each item_key, item_name, brand, type and
supplier_type.
The Branch table has columns for each branch_key, branch_name and
branch_type.
The Location table has columns of geographic data, including street, city,
state, and country. Unit_Sold and Dollars_Sold are the Measures.

Figure 3a: Example of Star Schema

Example 2:
The star schema works by dividing data into measurements and the “who,
what, where, when, why, and how” descriptive context. Broadly, these two
groups are facts and dimensions.
By doing this, the star schema methodology allows the business user to
restructure their transactional database into smaller tables that are easier to fit
together. Fact tables are then linked to their associated dimension tables with
primary or foreign key relationships. An example of this would be a quick
grocery store purchase. The amount you spent and how many items you bought
would be considered a fact, but what you bought, when you bought it and the
specific grocery store’s location would all be considered dimensions.
25
Dimensional Modeling Once these two groups have been established, we can connect them by the
unique transaction number associated with your specific purchase. An
important note is that each fact, or measurement, will be associated with
multiple dimensions. This is what forms the star shape, the fact in the center,
and dimensions drawing out around it. Dimensions relating to the grocery
store, the products you bought, and descriptions about you as their customer
will be carefully separated into its table with its attributes.
This example is modeled as shown below and star schema for this is depicted
in Figure 3b.
Fact Table
Sales is the Fact Table.
Dimension Tables
The Store table consists of columns like store_id store_address, city, region,
state and country.
Customer table has columns for each product_id, product_time and
product_type.
Sales_Type includes sales_type_id and type_name columns.
Product table consists of product_id, product_name and product_type.
Time table consists of columns like time_id, action_date, action_week,
action_month, action_year and action_ weekday.
Measurements may be amount spent and no. of items bought.

Figure 3b: Example of Star Schema


26
Data Warehouse
Fundamentals and
Architecture

 Check Your Progress 1

1) Discuss the characteristics of star schema?


……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………

2) Draw a Star Schema for a marketing employee staying in a NewYork city of the
country USA. He buys products and wants to compute the total product sold and
how much sales done?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………

3.6 SNOWFLAKE SCHEMA


The other popular modeling technique is Snowflake Schema. You can
understand the term flakes as chocolate flakes on the pastry and ice-creams.
These flakes add additional tastes to the chocolate. Similarly, snowflake
schema is the extension of star schema which adds more dimensions to give
more meaning to the logical view of the database. These additional tables are
more normalized than star schema. The arrangement of data is like that the
centralized fact table relates to multiple related dimensional tables. This can
become more complex if the dimensions are more detailed and at multiple
levels. In the conceptual hierarchy child table has multiple parent tables. You
must keep in mind that we are just extending or flaking the dimension tables
not the fact tables.
Snowflake Model
The snowflake model is the conclusion of decomposing one or more of the
dimensions. Snowflake Schema in data warehouse is a logical arrangement of
tables in a multidimensional database such that the ER diagram resembles a
snowflake shape. A Snowflake Schema is an extension of a Star Schema, and it
adds additional dimensions. The dimension tables are normalized which splits
data into additional tables.
In the following Snowflake Schema example, Country is further normalized
into an individual table.
3.6.1 Features of Snowflake Schema
Following are the important features of snowflake schema:
1. It has normalized tables
2. Occupy less disk space.

27
Dimensional Modeling 3. It requires more lookup time as many tables are interconnected and
extending dimensions.
Example
In the below figure , the snowflake schema is shown of a case study of
customers, sales, products, location wise quantity sold, and number of items
sold are calculated. The customers, products, date, store are saved in the fact
table with their respective primary keys acting in fact table as a foreign key.
You will observe that the two aggregate functions can be applied to calculate
quantity sold and amount sold. Further, the some dimensions are extended to
the type of customer and also store information territory wise too. Note, date
has been expanded into date, month, year. This schema will give you more
opportunity to perform query handling in detail.

Figure 4: Snowflake Schema

3.7 ADVANTAGES AND DISADVANTAGES OF


SNOWFLAKE SCHEMA
Following are the advantages of Snowflake schema:
• A Snowflake schema occupies a much smaller amount of disk space
compared to the Star schema. Lesser disk space means more convenience
and less hassle.
• Snowflake schema of small protection from various Data integrity issues.
Most people tend to prefer the Snowflake schema because of how safe if it
is.
• Data is easy to maintain and more structured.
• Data quality is better than star schema.

28
Data Warehouse
Fundamentals and
Disadvantages of Snowflake Schema Architecture

• Complex data schemas: As you might imagine, snowflake schemas


create many levels of complexity while normalizing the attributes of a
star schema. This complexity results in more complicated source query
joins. In offering a more efficient way to store data, snowflake can
result in performance declines while browsing these complex joins.
Still, processing technology advancements have resulted in improved
snowflake schema query performance in recent years, which is one of
the reasons why snowflake schemas are rising in popularity.
• Slower at processing cube data: In a snowflake schema, the complex
joins result in slower cube data processing. The star schema is
generally better for cube data processing.
• Lower data integrity levels: While snowflake schemas offer greater
normalization and fewer risks of data corruption after performing
UPDATE and INSERT commands, they do not provide the level of
transnational assurance that comes with a traditional, highly-
normalized database structure. Therefore, when loading data into a
snowflake schema, it's vital to be careful and double-check the quality
of information post-loading.

3.7.1 Star Schema Vs Snowflake Schema

Features Star Schema Snowflake Schema


Normalized The dimension tables in star This schema has normalized
Dimension schema are not normalized so dimension tables
Tables they may contain redundancies
Queries The execution of queries is The execution of snowflake
relatively faster as there are less schema complex queries is
joins needed in forming a query. slower than star schema as
many joins and foreign key
relations are needed to form a
query. Thus performance is
affected.
Performance Star schema model has faster It has slow performance as
execution and response time compared to star schema
Storage This type of schema requires Snowflake schema tables are
Space more storage space as compared easy to maintain and save
to snowflake due to storage space due to
unnormalised tables. normalized tables.
Usage Star schema is preferred when If the dimension table
the dimension tables have lesser contains large number of
rows rows, snowflake schema is
preferred
Type of DW This schema is suitable for 1:1 It is used for complex
or 1: many relationships such as relationships such as many:
data marts. many in enterprise Data
warehouses.
Dimension Star schema has a single table Snowflake schema may have
Tables for each dimension more than one dimension
table for each dimension.

29
Dimensional Modeling
3.8 FACT CONSTELLATION SCHEMA
There is another schema for representing a multidimensional model. This term
fact constellation is like the galaxy of universe containing several stars. It is a
collection of fact schemas having one or more-dimension tables in common as
shown in the figure below. This logical representation is mainly used in
designing complex database systems.

Figure 7: Fact Constellation Schema

In the above figure, it can be observed that there are two fact tables and two-
dimension tables in the pink boxes are the common dimension tables
connecting both the star schemas.
For example, if we are designing a fact constellation schema for University
students. In the problem it is given that their fact table as

Fact tables

Placement (Stud_roll, Company_id, TPO_id) , need to calculate the number of


students eligible and number of students placed.
Workshop ( Stud_roll, Institute_id, TPO_id) need to find out the facts about
number of students selected, number of students attended the workshop)

So, there are two fact tables namely, Placement and Workshop which are part
of two different star schemas having:
i) dimension tables – Company, Student and TPO in Star schema with fact
table Placement and
ii) dimension tables – Training Institute, Student and TPO in Star schema with
fact table Workshop.

Both the star schema has two-dimension tables common and hence, forming a
fact constellation or galaxy schema.

30
Data Warehouse
Fundamentals and
Architecture

Figure 7: Fact Constellation

3.8.1 Advantages and Disadvantages of Fact Constellation Schema

Advantage
This schema is more flexible and gives wider perspective about the data
warehouse system.
Disadvantage
As, this schema is connecting two or more facts to form a constellation. This
kind of structure makes it complex to implement and maintain.

 Check Your Progress 2

1. Compare and contrast Star schema with Snowflake Schema?

……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………

2. Suppose that a data warehouse consists of dimensions time, doctor, ward and
patient, and the two measures count and charge, where charge is the fee that a
doctor charges a patient for a visit. Enumerate three classes of schemes that are
popularly used for modeling.
a) Draw a Star Schema diagram
b) Draw a Snowflake Schema.
……………………………………………………………………………..…
………………………………………………………………………..………
……………………………………………………………………………….

31
Dimensional Modeling
3.9 AGGREGATE TABLES

Since, in the data warehouse the data is stored in multidimensional cube. In the
information technology industry, there are various tools available to process
the queries posted on the data warehouse engine. These tools are called
business intelligence (BI) tools. These tools help to answer the complex
queries and to take decisions. Aggregate word is very similar to the
aggregation of the database schemas of relational tables that you must be
familiar with. Aggregate fact tables roll up the basic fact tables of the schema
to improve the query processing. The business tools smoothly select the level
of aggregation to improve the query performance. Aggregate fact tables
contain foreign keys referring to dimension tables.

Points to note about Aggregate tables:

1) It is also called summary tables.


2) It contains pre-computed queries of the data warehouse schema.
3) It reduces the dimensionality of the base fact tables.
4) It can be used to respond to the queries of the dimensions that are
saved.

Figure 5: Aggregate Tables

3.10 NEED FOR BUILDING AGGREGATE FACT


TABLES

Let us understand the need of building aggregate table. Aggregate tables also
referred to pre-computed tables having partially summarized data.

• Simply putting in one word, it’s about speed or quick response to queries.
This you can understand as an intermediate table which stores the results of
the queries on I/O disk space. It uses aggregates functionality.

For example, there is a company ABC corporation limited which takes


orders online and it there are millions of customer transactions placing
orders. So, the dimension tables for the company could be Customer,
32
Data Warehouse
Fundamentals and
Product and Order_date. In the fact table it maintains all the orders placed Architecture
say, Fact_Orders. To generate a report of monthly orders by product type
and by a particular region. It needs aggregates which are summary tables
can be obtained by Groupby SQL query.

• It occupies less space than atomic fact tables. It nearly takes the half time of
a general query processing.

• One of the more popular uses of aggregates is to adjust the granularity of a


dimension. When the granularity of a dimension is changed, the fact table
must be partially summarized to match the current grain of the new
dimension, resulting in the creation of new dimensional and fact tables that
fit this new grain standard.

• The Roll-up OLAP operation of the base fact tables generates aggregate
tables. Hence the query performance increases as it reduces the number of
rows to be accessed for the retrieval of data of a query.

3.11 AGGREGATE FACT TABLE AND DERIVED


DIMENSION TABLES

Aggregate facts are produced by calculating measures from more atomic fact
tables. These tables contain computational SQL aggregate functions like
AVERAGE, MIN, MAX, COUNT etc. It also contains function that helps to
find output using group by. The aggregate fact tables produce summary
statistics. Whenever, the speedy query handling is required the aggregate fact
tables is the best option.

• Basically, aggregates allow you to store the intermediate results or pre-


calculate the subqueries or queries fired on a data warehouse by
summing data up to higher levels and storing them in a separate star.

• You can understand aggregate fact tables as the conformed copy of the
fact table as it should provide you the same result of the query as the
detailed fact table.

• This aggregate fact tables can be used in the case of large datasets or
when there are large number of queries. It reduces the response time of
the queries fired by users or customers. It is very useful in business
intelligence application tools.

When you have complicated questions of multiple facts in multiple tables that
are stored at different levels from one another, and when a reporting request
includes yet another level, the levels at which facts are stored become even
more relevant. You must be able to meet users' need for fact reporting at the
business level. There's nothing wrong with improving the overall intelligence.

The levels at which facts are stored become especially important when you
begin to have complex queries with multiple facts in multiple tables that are
stored at levels different from one another, and when a reporting request

33
Dimensional Modeling involves still a different level. You must be able to support fact reporting at the
business levels which users require. There is nothing wrong with enhancing an
aggregate with new facts or deriving new dimension. For measures, the only
issue is if the new measures are atomic in the context of the aggregate fact. If,
however, the new measures are received at a lower grain, you would be better
off creating a new atomic fact for those measures prior to incorporating
summarized measures into the aggregate. This would allow the new measures
to be used for other purposes without having to go back to the source.
Let's say we have a fact table: FactBillReciept has monthly transactions. There
can be different types of transaction receipts during a month for each supplier.
This huge data would result in lot of calculations. So, we would build another
aggregate table which is derived of base table.

FactBillMonthReceipt: It contains aggregated receipts per month, per supplier.


But the problem is it has additional foreign keys like supplier_status for the
month. To solve this, we have the concept of derived tables which contains
additional measures and foreign keys that are not present in the base fact table.

Conformed Dimension
A conformed dimension is the dimension that is shared across multiple data
mart or subject area. An organization may use the same dimension table across
different projects without making any changes to the dimension tables.
Derived Tables
It is the significant addition to the Data Warehouse. Derived tables are used to
create a second-level data marts for cross functional analysis.
Consolidated Fact tables: It is the fact table which has data from different fact
tables used to form a schema with a common grain.

For example, to design a Sales department Data Warehouse schema assuming


there are following entities and respective grains in them.

Sales: Employee, date, and product.


Budget: Department, Financial Year, Quarter-wise
Product can have various attributes like, product size, product _category etc..

One thing to notice here is that the product attributes keep on changing as per
the requirements, but product dimension remains the same. So, it is better to
keep Product as a separate dimension.

34
Data Warehouse
Fundamentals and
Let’s design the tables and its grains. Architecture

Aggregate Fact Table Derived table

Product Product_Id
Product_Id Product_Type
Category_Id Product_Description
Supplier_Id Unit Sales
Timekey Year
Product_type Quarter
Product_Description
Product_start_date
Quantity
Fact Table (Supplier)

Supplier_details
Supplier_Id
Product_Id
Store_Id
TimeKey

Figure 6: Aggregate Tables and Derived tables

The derived tables are very useful in terms of putting fewer loads on the Data
Warehouse engine for calculation.

 Check Your Progress 3


1. Discuss the limitations of Aggregate Fact tables.
…………………………………………………………………………………
…………………………………………………………………………………
………………………………………………………………………………..

3.12 SUMMARY
This unit presented the basic designing of data warehouse. These topics are
more focused on the various kind of modeling and schemas. It explored the
grains, facts, and dimensions of the schemas. It is important to know about the
dimensional modeling .as the appropriate modeling technique would yield the
correct respond the queries.
A dimensional modeling is a kind of data structure used to optimize design of
Data warehouse for the query retrieval operations. There are various schema
designs. Here, it discussed star, snowflake, and fact constellations. From
denormalized to normalized schemas uses dimension, fact, derived and
aggregate fact table. Every table has some purpose and used for efficient
designing in terms of space and query handling. This unit discusses the pros
and cons of every tables. The number of examples used to explain the
designing in different scenarios.

35
Dimensional Modeling
3.13 SOLUTIONS/ANSWERS
Check Your Progress 1:
1) Characteristics of Star Schema:

• Every dimension in a star schema is represented with only one-dimension


table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The Star schema is easy to understand and provides optimal disk usage.
• The dimension tables are not normalized. For instance, in the above figure,
Country ID does not have Country lookup table as an OLTP design would
have.
• The schema is widely supported by BI Tools

2)

Figure 8: Star Schema

Check Your Progress 2:


1:

Star Schema Snowflake Schema


It is a logical arrangement of one fact It is a logical arrangement of one fact
table surrounded by other dimension table with dimension tables and further
tables like a star. dimension tables are normalized to other
dimensions
It requires a single join SQL command to It requires many joins SQL command to
fetch the data fetch the data
Simple Database design and respond to Complex database design and respond
query time is very less time to queries is high
The data is not normalized. High level of The data is normalized so low level of
redundancy redundancy.

36
Data Warehouse
Fundamentals and
Architecture
2: a. Star Schema of Hospital Management
Dimension Doctor
Doctor_ID
Doctor_Name
Doctor_Contact
DoctorAvail_status
Specialization Dimension Patient
Patient_ID
Patient_name
Patient_Address
Dimension Ward Patient_Contact
Ward_ID Fact Hospital
Patient_Complain
Ward_Name Patient_ID
Ward_Assistant Doctor_ID
Admisison Ward_ID
_details Time_Key Dimension
Bill_ID Time
Time_ID
Calculate_billamt() Date
count_patients()
Dimension Bill Count_Admission()
Bill_ID
Bill_Description
Amount
Time

Figure 9 : Fact Schema of Hospital Management System

37
Dimensional Modeling b. Snowflake Schema of Hospital Management

Dimension Doctor
Doctor_ID
Doctor_Name Dimension Patient
Dimension_Ward_Assistant Address Patient_ID
Assistant_ID Doctor_ContactNo Patient_name
Assistant_Name DoctorAvail_status Address
Specialization Patient_ContactNo
Patient_Complain

Dimension Address
City
Dimension Ward Fact Hospital State
Ward_ID Patient_ID Country
Ward_Name Doctor_ID
Ward_Assistant Ward_ID
Admission_ID Time_Key
Patient_ID Bill_ID
Dimension Bill
Bill_ID
Calculate_billamt()
Bill_Description
count_patients()
Amount
Count_Admission()
Time_ID
Patient_ID
Doctor_ID
Dimension Admission
Admission_ID
Type of Admission
Patient_ID
Details
Time_ID Dimension Date Dimension Time
Date Time_ID
Month Date
year Time(HH:MM:SS)

Figure 10: Snowflake Schema of Hospital Management System

Check Your Progress 3:

1.
Limitations of Aggregate fact tables: Aggregate tables take lot of time to scan
the rows of the base fact table. So, there will be more tables to manage. The
size of aggregates in computing can be costly. Based on the greedy approach
the size of aggregates is decided using hashing technique. If there are n
dimensions in the table, then there can be 2n possible aggregates. The load on
the data warehouse becomes more complex.

38
Data Warehouse
Fundamentals and
3.14 FURTHER READINGS Architecture

• Building the Data Warehouse, William H. Inmon, Wiley, 4th


Edition, 2005.
• Data Warehousing Fundamentals, Paulraj Ponnaiah, Wiley
Student Edition
• Data Warehousing, Reema Thareja, Oxford University Press.
• Data Warehousing, Data Mining & OLAP, Alex Berson and
Stephen J.Smith, Tata McGraw – Hill Edition, 2016.

39

You might also like