DW 1
DW 1
Data Warehousing
Indira Gandhi National Open University
School of Computer and Information
Sciences (SOCIS)
and Data Mining
Many global corporations have turned to data warehousing to organize data that
streams in from corporate branches and operations centers around the world. It’s
essential for IT students to understand how data warehousing helps businesses
remain competitive in a quickly evolving global marketplace. Data warehousing is
an increasingly important business intelligence tool which enables historical
insights, ensure consistency, allow organizations to make better business decisions,
decrease costs, maximize efficiency, increase the power and speed of data
analytics, provides major competitive edge and increase sales to improve the
bottom line.
Block
1
DATA WAREHOUSE FUNDAMENTALS
AND ARCHITECTURE
UNIT 1
Fundamentals of Data Warehouse
UNIT 2
Data Warehouse Architecture
UNIT 3
Dimensional Modeling
PROGRAMME DESIGN COMMITTEE
Prof. (Retd.) S.K. Gupta , IIT, Delhi Sh. Shashi Bhushan Sharma, Associate Professor, SOCIS, IGNOU
Prof. T.V. Vijay Kumar JNU, New Delhi Sh. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Prof. Ela Kumar, IGDTUW, Delhi Dr. P. Venkata Suresh, Associate Professor, SOCIS, IGNOU
Prof. Gayatri Dhingra, GVMITM, Sonipat Dr. V.V. Subrahmanyam, Associate Professor, SOCIS, IGNOU
Mr. Milind Mahajan,. Impressico Business Solutions, Sh. M.P. Mishra, Assistant Professor, SOCIS, IGNOU
New Delhi Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
SOCIS FACULTY
Prof. P. Venkata Suresh, Director, SOCIS, IGNOU
Prof. V.V. Subrahmanyam, SOCIS, IGNOU
Dr. Akshay Kumar, Associate Professor, SOCIS, IGNOU
Dr. Naveen Kumar, Associate Professor, SOCIS, IGNOU (on EOL)
Dr. M.P. Mishra, Associate Professor, SOCIS, IGNOU
Dr. Sudhansh Sharma, Assistant Professor, SOCIS, IGNOU
Dr. Manish Kumar, Assistant Professor, SOCIS, IGNOU
Print Production
Mr. Sanjay Aggarwal, Assistant Registrar (Publication), MPDD
July, 2022
Indira Gandhi National Open University, 2022
ISBN-
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission in writing from
the Indira Gandhi National Open University.
Further information on the Indira Gandhi National Open University courses may be obtained from the University’s office at Maidan Garhi, New
Delhi-110068.
Printed and published on behalf of the Indira Gandhi National Open University, New Delhi by MPDD, IGNOU.
BLOCK INTRODUCTION
The title of the Block is Data Warehouse Fundamentals and Architecture. The objectives of
this block are to make you understand about the underlying concepts of Data Warehousing,
identify the components of the Data Warehouse Architecture, to know the difference between
the Data Warehouse and Data Marts, to understand the Data Warehouse Development Life
Cycle and to elucidate the dimensional modeling techniques.
The block is organized into 3 units:
1.0 Introduction
1.1 Objectives
1.2 Evolution of Data Warehouse
1.3 Data Warehouse and its Need
1.3.1 Need for Data Warehouse
1.3.2 Benefits of Data Warehouse
1.4 Data Warehouse Design Approaches
1.4.1 Top-Down Approach
1.4.2 Bottom-Up Approach
1.5 Characteristics of a Data Warehouse
1.5.1 How Data Warehouse Works?
1.6 OLTP and OLAP
1.6.1 Online Transaction Processing (OLTP)
1.6.2 Online Analytical Processing (OLAP)
1.7 Data Granularity
1.8 Metadata and Data Warehousing
1.9 Data Warehouse Applications
1.10 Types of Data Warehouses
1.10.1 Enterprise Data Warehouse
1.10.2 Operational Data Store
1.10.3 Data Mart
1.11 Popular Data Warehouse Platforms
1.12 Summary
1.13 Solutions/Answers
1.14 Further Readings
1.0 INTRODUCTION
The process of consolidating data and analyzing it to obtain some insights has
been around for centuries, but we just recently began referring to this as data
warehousing. Any operational or transactional system is only designed with its
own functionality and hence, it could handle limited amounts of data for a
limited amount of time. The operational systems are not designed or
architected for long term data retention as the historical data is little to no
importance to them. However, to gain a point-in-time visibility and understand
the high-level operational aspects of any business, the historical data plays a
vital role. With the emergence of matured Relational Database Management
Systems (RDBMS) in 1960s, engineers across various enterprises started
1
Fundamentals of
Data Warehouse
architecting ways to copy the data from the transactional systems over to
different databases via manual or automated mechanism and use it for
reporting and analysis. As the data in the transactional systems would get
purged periodically, it would not be the case in these analytical repositories as
their purpose was to store as much data as possible; hence the word “data
warehouse” came into existence because these repositories would become a
warehouse for the data.
Data Warehousing (DW) as a practice became very prominent during late 80s
when the enterprises started building decision support systems that were
mainly responsible to support reporting. As there was a rapid advancement in
the performance of these relational database during late 1990s and early 2000s,
Data Warehousing became a core part of the Information Technology group
across large enterprises. In fact, some of the vendors like Netezza, Teradata
started offering customized hardware to manage data warehouse architectures
within state-of-the-art machines. Data Warehousing had evolved to be on top
of the list of priorities since mid 2000s. Data supply chain ecosystem has
grown exponentially in the current world and so is the way enterprises
architect their data warehouses.
This unit covers the basic features of data warehousing, its evolution,
characteristics, online transaction processing (OLTP), online analytical
processing, popular platforms and applications of data warehouses.
1.1 OBJECTIVES
In 1992, Inmon published Building the Data Warehouse, one of the seminal
volumes of the industry. Later in the 1990s, Inmon developed the concept of
the Corporate Information Factory, an enterprise level view of an
organization’s data of which Data Warehousing plays one part. Inmon’s
approach to Data Warehouse design focuses on a centralized data repository
modeled to the third normal form. Inmon's approach is often characterized as a
top-down approach. Inmon feels using strong relational modeling leads to
enterprise-wide consistency facilitating easier development of individual data
marts to better serve the needs of the departments using the actual data. This
approach differs in some respects to the “other” father of Data Warehousing,
Ralph Kimball.
Data Warehouse is used to collect and manage data from various sources, in
order to provide meaningful business insights. A data warehouse is usually
used for linking and analyzing heterogeneous sources of business data. The
data warehouse is the center of the data collection and reporting framework
developed for the BI system. Data warehouse systems are real-time
repositories of information, which are likely to be tied to specific applications.
Data warehouses gather data from multiple sources (including databases), with
an emphasis on storing, filtering, retrieving and in particular, analyzing huge
quantities of organized data. The data warehouse operates in information-rich
environment that provides an overview of the company, makes the current and
historical data of the company available for decisions, enables decision support
transactions without obstructing operating systems, makes information
consistent for the organization, and presents a flexible and interactive
information source.
Data warehouses are used extensively in the largest and most complex
businesses around the world. In demanding situations, good decision making
becomes critical. Significant and relevant data is required to make decisions.
This is possible only with the help of a well-designed data warehouse.
Following are some of the reasons for the need of Data Warehouses:
Enhancing the turnaround time for analysis and reporting: Data warehouse
allows business users to access critical data from a single source enabling them
to take quick decisions. They need not waste time retrieving data from multiple
sources. The business executives can query the data themselves with minimal
or no support from IT which in turn saves money and time.
Benefit of historical data: Transactional data stores data on a day to day basis
or for a very short period of duration without the inclusion of historical data. In
comparison, a data warehouse stores large amounts of historical data which
enables the business to include time-period analysis, trend analysis, and trend
forecasts.
Scalability - Businesses today cannot survive for long if they cannot easily
expand and scale to match the increase in the volume of daily transactions.
DW is easy to scale, making it easier for the business to stride ahead with
minimum hassle.
Increase Revenue and Returns - When the management and employees have
access to valuable data analytics, their decisions and actions will strengthen the
business. This increases the revenue in the long run.
Faster and Accurate Data Analytics - When data is available in the central data
warehouse, it takes less time to perform data analysis and generate reports.
Since the data is already cleaned and formatted, the results will be more
accurate.
5
Fundamentals of
Data Warehouse 1.4 DATA WAREHOUSE DESIGN APPROACHES
Data Warehouse design approaches are very important aspect of building data
warehouse. Selection of right data warehouse design could save lot of time and
project cost.
There are two different Data Warehouse Design Approaches normally
followed when designing a Data Warehouse solution and based on the
requirements of your project you can choose which one suits your particular
scenario. These methodologies are a result of research from Bill Inmon (Top-
Down Approach) and Ralph Kimball(Bottom up Approach).
• Data is extracted from the various source systems. The extracts are
loaded and validated in the stage area. Validation is required to make
sure the extracted data is accurate and correct. You can use the ETL
tools or approach to extract and push to the data warehouse.
• Data is extracted from the data warehouse in regular basis in stage area.
At this step, you will apply various aggregation, summarization
techniques on extracted data and loaded back to the data warehouse.
• Once the aggregation and summarization is completed, various data
marts extract that data and apply the some more transformation to make
the data structure as defined by the data marts.
6
Data Warehouse
1.4.2 Bottom-up Approach Fundamentals and
Architecture
Ralph Kimball’s data warehouse design approach is called dimensional
modelling or the Kimball methodology which is illustrated in Figure 2. This
methodology follows the bottom-up approach.
As per this method, data marts are first created to provide the reporting and
analytics capability for specific business process, later with these data marts
enterprise data warehouse is created.
Basically, Kimball model reverses the Inmon model i.e. Data marts are directly
loaded with the data from the source systems and then ETL process is used to
load in to Data Warehouse. The above image depicts how the top-down
approach works.
• The data flow in the bottom up approach starts from extraction of data
from various source systems into the stage area where it is processed
and loaded into the data marts that are handling specific business
process.
• After data marts are refreshed the current data is once again extracted
in stage area and transformations are applied to create data into the data
mart structure. The data is the extracted from Data Mart to the staging
area is aggregated, summarized and so on loaded into EDW and then
made available for the end user for analysis and enables critical
business decisions.
Having discussed the data warehouse design strategies, let us study the
characteristics of the DW in the next section.
7
Fundamentals of
Data Warehouse 1.5 CHARACTERISTICS OF A
DATA WAREHOUSE
Data warehouses are systems that are concerned with studying, analyzing and
presenting enterprise data in a way that enables senior management to make
decisions. The data warehouses have four essential characteristics that
distinguish them from any other data and these characteristics are as follows:
• Subject-oriented
A DW is always a subject-oriented one, as it provides information about a
specific theme instead of current organizational operations. On specific
themes, it can be done. That means that it is proposed to handle the data
warehousing process with a specific theme (subject) that is more defined.
Figure 3 shows Sales, Products, Customers and Account are the different
themes.
A data warehouse never emphasizes only existing activities. Instead, it focuses
on data demonstration and analysis to make different decisions. It also
provides an easy and accurate demonstration of specific themes by eliminating
information that is not needed to make decisions.
• Integrated
8
Data Warehouse
Fundamentals and
Architecture
• Time-Variant
• Non-Volatile
The data residing in the data warehouse is permanent, as the name non -
volatile suggests. It also ensures that when new data is added, data is not
erased or removed. It requires the mammoth amount of data and analyses
the data within the technologies of warehouse. Figure 6 shows the non-
volatile data warehouse vs operational database. A data warehouse is kept
separate from the operational database and thus the data warehouse does
not represent regular changes in the operational database. Data warehouse
integration manages different warehouses relevant to the topic.
9
Fundamentals of
Data Warehouse
• Load Manager
• Warehouse Manager
• Query Manager
Query Manager Component provides the end-users with access to the stored
warehouse information through the use of specialized end-user tools. Data
mining access tools have various categories such as query and reporting, on-
line analytical processing (OLAP), statistics, data discovery and graphical and
geographical information systems.
• Reporting Data
• Query Tools
• Data Dippers
• Tools for EIS
• Tools for OLAP and tools for data mining.
……………………………………………………………………………
……………………………………………………………………………
11
Fundamentals of
Data Warehouse
OLTP OLAP
Characteristics Handles a large number of Handles large volumes of
small transactions data with complex queries
Query types Simple standardized queries Complex queries
Operations Based on INSERT, UPDATE, Based on SELECT
DELETE commands commands to aggregate
data for reporting
12
Data Warehouse
Response time Milliseconds Seconds, minutes, or hours Fundamentals and
Architecture
depending on the amount
of data to process
Design Industry-specific, such as Subject-specific, such as
retail, manufacturing, or sales, inventory, or
banking marketing
Source Transactions Aggregated data from
transactions
Purpose Control and run essential Plan, solve problems,
business operations in real support decisions, discover
time hidden insights
Data updates Short, fast updates initiated by Data periodically refreshed
user with scheduled, long-
running batch jobs
Space Generally small if historical Generally large due to
requirements data is archived aggregating large datasets
Backup and Regular backups required to Lost data can be reloaded
recovery ensure business continuity and from OLTP database as
meet legal and governance needed in lieu of regular
requirements backups
Productivity Increases productivity of end Increases productivity of
users business managers, data
analysts, and executives
Data view Lists day-to-day business Multi-dimensional view of
transactions enterprise data
User examples Customer-facing personnel, Knowledge workers such
clerks, online shoppers as data analysts, business
analysts, and executives
Database design Normalized databases for Denormalized databases
efficiency for analysis
…………………………………………………………………………………………
…………………………………………………………………………………………
…………………………………………………………………………………………
2) Mention the key differences between a database and a data warehouse.
…………………………………………………………………………………………
…………………………………………………………………………………………
13
Fundamentals of
Data Warehouse 1.7 DATA GRANULARITY
Government: In addition to store and analyze taxes used to detect tax theft,
government uses the data warehouse.
Airlines: It is used in the airline system for operational purposes such as crew
assignments, road profitability analyses, flight frequency programs
promotions, etc.
15
Fundamentals of
Data Warehouse 1.10 TYPES OF DATA WAREHOUSES
There are three different types of traditional Data Warehouse models as listed
below:
i. Enterprise
ii. Operational
iii. Data Mart
These features have a sizable enterprise-wide scope, but unlike the substantial
enterprise warehouse, data is refreshed in near real-time and used for routine
commercial activity. It assists in obtaining data straight from the database,
which also helps data transaction processing. The data present in the
Operational Data Store can be scrubbed, and the duplication which is present
can be reviewed and fixed by examining the corresponding market rules.
A data warehouse is a critical database for supporting data analysis and acts as
a conduit between analytical tools and operational data stores. The most
popular data warehousing solutions include a range of useful features for data
management and consolidation.
16
Data Warehouse
Google BigQuery Fundamentals and
Architecture
BigQuery is a cost-effective data warehousing tool with built-in machine
learning capabilities. You can integrate it with Cloud ML and TensorFlow to
create powerful AI models. It can also execute queries on petabytes of data for
real-time analytics. This scalable and serverless cloud data warehouse is ideal
for companies that want to keep costs low. If you need a quick way to make
informed decisions through data analysis, BigQuery is one of the solutions.
AWS Redshift
Snowflake
Azure Synapse brings together the two worlds of data warehousing and
analytics with a unified experience to ingest, prepare, manage, and serve data
for immediate BI and machine learning. The broader Azure platform includes
thousands of tools, including others that interface with the various Azure
databases.
17
Fundamentals of
Data Warehouse
1.12 SUMMARY
In this unit you have studied about the evolution, characteristics, benefits and
applications of data ware house.
1.13 SOLUTIONS/ANSWERS
Check Your Progress 1
• Data storage is a tool that companies can use increasingly important for
corporate intelligence:
• Make uniformity possible. All research data gathered and shared to
decision makers worldwide should be used in a uniform format.
Standardization of data from various sources reduces the risk of
misinterpretation as well as overall accuracy of interpretation.
• Take better business decisions. Successful entrepreneurs have a
thorough understanding of data, and are good at predicting future
trends. The data storage system helps users access various data sets at
speed and efficiency.
• Data storage platforms allow companies to access their business' past
history and evaluate ideas and projects. This gives managers an idea of
how they can improve their sales and management practices.
i) Subject oriented
(ii) Integrated
(iii) Time-variant
(iv) Non-volatile
Also, the data warehouse is non-volatile, meaning that prior data will
not be erased when new data are entered into it. Data is read-only, only
updated regularly. It also assists in analyzing historical data and in
understanding what and when it happened. The transaction process,
recovery, and competitiveness control mechanisms are not required. In
the Data Warehouse environment, activities such as deleting, updating,
and inserting that are performed in an operational application
environment are omitted.
21
Data Warehouse Architecture
Structure
2.0 Introduction
2.1 Objectives
2.2 Data Warehouse Architecture and its Types
2.2.1 Types of Data Warehouse Architectures
2.3 Components of Data Warehouse Architecture
2.4 Layers of Data Warehouse Architecture
2.4.1 Best Practices for Data Warehouse Architecture
2.5 Data Marts
2.5.1 Data Mart Vs Data Warehouse
2.6 Benefits of Data Marts
2.7 Types of Data Marts
2.8 Structure of a Data Mart
2.9 Designing the Data Marts
2.10 Limitations with Data Marts
2.11 Summary
2.12 Solutions / Answers
2.13 Further Readings
2.0 INTRODUCTION
In the previous unit we had studied about the data warehousing and related
topics. Despite numerous advancements over the last five years in the arena of
Big Data, cloud computing, predictive analysis, and information technologies,
data warehouses have only gained more significance. For the success of any
data warehouse, its architecture plays an important role. Since three decades,
the data warehouse architecture has been the pillar of the corporate data
ecosystems.
This unit present various topics including the basic concept of data warehouse
architecture, its types, significant components and layers of data ware house
architecture, data marts and their designing.
2.1 OBJECTIVES
1
Data Warehouse
Fundamentals And
Architecture
2.2 DATA WAREHOUSE ARCHITECTURE AND ITS
TYPES
Using a dimensional model, the raw data in the staging area is extracted and
converted into a simple consumable warehousing structure to deliver valuable
business intelligence. When designing a data warehouse, there are three
different types of models to consider, based on the approach of number of tiers
the architecture has.
2
Data Warehouse Architecture
The two-tier architecture (Figure 2) includes a staging area for all data sources,
before the data warehouse layer. By adding a staging area between the sources
and the storage repository, you ensure all data loaded into the warehouse is
cleansed and in the appropriate format.
The three-tier approach (Figure 3) is the most widely used architecture for data
warehouse systems.
1. The bottom tier is the database of the warehouse, where the cleansed
and transformed data is loaded.
2. The middle tier is the application layer giving an abstracted view of
the database. It arranges the data to make it more suitable for analysis.
This is done with an OLAP server, implemented using the ROLAP or
MOLAP model.
3. The top-tier is where the user accesses and interacts with the data. It
represents the front-end client layer. You can use reporting tools, query,
analysis or data mining tools.
3
Data Warehouse
Fundamentals And
Architecture
Figure 4 illustrates the complete data warehouse architecture with the three
tiers:
4
Data Warehouse Architecture
• Speed: Cloud-based data warehouse architecture is substantially
speedier than on-premises options, partly due to the use of ELT —
which is an uncommon process for on-premises counterparts.
• Scale: The elastic resources of the cloud make it ideal for the scale
required of big datasets. Additionally, cloud-based data warehousing
options can also scale down as needed, which is difficult to do with
other approaches.
Cloud-based platforms make it possible to create, share, and store massive data
sets with ease, paving the way for more efficient and effective data access and
analysis. Cloud systems are built for sustainable business growth, with many
modern Software-as-a Service (SaaS) providers separating data storage from
computing to improve scalability when querying data.
Some of the more notable cloud data warehouses in the market include
Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure SQL
Data Warehouse.
Now, let’s learn about the major components of a data warehouse and how
they help build and scale a data warehouse in the next section.
The following are the four database types that you can use:
5
Data Warehouse
Fundamentals And
Architecture • Typical relational databases are the row-centered databases you
perhaps use on an everyday basis —for example, Microsoft SQL
Server, SAP, Oracle, and IBM DB2.
• Analytics databases are precisely developed for data storage to sustain
and manage analytics, such as Teradata and Greenplum.
• Data warehouse applications aren’t exactly a kind of storage database,
but several dealers now offer applications that offer software for data
management as well as hardware for storing data. For example, SAP
Hana, Oracle Exadata, and IBM Netezza.
• Cloud-based databases can be hosted and retrieved on the cloud so that
you don’t have to procure any hardware to set up your data
warehouse—for example, Amazon Redshift, Google BigQuery, and
Microsoft Azure SQL.
2.3.3 Metadata
Before we delve into the different types of metadata in data mining, we first
need to understand what metadata is. In the data warehouse architecture,
metadata describes the data warehouse database and offers a framework for
data. It helps in constructing, preserving, handling, and making use of the data
warehouse.
Metadata plays an important role for businesses and the technical teams to
understand the data present in the warehouse and convert it into information.
6
Data Warehouse Architecture
• Query and reporting tools help users produce corporate reports for
analysis that can be in the form of spreadsheets, calculations, or
interactive visuals.
• Application development tools help create tailored reports and present
them in interpretations intended for reporting purposes.
• Data mining tools for data warehousing systematize the procedure of
identifying arrays and links in huge quantities of data using cutting-
edge statistical modeling methods.
• OLAP tools help construct a multi-dimensional data warehouse and
allow the analysis of enterprise data from numerous viewpoints.
It defines the data flow within a data warehousing bus architecture and
includes a data mart. A data mart is an access level that allows users to transfer
data. It is also used for partitioning data that is produced for a particular user
group.
The reporting layer in the data warehouse allows the end-users to access the BI
interface or BI database architecture. The purpose of the reporting layer in the
data warehouse is to act as a dashboard for data visualization, create reports,
and take out any required information.
In general, the data warehouse architecture can be divided into four layers.
They are:
7
Data Warehouse
Fundamentals And
Architecture (i) Data source layer
The data source layer is the place where unique information, gathered from an
assortment of inner and outside sources, resides in the social database.
Following are the examples of the data source layer:
While most data warehouses manage organized data, thought ought to be given
to the future utilization of unstructured data sources, for example, voice
accounts, scanned pictures, and unstructured text. These floods of data are
significant storehouses of information and ought to be viewed when building
up your warehouse.
This layer dwells between information sources and the data warehouse. In this
layer, information is separated from various inside and outer data sources.
Since source data comes in various organizations, the data extraction layer will
use numerous technologies and devices to extricate the necessary information.
Once the extracted data has been stacked, it will be exposed to high-level
quality checks. The conclusive outcome will be perfect and organized data that
you will stack into your data warehouse. The staging layer contains the given
parts:
The landing database stores the information recovered from the data source.
Before the data goes to the warehouse, the staging process does stringent
quality checks on it. Arranging is a basic step in architecture. Poor information
will add up to inadequate data, and the result is poor business dynamic. The
arranging layer is where you need to make changes in accordance with the
business process to deal with unstructured information sources.
Extract, Transform and Load tools (ETL) are the data tools used to extricate
information from source frameworks, change, and prepare information and
load it into the warehouse.
This layer is the place where the data that was washed down in the arranging
zone is put away as a solitary central archive. Contingent upon your business
and your warehouse architecture necessities, your data storage might be a data
warehouse center, data mart (data warehouse somewhat recreated for particular
departments), or an Operational Data Store (ODS).
8
Data Warehouse Architecture
(iv) Data Presentation Layer
This is where the users communicate with the scrubbed and sorted out data.
This layer of the data architecture gives users the capacity to query the data for
item or service insights, break down the data to conduct theoretical business
situations, and create computerized or specially appointed reports.
Designing the data warehouse with the designated architecture is an art. Some
of the best practices are shown below:
9
Data Warehouse
Fundamentals And
Architecture
2.5 DATA MARTS
A data mart is a subset of a data warehouse focused on a particular line of
business, department, or subject area. Data marts make specific data available
to a defined group of users, which allows those users to quickly access critical
insights without wasting time searching through an entire data warehouse. For
example, many companies may have a data mart that aligns with a specific
department in the business, such as finance, sales, or marketing.
Data marts and data warehouses are both highly structured repositories where
data is stored and managed until it is needed. However, they differ in the scope
of data stored: data warehouses are built to serve as the central store of data for
the entire business, whereas a data mart fulfills the request of a specific
division or business function. Because a data warehouse contains data for the
entire company, it is best practice to have strictly control who can access it.
Additionally, querying the data you need in a data warehouse is an incredibly
difficult task for the business. Thus, the primary purpose of a data mart is to
isolate—or partition—a smaller set of data from a whole to provide easier data
access for the end consumers.
On the other hand, separate business units may create their own data marts
based on their own data requirements. If business needs dictate, multiple data
marts can be merged together to create a single, data warehouse. This is the
bottom-up development approach.
• Data marts improve query speed with a smaller, more specialized set of
data.
• Data warehouse includes many data sets and takes time to update, data
marts handle smaller, faster-changing data sets.
• Data warehouse implementation can take many years, data marts are
much smaller in scope and can be implemented in months.
10
Data Warehouse Architecture
With its smaller, focused design, a data mart has several benefits to the end
user, including the following:
• Simplified data access: Data marts only hold a small subset of data, so
users can quickly retrieve the data they need with less work than they
could when working with a broader data set from a data warehouse.
There are three types of data marts that differ based on their relationship to the
data warehouse and the respective data sources of each system.
• Hybrid data marts combine data from existing data warehouses and
other operational sources. This unified approach leverages the speed
and user-friendly interface of a top-down approach and also offers the
enterprise-level integration of the independent method.
Star
Snowflake
While this method requires less space to store dimension tables, it is a complex
structure that can be difficult to maintain. The main benefit of using snowflake
schema is the low demand for disk space, but the caveat is a negative impact
on performance due to the additional tables.
12
Data Warehouse Architecture
Data Vault
The first step is to create a robust design. Some critical processes involved in
this phase include collecting the corporate and technical requirements,
identifying data sources, choosing a suitable data subset, and designing the
logical layout (database schema) and physical structure.
(ii) Build/Construct
The next step is to construct it. This includes creating the physical database
and the logical structures. In this phase, you’ll build the tables, fields, indexes,
and access controls.
The next step is to populate the mart, which means transferring data into it. In
this phase, you can also set the frequency of data transfer, such as daily or
weekly. This usually involves extracting source information, cleaning and
transforming the data, and loading it into the departmental repository.
In this step, the data loaded into the data mart is used in querying, generating
reports, graphs, and publishing. The main task involved in this phase is setting
up a meta-layer and translating database structures and item names into
corporate expressions so that non-technical operators can easily use the data
13
Data Warehouse
Fundamentals And
Architecture mart. If necessary, you can also set up API and interfaces to simplify data
access.
(v) Manage
The more common reason for failure is that the data mart is immediately
unsuccessful because it is designed in such a way that users are unable to
retrieve the sort of information they want and need to extract from the data.
Databases are highly denormalized to respond to a small set of canned queries;
summaries, rather than detail data, comprise the database so that fine-grained
exploratory data analysis is not possible; and support for ad hoc queries is
either absent or so poor as to discourage users from bothering with them.
The very factors that frequently defeat data mart projects are also the most
commonly recommended approaches to designing data marts and data
warehouses in the popular data warehousing literature:
14
Data Warehouse Architecture
Check your Progress 1
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
2.11 SUMMARY
15
Data Warehouse
Fundamentals And
Architecture commutative and historical data from single or multiple sources. The process
of reporting and analysis of data in the organizations is simplified with the
help of different data warehousing concepts. There are different approaches to
constructing a data warehouse architecture. Any approach is used based on
the requirements of the organizations.
2. On every operational database, there are a certain fixed number of
operations that have to be applied. There are different well-defined
techniques for delivering suitable solutions. Data warehousing is found
to be more effective when the correct flow of the data warehouse
architecture is completely followed.
Focused Analytics
Analytics is perhaps the most common application of data marts. The
data in these repositories is entirely relevant to the requirements of the
business department, with no extraneous information, resulting in faster
and more accurate analysis. For example, financial analysts will find it
easier to work with a financial data mart, rather than working with an
entire data warehouse.
Fast Turnaround
Data marts are generally faster to develop than a data warehouse, as the
developers are working with fewer sources and a limited schema. Data
marts are ideal for data projects operating under challenging time
constraints.
Permission Management
Data marts can be a risk-free way to grant limited data access without
exposing the entire data warehouse. For example, dependent data mart
contains a segment of warehouse data, and users are only able to view
the contents of the mart. This prevents unauthorized access and
accidental writes.
Better Resource Management
Data marts are sometimes used where there is a disparity in resource
usage between different departments. For example, the logistics
department might perform a high volume of daily database actions,
which causes the marketing team’s analytics tools to run slow. By
providing each department with its own data mart, it’s easier to allocate
resources according to their needs.
16
Data Warehouse Architecture
17
UNIT 3 DIMENSIONAL MODELING
Structure
3.0 Introduction
3.1 Objectives
3.2 Dimensional Modeling
3.2.1 Strengths of Dimensional Modeling
3.3 Identifying Facts and Dimensions
3.4 Star Schema
3.4.1 Features of Star Schema
3.5 Advantages and Disadvantages of Star Schema
3.6 Snowflake Schema
3.6.1 Features of Snowflake Schema
3.7 Advantages and Disadvantages of Snowflake Schema
3.7.1 Star Schema Vs Snowflake Schema
3.8 Fact Constellation Schema
3.8.1 Advantages and Disadvantages of Fact Constellation Schema
3.9 Aggregate Tables
3.10 Need for Building Aggregate Fact Tables
Limitations of Aggregate Fact Tables
3.11 Aggregate Fact Tables and Derived Dimension Tables
3.12 Summary
3.13 Solutions/Answers
3.14 Further Readings
3.0 INTRODUCTION
In the earlier unit, we had studied about the Data Warehouse Architecture and
Data Marts. In this unit let us focus on the modeling aspects. In this unit we
will go through the dimensional modeling, star schema, snowflake schema,
aggregate tables and Fact constellation schema.
3.1 OBJECTIVES
After going through this unit, you shall be able to:
• understand the purpose of dimension modeling;
• identifying the measures, facts, and dimensions;
• discuss the fact and dimension tables and their pros and cons;
• discuss the Star and Snowflake schema;
• explore comparative analysis of star and snowflake schema;
• describe Aggregate facts, fact constellation, and
• discuss various examples of star and snowflake schema.
19
Dimensional Modeling
3.2 DIMENSIONAL MODELING
20
Data Warehouse
Fundamentals and
Architecture
Student Registration
22
Data Warehouse
Fundamentals and
3.4 STAR SCHEMA Architecture
There are two basic popular models which are used for dimensional modeling:
• Star Model
• Snowflake Model
Star Model: It represents the multidimensional model. In this model the data
is organized into facts and dimensions. The star model is the underlying
structure for a dimensional model. It has one broad central table (fact table)
and a set of smaller tables (dimensions) arranged in a star design. This design
is logically shown in the below figure 2.
23
Dimensional Modeling
• Query performance
Because a star schema database has a small number of tables and clear join
paths, queries run faster than they do against an OLTP system. Small single-
table queries, usually of dimension tables, are almost instantaneous. Large join
queries that involve multiple tables take only seconds or minutes to run.
In a star schema database design, the dimensions are linked only through the
central fact table. When two dimension tables are used in a query, only one
join path, intersecting the fact table, exists between those two tables. This
design feature enforces accurate and consistent query results.
• Load performance and administration
Structural simplicity also reduces the time required to load large batches of
data into a star schema database. By defining facts and dimensions and
separating them into different tables, the impact of a load operation is reduced.
Dimension tables can be populated once and occasionally refreshed. You can
add new facts regularly and selectively by appending records to a fact table.
• Built-in referential integrity
A star schema has referential integrity built in when data is loaded. Referential
integrity is enforced because each record in a dimension table has a unique
primary key, and all keys in the fact tables are legitimate foreign keys drawn
from the dimension tables. A record in the fact table that is not related
correctly to a dimension cannot be given the correct key value to be retrieved.
• Easily understood
A star schema is easy to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end user, because
they represent the fundamental relationship between parts of the underlying
business. Users can also browse dimension table attributes before constructing
a query.
3.5.2 Disadvantages of Star Schema
As mentioned before, improving read queries and analysis in a star schema
could involve certain challenges:
• Decreased data integrity: Because of the denormalized data structure,
star schemas do not enforce data integrity very well. Although star
schemas use countermeasures to prevent anomalies from developing, a
simple insert or update command can still cause data incongruities.
• Less capable of handling diverse and complex queries: Databases
designers build and optimize star schemas for specific analytical needs.
As denormalized data sets, they work best with a relatively narrow set
of simple queries. Comparatively, a normalized schema permits a far
wider variety of more complex analytical queries.
• No Many-to-Many Relationships: Because they offer a simple
dimension schema, star schemas don’t work well for “many-to-many
data relationships”
24
Data Warehouse
Fundamentals and
Example 1: Suppose a star schema is composed of a Sales fact table as shown Architecture
in Figure 3a and several dimension tables connected to it for Time, Branch,
Item and Location.
Fact Table
Sales is the Fact table.
Dimension Tables
The Time table has a column for each day, month, quarter, year etc..
The Item table has columns for each item_key, item_name, brand, type and
supplier_type.
The Branch table has columns for each branch_key, branch_name and
branch_type.
The Location table has columns of geographic data, including street, city,
state, and country. Unit_Sold and Dollars_Sold are the Measures.
Example 2:
The star schema works by dividing data into measurements and the “who,
what, where, when, why, and how” descriptive context. Broadly, these two
groups are facts and dimensions.
By doing this, the star schema methodology allows the business user to
restructure their transactional database into smaller tables that are easier to fit
together. Fact tables are then linked to their associated dimension tables with
primary or foreign key relationships. An example of this would be a quick
grocery store purchase. The amount you spent and how many items you bought
would be considered a fact, but what you bought, when you bought it and the
specific grocery store’s location would all be considered dimensions.
25
Dimensional Modeling Once these two groups have been established, we can connect them by the
unique transaction number associated with your specific purchase. An
important note is that each fact, or measurement, will be associated with
multiple dimensions. This is what forms the star shape, the fact in the center,
and dimensions drawing out around it. Dimensions relating to the grocery
store, the products you bought, and descriptions about you as their customer
will be carefully separated into its table with its attributes.
This example is modeled as shown below and star schema for this is depicted
in Figure 3b.
Fact Table
Sales is the Fact Table.
Dimension Tables
The Store table consists of columns like store_id store_address, city, region,
state and country.
Customer table has columns for each product_id, product_time and
product_type.
Sales_Type includes sales_type_id and type_name columns.
Product table consists of product_id, product_name and product_type.
Time table consists of columns like time_id, action_date, action_week,
action_month, action_year and action_ weekday.
Measurements may be amount spent and no. of items bought.
2) Draw a Star Schema for a marketing employee staying in a NewYork city of the
country USA. He buys products and wants to compute the total product sold and
how much sales done?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
27
Dimensional Modeling 3. It requires more lookup time as many tables are interconnected and
extending dimensions.
Example
In the below figure , the snowflake schema is shown of a case study of
customers, sales, products, location wise quantity sold, and number of items
sold are calculated. The customers, products, date, store are saved in the fact
table with their respective primary keys acting in fact table as a foreign key.
You will observe that the two aggregate functions can be applied to calculate
quantity sold and amount sold. Further, the some dimensions are extended to
the type of customer and also store information territory wise too. Note, date
has been expanded into date, month, year. This schema will give you more
opportunity to perform query handling in detail.
28
Data Warehouse
Fundamentals and
Disadvantages of Snowflake Schema Architecture
29
Dimensional Modeling
3.8 FACT CONSTELLATION SCHEMA
There is another schema for representing a multidimensional model. This term
fact constellation is like the galaxy of universe containing several stars. It is a
collection of fact schemas having one or more-dimension tables in common as
shown in the figure below. This logical representation is mainly used in
designing complex database systems.
In the above figure, it can be observed that there are two fact tables and two-
dimension tables in the pink boxes are the common dimension tables
connecting both the star schemas.
For example, if we are designing a fact constellation schema for University
students. In the problem it is given that their fact table as
Fact tables
So, there are two fact tables namely, Placement and Workshop which are part
of two different star schemas having:
i) dimension tables – Company, Student and TPO in Star schema with fact
table Placement and
ii) dimension tables – Training Institute, Student and TPO in Star schema with
fact table Workshop.
Both the star schema has two-dimension tables common and hence, forming a
fact constellation or galaxy schema.
30
Data Warehouse
Fundamentals and
Architecture
Advantage
This schema is more flexible and gives wider perspective about the data
warehouse system.
Disadvantage
As, this schema is connecting two or more facts to form a constellation. This
kind of structure makes it complex to implement and maintain.
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
2. Suppose that a data warehouse consists of dimensions time, doctor, ward and
patient, and the two measures count and charge, where charge is the fee that a
doctor charges a patient for a visit. Enumerate three classes of schemes that are
popularly used for modeling.
a) Draw a Star Schema diagram
b) Draw a Snowflake Schema.
……………………………………………………………………………..…
………………………………………………………………………..………
……………………………………………………………………………….
31
Dimensional Modeling
3.9 AGGREGATE TABLES
Since, in the data warehouse the data is stored in multidimensional cube. In the
information technology industry, there are various tools available to process
the queries posted on the data warehouse engine. These tools are called
business intelligence (BI) tools. These tools help to answer the complex
queries and to take decisions. Aggregate word is very similar to the
aggregation of the database schemas of relational tables that you must be
familiar with. Aggregate fact tables roll up the basic fact tables of the schema
to improve the query processing. The business tools smoothly select the level
of aggregation to improve the query performance. Aggregate fact tables
contain foreign keys referring to dimension tables.
Let us understand the need of building aggregate table. Aggregate tables also
referred to pre-computed tables having partially summarized data.
• Simply putting in one word, it’s about speed or quick response to queries.
This you can understand as an intermediate table which stores the results of
the queries on I/O disk space. It uses aggregates functionality.
• It occupies less space than atomic fact tables. It nearly takes the half time of
a general query processing.
• The Roll-up OLAP operation of the base fact tables generates aggregate
tables. Hence the query performance increases as it reduces the number of
rows to be accessed for the retrieval of data of a query.
Aggregate facts are produced by calculating measures from more atomic fact
tables. These tables contain computational SQL aggregate functions like
AVERAGE, MIN, MAX, COUNT etc. It also contains function that helps to
find output using group by. The aggregate fact tables produce summary
statistics. Whenever, the speedy query handling is required the aggregate fact
tables is the best option.
• You can understand aggregate fact tables as the conformed copy of the
fact table as it should provide you the same result of the query as the
detailed fact table.
• This aggregate fact tables can be used in the case of large datasets or
when there are large number of queries. It reduces the response time of
the queries fired by users or customers. It is very useful in business
intelligence application tools.
When you have complicated questions of multiple facts in multiple tables that
are stored at different levels from one another, and when a reporting request
includes yet another level, the levels at which facts are stored become even
more relevant. You must be able to meet users' need for fact reporting at the
business level. There's nothing wrong with improving the overall intelligence.
The levels at which facts are stored become especially important when you
begin to have complex queries with multiple facts in multiple tables that are
stored at levels different from one another, and when a reporting request
33
Dimensional Modeling involves still a different level. You must be able to support fact reporting at the
business levels which users require. There is nothing wrong with enhancing an
aggregate with new facts or deriving new dimension. For measures, the only
issue is if the new measures are atomic in the context of the aggregate fact. If,
however, the new measures are received at a lower grain, you would be better
off creating a new atomic fact for those measures prior to incorporating
summarized measures into the aggregate. This would allow the new measures
to be used for other purposes without having to go back to the source.
Let's say we have a fact table: FactBillReciept has monthly transactions. There
can be different types of transaction receipts during a month for each supplier.
This huge data would result in lot of calculations. So, we would build another
aggregate table which is derived of base table.
Conformed Dimension
A conformed dimension is the dimension that is shared across multiple data
mart or subject area. An organization may use the same dimension table across
different projects without making any changes to the dimension tables.
Derived Tables
It is the significant addition to the Data Warehouse. Derived tables are used to
create a second-level data marts for cross functional analysis.
Consolidated Fact tables: It is the fact table which has data from different fact
tables used to form a schema with a common grain.
One thing to notice here is that the product attributes keep on changing as per
the requirements, but product dimension remains the same. So, it is better to
keep Product as a separate dimension.
34
Data Warehouse
Fundamentals and
Let’s design the tables and its grains. Architecture
Product Product_Id
Product_Id Product_Type
Category_Id Product_Description
Supplier_Id Unit Sales
Timekey Year
Product_type Quarter
Product_Description
Product_start_date
Quantity
Fact Table (Supplier)
Supplier_details
Supplier_Id
Product_Id
Store_Id
TimeKey
The derived tables are very useful in terms of putting fewer loads on the Data
Warehouse engine for calculation.
3.12 SUMMARY
This unit presented the basic designing of data warehouse. These topics are
more focused on the various kind of modeling and schemas. It explored the
grains, facts, and dimensions of the schemas. It is important to know about the
dimensional modeling .as the appropriate modeling technique would yield the
correct respond the queries.
A dimensional modeling is a kind of data structure used to optimize design of
Data warehouse for the query retrieval operations. There are various schema
designs. Here, it discussed star, snowflake, and fact constellations. From
denormalized to normalized schemas uses dimension, fact, derived and
aggregate fact table. Every table has some purpose and used for efficient
designing in terms of space and query handling. This unit discusses the pros
and cons of every tables. The number of examples used to explain the
designing in different scenarios.
35
Dimensional Modeling
3.13 SOLUTIONS/ANSWERS
Check Your Progress 1:
1) Characteristics of Star Schema:
2)
36
Data Warehouse
Fundamentals and
Architecture
2: a. Star Schema of Hospital Management
Dimension Doctor
Doctor_ID
Doctor_Name
Doctor_Contact
DoctorAvail_status
Specialization Dimension Patient
Patient_ID
Patient_name
Patient_Address
Dimension Ward Patient_Contact
Ward_ID Fact Hospital
Patient_Complain
Ward_Name Patient_ID
Ward_Assistant Doctor_ID
Admisison Ward_ID
_details Time_Key Dimension
Bill_ID Time
Time_ID
Calculate_billamt() Date
count_patients()
Dimension Bill Count_Admission()
Bill_ID
Bill_Description
Amount
Time
37
Dimensional Modeling b. Snowflake Schema of Hospital Management
Dimension Doctor
Doctor_ID
Doctor_Name Dimension Patient
Dimension_Ward_Assistant Address Patient_ID
Assistant_ID Doctor_ContactNo Patient_name
Assistant_Name DoctorAvail_status Address
Specialization Patient_ContactNo
Patient_Complain
Dimension Address
City
Dimension Ward Fact Hospital State
Ward_ID Patient_ID Country
Ward_Name Doctor_ID
Ward_Assistant Ward_ID
Admission_ID Time_Key
Patient_ID Bill_ID
Dimension Bill
Bill_ID
Calculate_billamt()
Bill_Description
count_patients()
Amount
Count_Admission()
Time_ID
Patient_ID
Doctor_ID
Dimension Admission
Admission_ID
Type of Admission
Patient_ID
Details
Time_ID Dimension Date Dimension Time
Date Time_ID
Month Date
year Time(HH:MM:SS)
1.
Limitations of Aggregate fact tables: Aggregate tables take lot of time to scan
the rows of the base fact table. So, there will be more tables to manage. The
size of aggregates in computing can be costly. Based on the greedy approach
the size of aggregates is decided using hashing technique. If there are n
dimensions in the table, then there can be 2n possible aggregates. The load on
the data warehouse becomes more complex.
38
Data Warehouse
Fundamentals and
3.14 FURTHER READINGS Architecture
39