DMW Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Data Warehousing

● Overview,
● Definition,
● Data Warehousing Components,
● Building a Data Warehouse,
● Warehouse Database,
● Mapping the Data Warehouse to a Multiprocessor
Architecture,
● Difference between Database System and Data
Warehouse,
● Multi-Dimensional Data Model,
● Data Cubes,
● Stars,
● Snow Flakes,
● Fact Constellations,
● Concept.

1
Data Warehousing - Overview
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a
data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of
data. This data helps analysts to take informed decisions in an organization.

An operational database undergoes frequent changes on a daily basis on account of the


transactions that take place. Suppose a business executive wants to analyze previous
feedback on any data such as a product, a supplier, or any consumer data, then the
executive will have no data available to analyze because the previous data has been
updated due to transactions.

A data warehouses provides us generalized and consolidated data in multidimensional view.


Along with generalized and consolidated view of data, a data warehouses also provides us
Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective
analysis of data in a multidimensional space. This analysis results in data generalization
and data mining.

Data mining functions such as association, clustering, classification, prediction can be


integrated with OLAP operations to enhance the interactive mining of knowledge at
multiple level of abstraction. That's why data warehouse has now become an important
platform for data analysis and online analytical processing.

Understanding a Data Warehouse

● A data warehouse is a database, which is kept separate from the organization's


operational database.
● There is no frequent updating done in a data warehouse.
● It possesses consolidated historical data, which helps the organization to analyze its
business.
● A data warehouse helps executives to organize, understand, and use their data to
take strategic decisions.
● Data warehouse systems help in the integration of diversity of application systems.
● A data warehouse system helps in consolidated historical data analysis.

2
Why a Data Warehouse is Separated from Operational
Databases

A data warehouses is kept separate from operational databases due to the following reasons

● An operational database is constructed for well-known tasks and workloads such as


searching particular records, indexing, etc. In contract, data warehouse queries are
often complex and they present a general form of data.
● Operational databases support concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms are required for operational databases
to ensure robustness and consistency of the database.
● An operational database query allows to read and modify operations, while an OLAP
query needs only read only access of stored data.
● An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.

Data Warehouse Features

The key features of a data warehouse are discussed below −

● Subject Oriented − A data warehouse is subject oriented because it provides


information around a subject rather than the organization's ongoing operations.
These subjects can be product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations, rather it focuses on modelling
and analysis of data for decision making.
● Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This integration
enhances the effective analysis of data.
● Time Variant − The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information from the
historical point of view.

3
● Non-volatile − Non-volatile means the previous data is not erased when new data is
added to it. A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the data
warehouse.

Note − A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.

Data Warehouse Applications

As discussed before, a data warehouse helps business executives to organize, analyze, and
use their data for decision making. A data warehouse serves as a sole part of a
plan-execute-assess "closed-loop" feedback system for the enterprise management. Data
warehouses are widely used in the following fields −

● Financial services
● Banking services
● Consumer goods
● Retail sectors
● Controlled manufacturing

Types of Data Warehouse

Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below −

● Information Processing − A data warehouse allows to process the data stored in it.
The data can be processed by means of querying, basic statistical analysis, reporting
using crosstabs, tables, charts, or graphs.
● Analytical Processing − A data warehouse supports analytical processing of the
information stored in it. The data can be analyzed by means of basic OLAP
operations, including slice-and-dice, drill down, drill up, and pivoting.

4
● Data Mining − Data mining supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing classification and
prediction. These mining results can be presented using the visualization tools.

Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)

1 It involves historical processing of It involves day-to-day processing.


information.

2 OLAP systems are used by knowledge OLTP systems are used by clerks,
workers such as executives, managers, DBAs, or database professionals.
and analysts.

3 It is used to analyze the business. It is used to run the business.

4 It focuses on Information out. It focuses on Data in.

5 It is based on Star Schema, Snowflake It is based on Entity Relationship


Schema, and Fact Constellation Model.
Schema.

6 It focuses on Information out. It is application oriented.

7 It contains historical data. It contains current data.

8 It provides summarized and It provides primitive and highly


consolidated data. detailed data.

9 It provides summarized and It provides detailed and flat relational


multidimensional view of data. view of data.

10 The number of users is in hundreds. The number of users is in thousands.

11 The number of records accessed is in The number of records accessed is in


millions. tens.

5
12 The database size is from 100GB to The database size is from 100 MB to
100 TB. 100 GB.

13 These are highly flexible. It provides high performance.

6
Building a Data Warehouse

Building a data warehouse involves several steps, from initial planning to implementation
and maintenance. A data warehouse (DW) is designed to aggregate and organize large
amounts of data from different sources to facilitate analysis and decision-making. Here’s
an overview of the steps involved in building a data warehouse:

1. Requirement Gathering and Planning

● Understand Business Needs: Collaborate with stakeholders to identify key objectives,


data sources, KPIs, and reporting needs.
● Define Scope: Determine the size of the data warehouse, the data volume, and
frequency of updates.
● Select Key Performance Indicators (KPIs): Identify important metrics that the data
warehouse will focus on for reporting and analytics.

2. Data Modeling

● Identify Data Sources: List the different systems (e.g., CRM, ERP, transactional
databases, third-party APIs) that will feed data into the warehouse.
● Schema Design: Design the structure of the warehouse. There are two common
approaches:
○ Star Schema: Uses fact tables and dimension tables for simplicity.
○ Snowflake Schema: A more normalized version of the star schema with
additional tables.
● Data Marts: Consider creating smaller, subject-specific databases (data marts) that
feed into the central warehouse.

3. ETL (Extract, Transform, Load) Process

● Extract: Gather data from various data sources, such as databases, flat files, cloud
services, or APIs.

7
● Transform: Cleanse, normalize, and aggregate the data. This includes handling
missing values, removing duplicates, applying business rules, and transforming the
data into the desired format.
● Load: Insert the cleaned data into the data warehouse. Loading can be done in two
ways:
○ Batch Loading: Scheduled, periodic loads (e.g., daily or weekly).
○ Real-Time Loading: Continuous updating as new data comes in.

4. Data Storage and Infrastructure

● Choose the Storage Model: Decide on the type of data warehouse platform, either
cloud-based (e.g., AWS Redshift, Google BigQuery, Snowflake) or on-premise (e.g.,
Microsoft SQL Server, Oracle).
● Design the Physical Database: Set up indexing, partitioning, and sharding to improve
performance for large datasets.
● Data Archiving: Plan for long-term storage of historical data and its retention
policies.

5. Data Governance and Security

● Data Quality Management: Implement processes to monitor the quality, accuracy,


and consistency of data.
● Data Security: Define access control policies and ensure compliance with relevant
regulations (e.g., GDPR, HIPAA). Encryption and role-based access can help secure
sensitive data.
● Audit and Logging: Keep track of data changes and user interactions with the
warehouse.

6. BI Tools and Reporting

● Select BI Tools: Choose tools for querying, reporting, and visualization (e.g., Tableau,
Power BI, Looker).
● Data Aggregation and Cubes: Use OLAP (Online Analytical Processing) cubes for
multidimensional data analysis.

8
● Build Dashboards and Reports: Create user-friendly reports and visualizations to help
stakeholders understand the data.

7. Testing and Optimization

● Data Validation: Ensure that the data in the warehouse is accurate and matches
the source data.
● Performance Testing: Test the system for performance issues such as query speed
and ETL job efficiency.
● Optimize Queries: Indexing, partitioning, and denormalization techniques can help
improve query performance.

8. Deployment and Maintenance

● Deploy the Data Warehouse: Move the data warehouse from development to
production.
● Monitoring: Continuously monitor the performance, data loading processes, and
usage patterns.
● Scaling: Plan for future growth and the need to scale storage, compute, and query
capabilities.
● Maintenance: Regularly update the ETL process, add new data sources, and ensure
that reports meet changing business needs.

9. Iterate and Evolve

● Feedback Loop: Regularly gather feedback from users to improve the system.
● Adapt to New Business Needs: Update the warehouse structure as the business
evolves.

Key Considerations:

● Cloud vs. On-Premise: Cloud data warehouses offer scalability, but on-premise
solutions give more control.

9
● Big Data Compatibility: If dealing with extremely large datasets, choose a
warehouse architecture that can handle big data efficiently (e.g., Hadoop-based
storage).
● Cost Management: Track the cost of data storage and processing, especially if using
a cloud-based platform where costs can rise quickly with increased usage.

10
Components or Building Blocks of Data
Warehouse
Architecture is the proper arrangement of the elements. We build a data warehouse with
software and hardware components. To suit the requirements of our organizations, we
arrange these building we may want to boost up another part with extra tools and services.
All of these depends on our circumstances.

The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block.
In the middle, we see the Data Storage component that handles the data warehouses data.
This element not only stores and manages the data; it also keeps track of data using the
metadata repository. The Information Delivery component shows on the right consists of all
the different ways of making the information from the data warehouses available to the
users.

11
Source Data Component
Source data coming into the data warehouses may be grouped into four broad categories:

Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of
the data from the various operational modes.

Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data,
part of which could be useful in a data warehouse.

Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.

External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.

Data Staging Component


After we have been extracted data from various operational systems and external sources,
we have to prepare the files for storing in the data warehouse. The extracted data coming
from several different sources need to be changed, converted, and made ready in a format
that is relevant to be saved for querying and analysis.

We will now discuss the three primary functions that take place in the staging area.

1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.

2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual tasks as
part of data transformation.

12
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.

Standardization of data components forms a large part of data transformation. Data


transformation contains many forms of combining pieces of data from different sources.
We combine data from single source record or related data parts from many source records.

On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging of
data take place on a large scale in the data staging area. When the data transformation
function ends, we have a collection of integrated data that is cleaned, standardized, and
summarized.

3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first
time, we do the initial loading of the information into the data warehouse storage. The
initial load moves high volumes of data using up a substantial amount of time.

Data Storage Components


Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories
include the data structured in highly normalized for fast and efficient processing.

Information Delivery Component


The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.

13
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a
database management system. In the data dictionary, we keep the data about the logical
data structures, the data about the records and addresses, the information about the
indexes, and so on.

Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users.
The scope is confined to particular selected subjects. Data in a data warehouse should be a
fairly current, but not mainly up to the minute, although development in the data
warehouse industry has made standard and incremental data dumps more achievable. Data
marts are lower than data warehouses and usually contain organization. The current trends
in data warehousing are to developed a data warehouse with several smaller related data
marts for particular kinds of queries and reports.

Management and Control Component


The management and control elements coordinate the services and functions within the
data warehouse. These components control the data transformation and the data transfer
into the data warehouse storage. On the other hand, it moderates the data delivery to the
clients. Its work with the database management systems and authorizes data to be

14
correctly saved in the repositories. It monitors the movement of information into the
staging method and from there into the data warehouses storage itself.

Why we need a separate Data Warehouse?


Data Warehouse queries are complex because they involve the computation of large groups
of data at summarized levels.

It may require the use of distinctive data organization, access, and implementation method
based on multidimensional views.

Performing OLAP queries in operational database degrade the performance of functional


tasks.

Data Warehouse is used for analysis and decision making in which extensive database is
required, including historical data, which operational database does not typically maintain.

The separation of an operational database from data warehouses is based on the different
structures and uses of data in these systems.

Because the two systems provide different functionalities and require different kinds of
data, it is necessary to maintain separate databases.

Difference between Database and Data Warehouse

15
Database Data Warehouse

1. It is used for Online Transactional 1. It is used for Online Analytical


Processing (OLTP) but can be used for Processing (OLAP). This reads the
other objectives such as Data historical information for the customers
Warehousing. This records the data from for business decisions.
the clients for history.

2. The tables and joins are complicated 2. The tables and joins are accessible
since they are normalized for RDBMS. since they are de-normalized. This is
This is done to reduce redundant files done to minimize the response time for
and to save storage space. analytical queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures 4. Data: Modeling approach are used for
are used for RDBMS database design. the Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis 6. High performance for analytical


queries. queries.

7. The database is the place where the 7. Data Warehouse is the place where the
data is taken as a base and managed to application data is handled for analysis
get available fast and efficient access. and reporting objectives.

16
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital
components.

Production applications such as payroll accounts payable product purchasing and inventory
control are designed for online transaction processing (OLTP). Such applications gather
detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications
such as forecasting, profiling, summary reporting, and trend analysis.

Production databases are updated continuously by either by hand or via OLTP applications.
In contrast, a warehouse database is updated from operational systems periodically, usually
during off-hours. As OLTP data accumulates in production databases, it is regularly
extracted, filtered, and then loaded into a dedicated warehouse server that is accessible to
users. As the warehouse is populated, it must be restructured tables de-normalized, data
cleansed of errors and redundancies and new fields and keys added to reflect the needs to
the user for sorting, combining, and summarizing data.

Data warehouses and their architectures very depending upon the elements of an
organization's situation.

Three common architectures are:

○ Data Warehouse Architecture: Basic


○ Data Warehouse Architecture: With Staging Area
○ Data Warehouse Architecture: With Staging Area and Data Marts

17
Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to a system that is


used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and every file
in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make finding and work
with particular instances of data more accessible. For example, author, data build, and data
changed, and file size are examples of very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

18
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance. The
summarized record is updated continuously as new information is loaded into the
warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the business


managers for strategic decision-making. These customers interact with the warehouse using
end-client access tools.

The examples of some of the end-user access tools can be:

○ Reporting and Query Tools


○ Application Development Tools
○ Executive Information Systems Tools
○ Online Analytical Processing Tools
○ Data Mining Tools

Data Warehouse Architecture: With Staging Area


We must clean and process your operational information before put it into the warehouse.

We can do this programmatically, although data warehouses uses a staging area (A place
where data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming
from multiple source systems, especially for enterprise data warehouses where all relevant
data of an enterprise is consolidated

19
Data Warehouse Staging Area is a temporary location where a record from source systems
is copied.

Data Warehouse Architecture: With Staging Area and Data Marts


We may want to customize our warehouse's architecture for multiple groups within our
organization.

We can do this by adding data marts. A data mart is a segment of a data warehouses
that can provided information for reporting and analysis on a section, unit, department or
operation in the company, e.g., sales, payroll, production, etc.

The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyze historical data for purchases and sales or
mine historical information to make predictions about customer behavior.

20
Properties of Data Warehouse Architectures
The following architecture properties are necessary for a data warehouse system:

1. Separation: Analytical and transactional processing should be keep apart as much as


possible.

2. Scalability: Hardware and software architectures should be simple to upgrade the data
volume, which has to be managed and processed, and the number of user's requirements,
which have to be met, progressively increase.

3. Extensibility: The architecture should be able to perform new operations and


technologies without redesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data stored in the
data warehouses.

5. Administerability: Data Warehouse management should not be complicated.

Types of Data Warehouse Architectures

21
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the
amount of data stored to reach this goal; it removes data redundancies.

The figure shows the only layer physically available is the source layer. In this method,
data warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an
intermediate processing layer.

The vulnerability of this architecture lies in its failure to meet the requirement for
separation between analytical and transactional processing. Analysis queries are agreed to
operational data after the middleware interprets them. In this way, queries affect
transactional workloads.

22
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture
for a data warehouse system, as shown in fig:

Although it is typically called two-layer architecture to highlight a separation between


physically available sources and data warehouses, in fact, consists of four subsequent data
flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is
stored initially to corporate relational databases or legacy databases, or it may come from
an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse, validate, filter, and load
source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual repository:
a data warehouse. The data warehouses can be directly accessed, but it can also be used as
a source for creating data marts, which partially replicate data warehouse contents and are
designed for specific enterprise departments. Meta-data repositories store information on
sources, access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should

23
feature aggregate information navigators, complex query optimizers, and customer-friendly
GUIs.

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source
system), the reconciled layer and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits between the source data and data
warehouse.

The main advantage of the reconciled layer is that it creates a standard reference data
model for a whole enterprise. At the same time, it separates the problems of source data
extraction and integration from those of data warehouse population. In some cases, the
reconciled layer is also directly used to accomplish better some operational tasks, such as
producing daily reports that cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external processes periodically to benefit from
cleaning and integration.

This architecture is especially useful for the extensive, enterprise-wide systems. A


disadvantage of this structure is the extra file storage space used through the extra
redundant reconciled layer. It also makes the analytical tools a little further away from
being real-time.

24
25
What is Dimensional Modeling?
Dimensional modeling represents data with a cube operation, making more suitable logical
data representation with OLAP data management. The perception of Dimensional Modeling
was developed by Ralph Kimball and is consist of "fact" and "dimension" tables.

In dimensional modeling, the transaction record is divided into either "facts," which are
frequently numerical transaction data, or "dimensions," which are the reference information
that gives context to the facts. For example, a sale transaction can be damage into facts
such as the number of products ordered and the price paid for the products, and into
dimensions such as order date, user name, product number, order ship-to, and bill-to
locations, and salesman responsible for receiving the order.

Objectives of Dimensional Modeling


The purposes of dimensional modeling are:

1. To produce database architecture that is easy for end-clients to understand and write
queries.
2. To maximize the efficiency of queries. It achieves these goals by minimizing the number of
tables and relationships between them.

Advantages of Dimensional Modeling


Following are the benefits of dimensional modeling are:

Dimensional modeling is simple: Dimensional modeling methods make it possible for


warehouse designers to create database schemas that business customers can easily hold
and comprehend. There is no need for vast training on how to read diagrams, and there is
no complicated relationship between different data elements.

Dimensional modeling promotes data quality: The star schema enable warehouse
administrators to enforce referential integrity checks on the data warehouse. Since the fact
information key is a concatenation of the essentials of its associated dimensions, a factual
record is actively loaded if the corresponding dimensions records are duly described and also
exist in the database.

26
By enforcing foreign key constraints as a form of referential integrity check, data
warehouse DBAs add a line of defense against corrupted warehouses data.

Performance optimization is possible through aggregates: As the size of the data warehouse
increases, performance optimization develops into a pressing concern. Customers who have
to wait for hours to get a response to a query will quickly become discouraged with the
warehouses. Aggregates are one of the easiest methods by which query performance can be
optimized.

Disadvantages of Dimensional Modeling


1. To maintain the integrity of fact and dimensions, loading the data warehouses with
a record from various operational systems is complicated.
2. It is severe to modify the data warehouse operation if the organization adopting the
dimensional technique changes the method in which it does business.

Elements of Dimensional Modeling

1. Fact
It is a collection of associated data items, consisting of measures and context data.
It typically represents business items or business transactions.

2. Dimensions
It is a collection of data which describe one business dimension. Dimensions decide
the contextual background for the facts, and they are the framework over which
OLAP is performed.

3. Measure
It is a numeric attribute of a fact, representing the performance or behavior of the
business relative to the dimensions.

27
Considering the relational context, there are two basic models which are used in
dimensional modeling:

○ Star Model
○ Snowflake Model

The star model is the underlying structure for a dimensional model. It has one broad
central table (fact table) and a set of smaller tables (dimensions) arranged in a radial
design around the primary table. The snowflake model is the conclusion of decomposing one
or more of the dimensions.

Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric data
elements that are of interest to the company.

Characteristics of the Fact table

● The fact table includes numerical values of what we measure. For example, a fact
value of 20 might means that 20 widgets have been sold.
● Each fact table includes the keys to associated dimension tables. These are known
as foreign keys in the fact table.
● Fact tables typically include a small number of columns.
● When it is compared to dimension tables, fact tables have a large number of rows.

Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that
describe the facts.

Characteristics of the Dimension table

Dimension tables contain the details about the facts. That, as an example, enables the
business analysts to understand the data and their reports better.

28
The dimension tables include descriptive data about the numerical values in the fact table.
That is, they contain the attributes of the facts. For example, the dimension tables for a
marketing analysis function might include attributes such as time, marketing region, and
product type.

Since the record in a dimension table is denormalized, it usually has a large number of
columns. The dimension tables include significantly fewer rows of information than the
fact table.

The attributes in a dimension table are used as row and column headings in a document or
query results display.

Example: A city and state can view a store summary in a fact table. Item summary can be
viewed by brand, color, etc. Customer information can be viewed by name and address.

29
Fact Table

Time ID Product ID Customer ID Unit Sold

4 17 2 1

8 21 3 2

8 4 1 1

In this example, Customer ID column in the facts table is the foreign keys that join with
the dimension table. By following the links, we can see that row 2 of the fact table
records the fact that customer 3, Gaurav, bought two items on day 8.

Dimension Tables

Customer Name Gender Income Education Region


ID

1 Rohan Male 2 3 4

2 Sandeep Male 3 5 1

3 Gaurav Male 1 7 3

Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs
model many to one association between dimensional attributes team. It contains a
dimension, positioned at the tree's root, and all of the dimensional attributes that define it.

30
What is Multi-Dimensional Data Model?
A multidimensional model views data in the form of a data-cube. A data cube enables data
to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the
store's sales for the dimension time, item, and location. These dimensions allow the save to
keep track of things, for example, monthly sales of items and the locations at which the
items were sold. Each dimension has a table related to it, called a dimensional table, which
describes the dimension further. For example, a dimensional table for an item may contain
the attributes item_name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures. The fact table
contains the names of the facts or measures of the related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is
shown in the table. In this 2D representation, the sales for Delhi are shown for the time
dimension (organized in quarters) and the item dimension (classified according to the
types of an item sold). The fact or measure displayed in rupee_sold (in thousands).

31
Now, if we want to view the sales data with a third dimension, For example, suppose the
data according to time and item, as well as the location is considered for the cities
Chennai, Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D data
of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube,
as shown in fig:

32
33
Data Cube
What is Data Cube?
When data is grouped or combined in multidimensional matrices called Data Cubes. The
data cube method has a few alternative names or a few variants, such as
"Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical
Processing)."

The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-price) can
be materialized into a set of eight views as shown in fig, where psc indicates a view
consisting of aggregate function value (such as total-sales) computed by grouping three
attributes part, supplier, and customer, p indicates a view composed of the corresponding
aggregate function values calculated by grouping part alone, etc.

A data cube is created from a subset of attributes in the database. Specific attributes are
chosen to be measure attributes, i.e., the attributes whose values are of interest. Another
attributes are selected as dimensions or functional attributes. The measure attributes are
aggregated according to the dimensions.

For example, XYZ may create a sales data warehouse to keep records of the store's sales
for the dimensions time, item, branch, and location. These dimensions enable the store to

34
keep track of things like monthly sales of items, and the branches and locations at which
the items were sold. Each dimension may have a table identify with it, known as a
dimensional table, which describes the dimensions. For example, a dimension table for
items may contain the attributes item_name, brand, and type.

Data cube method is an interesting technique with many applications. Data cubes could be
sparse in many cases because not every cell in each dimension may have corresponding
data in the database.

Techniques should be developed to handle sparse cubes efficiently.

If a query contains constants at even lower levels than those provided in a data cube, it is
not clear how to make the best use of the precomputed results stored in the data cube.

The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.

A data cube enables data to be modeled and viewed in multiple dimensions. A


multidimensional data model is organized around a central theme, like sales and
transactions. A fact table represents this theme. Facts are numerical measures. Thus, the
fact table contains measure (such as Rs_sold) and keys to each of the related dimensional
tables.

Dimensions are a fact that defines a data cube. Facts are generally quantities, which are
used for analyzing the relationship between dimensions.

35
Example: In the 2-D representation, we will look at the All Electronics sales data for items
sold per quarter in the city of Vancouver. The measured display in dollars sold (in
thousands).

3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location for
the cities Chicago, New York, Toronto, and Vancouver. The measured display in dollars sold
(in thousands). These 3-D data are shown in the table. The 3-D data of the table are
represented as a series of 2-D tables.

Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in
fig:

36
Advertisement

Let us suppose that we would like to view our sales data with an additional fourth
dimension, such as a supplier.

In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest
level of summarization is called a base cuboid.

For example, the 4-D cuboid in the figure is the base cuboid for the given time, item,
location, and supplier dimensions.

37
Figure is shown a 4-D data cube representation of sales data, according to the dimensions
time, item, location, and supplier. The measure displayed is dollars sold (in thousands).

The topmost 0-D cuboid, which holds the highest level of summarization, is known as the
apex cuboid. In this example, this is the total sales, or dollars sold, summarized over all
four dimensions.

The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid represents
a different degree of summarization.

38
Schema
Schema is a logical description of the entire database. It includes the name and description
of records of all record types including all associated data-items and aggregates. Much like
a database, a data warehouse also requires to maintain a schema. A database uses
relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation
schema.

What is Star Schema?


A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale
or log in. A dimension includes reference data about the fact, such as date, item, or
customer.

A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is
known as star schema because the entity-relationship diagram of this schemas simulates a
star, with points, diverge from a central table. The center of the schema consists of a
large fact table, and the points of the star are the dimension tables.

39
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.

Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize
data. If a dimension has not got hierarchies and levels, it is called a flat dimension or list.
The primary keys of each of the dimensions table are part of the composite primary keys
of the fact table. Dimensional attributes help to define the dimensional value. They are
generally descriptive, textual values. Dimensional tables are usually small in size than fact
table.

Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.

Characteristics of Star Schema


The star schema is intensely suitable for data warehouse database design because of the
following features:

○ It creates a DE-normalized database that can quickly provide query responses.


○ It provides a flexible design that can be changed easily or added to throughout
the development cycle, and as the database grows.
○ It provides a parallel in design to how end-users typically think of and use the
data.
○ It reduces the complexity of metadata for both developers and end-users.

40
Advantages of Star Schema
Star Schemas are easy for end-users and application to understand and navigate. With a
well-designed schema, the customer can instantly analyze large, multidimensional data
sets.

The main advantage of star schemas in a decision-support environment are:

Query Performance
A star schema database has a limited number of table and clear join paths, the query run
faster than they do against OLTP systems. Small single-table queries, frequently of a
dimension table, are almost instantaneous. Large join queries that contain multiple tables
takes only seconds or minutes to run.

In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.

Load performance and administration


Structural simplicity also decreases the time required to load large batches of record into a
star schema database. By describing facts and dimensions and separating them into the

41
various table, the impact of a load structure is reduced. Dimension table can be populated
once and occasionally refreshed. We can add new facts regularly and selectively by
appending records to a fact table.

Built-in referential integrity


A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary key, and
all keys in the fact table are legitimate foreign keys drawn from the dimension table. A
record in the fact table which is not related correctly to a dimension cannot be given the
correct key value to be retrieved.

Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through
the fact table. These joins are more significant to the end-user because they represent the
fundamental relationship between parts of the underlying business. Customer can also
browse dimension table attributes before constructing a query.

Disadvantage of Star Schema


There is some condition which cannot be meet by star schemas like the relationship
between the user, and bank account cannot describe as star schema as the relationship
between them is many to many.

Example: Suppose a star schema is composed of a fact table, SALES, and several dimension
tables connected to it for time, branch, item, and geographic locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has
columns for each branch_key, branch_name, branch_type. The LOCATION table has
columns of geographic data, including street, city, state, and country.

42
In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for LOCATION
data. Thus, the size of the fact table is significantly reduced. When we need to change an
item, we need only make a single change in the dimension table, instead of making many
changes in the fact table.

We can create even more complex star schemas by normalizing a dimension table into
several tables. The normalized dimension table is called a Snowflake.

43
What is Snowflake Schema?
A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake
if one or more dimension tables do not connect directly to the fact table but must join
through other dimension tables."

The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of snowflake
schema resembles a snowflake. Snowflaking is a method of normalizing the dimension
tables in a STAR schemas. When we normalize all the dimension tables entirely, the
resultant structure resembles a snowflake with the fact table in the middle.

Snowflaking is used to develop the performance of specific queries. The schema is


diagramed with each fact surrounded by its associated dimensions, and those dimensions
are related to other dimensions, branching out into a snowflake pattern.

The snowflake schema consists of one fact table which is linked to many dimension tables,
which can be linked to other dimension tables through a many-to-one relationship. Tables
in a snowflake schema are generally normalized to the third normal form. Each dimension
table performs exactly one level in a hierarchy.

The following diagram shows a snowflake schema with two dimensions, each having three
levels. A snowflake schemas can have any number of dimension, and each dimension can
have any number of levels.

44
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location,
Time, Product, Line, and Family dimension tables. The Market dimension has two
dimension tables with Store as the primary dimension table, and Location as the outrigger
dimension table. The product dimension has three dimension tables with Product as the
primary dimension table, and the Line and Family table are the outrigger dimension tables.

A star schema store all attributes for a dimension into one denormalized table. This needed
more disk space than a more normalized snowflake schema. Snowflaking normalizes the
dimension by moving attributes with low cardinality into separate dimension tables that
relate to the core dimension table by using foreign keys. Snowflaking for the sole purpose
of minimizing disk space is not recommended, because it can adversely impact query
performance.

In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension


tables are damaged into multiple dimension tables.

45
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact
table include quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT,
and TIME are the dimension tables.

The STAR schema for sales, as shown above, contains only five tables, whereas the
normalized version now extends to eleven tables. We will notice that in the snowflake
schema, the attributes with low cardinality in each original dimension tables are removed
to form separate tables. These new tables are connected back to the original dimension
table through artificial keys.

46
A snowflake schema is designed for flexible querying across more complex dimensions and
relationship. It is suitable for many to many and one to many relationships between
dimension levels.

Advantage of Snowflake Schema


1. The primary advantage of the snowflake schema is the development in query
performance due to minimized disk storage requirements and joining smaller lookup
tables.
2. It provides greater scalability in the interrelationship between dimension levels and
components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema


1. The primary disadvantage of the snowflake schema is the additional maintenance
efforts required due to the increasing number of lookup tables. It is also known as a
multi fact star schema.

47
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.

Difference between Star and Snowflake Schemas

Star Schema Snowflake Schema

○ A snowflake schema is an extension


○ In a star schema, the fact table will
of star schema where the dimension
be at the center and is connected
tables are connected to one or more
to the dimension tables.
dimensions.
○ The tables are completely in a
○ The tables are partially denormalized
denormalized structure.
in structure.
○ SQL queries performance is good as
○ The performance of SQL queries is a
there is less number of joins
bit less when compared to star
involved.
schema as more number of joins are
○ Data redundancy is high and
involved.
occupies more disk space.
○ Data redundancy is low and
occupies less disk space when
compared to star schema.

48
Let's see the differentiate between Star and Snowflake Schema.

Basis for Comparison Star Schema Snowflake Schema

Ease of It has redundant data and No redundancy and


Maintenance/change hence less easy to therefore more easy to
maintain/change maintain and change

Ease of Use Less complex queries and More complex queries and
simple to understand therefore less easy to
understand

Parent table In a star schema, a In a snowflake schema, a


dimension table will not dimension table will have
have any parent table one or more parent tables

Query Performance Less number of foreign More foreign keys and thus
keys and hence lesser more query execution time
query execution time

Normalization It has De-normalized It has normalized tables


tables

49
Type of Data Warehouse Good for data marts with Good to use for data
simple relationships (one warehouse core to simplify
to one or one to many) complex relationships
(many to many)

Joins Fewer joins Higher number of joins

Dimension Table It contains only a single It may have more than


dimension table for each one dimension table for
dimension each dimension

Hierarchies Hierarchies for the Hierarchies are broken into


dimension are stored in separate tables in a
the dimensional table itself snowflake schema. These
in a star schema hierarchies help to drill
down the information from
topmost hierarchies to the
lowermost hierarchies.

When to use When the dimensional When dimensional table


table contains less number store a huge number of
of rows, we can go for rows with redundancy
Star schema. information and space is
such an issue, we can
choose snowflake schema
to store space.

Data Warehouse system Work best in any data Better for small data
warehouse/ data mart warehouse/data mart.

50
What is Fact Constellation Schema?
A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to summarize


information. Fact Constellation Schema can implement between aggregate Fact tables or
decompose a complex Fact table into independent simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.

51
This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location. The schema contains a fact table for
sales that includes keys to each of the four dimensions, along with two measures:
Rupee_sold and units_sold. The shipping table has five dimensions, or keys: item_key,
time_key, shipper_key, from_location, and to_location, and two measures: Rupee_cost and
units_shipped.

The primary disadvantage of the fact constellation schema is that it is a more challenging
design because many variants for specific kinds of aggregation must be considered and
selected.

Data Warehouse Applications


The application areas of the data warehouse are:

Information Processing

It deals with querying, statistical analysis, and reporting via tables, charts, or graphs.
Nowadays, information processing of data warehouse is to construct a low cost, web-based
accessing tools typically integrated with web browsers.

52
Analytical Processing

It supports various online analytical processing such as drill-down, roll-up, and pivoting. The
historical data is being processed in both summarized and detailed format.

OLAP is implemented on data warehouses or data marts. The primary objective of OLAP is
to support ad-hoc querying needed for support DSS. The multidimensional view of data is
fundamental to the OLAP application. OLAP is an operational view, not a data structure or
schema. The complex nature of OLAP applications requires a multidimensional view of the
data.

Data Mining

It helps in the analysis of hidden design and association, constructing scientific models,
operating classification and prediction, and performing the mining results using visualization
tools.

Data mining is the technique of designing essential new correlations, patterns, and trends
by changing through high amounts of a record save in repositories, using pattern
recognition technologies as well as statistical and mathematical techniques.

It is the phase of selection, exploration, and modeling of huge quantities of information to


determine regularities or relations that are at first unknown to access precise and useful
results for the owner of the database.

It is the process of inspection and analysis, by automatic or semi-automatic means, of


large quantities of records to discover meaningful patterns and rules.

53
Data Warehousing - Concepts

What is Data Warehousing?

Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries, and decision making. Data
warehousing involves data cleaning, data integration, and data consolidations.

Using Data Warehouse Information

There are decision support technologies that help utilize the data available in a data
warehouse. These technologies help executives to use the warehouse quickly and effectively.
They can gather data, analyze it, and take decisions based on the information present in
the warehouse. The information gathered in a warehouse can be used in any of the
following domains −

● Tuning Production Strategies − The product strategies can be well tuned by


repositioning the products and managing the product portfolios by comparing the
sales quarterly or yearly.
● Customer Analysis − Customer analysis is done by analyzing the customer's buying
preferences, buying time, budget cycles, etc.
● Operations Analysis − Data warehousing also helps in customer relationship
management, and making environmental corrections. The information also allows us
to analyze business operations.

Integrating Heterogeneous Databases

To integrate heterogeneous databases, we have two approaches −

● Query-driven Approach
● Update-driven Approach

54
● Query-Driven Approach

This is the traditional approach to integrate heterogeneous databases. This approach was
used to build wrappers and integrators on top of multiple heterogeneous databases. These
integrators are also known as mediators.

● Process of Query-Driven App

When a query is issued to a client side, a metadata dictionary translates the query into an
appropriate form for individual heterogeneous sites involved.

Now these queries are mapped and sent to the local query processor.

The results from heterogeneous sites are integrated into a global answer set.

Disadvantages

● Query-driven approach needs complex integration and filtering processes.


● This approach is very inefficient.
● It is very expensive for frequent queries.
● This approach is also very expensive for queries that require aggregations.

Update-Driven Approach

This is an alternative to the traditional approach. Today's data warehouse systems follow
update-driven approach rather than the traditional approach discussed earlier. In
update-driven approach, the information from multiple heterogeneous sources are integrated
in advance and are stored in a warehouse. This information is available for direct querying
and analysis.

Advantages

This approach has the following advantages −

● This approach provide high performance.

55
● The data is copied, processed, integrated, annotated, summarized and restructured in
semantic data store in advance.
● Query processing does not require an interface to process data at local sources.

Functions of Data Warehouse Tools and Utilities

The following are the functions of data warehouse tools and utilities −

● Data Extraction − Involves gathering data from multiple heterogeneous sources.


● Data Cleaning − Involves finding and correcting the errors in data.
● Data Transformation − Involves converting the data from legacy format to
warehouse format.
● Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and
building indices and partitions.
● Refreshing − Involves updating from data sources to warehouse.

Note − Data cleaning and data transformation are important steps in improving the quality
of data and data mining results.

56

You might also like