Data Warehouse Unit-3 Complete
Data Warehouse Unit-3 Complete
A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived
from transaction data from single and multiple sources.
A Data Warehouse is a group of data specific to the entire organization, not only
to a particular group of users.
It is not used for daily operations and transaction processing but used for making
decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve
files from 3 months, 6 months, 12 months, or even previous data from a data
warehouse. These variations with a transactions system, where often only the
most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed
from the source operational RDBMS. The operational updates of data do not
occur in the data warehouse, i.e., update, insert, and delete operations are not
performed. It usually requires only two procedures in data accessing: Initial
loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-Volatile defines that once entered into
the warehouse, and data should not change.
Now we discuss the delivery process of the data warehouse. Main steps used in
data warehouse delivery process which are as follows:
Data Warehouse Delivery Process
1. IT Strategy
○ Develop a strategy for securing and retaining funding for the data
warehouse project.
2. Business Case Analysis
Data Warehouse and the OLTP database are both relational databases. However,
the goals of both these databases are different.
Consider the data of a shop for items sold per quarter in the city of Delhi. The
data is shown in the table. In this 2D representation, the sales for Delhi are shown
for the time dimension (organized in quarters) and the item dimension (classified
according to the types of an item sold). The fact or measure displayed in
rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example,
suppose the data according to time and item, as well as the location is
considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data are
shown in the table. The 3D data of the table are represented as a series of 2D
tables.
Data Cube
A data cube is a multidimensional structure used to organize and analyze data. It
is also known as a multidimensional database, materialized view, or OLAP
(On-Line Analytical Processing). The main goal of a data cube is to store
precomputed, frequently queried data to enhance retrieval efficiency.
To create a data cube, specific attributes from a database are chosen as measure
attributes (values of interest, such as sales amounts) and dimensions (attributes
used to organize the data, such as time, item, branch, and location). For example,
a sales data warehouse might track sales across dimensions like time, item,
branch, and location, allowing detailed analysis such as monthly sales by item
and location. Dimension tables describe these dimensions with attributes like
item_name, brand, and type.
Data cubes are useful for various analytical applications but can be sparse,
meaning not all cells in each dimension have corresponding data. Efficient
techniques are needed to handle this sparsity. In a multidimensional data model,
data cubes allow data to be viewed and analyzed from multiple perspectives. A
central fact table, which contains numerical measures like total sales and keys to
dimension tables, anchors the model. OLAP tools leverage this structure to
enable complex queries and analyses.
Despite their benefits, data cubes can face challenges, such as efficiently using
precomputed results when queries include lower-level constants. Nonetheless,
they remain a powerful tool for data organization and analysis in data
warehousing and business intelligence.
What is Schema
Schema is a logical description of the entire database.
It includes the name and description of records of all record types including all
A database uses relational model, while a data warehouse uses Star, Snowflake,
and
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions.
The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schema simulates a star,
with points, divergent from a central table.
The center of the schema consists of a large fact table, and the points of the star
are the dimension tables.
Characteristics of Star Schema:
The dimension table is joined to the fact table using a foreign key
The Star schema is easy to understand and provides optimal disk usage.
The dimension tables are not normalized. For instance, in the above figure,
Country_ID
does not have a Country lookup table as an OLTP design would have.
The schema is widely supported by BI Tools.
Advantages:
Snowflake Schema
Definition:
● A snowflake schema is an expansion of the star schema where one or more dimension
tables do not connect directly to the fact table but join through other dimension tables.
● It is called a snowflake schema because its diagram resembles a snowflake.
Normalization:
Performance:
Structure:
● The schema consists of one fact table surrounded by its associated dimensions.
● These dimensions are related to other dimensions, creating a branching snowflake
pattern.
Relationships:
● The snowflake schema has many dimension tables, which can be linked to other
dimension tables through a many-to-one relationship.
Normalization Level:
● Tables in a snowflake schema are generally normalized to the third normal form.
Hierarchy:
● Each dimension table represents exactly one level in a hierarchy.
In the star schema, The fact While in snowflake schema, The fact
tables and the dimension tables, dimension tables as well as sub
tables are contained. dimension tables are contained.
It takes less time for the While it takes more time than star
execution of queries. schema for the execution of queries.
In star schema, While in this, Both normalization and
Normalization is not used. denormalization are used.
A Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.
Concept Hierarchy
A concept hierarchy is a directed acyclic graph of ideas where each theory or idea is
identified by a unique name. In this hierarchy, an arc from concept A to concept B
indicates that A is a more general concept than B. Reports are tagged with concepts
that correspond to their content, and tagging a report with a concept also implicitly tags
it with all the ancestors in the concept hierarchy. Therefore, reports should be tagged
with the lowest possible concept to ensure accuracy.
The concept hierarchy, assumed to be predefined (a priori), can also be created for
documents using hierarchical clustering algorithms. These hierarchies map specific,
low-level concepts to more general, high-level concepts and are used in data
warehouses to express different levels of granularity of an attribute. They are crucial for
formulating useful OLAP queries, allowing users to summarize data at various levels.
For example, using a location hierarchy, users can retrieve and summarize sales data
for each location, area, state, or country without needing to reorganize the data.
Three-Tier Data Warehouse Architecture
Data Warehouses usually have a three-level (tier) architecture that includes-
Bottom Tier
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It
may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway. A
gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking
and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
Middle-tier
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions
on multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
Top-tier
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-generated data.
Data Marting
There are mainly two approaches to designing data marts. These approaches are
Designing
The design step is the first in the data mart process. This phase covers all of the
functions from initiating the request for a data mart through gathering data
about the requirements and developing the logical and physical design of the
data mart.
Constructing
This step contains creating the physical database and logical structures
associated with the data mart to provide fast and efficient access to the data.
Populating
This step includes all of the tasks related to getting data from the source,
cleaning it up, modifying it to the right format and level of detail, and moving it
into the data mart.
Accessing
This step involves putting the data to use: querying the data, analyzing it,
creating reports, charts and graphs and publishing them.
1. Set up an intermediate layer (Meta Layer) for the front-end tool to use. This
layer translates database operations and objects names into business
conditions so that the end-clients can interact with the data mart using
words which relate to the business functions.
2. Set up and manage database architectures like summarized tables which
help queries agree through the front-end tools execute rapidly and
efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step,
management functions are performed as:
The process architecture defines an architecture in which the data from the data
warehouse is processed for a particular computation.
Client-Server
In this architecture, the user does all the information collecting and presentation,
while the server does the processing and management of data.
Three-tier Architecture
N-tier Architecture
Cluster Architecture
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients.
Instead, all the processing responsibilities are allocated among all machines,
called peers. Each machine can perform the function of a client or server or just
process data.