0% found this document useful (0 votes)
18 views27 pages

Session 4 - Datawarehousing

The document discusses data warehouse concepts, focusing on data modeling schemas such as star and snowflake schemas. It outlines the advantages and disadvantages of each schema type, emphasizing the importance of selecting appropriate business processes, declaring the grain, identifying dimensions, and determining facts for effective data warehouse design. Additionally, it provides a use case for a retail business to illustrate the application of these concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views27 pages

Session 4 - Datawarehousing

The document discusses data warehouse concepts, focusing on data modeling schemas such as star and snowflake schemas. It outlines the advantages and disadvantages of each schema type, emphasizing the importance of selecting appropriate business processes, declaring the grain, identifying dimensions, and determining facts for effective data warehouse design. Additionally, it provides a use case for a retail business to illustrate the application of these concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

data warehouse concepts, design &

architecture

l.ben hiba - bi&a 2a - feb 2025


124

data warehouse
Data modeling: Schemas

• A Schema is a collection of database objects, including tables, views, indexes, and


synonyms.
• How objects are arranged represents the schema model
• The most common schemas for data warehouses are Star schema and snowflake
schema
• The model of source data and analytical requirements help choose the adequate
schema design for the data warehouse
125

data warehouse
Data modeling: Schemas

• The star schema is the most common schema for dimensional models. It is a fact
table surrounded by multiple dimension tables
• The star schema is a compromise between a fully normalized and a denormalized
model
• Facts are stored in normalized tables, Dimensions, on the other hand, are
denormalized tables containing attributes that are often spread out across multiple
tables if a 3NF data model is used
• Star schema have “flattened” table (denormalized dimensions): hierarchies are
flattened into one table
126

data warehouse
Data modeling: Schemas

• Advantages of a star schema:


• Easy for Users to Understand
• Optimized navigation: due to fewer joins
• Most suitable for Query processing: due to the simple and straightforward join
paths
• Allows special performance techniques STARjoin (high-speed, single-pass,
parallelizable, multitable join) and STARindex (specialized index to accelerate
join performance)
127

data warehouse
Data modeling: Schemas

Kimbal. The data warehouse toolkit


128

data warehouse
Data modeling: Schemas

R. Sherman - Business Intelligence Guidebook From Data Integration to Analytics


129

data warehouse
Data modeling: Schemas

• A Snowflake schema normalizes the hierarchies within dimensions


• Each level in the dimensional hierarchy becomes its own dimensional table with
parent keys created to link the hierarchical structure together
• The fact table stores the foreign key to the lowest level of the dimensional
hierarchy
• This technique is most often used with dimensions that have a very large number
of rows with deep hierarchies that are relatively static
• Some BI tools are built specifically to leverage snowflake schemas
130

data warehouse
Data modeling

Kimbal. The data warehouse toolkit


131

data warehouse
Data modeling: Schemas

• Advantages:
• Small savings in storage space
• Normalized structures are easier to update and maintain

• Disadvantages:
• Schema less intuitive and end-users are put off by the complexity
• Ability to browse through the contents difficult
• Degraded query performance because of additional joins
132

data warehouse
Data modeling: Schemas

• Almost all data warehouses contain multiple star schema structures


• Each star serves a specific purpose to track the measures stored in the fact table
• The fact tables of the star schemas share dimension tables (conformed
dimensions)
• A collection of star schemas is called a family of stars or constellation
133

data warehouse
Data modeling: Fact table sizes

• Time dimension: 5 years × 365 days = 1825


• Store dimension: 300 stores reporting daily sales
• Product dimension: 40,000 products in each store (about 4000 sell in each store
daily)
• Promotion dimension: a sold item may be in only one promotion in a store on a
given day

• Maximum number of base fact table records: 1825 × 300 × 4000 × 1 = 2 billion
134

data warehouse
Data modeling: Fact table sizes

• Telephone Call Monitoring


• Time dimension: 5 years = 1825 days
• Number of calls tracked each day: 150 million
• Maximum number of base fact table records: 274 billion
• Credit Card Transaction Tracking
• Time dimension: 5 years = 60 months
• Number of credit card accounts: 150 million
• Average number of monthly transactions per account: 20 Maximum
• number of base fact table records: 180 billion
135

data warehouse
Data modeling

Kimball approach
According to Kimball, there are four key decisions that must be made during the
design of a dimensional model:
1. Select the business process
2. Declare the grain
3. Identify the dimensions
4. Identify the facts
It is also important to decide on the duration of the database. Determining how far
back in time you should go for historical data.
Source: Kimball & Ross (2002)
136

data warehouse
Data modeling

Data warehouse design Use case

• A retail business has 100 grocery stores spread across five states
• Each store has a full complement of departments, including grocery, frozen foods,
dairy, meat, produce, bakery, floral, and health/beauty aids
• Each store has approximately 60,000 individual products, called stock keeping
units (SKUs), on its shelves
• Data is collected at the cash registers as customers purchase products
137

data warehouse
Data modeling

• The point- of-sale (POS) system scans product


barcodes at the cash register, measuring
consumer takeaway at the front door of the
grocery store
• Some of the most significant management
decisions have to do with pricing and
promotions
• The visibility of all forms of promotion is an
important part of analyzing the operations of a
grocery store
138

data warehouse
Data modeling

1. Select the business process

• A business process is a low-level activity performed by an organization


• Business processes are generally supported by an operational system, such as the
billing or purchasing system
• They generate or capture key performance metrics
• Performance measurements users want to analyze in the DW/BI system result
from business process events
• Identifying business processes is the results of multiple design sessions with
business users
139

Source: Kimball (2013)


140

data warehouse
Data modeling

1. Select the business process

• Management wants to better understand customer purchases as captured by the


POS system
• Target Business process: POS retail sales transactions
• Aim: analyze which products are selling in which stores on which days under what
promotional conditions in which transactions
141

data warehouse
Data modeling

2. Declare the grain

• What does an individual fact table row represents?


• The grain conveys the level of detail associated with the fact table measurements
• Atomic grain reflects the highest level of details that could be made available in
the dimensional model
• Atomic data is highly dimensional and thus provide maximum analytic flexibility
• A summarized higher level grain limits the model to fewer and/or potentially less
detailed dimensions
142

data warehouse
Data modeling

2. Declare the grain

• We’d like to answer questions such as : how many


shoppers took advantage of the 50-cents-off
promotion on shampoo? What was the impact of
decreased sales when a competitive diet soda
product was promoted heavily?…
143

data warehouse
Data modeling

2. Declare the grain

• We’d like to answer questions such as : how many


shoppers took advantage of the 50-cents-off
promotion on shampoo? What was the impact of
decreased sales when a competitive diet soda
product was promoted heavily?…
• The most atomic data is an individual product on a
POS transaction
144

data warehouse
Data modeling

3. Identify the dimensions


• The descriptive dimensions that apply to the
case: date, product, store, promotion, cashier,
and method of payment
• Cryptic abbreviations, true/false flags, and
operational indicators should be supplemented
in dimension tables with full text words that have
meaning when independently viewed.
• Null-values should be avoided, or at least
substituted with Unknown or Not Applicable
145

data warehouse
Data modeling

4. Identify the facts

• The facts collected by the POS system :


• sales quantity
• Per unit regular
• Discount
• Net paid prices
• Extended discount
• Sales dollar amounts
Data modeling
data warehouse
146

Kimbal. The data warehouse toolkit


147

data warehousing
Data Modeling

Source: Kimball (2013)


148

Source: Kimball (2013)


149

Source: Kimball (2013)

You might also like