0% found this document useful (0 votes)
8 views

Notes Data Modelling Fundamentals

The document discusses the differences between data modeling and dimensional modeling, emphasizing that data modeling focuses on structuring data with normalization for integrity, while dimensional modeling prioritizes usability and performance through denormalization. It outlines the characteristics and requirements of data warehouses and business intelligence systems, highlighting the importance of accessibility, consistency, and adaptability. Additionally, it explains the structure and function of fact and dimension tables in dimensional modeling, including the use of OLAP cubes for enhanced query performance.

Uploaded by

Mohit Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Notes Data Modelling Fundamentals

The document discusses the differences between data modeling and dimensional modeling, emphasizing that data modeling focuses on structuring data with normalization for integrity, while dimensional modeling prioritizes usability and performance through denormalization. It outlines the characteristics and requirements of data warehouses and business intelligence systems, highlighting the importance of accessibility, consistency, and adaptability. Additionally, it explains the structure and function of fact and dimension tables in dimensional modeling, including the use of OLAP cubes for enhanced query performance.

Uploaded by

Mohit Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 154

- In data modeling, the focus is on structuring the data and its relationships,

whereas in dimensional modeling, the focus is on organizing data for analysis


and reporting.
- Data modeling involves creating normalized tables to minimize redundancy
and ensure data integrity, while dimensional modeling often involves
denormalization for performance and usability.
- Querying the data modeled using data modeling may require complex joins,
while in dimensional modeling, queries are typically simpler and more user-
friendly, as they involve querying fact and dimension tables directly.

Normalization:
In normalization, we aim to organize data to minimize redundancy (same
piece of data exists in multiple places)and maintain data integrity.
Users Table:
UserID Username Email
1 user1 [email protected]
2 user2 [email protected]
3 user3 [email protected]
Products Table:
ProductID Name Descrip on Price
101 Laptop High-performance laptop 1000
102 Smartphone Latest smartphone model 800
103 Headphones Noise-cancelling headphones 200
Orders Table:
OrderID UserID ProductID Quan ty OrderDate
1 1 101 1 2024-02-29
2 2 102 2 2024-02-28
3 3 103 1 2024-02-27

In denormalization, we may combine tables to simplify queries and improve


performance, accepting redundancy as a trade-off.
Orders Table with Denormalization:
OrderI Userna ProductI Product Quan OrderDat
D me Email D Name ty e
1 user1 [email protected] 101 Laptop 1 2024-02-
m 29
2 user2 [email protected] 102 Smartphon 2 2024-02-
m e 28
3 user3 [email protected] 103 Headphon 1 2024-02-
m es 27
t
ti
ti
Chapter 1:

- Both BI and data warehouses involve the storage of data. However,


business intelligence is also the collection, methodology (to process, clean,
transform the collected data), and analysis of data.
- DW/BI involves requirements of both IT professionals as well as business
users.
- You put the data into operational systems and you put the data out from DW/
BI systems.
- Users of operational system take orders, sign up new customers, monitor
the status of operational activities, and log complaints.
- Operational systems deal with one transaction record at a time.
- They update data to re ect the most current state rather than maintaining
the historical data.
- DW/BI systems count the new orders and compare them with last week’s
orders, and ask why the new customers signed up, and what the customers
complained about.
- DW/BI systems are optimized for high-performance queries as they deal with
thousands of transactions.
- They maintain history to evaluate the organization’s performance.

- DW/BI system requirements:


1. The DW/BI system must make information easily accessible and
understandable to the business user. The business intelligence tools and
applications must be simple and query results must be fast.
2.The DW/BI system must present information consistently. Data must be
carefully assembled from a variety of sources, cleansed, quality assured.
Consistency also implies if two performance measures have the same name,
they must mean the same thing. Conversely, if two measures don’t mean the
same thing, they should be labeled differently.
3.The DW/BI system must adapt to change. User needs, business conditions,
data, and technology change, and DW/BI system must also change with
them, and these changes must be transparent to the users.
4.The DW/BI system must convert data into actionable information in a timely
way.
5.The DW/BI system must control access to the organization’s con dential
information like what you’re selling to whom at what price.
6.The DW/BI system must have right data to decisions that are made. And
these decisions are based on analytic evidence presented.
7.Business users will embrace the DW/BI system if it is the “simple and fast”
source for actionable information.

- Dimensional modeling is a technique to present analytic data. It delivers data


which is easy to understand by business users (makes databases simple) and
it delivers fast query performance with the help of software.
- A data cube is a data structure that, contrary to tables and spreadsheets,
can store data in more than 2 dimensions.
fl
fi
- To slice and dice is to break a body of information down into smaller parts or
to examine it from different viewpoints so that you can understand it better.
- The ability to visualize something as abstract as a set of data in a concrete
and tangible way is the secret of understandability.
- Normalized third normal form (3NF) structures divide data into many discrete
entities, each of which becomes a relational table.
- 3NF is ideal for transactional systems where data integrity and normalization
are crucial, while Dimensional Data Modeling shines in analytical and
reporting systems, offering improved query performance.
- Entity-relationship diagrams (ER diagrams or ERDs or normalized models)
are drawings that communicate the relationships between tables. Both consist
of joined relational tables.
Example of ER diagram:
+---------------+ +---------------+
| Author | | Publisher |
+---------------+ +---------------+
| AuthorID (PK) | | PublisherID |
| Name | | Name |
| Nationality | | Location |
+---------------+ +---------------+
| |
| |
| |
+---------------+ +---------------+
| Book | | Order |
+---------------+ +---------------+
| ISBN (PK) | | OrderID (PK) |
| Title | | CustomerID (FK)|
| Genre | | OrderDate |
| Price | +---------------+
| Pub_Year | |
| AuthorID (FK) | |
| PublisherID (FK)| |
+---------------+ |
|
+----------------+
| Customer |
+----------------+
| CustomerID (PK)|
| Name |
| Email |
| Address |
+----------------+
- Each box represents an entity.
- Lines connecting entities represent relationships between them.
- Attributes are listed within each entity box.
- A dimensional model contains the same information as a normalized model,
but packages the data in a format that delivers user understandability, query
performance, and resilience to change.
- Dimensional models implemented in relational database management
systems are
referred to as star schemas.
- Dimensional models implemented in multidimensional database
environments are
referred to as online analytical processing (OLAP) cubes.
- OLAP cubes deliver superior query performance because of precalculated
summary tables, indexing strategies, and other optimizations.
- OLAP cubes also provide more analytically robust functions that exceed
those available with SQL.
- The downside is that you pay a load performance price for these capabilities,
especially with large data sets.
- Most dimensional modeling techniques are applicable for both star schema
and OLAP cube.

OLAP Deployment Considerations:


1.A star schema is a good foundation for building an OLAP cube, and is good
for backup and recovery
2.Because of the introduction of appliances, in memory databases and
columnar databases the performance of OLAP cube has degraded as
compare to RDBMS.
3.It is typically more dif cult to port BI applications between different OLAP
tools than to port BI applications across different relational databases.
4.OLAP cubes typically offer more sophisticated security options than
RDBMSs, such as limiting access to detailed data but providing more open
access to summary data.
5.Analysis capabilities of OLAP cube is higher than RDBMS.
6.OLAP cubes support slowly changing dimension type 2 changes, but cubes
often need to be reprocessed whenever data is overwritten using alternative
slowly changing dimension techniques.
7.OLAP cubes support transaction and periodic snapshot fact tables, but do
not handle accumulating snapshot fact tables because of the limitations
described in the previous point.
fi
8.Ragged hierarchies:

OLAP cubes typically support complex ragged hierarchies of indeterminate


depth using native query syntax that is superior to the approaches required for
RDBMSs.
9.OLAP cubes may impose constraints on the structure of dimension keys
(key that is used to join the dimension with fact tables) that implement drill-
down hierarchies compared to relational databases.
10.Some OLAP products do not enable dimensional roles (The dimensions of
the role of a parent include providing both physical and emotional care.) or
aliases, thus requiring separate physical dimensions to be de ned.

- The fact(a business measure) table in a dimensional model stores the


performance measurements resulting from an organization’s business
process events (Production, marketing, and sales).
- Because measurement data is the largest set of data it should be stored in a
single dimensional model/single centralized repository, resulting in the use of
consistent data throughout the enterprise by business users.
- The grain of your data is the unique set of columns, that de ne what a
record is.
- All the measurement rows in a fact table must be of same type/single level of
detail which ensures that measurements aren’t inappropriately double-
counted.
- There should be a one to one relationship between a measurement event
and a single row in the corresponding fact table, this forms the core principle
for dimensional modeling.
- The most useful facts are numeric and additive (which requires hundreds of
row to sum up such as dollar sales).
- Facts are sometimes semi-additive (account balances, cannot be summed
across the time dimension)or even non additive (unit prices, can never be
added).

- Dimension attributes : These are columns of dimension table. Facts : are the
quantitative data.
- You must know about fact type before knowing about the value of it which is
generally a continuous value.
fi
fi
- Text type data belongs to dimension table and is rarely belongs to fact table
(only if text is unique for every row).
- Inserting zeros as rows in fact tables would be overwhelming. And by
inserting non zero rows in fact table would make it sparse. But despite of that
it would take 90 percent or more of the total space consumed by a
dimensional model. As a result you should take fact table space utilization
wisely.
- In a fact table there are greater number of rows as compared to number of
columns.
- A row in a periodic snapshot fact table captures some sort of periodic data,
ex. monthly bank account.
- A row in an accumulating snapshot fact table summarizes the measurement
events occurring at predictable steps between the beginning and the end of a
process. Ex. How many Services were started in July.
- Most of the records in a fact table are of type: transaction, periodic snapshot,
and accumulating snapshot.
- All fact tables have two or more foreign keys that connect to the dimension
tables’ primary keys.
- When all the keys in the fact table correctly match their respective primary
keys in
the corresponding dimension tables, the tables satisfy referential integrity.
- Foreign key values do not need to be unique.
- Every fact table has a primary key which consists of a subset of foreign keys
(dimensions) which is called composite key. Rest of the dimensions can be
added to the primary key as it would remain as primary key even after that.
- Every table with a composite keys is a fact table. And others are dimension
tables.

- The dimension tables contain the textual context associated with a business
process measurement event.
- Unlike fact table, dimension table has few number of rows and generally has
50 to 100 number of columns/attributes. But it can be wide with many large
text columns. It has single primary key.
- Dimension attributes serve as the primary source of query constraints,
groupings, and report labels. In a query or report request, attributes are
identi ed as the
by words. For example, when a user wants to see dollar sales by brand,
brand must
be available as a dimension attribute.
- Dimension attributes make the DW/BI system usable and understandable.
- Try to use more verbose textual attributes instead of using codes for them.
- The decode values should never be hidden in the reporting applications (BI,
Tableau ) where inconsistency is inevitable.
- Sometimes operational codes have legitimate business signi cance to
users. In these cases, the codes should appear as explicit dimension
attributes, in addition to the corresponding textual descriptors that can easily
be ltered, grouped, or reported.
fi
fi
fi
- The verbose business terminology, population of domain values and quality
of the values in an attribute column should be better. As Robust dimension
attributes deliver robust analytic slicing-and-dicing capabilities.
- Dimensions provide the entry points to the data, and the nal labels and
groupings on all DW/BI analyses.
- Row labels determine which columns are prominently displayed for that
table, similar to a title.
- If the column is a measurement that takes on lots of values and participates
in calculations then it is a fact or is a discretely valued description that is more
or less constant/drawn from a small list and participates in constraints and row
labels then it is a dimensional attribute.
- Dimension tables often represent hierarchical relationships in which for
example products roll up into brands and then into categories. This structure
leads to ease of use and query performance.
- Store only the brand code in the product dimension and creating a separate
brand lookup table, and likewise for the category description in a separate
category lookup table. This normalization is called snow aking.
- As dimension tables are smaller than fact tables, increasing dimension
tables by snow aking to improve simplicity and accessibility would not impact
database size.

- Each business process (Production, marketing, and sales) is represented by


a dimensional model/dimensional schema consists of a fact table and a halo
of dimension tables. This characteristic star-like structure is called a star join.
- This dimensional schema is simple and symmetric which is easier to
understand and navigate by business users.
- Due to this simplicity database optimizers process these simple schemas
with fewer joins more ef ciently.

-:
Let's continue with our previous example of a dimensional model with the
"Product" and "Location" dimension tables and a fact table "Sales".
Suppose we have the following sales data:
Sales table:
| Product_ID | Location_ID | Sales_Amount |
|------------|-------------|--------------|
|1 | 101 | 500 |
|1 | 102 | 700 |
|2 | 101 | 300 |
|2 | 102 | 600 |

Now, let's say we want to retrieve sales data for products sold in New York
(Location_ID 101) with sales amounts greater than 400.
First, the database optimizer constrains the dimension tables. In this case, it
would apply the condition "Location_ID = 101" to the "Location" table,
narrowing down the dimension table to only contain data related to New York.

Location table (constrained):


| Location_ID | City |
fl
fi
fl
fi
|-------------|----------|
| 101 | New York |

Next, the optimizer performs the Cartesian product of the keys from the
constrained dimension tables, which in this case would be just the
Location_ID 101.
(1, 101)
(2, 101)

These pairs represent all possible combinations of Product_ID with the


constrained Location_ID (101).
Finally, the optimizer uses these combinations to ef ciently access the fact
table "Sales" and retrieve the relevant sales data.

| Product_ID | Location_ID | Sales_Amount |


|------------|-------------|--------------|
|1 | 101 | 500 |
|2 | 101 | 300 |

By following this process, the optimizer ef ciently gathers all the necessary
sales data for products sold in New York with sales amounts greater than 400
in just one pass through the fact table's index. This approach minimizes the
need for multiple scans through the fact table and ensures ef cient query
processing.

-:
Imagine a retail company using a dimensional model to analyze sales data. In
this model, there are dimension tables like "Product," "Time," and "Location,"
and a fact table containing sales transactions.

Now, let's say the company decides to expand its product offerings by
introducing new product categories. With a dimensional model,
accommodating this change is straightforward. They can simply add new
entries to the "Product" dimension table for the new categories without
needing to overhaul the entire database schema.

Similarly, if there's a shift in customer behavior, such as increased interest in


online shopping, the company can adapt without dif culty. They might add a
new dimension table for "Online Sales Channel" and integrate it seamlessly
into their existing dimensional model.’

Furthermore, every dimension in the model is treated equally. Whether


analyzing sales (fact table) by product, time, or location, users have equal
access and exibility to explore the data from various perspectives. There's no
inherent bias or preference for speci c query patterns, allowing for unbiased
analysis regardless of the questions asked.

In essence, the dimensional model provides a exible and adaptable structure


that can gracefully accommodate changes in business needs and user
fl
fi
fi
fl
fi
fi
fi
behavior without requiring signi cant schema adjustments (changes to tables,
relationships, and data organization (data restructuring, storage, and
accessibility within a database)), ensuring continuity and ef ciency in data
analysis processes.

- rst name, last name are more granular than name attribute.
- Detailed, unaggregated data (granular data) holds more dimensions
(attributes). This raw data forms the core of fact table design, vital for handling
spontaneous user queries effectively.
- In a retail business using a dimensional model for sales data analysis,
extending the schema to incorporate new dimensions or facts is seamless.
For instance, adding a "Customer Segment" dimension involves creating a
new table with segments like "VIP" or "Regular" and linking sales transactions
to the appropriate segment. Similarly, introducing new facts like returns or
discounts is straightforward, provided they align with the existing level of detail
in the fact table. Additionally, enhancing dimension tables with new attributes,
such as adding a "Product Category" attribute to the "Product" dimension, is
easily achievable. Importantly, these schema modi cations can be made
directly to the existing tables without reloading the data. Whether adding new
rows or executing SQL commands like ALTER TABLE, the process is ef cient
and preserves the continuity of data analysis. Existing BI applications can
continue to function seamlessly, ensuring consistent and reliable insights
without impacting results.
- fact (numeric values) and dimension tables ( lters and labeling) can also be
translated into a report.

SELECT
store.district_name,
product.brand,
sum(sales_facts.sales_dollars) AS "Sales Dollars"
FROM
store,
product,
date,
sales_facts
WHERE
date.month_name="January" AND
date.year=2013 AND
store.store_key = sales_facts.store_key AND
product.product_key = sales_facts.product_key AND
date.date_key = sales_facts.date_key
GROUP BY
store.district_name,
product.brand

- Four components of a DW/BI environment: operational source systems, ETL


system, data presentation area, and business intelligence applications.
fi
fi
fi
fi
fi
fi
- The data warehouse environment is the place where the data that serve as
the basis for corporate decision-making are stored.
- Business intelligence environment is a system that enable business users to
query BI data, create data visualizations and design dashboards on their own.

- Examples of source systems:


1.A Point of Sale (POS) system is a computerized tool used in retail,
restaurants, and other businesses to process transactions. It captures
product details, calculates totals, processes payments, and provides
receipts. Typically including hardware like cash registers and scanners, it
also manages inventory, sales data, and customer information.
2.Inventory Management: Tracks product movement and availability. This
includes information such as how much stock is available (stock levels),
when to reorder to replenish supplies (replenishment orders), and where
items are located within the warehouse (warehouse locations).
3.Customer Relationship Management (CRM) Systems: CRM systems store
information about customers, including their contact details, purchase
history, interactions with the company, and preferences.
4.Financial Systems: Financial systems handle
nancial transactions (The company sells 100 units of its product at $10 each,
receiving a total of $1,000 in revenue.),
accounting (The sale is recorded in the company's general ledger as a credit
of $1,000 in sales revenue.),
and budgeting processes (The company budgeted $1,500 in monthly sales
revenue but earned $1,000. This prompts a variance analysis comparing
actual ($1,000) to budgeted amounts. Insights from this comparison inform
adjustments to future budgets or strategies, ensuring effective nancial
resource management and informed decision-making) within an organization.
5.Human Resources (HR) Systems: HR systems manage employee-related
information, including payroll, bene ts, attendance, performance evaluations,
and training records.
6.Supply Chain Management (SCM) Systems: SCM systems oversee the ow
of goods, services, and information across the supply chain, from
procurement and production to distribution and delivery. They optimize
processes such as inventory management (right amount of products are
available),
Logistics (movement of products more ef cient, reducing time and costs
associated with transportation, warehousing, and distribution),
and supplier relationships (builds and maintains positive connections with
suppliers, ensuring timely delivery, better pricing, and collaboration for mutual
bene t).

- Data Warehouse: The company decides to set up a data warehouse to


analyze sales trends, inventory levels, customer behavior (the actions and
decisions that people make when they choose, buy, use, and dispose of a
product or service.), and nancial performance (how well a rm can use
assets from its primary mode of business and generate revenues). They
fi
fi
fi
fi
fi
fi
fi
fl
extract data from their source systems regularly and load it into the data
warehouse.

- Example: When a customer makes a purchase at a store, the transaction


details are recorded in the POS system, which is a source system. This
includes information like the items purchased, the price, the time of purchase,
and the customer's details. The POS system's main priority is to process this
transaction quickly and ensure the availability of the system for other
customers. Operational queries against the POS system would typically
involve retrieving speci c transaction details for individual purchases, such as
looking up a receipt or checking the availability of a particular product in the
inventory.

- Meanwhile, the data warehouse is used for broader analysis. For example,
the company might use data from the data warehouse to analyze sales trends
over time, identify popular products, understand customer demographics, or
forecast future sales. This involves querying the data warehouse in more
complex and varied ways compared to the narrow, transaction-speci c
queries performed on the source systems. Additionally, the data warehouse
stores historical data, allowing the company to analyze trends and patterns
over time, which the source systems typically don't maintain.

- While ERP focuses on integrating and managing core business processes


across the enterprise, operational MDM focuses on ensuring the consistency
and quality of master data within operational systems. Both play critical roles
in driving ef ciency, improving decision-making, and enabling organizations to
adapt to changing business needs.

- The extract, transformation, and load (ETL) system of the DW/BI


environment consists of:
Work Area: Temporary storage for data undergoing transformation before
loading into the data warehouse.
Instantiated Data Structures: Models or schemas created within ETL to store
transformed data temporarily.
Set of Processes: Series of tasks from extraction to loading, including
data pro ling (Examines data for quality and consistency),
Cleansing (Removes errors and inconsistencies from data),
Integration (Combines data from different sources),
and validation (Ensures data accuracy and integrity).

- Extracting means reading and understanding the source data and copying
the data needed into the ETL system (data warehouse).
- Diagnostic metadata (Detailed information about data quality issues)
generated from these activities provides insights into data quality issues. This
information can drive
business process reengineering (Redesigning work ows (Sequential
processes like Extraction, Transformation, Loading, Reporting/Insights
Generation)
fi
fi
fi
fl
fi
and systems (Software and hardware (servers, Data Processing Units like
cpu,gpu) that store, process, and present data, such as databases, ETL tools
(informatica), and reporting platforms (tableau))
to enhance ef ciency and quality)
efforts aimed at improving data quality in the source systems gradually.
- Different sections of a data warehouse: Sales, Marketing, Finance, Human
Resources, Inventory.
- In the nal phase of the ETL process, data is structured and loaded into the
presentation area's (section of the data warehouse where structured and
loaded data is organized for reporting and analysis) dimensional models.
- In ETL system dimension and fact tables are delivered in the delivery step.
- Critical subsystems focus on dimension table processing for accuracy in
data representation.
- Dimension table tasks include surrogate keys assignments, code lookups
( nding the right labels or descriptions for data values. Ex. translating product
codes into product names), and column manipulation.
- Fact tables, though large and time-consuming to load, are straightforward for
presentation.
- Once updated, indexed, and quality assured, the business community is
noti ed of new data.

- In simpler terms, there's debate in the industry about whether we should


organize data in the ETL system into normalized structures before loading it
into the presentation area's dimensional structures. ETL systems typically
focus on basic tasks like sorting and processing data step by step. They often
use at les instead of relational technology. Once the data is checked to
ensure it meets speci c business rules, building a 3NF (Third Normal Form)
physical database before transforming it into denormalized structures for BI
presentation might seem unnecessary. This process can introduce complexity
without signi cant bene ts.

- an example of a table in Third Normal Form (3NF):


Consider a "Students" table with the following attributes:
Student_ID (Primary Key)
Student_Name
Date_of_Birth
Address
Course_ID (Foreign Key)
Each attribute in the table is atomic, meaning it cannot be further divided.
There are no repeating groups, and each piece of data is stored in its own
cell.
Additionally, each non-key attribute (Student_Name, Date_of_Birth, Address)
is functionally dependent on the primary key (Student_ID). There are no
transitive dependencies; for example, the address attribute depends directly
on the Student_ID, not on any other attribute like the student's name.
This organization ensures data integrity and exibility in querying, meeting the
criteria for Third Normal Form.
fi
fi
fl
fi
fi
fi
fi
fi
fi
fl
- ETL receives data in 3NF, preferring cleansing and transforming with
normalized structures by developers.
- Concerns arise due to potential double data processing in normalized
database and again into the dimensional model.
- Two-step process demands more time, investment, and storage capacity.
- Sole focus on normalized structures in DW/BI can lead to failures.
- Cost-effective alternatives exist for maintaining data consistency without
physical normalization in ETL.
- Let's illustrate the concept of data consistency in a retail company using
simpli ed tables for physical store sales and online sales.

Physical Store Sales Table:


| Transaction_ID | Product_ID | Quantity | Price | Payment_Method | Store_ID
| Transaction_Date |
|----------------|------------|----------|-------|----------------|----------|------------------|
|1 | 101 |2 | $20 | Credit Card | Store A | 2024-02-17
|
|2 | 103 |1 | $15 | Cash | Store B | 2024-02-17 |
|3 | 105 |3 | $30 | Debit Card | Store A | 2024-02-18
|

Online Sales Table:


| Transaction_ID | Product_ID | Quantity | Price | Payment_Method |
Customer_ID | Transaction_Date |
|----------------|------------|----------|-------|----------------|-------------|------------------|
| 101 | 102 |1 | $25 | PayPal | 1001 | 2024-02-17
|
| 102 | 104 |2 | $40 | Credit Card | 1002 | 2024-02-17
|
| 103 | 106 |1 | $20 | PayPal | 1003 | 2024-02-18
|

In this example:
Each table represents sales transactions from different channels: physical
stores and online.
- Both tables contain similar columns such as Transaction_ID, Product_ID,
Quantity, Price, Payment_Method, and Transaction_Date, ensuring
consistency in data structure.
- Despite the transactions originating from different channels, they are
recorded uniformly with the same attributes.
- The Transaction_ID serves as a unique identi er for each transaction,
facilitating data integration and analysis.
- Data consistency ensures that regardless of where the sale occurred, the
information is accurately captured, enabling the company to analyze overall
sales performance seamlessly.
- Consistent data structures across sales channels enable insightful analysis,
informing strategic decisions and fostering business growth via data-driven
initiatives.
fi
fi
- Users interact with DW/BI presentation area for querying and analysis,
unaware of the back-end ETL processes. It's likened to "Getting the Data
Out," as emphasized in The Data Warehouse Toolkit.
- Industry consensus favors dimensional modeling as the optimal technique
for delivering data to DW/BI users.
- The presentation area must contain detailed, atomic data for handling
unpredictable ad hoc queries effectively.
- While aggregated data can enhance performance, it's insuf cient without
underlying granular data in a dimensional form.

Ex.
| Date | Product | Region | Sales Amount |
|------------|------------|----------|--------------|
| 2024-01-01 | Laptop | North | $10,000 |
| 2024-01-01 | Smartphone | South | $5,000 |
| 2024-01-02 | Tablet | East | $8,000 |
If a user drills down to the most granular level (e.g., individual transactions),
they might lose the bene ts of the dimensional presentation, such as
aggregations and summaries that provide a more comprehensive view of the
data.
- Storing only summary data in dimensional models while atomic data remains
in normalized models is unacceptable.
- Users need access to nely grained data in the presentation area to ask
precise questions, even if they infrequently examine single line items.
- Detailed data enables users to analyze speci c scenarios, such as last
week's orders for speci c products from customers who recently made their
rst purchase.

- Dimensional structures in the presentation area must use common,


conformed dimensions.
In the Sales model, we might have the following dimensions:

Time (Date, Month, Quarter, Year)


Product (Product ID, Product Name, Category)
Location (Store ID, City, State)

In the Product model, we also have the Product dimension:


Product (Product ID, Product Name, Category)

Here, the Product dimension is common to both models. It is a conformed


dimension because it's de ned and maintained consistently across all
dimensional models in the presentation area.

So, regardless of whether we're analyzing sales data or product data, we use
the same Product dimension, ensuring consistency and compatibility across
different analyses and reports.
fi
fi
fi
fi
fi
fi
fi
- Adhering to the enterprise data warehouse bus architecture is crucial.
Imagine your data warehouse as a central hub where all your organization's
data comes together. The "bus" in EDW Bus Architecture is like the main road
leading to this hub. It's where all your different data sources and analysis
models connect and share information. Just as a bus transports people from
different places to a central destination, the bus in EDW Bus Architecture
transports data from various sources to your data warehouse, making it easier
to manage and analyze everything in one place.
- Isolated data sets hinder integration and perpetuate incompatible views of
the enterprise.
- Committing to the enterprise bus architecture is essential for building a
robust and integrated DW/BI environment.
- Conformed dimensions allow dimensional models to be combined and used
together effectively.
- The presentation area in a large enterprise DW/BI solution comprises
numerous dimensional models with shared dimension tables across fact
tables.

Here's an example of multiple tables representing multiple dimensional


models with shared dimension tables across multiple fact tables in a large
enterprise DW/BI solution:
Dimension Table: Product
Product_ID Product_Name Category Brand
101 Laptop Electronics Brand A
102 Smartphone Electronics Brand B
103 Tablet Electronics Brand C
104 Headphones Electronics Brand D
Dimension Table: Customer
Gende Ag
Customer_ID Name r e City Country
1001 John Smith Male 35 New York USA
1002 Emily Johnson Female 28 Los Angeles USA
1003 Michael Brown Male 40 Chicago USA
1004 Maria Garcia Female 45 Madrid Spain
Dimension Table: Time
Date Day Month Quarter Year
2024-01-01 Monday January Q1 2024
2024-01-02 Tuesday January Q1 2024
2024-01-03 Wednesday January Q1 2024
Fact Table: Sales
Product_I Customer_I Quan ty
Date D D Sold Sales Amount
2024-01-0 101 1001 10 $10,000
1
2024-01-0 102 1002 5 $5,000
1
2024-01-0 103 1003 8 $8,000
2
Fact Table: Inventory
Date Product_ID Quan ty_In_Stock
2024-01-01 101 100
2024-01-01 102 50
2024-01-02 103 80
In this example:
- The "Product," "Customer," and "Time" tables represent different
dimensional models containing information about products, customers, and
time, respectively.
- Each dimensional model is associated with its respective dimension table.
- The "Sales" and "Inventory" fact tables represent different aspects of
business operations.
- Both fact tables share the "Product_ID" dimension for analysis purposes.
- The presentation area of the DW/BI solution comprises multiple dimensional
models and fact tables, allowing for comprehensive analysis of sales,
inventory, and other business aspects while maintaining consistency across
shared dimension tables.

- Using the bus architecture is like having a blueprint for building a distributed
DW/BI system.
- Imagine a large retail corporation with stores located in different regions. The
corporation wants to analyze sales data from all its stores to make informed
business decisions.
In this scenario:
- Data Storage: Sales data stored in distributed databases, with each store's
data in a separate instance near its location.
- Data Processing: Analytical tasks done on multiple servers simultaneously in
a distributed computing cluster, enabling parallel processing.
- Example Task: Analyzing sales performance for a product category across
all stores, retrieving and analyzing data concurrently from each store's
database.
- Overall, the distributed DW/BI system allows the retail corporation to
ef ciently analyze sales data from its stores across different locations,
enabling better decision-making and strategic planning.

- Develop the enterprise data warehouse in an


fi
ti
ti
1.Agile: Flexible response to changes, adapting to evolving business needs
quickly throughout development.

2.Decentralized: Task distribution among teams for independent work and


collaboration, leveraging diverse expertise ef ciently.

3.Realistically Scoped: Achievable goals within de ned constraints, breaking


tasks into manageable parts for successful completion.

4.Iterative: Continuous improvement through repeated cycles, enabling


incremental enhancements and adjustments for optimal system evolution.

- The business intelligence (BI) application provides tools for business users
to analyze data from the presentation area. BI applications query data to
facilitate informed decision-making, serving as the primary means for
leveraging data for analytics.

- BI applications range from simple ad hoc query tools to complex data mining
(decision trees, neural networks) or modeling (relational and dimensional
modeling) applications. While ad hoc tools are powerful, they're only
understood by a small user base. Most users access data through prebuilt
applications and templates. Advanced tools may upload results back into
source systems or the presentation area ensuring that the insights generated
are available for further analysis or decision-making.
- The analogy compares an ETL system to a restaurant kitchen, where
talented chefs transform raw ingredients into tasty dishes for diners. Both
require careful planning of layout and components for ef cient operation.
- The kitchen's design emphasizes ef ciency, consistency, and integrity.
Ef ciency is crucial for high throughput during busy times, minimizing wasted
movement. Consistency is maintained by preparing special sauces in-house
to avoid variations. Integrity is upheld by separating tasks like salad
preparation from handling raw chicken to prevent contamination and ensure
food safety.
- Quality, consistency, and integrity are key considerations in both designing
the restaurant kitchen and everyday restaurant management. Chefs prioritize
obtaining high-quality ingredients, reject those that don't meet standards, and
may adjust menus based on ingredient availability.
- The restaurant employs skilled professionals in its kitchen who con dently
hold and use sharp knives, operate powerful equipment, and work around hot
surfaces safely and ef ciently.
- For safety and hygiene reasons, restaurant kitchens are off-limits to patrons.
Even in open kitchen setups, there's typically a barrier like glass to prevent
access. This separation ensures that cooks can work without distractions and
maintains cleanliness standards.
- The ETL system in a data warehouse, similar to a restaurant kitchen,
transforms raw data into meaningful information ef ciently. Both require
careful layout and planning to ensure throughput and minimize unnecessary
steps before any data extraction occurs.
fi
fi
fi
fi
fi
fi
fi
fi
- The ETL system focuses on ensuring data quality, integrity, and consistency.
It checks incoming data for quality, monitors conditions for integrity (ex.
Checking for missing values in a customer database), and applies
standardized business rules for consistent metrics (ex. Converting currency
values to a standardized format for nancial reporting). This approach, though
demanding on the ETL team, aims to deliver a superior and more reliable
product to data warehouse patrons.
- The ETL system should be off-limits to business users and BI developers.
Distractions could disrupt ETL professionals, leading to errors. Once data is
ready and quality-checked, it's brought to the DW/BI presentation area for
user consumption.

- The DW/BI system's presentation area serves as a curated menu of data


available for analysis and decision-making. It provides consistency and high-
quality data through metadata, published reports, and parameterized analytic
applications, ensuring it is properly prepared and safe to consume.
- The presentation area in DW/BI should prioritize user comfort and
preferences over developer preferences. Prompt delivery of data in an
appealing form to business users or developers is essential.
- Happy patrons lead to successful restaurant management: busy dining
rooms, high turnover, revenue, and pro t. Unhappy patrons result in nancial
struggles and potential closure.
- DW/BI managers should proactively monitor satisfaction like restaurant
managers. Immediate action on dissatisfaction crucial; don't wait for
complaints.
- Proactive management crucial to prevent DW/BI patrons seeking
alternatives. Ensure kitchen ef ciency to deliver required data effectively for
satisfaction in presentation area.

- Analytic data is deployed departmentally without considering enterprise-wide


integration. A single department identi es its data needs and collaborates with
IT or consultants to create a database/ data mart re ecting its rules and
preferences. However, this isolated approach solely caters to the
department's analytic needs, lacking cross-departmental collaboration.

Ex.

Business Rule:
Retail: Purchases over $500 require manager authorization to prevent fraud.

Data Preference:
Customer Service: Store contact info (name, email, phone) in a centralized
CRM.

Labeling Convention/preference:
Inventory Management: Product codes follow format (e.g., ABC-1234-L for
large-sized products from supplier ABC with code 1234).
fi
fi
fi
fi
fl
fi
- Another department needs the same data but builds its own solution
because it can't access the existing data mart. This leads to discrepancies in
performance reports due to differences in data, business rules, and labeling
conventions.

Ex.
Imagine a scenario where the Sales department builds a system to track sales
performance, including revenue by region and product category. Later, the
Marketing department wants to analyze the same sales data for campaign
effectiveness. However, lacking access to the Sales system, Marketing
creates its own solution with similar metrics but slightly different
categorizations or rules. Consequently, when comparing reports,
discrepancies arise due to these differences, causing confusion and
inef ciency in decision-making.

- Standalone analytic silos, despite being unsupported by industry leaders,


are widespread, especially in large organizations. They emerge due to
conventional IT project funding practices and the absence of cross-
organizational data governance. Although initially cost-effective, they lead to
long-term inef ciencies from redundant data handling. These silos perpetuate
con icting views of organizational performance, causing unnecessary
disputes and reconciliation efforts.

- Independent data marts, despite being discouraged, often employ


dimensional modeling for its user-friendly data presentation and query
responsiveness. However, they typically overlook fundamental principles such
as focusing on atomic details, organizing by business process rather than
department, and utilizing conformed dimensions for enterprise consistency
and integration.
- A "Date" dimension is a conformed dimension shared across multiple data
marts. It includes attributes like Year, Month, Day, etc., maintaining the same
structure and meaning in all data marts. This consistency enables cohesive
analysis and reporting across the organization.

- In the Corporate Information Factory (CIF) architecture, data is extracted


from operational systems and processed through an ETL system, landing in a
(3NF) normalized Enterprise Data Warehouse (EDW). Unlike the Kimball
approach, which allows optional normalization, the CIF mandates a
normalized EDW. While both emphasize enterprise data coordination, CIF
relies on the normalized EDW, whereas Kimball emphasizes an enterprise
bus with conformed dimensions.

- Ex.
Sales and Finance, both utilizing a "Customer" dimension table through an
enterprise bus. Here are simpli ed representations of the relevant tables:
Customer Dimension Table:
fi
fl
fi
fi
CustomerI Stat ZipCod
D Name Address City e e Phone
1 John 123 Anytow NY 12345 555-123-
Doe Main St n 4567
2 Jane 456 Elm Othervill CA 54321 555-987-
Smith St e 6543
... ... ... ... ... ... ...
Sales Data Mart:
SalesID Date CustomerID ProductID Quan ty Amount
1001 2024-01-01 1 101 2 $100.00
1002 2024-01-02 2 102 1 $50.00
... ... ... ... ... ...
Finance Data Mart:
Transac onID Date CustomerID AccountType Amount
2001 2024-01-01 1 Savings $500.00
2002 2024-01-02 2 Checking $300.00
... ... ... ... ...
Through the enterprise bus:
Changes or updates to customer data are made in the central "Customer"
dimension table.
These changes are propagated to both the Sales and Finance data marts
through the enterprise bus.
Both data marts maintain consistency in customer information, ensuring
accurate reporting and analysis across different business functions.
This setup enables seamless integration and consistent use of customer data
across various departments within the organization.

- Normalization doesn't inherently address integration; it establishes many-to-


one relationships in physical tables. Integration involves resolving
inconsistencies between separate sources, which can remain unresolved
even with extensive normalization. The Kimball architecture, with conformed
dimensions, prioritizes resolving data inconsistencies without necessitating
explicit normalization.

- In CIF-adoption organizations, business users access the detailed EDW for


its richness and timeliness. However, downstream analytic environments,
though dimensionally structured, differ from Kimball's presentation area.
They're often departmentally-centric and contain aggregated data, making it
dif cult to tie them to the EDW's atomic repository if ETL processes apply
complex business rules like renaming columns.

- Consider a large retail company implementing the blended Kimball-Inmon


CIF architecture. The CIF-centric EDW stores integrated data from various
fi
ti
ti
sources in a normalized form, maintaining data integrity and consistency. This
EDW is off-limits to business users due to its complexity.

Now, for analysis and reporting needs, the system of oads queries to a
dimensional presentation area, following Kimball principles. In this layer, data
is structured for easy analysis and reporting, utilizing dimensional models. For
instance, instead of navigating complex tables in the CIF-centric EDW, a
simpli ed sales data cube is created in the Kimball-esque presentation area.
Business users and BI applications can then easily query and analyze this
dimensional structure, providing a user-friendly and ef cient experience while
still bene ting from the integrated CIF-centric foundation.

- Myth: Dimensional models should only offer summary data, with detailed
data being too unpredictable. Reality: Detailed data is essential for users to
explore and aggregate information. Summary data improves performance but
shouldn't replace detailed data.
- Myth: Dimensional models should be structured based on organizational
departments. Reality: Dimensional models should be organized around
business processes like orders, invoices, and service calls. This ensures
consistency and enables multiple business functions to analyze the same
metrics from a single process.
- Dimensional models are highly scalable. Database vendors actively support
data warehousing and business intelligence, continuously improving
scalability and performance capabilities for dimensional models.
- Dimensional models should prioritize measurement processes over
prede ned reports or analyses. While considering ltering and labeling
requirements is crucial, designing around a xed set of reports is problematic
because these requirements can change. Instead, focus on stable
measurement events within the organization, as they provide a more reliable
foundation for dimensional modeling compared to constantly evolving
analyses.
Dimensional models are not in exible but rather highly adaptable to changing
business needs. Their symmetry allows for exibility, especially when fact
tables are built at the most granular level. Models providing only summary
data can lead to limitations in analysis and development. Starting with data at
the lowest detail level maximizes exibility and extensibility, avoiding
premature summarization that can hinder future adaptability.
- Dimensional models can integrate effectively if they adhere to the enterprise
data warehouse bus architecture. Conformed dimensions are centrally
managed as master data in the ETL system, ensuring semantic consistency
across dimensional models.
Presentation area databases that deviate from the bus architecture and lack
shared conformed dimensions result in standalone solutions.

- When gathering requirements for a DW/BI initiative, prioritize understanding


business processes over focusing solely on reports or dashboards.
Emphasize identifying the events driving metrics and maintain focus on one
business process per project to ensure clarity and ef ciency.
fi
fi
fi
fl
fl
fl
fi
fi
fi
fl
fi
- To ensure success in DW/BI initiatives, aligning IT and business
management is crucial. Shift the focus from departmental data deployments to
a process perspective. Prioritize opportunities based on business processes,
re ecting key performance indicators. Collaborate with business leadership to
rank processes by value and feasibility, tackling those with the highest impact
and feasibility scores rst. Your deep understanding of business processes is
vital for effective prioritization and actionable outcomes.
- When drafting the data architecture for a DW/BI system, understanding the
organization's processes and associated master descriptive dimension data is
crucial. The main output is the enterprise data warehouse bus matrix which
also serves to highlight the bene ts of a robust master data management
platform.
- Data governance programs should focus on key dimensions such as date,
customer, product, etc., led by subject matter experts. Assigning governance
responsibilities for these dimensions is vital for deploying consistent
dimensions in DW/BI systems, enhancing analytic capabilities and system
robustness.
- Dimensional modeling serves as the core motivation throughout the entire
DW/BI process, from initial design to ETL systems and BI applications. It acts
as a bridge between the business and technical communities, guiding the joint
design efforts for DW/BI deliverables.

- In the DW/BI industry, there's a growing interest in agile development


practices. Agile focuses on completing manageable increments of work within
weeks instead of committing to larger, riskier projects with longer timeframes.
Agile considerations:
- Deliver business value.
- Collaborate with business stakeholders closely.
- Prioritize face-to-face communication and feedback.
- Adapt to evolving requirements swiftly.
- Develop iteratively and incrementally.

- Agile approaches face criticism for lacking planning and architecture, but the
enterprise data warehouse bus matrix addresses these issues. It offers a
framework for agile development and identi es reusable descriptive
dimensions, ensuring data consistency and faster delivery. Collaborative
efforts between business and IT stakeholders produce the matrix quickly,
enabling incremental development until suf cient functionality is available for
release.
- Let's consider a retail company implementing a DW/BI system. They want to
analyze sales data across different regions, products, and time periods.
Instead of creating separate dimension tables for region, product, and time,
they establish conformed dimensions. This allows them to reuse these
dimensions across multiple analyses and reports, speeding up development
and reducing time-to-market for new analytical features. With conformed
dimensions in place, they can quickly integrate new data sources, such as
online sales or customer demographics, focusing their development efforts on
building insightful analytics rather than recreating dimension tables.
fl
fi
fi
fi
fi
- Teams sometimes misuse agile techniques to create analytic solutions
without considering broader organizational needs. They may work with a
limited set of users to address speci c problems, resulting in standalone data
sets that others can't use or don't align with the organization's broader
analytics. While agility is encouraged, creating isolated data sets should be
avoided.

Chapter 2:
fi
- Before launching a dimensional modeling effort, the team needs to
understand the needs of the business, as well as the realities of the
underlying source data. You uncover the requirements via sessions with
business representatives to understand their objectives based on:

1.Key Performance Indicators (KPIs):

In a given month, the retail company's data shows that customers made a
total of 10,000 transactions.
The total revenue generated from these transactions amounted to $500,000.
By dividing the total revenue by the number of transactions, we nd that the
average basket size is $50.
This KPI helps the company understand the typical spending behavior of its
customers and can be used to evaluate the effectiveness of marketing
promotions or sales strategies aimed at increasing the average transaction
value.

2.Compelling Business Issues:

The e-commerce platform observed 50,000 initiated transactions (adding


items to the cart) in a month.
However, only 35,000 of these transactions were completed (resulting in a
successful purchase).
Calculating the cart abandonment rate: (50,000 - 35,000) / 50,000 * 100% =
30%.
This indicates that 30% of customers abandon their carts without completing a
purchase.
High cart abandonment rates can signify issues with website usability,
checkout process, or pricing, impacting revenue and customer satisfaction.

3.Decision-Making Processes:

Example: A manufacturing company deciding whether to invest in automation


technology to streamline production processes and reduce labor costs.

4.Supporting Analytic Needs:

Example: A healthcare provider using predictive modeling to forecast patient


admissions to better allocate staf ng and resources in emergency
departments.

- At the same time, data realities are uncovered by meeting with source
system experts and doing high-level
data pro ling (in data pro ling, we look at the data closely to understand what
it contains, how it's structured, if there are any mistakes or missing parts, and
if it's reliable.)
to assess data feasibilities (Data feasibilities are like checking if you have the
right kinds of data available for your project and if they're good enough to use.
fi
fi
fi
fi
You're making sure that the data you have is suitable and will work well for
what you want to do).

- Dimensional models must be developed in collaboration with subject matter


experts and data governance representatives from the business. While the
data modeler leads the process, the design unfolds through interactive
workshops with business representatives. These workshops serve to further
clarify business requirements. Designing dimensional models in isolation
without a deep understanding of the business is not recommended;
collaboration is key for success.

- Collaborative modeling sessions consider business needs and data realities.


The design team determines table names, column details, sample values, and
business rules based on business processes, granularity, and dimension and
fact declarations. Involvement of business data governance representatives
ensures business buy-in during this phase.

- Business processes are the day-to-day activities conducted by an


organization, like placing orders or processing claims. These activities
generate performance metrics, which become facts in a fact table. Fact tables
typically revolve around one speci c business process. Selecting the right
process is crucial as it sets the design focus and helps de ne the granularity,
dimensions, and facts. Each business process corresponds to a row in the
enterprise data warehouse bus matrix.

Let's consider the business process of "Order Ful llment" for a retail company
and provide examples of multiple fact tables related to this single process:

1. **Order Details Fact Table**:

| Order_ID | Customer_ID | Product_ID | Quantity | Order_Date | Ship_Date


| Shipping_Cost |
|----------|-------------|------------|----------|------------|-----------|---------------|
|1 | 101 | 001 |2 | 2024-02-15 | 2024-02-17| $5.00 |
|2 | 102 | 002 |1 | 2024-02-16 | 2024-02-18| $7.50 |

2. **Order Delivery Fact Table**:

| Delivery_ID | Order_ID | Delivery_Method | Delivery_Date |


Delivery_Status |
|-------------|----------|-----------------|---------------|-----------------|
| 1001 |1 | Express | 2024-02-17 | Delivered |
| 1002 |2 | Standard | 2024-02-19 | In Transit |

3. **Order Payment Fact Table**:


fi
fi
fi
| Payment_ID | Order_ID | Payment_Method | Payment_Amount |
Payment_Date |
|------------|----------|----------------|----------------|--------------|
| 5001 |1 | Credit Card | $50.00 | 2024-02-15 |
| 5002 |2 | PayPal | $30.00 | 2024-02-16 |

Each of these fact tables provides different perspectives on the "Order


Ful llment" process, focusing on various aspects such as order details,
delivery, and payment. Together, they offer a comprehensive view of the
performance metrics related to the single business process of order
ful llment.

- In dimensional design, declaring the grain speci es what each row in a fact
table represents, forming a binding contract for consistency. It's vital to de ne
the grain before selecting dimensions or facts to ensure uniformity across
designs. Starting with atomic-grained data allows for exible query handling,
while rolled-up summary grains aid performance tuning. Each proposed fact
table grain corresponds to a separate physical table, preventing mixing of
different grains within the same fact table for clarity and integrity.

Consider the following fact tables representing different grains within the
same business process:

**Fact Table: Sales_Facts (Atomic Grain)**

| Transaction_ID | Product_ID | Customer_ID | Quantity_Sold |


Total_Sales_Amount | Sales_Date |
|----------------|------------|-------------|---------------|--------------------|-------------|
|1 | 101 | 201 |2 | $100.00 | 2024-02-15 |
|2 | 102 | 202 |1 | $50.00 | 2024-02-16 |
|3 | 103 | 203 |3 | $150.00 | 2024-02-17 |

**Fact Table: Sales_Summary_Facts (Summary Grain)**

| Sales_Date | Total_Sales_Amount | Total_Customers | Total_Products_Sold


|
|-------------|--------------------|-----------------|---------------------|
| 2024-02-15 | $1000.00 | 50 | 120 |
| 2024-02-16 | $1200.00 | 60 | 150 |
| 2024-02-17 | $900.00 | 45 | 100 |

- Dimension tables in dimensional modeling hold descriptive attributes utilized


by BI applications for ltering and grouping facts. Understanding the grain of a
fact table helps identify all relevant dimensions. It's preferable for dimensions
to be single-valued when associated with a fact row.

Consider a fact table representing sales transactions:


fi
fi
fi
fi
fl
fi
**Fact Table: Sales_Facts**

| Transaction_ID | Product_ID | Customer_ID | Quantity_Sold |


Total_Sales_Amount | Sales_Date |
|----------------|------------|-------------|---------------|--------------------|-------------|
|1 | 101 | 201 |2 | $100.00 | 2024-02-15 |
|2 | 102 | 202 |1 | $50.00 | 2024-02-16 |
|3 | 103 | 203 |3 | $150.00 | 2024-02-17 |

In this example:

- The grain of the "Sales_Facts" table is at the transaction level, with each row
representing a single sales transaction.
- Understanding this grain helps identify all relevant dimensions. In this case,
potential dimensions could include "Product_Dim" for product details,
"Customer_Dim" for customer information, and "Time_Dim" for time-related
attributes.
- For instance, the "Product_ID," "Customer_ID," and "Sales_Date" columns
in the fact table represent foreign keys that link to corresponding dimensions.
Each of these dimensions should ideally provide single-valued attributes when
associated with a fact row. For example, when a sales transaction occurs, it
should reference a single product, customer, and sales date to maintain data
integrity and consistency.

- Dimension tables contain the entry points and descriptive labels that enable
the DW/BI system to be leveraged for business analysis.

- A single fact table row has a one-to-one relationship to a measurement event


as described by the fact table’s grain. Within a fact table, only facts consistent
with the declared grain are allowed. For example, in a retail sales transaction,
the quantity of a product sold and its extended price are good facts, whereas
the store manager’s salary is disallowed.

- Star schemas are used in relational database management systems


(RDBMS) and consist of fact tables linked to dimension tables through
primary/foreign key relationships. They organize data in a dimensional
structure. On the other hand, OLAP cubes are dimensional structures in
multidimensional databases, equivalent to or derived from relational star
schemas, containing dimensional attributes and facts. They offer more
analytic capabilities than SQL, accessed through languages like XMLA and
MDX. OLAP cubes are often the nal step in deploying a dimensional DW/BI
system or serve as aggregate structures based on relational star schemas.

Ex.
As an example of OLAP cubes existing as aggregate structures based on
more atomic relational star schemas:
fi
Consider a manufacturing company that maintains a relational star schema to
analyze production data. The star schema consists of a fact table "Production"
linked to dimension tables such as "Product", "Time", and "Location".

To enhance performance and facilitate more ef cient analysis, the company


decides to aggregate the data into OLAP cubes. These cubes summarize the
production data at various levels of granularity, such as monthly or quarterly
totals, across different product categories and geographic regions.

- Dimensional models are resilient when data relationships change. All the
following
changes can be implemented without altering any existing BI query or
application,
and without any change in query results.
■ Facts consistent with the grain of an existing fact table can be added by
creating new columns.
■ Dimensions can be added to an existing fact table by creating new foreign
key
columns, presuming they don’t alter the fact table’s grain.
■ Attributes can be added to an existing dimension table by creating new
columns.
■ The grain of a fact table can be made more atomic by adding attributes to
an existing dimension table, and then restating the fact table at the lower
grain, being
careful to preserve the existing column names in the fact and dimension
tables.

Ex. of last point:

Let's say we have a dimensional model for sales data with a fact table called
"Sales_Fact" and a dimension table "Product_Dimension". The original
schema looks like this:

**Fact Table: Sales_Fact**


- sales_id (primary key)
- date_id (foreign key referencing Date_Dimension)
- product_id (foreign key referencing Product_Dimension)
- quantity_sold
- revenue

**Dimension Table: Product_Dimension**


- product_id (primary key)
- product_name
- category
- brand

Now, suppose we want to make the grain of the fact table more atomic by
adding a new attribute, such as "color", to the Product_Dimension table. After
fi
updating the dimension table, we need to restate the fact table at the lower
grain:

**Updated Dimension Table: Product_Dimension**


- product_id (primary key)
- product_name
- category
- brand
- color

After updating the dimension table, we can restate the fact table to re ect the
new granularity:

**Updated Fact Table: Sales_Fact**


- sales_id (primary key)
- date_id (foreign key referencing Date_Dimension)
- product_id (foreign key referencing Product_Dimension)
- quantity_sold
- revenue

Now, the Sales_Fact table is stated at a more atomic level of granularity,


including the "color" attribute from the Product_Dimension table. This change
allows for more detailed analysis without altering any existing BI queries or
applications, and without affecting query results.

- A fact table in dimensional modeling stores numeric measures from real-


world operational events. Each row represents a single event and includes
foreign keys for associated dimensions, optional degenerate dimension keys,
and timestamps. Fact tables are central to computations and dynamic
aggregations in queries, independent of eventual report generation.

Certainly! Here's how we can illustrate each key point with the help of a table:

**Example of Degenerate Dimension Keys**

| Transaction ID | Date | Product ID | Customer ID | Quantity Sold |


Revenue | Payment Method |
|----------------|------------|------------|-------------|---------------|---------|----------------|
| TRX001 | 2024-01-01 | 101 | 201 |2 | $50 | Credit
Card |
| TRX002 | 2024-01-02 | 102 | 202 |1 | $30 | Cash
|
| TRX003 | 2024-01-03 | 103 | 203 |3 | $80 | Debit
Card |
| TRX004 | 2024-01-04 | 104 | 204 |2 | $60 | Cash
|
fl
1. **Unique Identi ers**: The "Transaction ID" serves as a unique identi er for
each transaction. Each row represents a different transaction, and the
Transaction ID uniquely identi es it.

2. **Embedded in Fact Table**: The "Transaction ID" is directly embedded


within the fact table along with other transaction details. It is not stored in a
separate dimension table.

3. **Avoids Dimension Table Creation**: The Transaction ID, being a unique


identi er for each transaction, doesn't require additional descriptive attributes
beyond its uniqueness. Hence, there's no need for a separate dimension
table.

4. **Provide Context**: The Transaction ID provides crucial context about


each transaction. It helps in uniquely identifying and tracking individual
transactions, facilitating reporting and analysis.

This example demonstrates how degenerate dimension keys are integrated


directly within the fact table, providing unique identi ers for transactions
without the need for separate dimension tables.

- The numeric measures in a fact table fall into three categories. The most
exible and useful facts are fully additive; additive measures can be summed
across any of the dimensions associated with the fact table. Semi-additive
measures can be summed across some dimensions, but not all; balance
amounts are common semi-additive facts because they are additive across all
dimensions except time. Finally, some measures are completely non-additive,
such as ratios. A good approach for non-additive facts is, where possible, to
store the fully additive components of the non-additive measure and sum
these components into the nal answer set before calculating the nal non-
additive fact. This nal calculation is often done in the BI layer or OLAP cube.
Ex.

Calculate Fully Additive Components:

Total Revenue: Sum of revenue across all transactions.


Transaction Count: Count of transactions.

Calculate Final Non-Additive Fact:

Average Revenue per Transaction = Total Revenue / Transaction Count.

- Null-valued measurements behave gracefully in fact tables. The aggregate


functions
(SUM, COUNT, MIN, MAX, and AVG) all do the “right thing” with null facts.
However,
nulls must be avoided in the fact table’s foreign keys because these nulls
would automatically cause a referential integrity violation. Rather than a null
foreign key,
fi
fi
fi
fi
fi
fi
fi
fi
fl
the associated dimension table must have a default row (and surrogate key)
representing the unknown or not applicable condition.

Ex.
Certainly! Let's demonstrate the concept with example fact and dimension
tables, including a default row with a surrogate key in the associated
dimension table:

**Example Fact Table: Sales_Fact**

| Transaction ID | Date ID | Product ID | Customer ID | Quantity Sold |


Revenue |
|----------------|---------|------------|-------------|---------------|---------|
| TRX001 | 101 | 201 | 301 |2 | $50 |
| TRX002 | 102 | 202 | NULL |1 | $30 |
| TRX003 | 103 | 203 | 302 |3 | $80 |
| TRX004 | 104 | NULL | 303 |2 | $60 |

**Example Dimension Table: Customer_Dimension**

| Customer ID | Customer Name | City | Country |


|-------------|---------------|-----------|---------|
| 301 | John | New York | USA |
| 302 | Alice | Boston | USA |
| 303 | Emily | Chicago | USA |
| ... | ... | ... | ... |
| -1 | Unknown | Unknown | Unknown | # Default Row for Unknown
Customer

- If the same measurement appears in separate fact tables, care must be


taken to make sure the technical de nitions of the facts are identical if they
are to be compared or computed together. If the separate fact de nitions are
consistent, the conformed facts should be identically named; but if they are
incompatible, they should be differently named to alert the business users and
BI applications.

Ex. **Example Fact Tables:**

1. **Sales_Fact_USA:**

| Transaction ID | Date | Product ID | Customer ID | Quantity Sold |


Revenue (USD) |
|----------------|------------|------------|-------------|---------------|---------------|
| TRX001 | 2024-01-01 | 101 | 201 |2 | $50 |
| TRX002 | 2024-01-02 | 102 | 202 |1 | $30 |
| TRX003 | 2024-01-03 | 103 | 203 |3 | $80 |
| TRX004 | 2024-01-04 | 104 | 204 |2 | $60 |

2. **Sales_Fact_Europe:**
fi
fi
| Transaction ID | Date | Product ID | Customer ID | Quantity Sold |
Revenue (EUR) |
|----------------|------------|------------|-------------|---------------|---------------|
| TRX005 | 2024-01-01 | 101 | 301 |2 | €40 |
| TRX006 | 2024-01-02 | 102 | 302 |1 | €25 |
| TRX007 | 2024-01-03 | 103 | 303 |3 | €70 |
| TRX008 | 2024-01-04 | 104 | 304 |2 | €55 |

In this example:
- we could name the revenue columns as "Revenue_USD" and
"Revenue_EUR" to indicate the currency differences.

- Transaction fact tables represent measurement events at a speci c point in


space and time. They are highly dimensional and expressive, enabling
extensive slicing and dicing of transaction data. These tables can be dense or
sparse, containing rows only for events where measurements occur. They
always include foreign keys for associated dimensions and may include
precise timestamps and degenerate dimension keys. Measured numeric facts
in transaction fact tables must be consistent with the transaction grain ,
ensuring that each row represents a single measurement event without
aggregation or summarization.

**Example Transaction Fact Table: Sales_Transactions**

| Transaction ID | Date | Product ID | Customer ID | Quantity Sold |


Revenue |
|----------------|------------|------------|-------------|---------------|---------|
| TRX001 | 2024-01-01 | 101 | 201 |2 | $50 |
| TRX002 | 2024-01-02 | 102 | 202 |1 | $30 |
| TRX003 | 2024-01-03 | 103 | 203 |3 | $80 |
| TRX004 | 2024-01-04 | 104 | 204 |2 | $60 |

- A row in a periodic snapshot fact table summarizes many measurement


events occurring over a standard period, such as a day, a week, or a month.
The grain is the
period, not the individual transaction. Periodic snapshot fact tables often
contain
many facts because any measurement event consistent with the fact table
grain is
permissible. These fact tables are uniformly dense in their foreign keys
because
even if no activity takes place during the period, a row is typically inserted in
the
fact table containing a zero or null for each fact.

**Example Periodic Snapshot Fact Table: Weekly_Sales_Snapshot**


fi
| Week Start Date | Week End Date | Total Sales Revenue | Total Quantity
Sold | Average Customer Satisfaction |
|-----------------|---------------|---------------------|---------------------|-------------------------
------|
| 2024-01-01 | 2024-01-07 | $5000 | 100 | 4.5
|
| 2024-01-08 | 2024-01-14 | $5500 | 110 | 4.6
|
| 2024-01-15 | 2024-01-21 | $4800 | 95 | 4.3
|
| 2024-01-22 | 2024-01-28 | $5200 | 105 | 4.7
|

In this example:
- Each row in the `Weekly_Sales_Snapshot` table summarizes the sales data
for a speci c week.
- The grain of the fact table is the week, not the individual transaction. It
summarizes measurement events occurring over a standard period.
- The fact table includes various aggregated measures for each week, such
as total sales revenue, total quantity sold, and average customer satisfaction.
- Accumulating snapshot fact tables summarize measurement events that
happen at predictable steps within a process, particularly suited for pipeline or
work ow processes like order ful llment or claim processing. Each row
corresponds to a speci c instance, initiated at the process start and updated
as it progresses. Unique to this type of fact table is continuous updating as the
pipeline advances. Besides date foreign keys for critical milestones, they
include foreign keys for other dimensions and may feature degenerate
dimensions. Numeric lag measurements and milestone completion counters
are common attributes, offering insights into process duration and milestone
achievement.

Let's illustrate this concept with an example involving an order ful llment
process. We'll create simpli ed data tables to represent the accumulating
snapshot fact table, as well as related dimension tables.

1. **Accumulating Snapshot Fact Table with Numeric Lag Measurements and


Completion Counters**:

| Order_ID | Start_Date | Stage1_Date | Stage2_Date | Stage3_Date |


Stage1_Lag | Stage2_Lag | Stage3_Lag | Stage1_Completed |
Stage2_Completed | Stage3_Completed |
|----------|------------|-------------|-------------|-------------|------------|------------|------------
|------------------|------------------|------------------|
|1 | 2024-03-01 | 2024-03-02 | 2024-03-05 | 2024-03-07 | 1 day |3
days | 2 days | Yes | Yes | Yes |
|2 | 2024-03-02 | 2024-03-03 | NULL | NULL | 1 day | NULL
| NULL | Yes | No | No |
fl
fi
fi
fi
fi
fi
|3 | 2024-03-03 | NULL | NULL | NULL | NULL | NULL
| NULL | No | No | No |

In this enhanced fact table:


- Numeric lag measurements (e.g., Stage1_Lag, Stage2_Lag) represent the
duration between consecutive stages for each order. For instance, for Order
1, Stage1_Lag is 1 day, indicating the time taken to move from the start to
Stage 1.
- Completion counters (e.g., Stage1_Completed, Stage2_Completed) indicate
whether each stage has been completed. For example, for Order 1, all stages
are marked as completed (Yes) since it has progressed through all stages.

2. **Dimension Tables**:
- These tables provide additional context for the fact table.

a. **Order Dimension**:

| Order_ID | Customer_ID | Product_ID |


|----------|-------------|------------|
|1 | 101 | 201 |
|2 | 102 | 202 |
|3 | 103 | 203 |

b. **Customer Dimension**:

| Customer_ID | Name | Address |


|-------------|----------|--------------|
| 101 | John | 123 Main St |
| 102 | Sarah | 456 Elm St |
| 103 | David | 789 Oak St |

c. **Product Dimension**:

| Product_ID | Name | Category |


|------------|-----------|----------|
| 201 | Laptop | Electronics |
| 202 | Smartphone| Electronics |
| 203 | Headphones| Electronics |

In this example, the accumulating snapshot fact table tracks orders through
stages of ful llment, with each stage having a corresponding date. For
instance, Order 1 started on March 1st, moved to Stage 1 on March 2nd, then
to Stage 2 on March 5th, and nally completed on March 7th. Order 2 is still in
progress, having reached Stage 1 on March 3rd. Order 3 has not yet begun.

- Factless fact tables capture events that lack numerical metrics but involve
dimensional entities coming together at speci c times, such as a student
fi
fi
fi
attending a class or customer communications. These tables primarily consist
of foreign keys referencing dimensions like time, individuals, locations, and
event types. Additionally, factless fact tables enable analysis of events that
didn't happen by comparing a coverage table (listing all potential events) with
an activity table (documenting events that did occur). The difference between
the two reveals events that didn't transpire, providing insights into missed
opportunities or areas for improvement.

Let's demonstrate the last point, analyzing events that didn't happen, using
the factless fact table of student attendance and related dimension tables.

1. **Coverage Table**:
- This table lists all possible combinations of students, classes, and dates.

| Date | Student_ID | Class_ID |


|------------|------------|----------|
| 2024-03-01 | 101 | 301 |
| 2024-03-01 | 102 | 301 |
| 2024-03-01 | 103 | 301 |
| 2024-03-01 | 104 | 301 |
| 2024-03-02 | 101 | 302 |
| 2024-03-02 | 102 | 302 |
| 2024-03-02 | 103 | 302 |
| 2024-03-02 | 104 | 302 |

2. **Activity Table**:
- This table documents the actual instances of student attendance in
classes.

| Date | Student_ID | Class_ID |


|------------|------------|----------|
| 2024-03-01 | 101 | 301 |
| 2024-03-01 | 102 | 301 |
| 2024-03-02 | 103 | 302 |

3. **Events That Didn't Happen (Difference between Coverage and Activity)**:

| Date | Student_ID | Class_ID |


|------------|------------|----------|
| 2024-03-01 | 103 | 301 |
| 2024-03-01 | 104 | 301 |
| 2024-03-02 | 101 | 302 |
| 2024-03-02 | 102 | 302 |
| 2024-03-02 | 104 | 302 |

In this example:
- This analysis provides insights into absenteeism patterns and helps
educators and administrators address attendance issues more effectively.
- Aggregate fact tables are optimized versions of atomic fact tables, intended
to accelerate query performance in Business Intelligence (BI) systems. They
are designed to be readily accessible alongside atomic fact tables, allowing BI
tools to seamlessly choose the appropriate level of aggregation during query
execution, a process known as aggregate navigation. This ensures consistent
performance bene ts across different reporting and analysis tools. Aggregate
fact tables contain summarized numeric data obtained by aggregating
measures from atomic fact tables. They also include foreign keys pointing to
shrunken conformed dimensions, maintaining consistency in data
representation. Essentially acting like database indexes, aggregate fact tables
boost query performance without direct interaction from BI applications or
users. Additionally, aggregate OLAP cubes, built in a similar manner, provide
summarized measures directly accessible to business users for analysis.

1. **Aggregate Fact Table (Aggregate Sales by Product Category and


Month)**:
- This table summarizes sales data aggregated by month and product
category.

| Month | Product_Category_ID | Total_Revenue |


|----------|---------------------|---------------|
| January | Electronics | $420 |
| January | Clothing | $300 |
| February | Electronics | $500 |
| February | Clothing | $350 |

2. **Shrunken Conformed Dimensions**:


- These are subset dimensions of the original dimensions, tailored to t the
aggregated data.

a. **Shrunken Date Dimension** (only relevant attributes):

| Month | Year |
|-------|------|
| January | 2024 |
| February | 2024 |

b. **Shrunken Product Category Dimension** (only relevant attributes):

| Product_Category_ID | Name |
|---------------------|-----------|
| Electronics | Electronics |
| Clothing | Clothing |

By incorporating shrunken conformed dimensions, the aggregate fact table


remains aligned with the dimensional model while providing ef cient access to
summarized data.
fi
fi
fi
- Consolidated fact tables are a method of combining data from multiple
processes into a single table when they share the same granularity. For
instance, combining actual sales data with sales forecasts into one table
simpli es the analysis of actuals versus forecasts. While this approach adds
complexity to the ETL (Extract, Transform, Load) process, it streamlines
analysis for BI (Business Intelligence) applications. Consolidated fact tables
bring together data from various processes into a single table, simplifying
analysis of related metrics. They eliminate the need for complex setups like
drill-across applications, making it easier to compare and understand different
aspects of business performance in one place.

- Every dimension table has a single primary key column. This primary key is
embedded as a foreign key in any associated fact table where the dimension
row’s descriptive context is exactly correct for that fact table row. Dimension
tables are usually wide, at denormalized tables with many low-cardinality text
attributes. While operational codes and indicators can be treated as attributes,
the most powerful dimension attributes are populated with verbose
descriptions. Dimension table attributes are the primary target of constraints
and grouping speci cations from queries and BI applications. The descriptive
labels on reports are typically dimension attribute domain values.

Ex.
We have a "Customer_Dimension" table containing information about
customers such as their ID, name, type (Retail/Business), segment
(Individual/Corporate/Small Business), region, age group, and an indicator
"Is_Premium_Customer" (1 for premium, 0 for non-premium).

Ex. of last line:

In a retail sales data warehousing scenario, consider a dimension table


named "Product_Dimension" with the following attributes:

- Product_ID (Primary Key)


- Product_Name
- Category
- Brand
- Color
- Size

Suppose a fact table called "Sales_Fact" records sales transactions, including


the following columns:

- Sales_ID
- Sales_Date
- Customer_ID
- Product_ID (Foreign Key referencing Product_Dimension)
- Quantity_Sold
- Total_Sales_Amount
fi
fi
fl
Now, imagine generating a sales report. The descriptive labels on this report
would likely derive from the dimension attribute domain values. For instance,
instead of just displaying product IDs, the report would show the actual
product names, categories, brands, colors, and sizes associated with each
sale. This way, users can easily understand and interpret the sales data within
the context of speci c products and their attributes.

- A dimension table is designed with one column serving as a unique primary


key.
This primary key cannot be the operational system’s natural key because
there will
be multiple dimension rows for that natural key when changes are tracked
over time.
For instance, if the brand of a product changes, we would end up with two
rows for the same product in the dimension table, each re ecting its state at
different times.
In addition, natural keys for a dimension may be created by more than one
source
system, and these natural keys may be incompatible or poorly administered.

**Customer Dimension Table:**

| Customer_ID | Customer_Name | Customer_Email |


|-------------|----------------|--------------------|
|1 | John Smith | [email protected] |
|2 | Jane Doe | [email protected] |
|3 | Bob Johnson | [email protected] |

Now, imagine two different source systems providing data about customers:

**Source System A:**

| Customer_ID_A | Customer_Name_A | Customer_Email_A |


|---------------|-----------------|-----------------------|
| 101 | John Smith | [email protected] |
| 102 | Jane Doe | [email protected] |
| 103 | Bob Johnson | [email protected] |

**Source System B:**

| Customer_ID_B | Customer_Name_B | Customer_Email_B |


|---------------|-----------------|------------------------|
| 201 | John Smith | [email protected] |
| 202 | Jane Doe | [email protected] |
| 203 | Bob Johnson | [email protected] |

Using these identi ers directly as primary keys in the dimension table would
result in inconsistency and complexity.
fi
fi
fl
Instead, we would use surrogate keys (e.g., "Customer_ID" in the Customer
Dimension Table) to maintain consistency and compatibility across different
source systems.

The DW/BI system needs to claim control of the primary keys of all
dimensions; rather than using explicit natural keys or natural keys with
appended dates, you should create anonymous integer primary keys for every
dimension.

**Example 1: Product Dimension Table**

Consider a "Product" dimension table with the following attributes:

| Product_ID | Product_Name | Category | Brand | Unit_Price |


|------------|----------------|-------------|---------|------------|
|1 | Laptop | Electronics | Dell | 1000 |
|2 | Smartphone | Electronics | Samsung | 800 |
|3 | Sneakers | Fashion | Nike | 120 |

In this case, the "Product_ID" column serves as the anonymous integer


primary key generated and managed by the data warehousing system.
**Example 2: Customer Dimension Table**

Now, let's consider a "Customer" dimension table:

| Customer_ID | Customer_Name | Customer_Email |


|-------------|----------------|--------------------|
|1 | John Smith | [email protected] |
|2 | Jane Doe | [email protected] |
|3 | Bob Johnson | [email protected] |

Here again, the "Customer_ID" column serves as the anonymous integer


primary key, generated and managed by the data warehousing system.

**Example 3: Time Dimension Table**

Finally, let's look at a "Time" dimension table:

| Time_ID | Date | Day | Month | Year |


|---------|------------|-------|-------|------|
|1 | 2024-01-01 | Monday| January| 2024 |
|2 | 2024-01-02 | Tuesday| January| 2024 |
|3 | 2024-01-03 | Wednesday| January| 2024 |

Once again, the "Time_ID" column serves as the anonymous integer primary
key, controlled by the data warehousing system.
This approach simpli es data management, ensures consistency, and
facilitates easier integration, querying, and analysis of data within the data
warehousing environment.
fi
These dimension surrogate keys are simple integers, assigned in sequence,
starting with the value 1, every time a new key is needed. The date dimension
is exempt from the surrogate key rule; this highly predictable and stable
dimension can use a more meaningful primary key.

**Date Dimension Table:**

| Date_ID | Date | Day_of_Week | Month | Quarter | Year |


|---------|------------|-------------|-------|---------|------|
|1 | 2024-01-01 | Monday | January | Q1 | 2024 |
|2 | 2024-01-02 | Tuesday | January | Q1 | 2024 |
|3 | 2024-01-03 | Wednesday | January | Q1 | 2024 |

Using the date itself as the primary key in the "Date" dimension table allows
for easier interpretation and analysis of time-related data. It aligns with the
natural structure and predictability of dates, making it more intuitive for users
querying the data. Additionally, it simpli es joins with fact tables that contain
date references, as there's no need to join on surrogate keys or perform
additional conversions.

- In a data warehousing environment, natural keys created by operational


source systems may change due to external business rules. For example, an
employee number may change if the employee resigns and then rejoins the
company. To maintain consistency in the data warehouse, a new durable key,
known as a durable supernatural key, must be created for each entity. These
durable keys remain persistent and do not change over time, regardless of
any changes to the natural keys. The most effective durable keys have a
format that is independent of the original business process, often simple
integers assigned sequentially starting from 1. While multiple surrogate keys
may be associated with an entity over time, the durable key remains constant
and provides a reliable identi er for the entity.

- Drilling down is the most fundamental way data is analyzed by business


users. Drilling down simply means adding a row header to an existing query;
the new row header is a dimension attribute appended to the GROUP BY
expression in an SQL query. The attribute can come from any dimension
attached to the fact table in the query. Drilling down does not require the
de nition of predetermined hierarchies or drill-down paths.

**Sales_Fact Table:**

| Sales_ID | Product_ID | Quantity_Sold | Total_Sales_Amount |


|----------|------------|---------------|--------------------|
| 101 |1 |3 | 150 |
| 102 |2 |2 | 200 |
| 103 |1 |1 | 50 |

**Product_Dimension Table:**
fi
fi
fi
| Product_ID | Product_Name | Category | Brand |
|------------|--------------|-------------|---------|
|1 | Laptop | Electronics | Dell |
|2 | Smartphone | Electronics | Samsung |

Now, suppose we want to analyze sales by product category. We start with a


basic query:

```sql
SELECT pd.Category, SUM(sf.Total_Sales_Amount) AS Total_Sales
FROM Sales_Fact sf
JOIN Product_Dimension pd ON sf.Product_ID = pd.Product_ID
GROUP BY pd.Category;
```

This query gives us the total sales amount for each product category:

| Category | Total_Sales |
|-------------|-------------|
| Electronics | 400 |

Now, let's say we want to drill down further and see sales by both product
category and brand. We can achieve this by adding the "Brand" attribute from
the "Product_Dimension" table to the GROUP BY expression:

```sql
SELECT pd.Category, pd.Brand, SUM(sf.Total_Sales_Amount) AS
Total_Sales
FROM Sales_Fact sf
JOIN Product_Dimension pd ON sf.Product_ID = pd.Product_ID
GROUP BY pd.Category, pd.Brand;
```

This query gives us sales breakdown by both product category and brand:

| Category | Brand | Total_Sales |


|-------------|---------|-------------|
| Electronics | Dell | 200 |
| Electronics | Samsung | 200 |

Ex2.

**Time_Dimension Table:**

| Time_ID | Date | Year | Month | Day |


|---------|------------|------|-------|-----|
|1 | 2024-01-15 | 2024 | January | 15 |
|2 | 2024-02-20 | 2024 | February| 20 |
|3 | 2024-03-10 | 2024 | March | 10 |

Suppose we want to analyze sales using a predetermined time hierarchy of


Year -> Month -> Day:

```sql
SELECT td.Year, td.Month, td.Day, SUM(sf.Total_Sales_Amount) AS
Total_Sales
FROM Sales_Fact sf
JOIN Time_Dimension td ON sf.Sales_Date = td.Date
GROUP BY td.Year, td.Month, td.Day;
```

This query would provide a detailed breakdown of sales by year, month, and
day:

| Year | Month | Day | Total_Sales |


|------|---------|-----|-------------|
| 2024 | January | 15 | 150 |
| 2024 | February| 20 | 200 |
| 2024 | March | 10 | 50 |

In this example, the analysis follows a predetermined time hierarchy, drilling


down from year to month to day. Each step provides more granular detail in
the analysis, similar to how a predetermined hierarchy would be used for
analyzing time-related data.

- Let's illustrate the concept of a degenerate dimension with an example


involving an "Invoice" fact table and its line item fact rows.

**Invoice Fact Table with Degenerate Dimension:**

| Invoice_Number | Line_Item_Number | Product_ID | Quantity | Unit_Price |


Total_Amount |
|----------------|------------------|------------|----------|------------|--------------|
| INV-001 |1 |1 |2 | 50 | 100 |
| INV-001 |2 |2 |3 | 40 | 120 |
| INV-002 |1 |3 |1 | 80 | 80 |
| INV-002 |2 |4 |2 | 30 | 60 |

In this scenario:

- Each row in the fact table represents a line item on an invoice.


- The "Invoice_Number" serves as a unique identi er for each invoice, while
"Line_Item_Number" uniquely identi es each line item within an invoice.
- The primary key for the Invoice Fact Table is {Invoice_Number,
Line_Item_Number}.
- The fact rows inherit all the descriptive dimension foreign keys of the invoice,
such as "Invoice_Number."
fi
fi
- However, the fact table doesn't have a separate dimension table for the
invoice, leaving the invoice number as the only attribute associated with the
invoice.
- This situation is referred to as a degenerate dimension because the
dimension has no content except for its primary key, which is the
"Invoice_Number."
- The "Invoice_Number" is still used as a valid dimension key for fact tables at
the line item level, allowing analysis and reporting at that level.

In summary, a degenerate dimension occurs when a dimension has no


additional attributes beyond its primary key and is typically placed directly
within the fact table to represent data that doesn't warrant a separate
dimension table.

- In an operational database designed for an online store, you might have


separate normalized tables for customers, orders, and products.
- In contrast, in a dimensional model for data warehousing and analysis
purposes, you might denormalize these tables into a single fact table and
related dimension tables.

Consider a company's organizational structure, which may include


departments, teams, and employees. This structure forms a hierarchy where
each employee belongs to a team, and each team belongs to a department.
Here's a simpli ed version:

- **Departments**:
- Marketing
- Sales
- Finance

- **Teams** within Marketing:


- Branding
- Digital Marketing

- **Teams** within Sales:


- Inside Sales
- Field Sales

- **Teams** within Finance:


- Accounting
- Treasury

- **Employees**:
- John (Digital Marketing, Marketing)
- Alice (Inside Sales, Sales)
- Bob (Accounting, Finance)

In this hierarchy:
- Each department (e.g., Marketing, Sales, Finance) can have multiple teams.
fi
- Each team (e.g., Branding, Digital Marketing) belongs to one department.
- Each employee belongs to one team and, by extension, one department.

This hierarchical structure is a many-to-one xed depth hierarchy because:


- There can be many employees within a team, but each employee belongs to
only one team.
- Each team belongs to one department, but a department can have multiple
teams.

Now, in a normalized database schema, you might represent this hierarchy


with separate tables for departments, teams, and employees, with foreign key
relationships between them.

In a dimensional model, however, you would denormalize this hierarchy into a


single dimension table with separate attributes representing each level:

- **Dimension Table**:
- employee_id
- employee_name
- team_name
- department_name

In this denormalized dimension table:


- Each row represents an employee.
- The attributes `team_name` and `department_name` directly indicate the
team and department to which each employee belongs.
- Instead of navigating through multiple tables and performing joins, you can
query this attened dimension table directly for analytics or reporting
purposes, simplifying queries and enhancing performance.

This denormalized structure supports dimensional modeling's objectives of


simplicity and speed, crucial for ef cient data analysis in data warehousing
environments.

- Many dimensions contain more than one natural hierarchy.

**Example 1: Date Dimension with Multiple Hierarchies**

Consider a date dimension table that contains data for different hierarchies:

| Date | Day | Week | Month | Quarter | Year | Fiscal Period |


|------------|-----------|----------|----------|----------|--------|---------------|
| 2024-03-10 | Saturday | Week 10 | March | Q1 | 2024 | FY 2024 Q1
|
| 2024-03-11 | Sunday | Week 11 | March | Q1 | 2024 | FY 2024 Q1
|
| 2024-03-12 | Monday | Week 11 | March | Q1 | 2024 | FY 2024 Q1
|
| ... | ... | ... | ... | ... | ... | ... |
fl
fi
fi
In this example:
- The "Date" column represents individual dates.
- Multiple hierarchies coexist within the same dimension table:
- Day-to-Week-to-Fiscal Period hierarchy: Day -> Week -> Fiscal Period
- Day-to-Month-to-Year hierarchy: Day -> Month -> Year
- Each hierarchy level is represented by a separate column, allowing users to
navigate through different levels of granularity easily.

**Example 2: Location Dimension with Multiple Hierarchies**

Now, let's consider a location dimension table that contains data for different
geographic hierarchies:

| Location | Country | Region | State | City |


|------------|-----------|------------|-----------|------------|
| New York | USA | Northeast | New York | New York |
| Los Angeles| USA | West | California| Los Angeles|
| Chicago | USA | Midwest | Illinois | Chicago |
| ... | ... | ... | ... | ... |

In this example:
- The "Location" column represents individual locations.
- Multiple geographic hierarchies coexist within the same dimension table:
- Country-to-Region-to-State-to-City hierarchy: Country -> Region -> State ->
City
- State-to-City hierarchy: State -> City
- Each level of the hierarchies is represented by a separate column, allowing
users to analyze data at different geographic levels.

In both examples, having multiple hierarchies within the same dimension table
provides exibility and simplicity for querying and analyzing data. Users can
navigate through different levels of granularity without the need for complex
joins or separate dimension tables, thus supporting the objectives of
dimensional modeling.

- Let's provide separate examples of ags and operational indicators with


tabular data, along with their expanded descriptive attributes.

**Example 1: Flags**

Consider a dimension table for customer data where true/false ags are used
to indicate certain attributes:

| Customer ID | Name | Premium Flag | Active Flag | Email Opt-in Flag |


|-------------|-----------|--------------|-------------|-------------------|
| 001 | John Doe | true | true | false |
| 002 | Jane Smith| false | true | true |
| 003 | Alice Lee | true | false | true |
fl
fl
fl
| ... | ... | ... | ... | ... |

In this example:
- The "Premium Flag," "Active Flag," and "Email Opt-in Flag" are true/false
ags indicating whether the customer has premium status, is active, and has
opted into email communication, respectively.
- While using ags can be ef cient for storage and processing, they might not
provide clear meaning when viewed independently.
- To supplement these ags with full text words that have independent
meaning, you might add descriptive attributes like "Premium Status," "Active
Status," and "Email Opt-in Status."

The expanded dimension table might look like this:

| Customer ID | Name | Premium Flag | Active Flag | Email Opt-in Flag |


Premium Status | Active Status | Email Opt-in Status |
|-------------|-----------|--------------|-------------|-------------------|----------------|------------
---|---------------------|
| 001 | John Doe | true | true | false | Premium |
Active | Not Opted-In |
| 002 | Jane Smith| false | true | true | Regular |
Active | Opted-In |
| 003 | Alice Lee | true | false | true | Premium |
Inactive | Opted-In |
| ... | ... | ... | ... | ... | ... | ... | ...
|

With these descriptive attributes, the meaning of the ags becomes clearer
when viewed independently.

**Example 2: Operational Indicators**

Consider a dimension table for product data where operational indicators are
used to represent various attributes:

| Product ID | Name | Status Code | Category Code | Type Code |


|------------|--------------|-------------|---------------|-----------|
| 001 | Product A | 1 | 101 | 201 |
| 002 | Product B | 0 | 102 | 202 |
| 003 | Product C | 1 | 101 | 203 |
| ... | ... | ... | ... | ... |

In this example:
- The "Status Code," "Category Code," and "Type Code" are operational
indicators representing the status, category, and type of each product,
respectively.
- While these codes may have embedded meanings within their values, they
might not be immediately clear when viewed independently.
fl
fl
fl
fi
fl
- To break down these operational indicators into separate descriptive
attributes, you might add attributes like "Product Status," "Product Category,"
and "Product Type."

The expanded dimension table might look like this:

| Product ID | Name | Status Code | Category Code | Type Code |


Product Status | Product Category | Product Type |
|------------|--------------|-------------|---------------|-----------|----------------|-----------------
-|--------------|
| 001 | Product A | 1 | 101 | 201 | Active | Category
A | Type X |
| 002 | Product B | 0 | 102 | 202 | Inactive | Category
B | Type Y |
| 003 | Product C | 1 | 101 | 203 | Active | Category
A | Type Z |
| ... | ... | ... | ... | ... | ... | ... | ... |

With these expanded descriptive attributes, the operational indicators are


broken down into understandable terms, making it easier to interpret the data
independently.

- Let's provide an example of null-valued dimension attributes with tabular


data and show how substituting descriptive strings can enhance clarity and
consistency.

**Example: Product Dimension Table**

Consider a product dimension table where some attributes may not be


applicable to all products, resulting in null values:

| Product ID | Name | Category | Color | Size | Weight |


|------------|---------------|--------------|------------|--------|---------|
| 001 | Laptop | Electronics | Silver | 13" | 2.5 lbs |
| 002 | T-Shirt | Apparel | Blue | Large | |
| 003 | Headphones | Electronics | Black | | 0.5 lbs |
| 004 | Sunglasses | Accessories | | | 0.2 lbs |
| ... | ... | ... | ... | ... | ... |

In this example:
- The "Size" and "Weight" attributes may not be applicable to all products. For
example, Sunglasses may not have a Size attribute, and T-Shirt may not have
a Weight attribute.
- As a result, null values appear in the corresponding cells, indicating missing
or non-applicable information.
- However, null values can lead to inconsistencies in querying and reporting
across different database systems, as they may be treated differently in
groupings or constraints.
To address this issue, we can substitute descriptive strings, such as
"Unknown" or "Not Applicable," in place of null values:

| Product ID | Name | Category | Color | Size | Weight |


|------------|---------------|--------------|------------|--------|---------|
| 001 | Laptop | Electronics | Silver | 13" | 2.5 lbs |
| 002 | T-Shirt | Apparel | Blue | Large | Not Applicable |
| 003 | Headphones | Electronics | Black | Unknown | 0.5 lbs |
| 004 | Sunglasses | Accessories | Not Applicable | Not Applicable | 0.2
lbs |
| ... | ... | ... | ... | ... | ... |

By substituting descriptive strings for null values:


- We provide clarity and consistency in the data, making it easier to
understand the meaning of each attribute.
- We avoid potential inconsistencies in querying and reporting caused by
different treatments of null values in different database systems.
- Users can more easily interpret the data, especially when certain attributes
are not applicable to all dimension rows.

In data warehousing, calendar date dimensions enhance navigation through


dates, months, and scal periods. Using meaningful keys like YYYYMMDD
aids in ef cient partitioning. Dynamic dates like Easter are looked up in the
calendar dimension. Fact tables include separate date/time stamps for
precision, and optional time-of-day dimensions facilitate data grouping based
on attributes like day parts or shifts.

Sure! Let's combine all the examples into a comprehensive example.

**Calendar Date Dimension Table:**

| DateKey | Date | WeekNumber | MonthName | FiscalPeriod |


NationalHoliday |
|---------|------------|------------|-----------|--------------|-----------------|
| 20220101| 2022-01-01 | 1 | January | Q1 | New Year's Day |
| 20220102| 2022-01-02 | 1 | January | Q1 | |
| ... | ... | ... | ... | ... | ... |
| 20220131| 2022-01-31 | 5 | January | Q1 | |
| ... | ... | ... | ... | ... | ... |
| 20220419| 2024-04-19 | 17 | April | Q2 | |

**Time-of-Day Dimension Table:**

| TimeOfDayKey | DayPart | ShiftNumber |


|--------------|-------------|-------------|
|1 | Morning | 1 |
|2 | Afternoon | 2 |
|3 | Evening | 3 |
|4 | Night |4 |
fi
fi
**Fact Table:**

| FactID | DateKey | TimeStamp | TimeOfDayKey | Measure1 | Measure2


|
|--------|---------|--------------------|--------------|----------|----------|
|1 | 20220115| 2022-01-15 08:30:00| 1 | 100 | 50 |
|2 | 20220115| 2022-01-15 12:45:00| 2 | 150 | 75 |
|3 | 20220115| 2022-01-15 18:00:00| 3 | 200 | 100 |
|4 | 20220115| 2022-01-15 21:30:00| 4 | 250 | 125 |
| ... | ... | ... | ... | ... | ... |

In this example:

- We have a **Calendar Date Dimension Table** containing attributes such as


DateKey, Date, WeekNumber, MonthName, FiscalPeriod, and
NationalHoliday.
- There's a **Time-of-Day Dimension Table** with attributes TimeOfDayKey,
DayPart, and ShiftNumber.
- The **Fact Table** contains data with DateKey referencing the DateKey from
the Calendar Date Dimension Table and TimeOfDayKey referencing the
TimeOfDayKey from the Time-of-Day Dimension Table. It also includes
measures such as Measure1 and Measure2 for analysis.

Now, let's assume that the DayPart grouping is de ned as follows:

- Morning: 06:00:00 to 11:59:59


- Afternoon: 12:00:00 to 17:59:59
- Evening: 18:00:00 to 23:59:59
- Night: 00:00:00 to 05:59:59

We can create a query to calculate the day part grouping for each record in
the fact table:

```sql
SELECT
FactID,
DateKey,
TimeStamp,
TimeOfDayKey,
CASE
WHEN CAST(TimeStamp AS TIME) >= '06:00:00' AND
CAST(TimeStamp AS TIME) < '12:00:00' THEN 'Morning'
WHEN CAST(TimeStamp AS TIME) >= '12:00:00' AND
CAST(TimeStamp AS TIME) < '18:00:00' THEN 'Afternoon'
WHEN CAST(TimeStamp AS TIME) >= '18:00:00' AND
CAST(TimeStamp AS TIME) < '00:00:00' THEN 'Evening'
ELSE 'Night'
fi
END AS DayPart,
Measure1,
Measure2
FROM
FactTable;
```

A single physical dimension can be referenced multiple times in a fact table,


with each reference linking to a logically distinct role for the dimension.

**Date Dimension Table:**

| DateKey | Date | WeekNumber | MonthName | DayOfWeek | Quarter |


Year |
|---------|------------|------------|-----------|-----------|---------|------|
| 20220101| 2022-01-01 | 1 | January | Saturday | Q1 | 2022 |
| 20220102| 2022-01-02 | 1 | January | Sunday | Q1 | 2022 |
| ... | ... | ... | ... | ... | ... | ... |
| 20220131| 2022-01-31 | 5 | January | Monday | Q1 | 2022 |
| ... | ... | ... | ... | ... | ... | ... |
| 20220419| 2024-04-19 | 17 | April | Friday | Q2 | 2024 |

**Fact Table:**

| FactID | OrderDateKey | ShipDateKey | DeliveryDateKey | CustomerID |


ProductID | Quantity | Amount |
|--------|--------------|-------------|-----------------|------------|-----------|----------|--------|
|1 | 20220115 | 20220118 | 20220120 | 1001 | 001 |2
| 50 |
|2 | 20220120 | 20220122 | 20220125 | 1002 | 002 |1
| 30 |
| ... | ... | ... | ... | ... | ... | ... | ... |

In this example:

- The **Date Dimension Table** contains various attributes such as DateKey,


Date, WeekNumber, MonthName, DayOfWeek, Quarter, and Year.
- The **Fact Table** represents sales orders and includes multiple references
to the Date Dimension through separate roles:
- `OrderDateKey` references the Date Dimension for the order date.
- `ShipDateKey` references the Date Dimension for the shipment date.
- `DeliveryDateKey` references the Date Dimension for the delivery date.
- Each reference to the Date Dimension provides different perspectives on the
date, such as the order date, shipment date, and delivery date, enabling
analysis based on various timeframes within the sales process.
A transaction pro le dimension, also known as a junk dimension, consolidates
miscellaneous, low-cardinality ags and indicators from transactional business
processes. Instead of creating separate dimensions for each ag or attribute,
a single junk dimension combines them. This approach reduces the
complexity of the schema. The junk dimension doesn't need to encompass
every possible combination of attribute values, but rather only those that occur
in the source data.

Let's say we have transactional data from an e-commerce platform. Here's a


simpli ed version of the data:

| Transaction ID | Customer ID | Product ID | Payment Method | Shipping


Method | Coupon Used |
|----------------|-------------|------------|----------------|----------------|-------------|
|1 | 1001 | 500 | Credit Card | Standard | Yes |
|2 | 1002 | 501 | PayPal | Express | No |
|3 | 1003 | 502 | Credit Card | Standard | Yes |
|4 | 1004 | 503 | PayPal | Express | No |

Now, instead of creating separate dimensions for Payment Method, Shipping


Method, and Coupon Used, we can create a junk dimension combining them.

Here's how the junk dimension might look:

| Payment Method | Shipping Method | Coupon Used |


|----------------|-----------------|-------------|
| Credit Card | Standard | Yes |
| Credit Card | Standard | No |
| Credit Card | Express | Yes |
| PayPal | Standard | Yes |
| PayPal | Standard | No |
| PayPal | Express | Yes |
| PayPal | Express | No |

In this example, the junk dimension captures all possible combinations of


Payment Method, Shipping Method, and Coupon Used that occur in the
source data. This allows us to reduce the complexity of the schema while still
retaining the relevant information for analysis and reporting.

Hierarchical relationships in dimension tables are normalized by moving low-


cardinality attributes to secondary tables, forming a snow ake structure.
However, snow akes are complex, hard to navigate, and can degrade query
performance. Denormalized attened dimensions offer the same data without
the complexities of snow aking.

Let's consider a dimension table for "Employee" and its hierarchical attributes
"Department" and "Manager".

### Snow ake Structure:


fi
fl
fl
fi
fl
fl
fl
fl
fl
**Employee Table:**
| Employee ID | Employee Name | Department ID |
|-------------|---------------|---------------|
|1 | John | 101 |
|2 | Alice | 102 |

**Department Table:**
| Department ID | Department Name | Manager ID |
|---------------|-----------------|------------|
| 101 | Sales | 201 |
| 102 | Marketing | 202 |

**Manager Table:**
| Manager ID | Manager Name |
|------------|--------------|
| 201 | Mark |
| 202 | Sarah |

In this snow ake structure, the "Employee" table has a foreign key reference
to the "Department" table, and the "Department" table has a foreign key
reference to the "Manager" table. Each table represents a level of hierarchy.

### Flattened Denormalized Dimension Table:


In a denormalized structure, all hierarchical attributes are combined into a
single table, eliminating the need for joins.

**Employee Table (Denormalized):**


| Employee ID | Employee Name | Department Name | Manager Name |
|-------------|---------------|-----------------|--------------|
|1 | John | Sales | Mark |
|2 | Alice | Marketing | Sarah |

In this denormalized structure, all hierarchical attributes like "Department


Name" and "Manager Name" are present in a single table, simplifying queries
and making it easier for business users to understand and navigate the data.

Outrigger dimensions, secondary references in primary dimension tables, add


extra detail like account opening dates. While permissible, they're best used
sparingly. Instead, correlation between dimensions is often managed in fact
tables using separate foreign keys, promoting simpler schema designs.

Let's illustrate this with a simple example involving a bank account dimension
and a separate dimension for account opening dates.

### Bank Account Dimension:


The primary dimension table representing bank accounts:

| Account ID | Account Number | Account Type | Date Opened ID |


fl
|------------|----------------|--------------|----------------|
|1 | 123456 | Savings | 101 |
|2 | 789012 | Checking | 102 |

### Date Dimension:


A separate dimension table for dates:

| Date ID | Date |
|---------|------------|
| 101 | 2023-01-15 |
| 102 | 2023-04-20 |

In this scenario:

- The "Bank Account Dimension" contains a foreign key reference ("Date


Opened ID") to the "Date Dimension", indicating when each account was
opened.
- "Date Opened ID" in the "Bank Account Dimension" acts as an outrigger
dimension, providing additional detail about the account.

Conformed dimensions ensure consistency across dimension tables by


sharing attributes with the same column names and content domains. They
facilitate combining information from multiple fact tables in reports, aligning
results seamlessly. This integration, essential in enterprise DW/BI systems,
promotes analytic consistency, reduces development costs, and ensures
alignment with business data governance standards.

**Dimension Table 1: ProductDim**

| Product_ID | Product_Name | Category | Subcategory |


|------------|--------------|----------|-------------|
|1 | Product_A | Electronics | Smartphones |
|2 | Product_B | Electronics | Laptops |
|3 | Product_C | Clothing | T-Shirts |
|4 | Product_D | Home | Furniture |

**Dimension Table 2: DateTime**

| Date | Day_Name | Month | Quarter |


|------------|----------|-------|---------|
| 2024-04-01 | Monday | April | Q2 |
| 2024-04-02 | Tuesday | April | Q2 |
| 2024-04-03 | Wednesday| April | Q2 |
| 2024-04-04 | Thursday | April | Q2 |

**Fact Table 1: Sales**

| Sale_ID | Product_ID | Date | Sale_Amount |


|---------|------------|------------|-------------|
|1 |1 | 2024-04-01 | 5000 |
|2 |2 | 2024-04-01 | 7000 |
|3 |3 | 2024-04-01 | 6000 |
|4 |4 | 2024-04-01 | 4500 |

**Fact Table 2: Inventory**

| Inventory_ID | Product_ID | Date | Quantity |


|--------------|------------|------------|----------|
|1 |1 | 2024-04-01 | 100 |
|2 |2 | 2024-04-01 | 150 |
|3 |3 | 2024-04-01 | 120 |
|4 |4 | 2024-04-01 | 90 |

Now, let's explain the paragraph using this example:

1. Both `ProductDim` and `DateTime` dimension tables conform because they


have common attributes `Date` and similar domains (`Day_Name`, `Month`,
`Quarter` for `DateTime`, and `Product_Name`, `Category`, `Subcategory` for
`ProductDim`).

2. We can combine information from the `Sales` and `Inventory` fact tables
into a single report by using the conformed dimension attributes `Date` and
`Product_ID`, which are associated with each fact table.

3. By using `Date` as the row header, we can align sales and inventory data
on the same rows, allowing for a drill-across analysis where we can compare
sales and inventory levels for different products over time.

4. The use of conformed dimensions like `DateTime` and `ProductDim`


ensures integration within a data warehouse/business intelligence system. It
promotes analytic consistency and reduces development costs by allowing for
the reuse of dimension attributes across multiple fact tables.

**Combined Report: Sales and Inventory by Date**

| Date | Product_ID | Product_Name | Sale_Amount | Inventory_Quantity |


|------------|------------|--------------|-------------|---------------------|
| 2024-04-01 | 1 | Product_A | 5000 | 100 |
| 2024-04-01 | 2 | Product_B | 7000 | 150 |
| 2024-04-01 | 3 | Product_C | 6000 | 120 |
| 2024-04-01 | 4 | Product_D | 4500 | 90 |

Shrunken dimensions, subsets of base dimensions, are crucial for aggregate


fact table construction and capturing data at higher granularity levels, like
monthly forecasts by brand. They are necessary when dimensions share the
same level of detail but represent different subsets of data, enabling ef cient
data analysis and reporting.
fi
"Roll up" refers to the process of summarizing or aggregating data from a
lower level of granularity to a higher level. For example, if you have sales data
recorded at the daily level, rolling up the data to the monthly level involves
summing up the sales gures for each month.

| Base Dimension: Customer ID | Shrunken Dimension: Customer Segment |


|-----------------------------|-------------------------------------|
| 001 | Corporate |
| 002 | Individual |
| 003 | Small Business |
| ... | ... |

In this example, the base dimension is the full set of customer IDs, while the
shrunken dimension is limited to just the customer segments. Each row in the
shrunken dimension represents a subset of rows from the base dimension.

| Sales Fact Table (Base) | Aggregate Fact Table (Shrunken) |


|--------------------------|----------------------------------|
| Date | Customer ID | Sales Amount | Month | Customer Segment |
Total Sales Amount |
|------------|--------------|---------------|-----------|------------------|---------------------|
| 2024-01-01 | 001 | $1000 | January | Corporate | $3000
|
| 2024-01-01 | 002 | $2000 | February | Individual | $5000
|
| 2024-02-01 | 003 | $1500 | March | Small Business | $2000
|
| ... | ... | ... | ... | ... | ... |

In this example, the aggregate fact table summarizes sales data from the
sales fact table. It aggregates sales by month and customer segment,
effectively shrunken versions of the base dimensions (Date and Customer ID).

| Forecast by Date (Base) | Forecast by Month and Customer Segment


(Shrunken) |
|--------------------------|-----------------------------------------------|
| Date | Customer ID | Forecast Amount | Month | Customer Segment |
Total Forecast Amount |
|------------|--------------|------------------|-----------|------------------|-----------------------|
| 2024-01-01 | 001 | $1200 | January | Corporate | $3500
|
| 2024-01-01 | 002 | $2500 | February | Individual | $6000
|
| 2024-02-01 | 003 | $1800 | March | Small Business | $2300
|
| ... | ... | ... | ... | ... | ... |
fi
In this table, the forecast data is captured at a higher level of granularity (by
month and customer segment) compared to the base data, which is
forecasted by date and customer ID.

| Gender (Base) | Male (Shrunken) |


|---------------|-----------------|
| Male | Male |
| Female | |
| ... | |

In this nal example, both Gender and Male dimensions are at the same level
of detail, but the Male dimension represents only a subset of rows from the
Gender dimension, speci cally just the male gender.

Drilling across involves querying multiple fact tables, each with identical
dimension attributes, and aligning the results through a sort-merge operation
based on common dimensions.

**Sales Fact Table:**

| Date | Product ID | Sales Amount |


|------------|------------|--------------|
| 2024-01-01 | 001 | $1000 |
| 2024-01-01 | 002 | $2000 |
| 2024-02-01 | 003 | $1500 |
| ... | ... | ... |

**Inventory Fact Table:**

| Date | Product ID | Quantity |


|------------|------------|----------|
| 2024-01-01 | 001 | 50 |
| 2024-01-01 | 002 | 30 |
| 2024-02-01 | 003 | 20 |
| ... | ... | ... |

Both fact tables have Date and Product ID as conformed attributes.

We perform a sort-merge operation on the Date attribute.

**Merged Result:**

| Date | Sales Product ID | Sales Amount | Inventory Product ID | Quantity


|
|------------|------------------|--------------|----------------------|----------|
| 2024-01-01 | 001 | $1000 | 001 | 50 |
| 2024-01-01 | 002 | $2000 | 002 | 30 |
| 2024-02-01 | 003 | $1500 | 003 | 20 |
| ... | ... | ... | ... | ... |
fi
fi
In this merged result, the sales and inventory data are aligned based on the
common dimension attribute, Date. Each row represents information about
sales and inventory for the same date and product ID.

A value chain identi es the natural ow of an organization’s primary business


processes. Because each process produces unique metrics at unique time
intervals with unique granularity and dimensionality, each process typically
spawns at least one atomic fact table.

### Example: Retailer's Value Chain

### Purchasing Process - Purchase Orders

| Order ID | Supplier | Date | Product ID | Quantity | Unit Price |


|----------|------------|------------|------------|----------|------------|
| PO001 | Supplier A | 2024-04-01 | PROD001 | 100 | $10 |
| PO002 | Supplier B | 2024-04-03 | PROD002 | 50 | $20 |
| PO003 | Supplier C | 2024-04-05 | PROD003 | 200 | $15 |

### Warehousing Process - Inventory Snapshots

| Snapshot ID | Warehouse | Date | Product ID | Quantity On Hand |


Location |
|-------------|-----------|------------|------------|------------------|------------|
| IS001 | Warehouse1| 2024-04-01 | PROD001 | 120 | Aisle 1 |
| IS002 | Warehouse2| 2024-04-03 | PROD002 | 70 | Shelf 2 |
| IS003 | Warehouse3| 2024-04-05 | PROD003 | 210 | Section C
|

### Sales Process - Sales Transactions

| Transaction ID | Store | Date | Product ID | Quantity Sold | Revenue |


|----------------|---------|------------|------------|---------------|------------|
| ST001 | Store1 | 2024-04-01 | PROD001 | 80 | $800 |
| ST002 | Store2 | 2024-04-03 | PROD002 | 40 | $800 |
| ST003 | Store3 | 2024-04-05 | PROD003 | 180 | $2700 |

In this example:

- Each process (Purchasing, Warehousing, Sales) in the retailer's value chain


generates unique data with different attributes (Order ID, Supplier, Date,
Product ID, Quantity, Unit Price for Purchasing; Snapshot ID, Warehouse,
Date, Product ID, Quantity On Hand, Location for Warehousing; Transaction
ID, Store, Date, Product ID, Quantity Sold, Revenue for Sales).
- Each process operates at different time intervals (e.g., daily for Purchasing,
hourly for Sales).
- The granularity varies between processes (e.g., order level for Purchasing,
product level for Warehousing and Sales).
fi
fl
- Dimensions associated with each process differ (e.g., Supplier, Warehouse,
Store).
- To analyze these metrics effectively, each process typically spawns at least
one atomic fact table to store the unique data generated at each step of the
value chain.

Enterprise DW/BI bus architecture offers an incremental approach, focusing


on business processes and standardized dimensions for integration. It
encourages agile implementations aligned with the enterprise data warehouse
bus matrix, independent of technology and database platforms,
accommodating both relational and OLAP dimensional structures.

Instead of trying to build the entire dw/bi system at once, a company might
start by creating a data mart for one department, such as sales, and gradually
expand it to cover other departments like marketing or nance.

The planning process in the context of data warehousing and business


intelligence (DW/BI) typically refers to the initial phase of designing and
implementing a DW/BI system.

1. **Business Analysis**: Understand business needs, de ne requirements,


and establish key performance indicators for the DW/BI system.
2. **Data Assessment**: Evaluate existing data sources, quality, and
integration requirements for the data warehouse.
3. **Technology Evaluation**: Assess DW/BI tools, platforms, and
infrastructure to ensure compatibility and effectiveness.
4. **Architectural Design**: Plan data modeling, ETL processes, storage, and
reporting components (to generate insights) to meet business objectives
ef ciently.
5. **Project Planning**: Develop a detailed timeline, allocate resources, and
budget while breaking down the project into manageable tasks.

Rather than trying to tackle DW/BI planning process at once, the company
might prioritize key business processes, such as order management or
customer service, and design speci c data models and reports to support
these processes.

Enterprise Data Warehouse (EDW) bus matrix with some hypothetical subject
areas and corresponding data elements:

| Subject Area | Dimension | Fact Table(s) | Example Data Elements


|
|-----------------|----------------|----------------------|---------------------------------------------
----------------------|
| Sales | Product | Sales Fact | Product ID, Product Name,
Product Category, Unit Price |
| | Time | | Order Date, Ship Date, Month,
Quarter, Year |
fi
fi
fi
fi
| | Customer | | Customer ID, Customer Name,
Customer Segment, Customer Location |
| Marketing | Campaign | Marketing Fact | Campaign ID, Campaign
Name, Campaign Type, Campaign Start Date |
| | Channel | | Channel ID, Channel Name,
Channel Type, Channel Description |
| | Demographics | | Age Group, Gender, Income
Level, Geographic Location |
| Finance | Financial | Financial Fact | Transaction ID, Transaction
Date, Amount, Transaction Type |
| | Account | | Account ID, Account Type, Account
Holder, Account Balance |
| | Currency | | Currency Code, Exchange Rate
|

Imagine a company, XYZ Corp, is implementing an enterprise data


warehouse (EDW) using the bus architecture. The EDW bus matrix identi es
several key subject areas, including Sales, Marketing, and Finance. Each of
these subject areas represents a row in the matrix.

Now, let's focus on the Sales subject area. The company decides to adopt an
agile approach to develop the sales-related components of the data
warehouse. They decompose the Sales row on the bus matrix into smaller,
manageable pieces or user stories, such as:

1. **Sales Orders Reporting**: reports on sales by region, product category,


and customer segment.
2. **Customer Relationship Management (CRM) Integration**: Integrate data
from the CRM system to provide insights into customer interactions and sales
pipeline.
3. **Sales Performance Dashboards**: Create dashboards to monitor sales
team performance, track sales targets, and identify opportunities for
improvement.
4. **Product Sales Analysis**: Analyze product sales data to identify top-
selling products, seasonality trends, and cross-selling opportunities.

Each of these user stories corresponds to a speci c aspect of the Sales


subject area and aligns with the company's business objectives. The agile
development team can then prioritize and tackle these user stories iteratively,
delivering incremental value to the organization with each sprint or
development cycle.
The enterprise data warehouse bus matrix organizes business processes as
rows and dimensions as columns, indicating their association. Design teams
use it to validate dimension relevance to processes and ensure conformity
across them. Additionally, it aids in prioritizing DW/BI projects, implementing
one matrix row at a time.
fi
fi
| Business Process | Customer | Product | Time | Location | Sales |
Marketing | Finance |
|---------------------|------------|------------|------------|------------|---------|-------------|------
-----|
| Sales | Shaded | Shaded | Shaded | Shaded | Shaded |
| |
| Marketing | Shaded | | Shaded | | | Shaded
| |
| Finance | Shaded | | Shaded | | | |
Shaded |
| Inventory | | Shaded | | Shaded | | |
|

The detailed implementation bus matrix is a more granular bus matrix where
each business process row has been expanded to show speci c fact tables or
OLAP cubes. At this level of detail, the precise grain statement and list of facts
can be documented.

Here's a tabular example of a detailed implementation bus matrix:

| Business Process | Fact Table / OLAP Cube | Grain Statement


| List of Facts |
|---------------------|---------------------------|-----------------------------------------------|-----
--------------------------------|
| Sales | Sales Fact | Each row represents a sales
transaction. | Quantity Sold, Sales Amount |
| | Returns Fact | Each row represents a returned item.
| Quantity Returned, Return Amount |
| Marketing | Campaigns Fact | Each row represents a
marketing campaign. | Campaign ID, Campaign Cost |
| | Leads Fact | Each row represents a generated lead.
| Lead ID, Lead Source, Lead Status |
| Finance | Transactions Fact | Each row represents a nancial
transaction. | Transaction ID, Transaction Amount |
| | Budgets Fact | Each row represents a budget
allocation. | Budget ID, Budget Amount |
| Inventory | Inventory Fact | Each row represents an inventory
transaction.| Item ID, Quantity, Transaction Type |

"lead" typically refers to a potential customer or prospect who has shown


interest in a company's products or services but has not yet made a purchase
or taken any signi cant action.
The opportunity/stakeholder matrix replaces dimension columns with business
functions in the EDW bus matrix. Shaded cells indicate function interest in
each process, guiding collaborative sessions for tailored design and
development.
fi
fi
fi
Certainly! Let's adapt the example with business functions like Marketing,
Sales, and Finance:

Original matrix:

| | Customer | Product | Time | Location |


|----------|----------|---------|------|----------|
| Marketing| X | X | X | X |
| Sales | X | X | X | X |
| Finance | X | X | X | X |

Opportunity/stakeholder matrix:

| | Customer Support | Inventory Management | Sales Forecasting |


Financial Reporting |
|----------|------------------|----------------------|-------------------|---------------------|
| Marketing| X | | X | |
| Sales | X | X | X | X |
| Finance | | X | | X |

In this adapted example:


• For Process Customer Support, we'd involve both Marketing and
Sales.
• For Process Inventory Management, we'd involve Sales and Finance.
• For Process Sales Forecasting, we'd involve Marketing and Sales
again.
• Process Sales Forecasting seems to be mainly relevant to Sales and
Finance.

This matrix helps identify the stakeholders for each business process,
facilitating collaboration and ensuring that the data warehouse meets the
needs of each department.

It's normal for a dimension table to have attributes (like customer address or
product price) that are managed or updated in different ways over time.

With type 0, the dimension attribute value never changes, so facts are always
grouped by this original value. Type 0 is appropriate for any attribute labeled
“original,” such as a customer’s original credit score or a durable identi er. It
also applies to most attributes in a date dimension

Sales Fact Table:

| Transaction ID | Customer ID | Product ID | Quantity | Sales Amount |


Transaction Date |
|----------------|-------------|------------|----------|--------------|------------------|
fi
|1 | 101 | 001 |2 | $100 | 2023-01-15 |
|2 | 102 | 002 |1 | $50 | 2023-02-20 |
|3 | 103 | 003 |3 | $200 | 2023-03-10 |

Customer Dimension Table:

| Customer ID | Name | Original Credit Score | Gender | Age |


|-------------|--------|-----------------------|--------|-----|
| 101 | Alice | 750 | Female | 35 |
| 102 | Bob | 700 | Male | 42 |
| 103 | Charlie| 800 | Male | 28 |

For simplicity, let's group customers by their original credit score ranges:
- Original Credit Score 700-749
- Original Credit Score 750-799
- Original Credit Score 800 and above

After joining the tables and grouping the sales transactions by these credit
score ranges, we might get results like this:

| Credit Score Range | Total Sales Amount |


|--------------------|--------------------|
| 700-749 | $50 |
| 750-799 | $100 |
| 800 and above | $200 |

In this analysis:
- We've ensured consistency in grouping customers based on their original
credit scores, regardless of any subsequent changes.
- This allows us to effectively analyze sales performance over time while
accounting for the customers' initial creditworthiness.

SELECT
CASE
WHEN OriginalCreditScore BETWEEN 700 AND 749 THEN '700-749'
WHEN OriginalCreditScore BETWEEN 750 AND 799 THEN '750-799'
WHEN OriginalCreditScore >= 800 THEN '800 and above'
ELSE 'Unknown' -- Handling for cases outside de ned ranges
END AS CreditScoreRange,
SUM(SalesAmount) AS TotalSalesAmount
FROM
SalesFactTable sf
JOIN
CustomerDimensionTable cd ON sf.CustomerID = cd.CustomerID
GROUP BY
CASE
WHEN OriginalCreditScore BETWEEN 700 AND 749 THEN '700-749'
WHEN OriginalCreditScore BETWEEN 750 AND 799 THEN '750-799'
WHEN OriginalCreditScore >= 800 THEN '800 and above'
fi
ELSE 'Unknown'
END;

Certainly! Here's an example illustrating each line with tabular data:

Consider a simpli ed customer dimension table:

| Customer ID | Name | Original Credit Score | Gender | Age |


|-------------|---------|-----------------------|--------|-----|
|1 | Alice | 750 | Female | 35 |
|2 | Bob | 700 | Male | 42 |
|3 | Charlie | 800 | Male | 28 |

In this example:
- The "Original Credit Score" attribute is a Type 0 dimension. It remains
constant over time and is crucial for analyzing historical nancial data. For
instance, if we're analyzing sales performance over time, we want to ensure
consistency in grouping customers based on their original credit score,
regardless of any subsequent changes.
- Similarly, other attributes like "Name", "Gender", and "Age" can also be
considered Type 0 dimensions because they typically do not change over time
in a customer dimension context.

For a date dimension, let's take a look at a simple example:

| Date | Day | Month | Quarter | Year |


|------------|-------|-------|---------|------|
| 2024-01-01 | Monday| Jan | Q1 | 2024 |
| 2024-02-01 | Tuesday| Feb | Q1 | 2024 |
| 2024-03-01 | Thursday| Mar | Q1 | 2024 |

In this case:
- Each attribute ("Day", "Month", "Quarter", "Year") in the date dimension table
can be considered Type 0 dimensions. The values for these attributes do not
change over time and remain constant for each corresponding date record.
- For instance, the "Day" attribute stays the same for each date entry,
ensuring that facts (such as sales or transactions) are consistently grouped by
the day of the week, month, quarter, or year.

A durable identi er is a unique identi er assigned to an entity that remains


constant over time, even if other attributes of the entity change. Here's an
example to illustrate this concept:

Let's consider a product catalog for an e-commerce platform:

Product Dimension Table:

| Product ID | Product Name | Category | Price | Manufacturer | Durable


Identi er |
fi
fi
fi
fi
fi
|------------|-----------------|-------------|--------|--------------|--------------------|
| 001 | Laptop | Electronics | $1000 | Dell | LP-001 |
| 002 | Smartphone | Electronics | $800 | Apple | SP-002 |
| 003 | T-shirt | Clothing | $20 | Nike | TS-003 |

In this example:
- Each product in the catalog has a unique Product ID.
- The "Durable Identi er" column contains a durable identi er assigned to
each product. This identi er remains constant for each product, even if other
attributes such as product name or price change over time.
- The durable identi er serves as a reliable reference point for tracking the
product, regardless of any modi cations to its attributes.
- For instance, if the price of the Laptop (Product ID: 001) changes in the
future, the durable identi er "LP-001" will still uniquely identify that speci c
product.

Durable identi ers are particularly useful in scenarios where entities need to
be referenced consistently across different systems or over long periods,
ensuring data integrity and reliability.

With type 1, the old attribute value in the dimension row is overwritten with the
new value; type 1 attributes always re ects the most recent assignment, and
therefore this technique destroys history. Although this approach is easy to
implement and does not create additional dimension rows, you must be
careful that aggregate fact tables and OLAP cubes affected by this change
are recomputed.

Type 2 changes add a new row in the dimension with the updated attribute
values. This requires generalizing the primary key of the dimension beyond
the natural or durable key because there will potentially be multiple rows
describing each member. When a new row is created for a dimension
member, a new primary surrogate key is assigned and used as a foreign key
in all fact tables from the moment of the update until a subsequent change
creates a new dimension key and updated dimension row.
A minimum of three additional columns should be added to the dimension row
with type 2 changes: 1) row effective date or date/time stamp; 2) row
expiration date or date/time stamp; and 3) current row indicator.

Customer Dimension Table:

| Customer Key | Customer ID | Name | Address | Start Date | End


Date | Current Row |
|--------------|-------------|---------|-------------------|-------------|-------------|-------------|
|1 | 101 | Alice | 123 Main St | 2023-01-01 | 2023-05-15 | No
|
|2 | 101 | Alice | 456 Elm St | 2023-05-16 | NULL | Yes
|
|3 | 102 | Bob | 789 Oak St | 2023-01-01 | NULL | Yes
|
fi
fi
fi
fi
fi
fi
fl
fi
fi
|4 | 103 | Charlie | 101 Pine St | 2023-01-01 | 2023-03-31 |
No |
|5 | 103 | Charlie | 202 Maple St | 2023-04-01 | NULL | Yes
|

Type 3 changes add a new attribute in the dimension to preserve the old
attribute value; the new value overwrites the main attribute as in a type 1
change. This kind of type 3 change is sometimes called an alternate reality. A
business user can group and lter fact data by either the current value or
alternate reality.

Product Dimension Table (after price change):

| Product ID | Product Name | Current Price | Previous Price | Effective Date |


|------------|--------------|---------------|----------------|----------------|
| 001 | Laptop | $1200 | $1000 | 2023-01-01 |
| 002 | Smartphone | $750 | $800 | 2023-01-01 |
| 003 | T-shirt | $20 | | 2023-01-01 |

The type 4 technique is used when a group of attributes in a dimension rapidly


changes and is split off to a mini-dimension. Frequently used attributes in
multimillion-row dimension tables are mini-dimension design candidates, even
if they don’t frequently change. The type 4 mini-dimension requires its own
unique primary key; the primary keys of both the base dimension and mini-
dimension are captured in the associated fact tables.

Main Product Dimension Table (with Mini-Dimension Attributes):

| Product ID | Product Name | Category | Price | Initial Color | Initial Size |


Initial Material | Initial Style |
|------------|--------------|-------------|--------|---------------|--------------|------------------|---
------------|
| 001 | Laptop | Electronics | $1000 | Silver | Small |
Aluminum | Basic |
| 002 | Smartphone | Electronics | $800 | Gold | Medium |
Plastic | Standard |
| 003 | T-shirt | Clothing | $20 | Red | Small | Cotton
| Casual |

Mini-Dimension Table (Attributes for Product Variation):

| Variation ID | Product ID | Color | Size | Material | Style |


|--------------|------------|---------|-------|----------|----------|
|1 | 001 | Silver | Small | Aluminum | Basic |
|2 | 001 | Black | Large | Steel | Premium |
|3 | 002 | Gold | Medium| Plastic | Standard |
|4 | 003 | Red | Small | Cotton | Casual |
fi
Fact Table (Sales):

| Transaction ID | Product ID | Variation ID | Quantity | Sales Amount |


Transaction Date |
|----------------|------------|--------------|----------|--------------|------------------|
|1 | 001 |1 |2 | $2000 | 2023-01-15 |
|2 | 001 |2 |1 | $1000 | 2023-01-20 |
|3 | 002 |3 |1 | $800 | 2023-02-20 |
|4 | 003 |4 |3 | $60 | 2023-03-10 |

The Type 5 Slowly Changing Dimension (SCD) technique is a sophisticated


method used in data warehousing to handle changes in dimension attributes.
It aims to preserve historical attribute values while allowing historical facts to
be reported according to current attribute values. Here's a detailed
breakdown:

### Key Concepts

1. **Mini-Dimension (Type 4)**:


- A separate dimension that contains historical values of certain attributes
that change frequently (e.g., customer preferences).
- This allows for better performance and manageability.

2. **Current Attribute Reference (Type 1)**:


- A direct reference in the main dimension table that points to the current
attribute values in the mini-dimension.
- This is implemented as a simple overwrite without maintaining history in
the main dimension.

### How Type 5 SCD Works

1. **Combining Type 4 and Type 1**:


- **Mini-Dimension (Type 4)**: Keeps track of historical changes in a
separate table.
- **Current Attribute Reference (Type 1)**: A reference to the current state of
these attributes is embedded in the main dimension table.

2. **Base Dimension Table**:


- Includes all the main attributes of the entity (e.g., customer) plus a
reference to the current state of the mini-dimension.

3. **Updating the Reference**:


- Whenever there is a change in the mini-dimension (e.g., a customer
changes their preference), the ETL process updates the reference in the main
dimension to point to the new current state.

4. **Reporting**:
- For reporting purposes, the main dimension and mini-dimension are
logically represented as a single table.
- This allows easy access to both historical and current attribute values
without complex joins through the fact table.

### Example

#### Mini-Dimension Table (Customer Preferences)


| Pref_Key | Preference_Type | Preference_Value | Start_Date | End_Date |
|----------|-----------------|------------------|------------|------------|
|1 | Contact_Method | Email | 2023-01-01 | 2023-06-30 |
|2 | Contact_Method | Phone | 2023-07-01 | 9999-12-31 |
|3 | Contact_Time | Morning | 2023-01-01 | 2023-03-31 |
|4 | Contact_Time | Afternoon | 2023-04-01 | 9999-12-31 |

#### Base Dimension Table (Customer)


| Customer_Key | Customer_ID | Name | Current_Pref_Key | Start_Date |
End_Date |
|--------------|-------------|----------|------------------|------------|------------|
|1 | 1001 | John Doe | 2 | 2023-07-01 | 9999-12-31 |
|2 | 1001 | John Doe | 1 | 2023-01-01 | 2023-06-30 |

#### Fact Table (Orders)


| Order_Key | Customer_Key | Order_Amount | Order_Date |
|-----------|--------------|--------------|------------|
|1 |2 | 150.00 | 2023-05-15 |
|2 |1 | 200.00 | 2023-07-10 |

### ETL Process

1. **Detect Change**: Customer John Doe changes their contact method from
Phone to SMS on 2023-07-16.
2. **Update Mini-Dimension**: Insert a new row in the `Customer
Preferences` table.
- Pref_Key: 5
- Preference_Type: Contact_Method
- Preference_Value: SMS
- Start_Date: 2023-07-16
- End_Date: 9999-12-31

#### Updated Mini-Dimension Table (Customer Preferences)


| Pref_Key | Preference_Type | Preference_Value | Start_Date | End_Date |
|----------|-----------------|------------------|------------|------------|
|1 | Contact_Method | Email | 2023-01-01 | 2023-06-30 |
|2 | Contact_Method | Phone | 2023-07-01 | 2023-07-15 |
|3 | Contact_Time | Morning | 2023-01-01 | 2023-03-31 |
|4 | Contact_Time | Afternoon | 2023-04-01 | 9999-12-31 |
|5 | Contact_Method | SMS | 2023-07-16 | 9999-12-31 |
3. **Update Base Dimension**: Overwrite the `Current_Pref_Key` in the
`Customer` table to reference the new preference key.
#### Updated Base Dimension Table (Customer)
| Customer_Key | Customer_ID | Name | Current_Pref_Key | Start_Date |
End_Date |
|--------------|-------------|----------|------------------|------------|------------|
|1 | 1001 | John Doe | 5 | 2023-07-16 | 9999-12-31 |
|2 | 1001 | John Doe | 2 | 2023-07-01 | 2023-07-15 |
|3 | 1001 | John Doe | 1 | 2023-01-01 | 2023-06-30 |

### Reporting

For reporting purposes, logically combine the base dimension and mini-
dimension as follows:

#### Combined Dimension for Presentation (Customer with Current


Preference)
| Customer_ID | Name | Current_Contact_Method | Current_Contact_Time |
Order_Amount | Order_Date |
|-------------|----------|------------------------|----------------------|--------------|------------|
| 1001 | John Doe | SMS | Afternoon | 200.00 |
2023-07-10 |
| 1001 | John Doe | Phone | Afternoon | 150.00 |
2023-05-15 |

### Summary

- **Historical Accuracy**: The mini-dimension table preserves historical


attribute values.
- **Current Reporting**: The base dimension table's reference to the mini-
dimension allows for current attribute values to be accessed directly.
- **Ef cient Access**: Reporting combines these tables logically, simplifying
access to both current and historical data without complex joins through the
fact table.
- **ETL Responsibility**: The ETL team ensures the `Current_Pref_Key` in
the base dimension table is updated whenever there's a change in the mini-
dimension.

The "simple overwrite" process is evident when the Current_Pref_Key in the


base dimension table (Customer) is directly updated to re ect the new preference
without keeping the old preference key. The history of the old preference is preserved
in the mini-dimension table (Customer Preferences), not in the base dimension table.
This allows the base dimension to always re ect the current state of the customer's
attributes while the mini-dimension maintains the historical records.

Sure, let's break down each line and provide a comprehensive tabular data example to
illustrate the concept of a Type 6 Slowly Changing Dimension (SCD).
fi
fl
fl
### Explanation of Each Line

1. **"Like type 5, type 6 also delivers both historical and current dimension attribute
values."**
- Type 6 SCDs combine the features of both Type 1 (overwrite) and Type 2
(versioning) to maintain both historical and current values.

2. **"Type 6 builds on the type 2 technique by also embedding current type 1


versions of the same attributes in the dimension row..."**
- Type 6 starts with a Type 2 technique (adding a new row for changes) and also
includes the current value of the attribute within each row.

3. **"...so that fact rows can be ltered or grouped by either the type 2 attribute value
in effect when the measurement occurred or the attribute’s current value."**
- This allows analysis based on the historical value (Type 2) at the time the fact was
recorded or the current value (Type 1).

4. **"In this case, the type 1 attribute is systematically overwritten on all rows
associated with a particular durable key whenever the attribute is updated."**
- When an attribute is updated, the current value (Type 1) is overwritten on all
historical rows for the same durable key, ensuring consistency.

### Tabular Data Example

Let's consider a customer dimension with the following attributes: CustomerID


(Durable Key), CustomerName, and CustomerStatus. We'll track changes to
CustomerStatus using Type 6 SCD.

#### Initial Data


| CustomerID | CustomerName | CustomerStatus | StartDate | EndDate |
CurrentStatus |
|------------|--------------|----------------|------------|------------|---------------|
|1 | John Doe | Active | 2023-01-01 | NULL | Active |

#### Change 1: CustomerStatus changes from Active to Inactive on 2023-06-01


| CustomerID | CustomerName | CustomerStatus | StartDate | EndDate |
CurrentStatus |
|------------|--------------|----------------|------------|------------|---------------|
|1 | John Doe | Active | 2023-01-01 | 2023-05-31 | Inactive |
|1 | John Doe | Inactive | 2023-06-01 | NULL | Inactive |

#### Change 2: CustomerStatus changes from Inactive to Suspended on 2023-09-01


| CustomerID | CustomerName | CustomerStatus | StartDate | EndDate |
CurrentStatus |
|------------|--------------|----------------|------------|------------|---------------|
|1 | John Doe | Active | 2023-01-01 | 2023-05-31 | Suspended |
|1 | John Doe | Inactive | 2023-06-01 | 2023-08-31 | Suspended |
|1 | John Doe | Suspended | 2023-09-01 | NULL | Suspended |
fi
### Explanation of the Table

1. **Initial Data:**
- The initial state of the customer with CustomerID = 1 is Active.
- `CurrentStatus` is also Active.

2. **Change 1:**
- On 2023-06-01, the CustomerStatus changes to Inactive.
- A new row is created to record the change.
- The `EndDate` of the previous row is set to 2023-05-31.
- `CurrentStatus` of all rows for CustomerID = 1 is updated to Inactive.

3. **Change 2:**
- On 2023-09-01, the CustomerStatus changes to Suspended.
- A new row is created to record this change.
- The `EndDate` of the previous row is set to 2023-08-31.
- `CurrentStatus` of all rows for CustomerID = 1 is updated to Suspended.

### Key Points

- **Historical Data:** Maintained using the `StartDate` and `EndDate` columns.


- **Current Data:** The `CurrentStatus` column always shows the current value for
all rows.
- **Durable Key:** `CustomerID` remains the same for the customer across all
versions.

By using this approach, you can lter or group data based on the historical
`CustomerStatus` at the time a fact was recorded or based on the current
`CustomerStatus`. This exibility allows for comprehensive analysis and reporting.

Sure! Let's expand on the example by including fact table data and showing how you
can lter or group by either the Type 2 attribute value (historical) or the Type 1
attribute value (current).

### Queries

#### 1. Group by Type 2 Attribute (Historical CustomerStatus)

If you want to group sales by the historical `CustomerStatus` at the time of the sale,
you would join the fact table with the dimension table on the appropriate date range:

```sql
SELECT
d.CustomerStatus,
SUM(f.SaleAmount) AS TotalSales
FROM
FactTable f
fi
fl
fi
JOIN
DimensionTable d
ON
f.CustomerID = d.CustomerID
AND f.SaleDate >= d.StartDate
AND (f.SaleDate <= d.EndDate OR d.EndDate IS NULL)
GROUP BY
d.CustomerStatus;
```

**Result:**

| CustomerStatus | TotalSales |
|----------------|------------|
| Active | 100 |
| Inactive | 150 |
| Suspended | 200 |

- **Explanation:** Sales are grouped by the `CustomerStatus` that was in effect at the
time of each sale.

#### 2. Group by Type 1 Attribute (Current CustomerStatus)

If you want to group sales by the current `CustomerStatus`, you would join the fact
table with the dimension table and use the `CurrentStatus` column:

```sql
SELECT
d.CurrentStatus,
SUM(f.SaleAmount) AS TotalSales
FROM
FactTable f
JOIN
DimensionTable d
ON
f.CustomerID = d.CustomerID
GROUP BY
d.CurrentStatus;
```

**Result:**

| CurrentStatus | TotalSales |
|---------------|------------|
| Suspended | 450 |

- **Explanation:** All sales are grouped by the current `CustomerStatus` (which is


`Suspended` for all rows in the dimension table).
### Key Points

- **Group by Historical (Type 2) Attribute:** Allows you to analyze data based on


the status at the time of each transaction.
- **Group by Current (Type 1) Attribute:** Allows you to analyze data based on the
current status, regardless of historical changes.

This approach provides the exibility to perform different types of analysis depending
on the business requirement, leveraging both historical and current views of the data.

Sure! Let's illustrate how to lter fact rows by either the Type 2 attribute value
(historical) or the Type 1 attribute value (current).

### Filtering Example

### Queries

#### 1. Filter by Type 2 Attribute (Historical CustomerStatus)

If you want to lter sales that occurred when the `CustomerStatus` was "Active", you
would join the fact table with the dimension table on the appropriate date range and
lter by `CustomerStatus`:

```sql
SELECT
f.SaleID,
f.SaleAmount,
f.SaleDate,
d.CustomerStatus
FROM
FactTable f
JOIN
DimensionTable d
ON
f.CustomerID = d.CustomerID
AND f.SaleDate >= d.StartDate
AND (f.SaleDate <= d.EndDate OR d.EndDate IS NULL)
WHERE
d.CustomerStatus = 'Active';
```

**Result:**

| SaleID | SaleAmount | SaleDate | CustomerStatus |


|--------|------------|------------|----------------|
| 101 | 100 | 2023-03-15 | Active |
fi
fi
fl
fi
- **Explanation:** The query lters sales to only include those where the
`CustomerStatus` was "Active" at the time of the sale.

#### 2. Filter by Type 1 Attribute (Current CustomerStatus)

If you want to lter sales where the current `CustomerStatus` is "Suspended", you
would join the fact table with the dimension table and lter by `CurrentStatus`:

```sql
SELECT
f.SaleID,
f.SaleAmount,
f.SaleDate,
d.CurrentStatus
FROM
FactTable f
JOIN
DimensionTable d
ON
f.CustomerID = d.CustomerID
WHERE
d.CurrentStatus = 'Suspended';
```

**Result:**

| SaleID | SaleAmount | SaleDate | CurrentStatus |


|--------|------------|------------|---------------|
| 101 | 100 | 2023-03-15 | Suspended |
| 102 | 150 | 2023-07-10 | Suspended |
| 103 | 200 | 2023-10-05 | Suspended |

- **Explanation:** The query lters sales to only include those where the current
`CustomerStatus` is "Suspended", regardless of what the status was at the time of the
sale.

### Key Points

- **Filter by Historical (Type 2) Attribute:** Allows you to lter data based on the
status at the time of each transaction.
- **Filter by Current (Type 1) Attribute:** Allows you to lter data based on the
current status, regardless of historical changes.

This exibility enables comprehensive and nuanced analysis, allowing for insights
based on both the historical and current states of the data.

Sure, let's break down the explanation of the Type 7 hybrid technique and illustrate
each concept with a comprehensive tabular data example.
fl
fi
fi
fi
fi
fi
fi
For the type 1 perspective, the current ag in the dimension is constrained to be
current, and the fact table is joined via the durable key. For the type 2 perspective, the
current ag is not constrained, and the fact table is joined via the surrogate primary
key.

### Explanation Breakdown:

1. **Type 7 hybrid technique**:


- This technique supports both "as-was" and "as-is" reporting by using a
combination of Type 1 and Type 2 dimensions.

2. **Fact Table**:
- Contains measurable data (facts) and keys to link to dimension tables.

3. **Type 1 Dimension**:
- Shows only the most current attribute values.

4. **Type 2 Dimension**:
- Shows historical changes with contemporary historical pro les.

5. **Dimension Table**:
- Contains attributes related to dimensions which can have both Type 1 and Type 2
perspectives.

6. **Durable Key**:
- A unique identi er that does not change over time.

7. **Primary Surrogate Key**:


- A unique identi er that can change over time (each version of a dimension record
gets a new surrogate key).

8. **Current Flag**:
- An attribute in the dimension table that indicates the current record.

9. **Separate Views**:
- BI applications use different views for Type 1 and Type 2 perspectives.

### Tabular Data Example:

#### Dimension Table: `Customer`

| Customer Durable Key | Customer Surrogate Key | Customer Name | Current Flag |
Start Date | End Date |
|----------------------|------------------------|---------------|--------------|------------|------------|
|1 | 101 | John Doe | Y | 2020-01-01 | NULL |
|1 | 102 | John Doe | N | 2019-01-01 | 2019-12-31 |
|2 | 201 | Jane Smith | Y | 2020-01-01 | NULL |
|2 | 202 | Jane Smith | N | 2018-01-01 | 2019-12-31 |
fl
fi
fi
fl
fi
#### Fact Table: `Sales`

| Fact Date | Customer Durable Key | Customer Surrogate Key | Sales Amount |
|------------|-----------------------|------------------------|--------------|
| 2020-01-10 | 1 | 101 | 100 |
| 2019-05-15 | 1 | 102 | 200 |
| 2020-02-20 | 2 | 201 | 150 |
| 2019-08-25 | 2 | 202 | 250 |

### Joining for Type 1 Perspective (Current Values)

- **Join Condition**: Join `Sales` on `Customer Durable Key` where `Current Flag`
is `Y`.

```sql
SELECT s.Fact_Date, d.Customer_Name, s.Sales_Amount
FROM Sales s
JOIN Customer d
ON s.Customer_Durable_Key = d.Customer_Durable_Key
WHERE d.Current_Flag = 'Y';
```

**Result**:

| Fact Date | Customer Name | Sales Amount |


|------------|---------------|--------------|
| 2020-01-10 | John Doe | 100 |
| 2020-02-20 | Jane Smith | 150 |

### Joining for Type 2 Perspective (Historical Values)

- **Join Condition**: Join `Sales` on `Customer Surrogate Key`.

```sql
SELECT s.Fact_Date, d.Customer_Name, s.Sales_Amount
FROM Sales s
JOIN Customer d
ON s.Customer_Surrogate_Key = d.Customer_Surrogate_Key;
```

**Result**:

| Fact Date | Customer Name | Sales Amount |


|------------|---------------|--------------|
| 2020-01-10 | John Doe | 100 |
| 2019-05-15 | John Doe | 200 |
| 2020-02-20 | Jane Smith | 150 |
| 2019-08-25 | Jane Smith | 250 |
### Explanation with Example:

- **Fact Table Access**:


- The fact table (`Sales`) can be accessed via `Customer Durable Key` for the Type 1
perspective, ensuring only current data.
- It can be accessed via `Customer Surrogate Key` for the Type 2 perspective,
ensuring historical accuracy.

- **Dimension Table**:
- The `Customer` dimension table maintains both current and historical data with the
`Current Flag` indicating the most recent record.
- The same dimension table supports both Type 1 (current) and Type 2 (historical)
perspectives.

- **Durable Key and Surrogate Key**:


- `Customer Durable Key` is constant, linking to the most current dimension record.
- `Customer Surrogate Key` changes with each update, allowing historical tracking.

- **Views**:
- BI applications can have separate views to utilize the different perspectives:
- A view that joins on the `Customer Durable Key` for current data (Type 1).
- A view that joins on the `Customer Surrogate Key` for historical data (Type 2).

Certainly! Here are examples of SQL views for both the Type 1 (current values) and
Type 2 (historical values) perspectives based on the `Customer` dimension and `Sales`
fact tables provided.

### Example of Type 1 View (Current Values)

This view will join the `Sales` fact table with the `Customer` dimension table using
the `Customer Durable Key` and lter for the current records using the `Current Flag`.

```sql
CREATE VIEW vw_Sales_Current AS
SELECT
s.Fact_Date,
d.Customer_Name,
s.Sales_Amount
FROM
Sales s
JOIN
Customer d
ON
s.Customer_Durable_Key = d.Customer_Durable_Key
WHERE
d.Current_Flag = 'Y';
```
fi
**Querying the Type 1 View**:

```sql
SELECT * FROM vw_Sales_Current;
```

**Result**:

| Fact_Date | Customer_Name | Sales_Amount |


|------------|---------------|--------------|
| 2020-01-10 | John Doe | 100 |
| 2020-02-20 | Jane Smith | 150 |

### Example of Type 2 View (Historical Values)

This view will join the `Sales` fact table with the `Customer` dimension table using
the `Customer Surrogate Key`, showing historical data.

```sql
CREATE VIEW vw_Sales_Historical AS
SELECT
s.Fact_Date,
d.Customer_Name,
s.Sales_Amount
FROM
Sales s
JOIN
Customer d
ON
s.Customer_Surrogate_Key = d.Customer_Surrogate_Key;
```

**Querying the Type 2 View**:

```sql
SELECT * FROM vw_Sales_Historical;
```

**Result**:

| Fact_Date | Customer_Name | Sales_Amount |


|------------|---------------|--------------|
| 2020-01-10 | John Doe | 100 |
| 2019-05-15 | John Doe | 200 |
| 2020-02-20 | Jane Smith | 150 |
| 2019-08-25 | Jane Smith | 250 |

### Summary
- **Type 1 View (`vw_Sales_Current`)**: Provides the "as-is" perspective by
showing only the most current attribute values.
- **Type 2 View (`vw_Sales_Historical`)**: Provides the "as-was" perspective by
showing the historical changes and contemporary historical pro les.

These views enable BI applications to query either current or historical data without
needing to join tables manually each time.

Sure! Let's break down the description of a xed depth hierarchy with a
comprehensive example using a dimension table that represents products, brands,
categories, and departments. Here's a detailed explanation for each line:

1. **A xed depth hierarchy is a series of many-to-one relationships, such as product


to brand to category to department.**
- **Explanation:** This means each product belongs to one brand, each brand
belongs to one category, and each category belongs to one department. This creates a
hierarchical structure where each level is dependent on the level above it.

2. **When a xed depth hierarchy is de ned and the hierarchy levels have agreed
upon names, the hierarchy levels should appear as separate positional attributes in a
dimension table.**
- **Explanation:** Each level of the hierarchy should have a clearly de ned name
and should be represented as a separate column in a dimension table. This makes the
structure clear and easy to understand.

3. **A xed depth hierarchy is by far the easiest to understand and navigate as long as
the above criteria are met.**
- **Explanation:** When the hierarchy is well-de ned with distinct levels, it
becomes straightforward for users to comprehend and traverse through the hierarchy.

4. **It also delivers predictable and fast query performance.**


- **Explanation:** With a xed structure, queries can be optimized easily, leading
to faster performance because the relationships and the hierarchy are prede ned and
clear.

### Comprehensive Tabular Data Example

Let's create a dimension table for a retail company that sells various products. The
table will include columns for Product, Brand, Category, and Department.

| Product ID | Product Name | Brand ID | Brand Name | Category ID | Category


Name | Department ID | Department Name |
|------------|--------------------|----------|--------------|-------------|---------------|---------------|
-----------------|
|1 | iPhone 13 | 101 | Apple | 201 | Smartphones | 301 |
Electronics |
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
|2 | Galaxy S21 | 102 | Samsung | 201 | Smartphones | 301
| Electronics |
|3 | MacBook Pro | 101 | Apple | 202 | Laptops | 301 |
Electronics |
|4 | ThinkPad X1 | 103 | Lenovo | 202 | Laptops | 301 |
Electronics |
|5 | Nike Air Max | 104 | Nike | 203 | Footwear | 302 |
Apparel |
|6 | Adidas Ultraboost | 105 | Adidas | 203 | Footwear | 302
| Apparel |
|7 | Levi's 501 Jeans | 106 | Levi's | 204 | Jeans | 302 |
Apparel |
|8 | Wrangler Classic | 107 | Wrangler | 204 | Jeans | 302 |
Apparel |

### Explanation of Each Column

- **Product ID:** Unique identi er for each product.


- **Product Name:** Name of the product.
- **Brand ID:** Unique identi er for the brand to which the product belongs.
- **Brand Name:** Name of the brand.
- **Category ID:** Unique identi er for the category to which the brand belongs.
- **Category Name:** Name of the category.
- **Department ID:** Unique identi er for the department to which the category
belongs.
- **Department Name:** Name of the department.

### Hierarchical Structure Explanation

- **Product to Brand:** Each product belongs to one brand (e.g., iPhone 13 belongs
to Apple).
- **Brand to Category:** Each brand belongs to one category (e.g., Apple products in
this example belong to either Smartphones or Laptops).
- **Category to Department:** Each category belongs to one department (e.g.,
Smartphones and Laptops belong to Electronics, Footwear and Jeans belong to
Apparel).

### Bene ts

- **Easy to Understand and Navigate:** The structure is clear, and each level of the
hierarchy is explicitly de ned, making it easy for users to comprehend and navigate
through the data.
- **Predictable and Fast Query Performance:** Queries can be optimized for this
xed structure, resulting in predictable and fast performance. For example, ltering
all products under the "Electronics" department is straightforward and ef cient.

This xed depth hierarchy ensures data is organized in a way that is both logical and
ef cient for querying and reporting.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Let's break down the concept of slightly ragged hierarchies and explain each line with
a comprehensive tabular data example.

### Explanation

1. **Slightly ragged hierarchies don’t have a xed number of levels, but the range in
depth is small.**
- This means that the number of levels in the hierarchy is not always the same, but it
does not vary greatly.

2. **Geographic hierarchies often range in depth from perhaps three levels to six
levels.**
- For example, in geographic hierarchies, you might have different levels such as
Country, State, City, and Suburb.

3. **Rather than using the complex machinery for unpredictably variable hierarchies,
you can force- t slightly ragged hierarchies into a xed depth positional design with
separate dimension attributes for the maximum number of levels, and then populate
the attribute value based on rules from the business.**
- Instead of creating a exible structure to handle different numbers of levels, you
create a xed structure with separate columns for each possible level. Any missing
levels are lled according to speci c business rules.

### Example: Geographic Hierarchy

Assume we have geographic data with varying depths, such as:

1. Country -> State -> City


2. Country -> State -> City -> Suburb
3. Country -> City

We can represent this hierarchy in a xed-depth table as follows:

#### Original Data

| Region Type | Name |


|-------------|------------|
| Country | USA |
| State | California |
| City | Los Angeles|
| Country | USA |
| State | California |
| City | Los Angeles|
| Suburb | Hollywood |
| Country | Canada |
| City | Toronto |

#### Fixed Depth Positional Design


fi
fi
fi
fl
fi
fi
fi
fi
| Country | State | City | Suburb |
|---------|------------|-------------|----------|
| USA | California | Los Angeles | NULL |
| USA | California | Los Angeles | Hollywood|
| Canada | NULL | Toronto | NULL |

### Explanation of Each Column

- **Country**: The highest level in the hierarchy. This is always populated.


- **State**: The second level. This is populated if available, otherwise `NULL`.
- **City**: The third level. This is populated if available, otherwise `NULL`.
- **Suburb**: The fourth level. This is populated if available, otherwise `NULL`.

### Rules for Populating Attribute Values

1. **If a level is missing in the original hierarchy, the corresponding cell in the xed-
depth table is set to `NULL`.**
2. **If the hierarchy has fewer levels than the maximum (in this case, four), then all
missing levels are lled with `NULL`.**

By using this approach, you ensure that each row in the table has the same number of
columns, simplifying data analysis and querying, even though the depth of the
original hierarchies varies.

Let's break down the concept of modeling ragged hierarchies using a bridge table in a
relational database and explain each line with a comprehensive tabular data example.

### Explanation

1. **Ragged hierarchies of indeterminate depth are dif cult to model and query in a
relational database.**
- Hierarchies with varying levels of depth pose challenges because traditional
relational databases are optimized for xed-schema data structures.

2. **Although SQL extensions and OLAP access languages provide some support for
recursive parent/child relationships, these approaches have limitations.**
- SQL extensions like Common Table Expressions (CTEs) and OLAP tools can
handle parent/child hierarchies, but they have constraints such as performance issues
and lack of exibility.

3. **With SQL extensions, alternative ragged hierarchies cannot be substituted at


query time, shared ownership structures are not supported, and time-varying ragged
hierarchies are not supported.**
- The rigid nature of SQL extensions makes it hard to switch between different
hierarchies, manage shared ownership, or track changes over time.

4. **All these objections can be overcome in relational databases by modeling a


ragged hierarchy with a specially constructed bridge table.**
fl
fi
fi
fi
fi
- A bridge table can be used to represent complex hierarchies more exibly.

5. **This bridge table contains a row for every possible path in the ragged hierarchy
and enables all forms of hierarchy traversal to be accomplished with standard SQL
rather than using special language extensions.**
- Each row in the bridge table represents a complete path from a top-level node to a
lower-level node, allowing standard SQL queries to navigate the hierarchy.

### Example: Ragged Hierarchy

Consider a company with the following organizational structure:

- **CEO**
- **VP of Sales**
- **Sales Manager 1**
- **Salesperson 1**
- **Sales Manager 2**
- **VP of Marketing**
- **Marketing Manager 1**

#### Original Hierarchy Data

| ID | ParentID | Name |
|----|----------|------------------|
| 1 | NULL | CEO |
|2 |1 | VP of Sales |
|3 |2 | Sales Manager 1 |
|4 |3 | Salesperson 1 |
|5 |2 | Sales Manager 2 |
|6 |1 | VP of Marketing |
|7 |6 | Marketing Manager 1|

#### Bridge Table Construction

The bridge table will contain all possible paths in the hierarchy:

| AncestorID | DescendantID | Depth |


|------------|--------------|-------|
|1 |1 |0 |
|1 |2 |1 |
|1 |3 |2 |
|1 |4 |3 |
|1 |5 |2 |
|1 |6 |1 |
|1 |7 |2 |
|2 |2 |0 |
|2 |3 |1 |
|2 |4 |2 |
|2 |5 |1 |
fl
|6 |6 |0 |
|6 |7 |1 |
|3 |3 |0 |
|3 |4 |1 |
|4 |4 |0 |
|5 |5 |0 |
|7 |7 |0 |

### Explanation of Each Column

- **AncestorID**: The starting point of a path (e.g., higher-level employee).


- **DescendantID**: The ending point of a path (e.g., lower-level employee).
- **Depth**: The number of steps from the ancestor to the descendant.

### Bene ts of Using a Bridge Table

- **Hierarchy Traversal**: Easily query the hierarchy using standard SQL to nd all
subordinates or all superiors.
- **Shared Ownership**: Manage cases where an entity might belong to multiple
hierarchies.
- **Time-Varying Hierarchies**: Track changes in the hierarchy over time by adding
a date range to the bridge table.

### Example Queries

1. **Find all subordinates of the CEO:**

```sql
SELECT DescendantID
FROM BridgeTable
WHERE AncestorID = 1 AND Depth > 0;
```

#### Query Result

| DescendantID |
|--------------|
|2 |
|3 |
|4 |
|5 |
|6 |
|7 |

These results correspond to the VP of Sales, Sales Manager 1, Salesperson 1, Sales


Manager 2, VP of Marketing, and Marketing Manager 1.

2. **Find the path from CEO to Salesperson 1:**


fi
fi
```sql
SELECT AncestorID, DescendantID
FROM BridgeTable
WHERE DescendantID = 4;
```

#### Query Result

| AncestorID | DescendantID |
|------------|--------------|
|1 |4 |
|2 |4 |
|3 |4 |
|4 |4 |

This shows the path from the CEO (ID 1) through VP of Sales (ID 2) and Sales
Manager 1 (ID 3) to Salesperson 1 (ID 4).

By structuring the data this way, we can ef ciently retrieve and traverse the hierarchy
using standard SQL queries.

Certainly! Let’s break down the explanation of using a pathstring attribute for
handling ragged variable depth hierarchies in a dimension table with a comprehensive
tabular data example.

### Explanation of Each Line with an Example

1. **Use of a bridge table for ragged variable depth hierarchies can be avoided by
implementing a pathstring attribute in the dimension.**
- Instead of using a bridge table (which is a common solution for handling
hierarchies with varying depths), we can use a pathstring attribute. This attribute
captures the hierarchy path in a single string.

2. **For each row in the dimension, the pathstring attribute contains a specially
encoded text string containing the complete path description from the supreme node
of a hierarchy down to the node described by the particular dimension row.**
- Each row in the dimension table will have a pathstring that represents the path
from the root node to that speci c node.

3. **Many of the standard hierarchy analysis requests can then be handled by


standard SQL, without resorting to SQL language extensions.**
- Using pathstrings allows you to perform hierarchical queries with standard SQL,
avoiding the need for specialized SQL extensions.

4. **However, the pathstring approach does not enable rapid substitution of


alternative hierarchies or shared ownership hierarchies.**
fi
fi
- A limitation of the pathstring method is that it does not easily allow for switching
between different hierarchical structures or handling nodes that belong to multiple
parents.

5. **The pathstring approach may also be vulnerable to structure changes in the


ragged hierarchy that could force the entire hierarchy to be relabeled.**
- If the hierarchy changes (e.g., nodes are added or moved), the pathstrings for all
affected nodes need to be updated, which can be cumbersome.

### Comprehensive Tabular Data Example

Let’s consider a company’s organizational structure as an example. The hierarchy


might look like this:

- CEO
- VP of Sales
- Sales Manager 1
- Sales Representative 1
- Sales Representative 2
- Sales Manager 2
- VP of Marketing
- Marketing Manager 1
- Marketing Manager 2
- Marketing Specialist 1

#### Dimension Table with Pathstring Attribute

| EmployeeID | EmployeeName | Title | Pathstring |


|------------|----------------------|----------------------|--------------------------|
|1 | CEO | CEO | /CEO |
|2 | VP of Sales | VP | /CEO/VP_Sales |
|3 | Sales Manager 1 | Manager | /CEO/VP_Sales/Sales_Mgr1 |
|4 | Sales Representative 1 | Sales Rep | /CEO/VP_Sales/Sales_Mgr1/
Sales_Rep1 |
|5 | Sales Representative 2 | Sales Rep | /CEO/VP_Sales/Sales_Mgr1/
Sales_Rep2 |
|6 | Sales Manager 2 | Manager | /CEO/VP_Sales/Sales_Mgr2 |
|7 | VP of Marketing | VP | /CEO/VP_Marketing |
|8 | Marketing Manager 1 | Manager | /CEO/VP_Marketing/
Mktg_Mgr1 |
|9 | Marketing Manager 2 | Manager | /CEO/VP_Marketing/
Mktg_Mgr2 |
| 10 | Marketing Specialist 1 | Specialist | /CEO/VP_Marketing/Mktg_Mgr2/
Mktg_Spec1 |

### Query 1: Finding All Employees Under VP of Sales


**SQL Query:**

```sql
SELECT EmployeeID, EmployeeName, Title
FROM DimensionTable
WHERE Pathstring LIKE '/CEO/VP_Sales/%';
```

**Result:**

| EmployeeID | EmployeeName | Title |


|------------|------------------------|----------------|
|3 | Sales Manager 1 | Manager |
|4 | Sales Representative 1 | Sales Rep |
|5 | Sales Representative 2 | Sales Rep |
|6 | Sales Manager 2 | Manager |

### Query 2: Counting the Number of Direct Reports for Each Manager

To perform this query, assume the manager IDs for which we need to nd direct
reports are `2` (VP of Sales) and `7` (VP of Marketing).

**SQL Query for VP of Sales:**

```sql
SELECT EmployeeID, EmployeeName, Title
FROM DimensionTable
WHERE Pathstring LIKE '/CEO/VP_Sales/%';
```

**Result:**

| EmployeeID | EmployeeName | Title |


|------------|------------------------|----------------|
|3 | Sales Manager 1 | Manager |
|4 | Sales Representative 1 | Sales Rep |
|5 | Sales Representative 2 | Sales Rep |
|6 | Sales Manager 2 | Manager |

**SQL Query for VP of Marketing:**

```sql
SELECT EmployeeID, EmployeeName, Title
FROM DimensionTable
WHERE Pathstring LIKE '/CEO/VP_Marketing/%';
```

**Result:**
fi
| EmployeeID | EmployeeName | Title |
|------------|------------------------|----------------|
|8 | Marketing Manager 1 | Manager |
|9 | Marketing Manager 2 | Manager |
| 10 | Marketing Specialist 1 | Specialist |

**Counting Direct Reports:**

For the direct reports, assuming direct reports only include the next level (directly
reporting) under a manager:

**SQL Query:**

```sql
SELECT ManagerID, COUNT(EmployeeID) AS NumDirectReports
FROM (
SELECT 2 AS ManagerID, EmployeeID
FROM DimensionTable
WHERE Pathstring LIKE '/CEO/VP_Sales/%'
AND Pathstring NOT LIKE '/CEO/VP_Sales/%/%'
UNION ALL
SELECT 7 AS ManagerID, EmployeeID
FROM DimensionTable
WHERE Pathstring LIKE '/CEO/VP_Marketing/%'
AND Pathstring NOT LIKE '/CEO/VP_Marketing/%/%'
) AS DirectReports
GROUP BY ManagerID;
```

**Result:**

| ManagerID | NumDirectReports |
|-----------|------------------|
|2 |2 |
|7 |2 |

Here, the assumption is that direct reports only count the next hierarchical level,
hence `Sales Manager 1` and `Sales Manager 2` directly under `VP of Sales`, and
`Marketing Manager 1` and `Marketing Manager 2` directly under `VP of Marketing`.

### Limitations and Challenges

1. **Changing Hierarchies:**
- If the Sales Manager 1 is moved under VP of Marketing, all pathstrings under
Sales Manager 1 need to be updated, e.g., changing from `/CEO/VP_Sales/
Sales_Mgr1/...` to `/CEO/VP_Marketing/Sales_Mgr1/...`.

2. **Alternative Hierarchies:**
- If you need to view the hierarchy from different perspectives (e.g., based on
projects instead of departments), the pathstring approach doesn’t provide an easy way
to switch between these views.

By using the pathstring attribute, you can simplify many hierarchical queries and
avoid the complexity of bridge tables. However, this approach also has its limitations,
particularly when it comes to exibility and maintenance in response to changes in the
hierarchy structure.

To explain each line with a tabular data example, let's consider a simpli ed data
warehouse schema with a fact table and several dimension tables. We'll use a sales
scenario for this example.

### Dimension Tables

**Product Dimension (`dim_product`):**


| product_key | product_name | category |
|-------------|---------------|-----------|
|1 | Widget A | Gadgets |
|2 | Widget B | Gadgets |
|3 | Gizmo A | Gizmos |

**Customer Dimension (`dim_customer`):**


| customer_key | customer_name | region |
|--------------|---------------|----------|
|1 | Alice | East |
|2 | Bob | West |
|3 | Charlie | Central |

**Date Dimension (`dim_date`):**


| date_key | date | month | year |
|----------|------------|-------|------|
|1 | 2023-01-01 | Jan | 2023 |
|2 | 2023-01-02 | Jan | 2023 |
|3 | 2023-01-03 | Jan | 2023 |

### Fact Table

**Sales Fact Table (`fact_sales`):**


| sales_key | product_key | customer_key | date_key | sales_amount |
|-----------|--------------|--------------|----------|--------------|
|1 |1 |1 |1 | 100.00 |
|2 |2 |2 |2 | 200.00 |
|3 |3 |3 |3 | 300.00 |

### Explanation of Each Line


fl
fi
1. **"Surrogate keys are used to implement the primary keys of almost all dimension
tables."**
- In the dimension tables, surrogate keys (`product_key`, `customer_key`,
`date_key`) are unique identi ers for each record. They are typically sequential
integers.
- **Example:**
| product_key | product_name | category |
|-------------|---------------|-----------|
|1 | Widget A | Gadgets |
|2 | Widget B | Gadgets |
|3 | Gizmo A | Gizmos |

2. **"In addition, single column surrogate fact keys can be useful, albeit not
required."**
- The fact table (`fact_sales`) includes a single column surrogate key (`sales_key`)
which uniquely identi es each record in the fact table.
- **Example:**
| sales_key | product_key | customer_key | date_key | sales_amount |
|-----------|--------------|--------------|----------|--------------|
|1 |1 |1 |1 | 100.00 |

3. **"Fact table surrogate keys, which are not associated with any dimension, are
assigned sequentially during the ETL load process and are used 1) as the single
column primary key of the fact table;"**
- The `sales_key` in the `fact_sales` table is a sequential surrogate key and serves as
the primary key for the fact table.
- **Example:**
| sales_key | product_key | customer_key | date_key | sales_amount |
|-----------|--------------|--------------|----------|--------------|
|1 |1 |1 |1 | 100.00 |

4. **"2) to serve as an immediate identi er of a fact table row without navigating


multiple dimensions for ETL purposes;"**
- The `sales_key` allows easy identi cation of each row in the `fact_sales` table
without needing to look up the dimension keys.
- **Example:**
| sales_key | sales_amount |
|-----------|--------------|
|1 | 100.00 |

5. **"3) to allow an interrupted load process to either back out or resume;"**

### ETL Load Process Steps

1. **ETL Process Starts:**


- New records to be loaded:
| product_key | customer_key | date_key | sales_amount |
|-------------|--------------|----------|--------------|
fi
fi
fi
fi
|1 |2 |1 | 150.00 |
|2 |1 |2 | 250.00 |
|3 |3 |3 | 350.00 |

2. **Assign Surrogate Keys:**


- Sequentially assign surrogate keys (`sales_key`) to the new records.
- New records with surrogate keys:
| sales_key | product_key | customer_key | date_key | sales_amount |
|-----------|--------------|--------------|----------|--------------|
|4 |1 |2 |1 | 150.00 |
|5 |2 |1 |2 | 250.00 |
|6 |3 |3 |3 | 350.00 |

3. **ETL Process Interrupted:**


- Assume the process is interrupted after loading the rst new record (`sales_key` 4).
- Fact table after interruption:
| sales_key | product_key | customer_key | date_key | sales_amount |
|-----------|--------------|--------------|----------|--------------|
|1 |1 |1 |1 | 100.00 |
|2 |2 |2 |2 | 200.00 |
|3 |3 |3 |3 | 300.00 |
|4 |1 |2 |1 | 150.00 |

### Backing Out or Resuming the ETL Process

**Backing Out:**
- To back out the changes, remove the partially loaded records.
- In this case, delete the row with `sales_key` 4.
- Fact table after backing out:
| sales_key | product_key | customer_key | date_key | sales_amount |
|-----------|--------------|--------------|----------|--------------|
|1 |1 |1 |1 | 100.00 |
|2 |2 |2 |2 | 200.00 |
|3 |3 |3 |3 | 300.00 |

**Resuming:**
- To resume the ETL process, start loading from where it was interrupted.
- Check the last successfully loaded surrogate key (`sales_key` 4) and continue with
the next records.
- Fact table after resuming:
| sales_key | product_key | customer_key | date_key | sales_amount |
|-----------|--------------|--------------|----------|--------------|
|1 |1 |1 |1 | 100.00 |
|2 |2 |2 |2 | 200.00 |
|3 |3 |3 |3 | 300.00 |
|4 |1 |2 |1 | 150.00 |
|5 |2 |1 |2 | 250.00 |
|6 |3 |3 |3 | 350.00 |
fi
6. **"4) to allow fact table update operations to be decomposed into less risky inserts
plus deletes."**
- By using the `sales_key`, updates to the fact table can be handled by inserting a
new row and then deleting the old row, reducing the risk of data inconsistencies.
- **Example:**
- To update a sales record, insert a new row with `sales_key` 4 and then delete the
old row with `sales_key` 1.

### Summary Table

Here's a summarized table to illustrate the use of surrogate keys in the fact table:

**Fact Table (`fact_sales`):**


| sales_key | product_key | customer_key | date_key | sales_amount |
|-----------|--------------|--------------|----------|--------------|
|1 |1 |1 |1 | 100.00 |
|2 |2 |2 |2 | 200.00 |
|3 |3 |3 |3 | 300.00 |
|4 |1 |1 |1 | 150.00 | (Updated record for `sales_key` 1)

By using surrogate keys, we can manage the fact table more ef ciently and ensure
data integrity during ETL operations.

Designers sometimes create separate dimensions for each level of a hierarchy (e.g.,
date, month, quarter, year) and include all these foreign keys in a fact table, resulting
in a "centipede fact table" with many hierarchically related dimensions. This approach
should be avoided. Instead, these dimensions should be collapsed to their unique
lowest grain (e.g., date). Centipede fact tables can also occur when designers embed
multiple foreign keys for low-cardinality dimensions instead of using a junk
dimension.

To explain the distinctions made in the text using a comprehensive tabular data
example, let's consider a hypothetical sales database with the following components:
Fact Table (Sales_Fact) and Dimension Tables (Product_Dim and Date_Dim).

### 1. Fact Table (Sales_Fact)

| Sale_ID | Date_Key | Product_Key | Quantity | Sale_Amount | Discount_Amt |


On_Time_Delivery_Score |
|---------|----------|-------------|----------|-------------|--------------|------------------------|
|1 | 20210701 | 101 |2 | 200.00 | 20.00 | 0.95 |
fi
|2 | 20210702 | 102 |1 | 150.00 | 15.00 | 0.85 |
|3 | 20210703 | 101 |3 | 300.00 | 30.00 | 0.90 |

### 2. Dimension Table (Product_Dim)

| Product_Key | Product_Name | Category | Standard_List_Price | Price_Band |


Qualitative_Description |
|-------------|--------------|------------|---------------------|------------|-------------------------|
| 101 | Widget A | Widgets | 100.00 | $50-100 | High Quality
|
| 102 | Gadget B | Gadgets | 150.00 | $100-150 | Premium Quality
|
| 103 | Tool C | Tools | 75.00 | $50-100 | Standard Quality |

### 3. Dimension Table (Date_Dim)

| Date_Key | Date | Month | Quarter | Year |


|----------|------------|-------|---------|------|
| 20210701 | 2021-07-01 | July | Q3 | 2021 |
| 20210702 | 2021-07-02 | July | Q3 | 2021 |
| 20210703 | 2021-07-03 | July | Q3 | 2021 |

### Explanation

| Line | Explanation
| Example
|
|-----------------------------------------------------------|-------------------------------------------
-------------------------------------------------------------------------------------------------------
--------------------------------|-----------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------|
| Numeric values that don’t clearly fall into fact or dimension attribute categories.
| Some numeric values can be ambiguous in terms of classi cation, making it dif cult
to decide whether they belong in a fact or dimension table.
| Standard list price of a product can be such a value.
|
| Numeric value used primarily for calculation purposes belongs in the fact table.
| If a numeric value is mainly used for calculations like aggregations or arithmetic
operations, it should be part of the fact table.
| The `Sale_Amount` in the Sales_Fact table is used for calculations (e.g., total sales),
so it is a fact.
|
| Stable numeric value used predominantly for ltering and grouping should be a
dimension attribute. | If a numeric value is stable and mainly used for ltering,
grouping, or categorization, it should be part of a dimension table.
| `Standard_List_Price` in the Product_Dim table is used to lter and group products
based on their price.
|
fi
fi
fi
fi
fi
| Numeric values can be supplemented with value band attributes.
| For better categorization, numeric values can be grouped into bands or ranges, which
can be useful for ltering and grouping.
| `Standard_List_Price` is supplemented with `Price_Band` (e.g., $0-50, $50-100) in
the Product_Dim table.
|
| Modeling a numeric value as both a fact and dimension attribute. |
In some scenarios, it might be useful to model a numeric value in both the fact and
dimension tables to provide different perspectives for analysis.
| `On_Time_Delivery_Score` is included in the Sales_Fact table for quantitative
analysis and also described qualitatively in the Product_Dim table with
`Qualitative_Description` (e.g., High Quality, Premium Quality).
|

By including `Standard_List_Price` in both the `Product_Dim` table (as a dimension


attribute) and supplementing it with a `Price_Band`, you can easily lter and group
products based on their price bands. Similarly, `On_Time_Delivery_Score` is a fact
used for quantitative analysis in the fact table, while the qualitative description
provides a textual context for these scores in the dimension table.

Sure, here is a comprehensive tabular data example to illustrate how


`Standard_List_Price` can be included in both the `Product_Dim` table and
supplemented with a `Price_Band` for easy ltering and grouping:

### 1. Fact Table (Sales_Fact)

| Sale_ID | Date_Key | Product_Key | Quantity | Sale_Amount |


|---------|----------|-------------|----------|-------------|
|1 | 20210701 | 101 |2 | 200.00 |
|2 | 20210702 | 102 |1 | 150.00 |
|3 | 20210703 | 103 |3 | 225.00 |

### 2. Dimension Table (Product_Dim)

| Product_Key | Product_Name | Category | Standard_List_Price | Price_Band |


|-------------|--------------|------------|---------------------|------------|
| 101 | Widget A | Widgets | 100.00 | $50-100 |
| 102 | Gadget B | Gadgets | 150.00 | $100-150 |
| 103 | Tool C | Tools | 75.00 | $50-100 |

### Example Queries

#### Query 1: Filter Products by Price Band

You can lter products by their `Price_Band` to group them into speci c price ranges.

```sql
fi
fi
fi
fi
fi
SELECT Product_Name, Category, Standard_List_Price, Price_Band
FROM Product_Dim
WHERE Price_Band = '$50-100';
```

**Result:**

| Product_Name | Category | Standard_List_Price | Price_Band |


|--------------|----------|---------------------|------------|
| Widget A | Widgets | 100.00 | $50-100 |
| Tool C | Tools | 75.00 | $50-100 |

#### Query 2: Group Sales by Product Price Band

You can also group sales by the `Price_Band` of the products sold.

```sql
SELECT P.Price_Band, SUM(F.Sale_Amount) AS Total_Sales
FROM Sales_Fact F
JOIN Product_Dim P ON F.Product_Key = P.Product_Key
GROUP BY P.Price_Band;
```

**Result:**

| Price_Band | Total_Sales |
|------------|-------------|
| $50-100 | 425.00 |
| $100-150 | 150.00 |

This example demonstrates how including `Standard_List_Price` and a supplementary


`Price_Band` in the `Product_Dim` table facilitates easy ltering and grouping of
products and sales data based on price ranges.

Let's break down the explanation with a tabular data example to illustrate how
accumulating snapshot fact tables work, how time lags are calculated, and how they
can simplify analysis.

### Explanation Breakdown

1. **Accumulating Snapshot Fact Tables:**


- These tables capture multiple process milestones.
- Each milestone has a date foreign key (FK) and possibly a date/time stamp.

2. **Business User Analysis:**


- Users often analyze the lags or durations between milestones.
- Lags can be simple differences between dates or based on complex business rules.
fi
3. **Complexity with Dozens of Steps:**
- With many steps, there could be hundreds of possible lags.
- Calculating each lag from date/time stamps in a query can be cumbersome.

4. **Simpli ed Approach:**
- Store just one time lag for each step measured against the process’s start point.
- Calculate any lag between two steps as a simple subtraction of the stored lags.

### Tabular Data Example

Consider a sales order process with the following steps:


1. Order Placed
2. Order Approved
3. Order Packed
4. Order Shipped
5. Order Delivered

#### Fact Table Structure

| OrderID | StartDate | OrderPlacedDate | OrderApprovedDate | OrderPackedDate |


OrderShippedDate | OrderDeliveredDate | LagOrderPlaced | LagOrderApproved |
LagOrderPacked | LagOrderShipped | LagOrderDelivered |
|---------|-----------|------------------|-------------------|-----------------|------------------|-------
------------|----------------|------------------|----------------|-----------------|-------------------|
| 1001 | 2024-01-01| 2024-01-01 | 2024-01-02 | 2024-01-03 | 2024-01-04
| 2024-01-05 |0 |1 |2 |3 |4 |
| 1002 | 2024-01-05| 2024-01-05 | 2024-01-06 | 2024-01-07 | 2024-01-08
| 2024-01-09 |0 |1 |2 |3 |4 |

#### Explanation with Example Data

1. **OrderID:** Unique identi er for each order.


2. **StartDate:** The date when the process started (Order Placed).
3. **OrderPlacedDate to OrderDeliveredDate:** Dates for each milestone in the
process.
4. **Lag Columns (LagOrderPlaced to LagOrderDelivered):**
- These columns store the time lag in days from the StartDate to each milestone.
- LagOrderPlaced is 0 because it is the start date.
- LagOrderApproved is 1, indicating the order was approved 1 day after it was
placed.
- LagOrderPacked is 2, indicating the order was packed 2 days after it was placed.
- LagOrderShipped is 3, indicating the order was shipped 3 days after it was placed.
- LagOrderDelivered is 4, indicating the order was delivered 4 days after it was
placed.

### Calculating Lags Between Steps

To nd the lag between two steps, subtract the corresponding lag values.
fi
fi
fi
- **Lag from Order Placed to Order Approved:**
- LagOrderApproved - LagOrderPlaced = 1 - 0 = 1 day

- **Lag from Order Approved to Order Packed:**


- LagOrderPacked - LagOrderApproved = 2 - 1 = 1 day

- **Lag from Order Placed to Order Delivered:**


- LagOrderDelivered - LagOrderPlaced = 4 - 0 = 4 days

By storing these lags, you simplify the process of calculating any lag between steps,
making the data more accessible and easier to analyze for business users.

Certainly! Let's break down the concept of operational transaction systems with
header and line schemas through a comprehensive tabular example.

### Concept Explanation

1. **Header Table**: This fact table contains information about the overall
transaction. It usually includes elds like transaction ID, date, customer information,
and other high-level details.
2. **Line Table**: This fact table contains information about individual items or
services within the transaction. Each row in this table is associated with a speci c
transaction ID from the header table and includes details like product ID, quantity,
price, and other item-speci c information.
3. **Degenerate Dimensions**: These are dimensions that do not have their own
dimension tables and exist in the fact table as identi ers, like transaction number or
order number.

### Example Schema

#### Header Table (Transaction_Header)


| Transaction_ID | Transaction_Date | Customer_ID | Payment_Type | Total_Amount |
|----------------|------------------|-------------|--------------|--------------|
| 1001 | 2023-07-01 | C001 | Credit Card | 150.00 |
| 1002 | 2023-07-02 | C002 | Cash | 200.00 |

#### Line Table (Transaction_Line)


| Line_ID | Transaction_ID | Product_ID | Quantity | Price | Discount |
Total_Line_Amount | Customer_ID | Payment_Type | Transaction_Date |
Total_Amount |
|---------|----------------|------------|----------|-------|----------|-------------------|-------------|--
------------|------------------|--------------|
|1 | 1001 | P001 |2 | 30.00 | 0.00 | 60.00 | C001 |
Credit Card | 2023-07-01 | 150.00 |
fi
fi
fi
fi
|2 | 1001 | P002 |1 | 90.00 | 0.00 | 90.00 | C001 |
Credit Card | 2023-07-01 | 150.00 |
|3 | 1002 | P003 |4 | 50.00 | 0.00 | 200.00 | C002 |
Cash | 2023-07-02 | 200.00 |

### Explanation

- **Transaction_ID**: The primary key in the Header Table and a foreign key in the
Line Table, linking each line item to its corresponding transaction.
- **Transaction_Date**: This dimension is repeated in the Line Table to provide
context for each line item.
- **Customer_ID**: While usually a foreign key to a Customer Dimension table, it is
included in the Line Table to provide context.
- **Payment_Type**: Another header-level dimension repeated in the Line Table.
- **Total_Amount**: The total transaction amount is included in the Line Table for
each line item for easy aggregation and analysis.

### Comprehensive Data Flow

1. **Header Table**:
- Transaction 1001:
- Date: 2023-07-01
- Customer: C001
- Payment: Credit Card
- Total Amount: 150.00
- Transaction 1002:
- Date: 2023-07-02
- Customer: C002
- Payment: Cash
- Total Amount: 200.00

2. **Line Table**:
- Line 1 of Transaction 1001:
- Product: P001
- Quantity: 2
- Price: 30.00
- Line Total: 60.00
- Header-level dimensions: Date, Customer, Payment, Total Amount are repeated.
- Line 2 of Transaction 1001:
- Product: P002
- Quantity: 1
- Price: 90.00
- Line Total: 90.00
- Header-level dimensions: Date, Customer, Payment, Total Amount are repeated.
- Line 1 of Transaction 1002:
- Product: P003
- Quantity: 4
- Price: 50.00
- Line Total: 200.00
- Header-level dimensions: Date, Customer, Payment, Total Amount are repeated.

By including the header-level dimensions and degenerate dimensions in the Line


Table, querying and analysis become more ef cient, as all necessary information is
available at the line level, avoiding the need for complex joins.

The Transaction_ID is the degenerate dimension in the provided example because it


is a transactional identi er that resides in the fact table without having its own
dimension table.

The determination of a header freight charge can vary based on several factors. These
factors might include the policies of the shipping company, the total weight or volume
of the shipment, the distance the shipment must travel, and any special handling
requirements.

In the context of the line "A 'header freight charge' is a cost applied to the entire
transaction rather than individual items," the distinction between the entire transaction
and individual items can be clari ed with an example:

### Explanation

- **Entire Transaction:** This refers to the overall transaction, which could be an


invoice, order, or shipment. It encompasses all the items included in that single
transaction.

- **Individual Items:** These are the speci c products or services listed within the
transaction. Each item is a line in the transaction that details the quantity, price, and
other item-speci c information.

### Example

#### Entire Transaction (Header-Level Information)


Consider an invoice (INV001) for a customer that includes multiple items. The
header-level information includes details that apply to the whole invoice, such as the
total freight charge for shipping all items in the invoice.

| InvoiceID | Date | CustomerID | HeaderFreightCharge |


|-----------|------------|------------|---------------------|
| INV001 | 2024-07-01 | CUST001 | 20.00 |

In this example, the freight charge of $20.00 is a cost applied to the entire invoice, not
to any speci c item within the invoice.
fi
fi
fi
fi
fi
fi
#### Individual Items (Line-Level Information)
The same invoice (INV001) includes multiple items. Each item has its own line in the
transaction.

| InvoiceID | LineID | ProductID | Quantity | UnitPrice | LineTotal |


|-----------|--------|-----------|----------|-----------|-----------|
| INV001 | 1 | PROD001 | 2 | 10.00 | 20.00 |
| INV001 | 2 | PROD002 | 1 | 5.00 | 5.00 |

In this case, the line items are:


- 2 units of PROD001 at $10.00 each, totaling $20.00.
- 1 unit of PROD002 at $5.00 each, totaling $5.00.

#### Distinction
- **Entire Transaction (Header Level):** The $20.00 freight charge is related to the
whole invoice (INV001). It does not vary based on the speci c products or quantities.
- **Individual Items (Line Level):** Each product (PROD001, PROD002) has its
own cost and quantity details.

Allocating the $20.00 freight charge from the header to the line items allows for a
more granular analysis. For instance, if the freight charge is allocated proportionally
to the line totals, each line item will carry a portion of the $20.00 based on its relative
cost in the invoice.

### Allocated Freight Example


Using the previous allocation method:

For Invoice INV001:


- Total Line Amount: $20.00 (PROD001) + $5.00 (PROD002) = $25.00
- Allocation for PROD001: \( \frac{20.00}{25.00} \times 20.00 = 16.00 \)
- Allocation for PROD002: \( \frac{5.00}{25.00} \times 20.00 = 4.00 \)

| InvoiceID | LineID | ProductID | Quantity | UnitPrice | LineTotal | AllocatedFreight |


|-----------|--------|-----------|----------|-----------|-----------|------------------|
| INV001 | 1 | PROD001 | 2 | 10.00 | 20.00 | 16.00 |
| INV001 | 2 | PROD002 | 1 | 5.00 | 5.00 | 4.00 |

By allocating the header freight charge to the line items, you can analyze costs at the
item level, which is useful for detailed reporting and analysis.

Certainly! Let's break down the provided statement line by line, using a
comprehensive tabular data example to illustrate each concept.

### Statement Breakdown and Explanation


fi
1. **Statement:** "It is quite common in header/line transaction data to encounter
facts of differing granularity, such as a header freight charge."
- **Explanation:** In transaction data, "header" refers to the overall transaction
(e.g., an invoice), and "line" refers to the individual items within that transaction. A
"header freight charge" is a cost applied to the entire transaction rather than individual
items.
- **Example:**

| InvoiceID | Date | CustomerID | HeaderFreightCharge |


|-----------|------------|------------|---------------------|
| INV001 | 2024-07-01 | CUST001 | 20.00 |
| INV002 | 2024-07-02 | CUST002 | 15.00 |

| InvoiceID | LineID | ProductID | Quantity | UnitPrice | LineTotal |


|-----------|--------|-----------|----------|-----------|-----------|
| INV001 | 1 | PROD001 | 2 | 10.00 | 20.00 |
| INV001 | 2 | PROD002 | 1 | 5.00 | 5.00 |
| INV002 | 1 | PROD001 | 1 | 10.00 | 10.00 |
| INV002 | 2 | PROD003 | 2 | 7.50 | 15.00 |

2. **Statement:** "You should strive to allocate the header facts down to the line
level based on rules provided by the business, so the allocated facts can be sliced and
rolled up by all the dimensions."
- **Explanation:** Allocating header-level costs to line items makes it possible to
analyze costs at the item level. Business rules might specify how to distribute these
costs, such as based on the proportion of line totals.
- **Example with Allocation:**

Let's allocate the header freight charge based on the proportion of each line's total to
the invoice total.

For Invoice INV001:


- Total Line Amount: 20.00 (PROD001) + 5.00 (PROD002) = 25.00
- Allocation for PROD001: \( \frac{20.00}{25.00} \times 20.00 = 16.00 \)
- Allocation for PROD002: \( \frac{5.00}{25.00} \times 20.00 = 4.00 \)

For Invoice INV002:


- Total Line Amount: 10.00 (PROD001) + 15.00 (PROD003) = 25.00
- Allocation for PROD001: \( \frac{10.00}{25.00} \times 15.00 = 6.00 \)
- Allocation for PROD003: \( \frac{15.00}{25.00} \times 15.00 = 9.00 \)

| InvoiceID | LineID | ProductID | Quantity | UnitPrice | LineTotal | AllocatedFreight


|
|-----------|--------|-----------|----------|-----------|-----------|------------------|
| INV001 | 1 | PROD001 | 2 | 10.00 | 20.00 | 16.00 |
| INV001 | 2 | PROD002 | 1 | 5.00 | 5.00 | 4.00 |
| INV002 | 1 | PROD001 | 1 | 10.00 | 10.00 | 6.00 |
| INV002 | 2 | PROD003 | 2 | 7.50 | 15.00 | 9.00 |
3. **Statement:** "In many cases, you can avoid creating a header-level fact table,
unless this aggregation delivers query performance advantages."
- **Explanation:** Instead of maintaining separate tables for header and line facts,
allocate the header facts to line items to simplify querying and ensure consistency
across dimensions. However, if query performance signi cantly improves with a
header-level fact table, it might be worth keeping.
- **Example Without Header-Level Fact Table:**

| InvoiceID | LineID | ProductID | Quantity | UnitPrice | LineTotal | AllocatedFreight


|
|-----------|--------|-----------|----------|-----------|-----------|------------------|
| INV001 | 1 | PROD001 | 2 | 10.00 | 20.00 | 16.00 |
| INV001 | 2 | PROD002 | 1 | 5.00 | 5.00 | 4.00 |
| INV002 | 1 | PROD001 | 1 | 10.00 | 10.00 | 6.00 |
| INV002 | 2 | PROD003 | 2 | 7.50 | 15.00 | 9.00 |

- **Example With Header-Level Fact Table for Performance:**

| InvoiceID | Date | CustomerID | HeaderFreightCharge |


|-----------|------------|------------|---------------------|
| INV001 | 2024-07-01 | CUST001 | 20.00 |
| INV002 | 2024-07-02 | CUST002 | 15.00 |

| InvoiceID | LineID | ProductID | Quantity | UnitPrice | LineTotal | AllocatedFreight


|
|-----------|--------|-----------|----------|-----------|-----------|------------------|
| INV001 | 1 | PROD001 | 2 | 10.00 | 20.00 | 16.00 |
| INV001 | 2 | PROD002 | 1 | 5.00 | 5.00 | 4.00 |
| INV002 | 1 | PROD001 | 1 | 10.00 | 10.00 | 6.00 |
| INV002 | 2 | PROD003 | 2 | 7.50 | 15.00 | 9.00 |

By following these steps, the transaction data can be analyzed more effectively at
various levels of granularity.

Sure, let's break down each part of the explanation with a comprehensive tabular data
example. We'll use a simple scenario where a company sells three products: A, B, and
C, through two channels: Online and Retail. We'll illustrate the components involved
in the pro t equation (revenue - costs = pro t) and show how these can be rolled up to
analyze different pro tability aspects.

### 1. The Equation of Pro t: `(Revenue) – (Costs) = (Pro t)`

**Table: Pro t Equation Example**

| Transaction ID | Product | Channel | Revenue | Cost | Pro t |


|----------------|---------|---------|---------|------|--------|
|1 |A | Online | 100 | 60 | 40 |
fi
fi
fi
fi
fi
fi
fi
fi
|2 |B | Retail | 150 | 90 | 60 |
|3 |A | Retail | 200 | 120 | 80 |
|4 |C | Online | 50 | 30 | 20 |
|5 |B | Online | 100 | 70 | 30 |
|6 |C | Retail | 70 | 50 | 20 |

- **Revenue**: The income generated from selling products.


- **Cost**: The expenditure incurred to produce/sell the products.
- **Pro t**: The difference between revenue and costs.

### 2. Implementing Pro t Equation at the Atomic Grain

**Atomic Grain**: Each row in the table represents an individual transaction.

### 3. Numerous Rollups Are Possible

**Customer Pro tability, Product Pro tability, Promotion Pro tability, Channel
Pro tability, and Others**

#### Example Rollups:

**Product Pro tability**

| Product | Total Revenue | Total Cost | Total Pro t |


|---------|---------------|------------|--------------|
|A | 300 | 180 | 120 |
|B | 250 | 160 | 90 |
|C | 120 | 80 | 40 |

**Channel Pro tability**

| Channel | Total Revenue | Total Cost | Total Pro t |


|---------|---------------|------------|--------------|
| Online | 250 | 160 | 90 |
| Retail | 270 | 170 | 100 |

### 4. Allocating Cost Components to the Fact Table's Grain

This involves breaking down the costs to match the atomic grain of revenue
transactions. This step can be complex and politically sensitive as it often requires
support from high-level executives to ensure accuracy and acceptance.

**Table: Allocated Costs Example**

| Transaction ID | Product | Channel | Revenue | Raw Material Cost | Labor Cost |


Shipping Cost | Total Cost | Pro t |
|----------------|---------|---------|---------|--------------------|------------|---------------|---------
---|--------|
|1 |A | Online | 100 | 40 | 15 |5 | 60 | 40 |
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
|2 |B | Retail | 150 | 60 | 20 | 10 | 90 | 60 |
|3 |A | Retail | 200 | 80 | 30 | 10 | 120 | 80 |
|4 |C | Online | 50 | 20 |8 |2 | 30 | 20 |
|5 |B | Online | 100 | 50 | 15 |5 | 70 | 30 |
|6 |C | Retail | 70 | 30 | 15 |5 | 50 | 20 |

### 5. Complex ETL Subsystem

The process of breaking down and allocating costs requires complex ETL (Extract,
Transform, Load) processes.

**ETL Steps Example:**


1. **Extract**: Collect raw data from various sources.
2. **Transform**: Allocate costs to each transaction.
3. **Load**: Load the transformed data into the fact table.

### 6. Political and Executive Support

This step often involves aligning various departments and ensuring that all
stakeholders agree on the cost allocation methods and data accuracy. High-level
executive support is crucial for the successful implementation of these fact tables.

### 7. Pro t and Loss Fact Tables Are Not Early Implementation Phases

Due to their complexity, pro t and loss fact tables are typically implemented in later
phases of a DW/BI program after simpler and more straightforward tables have been
successfully deployed.

In summary, building fact tables that expose the full pro t equation involves detailed
data allocation and transformation, signi cant ETL work, and political navigation,
making them challenging but extremely valuable for comprehensive business
analysis.

Sure, let's break down each line of the description with an example using a
comprehensive tabular data structure.

### Explanation and Example

- For every nancial fact (e.g., sales amount, cost amount) in the table, there should
be two columns: one for the actual transaction currency and one for the standard
currency (e.g., USD).

Example:
| Transaction ID | Sales Amount (Local) | Sales Amount (USD) |
|----------------|----------------------|---------------------|
|1 | 100 EUR | 110 USD |
fi
fi
fi
fi
fi
|2 | 200 JPY | 1.80 USD |
|3 | 50 GBP | 65 USD |

- "Sales Amount (Local)" represents the sales amount in the currency in which the
transaction was made.
- "Sales Amount (USD)" represents the sales amount converted to a standard
currency (USD in this case).

Example:
| Transaction ID | Sales Amount (Local) | Sales Amount (USD) |
|----------------|----------------------|---------------------|
|1 | 100 EUR | 110 USD |
|2 | 200 JPY | 1.80 USD |
|3 | 50 GBP | 65 USD |

- During the ETL (Extract, Transform, Load) process, the local currency values are
converted to the standard currency (USD) based on prede ned business rules and
exchange rates.

Example:
| Transaction ID | Sales Amount (Local) | Sales Amount (USD) | Conversion Rate |
|----------------|----------------------|---------------------|-----------------|
|1 | 100 EUR | 110 USD | 1.10 |
|2 | 200 JPY | 1.80 USD | 0.009 |
|3 | 50 GBP | 65 USD | 1.30 |

- A separate dimension table (Currency Dimension) is used to store details about


each currency.

Fact Table Example:


| Transaction ID | Sales Amount (Local) | Sales Amount (USD) | Currency Key |
|----------------|----------------------|---------------------|--------------|
|1 | 100 EUR | 110 USD |1 |
|2 | 200 JPY | 1.80 USD |2 |
|3 | 50 GBP | 65 USD |3 |

Currency Dimension Table Example:


| Currency Key | Currency Code | Currency Name |
|--------------|---------------|----------------|
|1 | EUR | Euro |
|2 | JPY | Japanese Yen |
|3 | GBP | British Pound |

### Comprehensive Example


Combining all the elements into one example:
fi
**Fact Table:**
| Transaction ID | Sales Amount (Local) | Sales Amount (USD) | Currency Key |
|----------------|----------------------|---------------------|--------------|
|1 | 100 EUR | 110 USD |1 |
|2 | 200 JPY | 1.80 USD |2 |
|3 | 50 GBP | 65 USD |3 |

**Currency Dimension Table:**


| Currency Key | Currency Code | Currency Name |
|--------------|---------------|----------------|
|1 | EUR | Euro |
|2 | JPY | Japanese Yen |
|3 | GBP | British Pound |

This setup allows for accurate nancial reporting across multiple currencies by
standardizing nancial facts in a single currency (USD) while maintaining the original
transaction currency for reference.

Sure, let's break down the explanation and provide a comprehensive tabular data
example for better understanding.

### Explanation of Each Line

1. **Business processes require facts to be stated simultaneously in several units of


measure:**
- Different stakeholders in a business may need to see data in various units. For
example, a warehouse manager might need inventory data in pallets, while a retailer
might need it in individual units.

2. **For example, depending on the perspective of the business user, a supply chain
may need to report the same facts as pallets, ship cases, retail cases, or individual scan
units:**
- This highlights the need to present the same data in different formats depending on
who is using it.

3. **If the fact table contains a large number of facts, each of which must be
expressed in all units of measure, a convenient technique is to store the facts once in
the table at an agreed standard unit of measure:**
- To avoid redundancy and storage issues, data should be stored in a single standard
unit of measure.

4. **But also simultaneously store conversion factors between the standard measure
and all the others:**
- Alongside the data, store conversion factors that can convert the standard measure
to other required units.
fi
fi
5. **This fact table could be deployed through views to each user constituency, using
an appropriate selected conversion factor:**
- Create views for different users that apply the necessary conversion factors to
present data in their required units.

6. **The conversion factors must reside in the underlying fact table row to ensure the
view calculation is simple and correct, while minimizing query complexity:**
- Storing conversion factors in the fact table ensures easy and accurate conversion,
simplifying the queries needed to create the views.

### Tabular Data Example

Let's say we have a fact table that tracks inventory data. We store the data in the
standard unit of "Individual Units."

| Product ID | Standard Units | Conversion Factor to Pallets | Conversion Factor to


Ship Cases | Conversion Factor to Retail Cases |
|------------|----------------|-----------------------------|-------------------------------|------------
----------------------|
|1 | 1000 | 0.02 | 0.1 | 0.2
|
|2 | 5000 | 0.01 | 0.05 | 0.1
|
|3 | 2000 | 0.04 | 0.2 | 0.25
|

#### Explanation of Table

1. **Product ID:**
- Unique identi er for each product.

2. **Standard Units:**
- The quantity stored in the standard unit of measure (Individual Units).

3. **Conversion Factor to Pallets:**


- Factor to convert standard units to pallets. For Product 1, 1000 individual units
equate to 20 pallets (1000 * 0.02).

4. **Conversion Factor to Ship Cases:**


- Factor to convert standard units to ship cases. For Product 1, 1000 individual units
equate to 100 ship cases (1000 * 0.1).

5. **Conversion Factor to Retail Cases:**


- Factor to convert standard units to retail cases. For Product 1, 1000 individual
units equate to 200 retail cases (1000 * 0.2).

#### Creating Views


fi
Views can be created to display the data in different units using the conversion
factors. For example:

1. **View for Pallets:**

```sql
CREATE VIEW InventoryInPallets AS
SELECT
ProductID,
StandardUnits * ConversionFactorToPallets AS Pallets
FROM FactTable;
```

2. **View for Ship Cases:**

```sql
CREATE VIEW InventoryInShipCases AS
SELECT
ProductID,
StandardUnits * ConversionFactorToShipCases AS ShipCases
FROM FactTable;
```

3. **View for Retail Cases:**

```sql
CREATE VIEW InventoryInRetailCases AS
SELECT
ProductID,
StandardUnits * ConversionFactorToRetailCases AS RetailCases
FROM FactTable;
```

By structuring the fact table and creating views in this manner, different business
users can access the data in the units most relevant to them, while maintaining
simplicity and accuracy in the database structure.

To provide a comprehensive understanding of the concept, let's break down the


provided statement and explain each part with an example of tabular data.

### Explanation with Example

#### 1. Business users often request year-to-date (YTD) values in a fact table.
**Explanation**: Business users frequently ask for YTD metrics, which represent
cumulative values from the beginning of the year up to the current date.

**Example Fact Table**:


| Date | Product | Sales |
|------------|---------|-------|
| 2024-01-01 | A | 100 |
| 2024-01-02 | A | 150 |
| 2024-01-03 | A | 200 |

In this example, if the current date is January 3, 2024, the YTD Sales for Product A
would be 450 (100 + 150 + 200).

#### 2. It is hard to argue against a single request, but YTD requests can easily morph
into “YTD at the close of the scal period” or “ scal period to date.”
**Explanation**: While it might be manageable to ful ll a single YTD request, users
may soon ask for more speci c calculations, such as YTD up to the end of a scal
period or values for a custom scal period to date.

**Example Fact Table with Fiscal Period**:

| Date | Product | Sales | Fiscal Period |


|------------|---------|-------|---------------|
| 2024-01-01 | A | 100 | 2024 Q1 |
| 2024-01-02 | A | 150 | 2024 Q1 |
| 2024-01-03 | A | 200 | 2024 Q1 |
| 2024-04-01 | A | 250 | 2024 Q2 |
| 2024-04-02 | A | 300 | 2024 Q2 |

For instance, the "Fiscal YTD" for Product A up to the end of Q1 (March 31, 2024)
would be 450, but for Q2 it would need to include additional sales.

#### 3. A more reliable, extensible way to handle these assorted requests is to


calculate the YTD metrics in the BI applications or OLAP cube rather than storing
YTD facts in the fact table.
**Explanation**: Instead of storing YTD values directly in the fact table (which can
lead to data redundancy and complexity), it is better to calculate these metrics
dynamically in Business Intelligence (BI) applications or OLAP cubes. This method
ensures exibility and adaptability to different reporting requirements.

**Example BI Application Calculation**:

In a BI tool, we can use the following logic to calculate YTD dynamically:

- **YTD Sales** = SUM(Sales) WHERE Date <= CurrentDate


- **Fiscal YTD Sales** = SUM(Sales) WHERE Date <= EndOfFiscalPeriod

**Illustrative Calculation**:

Let's say we are querying data for Product A on April 2, 2024.

- **Calendar YTD Sales** for 2024: 100 + 150 + 200 + 250 + 300 = 1000
fl
fi
fi
fi
fi
fi
fi
- **Fiscal YTD Sales** for Q2 2024: 250 + 300 = 550

By calculating YTD metrics within the BI tool or OLAP cube, users can easily switch
between different periods and tailor the calculations to their speci c needs without
altering the underlying fact table.

### Summary

| Date | Product | Sales | Fiscal Period | YTD Sales (Calendar) | YTD Sales (Fiscal)
|
|------------|---------|-------|---------------|----------------------|--------------------|
| 2024-01-01 | A | 100 | 2024 Q1 | 100 | 100 |
| 2024-01-02 | A | 150 | 2024 Q1 | 250 | 250 |
| 2024-01-03 | A | 200 | 2024 Q1 | 450 | 450 |
| 2024-04-01 | A | 250 | 2024 Q2 | 700 | 250 |
| 2024-04-02 | A | 300 | 2024 Q2 | 1000 | 550 |

This approach allows for maintaining a clean, manageable fact table while providing
the exibility to adapt to various business reporting needs through dynamic
calculations in the BI layer.

Let's break down the provided statement step by step, explaining each part with a
comprehensive tabular data example to illustrate the concept clearly.

### Explanation with Example

#### 1. A BI application must never issue SQL that joins two fact tables together
across the fact table’s foreign keys.
**Explanation**: Directly joining two fact tables (which usually contain transactional
data) using their foreign keys can lead to ambiguous or incorrect results due to the
nature of their many-to-many relationships.

**Example Fact Tables**:

**Shipments Fact Table**:


| ShipmentID | CustomerID | ProductID | ShipmentDate | QuantityShipped |
|------------|------------|-----------|--------------|-----------------|
|1 | 101 | 1001 | 2024-01-01 | 10 |
|2 | 102 | 1002 | 2024-01-02 | 20 |
|3 | 101 | 1003 | 2024-01-03 | 15 |

**Returns Fact Table**:


| ReturnID | CustomerID | ProductID | ReturnDate | QuantityReturned |
|----------|------------|-----------|------------|------------------|
|1 | 101 | 1001 | 2024-01-04 | 5 |
|2 | 103 | 1002 | 2024-01-05 | 10 |
fl
fi
|3 | 101 | 1003 | 2024-01-06 | 8 |

#### 2. It is impossible to control the cardinality of the answer set of such a join in a
relational database, and incorrect results will be returned to the BI tool.
**Explanation**: Joining these tables on `CustomerID` and `ProductID` would result
in a Cartesian product, where each shipment is incorrectly matched with each return,
leading to in ated or misleading results.

**Example Incorrect Join Result**:

| CustomerID | ProductID | QuantityShipped | QuantityReturned |


|------------|-----------|-----------------|------------------|
| 101 | 1001 | 10 |5 |
| 101 | 1001 | 10 |8 |
| 101 | 1003 | 15 |5 |
| 101 | 1003 | 15 |8 |

#### 3. For instance, if two fact tables contain customer’s product shipments and
returns, these two fact tables must not be joined directly across the customer and
product foreign keys.
**Explanation**: In our example, directly joining the `Shipments` and `Returns`
tables using `CustomerID` and `ProductID` leads to incorrect aggregation and
duplication of data.

#### 4. Instead, the technique of drilling across two fact tables should be used, where
the answer sets from shipments and returns are separately created, and the results sort-
merged on the common row header attribute values to produce the correct result.
**Explanation**: The correct approach involves creating separate result sets for each
fact table and then combining (sort-merging) these results based on common keys
(e.g., `CustomerID` and `ProductID`).

**Step-by-Step Drilling Across Example**:

1. **Query Shipments Table**:

```sql
SELECT CustomerID, ProductID, SUM(QuantityShipped) AS TotalShipped
FROM Shipments
GROUP BY CustomerID, ProductID;
```

| CustomerID | ProductID | TotalShipped |


|------------|-----------|--------------|
| 101 | 1001 | 10 |
| 101 | 1003 | 15 |
| 102 | 1002 | 20 |

2. **Query Returns Table**:


fl
```sql
SELECT CustomerID, ProductID, SUM(QuantityReturned) AS TotalReturned
FROM Returns
GROUP BY CustomerID, ProductID;
```

| CustomerID | ProductID | TotalReturned |


|------------|-----------|---------------|
| 101 | 1001 |5 |
| 101 | 1003 |8 |
| 103 | 1002 | 10 |

3. **Sort-Merge the Results**:

| CustomerID | ProductID | TotalShipped | TotalReturned |


|------------|-----------|--------------|---------------|
| 101 | 1001 | 10 |5 |
| 101 | 1003 | 15 |8 |
| 102 | 1002 | 20 | NULL |
| 103 | 1002 | NULL | 10 |

By sort-merging the results, we avoid the pitfalls of direct joins and ensure accurate,
meaningful aggregates. This method allows us to handle complex data relationships
effectively without compromising data integrity.

### Summary

Directly joining two fact tables on their foreign keys can result in incorrect data due to
uncontrolled cardinality. Instead, using the drilling across technique, where each fact
table is queried separately and results are merged based on common keys, ensures
accurate and reliable reporting in BI applications.

Let's break down the provided statement step by step and explain each part with a
comprehensive tabular data example to illustrate the concept clearly.

### Explanation with Example

#### 1. There are three basic fact table grains: transaction, periodic snapshot, and
accumulating snapshot.
**Explanation**: Fact tables can be classi ed into three types based on their
granularity:
- **Transaction**: Each row represents a single event or transaction.
- **Periodic Snapshot**: Each row captures a summary or status at regular intervals.
- **Accumulating Snapshot**: Each row captures the progress of a process over time,
updating as the process moves through its stages.
fi
**Example Fact Tables**:

**Transaction Fact Table**:


| TransactionID | ProductID | CustomerID | TransactionDate | Quantity | Amount |
|---------------|-----------|------------|-----------------|----------|--------|
|1 | 1001 | 2001 | 2024-01-01 |5 | 100 |
|2 | 1002 | 2002 | 2024-01-02 | 10 | 200 |

**Periodic Snapshot Fact Table**:


| SnapshotDate | ProductID | InventoryLevel |
|--------------|-----------|----------------|
| 2024-01-01 | 1001 | 50 |
| 2024-01-02 | 1001 | 45 |
| 2024-01-01 | 1002 | 30 |
| 2024-01-02 | 1002 | 20 |

**Accumulating Snapshot Fact Table**:


| OrderID | CustomerID | OrderDate | ShipDate | DeliveryDate | ReturnDate |
OrderStatus |
|---------|------------|------------|------------|--------------|-------------|--------------|
| 3001 | 2001 | 2024-01-01 | 2024-01-03 | 2024-01-05 | NULL | Delivered
|
| 3002 | 2002 | 2024-01-02 | 2024-01-04 | 2024-01-06 | 2024-01-10 | Returned
|

#### 2. In isolated cases, it is useful to add a row effective date, row expiration date,
and current row indicator to the fact table, much like you do with type 2 slowly
changing dimensions, to capture a timespan when the fact row was effective.
**Explanation**: Adding these columns helps track the duration during which a fact
was valid, similar to Type 2 Slowly Changing Dimensions (SCD), which track
historical data changes.

**Example Fact Table with Effective Dates**:

| ProductID | InventoryLevel | EffectiveDate | ExpirationDate | IsCurrent |


|-----------|----------------|---------------|----------------|-----------|
| 1001 | 50 | 2024-01-01 | 2024-01-02 | No |
| 1001 | 45 | 2024-01-02 | NULL | Yes |
| 1002 | 30 | 2024-01-01 | 2024-01-02 | No |
| 1002 | 20 | 2024-01-02 | NULL | Yes |

#### 3. Although an unusual pattern, this pattern addresses scenarios such as slowly
changing inventory balances where a frequent periodic snapshot would load identical
rows with each snapshot.
**Explanation**: This approach is bene cial in scenarios where changes are
infrequent. Regular snapshots might repeatedly store identical data, resulting in
redundancy. Adding effective dates ensures that only changes are tracked, reducing
data redundancy.
fi
**Example Scenario - Slowly Changing Inventory Balances**:

**Initial Data Capture**:


| ProductID | InventoryLevel | EffectiveDate | ExpirationDate | IsCurrent |
|-----------|----------------|---------------|----------------|-----------|
| 1001 | 50 | 2024-01-01 | 2024-01-02 | No |
| 1002 | 30 | 2024-01-01 | 2024-01-02 | No |

**Inventory Update on 2024-01-02**:


| ProductID | InventoryLevel | EffectiveDate | ExpirationDate | IsCurrent |
|-----------|----------------|---------------|----------------|-----------|
| 1001 | 50 | 2024-01-01 | 2024-01-02 | No |
| 1001 | 45 | 2024-01-02 | NULL | Yes |
| 1002 | 30 | 2024-01-01 | 2024-01-02 | No |
| 1002 | 20 | 2024-01-02 | NULL | Yes |

This ensures only meaningful changes are captured, and the history of inventory
levels is accurately maintained without redundancy.

### Summary

| ProductID | InventoryLevel | EffectiveDate | ExpirationDate | IsCurrent |


|-----------|----------------|---------------|----------------|-----------|
| 1001 | 50 | 2024-01-01 | 2024-01-02 | No |
| 1001 | 45 | 2024-01-02 | NULL | Yes |
| 1002 | 30 | 2024-01-01 | 2024-01-02 | No |
| 1002 | 20 | 2024-01-02 | NULL | Yes |

By adding effective dates, expiration dates, and current row indicators to fact tables,
we can manage scenarios where frequent periodic snapshots would otherwise
introduce redundant data, thus ensuring a more ef cient and accurate historical
record.

Let's break down the concept of a late arriving fact row with a comprehensive tabular
data example.

### Explanation:

1. **De nition:**
- A fact row is considered late arriving if the most current dimensional context for
new fact rows does not match the incoming row. This occurs when the fact row is
delayed.

2. **Implication:**
fi
fi
- When a fact row is late, the relevant dimensions must be searched to nd the
dimension keys that were effective when the late arriving measurement event
occurred.

### Example:

Imagine we have a data warehouse for a retail store. We track sales transactions
(facts) and have dimensions for time, product, and store.

#### Dimensions:

##### Time Dimension:


| Time_Key | Date | Month | Year |
|----------|------------|-------|------|
|1 | 2024-07-01 | 07 | 2024 |
|2 | 2024-07-02 | 07 | 2024 |
|3 | 2024-07-03 | 07 | 2024 |

##### Product Dimension:


| Product_Key | Product_Name | Category |
|-------------|--------------|----------|
| 101 | Widget A | Widgets |
| 102 | Gadget B | Gadgets |

##### Store Dimension:


| Store_Key | Store_Name | Location |
|-----------|-------------|------------|
| 201 | Store X | New York |
| 202 | Store Y | Los Angeles|

#### Fact Table:

##### Sales Fact:


| Sales_Key | Time_Key | Product_Key | Store_Key | Sales_Amount |
|-----------|----------|-------------|-----------|--------------|
| 1001 |1 | 101 | 201 | 500 |
| 1002 |2 | 102 | 202 | 300 |

### Scenario of a Late Arriving Fact Row:

Let's say a sales transaction occurred on July 1, 2024, for Widget A at Store X, but the
data entry for this transaction was delayed, and it only arrives in the system on July 3,
2024.

#### Late Arriving Fact Row:


| Sales_Key | Time_Key | Product_Key | Store_Key | Sales_Amount |
|-----------|----------|-------------|-----------|--------------|
| 1003 |1 | 101 | 201 | 700 |
fi
### Steps to Handle Late Arriving Fact Row:

1. **Identify the Delay:**


- The new fact row arrives on July 3, 2024, but the transaction happened on July 1,
2024.

2. **Search for Relevant Dimensions:**


- **Time Dimension:** Find the key for the date July 1, 2024, which is `Time_Key
= 1`.
- **Product Dimension:** Find the key for Widget A, which is `Product_Key =
101`.
- **Store Dimension:** Find the key for Store X, which is `Store_Key = 201`.

3. **Insert the Late Fact Row:**


- Ensure the fact table re ects the late arriving data correctly by inserting the fact
row with the correct dimensional keys.

#### Updated Fact Table:

| Sales_Key | Time_Key | Product_Key | Store_Key | Sales_Amount |


|-----------|----------|-------------|-----------|--------------|
| 1001 |1 | 101 | 201 | 500 |
| 1002 |2 | 102 | 202 | 300 |
| 1003 |1 | 101 | 201 | 700 |

### Summary:

- A fact row is late arriving if the current dimensional context does not match the
incoming row due to a delay.
- To handle late arriving fact rows, identify the relevant dimensions that were
effective when the event occurred.
- Search the dimension tables for the correct keys and insert the fact row with these
keys to maintain accurate historical data.

Let's break down the provided explanation step by step with an example to illustrate
how dimensions, outrigger dimensions, and the placement of foreign keys impact the
growth of the base dimension and how this can be managed.

### Step-by-Step Explanation with Example

1. **Dimensions can contain references to other dimensions:**


- Dimensions (like Customer, Product) often reference other dimensions. For
example, a `Customer` dimension might reference a `Geography` dimension to store
the customer's location.

2. **These relationships can be modeled with outrigger dimensions:**


fl
- An outrigger dimension is a secondary dimension that is referenced by a primary
dimension. For example, the `Customer` dimension references the `Geography`
dimension.

| CustomerID | CustomerName | GeographyID |


|------------|--------------|-------------|
|1 | John Doe | 101 |
|2 | Jane Smith | 102 |

| GeographyID | Country | State | City |


|-------------|------------|--------|-------------|
| 101 | USA | CA | Los Angeles |
| 102 | Canada | ON | Toronto |

3. **In some cases, the existence of a foreign key to the outrigger dimension in the
base dimension can result in explosive growth of the base dimension because type 2
changes in the outrigger force corresponding type 2 processing in the base
dimension:**
- If the `Geography` dimension undergoes type 2 changes (historical changes), the
`Customer` dimension will also need to track these changes. For example, if John Doe
moves from Los Angeles to San Francisco, the `Geography` dimension will have a
new row for the new city, and the `Customer` dimension will have a new row to
re ect this change.

Updated `Geography` Dimension:


| GeographyID | Country | State | City |
|-------------|------------|--------|--------------|
| 101 | USA | CA | Los Angeles |
| 103 | USA | CA | San Francisco|

Updated `Customer` Dimension with Type 2 Change:


| CustomerID | CustomerName | GeographyID |
|------------|--------------|-------------|
|1 | John Doe | 101 |
|1 | John Doe | 103 |

4. **This explosive growth can often be avoided if you demote the correlation
between dimensions by placing the foreign key of the outrigger in the fact table rather
than in the base dimension:**
- Instead of having the `GeographyID` in the `Customer` dimension, place it
directly in the fact table. This way, changes in the `Geography` dimension don't force
type 2 changes in the `Customer` dimension.

Fact Table with GeographyID:


| FactID | CustomerID | GeographyID | SalesAmount |
|--------|------------|-------------|-------------|
|1 |1 | 101 | 100 |
|2 |2 | 102 | 150 |
|3 |1 | 103 | 200 |
fl
5. **This means the correlation between the dimensions can be discovered only by
traversing the fact table, but this may be acceptable, especially if the fact table is a
periodic snapshot where all the keys for all the dimensions are guaranteed to be
present for each reporting period:**
- The relationship between `Customer` and `Geography` can be determined by
querying the fact table. This is feasible if the fact table contains periodic snapshots
(e.g., daily, monthly) that ensure all dimension keys are recorded for each reporting
period.

Example of a Periodic Snapshot Fact Table:


| SnapshotDate | FactID | CustomerID | GeographyID | SalesAmount |
|--------------|--------|------------|-------------|-------------|
| 2024-01-01 | 1 |1 | 101 | 100 |
| 2024-01-01 | 2 |2 | 102 | 150 |
| 2024-02-01 | 3 |1 | 103 | 200 |
| 2024-02-01 | 4 |2 | 102 | 120 |

By placing the foreign key of the outrigger (`GeographyID`) in the fact table rather
than the `Customer` dimension, we avoid the explosive growth of the `Customer`
dimension due to type 2 changes in the `Geography` dimension. The correlation can
still be determined by analyzing the fact table, making it a viable solution for
managing dimensional relationships and growth.

Let's break down the explanation and provide a comprehensive tabular data example
for each step.

### Classic Dimensional Schema

In a classic dimensional schema, each dimension attached to a fact table has a single
value consistent with the fact table’s grain.

#### Example:

1. **Fact Table: Sales**


2. **Dimensions: Date, Product, Store**

| Date | Product_ID | Store_ID | Sales_Amount |


|------------|------------|----------|--------------|
| 2024-07-01 | 1 |1 | 100 |
| 2024-07-01 | 2 |2 | 150 |
| 2024-07-02 | 1 |1 | 200 |
| 2024-07-02 | 3 |3 | 250 |

### Multivalued Dimension Scenario


A situation in which a dimension is legitimately multivalued is when a patient
receiving a healthcare treatment may have multiple simultaneous diagnoses.

#### Example:

1. **Fact Table: Treatment**


2. **Dimensions: Date, Patient, Treatment**

| Date | Patient_ID | Treatment_ID | Treatment_Cost |


|------------|------------|--------------|----------------|
| 2024-07-01 | 1 |1 | 1000 |
| 2024-07-01 | 2 |2 | 1500 |
| 2024-07-02 | 1 |3 | 2000 |
| 2024-07-02 | 3 |4 | 2500 |

### Handling Multivalued Dimensions

To handle multivalued dimensions, a group dimension key and a bridge table are
used.

1. **Group Dimension Key: Diagnosis_Group_ID**


2. **Bridge Table: Diagnosis_Bridge**

#### Diagnosis Group Dimension Table

| Diagnosis_Group_ID | Diagnosis_ID |
|--------------------|--------------|
|1 | 101 |
|1 | 102 |
|2 | 103 |
|2 | 104 |
|3 | 105 |
|3 | 106 |

#### Diagnosis Bridge Table

| Treatment_ID | Diagnosis_Group_ID |
|--------------|--------------------|
|1 |1 |
|2 |2 |
|3 |3 |
|4 |1 |

### Detailed Example

1. **Fact Table: Treatment**


2. **Diagnosis Group Dimension Table**
3. **Bridge Table: Diagnosis_Bridge**
#### Fact Table: Treatment

| Date | Patient_ID | Treatment_ID | Treatment_Cost | Diagnosis_Group_ID |


|------------|------------|--------------|----------------|--------------------|
| 2024-07-01 | 1 |1 | 1000 |1 |
| 2024-07-01 | 2 |2 | 1500 |2 |
| 2024-07-02 | 1 |3 | 2000 |3 |
| 2024-07-02 | 3 |4 | 2500 |1 |

#### Diagnosis Group Dimension Table

| Diagnosis_Group_ID | Diagnosis_ID |
|--------------------|--------------|
|1 | 101 |
|1 | 102 |
|2 | 103 |
|2 | 104 |
|3 | 105 |
|3 | 106 |

#### Bridge Table: Diagnosis_Bridge

| Treatment_ID | Diagnosis_Group_ID |
|--------------|--------------------|
|1 |1 |
|2 |2 |
|3 |3 |
|4 |1 |

In this example, the `Diagnosis_Group_ID` in the Fact Table `Treatment` allows each
treatment to be associated with a group of diagnoses through the Bridge Table
`Diagnosis_Bridge`. This design handles the multivalued dimension by creating a
group of diagnoses for each treatment, making it possible to maintain the relationship
between multiple diagnoses and a single treatment record in the Fact Table.

If `Diagnosis_ID` 102 and 103 happened simultaneously as treatment for some


patient, we would need to create a new diagnosis group to re ect this combination.
Here's how we can adjust the tables:

1. **Create a new Diagnosis Group** to re ect the combination of `Diagnosis_ID`


102 and 103.
2. **Update the Bridge Table** to include this new combination.

### Updated Diagnosis Group Dimension Table

| Diagnosis_Group_ID | Diagnosis_ID |
|--------------------|--------------|
fl
fl
|1 | 101 |
|1 | 102 |
|2 | 103 |
|2 | 104 |
|3 | 105 |
|3 | 106 |
|4 | 102 |
|4 | 103 |

### Updated Bridge Table: Diagnosis_Bridge

| Treatment_ID | Diagnosis_Group_ID |
|--------------|--------------------|
|1 |1 |
|2 |2 |
|3 |3 |
|4 |1 |
|5 |4 | <!-- New record indicating the combination of Diagnosis_ID
102 and 103 -->

### Fact Table: Treatment (Updated)

| Date | Patient_ID | Treatment_ID | Treatment_Cost | Diagnosis_Group_ID |


|------------|------------|--------------|----------------|--------------------|
| 2024-07-01 | 1 |1 | 1000 |1 |
| 2024-07-01 | 2 |2 | 1500 |2 |
| 2024-07-02 | 1 |3 | 2000 |3 |
| 2024-07-02 | 3 |4 | 2500 |1 |
| 2024-07-03 | 4 |5 | 1800 |4 | <!-- New treatment record
for the combination of Diagnosis_ID 102 and 103 -->

### Explanation

- **New Diagnosis Group**: A new `Diagnosis_Group_ID` 4 is created to represent


the combination of `Diagnosis_ID` 102 and 103.
- **Bridge Table**: The `Diagnosis_Bridge` table is updated to include the new
combination under `Diagnosis_Group_ID` 4.
- **Fact Table**: The Fact Table `Treatment` is updated with a new treatment record
for `Patient_ID` 4 (for example), re ecting the new combination of diagnoses with
`Treatment_ID` 5.

This approach ensures that each unique combination of diagnoses is represented by a


distinct `Diagnosis_Group_ID`, maintaining the integrity of the data model and
allowing for accurate analysis and reporting of treatments associated with multiple
diagnoses.
fl
Certainly! Let's break down each line of the explanation and illustrate it with a
comprehensive tabular data example.

### 1. Multivalued Bridge Table

A **multivalued bridge table** handles many-to-many relationships between two


dimensions.

### Example Scenario

Imagine a scenario with bank accounts and customers where each customer can have
multiple bank accounts, and each bank account can belong to multiple customers.

### 2. Type 2 Slowly Changing Dimension (SCD Type 2)

A **Type 2 SCD** tracks changes over time by creating a new record with a new
primary key whenever changes occur. This approach keeps the historical data.

### Example of Type 2 SCD for Customers

| CustomerID | CustomerSK | Name | EffectiveDate | ExpirationDate |


|------------|------------|------------|---------------|----------------|
|1 |1 | Alice | 2023-01-01 | 2023-06-30 |
|1 |2 | Alice A. | 2023-07-01 | NULL |
|2 |3 | Bob | 2023-01-01 | 2023-02-28 |
|2 |4 | Bob B. | 2023-03-01 | NULL |

### Example of Type 2 SCD for Bank Accounts

| AccountID | AccountSK | AccountType | EffectiveDate | ExpirationDate |


|-----------|-----------|-------------|---------------|----------------|
| 100 |1 | Checking | 2023-01-01 | NULL |
| 101 |2 | Savings | 2023-01-01 | 2023-05-31 |
| 101 |3 | Savings | 2023-06-01 | NULL |

### 3. Bridge Table with Effective and Expiration Date/Time Stamps

The bridge table must include effective and expiration date/time stamps to correctly
represent the many-to-many relationships over time.

### Example Bridge Table

| BridgeID | CustomerSK | AccountSK | EffectiveDate | ExpirationDate |


|----------|------------|-----------|---------------|----------------|
|1 |1 |1 | 2023-01-01 | 2023-06-30 |
|2 |2 |1 | 2023-07-01 | NULL |
|3 |3 |2 | 2023-01-01 | 2023-02-28 |
|4 |4 |3 | 2023-03-01 | NULL |
### 4. Constrain to a Speci c Moment in Time

To get a consistent snapshot, the requesting application must constrain the bridge
table to a speci c moment in time.

### Example Snapshot Query

To nd the relationship as of July 1, 2023:

```sql
SELECT b.BridgeID, b.CustomerSK, b.AccountSK
FROM BridgeTable b
JOIN CustomerDimension c ON b.CustomerSK = c.CustomerSK
JOIN AccountDimension a ON b.AccountSK = a.AccountSK
WHERE '2023-07-01' BETWEEN b.EffectiveDate AND
COALESCE(b.ExpirationDate, '9999-12-31')
AND '2023-07-01' BETWEEN c.EffectiveDate AND COALESCE(c.ExpirationDate,
'9999-12-31')
AND '2023-07-01' BETWEEN a.EffectiveDate AND COALESCE(a.ExpirationDate,
'9999-12-31')
```

### Explanation of Query

- The query joins the bridge table with the customer and account dimension tables.
- It constrains the results to records effective on July 1, 2023, using the
`EffectiveDate` and `ExpirationDate` elds.
- `COALESCE` is used to handle records with no expiration date (`NULL`).

### Query Result

| BridgeID | CustomerSK | AccountSK |


|----------|------------|-----------|
|2 |2 |1 |
|4 |4 |3 |

This result shows the valid relationships between customers and accounts as of July 1,
2023.

### Summary Table

Here is a summary of the tables involved:

#### Customer Dimension Table


| CustomerID | CustomerSK | Name | EffectiveDate | ExpirationDate |
|------------|------------|------------|---------------|----------------|
|1 |1 | Alice | 2023-01-01 | 2023-06-30 |
|1 |2 | Alice A. | 2023-07-01 | NULL |
|2 |3 | Bob | 2023-01-01 | 2023-02-28 |
fi
fi
fi
fi
|2 |4 | Bob B. | 2023-03-01 | NULL |

#### Account Dimension Table


| AccountID | AccountSK | AccountType | EffectiveDate | ExpirationDate |
|-----------|-----------|-------------|---------------|----------------|
| 100 |1 | Checking | 2023-01-01 | NULL |
| 101 |2 | Savings | 2023-01-01 | 2023-05-31 |
| 101 |3 | Savings | 2023-06-01 | NULL |

#### Bridge Table


| BridgeID | CustomerSK | AccountSK | EffectiveDate | ExpirationDate |
|----------|------------|-----------|---------------|----------------|
|1 |1 |1 | 2023-01-01 | 2023-06-30 |
|2 |2 |1 | 2023-07-01 | NULL |
|3 |3 |2 | 2023-01-01 | 2023-02-28 |
|4 |4 |3 | 2023-03-01 | NULL |

This explanation, along with the tabular data example, should clarify how a
multivalued bridge table based on a type 2 slowly changing dimension works and how
to ensure accurate historical linkages between accounts and customers.

Certainly! Let's break down each line and provide a comprehensive tabular data
example to illustrate the concepts.

### 1. Descriptive Text in Dimension Tables

In a data warehouse, dimension tables often contain descriptive text to provide


context for various entities.

### Example Dimension Table

| CustomerID | Name | City |


|------------|------------|----------|
|1 | Alice | New York |
|2 | Bob | Los Angeles |

### 2. Data Mining Customer Cluster Analyses

Data mining techniques, such as cluster analysis, group customers based on their
behaviors and characteristics.

### Example of Cluster Analysis Results

| CustomerID | Cluster | BehaviorTag |


|------------|---------|-------------|
|1 |A | HighSpender |
|2 |B | LowSpender |
### 3. Textual Behavior Tags

Behavior tags are textual labels assigned to customers based on their behavior.

### Example Behavior Tags Over Time

| CustomerID | Date | BehaviorTag |


|------------|------------|-------------|
|1 | 2023-01-01 | HighSpender |
|1 | 2023-02-01 | MediumSpender |
|1 | 2023-03-01 | LowSpender |
|2 | 2023-01-01 | LowSpender |
|2 | 2023-02-01 | MediumSpender |
|2 | 2023-03-01 | HighSpender |

### 4. Behavior Measurements Over Time as a Sequence of Tags

These tags form a time series representing customer behavior over time.

### Example Sequence of Behavior Tags

| CustomerID | Sequence |
|------------|-----------------------------------|
|1 | HighSpender, MediumSpender, LowSpender |
|2 | LowSpender, MediumSpender, HighSpender |

### 5. Positional Attributes in the Customer Dimension

The behavior tags are stored as positional attributes in the customer dimension to
facilitate complex queries.

### Example Customer Dimension with Positional Attributes

| CustomerID | Name | City | Tag1 | Tag2 | Tag3 | TagSequence


|
|------------|-------|--------------|-------------|---------------|-------------|-------------------------
--------------------|
|1 | Alice | New York | HighSpender | MediumSpender | LowSpender |
HighSpender, MediumSpender, LowSpender |
|2 | Bob | Los Angeles | LowSpender | MediumSpender | HighSpender |
LowSpender, MediumSpender, HighSpender |

### 6. Behavior Tags as Targets of Complex Queries

Behavior tags, stored as positional attributes, are used in complex queries rather than
numeric computations.

### Example Query


To nd customers who were "HighSpender" at any point in time:

```sql
SELECT CustomerID, Name, City, TagSequence
FROM CustomerDimension
WHERE Tag1 = 'HighSpender'
OR Tag2 = 'HighSpender'
OR Tag3 = 'HighSpender';
```

### Explanation of Query

- The query selects customers who have the "HighSpender" tag in any of the
positional attributes (`Tag1`, `Tag2`, or `Tag3`).

### Query Result

| CustomerID | Name | City | TagSequence |


|------------|-------|--------------|---------------------------------------------|
|1 | Alice | New York | HighSpender, MediumSpender, LowSpender |
|2 | Bob | Los Angeles | LowSpender, MediumSpender, HighSpender |

### Summary Table

Here is a summary of the tables involved:

#### Customer Dimension Table

| CustomerID | Name | City | Tag1 | Tag2 | Tag3 | TagSequence


|
|------------|-------|--------------|-------------|---------------|-------------|-------------------------
--------------------|
|1 | Alice | New York | HighSpender | MediumSpender | LowSpender |
HighSpender, MediumSpender, LowSpender |
|2 | Bob | Los Angeles | LowSpender | MediumSpender | HighSpender |
LowSpender, MediumSpender, HighSpender |

#### Behavior Tags Over Time Table

| CustomerID | Date | BehaviorTag |


|------------|------------|-------------|
|1 | 2023-01-01 | HighSpender |
|1 | 2023-02-01 | MediumSpender |
|1 | 2023-03-01 | LowSpender |
|2 | 2023-01-01 | LowSpender |
|2 | 2023-02-01 | MediumSpender |
|2 | 2023-03-01 | HighSpender |
fi
This explanation, along with the tabular data example, should clarify how textual
behavior tags from customer cluster analyses are stored as positional attributes in the
customer dimension for complex queries.

Let's break down the explanation step-by-step with a comprehensive tabular data
example to illustrate the process.

### Explanation with Tabular Data Example

1. **Complex customer behavior can sometimes be discovered only by running


lengthy iterative analyses.**
- This means that understanding customer behavior often requires detailed and
repetitive analysis, which might be computationally intensive and time-consuming.

2. **In these cases, it is impractical to embed the behavior analyses inside every BI
application that wants to constrain all the members of the customer dimension who
exhibit the complex behavior.**
- It is not ef cient to include complex behavior analysis directly in every Business
Intelligence (BI) application because it would slow down the system and increase
complexity.

3. **The results of the complex behavior analyses, however, can be captured in a


simple table, called a study group, consisting only of the customers’ durable keys.**
- Instead of embedding the analysis, we can store the results in a simpli ed table,
called a "study group," which includes only unique customer identi ers (durable
keys).

4. **This static table can then be used as a kind of lter on any dimensional schema
with a customer dimension by constraining the study group column to the customer
dimension’s durable key in the target schema at query time.**
- The study group table can act as a lter when querying other tables that include the
customer dimension, allowing us to constrain results based on the pre-analyzed
behavior.

5. **Multiple study groups can be de ned and derivative study groups can be created
with intersections, unions, and set differences.**
- We can de ne multiple study groups for different behaviors and create new groups
through set operations like intersections, unions, and set differences.

### Tabular Data Example

#### Customer Dimension Table


| CustomerID | CustomerName | CustomerKey |
|------------|--------------|-------------|
|1 | Alice | C001 |
|2 | Bob | C002 |
|3 | Charlie | C003 |
fi
fi
fi
fi
fi
fi
fi
|4 | David | C004 |

#### Purchase Behavior Study Group (Example)


| StudyGroupID | CustomerKey |
|--------------|-------------|
| SG001 | C001 |
| SG001 | C003 |

#### Loyalty Program Study Group (Example)


| StudyGroupID | CustomerKey |
|--------------|-------------|
| SG002 | C002 |
| SG002 | C004 |

#### Intersected Study Group (Customers in Both Study Groups)


| StudyGroupID | CustomerKey |
|--------------|-------------|
| SG003 | C003 |

#### Union Study Group (Customers in Either Study Group)


| StudyGroupID | CustomerKey |
|--------------|-------------|
| SG004 | C001 |
| SG004 | C002 |
| SG004 | C003 |
| SG004 | C004 |

#### Filtered Query Using Study Group


Suppose we want to lter another table (e.g., Sales) using a study group.

#### Sales Fact Table


| SaleID | CustomerKey | Amount |
|--------|-------------|--------|
| S001 | C001 | 100 |
| S002 | C002 | 150 |
| S003 | C003 | 200 |
| S004 | C004 | 250 |

#### Query: Filter Sales Fact Table by Purchase Behavior Study Group (SG001)
```sql
SELECT SaleID, CustomerKey, Amount
FROM SalesFactTable
WHERE CustomerKey IN (SELECT CustomerKey FROM
PurchaseBehaviorStudyGroup WHERE StudyGroupID = 'SG001')
```

#### Result of Filtered Query


| SaleID | CustomerKey | Amount |
|--------|-------------|--------|
fi
| S001 | C001 | 100 |
| S003 | C003 | 200 |

In this example, the "Purchase Behavior Study Group" (SG001) is used to lter the
"Sales Fact Table," resulting in a subset of sales records where customers have been
pre-identi ed based on their purchase behavior. This process can be repeated with
other study groups or set operations to derive different subsets of data based on
complex customer behaviors.

Let's break down the explanation and provide a comprehensive tabular data example
to illustrate each line.

### Explanation and Tabular Data Example

1. **Business users are often interested in constraining the customer dimension based
on aggregated performance metrics, such as ltering on all customers who spent over
a certain dollar amount during last year or perhaps over the customer’s lifetime.**

**Explanation:**
Business users want to analyze and lter customers based on their spending
behavior over a speci c period (e.g., last year or lifetime). This allows them to focus
on high-value customers for targeted marketing and other business decisions.

**Example Table (Customer Spending):**


| Customer ID | Customer Name | Last Year Spend | Lifetime Spend |
|-------------|---------------|-----------------|----------------|
|1 | Alice | $1,200 | $10,000 |
|2 | Bob | $600 | $5,000 |
|3 | Charlie | $1,500 | $7,500 |

2. **Selected aggregated facts can be placed in a dimension as targets for constraining


and as row labels for reporting.**

**Explanation:**
Aggregated performance metrics (such as total spending) can be added to a
dimension table to lter and label rows for reporting purposes.

**Example Table (Customer Dimension with Aggregated Facts):**


| Customer ID | Customer Name | Last Year Spend | Lifetime Spend |
|-------------|---------------|-----------------|----------------|
|1 | Alice | $1,200 | $10,000 |
|2 | Bob | $600 | $5,000 |
|3 | Charlie | $1,500 | $7,500 |

3. **The metrics are often presented as banded ranges in the dimension table.**
fi
fi
fi
fi
fi
fi
**Explanation:**
Metrics can be grouped into ranges (bands) to simplify analysis and reporting. For
example, spending can be categorized into ranges like "$0-$500," "$501-$1,000," etc.

**Example Table (Customer Dimension with Banded Ranges):**


| Customer ID | Customer Name | Last Year Spend | Last Year Spend Band | Lifetime
Spend | Lifetime Spend Band |

|-------------|---------------|-----------------|----------------------|----------------|-----------------
----|
|1 | Alice | $1,200 | $1,001-$1,500 | $10,000 | $9,001-
$10,000 |
|2 | Bob | $600 | $501-$1,000 | $5,000 | $4,001-$5,000
|
|3 | Charlie | $1,500 | $1,001-$1,500 | $7,500 | $7,001-
$8,000 |

4. **Dimension attributes representing aggregated performance metrics add burden to


the ETL processing, but ease the analytic burden in the BI layer.**

**Explanation:**
Adding these metrics to the dimension table makes the ETL (Extract, Transform,
Load) process more complex and time-consuming. However, this pre-processing
reduces the computational load during analysis and reporting in the Business
Intelligence (BI) layer.

**Example Table (ETL Process Output - Customer Dimension):**


| Customer ID | Customer Name | Last Year Spend | Last Year Spend Band | Lifetime
Spend | Lifetime Spend Band |

|-------------|---------------|-----------------|----------------------|----------------|-----------------
----|
|1 | Alice | $1,200 | $1,001-$1,500 | $10,000 | $9,001-
$10,000 |
|2 | Bob | $600 | $501-$1,000 | $5,000 | $4,001-$5,000
|
|3 | Charlie | $1,500 | $1,001-$1,500 | $7,500 | $7,001-
$8,000 |

**ETL Burden:**
- Calculate and update the `Last Year Spend` and `Lifetime Spend` for each
customer.
- Assign each customer to the appropriate spend band for both last year and lifetime.

**BI Layer Ease:**


- Pre-calculated metrics and bands allow for quick ltering and reporting without
additional computation.

### Comprehensive Example


fi
To demonstrate how this would work in a real-world scenario, let's consider a simple
ETL process and BI report based on the above tables.

**ETL Process:**
1. Extract raw transaction data for customers.
2. Transform data to calculate total spend for the last year and lifetime for each
customer.
3. Load the transformed data into the customer dimension table, including spend
bands.

**BI Report:**
- Filter customers who spent more than $1,000 last year.
- Group customers by spend bands to identify high-value segments.
- Report on customer names and their respective spend categories.

**BI Report Example:**


| Customer Name | Last Year Spend | Last Year Spend Band |
|---------------|-----------------|----------------------|
| Alice | $1,200 | $1,001-$1,500 |
| Charlie | $1,500 | $1,001-$1,500 |

This report quickly highlights high-value customers based on their last year's
spending, leveraging pre-processed data for ef cient analysis.

Let's provide an example of how label rows can be used for reporting purposes,
leveraging the aggregated performance metrics and banded ranges.

### Example Table with Label Rows for Reporting Purposes

#### Customer Dimension with Aggregated Metrics and Labels


| Customer ID | Customer Name | Last Year Spend | Last Year Spend Band | Lifetime
Spend | Lifetime Spend Band | Customer Segment |
|-------------|---------------|-----------------|----------------------|----------------|-----------------
----|------------------|
|1 | Alice | $1,200 | $1,001-$1,500 | $10,000 | $9,001-
$10,000 | High Value |
|2 | Bob | $600 | $501-$1,000 | $5,000 | $4,001-$5,000
| Medium Value |
|3 | Charlie | $1,500 | $1,001-$1,500 | $7,500 | $7,001-$8,000
| High Value |
|4 | Dave | $300 | $0-$500 | $1,200 | $1,001-$2,000
| Low Value |

In this table, the `Customer Segment` column serves as a label row, categorizing
customers into segments based on their spending behavior.

### Example Report Using Label Rows


fi
#### Report: Customer Spending by Segment

| Customer Segment | Customer Name | Last Year Spend | Last Year Spend Band |
Lifetime Spend | Lifetime Spend Band |
|------------------|---------------|-----------------|----------------------|----------------|------------
---------|
| High Value | Alice | $1,200 | $1,001-$1,500 | $10,000 |
$9,001-$10,000 |
| High Value | Charlie | $1,500 | $1,001-$1,500 | $7,500 |
$7,001-$8,000 |
| Medium Value | Bob | $600 | $501-$1,000 | $5,000 |
$4,001-$5,000 |
| Low Value | Dave | $300 | $0-$500 | $1,200 | $1,001-
$2,000 |

This report leverages the `Customer Segment` label to group and present customers
according to their value, making it easy to identify which customers fall into each
segment.

### Breakdown of Label Rows for Reporting


1. **High Value Segment:**
- Customers who spent signi cantly more, such as Alice and Charlie.
- Useful for targeted marketing and special offers.

2. **Medium Value Segment:**


- Customers like Bob, who have moderate spending patterns.
- Useful for nurturing to increase their spending.

3. **Low Value Segment:**


- Customers like Dave, with lower spending.
- Useful for identifying opportunities for engagement and improvement.

Using label rows in this manner simpli es the reporting process, making it clear and
actionable for business users to understand customer segments and tailor their
strategies accordingly.

Let's break down the explanation of a dynamic value banding report step by step with
a comprehensive tabular data example.

### Explanation Breakdown

1. **Dynamic Value Banding Report**: A report where row headers de ne varying-


sized ranges of a numeric fact (e.g., balances).

2. **Report Row Headers**: Labels for the rows that specify the ranges (e.g.,
“Balance from 0 to $10”).
fi
fi
fi
3. **Target Numeric Fact**: The numeric data being reported on (e.g., account
balances).

4. **Dynamic De nition**: Row headers are de ned at query time, not during ETL
processing.

5. **Value Banding Dimension Table**: A small table with range de nitions that
joins to the fact table using greater-than/less-than joins.

6. **SQL CASE Statement**: An alternative to the dimension table, where ranges are
de ned in the SQL query.

7. **Performance Consideration**: Using a value banding dimension table is likely


more ef cient, especially in a columnar database, compared to using a CASE
statement.

### Tabular Data Example

#### Value Banding Dimension Table

| Band ID | Range Start | Range End |


|---------|-------------|-----------|
|1 | 0.00 | 10.00 |
|2 | 10.01 | 25.00 |
|3 | 25.01 | 50.00 |
|4 | 50.01 | 100.00 |

#### Fact Table

| Account ID | Balance |
|------------|---------|
| 1001 | 5.00 |
| 1002 | 12.50 |
| 1003 | 30.00 |
| 1004 | 75.00 |

#### Joined Table for Report

When the value banding dimension table is joined with the fact table, the result might
look like this:

| Band ID | Range Start | Range End | Account ID | Balance |


|---------|-------------|-----------|------------|---------|
|1 | 0.00 | 10.00 | 1001 | 5.00 |
|2 | 10.01 | 25.00 | 1002 | 12.50 |
|3 | 25.01 | 50.00 | 1003 | 30.00 |
|4 | 50.01 | 100.00 | 1004 | 75.00 |
fi
fi
fi
fi
fi
#### Final Report

The nal report aggregates the balances into the de ned bands:

| Band ID | Range | Number of Accounts | Total Balance |


|---------|---------------------|--------------------|---------------|
|1 | 0.00 - 10.00 |1 | 5.00 |
|2 | 10.01 - 25.00 |1 | 12.50 |
|3 | 25.01 - 50.00 |1 | 30.00 |
|4 | 50.01 - 100.00 |1 | 75.00 |

### Explanation of Each Line

1. **Dynamic Value Banding Report**: We create a report where the balance ranges
(bands) are dynamically de ned based on current data at the time of the query.

2. **Report Row Headers**: The labels like “0.00 - 10.00” or “10.01 - 25.00” serve
as row headers.

3. **Target Numeric Fact**: The "Balance" from the fact table.

4. **Dynamic De nition**: The band de nitions are determined at query execution


rather than during ETL processing.

5. **Value Banding Dimension Table**: This table contains pre-de ned ranges for
balances.

6. **SQL CASE Statement**: Alternatively, ranges could be hardcoded in the query,


but this is less ef cient.

7. **Performance Consideration**: Using the value banding dimension table is more


ef cient, particularly in databases optimized for columnar storage, because it reduces
the need for extensive scans of the fact table.

This setup ensures ef cient querying and clear, aggregated reporting of balances
within speci ed dynamic ranges.

To illustrate the concept of a dynamic value banding report where speci c row
headers are de ned at query time rather than during ETL processing, let's consider an
example using SQL.

### Example Scenario

Imagine we have a table of account balances, and we want to create a report that
dynamically groups these balances into bands that are de ned at the time of query
execution.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
#### Fact Table: `account_balances`

| Account ID | Balance |
|------------|---------|
| 1001 | 5.00 |
| 1002 | 12.50 |
| 1003 | 30.00 |
| 1004 | 75.00 |

### SQL Query to Generate Dynamic Value Banding Report

```sql
SELECT
CASE
WHEN Balance BETWEEN 0.00 AND 10.00 THEN 'Balance from 0 to $10'
WHEN Balance BETWEEN 10.01 AND 25.00 THEN 'Balance from $10.01 to
$25'
WHEN Balance BETWEEN 25.01 AND 50.00 THEN 'Balance from $25.01 to
$50'
WHEN Balance BETWEEN 50.01 AND 100.00 THEN 'Balance from $50.01 to
$100'
ELSE 'Balance above $100'
END AS Balance_Range,
COUNT(*) AS Number_of_Accounts,
SUM(Balance) AS Total_Balance
FROM
account_balances
GROUP BY
CASE
WHEN Balance BETWEEN 0.00 AND 10.00 THEN 'Balance from 0 to $10'
WHEN Balance BETWEEN 10.01 AND 25.00 THEN 'Balance from $10.01 to
$25'
WHEN Balance BETWEEN 25.01 AND 50.00 THEN 'Balance from $25.01 to
$50'
WHEN Balance BETWEEN 50.01 AND 100.00 THEN 'Balance from $50.01 to
$100'
ELSE 'Balance above $100'
END
ORDER BY
MIN(Balance);
```

### Resulting Report

| Balance Range | Number of Accounts | Total Balance |


|--------------------------|--------------------|---------------|
| Balance from 0 to $10 | 1 | 5.00 |
| Balance from $10.01 to $25 | 1 | 12.50 |
| Balance from $25.01 to $50 | 1 | 30.00 |
| Balance from $50.01 to $100 | 1 | 75.00 |

### Explanation

- **Dynamic De nition**: The row headers ("Balance from 0 to $10", "Balance from
$10.01 to $25", etc.) are de ned within the SQL query using a `CASE` statement.
This means they are created at the time the query runs, not during ETL processing
when data is loaded into the database.
- **Aggregation**: The query groups account balances into the de ned ranges and
calculates the number of accounts and the total balance for each range.
- **Flexibility**: If the ranges need to change, you can simply modify the `CASE`
statement in the SQL query. This makes the report exible and dynamic, as it adapts
to new ranges without requiring changes to the underlying ETL process.

This example demonstrates how dynamic value banding works by de ning row
headers at query time, providing exibility and adaptability to changing requirements.

Let's explain what is meant by implementing row de nitions in a value banding


dimension table versus using an SQL CASE statement, with examples for each
approach.

### Value Banding Dimension Table

A value banding dimension table is a separate table that explicitly de nes the ranges
(bands) of the target numeric fact (e.g., balances). This table can be joined with the
fact table using greater-than/less-than joins.

#### Example

1. **Value Banding Dimension Table**: `balance_bands`

| Band ID | Range Start | Range End | Label |


|---------|-------------|------------|----------------------------|
|1 | 0.00 | 10.00 | 'Balance from 0 to $10' |
|2 | 10.01 | 25.00 | 'Balance from $10.01 to $25'|
|3 | 25.01 | 50.00 | 'Balance from $25.01 to $50'|
|4 | 50.01 | 100.00 | 'Balance from $50.01 to $100'|

2. **Fact Table**: `account_balances`

| Account ID | Balance |
|------------|---------|
| 1001 | 5.00 |
| 1002 | 12.50 |
| 1003 | 30.00 |
| 1004 | 75.00 |
fi
fi
fl
fl
fi
fi
fi
fi
3. **SQL Query to Join Tables**

```sql
SELECT
b.Label AS Balance_Range,
COUNT(*) AS Number_of_Accounts,
SUM(a.Balance) AS Total_Balance
FROM
account_balances a
JOIN
balance_bands b
ON
a.Balance > b.Range_Start AND a.Balance <= b.Range_End
GROUP BY
b.Label
ORDER BY
b.Range_Start;
```

### Resulting Report

| Balance Range | Number of Accounts | Total Balance |


|----------------------------|--------------------|---------------|
| Balance from 0 to $10 |1 | 5.00 |
| Balance from $10.01 to $25 | 1 | 12.50 |
| Balance from $25.01 to $50 | 1 | 30.00 |
| Balance from $50.01 to $100| 1 | 75.00 |

### SQL CASE Statement

Alternatively, the band de nitions can be directly embedded within the SQL query
using a CASE statement.

#### Example

1. **Fact Table**: `account_balances`

| Account ID | Balance |
|------------|---------|
| 1001 | 5.00 |
| 1002 | 12.50 |
| 1003 | 30.00 |
| 1004 | 75.00 |

2. **SQL Query Using CASE Statement**

```sql
SELECT
fi
CASE
WHEN Balance BETWEEN 0.00 AND 10.00 THEN 'Balance from 0 to $10'
WHEN Balance BETWEEN 10.01 AND 25.00 THEN 'Balance from $10.01 to
$25'
WHEN Balance BETWEEN 25.01 AND 50.00 THEN 'Balance from $25.01 to
$50'
WHEN Balance BETWEEN 50.01 AND 100.00 THEN 'Balance from $50.01 to
$100'
ELSE 'Balance above $100'
END AS Balance_Range,
COUNT(*) AS Number_of_Accounts,
SUM(Balance) AS Total_Balance
FROM
account_balances
GROUP BY
CASE
WHEN Balance BETWEEN 0.00 AND 10.00 THEN 'Balance from 0 to $10'
WHEN Balance BETWEEN 10.01 AND 25.00 THEN 'Balance from $10.01 to
$25'
WHEN Balance BETWEEN 25.01 AND 50.00 THEN 'Balance from $25.01 to
$50'
WHEN Balance BETWEEN 50.01 AND 100.00 THEN 'Balance from $50.01 to
$100'
ELSE 'Balance above $100'
END
ORDER BY
CASE
WHEN Balance BETWEEN 0.00 AND 10.00 THEN 1
WHEN Balance BETWEEN 10.01 AND 25.00 THEN 2
WHEN Balance BETWEEN 25.01 AND 50.00 THEN 3
WHEN Balance BETWEEN 50.01 AND 100.00 THEN 4
ELSE 5
END;
```

### Resulting Report

| Balance Range | Number of Accounts | Total Balance |


|----------------------------|--------------------|---------------|
| Balance from 0 to $10 |1 | 5.00 |
| Balance from $10.01 to $25 | 1 | 12.50 |
| Balance from $25.01 to $50 | 1 | 30.00 |
| Balance from $50.01 to $100| 1 | 75.00 |
| Balance above $100 |0 | 0.00 |

### Comparison

- **Value Banding Dimension Table**:


- Pros: More ef cient in terms of performance, especially in columnar databases.
fi
- Cons: Requires maintenance of an additional table.

- **SQL CASE Statement**:


- Pros: Simpler to implement for ad-hoc queries, no need for an additional table.
- Cons: May be less ef cient for large datasets as it involves scanning the entire fact
table.

By using a value banding dimension table, the query can leverage indexed joins and
avoid full table scans, which can signi cantly improve performance. However, for
simpler or one-time reports, a CASE statement might be more convenient.

Sure! Let's break down each part of your statement with an example.

### Statement: "Rather than treating freeform comments as textual metrics in a fact
table, they should be stored outside the fact table in a separate comments dimension"

**Explanation:**

1. **Freeform Comments in a Fact Table**: In a data warehouse, a fact table typically


stores measurable, quantitative data, such as sales amounts or quantities. Freeform
comments, being textual data, don't t this structure well and can complicate the
design and performance.

2. **Separate Comments Dimension**: Instead, comments should be stored in their


own dimension table, which is designed to handle text data and maintain referential
integrity.

**Example:**

Imagine we have a simple sales fact table (`FactSales`) with the following structure:

| TransactionID | ProductID | CustomerID | SalesAmount |


|---------------|-----------|------------|-------------|
|1 | 101 | 1001 | 50.00 |
|2 | 102 | 1002 | 30.00 |
|3 | 101 | 1003 | 20.00 |

Instead of storing comments directly in this table, we create a separate `Comments`


dimension table:

| CommentID | CommentText |
|-----------|--------------------------------|
|1 | "Customer liked the product" |
|2 | "Product arrived damaged" |
|3 | "Fast delivery" |

### Statement: "with a corresponding foreign key in the fact table"


fi
fi
fi
**Explanation:**

1. **Foreign Key**: To link the fact table and the comments dimension, we use a
foreign key in the fact table. This foreign key points to the primary key of the
comments dimension table.

**Example:**

We add a `CommentID` column to the `FactSales` table to store the foreign key:

| TransactionID | ProductID | CustomerID | SalesAmount | CommentID |


|---------------|-----------|------------|-------------|-----------|
|1 | 101 | 1001 | 50.00 |1 |
|2 | 102 | 1002 | 30.00 |2 |
|3 | 101 | 1003 | 20.00 |3 |

### Statement: "or as attributes in a dimension with one row per transaction if the
comments’ cardinality matches the number of unique transactions"

**Explanation:**

1. **Attributes in a Dimension**: If each transaction has a unique comment, the


comments can be stored as attributes in a dimension table that has one row per
transaction, effectively creating a 1-to-1 relationship.

**Example:**

If we choose to store comments directly in a `Transactions` dimension table:

| TransactionID | ProductID | CustomerID | SalesAmount | CommentText |


|---------------|-----------|------------|-------------|------------------------------|
|1 | 101 | 1001 | 50.00 | "Customer liked the product" |
|2 | 102 | 1002 | 30.00 | "Product arrived damaged" |
|3 | 101 | 1003 | 20.00 | "Fast delivery" |

Here, the `Transactions` dimension table itself stores the comments, which makes
sense if every transaction has exactly one comment and the comment cardinality
matches the number of unique transactions.

### Summary Table

| TransactionID | ProductID | CustomerID | SalesAmount | CommentID |


CommentText |
|---------------|-----------|------------|-------------|-----------|------------------------------|
|1 | 101 | 1001 | 50.00 |1 | "Customer liked the product" |
|2 | 102 | 1002 | 30.00 |2 | "Product arrived damaged" |
|3 | 101 | 1003 | 20.00 |3 | "Fast delivery" |
- **Fact Table (`FactSales`)**: Stores the core transactional data.
- **Comments Dimension**: Stores the textual comments.
- **Foreign Key**: Links the fact table to the comments dimension.
- **Combined Table**: Optionally, store comments directly in a dimension if
cardinality allows.

This approach optimizes data storage and ensures clear, organized schema design.

Certainly! Let's break down the statement and provide an example using a tabular
data structure.

### Statement Breakdown:


1. **To capture both universal standard time, as well as local times in multi-time zone
applications,**
- We want to store timestamps in both Universal Coordinated Time (UTC) and the
local time zones where the events occurred.

2. **dual foreign keys should be placed in the affected fact tables**


- We need two foreign keys in our fact table: one for UTC and one for local time.

3. **that join to two role-playing date (and potentially time-of-day) dimension


tables.**
- These foreign keys will link to two separate date dimension tables. Each
dimension table will represent the same date and time information but in different
roles (one for UTC and one for local time).

### Example:

Let's create a scenario with a fact table called `Sales` and two dimension tables called
`DateTimeUTC` and `DateTimeLocal`.

#### Dimension Tables:

1. **DateTimeUTC** (storing dates and times in UTC)


| DateKey | Date | Time | DateTimeUTC |
|---------|------------|---------|----------------------|
|1 | 2024-07-23 | 13:00 | 2024-07-23 13:00:00 |
|2 | 2024-07-23 | 14:00 | 2024-07-23 14:00:00 |
|3 | 2024-07-23 | 15:00 | 2024-07-23 15:00:00 |

2. **DateTimeLocal** (storing dates and times in the local time zone)


| DateKey | Date | Time | DateTimeLocal | TimeZone |
|---------|------------|---------|----------------------|--------------|
| 101 | 2024-07-23 | 09:00 | 2024-07-23 09:00:00 | America/New_York |
| 102 | 2024-07-23 | 10:00 | 2024-07-23 10:00:00 | America/New_York |
| 103 | 2024-07-23 | 08:00 | 2024-07-23 08:00:00 | America/Chicago |
#### Fact Table:

3. **Sales** (storing sales transactions with references to both UTC and local date/
time)
| SaleID | ProductID | Amount | DateKeyUTC | DateKeyLocal |
|--------|-----------|--------|------------|--------------|
|1 | 101 | 500 | 1 | 101 |
|2 | 102 | 300 | 2 | 102 |
|3 | 103 | 400 | 3 | 103 |

### Explanation with Example:

- **Dimension Tables:**
- `DateTimeUTC` contains standardized dates and times in UTC.
- `DateTimeLocal` contains the same dates and times but adjusted to local time
zones, along with the time zone information.

- **Fact Table:**
- `Sales` records transactions, with each sale linked to both UTC and local time via
the `DateKeyUTC` and `DateKeyLocal` foreign keys.

### Comprehensive Scenario:

Imagine a transaction occurred at 9 AM local time in New York (Eastern Time Zone)
on July 23, 2024. The UTC time for this would be 1 PM (13:00) on the same day.

- In the `DateTimeUTC` table, this time is represented by `DateKey` 1, which


corresponds to `2024-07-23 13:00:00`.
- In the `DateTimeLocal` table, this time is represented by `DateKey` 101, which
corresponds to `2024-07-23 09:00:00` in `America/New_York`.

In the `Sales` fact table:


- A sale with `SaleID` 1 is recorded with `ProductID` 101 and `Amount` 500.
- `DateKeyUTC` is set to 1 (linking to `2024-07-23 13:00:00` in `DateTimeUTC`).
- `DateKeyLocal` is set to 101 (linking to `2024-07-23 09:00:00` in
`DateTimeLocal`).

This dual-key setup allows the fact table to accurately reference both the universal
time and the local time of each transaction, facilitating time zone-aware analysis and
reporting.
Creating a measure type dimension to condense a fact table with sparsely populated
rows is generally not recommended. This approach, while reducing empty columns,
signi cantly increases the fact table's size and complicates intra-column calculations.
It is only suitable when dealing with an extremely large number of potential facts (in
the hundreds), with only a few applicable to each fact table row.

Certainly! Let's break down each line of the description and illustrate it with a
comprehensive tabular data example.

### Description Breakdown

1. **Sequential processes, such as web page events, normally have a separate row in a
transaction fact table for each step in a process.**
- Each step in a sequential process is recorded as a separate row in a transaction fact
table.

2. **To tell where the individual step ts into the overall session, a step dimension is
used.**
- A step dimension is employed to indicate the position of each step within the
overall session.

3. **That shows what step number is represented by the current step and how many
more steps were required to complete the session.**
- The step dimension indicates both the current step number and the total number of
steps required to complete the session.

### Comprehensive Tabular Data Example

#### Transaction Fact Table

| Session ID | Step ID | Event | Timestamp | Step Number | Total Steps |


|------------|---------|------------------|---------------------|-------------|-------------|
|1 | 101 | Page Load | 2024-07-25 10:00:00 | 1 |3 |
|1 | 102 | Click Link | 2024-07-25 10:01:00 | 2 |3 |
|1 | 103 | Form Submission | 2024-07-25 10:02:00 | 3 |3 |
|2 | 201 | Page Load | 2024-07-25 11:00:00 | 1 |4 |
|2 | 202 | Click Link | 2024-07-25 11:01:00 | 2 |4 |
|2 | 203 | Scroll Down | 2024-07-25 11:02:00 | 3 |4 |
|2 | 204 | Form Submission | 2024-07-25 11:03:00 | 4 |4 |

### Explanation with Example

1. **Separate Rows for Each Step**:


- Each row represents a unique step within a session. For example, Session 1 has
three steps (Page Load, Click Link, Form Submission), and each is recorded as a
separate row.
fi
fi
2. **Step Dimension**:
- The `Step Number` column shows the sequence of steps within a session. For
Session 1, the steps are numbered 1, 2, and 3, respectively.
- The `Total Steps` column indicates the total number of steps in that session. For
Session 1, there are 3 steps in total.

3. **Current Step and Remaining Steps**:


- By looking at the `Step Number` and `Total Steps`, one can determine the current
position in the session and how many steps remain. For example, in Session 2:
- Step 1 (Page Load) is the rst step out of four.
- Step 2 (Click Link) is the second step out of four.
- Step 3 (Scroll Down) is the third step out of four.
- Step 4 (Form Submission) is the fourth and nal step.

This structure helps in analyzing user behavior by tracking each step of a session,
understanding the sequence of events, and identifying where users might drop off or
complete their sessions.

Sure! Let's break down the concept of hot-swappable dimensions with a


comprehensive example in tabular form. We'll use a scenario with stock ticker quotes
and different investor views.

### Explanation

1. **Hot swappable dimensions are used when the same fact table is alternatively
paired with different copies of the same dimension.**
- **Fact Table**: Contains the main data of interest, such as stock quotes.
- **Dimension Table**: Contains additional attributes related to the data, such as
stock details which can vary for different investors.

2. **For example, a single fact table containing stock ticker quotes could be
simultaneously exposed to multiple separate investors, each of whom has unique and
proprietary attributes assigned to different stocks.**
- Each investor sees the same stock quotes but with different additional information
(attributes) based on their proprietary data.

### Example Tables

#### Fact Table: Stock Quotes

| QuoteID | StockSymbol | Date | Price |


|---------|-------------|------------|--------|
|1 | AAPL | 2024-07-25 | 150.00 |
|2 | MSFT | 2024-07-25 | 280.00 |
|3 | GOOGL | 2024-07-25 | 2700.00|
fi
fi
#### Dimension Table for Investor A

| StockSymbol | InvestorA_Rating | InvestorA_Comments |


|-------------|------------------|----------------------|
| AAPL | Buy | Strong performance |
| MSFT | Hold | Stable growth |
| GOOGL | Buy | Innovative products |

#### Dimension Table for Investor B

| StockSymbol | InvestorB_Rating | InvestorB_Comments |


|-------------|------------------|------------------------|
| AAPL | Hold | Overvalued currently |
| MSFT | Buy | Good long-term growth |
| GOOGL | Sell | High risk investment |

### Combined Views

#### Investor A's View

| QuoteID | StockSymbol | Date | Price | InvestorA_Rating | InvestorA_Comments


|
|---------|-------------|------------|--------|------------------|----------------------|
|1 | AAPL | 2024-07-25 | 150.00 | Buy | Strong performance |
|2 | MSFT | 2024-07-25 | 280.00 | Hold | Stable growth |
|3 | GOOGL | 2024-07-25 | 2700.00| Buy | Innovative products |

#### Investor B's View

| QuoteID | StockSymbol | Date | Price | InvestorB_Rating | InvestorB_Comments


|
|---------|-------------|------------|--------|------------------|------------------------|
|1 | AAPL | 2024-07-25 | 150.00 | Hold | Overvalued currently |
|2 | MSFT | 2024-07-25 | 280.00 | Buy | Good long-term growth |
|3 | GOOGL | 2024-07-25 | 2700.00| Sell | High risk investment |

### Summary

- The **Fact Table** holds the core data: stock ticker quotes.
- Each **Dimension Table** contains unique attributes related to the stocks, speci c
to each investor.
- The fact table can be paired with different dimension tables to create investor-
speci c views.
- This mechanism allows the same underlying data to be "hot-swapped" with
different sets of attributes without altering the core data, providing customized
insights for different investors.
fi
fi
Abstract generic dimensions, which combine different types of entities (e.g., a single
generic location dimension for stores, warehouses, and customers, or a single person
dimension for employees, customers, and vendors), should be avoided in dimensional
models. This approach can lead to signi cant issues:

1. **Different Attribute Sets**: Different types of entities (e.g., stores vs. customers)
often have different attributes. Combining them into a single dimension can create
confusion and inef ciency.

2. **Unique Labeling**: Even common attributes (e.g., geographic state) should be


uniquely labeled to distinguish between different contexts (e.g., store's state vs.
customer's state).

3. **Dimension Size**: Combining various entities into a single dimension results in


larger dimension tables, which can negatively impact query performance and
legibility.

While data abstraction may be useful in operational systems or ETL processes, it is


generally detrimental to query performance and clarity in dimensional models.

Sure, let's break down each line and provide a comprehensive example using tabular
data.

### Explanation of Each Line

1. **Fact table row creation in ETL back room:**


- When a fact table row is created during the ETL (Extract, Transform, Load)
process, it is bene cial to capture metadata related to this process.

2. **Creation of an audit dimension:**


- An audit dimension is a table that stores metadata about the ETL process, such as
data quality indicators and environment variables.

3. **Basic indicators of data quality:**


- These indicators may come from an error event schema that logs any data quality
issues encountered during ETL processing.

4. **Environment variables:**
- Variables that describe the ETL process, such as the versions of the ETL code
used, timestamps of ETL execution, etc.

5. **Compliance and auditing purposes:**


- These environment variables help in compliance and auditing by allowing BI tools
to trace back which ETL versions created speci c fact rows.
fi
fi
fi
fi
### Tabular Data Example

Let's create a simple example involving a fact table and an audit dimension.

#### Fact Table: `SalesFact`

| SaleID | ProductID | CustomerID | SaleDate | Amount |


|--------|-----------|------------|------------|--------|
|1 | 101 | 201 | 2023-07-01 | 500.00 |
|2 | 102 | 202 | 2023-07-02 | 150.00 |
|3 | 103 | 203 | 2023-07-03 | 200.00 |

#### Audit Dimension: `ETLAudit`

| AuditID | FactTable | RowID | ETLVersion | ErrorIndicator | ETLStartTime |


ETLEndTime |
|---------|------------|-------|------------|----------------|----------------------|---------------------
-|
|1 | SalesFact | 1 | v1.0 | None | 2023-07-01 00:00:00 | 2023-07-01
01:00:00 |
|2 | SalesFact | 2 | v1.0 | None | 2023-07-02 00:00:00 | 2023-07-02
01:00:00 |
|3 | SalesFact | 3 | v1.1 | MissingValue | 2023-07-03 00:00:00 |
2023-07-03 01:00:00 |

### Breakdown of the Audit Dimension Table

1. **AuditID:**
- A unique identi er for each audit entry.

2. **FactTable:**
- The name of the fact table being audited (e.g., `SalesFact`).

3. **RowID:**
- The ID of the row in the fact table that this audit entry pertains to.

4. **ETLVersion:**
- The version of the ETL code that processed this row (e.g., `v1.0`, `v1.1`).

5. **ErrorIndicator:**
- Any data quality issues detected during processing (e.g., `None`, `MissingValue`).

6. **ETLStartTime:**
- The start time of the ETL process for this row.

7. **ETLEndTime:**
- The end time of the ETL process for this row.

### Explanation Using the Example


fi
- For `SaleID` 1:
- An audit entry with `AuditID` 1 is created, indicating that the row in the
`SalesFact` table was processed with ETL version `v1.0`, with no data quality issues,
and the ETL process started and ended between `2023-07-01 00:00:00` and
`2023-07-01 01:00:00`.

- For `SaleID` 2:
- Similar to `SaleID` 1, processed with ETL version `v1.0`, no issues, within the
speci ed time.

- For `SaleID` 3:
- An audit entry with `AuditID` 3 indicates that the row was processed with ETL
version `v1.1`. There was a `MissingValue` data quality issue, and the ETL process
was executed between `2023-07-03 00:00:00` and `2023-07-03 01:00:00`.

This setup enables BI tools to trace back and analyze the ETL process for each fact
table row, ensuring transparency and aiding in compliance and auditing.

To explain the scenario of late-arriving dimension data, let's break it down line by line
with a comprehensive tabular data example. We'll use an inventory depletion scenario
involving customers and products.

### Initial Setup

#### Fact Table: Inventory Depletion

| Fact ID | Date | Customer Key | Product Key | Quantity |


|---------|------------|--------------|-------------|----------|
|1 | 2024-07-01 | 1001 | 2001 | 10 |
|2 | 2024-07-02 | 1002 | 2002 |5 |

#### Dimension Tables: Customers and Products

Initially, we have partial information, so we create placeholder rows for customers


and products.

#### Dimension Table: Customers

| Customer Key | Customer Name | Other Attributes |


|--------------|---------------|------------------|
| 1001 | Unknown | Generic Value |
| 1002 | Unknown | Generic Value |

#### Dimension Table: Products


fi
| Product Key | Product Name | Other Attributes |
|-------------|--------------|------------------|
| 2001 | Unknown | Generic Value |
| 2002 | Unknown | Generic Value |

### Scenario Breakdown

#### Line-by-Line Explanation

1. **Operational Process Arrival Time Variance**:


- Facts can arrive minutes, hours, days, or weeks before dimension context.
- Example: Fact ID 1 arrives on 2024-07-01, but the Customer Name and Product
Name are unknown.

2. **Real-Time Data Delivery**:


- Inventory depletion rows are delivered in real-time.
- Example: The fact table receives a row with Customer Key 1001 and Product Key
2001 but without full customer or product details.

3. **Posting Rows with Unresolved Natural Keys**:


- Facts are posted even if dimension context is unresolved.
- Example: The fact table contains rows with Customer Key 1001 and Product Key
2001 without complete descriptions.

4. **Creation of Placeholder Dimension Rows**:


- Special dimension rows are created with unresolved keys as attributes.
- Example: Dimension tables contain "Unknown" for Customer Name and Product
Name.

5. **Presumably, Proper Dimensional Context Follows**:


- Proper context will arrive later.
- Example: Customer and product information will be updated later.

6. **Updating Placeholder Rows with Type 1 Overwrites**:


- Placeholder rows are updated with accurate data once available.
- Example: When customer and product details arrive, the placeholder rows are
updated.

#### Updated Dimension Tables

When the correct dimensional context arrives, say on 2024-07-05, we update the
tables.

#### Updated Dimension Table: Customers

| Customer Key | Customer Name | Other Attributes |


|--------------|---------------|------------------|
| 1001 | John Doe | Detailed Value |
| 1002 | Jane Smith | Detailed Value |
#### Updated Dimension Table: Products

| Product Key | Product Name | Other Attributes |


|-------------|--------------|------------------|
| 2001 | Widget A | Detailed Value |
| 2002 | Gadget B | Detailed Value |

7. **Late Arriving Dimension Data with Type 2 Changes**:


- Retroactive changes to Type 2 dimension attributes require a new row and fact row
restatement.
- Example: A change in Customer Key 1001’s attributes on 2024-07-10 needs a new
dimension row.

#### Updated Dimension Table: Customers with Type 2 Changes

| Customer Key | Customer Name | Other Attributes | Effective Date | Expiry Date |
|--------------|---------------|--------------------|----------------|-------------|
| 1001 | John Doe | Old Detailed Value | 2024-07-01 | 2024-07-10 |
| 1003 | John Doe | New Detailed Value | 2024-07-10 | NULL |

#### Updated Fact Table with Restated Rows

| Fact ID | Date | Customer Key | Product Key | Quantity |


|---------|------------|--------------|-------------|----------|
|1 | 2024-07-01 | 1001 | 2001 | 10 |
|2 | 2024-07-02 | 1002 | 2002 |5 |
|3 | 2024-07-10 | 1003 | 2001 | 10 | *Updated for Type 2 change*

This example illustrates how real-time ETL systems handle late-arriving dimension
data, placeholder rows, and updates with Type 1 and Type 2 changes.

Sure, let's break down the concept of supertype and subtype fact tables in a
comprehensive manner using a tabular data example.

### Explanation

- Businesses have multiple products and services, such as accounts, loans, and
mortgages, each with unique attributes and facts.

- A retail bank's offerings include different account types like checking accounts,
savings accounts, mortgages, and business loans.

- Creating one fact table with all attributes of all products will be impractical due to
the vast number of unique attributes and facts.
- Create a supertype fact table with common facts and attributes across all account
types.
- Create subtype fact tables with speci c facts and attributes for each account type.

- Core fact tables (supertype) contain common data.


- Custom fact tables (subtype) contain speci c data for each product type.

### Tabular Data Example

#### Supertype Fact Table (Core Fact Table)

| AccountID | CustomerID | Balance | OpenDate | AccountType |


|-----------|------------|---------|-------------|-------------|
| 1001 | 2001 | 1500.00 | 2020-01-01 | Checking |
| 1002 | 2002 | 2500.00 | 2021-03-15 | Savings |
| 1003 | 2003 | 300000 | 2019-07-22 | Mortgage |
| 1004 | 2004 | 10000.00| 2022-05-10 | BusinessLoan|

#### Supertype Dimension Table (Core Dimension Table)

| CustomerID | Name | Address | PhoneNumber |


|------------|--------------|------------------|----------------|
| 2001 | John Doe | 123 Main St. | 555-1234 |
| 2002 | Jane Smith | 456 Oak St. | 555-5678 |
| 2003 | Alice Brown | 789 Pine St. | 555-9012 |
| 2004 | Bob Johnson | 321 Elm St. | 555-3456 |

#### Subtype Fact Table for Checking Accounts (Custom Fact Table)

| AccountID | OverdraftLimit | MonthlyFee |


|-----------|----------------|------------|
| 1001 | 500 | 10.00 |

#### Subtype Fact Table for Savings Accounts (Custom Fact Table)

| AccountID | InterestRate | WithdrawalLimit |


|-----------|--------------|-----------------|
| 1002 | 1.5 |6 |

#### Subtype Fact Table for Mortgages (Custom Fact Table)

| AccountID | Principal | InterestRate | TermYears |


|-----------|-----------|--------------|-----------|
| 1003 | 300000 | 3.75 | 30 |

#### Subtype Fact Table for Business Loans (Custom Fact Table)

| AccountID | LoanAmount | InterestRate | RepaymentSchedule |


|-----------|------------|--------------|-------------------|
fi
fi
| 1004 | 10000.00 | 5.0 | Monthly |

### Summary

- **Supertype (Core) Fact Table:** Contains common facts (Balance, OpenDate) and
attributes (CustomerID, AccountType) for all account types.
- **Supertype (Core) Dimension Table:** Contains common attributes (Name,
Address, PhoneNumber) of customers.
- **Subtype (Custom) Fact Tables:** Contain speci c facts and attributes unique to
each account type (Checking, Savings, Mortgage, Business Loan).

This approach maintains clarity and ef ciency, ensuring that each fact table only
contains relevant data, avoiding the complexity of handling hundreds of incompatible
facts and attributes in a single table.

### Subtype Dimension Tables

#### Dimension Table for Checking Accounts (Custom Dimension Table)

| AccountID | AccountType | AccountFeatures |


|-----------|-------------|-----------------------|
| 1001 | Checking | Free checks, Mobile banking |

#### Dimension Table for Savings Accounts (Custom Dimension Table)

| AccountID | AccountType | InterestType | MinimumBalance |


|-----------|-------------|------------------|----------------|
| 1002 | Savings | Compound | 1000.00 |

#### Dimension Table for Mortgages (Custom Dimension Table)

| AccountID | AccountType | PropertyAddress | LoanOf cer |


|-----------|-------------|----------------------|------------------|
| 1003 | Mortgage | 123 Maple St. | Sarah Thompson |

#### Dimension Table for Business Loans (Custom Dimension Table)

| AccountID | AccountType | BusinessName | BusinessType |


|-----------|--------------|------------------|----------------|
| 1004 | BusinessLoan | ABC Enterprises | Retail |
fi
fi
fi
To explain the paragraph using a tabular data example, let's consider a retail business
that maintains a sales fact table for business intelligence (BI) reporting.

### Sales Fact Table (Traditional Nightly Batch Process)

Traditionally, this sales fact table might be updated once per day, typically during off-
peak hours at night. This process involves:
- **Source Data:** Daily sales data collected from various stores.
- **Batch Update:** Data is processed and loaded into the fact table at night.
- **Indexes and Aggregations:** Built on the entire fact table for fast querying and
reporting.

**Example Table Structure:**

| Date | Store_ID | Product_ID | Sales_Amount | Quantity_Sold |


|------------|----------|------------|--------------|---------------|
| 2024-07-25 | 001 | 101 | 500 | 50 |
| 2024-07-25 | 002 | 102 | 300 | 30 |
| 2024-07-24 | 001 | 101 | 450 | 45 |
| 2024-07-24 | 002 | 103 | 600 | 60 |

### Real-Time Updates with "Hot Partition"

To support real-time updates, the sales fact table can be enhanced with a "hot
partition," which is a section of the table that resides in physical memory for fast
access and is updated more frequently.

- **Hot Partition:** This partition contains the most recent data (e.g., sales from the
current day).
- **Deferred Updates:** Allows queries on the rest of the table to run without
interruption while updates are applied to the hot partition.
- **No Indexes/Aggregations:** To speed up the update process, indexes and
aggregations are not built on this partition.

**Example Table Structure with Hot Partition:**

| Date | Store_ID | Product_ID | Sales_Amount | Quantity_Sold | Partition_Type |


|------------|----------|------------|--------------|---------------|----------------|
| 2024-07-27 | 001 | 101 | 200 | 20 | Hot Partition |
| 2024-07-27 | 002 | 104 | 100 | 10 | Hot Partition |
| 2024-07-26 | 001 | 101 | 480 | 48 | Historical |
| 2024-07-26 | 002 | 102 | 350 | 35 | Historical |

### Deferred Updating

In the case of deferred updating, the system allows existing queries to run to
completion before applying updates. For example, if a report is being generated at the
same time as new sales data is being loaded, the update will wait until the report
query completes.
**Process Overview:**

1. **Real-Time Data Collection:** Sales data is collected in real-time and


temporarily stored in a staging area.
2. **Hot Partition Update:** Data from the staging area is loaded into the hot
partition.
3. **Deferred Updating:** Existing queries on the historical data are allowed to
complete before merging the hot partition data into the main fact table.

### Key Bene ts

- **Frequent Updates:** Ensures the fact table is updated frequently, providing more
current data for analysis.
- **Performance:** By keeping the hot partition in memory and avoiding indexes and
aggregations on it, update performance is enhanced.
- **Minimal Disruption:** Deferred updates ensure that ongoing queries are not
disrupted, maintaining the integrity of the BI reporting layer.

### Example Work ow

1. **Morning:** Sales data from 9 AM to 10 AM is collected and loaded into the hot
partition.
2. **Midday Report:** A report is generated using data up to the previous day, as the
hot partition data has not yet been merged.
3. **Afternoon:** Sales data from 10 AM to 11 AM is collected and loaded into the
hot partition.
4. **End of Day:** The hot partition data is merged into the main fact table, and
indexes/aggregations are updated overnight.

This approach ensures that BI reports can access near-real-time data while
maintaining high performance and minimizing disruptions.

Let's break down the paragraph using a tabular data example:

### Explanation
In a data warehouse, ensuring data quality is essential. To do this, data quality screens
or lters are set up to test data as it moves from source systems to the Business
Intelligence (BI) platform. If an error is detected, the event is recorded in a
specialized dimensional schema in the ETL (Extract, Transform, Load) back room.
This schema consists of:

1. **Error Event Fact Table**: Records each individual error event.


2. **Error Event Detail Fact Table**: Records details about each column in each table
that is involved in the error event.
fi
fi
fl
### Example

#### Source Data


Suppose we have the following data in a source system:

| CustomerID | Name | Age | Email |


|------------|------------|-----|-------------------|
|1 | John Doe | 25 | [email protected] |
|2 | Jane Smith | -30 | [email protected] |
|3 | Alice Brown| 40 | alice@invalid |

Here, `CustomerID` 2 has an invalid age (`-30`), and `CustomerID` 3 has an invalid
email (`alice@invalid`).

#### Error Event Fact Table


This table records each error event.

| ErrorEventID | ErrorDate | TableName | RowID | ErrorType |


|--------------|------------|----------------|-------|-------------------|
|1 | 2024-07-27 | Customers | 2 | Invalid Age |
|2 | 2024-07-27 | Customers | 3 | Invalid Email |

- **ErrorEventID**: Unique identi er for each error event.


- **ErrorDate**: Date when the error was detected.
- **TableName**: Name of the table where the error occurred.
- **RowID**: Identi er of the row with the error.
- **ErrorType**: Description of the error.

#### Error Event Detail Fact Table


This table records details about each column involved in the error event.

| ErrorEventDetailID | ErrorEventID | ColumnName | InvalidValue |


ErrorDescription |
|--------------------|--------------|------------|----------------|----------------------|
|1 |1 | Age | -30 | Age cannot be negative|
|2 |2 | Email | alice@invalid | Invalid email format |

- **ErrorEventDetailID**: Unique identi er for each error event detail.


- **ErrorEventID**: References the corresponding error event.
- **ColumnName**: Name of the column with the error.
- **InvalidValue**: The value that caused the error.
- **ErrorDescription**: Description of why the value is invalid.

### Summary
When data ows from the source systems to the BI platform, data quality screens
check for errors. If errors are found, they are logged in the **Error Event Fact
Table** and detailed in the **Error Event Detail Fact Table**. This ensures that
errors can be tracked and addressed before they affect the BI platform's outputs.
fl
fi
fi
fi

You might also like